Improving Cancer Subtype Classification in High-Dimensional Gene Expression Data via PCA and Machine Learning

Authors:
Rejwan Bin Sulaiman, Noman Javed

Addresses:
Department of Computing Science and Technology, Northumbria University, Newcastle upon Tyne, England, United Kingdom. School of Computing, Engineering and Physical Science, University of the West of Scotland, Paisley, Scotland, United Kingdom.

Abstract:

The problem of predicting cancer subtypes is crucial to contemporary research on cancer because data on gene expression are used to reveal the patterns that distinguish among different subtypes. The paper will discuss machine learning models, namely, Logistic Regression (LR), Support Vector Machines (SVM), and Random Forests (RF), to be used to classify the subtypes of LGG and BRCA cancer. Among the key issues handled are high-dimensional data and small sample sizes. To prevent overfitting and enhance model generalisation, Principal Component Analysis (PCA) was used as a dimensionality-reduction method. The models were assessed using metrics such as accuracy, macro-F1 score, ROC-AUC, and confusion matrices, and, to achieve reliable results, nested cross-validation was employed. These results demonstrate that PCA enhances model generalisation, with the SVM+PCA combination being the most accurate and robust across both datasets. Stratified sampling and class weighting were implemented to address class imbalance, especially in the BRCA dataset. Integration of PCA not only minimised overfitting and enhanced model stability but also simplified the interpretation of results, without using complex, multifaceted omics data or deep learning techniques. Although the findings were encouraging, the research could be improved in the future by considering more sophisticated resampling methods and omics approaches to further increase predictive value. 

Keywords: Logistic Regression (LR); Support Vector Machines (SVM); Biological Data; Random Forests (RF); Cancer Subtypes Prediction; Machine Learning Models; Gene-Expression Data.

Received: 03/11/2024, Revised: 19/12/2024, Accepted: 20/03/2025, Published: 07/12/2025

DOI: 10.64091/ATICL.2025.000235

AVE Trends in Intelligent Computer Letters, 2025 Vol. 1 No. 4 , Pages: 208-231

  • 👁 29
  • ⬇ 2
Download PDF