Authors:
Nikhil Baiju Punnen, M. Pranav, Allent S. Manakatt, R. Regin, K. Senthamilselvan, S. Tejas
Addresses:
Department of Computer Science and Engineering, SRM Institute of Science and Technology, Ramapuram, Chennai, Tamil Nadu, India. Department of Electronics and Communication Engineering, Dhaanish Ahmed College of Engineering, Chennai, Tamil Nadu, India. Department of Data Science, Analytics and Engineering, Arizona State University, Tempe, Arizona, United States of America.
Abstract:
The objective of this paper is to conduct a thorough comparative study of baseline and preprocessed (with PCA, a technique for dimensionality reduction) classical machine learning classifiers: Logistic Regression (LR), Support Vector Machine (SVM), and Random Forest (RF) for lung disease prediction using high-dimensional biomedical data. Lung diseases (pneumonia, tuberculosis, COPD, lung cancer) pose a tremendous global health challenge and are responsible for several million deaths per year. There is an urgent need for prompt, accurate and automated detection of lung diseases to effectively improve patient care and reduce the variability inherent in human interpretation of diagnostic results. Machine learning with high-dimensional biomedical data is characterized by difficulties related to feature interdependencies, redundancy, correlation, and noise amplification, all of which increase the risk of overfitting, particularly when the number of training samples is much smaller than the number of features. This study compares six classification pipelines (three baselines, three preprocessed with PCA) across three train-test partition ratios (80/20, 70/30, 60/40) and two cross-validation methods (3-fold and 5-fold) using a leakage-free nested cross-validation methodology that computed feature scaling and PCA transformations only from the training folds. Our experiments show that baseline classifiers achieve training accuracies of 97–100% and significantly lower test accuracies, indicating overfitting in the high-dimensional domain. In contrast, PCA pipelines preprocess and improve generalization by narrowing the train-test gap across all three ratios.
Keywords: Machine Learning; Dimensionality Reduction; Support Vector Machine; Random Forest; Logistic Regression; Biomedical Data; Lung Disease Prediction; Classification Pipelines; Nested Cross-Validation.
Received on: 01/05/2025, Revised on: 16/06/2025, Accepted on: 15/08/2025, Published on: 01/03/2026
DOI: 10.64091/ATIIR.2026.000280
Ave Trends in Intelligent Informatics Reports, 2026 Vol. 1 No. 1 , Pages: 52–66