Machine Learning-Based Classification of SARS-CoV-2 Structural Proteins Using Amino Acid Composition Analysis
DOI:
https://doi.org/10.56979/1002/2026/1204Keywords:
COVID-19, Protein Classification, Machine Learning, Amino Acid Composition, SARS-CoV-2, Bioinformatics, Feature Engineering, Viral Genomics, Computational BiologyAbstract
The classification of COVID-19 protein types is important for understanding viral structure. This study presents a comprehensive machine learning approach for classifying four major COVID-19 protein types which are Spike, Membrane, Envelope, and Nucleocapsid proteins. We collected 40,000 protein sequences from the NCBI protein database, representing 10,000 sequences for each protein type through automated web scraping and parsing techniques. After processing the data and removing outliers, we obtained a dataset of 28,206 proteins. We used five machine learning algorithms which included Random Forest, Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Decision Tree Classifier, and Logistic Regression. We evaluated the models using accuracy, precision, recall and F1 score metrics. The result showed that K-Nearest Neighbors classifier achieved the highest accuracy of 98%. Feature importance analysis revealed that sequence length and specific amino acids are the main factors that provided biological insights into the differences between COVID-19 protein types. Our results show the effectiveness of amino acid composition-based features for COVID-19 protein classification. The feature importance analysis revealed key biological insights into the differences between the structures of protein and provided an efficient framework for automated protein type identification.
Downloads
Published
How to Cite
Issue
Section
License
This is an open Access Article published by Research Center of Computing & Biomedical Informatics (RCBI), Lahore, Pakistan under CCBY 4.0 International License



