Machine Learning-Based Classification of SARS-CoV-2 Structural Proteins Using Amino Acid Composition Analysis

Authors

  • Anam Fatima Department of Computer Science, Namal University, Mianwali, Pakistan.
  • Nasreen Department of Computer Science, Namal University, Mianwali, Pakistan.
  • Harram Sattar Department of Computer Science, Namal University, Mianwali, Pakistan.
  • Muhammad Bilal Department of Computer Science, Namal University, Mianwali, Pakistan.
  • Shafiq ur Rehman Khan Department of Computer Science, Namal University, Mianwali, Pakistan.

DOI:

https://doi.org/10.56979/1002/2026/1204

Keywords:

COVID-19, Protein Classification, Machine Learning, Amino Acid Composition, SARS-CoV-2, Bioinformatics, Feature Engineering, Viral Genomics, Computational Biology

Abstract

The classification of COVID-19 protein types is important for understanding viral structure. This study presents a comprehensive machine learning approach for classifying four major COVID-19 protein types which are Spike, Membrane, Envelope, and Nucleocapsid proteins. We collected 40,000 protein sequences from the NCBI protein database, representing 10,000 sequences for each protein type through automated web scraping and parsing techniques. After processing the data and removing outliers, we obtained a dataset of 28,206 proteins. We used five machine learning algorithms which included Random Forest, Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Decision Tree Classifier, and Logistic Regression. We evaluated the models using accuracy, precision, recall and F1 score metrics. The result showed that K-Nearest Neighbors classifier achieved the highest accuracy of 98%. Feature importance analysis revealed that sequence length and specific amino acids are the main factors that provided biological insights into the differences between COVID-19 protein types. Our results show the effectiveness of amino acid composition-based features for COVID-19 protein classification. The feature importance analysis revealed key biological insights into the differences between the structures of protein and provided an efficient framework for automated protein type identification.

Downloads

Published

2026-03-01

How to Cite

Anam Fatima, Nasreen, Harram Sattar, Muhammad Bilal, & Shafiq ur Rehman Khan. (2026). Machine Learning-Based Classification of SARS-CoV-2 Structural Proteins Using Amino Acid Composition Analysis. Journal of Computing & Biomedical Informatics, 10(02). https://doi.org/10.56979/1002/2026/1204