Email Spam Detection Using Machine Learning with Optimized Feature Engineering and Classification Techniques
Keywords:
Spam Detection, Machine Learning, TF-IDF, Support Vector Machine, Email Classification, Ensemble LearningAbstract
Spam emails remain a major challenge for digital communications today, with far-reaching implications in terms of productivity losses, storage consumption, and presenting severe cybersecurity threats such as phishing, malware, identity theft, etc. The traditional mechanisms based on rules and keyword matching have completely failed to combat the countless concealed forms of spam content obfuscation, dynamic generation, and URL cloaking. In contrast, the present study reports on a machine-learning-based approach for spam detection using NLP for preprocessing and TF-IDF for feature extraction. Multiple supervised classifiers were built and evaluated, namely Logistic Regression, Naïve Bayes, Random Forest, Gradient Boosting, Support Vector Machines (SVM), and Ensemble Learning, using the publicly available mail_data.csv data set for training and evaluation. An 80:20 split for training and testing was employed, and the models were evaluated based on accuracy, precision, recall, and F1 score. Among them, SVM attained the utmost accuracy (98.9%), indicating its skillfulness in segregating spam from legitimate emails.
Downloads
Published
How to Cite
Issue
Section
License
This is an open Access Article published by Research Center of Computing & Biomedical Informatics (RCBI), Lahore, Pakistan under CCBY 4.0 International License



