Webpage Classification for Search Engine Optimization using Machine Learning
Keywords:
Malicious & Benign Websites, Machine Learning, Deep Neural Network, URL, SEO (Search Engine Optimization)Abstract
Webpage classification for SEO is an essential area of study where machine learning, especially Deep Neural Networks (DNNs), plays a crucial role. This paper aims to develop an accurate Malicious & Benign page classifier using Deep Neural Networks (DNNs) for webpage classification in SEO. Data collection, selecting features, model construction, training, and evaluation, handling data that is imbalanced, & practical implementation considerations are just a few of the elements that make up the research approach. This dataset contains features like raw webpage content, geographical location, JavaScript length, obfuscated JavaScript code of the webpage, etc. The dataset has about 1.5 million web pages. 300,000 are used for testing, while 1.2 million are used for training. This dataset is highly skewed as 98.35% of the dataset are Benign webpages, and 2.27% are Malicious webpages, with a training dataset totaling 40,1806 instances, consisting of 25,770 good webpages, 6.41%, and 9472 harmful webpages, 2.35%. Our model is trained rigorously to identify patterns indicative of malicious intent. Our algorithm demonstrates robustness in classification in a test dataset of 398125 instances, including 23298 good webpages 5.8% and 9344 harmful webpages (2.34%). So, choosing the evaluation metrics carefully is essential, as just accuracy won’t give the correct evaluation, so I use an F1-score of 97.73%, a recall score of 95.2%, a precision score of 96%, and a confusion matrix. As a result, this paper solves the challenge of accurately differentiating between malicious and benign websites. The outcomes of this research contribute to webpage classification in SEO by leveraging DNNs to accurately classify malicious and benign webpages.
Downloads
Published
How to Cite
Issue
Section
License
This is an open Access Article published by Research Center of Computing & Biomedical Informatics (RCBI), Lahore, Pakistan under CCBY 4.0 International License