A Hybrid Approach for Analysis of Urdu Tweets Authorship Empowered by Genetic Algorithm and K-Nearest Neighbors
Keywords:
K-nearest neighbors (KNN), latent dirichlet allocation (LDA), Natural language toolkit (NLTK), term frequency, Inverse document frequency (TF-IDF)Abstract
Authorship attribution is the process of identifying the author of a puzzling report from a jumble of unclear material. As the world moves toward more constrained exchanges, Internet crimes such as phishing and harassment are becoming more common. Consequently, locating culprits during cybercrime investigative processes is a challenge. This research evaluates current authorship attribution algorithms on a semantic level, as well as the accuracy rate in Urdu situations. Urdu language datasets were used as Urdu TD, which is based on 600 Urdu tweets per author. The LDA model was used to chip away at stylometry elements to distinguish the composing style of specific authors using the n-gram method and cosine similarity. After applying the LDA model for feature selection, we used a genetic algorithm. After obtaining the features we applied the KNN classifier. The idea of combining the genetic algorithm and KNN classifier is to create a hybrid model that outperforms each classifier in terms of classification accuracy. In this study, the proposed authorship attribution model had an excellent ability to classify simple and different Urdu languages, with the highest accuracy of 98.20%, recall of 99%, precision of 97%, and f1 measure of 98%. The task was managed without utilizing any labels for authorship. This system should help improve standards for authorship attribution and classification methods.
Downloads
Published
How to Cite
Issue
Section
License
This is an open Access Article published by Research Center of Computing & Biomedical Informatics (RCBI), Lahore, Pakistan under CCBY 4.0 International License