A Hybrid Approach for Analysis of Urdu Tweets Authorship Empowered by Genetic Algorithm and K-Nearest Neighbors

Authors

  • Zain Ali Department of Computer Science, Lahore Garrison University, Lahore, 54000, Pakistan.
  • Arfan Ali Nagra Department of Computer Science, Lahore Garrison University, Lahore, 54000, Pakistan. https://orcid.org/0000-0002-2149-8165
  • Khalid Masood Department of Computer Science, Lahore Garrison University, Lahore, 54000, Pakistan.
  • Muhammad Abubakar Department of Computer Science, Lahore Garrison University, Lahore, 54000, Pakistan. https://orcid.org/0000-0001-6902-6549
  • Muhammad Mudassar Department of Technology, The University of Lahore, Lahore, 54000, Pakistan.

Keywords:

K-nearest neighbors (KNN), latent dirichlet allocation (LDA), Natural language toolkit (NLTK), term frequency, Inverse document frequency (TF-IDF)

Abstract

Authorship attribution is the process of identifying the author of a puzzling report from a jumble of unclear material. As the world moves toward more constrained exchanges, Internet crimes such as phishing and harassment are becoming more common. Consequently, locating culprits during cybercrime investigative processes is a challenge. This research evaluates current authorship attribution algorithms on a semantic level, as well as the accuracy rate in Urdu situations. Urdu language datasets were used as Urdu TD, which is based on 600 Urdu tweets per author. The LDA model was used to chip away at stylometry elements to distinguish the composing style of specific authors using the n-gram method and cosine similarity. After applying the LDA model for feature selection, we used a genetic algorithm. After obtaining the features we applied the KNN classifier. The idea of combining the genetic algorithm and KNN classifier is to create a hybrid model that outperforms each classifier in terms of classification accuracy. In this study, the proposed authorship attribution model had an excellent ability to classify simple and different Urdu languages, with the highest accuracy of 98.20%, recall of 99%, precision of 97%, and f1 measure of 98%. The task was managed without utilizing any labels for authorship. This system should help improve standards for authorship attribution and classification methods.

Downloads

Published

2023-12-05

How to Cite

Zain Ali, Arfan Ali Nagra, Khalid Masood, Muhammad Abubakar, & Muhammad Mudassar. (2023). A Hybrid Approach for Analysis of Urdu Tweets Authorship Empowered by Genetic Algorithm and K-Nearest Neighbors. Journal of Computing & Biomedical Informatics, 6(01), 1–11. Retrieved from https://jcbi.org/index.php/Main/article/view/261