A Hybrid Approach for Analysis of Urdu Tweets Authorship Empowered by  Genetic Algorithm and K-Nearest Neighbors

Zain Ali; Arfan Ali Nagra; Khalid Masood; Muhammad Abubakar; Muhammad Mudassar

Authors

Zain Ali Department of Computer Science, Lahore Garrison University, Lahore, 54000, Pakistan.
Arfan Ali Nagra Department of Computer Science, Lahore Garrison University, Lahore, 54000, Pakistan. https://orcid.org/0000-0002-2149-8165
Khalid Masood Department of Computer Science, Lahore Garrison University, Lahore, 54000, Pakistan.
Muhammad Abubakar Department of Computer Science, Lahore Garrison University, Lahore, 54000, Pakistan. https://orcid.org/0000-0001-6902-6549
Muhammad Mudassar Department of Technology, The University of Lahore, Lahore, 54000, Pakistan.

Keywords:

K-nearest neighbors (KNN), latent dirichlet allocation (LDA), Natural language toolkit (NLTK), term frequency, Inverse document frequency (TF-IDF)

Abstract

Authorship attribution is the process of identifying the author of a puzzling report from a jumble of unclear material. As the world moves toward more constrained exchanges, Internet crimes such as phishing and harassment are becoming more common. Consequently, locating culprits during cybercrime investigative processes is a challenge. This research evaluates current authorship attribution algorithms on a semantic level, as well as the accuracy rate in Urdu situations. Urdu language datasets were used as Urdu TD, which is based on 600 Urdu tweets per author. The LDA model was used to chip away at stylometry elements to distinguish the composing style of specific authors using the n-gram method and cosine similarity. After applying the LDA model for feature selection, we used a genetic algorithm. After obtaining the features we applied the KNN classifier. The idea of combining the genetic algorithm and KNN classifier is to create a hybrid model that outperforms each classifier in terms of classification accuracy. In this study, the proposed authorship attribution model had an excellent ability to classify simple and different Urdu languages, with the highest accuracy of 98.20%, recall of 99%, precision of 97%, and f1 measure of 98%. The task was managed without utilizing any labels for authorship. This system should help improve standards for authorship attribution and classification methods.

A Hybrid Approach for Analysis of Urdu Tweets Authorship Empowered by Genetic Algorithm and K-Nearest Neighbors

Authors

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

License

SCOPUS

HJRS

ISSN

Online First

Call for Papers

Make a Submission

Open Access

Information

Conference

SC-2