Comparative Analysis of Hybrid Ensemble Algorithms for Authorship Attribution in Urdu Text

Authors

  • Talha Farooq Khan Department of Computer Science & IT, Institute of Southern Punjab Multan (ISP-Multan), Multan, Pakistan.
  • Muhammad Sabir Department of Computer Science & IT, Institute of Southern Punjab Multan (ISP-Multan), Multan, Pakistan.
  • Mubasher H. Malik Department of Computer Science & IT, Institute of Southern Punjab Multan (ISP-Multan), Multan, Pakistan.
  • Hamid Ghous Department of Computer Science & IT, Institute of Southern Punjab Multan (ISP-Multan), Multan, Pakistan.
  • Hafiz Muhammad Ijaz Department of Computer Science & IT, Institute of Southern Punjab Multan (ISP-Multan), Multan, Pakistan.
  • Asma Nadeem Department of Information Technology, The Islamia University, Bahawalpur, Pakistan.
  • Abiha Ejaz Department of Information Technology, The Islamia University, Bahawalpur, Pakistan.

Keywords:

Authorship Attriution, Text Classification, Low Resource Language, Natural Language Processing

Abstract

The realm of computer crime investigation through digital text has seen significant advancement over the past decade, especially with the rise in the use of cell phones and computers in our increasingly digital world. As text-based forensics gains fame due to its pivotal role in investigations, the need for evolving techniques for authorship attribution becomes dominant. Identifying the authorship of textual or digital content based on its style, language, and textual variations has been a longstanding subject of inquiry and study. Traditionally, authorship claims for unpublished works were often validated posthumously for copyright purposes by comparing the stylistic elements of the work. However, with the ongoing digitization of the world, the demand for authorship attribution in digital text has surged. This study introduces a fresh approach to authorship attribution utilizing Hybrid Ensemble Methods, focusing on a corpus of Urdu texts. The methodology involves initial preprocessing techniques applied to the Urdu textual data, followed by conversion into vectors using the Word2Vec technique. Subsequently, six innovative algorithms combining Support Vector Machine and Boosted Algorithms, namely SVM-XGB, SVM-ABC, SVM-CBC, SVM-GBC, SVM-LGBC, and SVM-HGBC, are employed. These novel algorithms have exhibited superior performance in authorship attribution tasks, with SVM-CBC demonstrating the highest accuracy of 92%.

Downloads

Published

2024-04-01

How to Cite

Talha Farooq Khan, Muhammad Sabir, Mubasher H. Malik, Hamid Ghous, Hafiz Muhammad Ijaz, Asma Nadeem, & Abiha Ejaz. (2024). Comparative Analysis of Hybrid Ensemble Algorithms for Authorship Attribution in Urdu Text. Journal of Computing & Biomedical Informatics. Retrieved from https://jcbi.org/index.php/Main/article/view/495