Comparative Analysis of Hybrid Ensemble Algorithms for Authorship Attribution in Urdu Text
Keywords:
Authorship Attriution, Text Classification, Low Resource Language, Natural Language ProcessingAbstract
The realm of computer crime investigation through digital text has seen significant advancement over the past decade, especially with the rise in the use of cell phones and computers in our increasingly digital world. As text-based forensics gains fame due to its pivotal role in investigations, the need for evolving techniques for authorship attribution becomes dominant. Identifying the authorship of textual or digital content based on its style, language, and textual variations has been a longstanding subject of inquiry and study. Traditionally, authorship claims for unpublished works were often validated posthumously for copyright purposes by comparing the stylistic elements of the work. However, with the ongoing digitization of the world, the demand for authorship attribution in digital text has surged. This study introduces a fresh approach to authorship attribution utilizing Hybrid Ensemble Methods, focusing on a corpus of Urdu texts. The methodology involves initial preprocessing techniques applied to the Urdu textual data, followed by conversion into vectors using the Word2Vec technique. Subsequently, six innovative algorithms combining Support Vector Machine and Boosted Algorithms, namely SVM-XGB, SVM-ABC, SVM-CBC, SVM-GBC, SVM-LGBC, and SVM-HGBC, are employed. These novel algorithms have exhibited superior performance in authorship attribution tasks, with SVM-CBC demonstrating the highest accuracy of 92%.
Downloads
Published
How to Cite
Issue
Section
License
This is an open Access Article published by Research Center of Computing & Biomedical Informatics (RCBI), Lahore, Pakistan under CCBY 4.0 International License