Optimized Feature Extraction and Cross-Lingual Text Reuse Detection using Ensemble Machine Learning Models

Authors

  • Muhammad Sajid Maqbool Department of Computer Science, Bahauddin Zakariya University, Multan, Pakistan.
  • Israr Hanif Department of Computer Science, Bahauddin Zakariya University, Multan, Pakistan.
  • Sajid Iqbal Department of Computer Science, Bahauddin Zakariya University, Multan, Pakistan.
  • Abdul Basit Department of Computer Science, Bahauddin Zakariya University, Multan, Pakistan.
  • Aiman Shabbir Department of Computer Science, Muhammad Nawaz Sharif University of Agriculture, Multan, Pakistan.

Keywords:

Cross-lingual Plagiarism Detection, Urdu English Plagiarism Detection, Plagiarism Detection, Machine Learning, Ensemble machine learning methods

Abstract

With the availability of digital data in different languages, cross-lingual plagiarism detection has gained more importance. Cross-lingual plagiarism is difficult to detect because suspicious and source texts can be written in different languages and processing of digitized text in different languages presents varying types of challenges. In this work, we propose a cross-lingual plagiarism detection method using machine learning algorithms. In this work, we have created an ensemble of machine learning algorithms and to evaluate the designed methodology, a corpus focusing Urdu-English language pair titled CLPD-UE-19 is used. The corpus is a collection of 2398 documents where the source text is written in Urdu language and the suspicious text is presented in the English language. Using NLP methods, optimal features are extracted and fed to designed ensemble method for document classification. A number of aggregating techniques are employed which include majority voting, stacking, averaging, boosting, and bagging. Among these models, the stacking has performed the best achieving accuracy of 96 percent.

Author Biographies

Muhammad Sajid Maqbool, Department of Computer Science, Bahauddin Zakariya University, Multan, Pakistan.

 

 

Israr Hanif, Department of Computer Science, Bahauddin Zakariya University, Multan, Pakistan.

 

 

Sajid Iqbal, Department of Computer Science, Bahauddin Zakariya University, Multan, Pakistan.

 

 

Abdul Basit, Department of Computer Science, Bahauddin Zakariya University, Multan, Pakistan.

 

 

Aiman Shabbir, Department of Computer Science, Muhammad Nawaz Sharif University of Agriculture, Multan, Pakistan.

 

 

Downloads

Published

2023-06-05

How to Cite

Muhammad Sajid Maqbool, Israr Hanif, Sajid Iqbal, Abdul Basit, & Aiman Shabbir. (2023). Optimized Feature Extraction and Cross-Lingual Text Reuse Detection using Ensemble Machine Learning Models. Journal of Computing & Biomedical Informatics, 5(01), 26–40. Retrieved from https://jcbi.org/index.php/Main/article/view/133