Optimized Feature Extraction and Cross-Lingual Text Reuse Detection using Ensemble Machine Learning Models

Muhammad Sajid Maqbool; Israr Hanif; Sajid Iqbal; Abdul Basit; Aiman Shabbir

Authors

Muhammad Sajid Maqbool Department of Computer Science, Bahauddin Zakariya University, Multan, Pakistan.
Israr Hanif Department of Computer Science, Bahauddin Zakariya University, Multan, Pakistan.
Sajid Iqbal Department of Computer Science, Bahauddin Zakariya University, Multan, Pakistan.
Abdul Basit Department of Computer Science, Bahauddin Zakariya University, Multan, Pakistan.
Aiman Shabbir Department of Computer Science, Muhammad Nawaz Sharif University of Agriculture, Multan, Pakistan.

Keywords:

Cross-lingual Plagiarism Detection, Urdu English Plagiarism Detection, Plagiarism Detection, Machine Learning, Ensemble machine learning methods

Abstract

With the availability of digital data in different languages, cross-lingual plagiarism detection has gained more importance. Cross-lingual plagiarism is difficult to detect because suspicious and source texts can be written in different languages and processing of digitized text in different languages presents varying types of challenges. In this work, we propose a cross-lingual plagiarism detection method using machine learning algorithms. In this work, we have created an ensemble of machine learning algorithms and to evaluate the designed methodology, a corpus focusing Urdu-English language pair titled CLPD-UE-19 is used. The corpus is a collection of 2398 documents where the source text is written in Urdu language and the suspicious text is presented in the English language. Using NLP methods, optimal features are extracted and fed to designed ensemble method for document classification. A number of aggregating techniques are employed which include majority voting, stacking, averaging, boosting, and bagging. Among these models, the stacking has performed the best achieving accuracy of 96 percent.

Optimized Feature Extraction and Cross-Lingual Text Reuse Detection using Ensemble Machine Learning Models

Authors

Keywords:

Abstract

Author Biographies

Muhammad Sajid Maqbool, Department of Computer Science, Bahauddin Zakariya University, Multan, Pakistan.

Israr Hanif, Department of Computer Science, Bahauddin Zakariya University, Multan, Pakistan.

Sajid Iqbal, Department of Computer Science, Bahauddin Zakariya University, Multan, Pakistan.

Abdul Basit, Department of Computer Science, Bahauddin Zakariya University, Multan, Pakistan.

Aiman Shabbir, Department of Computer Science, Muhammad Nawaz Sharif University of Agriculture, Multan, Pakistan.

Downloads

Published

How to Cite

Issue

Section

License

SCOPUS

SCOPUS Q3

HJRS

ISSN

Online First

Call for Papers

Make a Submission

Open Access

Information

Conference

SC-2