Optimized Feature Extraction and Cross-Lingual Text Reuse Detection using Ensemble Machine Learning Models
Keywords:
Cross-lingual Plagiarism Detection, Urdu English Plagiarism Detection, Plagiarism Detection, Machine Learning, Ensemble machine learning methodsAbstract
With the availability of digital data in different languages, cross-lingual plagiarism detection has gained more importance. Cross-lingual plagiarism is difficult to detect because suspicious and source texts can be written in different languages and processing of digitized text in different languages presents varying types of challenges. In this work, we propose a cross-lingual plagiarism detection method using machine learning algorithms. In this work, we have created an ensemble of machine learning algorithms and to evaluate the designed methodology, a corpus focusing Urdu-English language pair titled CLPD-UE-19 is used. The corpus is a collection of 2398 documents where the source text is written in Urdu language and the suspicious text is presented in the English language. Using NLP methods, optimal features are extracted and fed to designed ensemble method for document classification. A number of aggregating techniques are employed which include majority voting, stacking, averaging, boosting, and bagging. Among these models, the stacking has performed the best achieving accuracy of 96 percent.
Downloads
Published
How to Cite
Issue
Section
License
This is an open Access Article published by Research Center of Computing & Biomedical Informatics (RCBI), Lahore, Pakistan under CCBY 4.0 International License