An Efficient Machine Learning Approach for Plagiarism Detection in Text Documents

Authors

  • Muhammad Mubashir Zahid Department of Computer Science, NFC Institute of Engineering and Technology Multan, Pakistan.
  • Kamran Abid Department of Computer Science, NFC Institute of Engineering and Technology Multan, Pakistan.
  • Abdul Rehman Faculty of Computer Science, Lahore Garrison University, Lahore, 54000, Pakistan.
  • M Fuzail Department of Computer Science, NFC Institute of Engineering and Technology Multan, Pakistan.
  • Naeem Aslam Department of Computer Science, NFC Institute of Engineering and Technology Multan, Pakistan.

Keywords:

Plagiarism detection, cross-lingual plagiarism detection, Obfuscation level, similarity matrix, Extrinsic plagiarism detection, Urdu English Plagiarism Detection

Abstract

Plagiarism is when you use someone else's words or ideas as your own. On every subject, the internet is a reliable source of information. People can therefore simply copy data and use various techniques to cover up plagiarism. Extrinsic and intrinsic methods can both be used to detect plagiarism. Extrinsic plagiarism involves comparing the source and the allegedly plagiarised texts to one another in order to obtain precise similarity metrics like Jaccard and Cosine. Source materials are not necessary for intrinsic plagiarism recognition, though. Plagiarism can be detected by an author's writing style and other notable actions. Cross-Lingual Plagiarism (CLP) is a kind of plagiarism in which the author steals content by translating text from one language to another like Urdu-English. It is hard to identify CLP because the source and suspicious documents are in two different languages. In this regard, various approaches to tackling the problem of CPD in text documents were presented. We need to apply ML way to deal with the problems of CLPD. For PD task curpus is used to evaluate the performance of PD, so we use Urdu English language pair Corpus CLPD UE 19 [1]. The source text in the language-pair corpus (CLPD-UE-19) is written in Urdu, whereas the suspicious text is supplied in English. To construct a dataset that can be understood by machine learning tools and extract optimized features from a corpus using Python NLP techniques. Our created dataset is in the CSV format in which there are distinctive features of source, and suspected content is mentioned like Jaccard similarity and Cosine similarity. We have used one gram and tri-gram of the preprocessed text to get comparability measures. five ML classifiers, such as KNN, Naïve Bayes, SVM, Decision Tree, and Random Forest, are utilized to build models.  Python language is used on PyCharm tool to build models from various classifiers. We use two methods to examine the models' accuracy (cross-validation and percentage split) in the python language. The trial shows that KNN, RF and have produce better results as compared to other models.

Downloads

Published

2023-03-29

How to Cite

Muhammad Mubashir Zahid, Kamran Abid, Abdul Rehman, M Fuzail, & Naeem Aslam. (2023). An Efficient Machine Learning Approach for Plagiarism Detection in Text Documents. Journal of Computing & Biomedical Informatics, 4(02), 241–248. Retrieved from https://jcbi.org/index.php/Main/article/view/153