An Efficient Machine Learning Approach for Plagiarism Detection in Text Documents
Keywords:
Plagiarism detection, cross-lingual plagiarism detection, Obfuscation level, similarity matrix, Extrinsic plagiarism detection, Urdu English Plagiarism DetectionAbstract
Plagiarism is when you use someone else's words or ideas as your own. On every subject, the internet is a reliable source of information. People can therefore simply copy data and use various techniques to cover up plagiarism. Extrinsic and intrinsic methods can both be used to detect plagiarism. Extrinsic plagiarism involves comparing the source and the allegedly plagiarised texts to one another in order to obtain precise similarity metrics like Jaccard and Cosine. Source materials are not necessary for intrinsic plagiarism recognition, though. Plagiarism can be detected by an author's writing style and other notable actions. Cross-Lingual Plagiarism (CLP) is a kind of plagiarism in which the author steals content by translating text from one language to another like Urdu-English. It is hard to identify CLP because the source and suspicious documents are in two different languages. In this regard, various approaches to tackling the problem of CPD in text documents were presented. We need to apply ML way to deal with the problems of CLPD. For PD task curpus is used to evaluate the performance of PD, so we use Urdu English language pair Corpus CLPD UE 19 [1]. The source text in the language-pair corpus (CLPD-UE-19) is written in Urdu, whereas the suspicious text is supplied in English. To construct a dataset that can be understood by machine learning tools and extract optimized features from a corpus using Python NLP techniques. Our created dataset is in the CSV format in which there are distinctive features of source, and suspected content is mentioned like Jaccard similarity and Cosine similarity. We have used one gram and tri-gram of the preprocessed text to get comparability measures. five ML classifiers, such as KNN, Naïve Bayes, SVM, Decision Tree, and Random Forest, are utilized to build models. Python language is used on PyCharm tool to build models from various classifiers. We use two methods to examine the models' accuracy (cross-validation and percentage split) in the python language. The trial shows that KNN, RF and have produce better results as compared to other models.
Downloads
Published
How to Cite
Issue
Section
License
This is an open Access Article published by Research Center of Computing & Biomedical Informatics (RCBI), Lahore, Pakistan under CCBY 4.0 International License