Transformer-Based Semantic Similarity Framework for Extrinsic Plagiarism Detection in Low-Resource Gujarati Language
Keywords:
Plagiarism Detection, Transformer Models, Gujarati Language, Semantic Similarity, Natural Language ProcessingAbstract
This research proposed a trans-based NLP model of plagiarism detection in Gujarati, a low-resource and morphologically rich language. The difficulties with Gujarati include the lineage of annotated corpora, complicated morphology and a variety of syntactic structures. A hybrid solution that incorporates both statistical and contextual embeddings is created in order to resolve these problems. The similarity score of 0.4214 generated by baseline TF -IDF and cosine similarity approaches demonstrated that they do not have high ability to capture semantic relations. An optimized BERT model scored much higher at 0.9935, as it shows better contextual comprehension and paraphrase recognition. The self-attention mechanism of the transformer is appropriate in predicting long-range dependencies, which allows identifying paraphrased and obfuscated text. The results highlight the usefulness of transformer-based representations in the low-resource language setting and provide a practical approach to enhancing plagiarism detection.
Downloads
Published
How to Cite
Issue
Section
License
This is an open Access Article published by Research Center of Computing & Biomedical Informatics (RCBI), Lahore, Pakistan under CCBY 4.0 International License



