Analyzing the Impact of Pretrained Language Models on Low-Resource Languages
DOI:
https://doi.org/10.56979/1101/2026/1332Keywords:
Low-Resource Languages, Pretrained Language Models, Cross-Lingual Transfer, Multilingual NLP, Adaptation FrameworkAbstract
The rapid advancement of natural language processing has predominantly benefited high-resource languages such as English, Chinese, and Spanish, leaving thousands of languages underserved. This digital language divide limits equitable access to technology and threatens global linguistic diversity. This paper presents a systematic evaluation of eight pretrained language models across seven low-resource languages representing five distinct language families. Through extensive experiments on sentiment analysis, named entity recognition, and machine translation tasks, we demonstrate that multilingual BERT achieves the highest average accuracy of 74.5%. We further propose a novel adaptation framework combining vocabulary augmentation, continual pretraining, task-adaptive fine-tuning, and knowledge distillation that improves performance by up to 18.7%. Our analysis identifies vocabulary overlap as the strongest predictor of cross-lingual transfer success, explaining 76.3% of performance variance. These findings provide evidence-based guidelines for researchers and practitioners developing inclusive NLP technologies for underserved language communities. Limitations of this study include the focus on seven languages (generalizability to other low-resource languages requires further validation), computational constraints that prevented evaluation of models exceeding 300M parameters, and potential biases introduced by dataset availability and quality across languages.
Downloads
Published
How to Cite
Issue
Section
License
This is an open Access Article published by Research Center of Computing & Biomedical Informatics (RCBI), Lahore, Pakistan under CCBY 4.0 International License



