Analyzing the Impact of Pretrained Language Models on Low-Resource Languages

Authors

  • Muhammad Irshad Hussain Riyadh Lake Real Estate Development Company (PIF subsidiary), Saudi Arabia.
  • Shafiq Hussain Department of Computer Science, University of Sahiwal, Sahiwal, 57000, Pakistan.
  • Aleena Jamil Department of Computer Science, University of Sahiwal, Sahiwal, 57000, Pakistan.
  • Adeen Amjad Department of Computer Science, University of Sahiwal, Sahiwal, 57000, Pakistan.
  • Sajid Iqbal Department of Information Systems, King Faisal University, Al-Ahsa, Saudi Arabia.

DOI:

https://doi.org/10.56979/1101/2026/1332

Keywords:

Low-Resource Languages, Pretrained Language Models, Cross-Lingual Transfer, Multilingual NLP, Adaptation Framework

Abstract

The rapid advancement of natural language processing has predominantly benefited high-resource languages such as English, Chinese, and Spanish, leaving thousands of languages underserved. This digital language divide limits equitable access to technology and threatens global linguistic diversity. This paper presents a systematic evaluation of eight pretrained language models across seven low-resource languages representing five distinct language families. Through extensive experiments on sentiment analysis, named entity recognition, and machine translation tasks, we demonstrate that multilingual BERT achieves the highest average accuracy of 74.5%. We further propose a novel adaptation framework combining vocabulary augmentation, continual pretraining, task-adaptive fine-tuning, and knowledge distillation that improves performance by up to 18.7%. Our analysis identifies vocabulary overlap as the strongest predictor of cross-lingual transfer success, explaining 76.3% of performance variance. These findings provide evidence-based guidelines for researchers and practitioners developing inclusive NLP technologies for underserved language communities. Limitations of this study include the focus on seven languages (generalizability to other low-resource languages requires further validation), computational constraints that prevented evaluation of models exceeding 300M parameters, and potential biases introduced by dataset availability and quality across languages.

Downloads

Published

2026-04-18

How to Cite

Muhammad Irshad Hussain, Shafiq Hussain, Aleena Jamil, Adeen Amjad, & Sajid Iqbal. (2026). Analyzing the Impact of Pretrained Language Models on Low-Resource Languages. Journal of Computing & Biomedical Informatics, 11(01). https://doi.org/10.56979/1101/2026/1332

Issue

Section

Articles