Analysis and Clustering of Pakistani Music by Lyrics: A Study of CokeStudio Pakistan


  • Zia Ur Rahman University of Engineering & Technology, Peshawar, 25000, Pakistan.
  • Muhammad Imran Khan Khalil University of Engineering & Technology, Peshawar, 25000, Pakistan.
  • Asif Nawaz Faculty of Electrical Engineering, Engineering Technology and Sciences Division, Higher Colleges of Technology, United Arab Emirates, 16062.
  • Izaz Ahmad Khan Bacha Khan University, Charsadda, 24420, Pakistan.
  • Naveed Jan Shuhada-e-APS, University of Technology, Nowshera, Pakistan.
  • Sheeraz Ahmad Iqra National University, Peshawar, 25000, Pakistan.


Clustering, Lyrics Analysis, Cultural Exploration, Unsupervised Learning, Text Classification, Processing, Natural Language, CokeStudio


This research explores the application of unsupervised learning techniques to categorize and understand the lyrical content of CokeStudio songs. In a world where music transcends cultural boundaries, this study delves into the rich linguistic tapestry of lyrics, unraveling emotions, themes, and cultural nuances. We begin by employing Natural Language Processing (NLP) and analysis techniques to uncover the emotional underpinnings of these lyrical compositions. This emotional layering becomes the foundation for the subsequent clustering process. Multiple unsupervised learning algorithms, including K-Means, Hierarchical Clustering, and DBSCAN, are employed to categorize songs into thematic clusters. The quality of these clusters is assessed using the silhouette score, with the optimal number of clusters determined as 5, achieving a score of 0.41641. Furthermore, we develop a robust classification model utilizing machine learning algorithms such as Logistic Regression, Decision Trees, Random Forest, Gradient Boosting, and Multinomial Naive Bayes for evaluation of our clustering. This model assigns CokeStudio songs to thematic clusters based on the results of topic modeling, enhancing our understanding of the cultural and emotional dimensions of these compositions. Logistic Regression, with SMOTE applied to NMF values, emerges as the best-performing model, achieving an impressive testing score of 89.47%. The research findings not only illuminate the intricate emotions and narratives woven into CokeStudio songs but also emphasize the practical application of machine learning in music analysis. By identifying and classifying thematic clusters within song lyrics, this study enriches our comprehension of cultural expressions through music and opens avenues for personalized music recommendations.




How to Cite

Zia Ur Rahman, Muhammad Imran Khan Khalil, Asif Nawaz, Izaz Ahmad Khan, Naveed Jan, & Sheeraz Ahmad. (2024). Analysis and Clustering of Pakistani Music by Lyrics: A Study of CokeStudio Pakistan. Journal of Computing & Biomedical Informatics, 7(01), 281–296. Retrieved from