Language identification in Telugu-English Code-Switched sentences
Main Article Content
Abstract
—In multilingual societies, code-mixing and code- switching have become prevalent, particularly in informal communication on social media platforms. This project focuses on the identification of languages in code-mixed and code-switched sentences, specifically utilizing Named Entity Recognition (NER) techniques. By analyzing sentences that blend languages such as English and Telugu, we aim to enhance language identification accuracy at the word level. The study employs various machine learning models, Support Vector Machine (SVM) and Hidden Markov Models (HMM), to classify words into distinct language categories. The dataset comprises annotated social media posts, ensuring a diverse representation of linguistic patterns. Our approach integrates NER to identify named entities, which serve as critical indicators for language classification. Preliminary results indicate that incorporating NER significantly improves the identification process, achieving an F1-score of 0.91. This research contributes to the development of robust language identification systems, facilitating better understanding and processing of code- mixed data in natural language processing applications. Future work will explore deep learning techniques to further enhance performance.