Language identification in Telugu-English Code-Switched sentences

Settipalli Manikanta, Somuguttu Rohith Reddy, Katta Chennakesava Naidu, Enjula Uchoi, Roddam Jaswanth Kumar Reddy, Appikonda Subrahmanyeswar Rao, Manikala Jayanth

doi:10.48165/bapas.2024.44.2.1

PDF

Published: Nov 23, 2024

DOI: https://doi.org/10.48165/bapas.2024.44.2.1

Keywords:

HMM, SVM, LSTM, NER, N-Gram, Rule-based Method

Settipalli Manikanta, Somuguttu Rohith Reddy, Katta Chennakesava Naidu, Enjula Uchoi, Roddam Jaswanth Kumar Reddy, Appikonda Subrahmanyeswar Rao, Manikala Jayanth

Abstract

—In multilingual societies, code-mixing and code- switching have become prevalent, particularly in informal communication on social media platforms. This project focuses on the identification of languages in code-mixed and code-switched sentences, specifically utilizing Named Entity Recognition (NER) techniques. By analyzing sentences that blend languages such as English and Telugu, we aim to enhance language identification accuracy at the word level. The study employs various machine learning models, Support Vector Machine (SVM) and Hidden Markov Models (HMM), to classify words into distinct language categories. The dataset comprises annotated social media posts, ensuring a diverse representation of linguistic patterns. Our approach integrates NER to identify named entities, which serve as critical indicators for language classification. Preliminary results indicate that incorporating NER significantly improves the identification process, achieving an F1-score of 0.91. This research contributes to the development of robust language identification systems, facilitating better understanding and processing of code- mixed data in natural language processing applications. Future work will explore deep learning techniques to further enhance performance.

Issue

Vol. 44 No. 3 (2024): LIB PRO. 44(3), JUL-DEC 2024

Section

Articles

Article Sidebar

Main Article Content

Abstract

Article Details