Marathi Dialect Detector:Text and Speech Normalization to Standard Marathi and Hindi

Prasad Sanjay Gavali

doi:10.69968/ijisem.2026v5i2116-121

Authors

Prasad Sanjay Gavali KIT College of Engineering, Kolhapur Dept. of Computer Science and Engineering (AI & ML)

DOI:

https://doi.org/10.69968/ijisem.2026v5i2116-121

Keywords:

Marathi dialects, speech processing, BERT, wav2vec, Indic languages, language normalization, NLP, ASR

Abstract

Marathi is spoken across Maharashtra and nearby regions in many local forms such as Varhadi, Puneri, Kolhapuri, Marathwada and coastal varieties like Malvani and Konkani influenced Marathi. These dialects differ in pronunciation, vocabulary and sometimes grammar, while most digital systems still expect clean Standard Marathi or Hindi text. When users speak or write in their natural dialect, systems such as educational portals, government websites and chatbots may fail to understand the input or return poor quality output. This paper describes a small but complete framework for handling such cases. The proposed Indic Language Dialect Detector accepts both text and speech in selected Marathi regional dialects and produces normalized text in Standard Marathi and, optionally, in Hindi. For text input, the system fine-tunes a multilingual BERT-style encoder on pairs of dialect sentences and their normalized versions and uses this representation for both dialect classification and text normalization. For speech input, a wav2vec-style automatic speech recognition (ASR) model first converts audio to text, which is then passed through the same BERT-based module. A simple web interface connects these components and lets users type text or record audio and see the dialect label and normalized output. The system is evaluated on a small curated dataset collected from speakers of five dialects. We report dialect classification accuracy, normalization quality using sequence-level metrics, and qualitative examples. Although the dataset is limited, results suggest that combining modern text and speech models with basic rule-based handling is a practical way to support dialect users in low-resource Indian language settings.

References

[1] R. Kakwani, A. Kunchukuttan, S. Golla et al., “Indicbert: A multilingual ALBERT model for indic languages,” in Proceedings of the 2020 IEEE International Conference on Big Data, 2020, multilingual ALBERT model trained on 12 Indian languages including Marathi.

[2] S. Khurana et al., “IndicWav2Vec: A multilingual speech model for indic languages, https://github.com/AI4Bharat/IndicWav2Vec, 2021, pretrained on speech from 40 Indian languages and fine-tuned for ASR in 9 languages.

[3] AI4Bharat, “AI4Bharat Hindi IndicWav2Vec speech model, https://aikosh.indiaai.gov.in/home/models/details/ai4bharat indicwav2vec speech model for hindi.html, 2025, description of IndicWav2Vec models fine-tuned for Hindi ASR.

[4] LDC-IL, “Marathi raw speech corpus,” http://data.ldcil.org/ marathi-raw-speech-corpus, 2018, 89 hours of Marathi speech from 307 speakers, recorded at 48 kHz.

[5] F. He, S.-H. C. Chu, O. Kjartansson et al., “Crowdsourced high-quality marathi multi-speaker speech data set (slr64),” https://openslr.org/64/, 2017, multi-speaker Marathi corpus for ASR and TTS.

[6] ARTPARK-IISc, “Vaani: Multi-modal, multi-lingual dataset,” https: //huggingface.co/datasets/ARTPARK-IISc/Vaani, 2025, spontaneous, image-prompted speech from over 100K speakers across India.

[7] F. Author and S. Author, “Text-independent automatic dialect recognition of marathi language using spectral and temporal features,” International Journal of Research in Information Technology and Computer Communication (IJRITCC), vol. 10, no. 12, 2022, reports Ridge Classifier accuracy of 84.24% for Marathi dialect recognition using spectral and temporal features. [Online] Available: https: //ijritcc.org/index.php/ijritcc/article/view/5949

[8] G. Firstauthor, G. Secondauthor, and Others, “Dialect matters: Crosslingual asr transfer for low-resource indic language varieties,” in Proceedings of the VarDial Workshop, 2026, empirical study of crosslingual ASR transfer on spontaneous, noisy and code-mixed Indic dialect speech. [Online]. Available https://aclanthology.org/2026.vardial-1.12/

[9] AIKosh, “Vaani: Multi-modal, multi-lingual dataset,” https: //aikosh.indiaai.gov.in/home/datasets/details/vaani multi modal multi lingual dataset.html, 2025, description of VAANI dataset covering 86 languages and multiple Indian dialects.

[10] AI4Bharat, “Indicnlp library and resources,” https://indicnlp.ai4bharat. org/pages/indicnlp-resources/, 2020, python toolkit for tokenization, sentence splitting, normalization, script conversion and transliteration for Indian languages.

[11] ——, “Ai4bharat-indicnlp corpus,” https://github.com/AI4Bharat/ indicnlp corpus, 2020, large-scale general-domain text corpora for multiple Indian languages.

[12] S. Raghuwanshi et al., “indic-punct: An automatic punctuation restoration and inverse text normalization framework for indic languages,” arXiv preprint arXiv:2203.16825, 2022, punctuation restoration with IndicBERT and WFST-based inverse text normalization for 11 Indic languages.

[13] AI4Bharat, “IndicBERT on hugging face,” https://huggingface.co/ ai4bharat/indic-bert, 2022, model card describing IndicBERT pretrained on 9B tokens in 12 Indic languages.

[14] X. Zhang et al., “A unified transformer-based framework for text normalization and inverse text normalization,” arXiv preprint arXiv:2108.09889, 2021, proposes a duplex Transformer for both TN and ITN with task-indicating prefixes