Title | : | Building Automatic Speech Recognition Systems for Indian Languages |
Speaker | : | Tahir Javed (IITM) |
Details | : | Tue, 25 Mar, 2025 2:00 PM @ SSB 334 |
Abstract: | : | Automatic Speech Recognition (ASR) for Indian languages, particularly low-resource languages, remains significantly limited due to the scarcity of labeled datasets and reliable evaluation benchmarks. Recent studies in ASR have demonstrated that large-scale unsupervised pre-training, combined with fine-tuning on smaller labeled datasets, effectively reduces the dependency on large labelled datasets. Building upon this insight, we introduced IndicWav2Vec, the first pretrained speech representation model specifically tailored for Indian languages. IndicWav2Vec was pretrained on 17,000 hours of multilingual speech data across 40 Indian languages, achieving state-of-the-art performance in 7 languages across widely-used benchmarks such as MUCS, MSR, and OpenSLR. Empirical evaluations indicate that language-specific pretrained backbones significantly enhance performance; however, pretraining alone does not completely eliminate the need for high-quality, speaker-diverse, human-labelled datasets. To address the limitations of IndicWav2Vec, specifically its restricted language coverage and limited data volume, and to expand benchmarks beyond Automatic Speech Recognition (ASR), we developed IndicSUPERB, a benchmark suite containing approximately 1,680 hours of human labelled speech data (Kathbath) collected from 1,218 speakers across 203 districts in India. IndicSUPERB covers six Speech-Language Understanding (SLU) tasks for 12 Indian languages viz. Automatic Speech Recognition, Speaker Verification, Speaker Identification (mono/multi), Language Identification, Query By Example, and Keyword Spotting. Evaluations reveal that self-supervised models are excellent feature encoders and outperform traditional features such as FBANK across all tasks. This benchmark enables comprehensive evaluations across various modelling approaches and different SLU tasks. Next, we focus on the problem of evaluating ASR systems for Indian-accented English and found that existing automatic speech recognition (ASR) models for English inadequately represent the linguistic diversity inherent in Indian-accented English, resulting in significant performance gaps for non-native Indian English speakers. To study this issue, we introduce Svarah, a benchmark comprising 9.6 hours of transcribed English speech from 117 speakers whose native language was not English, distributed across 65 geographic locations throughout India. Evaluations conducted using six open-source and two commercially available ASR models revealed considerable disparities in performance between Indian-accented speech and native counterparts, highlighting scope for improvement. Svarah thus serves as a critical resource for enhancing inclusivity and accuracy in English ASR systems for India's linguistically diverse population. |