Title | : | Giving Voice to the Next Billion: Advancing Text-to-Speech for Indian Languages |
Speaker | : | Praveen Srinivasa Varadhan (IITM) |
Details | : | Wed, 26 Mar, 2025 11:00 AM @ SSB 334 |
Abstract: | : | Can text-to-speech (TTS) systems achieve human-like quality and emotional depth even for languages with limited resources? While English TTS technology has made impressive strides, Indian languages remain significantly underserved due to data scarcity, computational challenges, and inadequate evaluation frameworks. This seminar outlines our concerted efforts to bridge these gaps, sharing innovative approaches in acoustic modeling, expressive speech synthesis, and evaluation methodologies to realize more natural synthetic speech for the many languages spoken in India. Building effective neural TTS systems for diverse Indian languages demands careful consideration. We systematically explored leading neural architectures across 13 Dravidian and Indo-Aryan languages, meticulously evaluating acoustic models, vocoders, loss functions, and training methodologies. Our experiments reveal that monolingual FastPitch models, enhanced by alignment learning frameworks and paired with HiFi-GAN vocoders, substantially enhance read-speech quality. This combination achieved significant improvements in Mean Opinion Scores (MOS) compared to previous systems. Beyond intelligibility, expressiveness is a major challenge. How well can machines truly convey authentic emotions like happiness or surprise? To address this, we created "Rasa," India's first expressive TTS dataset, featuring professional actors recording six fundamental emotions (happiness, sadness, anger, fear, disgust, and surprise) along with neutral speech in Assamese, Bengali, and Tamil. Our analysis demonstrated that prioritizing the collection of neutral speech alongside minimal expressive data significantly enhanced emotional realism, validated by MUSHRA evaluations. Crucially, our findings highlight a scalable blueprint for extending expressiveness, and the importance of syllabically balanced datasets and pooling emotions. Evaluation methodologies themselves, however, remain an unresolved challenge. Through rigorous listening studies involving nearly 492 participants, we uncovered critical shortcomings in popular evaluation tests, including the reference-matching bias and judgement ambiguity. To address these, we proposed refined variants of the MUSHRA test, enabling fairer and more precise assessments, especially for high-quality synthetic speech that surpasses traditional human benchmarks. As a by-product of our analysis, we introduced MANGO, a large-scale dataset of 246,000 human ratings, the largest of its kind for Indian languages, to facilitate future advancements in automatic TTS evaluation. Together, these interconnected initiatives - (i) improved acoustic models, (ii) expressive speech datasets, and (iii) more reliable evaluation frameworks, bring us closer to our broader vision: building multi-speaker multi-style neural speech synthesizers that sound natural, are intelligible, and have rich prosody for the many major Indian languages spoken across the country. |