Vaani: The Dataset Powering Inclusive Indian Language Research

Community Article Published April 24, 2026

India is a linguistically diverse country, with hundreds of languages spoken across its regions. However, building robust models and enabling research across this spectrum remains challenging due to the limited availability of resources for many languages. Vaani is designed to address this gap by capturing multimodal data that reflects diverse linguistic, geographic, and accent variations across the country.

Vaani is a multi-modal, multilingual dataset representative of India’s linguistic diversity. The current version comprises approximately 31,255 hours of spontaneous, image-prompted speech collected from 156,000 speakers across 165 districts. The dataset includes descriptions of 288,000 images and covers 109 languages. From this corpus, 2,043 hours of transcribed speech data are available, distributed nearly evenly across the 165 districts. At its core, Vaani is a geocentric speech corpus: rather than collecting audio opportunistically, it was designed around place. By anchoring data collection to 165 distinct districts, the project captured not just linguistic diversity but the living texture of regional accent β€” the way a speaker from coastal Andhra differs from one in the Deccan plateau, or how Bhojpuri sounds across the plains of eastern UP.

Since the release of the dataset, it has been used across a wide spectrum of research, in addition to its applications in industry. These use cases can be broadly categorized as follows.

Use-Case What Vaani Enables
πŸŽ™οΈ ASR Development The linguistic diversity of Vaani enables training accent-diverse models that generalize beyond standard dialects, while also supporting ASR development for low-resource languages.
🌐 Speech Translation Supports both end-to-end and cascaded speech translation for Indic languages.
πŸ“Š Benchmarks & Evaluation Enables region-specific and culturally grounded evaluation of speech and language models.
πŸ” Voice Conversion The diverse speaker coverage facilitates the development of robust voice conversion systems.
πŸ–ΌοΈ Multimodal Systems The multimodal nature of the dataset enables training and adaptation of audio-visual models in Indian language settings.

Building Inclusive ASR, One Accent at a Time

The most direct and impactful application of Vaani data has been in automatic speech recognition. When researchers set out to build speech recognition for Garo β€” a Tibeto-Burman language spoken primarily in Meghalaya β€” the Vaani corpus provided a foundation that would have taken years to assemble independently.

Equally significant is the ongoing effort to build ASR for Bhojpuri-speaking women. This project is notable not just for its linguistic scope, but for what it reveals about the gaps in existing speech data: most large-scale corpora skew toward male speakers, towards urban accents, and towards dominant regional varieties. Vaani's geocentric design at least partially corrects for these biases. In another work, a team from the University of Notre Dame used the Vaani dataset to study cross-lingual ASR transfer for low-resource Indic language varieties, focusing on spontaneous, noisy, and code-mixed speech across a wide range of dialects.


Benchmarking Cultural and Phonetic Reality

Evaluation is only as good as the benchmark β€” and most benchmarks were built for English. Two recent efforts have used Vaani to change that for Indian languages.

HinTel-AlignBench evaluates Vision-Language Models on cultural and region-specific context. The benchmark leverages images from Hindi- and Telugu-speaking regions in the Vaani corpus, using them as visual grounding for language-specific multiple-choice questions generated with GPT-4.1. The result is an evaluation suite that probes whether a Vision Langauge Model(VLM) genuinely understands regional Indian context.

PRiSM takes a different angle: it is the first open-source benchmark designed specifically to reveal blind spots in phonetic perception. Given the phonological richness of Indian languages β€” retroflex consonants, aspirated stops, tonal distinctions β€” a benchmark built on Vaani data is well-positioned to expose exactly the kinds of errors that models trained on Western speech data tend to make.


Voice Conversion and Multimodal Frontiers

The scale and speaker diversity of Vaani have also made it a natural fit for voice conversion research. EZ-VC simplifies voice conversion by pairing a self-supervised speech encoder with a non-autoregressive flow-matching decoder. Trained on 3,790 hours of multilingual Vaani speech, EZ-VC demonstrates that high-quality voice conversion is achievable without the complexity that has historically made the task inaccessible to smaller research groups.

Perhaps the most distinctive aspect of Vaani β€” and the one least often discussed β€” is its multimodal nature. The dataset includes not just audio and transcriptions, but images, making it one of the few Indian language corpora that spans the audio-visual space. Researchers have used this to demonstrate the effectiveness of modern audio-visual learning techniques in low-resource settings β€” an important proof point as the field moves toward richer, multi-sense representations of language.


A Standalone Resource β€” and a Building Block

What emerges from surveying Vaani's use-cases is a picture of a dataset that is both self-sufficient and collaborative. It stands alone as a training resource for ASR, voice conversion, and multimodal systems. But it also integrates gracefully with other data collection initiatives β€” researchers have combined Vaani with complementary corpora to fill geographic or demographic gaps that neither alone could cover.

This combination of scale, geographic intentionality, speaker diversity, and multimodal coverage explains why Vaani has become a reference point for Indic speech research. In a field where data scarcity is the primary bottleneck for low-resource languages, a corpus designed around place rather than convenience is not just useful β€” it is rare.

The work is far from finished. There are hundreds of Indian languages and thousands of dialects still underrepresented in existing corpora. But Vaani has demonstrated what a geocentric, community-oriented approach to data collection can achieve β€” and in doing so, it has set a bar for what inclusive speech technology should look like.

Community

Sign up or log in to comment