Challenge Description

The TidyVoice Challenge directly addresses the critical open problem of speaker verification under language mismatch. The field of speaker verification has long been shaped by the NIST Speaker Recognition Evaluation (SRE), a prestigious benchmark that has driven significant technological advancements. However, despite its influence, the SRE framework presents two critical limitations in today’s globalized landscape. First, the evaluation data is proprietary, which restricts accessibility and hinders reproducible research across the broader community. Second, and more crucially, the evaluations have historically focused on a limited set of languages, failing to adequately address the unique challenges of cross-lingual speaker verification for a diverse, multilingual world.

We believe that by proposing a challenge built on a large-scale, public, and multilingual dataset, we can create a more inclusive and accessible benchmark. This initiative will have a great impact on advancing the field, fostering the development of more robust and equitable speaker recognition systems that perform reliably across languages.

The TidyVoice Challenge aligns perfectly with the Interspeech 2026 theme, “Speaking Together,” which emphasizes diversity, equity, and inclusivity in speech science and technology by extending speaker verification to new languages and naturalistic cross-lingual conditions. The challenge provides a standardized benchmark and a rigorous evaluation protocol to assess system robustness, calibration, and fairness across a wide array of languages and demographic groups. To increase fairness and safeguard privacy, the original speaker identities are pseudonymized.

Why this challenge is significant?

  • Addresses a Key Scientific Gap: While many benchmarks test speaker verification “in the wild”, they often fail to disentangle the various sources of variability such as noise, channel effects, speaking style, and especially language. The TidyVoice dataset is uniquely designed to isolate the impact of language switching by offering curated data from the same speaker across different languages. This enables a focused investigation into the development of truly language-independent speaker embeddings.

  • Promotes Fairness and Inclusivity: The historical overreliance on English-centric datasets has been a significant bottleneck in the field. This challenge is built upon a dataset encompassing approximately 40 distinct languages and around 4,500 multilingual speakers, motivating the community to create systems that are not biased towards a single language and that perform equitably for multilingual speakers, a large and ever-growing global demographic.

  • Provides a New Public Resource: Unlike some challenges that utilize proprietary benchmarks such as NIST SRE evaluation, the TidyVoice dataset, along with its evaluation protocols and baseline models, will be made publicly available alongside the challenge. This commitment to open science promotes reproducible research and lowers the barrier to entry for academic and industry labs worldwide to tackle the cross-lingual speaker verification problem.

  • Establishes a Benchmark for Read Speech and Beyond: By being based on the Mozilla Common Voice (MCV) corpus, this challenge specifically focuses on the domain of read speech. This controlled setting minimizes stylistic and phonetic variability, allowing for a more precise analysis of cross-lingual acoustic modeling. Prior work has already highlighted both the potential and the limitations of MCV for phonetic and speaker-related research. Building on these insights, this large-scale, cleanly-labeled, multilingual dataset of bonafide speech provides an invaluable foundation for adjacent research areas, such as the development and evaluation of robust anti-spoofing countermeasures.

Why this challenge is timely?

  • Growing Need for Multilingual Systems: As speech technology becomes increasingly global, the demand for systems that work across languages has never been greater. This challenge directly addresses this need by providing a rigorous evaluation framework for cross-lingual speaker verification.

  • Advancements in Multilingual Modeling: Recent advances in multilingual and cross-lingual speech processing have made this an opportune time to establish benchmarks that push the boundaries of language-independent speaker recognition.

  • Open Science Movement: The commitment to open data and reproducible research aligns with the growing movement towards more accessible and transparent scientific practices in speech technology.