
The TidyVoice Challenge addresses the critical open problem of speaker verification under language mismatch. The performance of speaker verification systems degrades significantly under language mismatch, a critical challenge exacerbated by the field’s reliance on English-centric data.
This challenge leverages the TidyVoiceX dataset from the novel TidyVoice benchmark, a large-scale, multilingual corpus derived from Mozilla Common Voice dataset, and specifically curated to isolate the effect of language switching across around 40 languages. Participants will be tasked with building systems robust to this mismatch, with performance primarily evaluated using the Equal Error Rate (EER) on cross-language trials.
By providing standardized data, open-source baselines, and a rigorous evaluation protocol, this challenge aims to drive research towards fairer, more inclusive, and language-independent speaker recognition technologies.
The TidyVoice Challenge is an open-condition challenge where participants are permitted to use any public or private datasets to train their systems, in addition to the provided TidyVoiceX training partition. Participants are also encouraged to use pre-trained models (e.g., ResNet, SSL models such as wav2vec2, WavLM, etc.). The only restriction is that only the official TidyVoiceX training partition may be used from the Mozilla Common Voice dataset; all other Common Voice data is strictly forbidden. The core task is speaker verification - systems must output similarity score.
Primary Evaluation Metric: Equal Error Rate (EER).
Secondary Metric: Minimum Detection Cost Function (minDCF) for comprehensive performance analysis.
The challenge uses the TidyVoiceX dataset, a curated partition from Mozilla Common Voice dataset featuring:
The TidyVoice Challenge is organized in two main phases:
Development Phase: During this phase, participants will use the provided training and development datasets to develop and tune their systems. Participants can experiment with different approaches, architectures, and hyperparameters using both the training and development data.
Validation Phase: In this phase, the development dataset will be released with ground truth labels. Participants will submit their results on the development set by uploading them to the CodaBench website. The ranking will be determined based on the performance on the development set, allowing participants to compare their systems against others on the leaderboard.
Development Phase: The development trial pairs include four types to help participants assess how well their systems distinguish between speakers versus languages:
Evaluation Phase: Participants will evaluate and submit results for two trial pair lists:
These trial structures are designed to evaluate systems’ ability to eliminate language effects and perform robust speaker verification across languages, including languages not encountered during training.
In each speech signal from a single person, we have multiple types of information: the identity of the speaker, the content of the speech, emotional information, language information, etc. In this challenge, we aim to develop systems that, when receiving a speech signal from a human, can eliminate the language effect in the speech utterance and perform speaker verification in a language-independent manner.
This image was generated and edited using Runway and Qwen-VL models.