WildSpoof Image

🏆 Challenge Results Are Out! — The official results and leaderboard have been released.

📢 [UPDATE] [31.1.2026]
• Tidyvoice 2026 Challenge Evaluation Plan: https://arxiv.org/abs/2601.21960
• Reference paper for TidyvoiceX1: https://arxiv.org/abs/2601.16358

📢 [UPDATE] [26.1.2026] Evaluation set released! Please follow Dataset Download page for downloading the Eval data and Submission Guidelines page for submission instructions.

TidyVoice Challenge: Cross-Lingual Speaker Verification

The TidyVoice Challenge addresses the critical open problem of speaker verification under language mismatch. The performance of speaker verification systems degrades significantly under language mismatch, a critical challenge exacerbated by the field’s reliance on English-centric data.

This challenge leverages the TidyVoiceX dataset from the novel TidyVoice benchmark, a large-scale, multilingual corpus derived from Mozilla Common Voice dataset, and specifically curated to isolate the effect of language switching across around 40 languages. Participants will be tasked with building systems robust to this mismatch, with performance primarily evaluated using the Equal Error Rate (EER) on cross-language trials.

By providing standardized data, open-source baselines, and a rigorous evaluation protocol, this challenge aims to drive research towards fairer, more inclusive, and language-independent speaker recognition technologies.

Challenge Overview

The TidyVoice Challenge is an open-condition challenge where participants are permitted to use any public or private datasets to train their systems, in addition to the provided TidyVoiceX training partition. Participants are also encouraged to use pre-trained models (e.g., ResNet, SSL models such as wav2vec2, WavLM, etc.). The only restriction is that only the official TidyVoiceX training partition may be used from the Mozilla Common Voice dataset; all other Common Voice data is strictly forbidden. The core task is speaker verification - systems must output similarity score.

Primary Evaluation Metric: Equal Error Rate (EER).

Secondary Metric: Minimum Detection Cost Function (minDCF) for comprehensive performance analysis.

The challenge uses the TidyVoiceX dataset, a curated partition from Mozilla Common Voice dataset featuring:

Over 4,474 speakers across 40 languages
Approximately 321,711 utterances totaling 457 hours
Clearly defined training and test splits
Pseudonymized speaker identities for privacy protection

For evaluation, the challenge provides TidyVoiceX2_ASV, an evaluation dataset with:

Approximately 2,000 speakers across 38 additional languages
32 GB of audio data in .wav format (16KHz sampling frequency)
Two trial pair lists: tv26_eval-A.txt (4M trials) and tv26_eval-U.txt (1.28M trials)
Coverage of unseen languages for robust cross-lingual evaluation

Challenge Phases

The TidyVoice Challenge is organized in two main phases:

Development Phase: During this phase, participants will use the provided training and development datasets to develop and tune their systems. Participants can experiment with different approaches, architectures, and hyperparameters using both the training and development data.

Validation Phase: In this phase, the development dataset will be released with ground truth labels. Participants will submit their results on the development set by uploading them to the CodaBench website. The ranking will be determined based on the performance on the development set, allowing participants to compare their systems against others on the leaderboard.

Trial Pair Structure

Development Phase: The development trial pairs include four types to help participants assess how well their systems distinguish between speakers versus languages:

Target pairs (same speaker, same language)
Target pairs (same speaker, different languages)
Non-target pairs (different speakers, same language)
Non-target pairs (different speakers, different languages)

Evaluation Phase: Participants will evaluate and submit results for two trial pair lists:

tv26_eval-A.txt (All languages): Mix of seen and unseen languages (All the languages)
tv26_eval-U.txt (Unseen languages): Both enrollment and test from unseen languages (38 unseen languages)

These trial structures are designed to evaluate systems’ ability to eliminate language effects and perform robust speaker verification across languages, including languages not encountered during training.

Learn More

Challenge Description: Challenge Description
Dataset Download: Dataset Download
Challenge Task: Challenge Task
Submission Guidelines: Submission Guidelines
Important Dates: Important Dates
Evaluation Plan: Evaluation Plan
Baseline Systems: Baseline Systems
Organizers: Organizers
Registration: Registration

Relevant Links

Interspeech2026
Mozilla Common Voice
Contact: Aref Farhadipour (aref.farhadipour@uzh.ch)

Short Description of Image on Main Page

In each speech signal from a single person, we have multiple types of information: the identity of the speaker, the content of the speech, emotional information, language information, etc. In this challenge, we aim to develop systems that, when receiving a speech signal from a human, can eliminate the language effect in the speech utterance and perform speaker verification in a language-independent manner.

This image was generated and edited using Runway and Qwen-VL models.

Submission Platform

The evaluation was hosted on CodaBench.

CodaBench Evaluation Platform

TidyVoice2026 Challenge - Interspeech 2026