Here you can access the training data, development data, and trial pairs needed to participate in the challenge.
The complete dataset package containing both training and development data for the TidyVoice2026 Challenge.
Contents:
Download:
Click the link below to access the dataset page or download the API script:
Available via Mozilla Data Collective
Trial pairs file containing the evaluation protocol for the development set.
Click the link below to download the trial pairs file:
🔗 Download Trial Pairs (trial_pairs_dev.txt)Available via Google Drive
The official evaluation dataset for the TidyVoice2026 Challenge.
Contents:
Download:
Click the link below to access the dataset page or download the API script:
Available via Mozilla Data Collective
Official trial pairs for the evaluation dataset.
Trial Files Included:
Click the link below to download the trial pairs file (contains both tv26_eval-A.txt and tv26_eval-U.txt):
🔗 Download Evaluation Trial Pairs (ZIP)Available via Google Drive
Registration Required: Please complete the registration process before downloading the dataset.
pip install datacollective
Download Using Python Script:
Download the download_tidyvoice.py script from the dataset download section above, then:
YOUR_API_KEY_HERE with your Mozilla Data Collective API keyOUTPUT_DIR to your desired download locationpython download_tidyvoice.py and python download_tidyvoice2.pyThe dataset is organized with speakerID folders directly inside each dataset folder, which then contain languageID subfolders with the corresponding audio files for that speaker in that specific language.
TidyVoiceX_Train/Dev
├── speaker_001/
│ ├── en/ # English recordings
│ │ ├── file1.wav
│ │ ├── file2.wav
│ │ └── ...
│ ├── fa/ # Persian recordings
│ │ ├── file1.wav
│ │ └── ...
│ └── fr/ # French recordings
│ └── ...
├── speaker_002/
│ ├── de/ # German recordings
│ ├── it/ # Italian recordings
│ └── ...
└── ...
Structure Explanation:
If you encounter any issues with the dataset download or have questions about the data format, please contact:
If you use the TidyVoice dataset in your research, please cite:
@misc{farhadi2026tidy,
title={TidyVoice: A Curated Multilingual Dataset for Speaker Verification Derived from Common Voice},
author={Aref Farhadipour and Jan Marquenie and Srikanth Madikeri and Eleanor Chodroff},
year={2026},
journal={ICASSP2026},
url={https://arxiv.org/abs/2601.16358},
}