Download TidyVoiceX Dataset

Here you can access the training data, development data, and trial pairs needed to participate in the challenge.

Dataset Components

TidyVoiceX Dataset (Train + Dev)

The complete dataset package containing both training and development data for the TidyVoice2026 Challenge.

Contents:

TidyVoiceX_Train/: Training dataset with multi-lingual speaker recordings
TidyVoiceX_Dev/: Development dataset for system tuning and validation
Speaker identity labels and language annotations
Cross-lingual speaker samples across both splits

Download:

Size: 50 GB
Format: .wav file with 16KHz sampling Frequency

📥 TidyVoiceX Complete Dataset

Click the link below to access the dataset page or download the API script:

🔗 View Dataset Page 📥 Download Script

Available via Mozilla Data Collective

Trial Pairs for Dev Data

Trial pairs file containing the evaluation protocol for the development set.

📥 Trial Pairs Download

Click the link below to download the trial pairs file:

🔗 Download Trial Pairs (trial_pairs_dev.txt)

Available via Google Drive

TidyVoiceX2_ASV (Evaluation Dataset)

The official evaluation dataset for the TidyVoice2026 Challenge.

Contents:

TidyVoiceX2_ASV/: Evaluation dataset with multi-lingual speaker recordings covering approximately 38 additional languages
Cross-lingual speaker samples for evaluation

Download:

Size: 32 GB
Format: .wav file with 16KHz sampling Frequency

📥 TidyVoiceX2_ASV Evaluation Dataset

Click the link below to access the dataset page or download the API script:

🔗 View Dataset Page 📥 Download Script

Available via Mozilla Data Collective

Evaluation Trial Files

Official trial pairs for the evaluation dataset.

Trial Files Included:

tv26_eval-A.txt: Evaluation trial pairs for All languages
tv26_eval-U.txt: Evaluation trial pairs for Unseen languages

📥 Evaluation Trial Files Download

Click the link below to download the trial pairs file (contains both tv26_eval-A.txt and tv26_eval-U.txt):

🔗 Download Evaluation Trial Pairs (ZIP)

Available via Google Drive

Download Instructions

Registration Required: Please complete the registration process before downloading the dataset.
Create Mozilla Data Collective API Key:
- Visit https://datacollective.mozillafoundation.org/api-reference
- Navigate to Profile > API to create your API credentials
- Save your API key securely
Install Required Package:
```
pip install datacollective
```
Download Using Python Script:

Download the download_tidyvoice.py script from the dataset download section above, then:
- Replace YOUR_API_KEY_HERE with your Mozilla Data Collective API key
- Update OUTPUT_DIR to your desired download location
- Run: python download_tidyvoice.py and python download_tidyvoice2.py

Data Structure

The dataset is organized with speakerID folders directly inside each dataset folder, which then contain languageID subfolders with the corresponding audio files for that speaker in that specific language.

TidyVoiceX_Train/Dev
├── speaker_001/
│   ├── en/          # English recordings
│   │   ├── file1.wav
│   │   ├── file2.wav
│   │   └── ...
│   ├── fa/          # Persian recordings
│   │   ├── file1.wav
│   │   └── ...
│   └── fr/          # French recordings
│       └── ...
├── speaker_002/
│   ├── de/          # German recordings
│   ├── it/          # Italian recordings
│   └── ...
└── ...

Structure Explanation:

TidyVoiceX_Train/: Contains training data with speakerID folders directly at the root
TidyVoiceX_Dev/: Contains development data with speakerID folders directly at the root
Each speakerID folder contains all recordings for that specific speaker
languageID subfolders organize recordings by language (en, fa, fr, de, it, etc.)
Audio files for each language are stored in their respective languageID folders
This structure enables easy access to cross-lingual data for the same speaker

Support

If you encounter any issues with the dataset download or have questions about the data format, please contact:

Email: aref.farhadipour@uzh.ch

Citation

If you use the TidyVoice dataset in your research, please cite:

@misc{farhadi2026tidy,
      title={TidyVoice: A Curated Multilingual Dataset for Speaker Verification Derived from Common Voice}, 
      author={Aref Farhadipour and Jan Marquenie and Srikanth Madikeri and Eleanor Chodroff},
      year={2026},
      journal={ICASSP2026},
      url={https://arxiv.org/abs/2601.16358}, 
}

TidyVoice2026 Challenge - Interspeech 2026