Download TidyVoiceX Dataset

Here you can access the training data, development data, and trial pairs needed to participate in the challenge.


Dataset Components

TidyVoiceX Dataset (Train + Dev)

The complete dataset package containing both training and development data for the TidyVoice2026 Challenge.

Contents:

  • TidyVoiceX_Train/: Training dataset with multi-lingual speaker recordings
  • TidyVoiceX_Dev/: Development dataset for system tuning and validation
  • Speaker identity labels and language annotations
  • Cross-lingual speaker samples across both splits

Download:

  • Size: 50 GB
  • Format: .wav file with 16KHz sampling Frequency

📥 TidyVoiceX Complete Dataset

Click the link below to access the dataset page or download the API script:

🔗 View Dataset Page 📥 Download Script

Available via Mozilla Data Collective


Trial Pairs for Dev Data

Trial pairs file containing the evaluation protocol for the development set.

📥 Trial Pairs Download

Click the link below to download the trial pairs file:

🔗 Download Trial Pairs (trial_pairs_dev.txt)

Available via Google Drive


TidyVoiceX2_ASV (Evaluation Dataset)

The official evaluation dataset for the TidyVoice2026 Challenge.

Contents:

  • TidyVoiceX2_ASV/: Evaluation dataset with multi-lingual speaker recordings covering approximately 38 additional languages
  • Cross-lingual speaker samples for evaluation

Download:

  • Size: 32 GB
  • Format: .wav file with 16KHz sampling Frequency

📥 TidyVoiceX2_ASV Evaluation Dataset

Click the link below to access the dataset page or download the API script:

🔗 View Dataset Page 📥 Download Script

Available via Mozilla Data Collective


Evaluation Trial Files

Official trial pairs for the evaluation dataset.

Trial Files Included:

  • tv26_eval-A.txt: Evaluation trial pairs for All languages
  • tv26_eval-U.txt: Evaluation trial pairs for Unseen languages

📥 Evaluation Trial Files Download

Click the link below to download the trial pairs file (contains both tv26_eval-A.txt and tv26_eval-U.txt):

🔗 Download Evaluation Trial Pairs (ZIP)

Available via Google Drive


Download Instructions

  1. Registration Required: Please complete the registration process before downloading the dataset.

  2. Create Mozilla Data Collective API Key:
  3. Install Required Package:
    pip install datacollective
    
  4. Download Using Python Script:

    Download the download_tidyvoice.py script from the dataset download section above, then:

    • Replace YOUR_API_KEY_HERE with your Mozilla Data Collective API key
    • Update OUTPUT_DIR to your desired download location
    • Run: python download_tidyvoice.py and python download_tidyvoice2.py


Data Structure

The dataset is organized with speakerID folders directly inside each dataset folder, which then contain languageID subfolders with the corresponding audio files for that speaker in that specific language.

TidyVoiceX_Train/Dev
├── speaker_001/
│   ├── en/          # English recordings
│   │   ├── file1.wav
│   │   ├── file2.wav
│   │   └── ...
│   ├── fa/          # Persian recordings
│   │   ├── file1.wav
│   │   └── ...
│   └── fr/          # French recordings
│       └── ...
├── speaker_002/
│   ├── de/          # German recordings
│   ├── it/          # Italian recordings
│   └── ...
└── ...


Structure Explanation:

  • TidyVoiceX_Train/: Contains training data with speakerID folders directly at the root
  • TidyVoiceX_Dev/: Contains development data with speakerID folders directly at the root
  • Each speakerID folder contains all recordings for that specific speaker
  • languageID subfolders organize recordings by language (en, fa, fr, de, it, etc.)
  • Audio files for each language are stored in their respective languageID folders
  • This structure enables easy access to cross-lingual data for the same speaker


Support

If you encounter any issues with the dataset download or have questions about the data format, please contact:

  • Email: aref.farhadipour@uzh.ch


Citation

If you use the TidyVoice dataset in your research, please cite:

@misc{farhadi2026tidy,
      title={TidyVoice: A Curated Multilingual Dataset for Speaker Verification Derived from Common Voice}, 
      author={Aref Farhadipour and Jan Marquenie and Srikanth Madikeri and Eleanor Chodroff},
      year={2026},
      journal={ICASSP2026},
      url={https://arxiv.org/abs/2601.16358}, 
}