Download TidyVoiceX Dataset

Here you can access the training data, development data, and trial pairs needed to participate in the challenge.


Dataset Components

TidyVoiceX Dataset (Train + Dev)

The complete dataset package containing both training and development data for the TidyVoice2026 Challenge.

Contents:

  • TidyVoiceX_Train/: Training dataset with multi-lingual speaker recordings
  • TidyVoiceX_Dev/: Development dataset for system tuning and validation
  • Speaker identity labels and language annotations
  • Audio files in standard format
  • Cross-lingual speaker samples across both splits

Download:

  • Size: 50 GB
  • Format: .wav file with 16KHz sampling Frequency

📥 TidyVoiceX Complete Dataset

Click the link below to access the dataset page or download the API script:

🔗 View Dataset Page 📥 Download Script

Available via Mozilla Data Collective


Trial Pairs for Dev Data

Trial pairs file containing the evaluation protocol for the development set.

📥 Trial Pairs Download

Click the link below to download the trial pairs file:

🔗 Download Trial Pairs (trial_pairs_dev.txt)

Available via Google Drive


TidyVoiceX_Evaluation (Coming Soon)

The evaluation dataset will be released closer to the evaluation phase of the challenge.

⏳ TidyVoiceX_Evaluation

Evaluation dataset will be available here during the evaluation phase

Will be released during evaluation phase


Evaluation Trial Files (Coming Soon)

Trial files for the official evaluation phase will be made available here.

⏳ Evaluation Trial Files

Official evaluation trial files will be available here

Will be released during evaluation phase


Download Instructions

  1. Registration Required: Please complete the registration process before downloading the dataset.

  2. Create Mozilla Data Collective API Key:
  3. Install Required Package:
    pip install datacollective
    
  4. Download Using Python Script:

    Download the download_tidyvoice.py script from the dataset download section above, then:

    • Replace YOUR_API_KEY_HERE with your Mozilla Data Collective API key
    • Update OUTPUT_DIR to your desired download location
    • Run: python download_tidyvoice.py


Data Structure

The dataset is organized with speakerID folders directly inside each dataset folder, which then contain languageID subfolders with the corresponding audio files for that speaker in that specific language.

TidyVoiceX_Train/Dev
├── speaker_001/
│   ├── en/          # English recordings
│   │   ├── file1.wav
│   │   ├── file2.wav
│   │   └── ...
│   ├── fa/          # Persian recordings
│   │   ├── file1.wav
│   │   └── ...
│   └── fr/          # French recordings
│       └── ...
├── speaker_002/
│   ├── de/          # German recordings
│   ├── it/          # Italian recordings
│   └── ...
└── ...


Structure Explanation:

  • TidyVoiceX_Train/: Contains training data with speakerID folders directly at the root
  • TidyVoiceX_Dev/: Contains development data with speakerID folders directly at the root
  • Each speakerID folder contains all recordings for that specific speaker
  • languageID subfolders organize recordings by language (en, fa, fr, de, it, etc.)
  • Audio files for each language are stored in their respective languageID folders
  • This structure enables easy access to cross-lingual data for the same speaker


Support

If you encounter any issues with the dataset download or have questions about the data format, please contact:

  • Email: aref.farhadipour@uzh.ch


Citation

If you use the TidyVoice dataset in your research, please cite:

[Citation information will be provided upon dataset release]



Note: Evaluation Dataset links will be activated closer to the challenge start date. Please check back regularly for updates.