Multilingual TEDx (mTEDx) is a multilingual speech recognition and translation corpus to facilitate the training of ASR and SLT models in additional languages.

The corpus comprises audio recordings and transcripts from TEDx Talks in 8 languages (Spanish, French, Portuguese, Italian, Russian, Greek, Arabic, German) with translations into up to 5 languages (English, Spanish, French, Portguese, Italian).
The audio recordings are automatically aligned at the sentence level with their manual transcriptions and translations.
Each .tgz file contains two directories: data and docs. docs contains a README detailing the files provided in data and their structure.
Test sets for all IWSLT 2021 language pairs can be found in mtedx_iwslt2021.tgz.
For more information on the dataset please see the dataset paper.

Contact: Elizabeth Salesky, Matthew Wiesner. esalesky@jhu.edu, wiesner@jhu.edu

Citation: If you use the Multilingual TEDx corpus in your work, please cite the dataset paper:

  @inproceedings{salesky2021mtedx,
    title={Multilingual TEDx Corpus for Speech Recognition and Translation},
    author={Elizabeth Salesky and Matthew Wiesner and Jacob Bremerman and Roldano Cattoni and Matteo Negri and Marco Turchi and Douglas W. Oard and Matt Post},
    booktitle={Proceedings of Interspeech},
    year={2021},
  }