Multilingual TEDx
Identifier: SLR100
Summary: a multilingual corpus of TEDx talks for speech recognition and translation
Category: Speech
License: Creative Commons Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)
 Downloads (use a mirror closer to you): 
 mtedx_es.tgz  [35G]   ( Spanish speech and transcripts
)    Mirrors: 
 [EU]   
 [EU]   
 [CN]   
 mtedx_fr.tgz  [34G]   ( French speech and transcripts
)    Mirrors: 
 [EU]   
 [EU]   
 [CN]   
 mtedx_pt.tgz  [29G]   ( Portuguese speech and transcripts
)    Mirrors: 
 [EU]   
 [EU]   
 [CN]   
 mtedx_it.tgz  [19G]   ( Italian speech and transcripts
)    Mirrors: 
 [EU]   
 [EU]   
 [CN]   
 mtedx_ru.tgz  [10G]   ( Russian speech and transcripts
)    Mirrors: 
 [EU]   
 [EU]   
 [CN]   
 mtedx_el.tgz  [5.5G]   ( Greek speech and transcripts
)    Mirrors: 
 [EU]   
 [EU]   
 [CN]   
 mtedx_ar.tgz  [3.6G]   ( Arabic speech and transcripts
)    Mirrors: 
 [EU]   
 [EU]   
 [CN]   
 mtedx_de.tgz  [2.6G]   ( German speech and transcripts
)    Mirrors: 
 [EU]   
 [EU]   
 [CN]   
 mtedx_es-en.tgz  [13G]   ( Spanish speech and transcripts with aligned English translations
)    Mirrors: 
 [EU]   
 [EU]   
 [CN]   
 mtedx_es-fr.tgz  [1.9G]   ( Spanish speech and transcripts with aligned French translations
)    Mirrors: 
 [EU]   
 [EU]   
 [CN]   
 mtedx_es-it.tgz  [1.9G]   ( Spanish speech and transcripts with aligned Italian translations
)    Mirrors: 
 [EU]   
 [EU]   
 [CN]   
 mtedx_es-pt.tgz  [8.1G]   ( Spanish speech and transcripts with aligned Portuguese translations
)    Mirrors: 
 [EU]   
 [EU]   
 [CN]   
 mtedx_fr-en.tgz  [9.8G]   ( French speech and transcripts with aligned English translations
)    Mirrors: 
 [EU]   
 [EU]   
 [CN]   
 mtedx_fr-es.tgz  [7.1G]   ( French speech and transcripts with aligned Spanish translations
)    Mirrors: 
 [EU]   
 [EU]   
 [CN]   
 mtedx_fr-pt.tgz  [4.7G]   ( French speech and transcripts with aligned Portuguese translations
)    Mirrors: 
 [EU]   
 [EU]   
 [CN]   
 mtedx_pt-en.tgz  [10G]   ( Portuguese speech and transcripts with aligned English translations
)    Mirrors: 
 [EU]   
 [EU]   
 [CN]   
 mtedx_pt-es.tgz  [4.5G]   ( Portuguese speech and transcripts with aligned Spanish translations
)    Mirrors: 
 [EU]   
 [EU]   
 [CN]   
 mtedx_it-en.tgz  [9.1G]   ( Italian speech and transcripts with aligned English translations
)    Mirrors: 
 [EU]   
 [EU]   
 [CN]   
 mtedx_it-es.tgz  [1.6G]   ( Italian speech and transcripts with aligned Spanish translations
)    Mirrors: 
 [EU]   
 [EU]   
 [CN]   
 mtedx_ru-en.tgz  [2.3G]   ( Russian speech and transcripts with aligned English translations
)    Mirrors: 
 [EU]   
 [EU]   
 [CN]   
 mtedx_el-en.tgz  [2.4G]   ( Greek speech and transcripts with aligned English translations
)    Mirrors: 
 [EU]   
 [EU]   
 [CN]   
 mtedx_iwslt2021.tgz  [5.7G]   ( Test sets for IWSLT'21 Multilingual task
)    Mirrors: 
 [EU]   
 [EU]   
 [CN]   
 MTEDx-french-talks-gender-annotation.csv  [105K]   ( Gender annotations for French talks, contributed by Laurent Besacier and Marcely Zanon Boito
)    Mirrors: 
 [EU]   
 [EU]   
 [CN]   
About this resource:
  The corpus comprises audio recordings and transcripts from TEDx Talks in 8 languages (Spanish, French, Portuguese, Italian, Russian, Greek, Arabic, German) with translations into up to 5 languages (English, Spanish, French, Portguese, Italian). 
  The audio recordings are automatically aligned at the sentence level with their manual transcriptions and translations.
  Each .tgz file contains two directories: data and docs. docs contains a README detailing the files provided in data and their structure.
  Test sets for all IWSLT 2021 language pairs can be found in mtedx_iwslt2021.tgz.
  For more information on the dataset please see the dataset paper.
Contact: Elizabeth Salesky, Matthew Wiesner. esalesky@jhu.edu, wiesner@jhu.edu
Citation: If you use the Multilingual TEDx corpus in your work, please cite the dataset paper:
  @inproceedings{salesky2021mtedx,
    title={Multilingual TEDx Corpus for Speech Recognition and Translation},
    author={Elizabeth Salesky and Matthew Wiesner and Jacob Bremerman and Roldano Cattoni and Matteo Negri and Marco Turchi and Douglas W. Oard and Matt Post},
    booktitle={Proceedings of Interspeech},
    year={2021},
  }