This data set contains multi-speaker high quality transcribed audio data for four languages of South Africa. The data set consists of wave files, and a TSV file transcribing the audio. In each folder, the file line_index.tsv contains a FileID, which in turn contains the UserID and the Transcription of audio in the file.

The data set has had some quality checks, but there might still be errors.

This data set was collected by as a collaboration between North West University and Google.

See LICENSE.txt file for license information.

If you use this data in publications, please cite it as follows:

  @inproceedings{van-niekerk-etal-2017,
    title = {{Rapid development of TTS corpora for four South African languages}},
    author = {Daniel van Niekerk and Charl van Heerden and Marelie Davel and Neil Kleynhans and Oddur Kjartansson and Martin Jansche and Linne Ha},
    booktitle = {Proc. Interspeech 2017},
    pages = {2178--2182},
    address = {Stockholm, Sweden},
    month = aug,
    year  = {2017},
    URL   = {http://dx.doi.org/10.21437/Interspeech.2017-1139}
  }