The Pansori TEDxKR Corpus is a Korean speech recognition (ASR) corpus generated from Korean language TEDx talks given in Korea from 2010 to 2014. It contains about 3 hours of speech audio-transcript pairs from 41 speakers. This corpus was generated by using a new corpus data ingestion and processing system called Pansori. Please refer to this code repository and the following paper for further information on the Pansori ASR corpus generation system:
@inproceedings{choi_2018,
title={{Pansori: ASR corpus generation from open online video contents}},
author={Choi, Yoona and Lee, Bowon},
booktitle={Proceedings of the IEEE Seoul Section Student Paper Contest 2018},
pages={117-121},
month={Nov},
year={2018},
}
Extra care was taken to maintain the quality of the generated corpus:
Electronics Engineering, Inha University (link)