This dataset contains approximately 30 hours of audio spoken by Shaul Amsterdamski in a recording studio at 44100Hz with corresponding transcriptions.

The data is divided into a gold-standard subset of roughly 4 hours with manual transcriptions and an automatic subset with machine-generated transcriptions.

See README files inside the archives for more details.

The dataset was originally published as part of the robo-shaul competition with this license agreement (Hebrew-only). The license is also provided with the dataset archives in the file robo_shaul_terms.pdf. In case of conflict between the attached license and the version available online, the online version takes precedence.

A summary of the terms in English:

Copyright for the recordings and corresponding transcriptions is owned solely by the Israeli Public Broadcast Corporation, the IPBC.

The dataset is free for use for non-commercial purposes, under the following limitations, whether by positive act or by omission:

You can cite the data using the following BibTeX entry:

@inproceedings{sharoni23_interspeech,
    author={Orian Sharoni and Roee Shenberg and Erica Cooper},
    title={{SASPEECH: A Hebrew Single Speaker Dataset for Text To Speech and Voice Conversion}},
    year=2023,
    booktitle={Proc. Interspeech 2023},
    pages={To Appear}
    }