Armenian Speech Crowdsourcing Data is part of an ongoing effort to expand the availability of high-quality speech resources for low-resource languages such as Armenian.
The dataset contains speech recordings collected via the Toloka crowdsourcing platform, using text content provided by the Yerevan City Magazine. The magazine agreed to share portions of its published text archive for open use in language technology research and development. All recordings were contributed voluntarily by native speakers recruited through the crowdsourcing platform.
To prepare the dataset, the collected audio was segmented into short utterances (typically 3–15 seconds) aligned with their corresponding transcriptions. All segments underwent a verification process to ensure reading accuracy before inclusion.
To protect the privacy of contributors and prevent potential misuse, the voices themselves were anonymized so that they could not be easily identified or matched to individual speakers. This step was taken to avoid risks such as unauthorized voice cloning, impersonation, or other privacy violations, and to ensure that the dataset can be safely used for research and development purposes.
The dataset contains:
pitched/
– Audio files in .wav format (70 hours total)pitched.jsonl
– Transcriptions and metadataLicense: CC BY 4.0
About Yerevan City Magazine:
Yerevan City Magazine is a cultural and social publication serving the Armenian-speaking community. Through this collaboration, they have contributed openly licensed text content to support the preservation and technological development of the Armenian language.
Please cite this dataset as:
@misc{armenian_speech_crowdsourcing_2025, title = {Armenian Speech Crowdsourcing Data}, author = {Nikolay Karpov, nkarpov@nvidia.com}, year = {2025}, howpublished = {\url{https://evnmag.com/}} }