Open Speech and Language Resources

Samrómur Unverified 22.07

Identifier: SLR128

Summary: Samrómur Icelandic Speech, 2,200 hours of mostly unverified data approved for release in July 2022

Category: Speech

License: CC by 4.0

Downloads (use a mirror closer to you): [8.9G]   (icelandic speech and metadata files )   Mirrors: [US]   [EU]   [CN]  
samromur_unverified_22.07.z01 [10G]   (part 2 )   Mirrors: [US]   [EU]   [CN]  
samromur_unverified_22.07.z02 [10G]   (part 3 )   Mirrors: [US]   [EU]   [CN]  
samromur_unverified_22.07.z03 [10G]   (part 4 )   Mirrors: [US]   [EU]   [CN]  
samromur_unverified_22.07.z04 [10G]   (part 5 )   Mirrors: [US]   [EU]   [CN]  
samromur_unverified_22.07.z05 [10G]   (part 6 )   Mirrors: [US]   [EU]   [CN]  
samromur_unverified_22.07.z06 [10G]   (part 7 )   Mirrors: [US]   [EU]   [CN]  
samromur_unverified_22.07.z07 [10G]   (part 8 )   Mirrors: [US]   [EU]   [CN]  
samromur_unverified_22.07.z08 [10G]   (part 9 )   Mirrors: [US]   [EU]   [CN]  
samromur_unverified_22.07.z09 [10G]   (part 10 )   Mirrors: [US]   [EU]   [CN]  
samromur_unverified_22.07.z10 [10G]   (part 11 )   Mirrors: [US]   [EU]   [CN]  
samromur_unverified_md5sums.txt [704 bytes]   (checksums to verify download )   Mirrors: [US]   [EU]   [CN]  

About this resource:

This release of data from the Samrómur collection contains mostly UNVERIFIED data. It contains 2,159,314 (2,233 hours) speech-recordings in Icelandic, of which 84,161 have been verified. 700,000 utterances have been scored with marosijo which indicates if it is likely to be valid or not.

The corpus is a result of the crowd-sourcing effort run by the Language and Voice Lab (LVL) at Reykjavik University, in cooperation with Almannarómur, the Icelandic Center for Language Technology. The recording process has started in October 2019 and continues to this day (July 2022). The present edition of the corpus has been authorized for release in July 2022. The aim is to create an open-source speech corpus to enable research and development for Icelandic Language Technology. The corpus consists of audio recordings and a metadata file containing the sentences read by the participants.

Participants are from 6 and up to 80+ years. The distributed audio files are encoded at 16 kHz sampling rate, 16 bit linear PCM, 1 channel, *.flac format. The corpus is NOT split into train, dev, and test subsets. If such subsets are wished for, please see the other Samrómur releases. All demographics are self reported. The dataset contains folders that correspond to speaker IDs, and the audio files inside use the following naming convention: {speaker_ID}-{utterance_ID}.flac.

You can cite the data using the following BibTeX entry:
        title={{Samr{\'o}mur Unverified 22.07}},
        author={Staffan Hedstr{\"o}m, Ragnhei{\dh}ur {\TH}{\'o}rhallsd{\'o}ttir, 
            David Erik Mollberg, Sm{\'a}ri Freyr Gu{\dh}mundsson, {\'O}lafur Helgi 
            J{\'o}nsson, Sunneva {\TH}orsteinsd{\'o}ttir, Judy Y. Fong, 
            Eyd{\'\i}s Huld Magn{\'u}sd{\'o}ttir, Jon Gudnason},
        publisher={Reykjavik University: Language and Voice Lab},