What data we host
We are open to hosting any type of data that's useful for speech recognition and related tasks, that needs a stable URL where it can be downloaded from. We may think more carefully in cases where the data is very large (e.g. tens of gigabytes or more).Submitting your data
The process of adding data to OpenSLR is as follows. First you might want to quickly check with us whether the data you want to contribute is something we want to host; you can email jtrmal@gmail.com. If we think it's a good idea, you can prepare a .tar.gz file containing a directory with your data in it.
The format of submitted data
The directory that you transfer to us as a .tar.gz file should not contain subdirectories; it should just contain the files you want to host and two special files calledinfo.txt
and
about.html
whose format we'll explain below. Here is an example of such a directory:
# ls /var/www/openslr/resources/6 about.html data_voip_cs.tgz data_voip_en.tgz info.txtNote: the .tgz files inside it are the actual files that we're offering for download (and there is no limitation on their names or file-type, except for the no-subdirectories rule). What you would transfer to us is a .tar.gz file containing /var/www/openslr/resources/6, i.e. the four files you see in the listing above. This information is used to automatically populate the web-page at http://www.openslr.org/6/. An example of what the
info.txt
file looks like is as follows:
root@www:/var/www/openslr# cat /var/www/openslr/resources/6/info.txt name: Vystadial summary: English and Czech data, mirrored from the Vystadial project category: speech license: Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0 US) file: data_voip_cs.tgz Czech speech and transcripts file: data_voip_en.tgz English speech and transcripts alternate_url: https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-4670-6 Czech data alternate_url: https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-4671-4 English dataThis is a plain-text file that will be parsed by php scripts on our site. Some of the fields are mandatory and must appear only once: the
name
,
summary
, category
and license
fields.
The name
field gives
the name of your resource, which shouldn't be too long. The summary
is a short-sentence-length description of the resource.
The category
will normally be either
"speech", "text" or "software" but it can have other values too.
The license
line should be concise; it can just summarize the
license, which we assumed is explained more fully in the download itself or in
the about.html
file. There
may be multiple instances of the file
field; each one corresponds to one
of the files in the directory you sent us. The text after the filename in the file
field is optional; if your resource only contains one file it may not be necessary.
The alternate_url
field is optional and if it occurs, may be repeated;
the text after the URL is optional.
The about.html
file is generic HTML which will be included in the "about this resource"
section of the automatically generated webpage. Just send us a first guess and you can edit it later
if needed. In our example, the about.html
file looks like this:
This data is transcribed telephone converation data, in English and Czech. <p> The data collection process and development of these training scripts was partly funded by the Ministry of Education, Youth and Sports of the Czech Republic under the grant agreement LK11221 and core research funding of Charles University in Prague. <p> You can cite the data using the following BibTeX entry: <pre> @inproceedings{korvas_2014, title={{Free English and Czech telephone speech corpus shared under the CC-BY-SA 3.0 license}}, author={Korvas, Mat\v{e}j and Pl\'{a}tek, Ond\v{r}ej and Du\v{s}ek, Ond\v{r}ej and \v{Z}ilka, Luk\'{a}\v{s} and Jur\v{c}\'{i}\v{c}ek, Filip}, booktitle={Proceedings of the Eigth International Conference on Language Resources and Evaluation (LREC 2014)}, pages={To Appear}, year={2014}, } </pre> Once you have your .tar.gz file containing theinfo.txt
,about.html
files and your actual data, you can transfer it to us (we'll have to discuss the exact mechanism if it's too big to fit in email) and we'll check it and put it on the site.