Open Speech and Language Resources



Cantab-TEDLIUM Release 1.1 (February 2015)

Identifier: SLR27

Summary: Cantab Research Language models for the TEDLIUM database

Category: Text

License: unspecified

Downloads (use a mirror closer to you):
cantab-TEDLIUM.tar.bz2 [1.6G]   ( Original archive )   Mirrors: [US]   [EU]   [CN]  
cantab-TEDLIUM-partial.tar.bz2 [220M]   ( Partial archive for Kaldi TEDLIUM recipe )   Mirrors: [US]   [EU]   [CN]  

About this resource:

Cantab-TEDLIUM Release 1.1 (February 2015)

This is the README from the release http://cantabResearch.com/cantab-TEDLIUM.tar.bz2.

This release contains all the files required to reproduce the IWSLT baseline results quoted in Section 5.2 of "Scaling Recurrent Neural Network Language Models" (ICASSP 2015), which can be found at http://arxiv.org/abs/1502.00512.

Contents

  • cantab-TEDLIUM.txt contains 155,290,779 tokens entropy filtered from http://cantabResearch.com/cantab-1bn-norm.tar.bz2, which in turn was generated from https://code.google.com/p/1-billion-word-language-modeling-benchmark/.
  • cantab-TEDLIUM-unpruned.lm3 is the 3-gram built from cantab-TEDLIUM.txt with Witten-Bell smoothing.
  • cantab-TEDLIUM-pruned.lm3 is the pruned version of cantab-TEDLIUM-unpruned.lm3, suitable for use in a first pass decode with Kaldi.
  • cantab-TEDLIUM-unpruned.lm4 is an unpruned Kneser-Ney smoothed 4-gram provided for rescoring lattices produced by the above decode step.
  • cantab-TEDLIUM.dct is the 150 thousand word vocabulary for the above two LMs, including phonetic pronunciations.
Contact: tonyr _at_ cantabresearch.com