Open Speech and Language Resources



AISHELL-5

Identifier: SLR159

Summary: The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition, provided by Beijing AISHELL Technology Co.,Ltd.

Category: Speech

License: CC BY-SA 4.0

Downloads (use a mirror closer to you):
train.tar.gz [52G]   (Training set, far-field and near-field microphone speech and transcripts )   Mirrors: [US]   [EU]   [CN]  
Dev.tar.gz [2.1G]   (Development set )   Mirrors: [US]   [EU]   [CN]  
Eval1.tar.gz [1.8G]   (Evaluation set )   Mirrors: [US]   [EU]   [CN]  
Eval2.tar.gz [2.1G]   (Evaluation set )   Mirrors: [US]   [EU]   [CN]  
noise.tar.gz [13G]   (Noise set )   Mirrors: [US]   [EU]   [CN]  

About this resource:

The AISHELL-5 dataset is recorded inside a hybrid electric car, with a far-field microphone placed above the door handles of all four doors to capture far-field audio from different areas of the car. The recorded language is Chinese. Additionally, each speaker wears a high-fidelity microphone to collect near-field audio for data annotation. A total of 165 participants are involved in the recording with no notable accents. During the recording, 2-4 speakers are randomly seated in the four positions inside the car and engaged in free conversations without content restrictions to ensure the naturalness and authenticity of the audio data. The average duration of each session is 10 minutes. The scripts for all our speech data are prepared in TextGrid format. The AISHELL-5 dataset contains more than 100 hours of speech data, divided into 94 hours of training data(Train), 3.3 hours of validation data (Dev), and two test sets(Eval1 and Eval2), with durations of 3.3 and 3.58 hours. Each dataset includes far-field audio from 4 channels, with only the training set containing near-field audio. Additionally,to promote research on speech simulation techniques, we also provide alarge-scale noise dataset (Noise), which has the same recording settings as the far-field data but without any speaker speech, lasting approximately 40 hours. We also release a training and evaluation framework as baseline system to promote reproducible research in this field. The baseline system code and generated samples are available here.

You can cite the data using the following BibTeX entry:


@inproceedings{AISHELL-5_2025,
title={AISHELL-5: The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition},
author={Yuhang Dai, He Wang, Xingchen Li, Zihan Zhang, Shuiyuan Wang, Lei Xie, Xin Xu, hongxiao Guo, Shaoji Zhang, Hui Bu, Wei Chen},
booktitle={Interspeech},
url={https://arxiv.org/pdf/2505.23036},
year={2025}
}

External URL: https://www.aishelltech.com/AISHELL_5.   Full description from the company website.