ICSRC 2022

Call for Participation

As cars become an indispensable part of human daily life, a secure and comfortable driving environment is more and more attractive. The touch-based interaction in the traditional cockpit is easy to distract the driver's attention, leading to inefficient operations and potential security risks. Thus,the concept of intelligent cockpit is gradually on the rise.

The intelligent cockpit aims to achieve a seamless driving experience for people by integrating multimodal intelligent interactions, like speech, gestures, body, etc., with different driving functions, like commands recognition, entertainment, navigation, etc. As a natural human-computer interaction method, a robust speech or command recognition system is crucial to the intelligent cockpit. Although speech recognition has achieved great progress in lots of applications, there are still many challenges in the driving scenario. First of all, the acoustic environment of the cockpit is complex. Since the cockpit is a closed and irregular space, it has special room impulse response (RIR), resulting in special reverberation conditions. In addition, there are various kinds of noise during driving from both inside and outside, such as wind, engine, wheel, background music and interfering speaker, etc. Secondly, the main content of intelligent cockpit speech interaction is the user's command recognition, which includes controlling the air conditioner, playing songs, navigating, etc. These commands may involve a large number of named entities such as contacts, singer names and point of interest (POI).

Nowadays there is a large amount of open-source data for speech recognition, and the model trained with open-source data has achieved good performance in many applications. However, such models often show poor performance in the intelligent cockpit scene because of the special acoustic environment and content characteristics. Therefore, we launch the Intelligent Cockpit Speech Recognition Challenge (ICSRC), in which we will release an intelligent cockpit dataset and aim to explore speech recognition techniques in intelligent cockpit scenes. The corpus consists of 20 hours of real-world recorded data collected by a Hi-Fi microphone placed in a car in different driving conditions. This competition consists of 2 tracks with different limits of model configurations.

Track Setting and Evaluation

We set up two tracks in the challenge for participants to investigate intelligent cockpit speech recognition with different limits on the scope of model size.

Track I (Limited model size track): The number of model parameters cannot exceed 15M and the FST file of the system should be less than 25M if participants build a system with FST.
Track II (Unlimited model size track): The number of model parameters is not limited.

Both of the tracks allow the participants to use the training data listed in the dataset section. The participants have to indicate the data used in the final system description paper and describe the data simulation scheme in detail.

The accuracy of the ASR system is measured by Character Error Rate (CER). The CER indicates the percentage of characters that are incorrectly predicted. For a given hypothesis output, it computes the minimum number of insertions (Ins), substitutions (Subs), and deletions (Del) of characters that are required to obtain the reference transcript. Specifically, CER is calculated by

where N_Ins, N_Subs, N_Del are the character number of the three errors, and N_Total is the total number of characters. As standard, insertion, deletion, and substitution all account for the errors.

Dataset

The dataset of the challenge contains 20 hours of speech data in total. It is collected in the new energy vehicle with a Hi-Fi microphone placed on the display screen of the car. During recording, the speakers sit on the passenger seats. The distance between the microphone and speaker is around 0.5m. All speakers are native Chinese speaking Mandarin without strong accents. During driving, the driver may change the driving speed, open windows, and play music, which covers various scenes and conditions. The dataset can be categorized into five categories:

Air Conditioner: includes commands to control the air conditioner, like “调高温度 (turn up the temperature)”;
Phone Call: includes commands to make a phone call, like “呼叫张奥 (call Ao Zhang)”;
Music: includes commands to control media player, like “播放周杰伦的歌 (play Jay Zhou’s song)”;
Point of Interest (POI): includes commands to navigation, like “导航到博物馆 (route to the museum)”;
Others: instead of commands, like reading news, conversation between passengers;

The detailed statistics of the dataset are shown in Table 1.

Table 1: Details statistics of the dataset.

Category	#Percent
Air Conditioner	15%
Phone Call	10%
Music	15%
POI	15%
Others	45%

In this challenge, the dataset is divided into 10 hours for evaluation (Eval set) and 10 hours for scoring and ranking (Test set). Both Eval and Test sets have 50 speakers with balanced gender coverage. The Eval sets will be released to the participants at the beginning of the challenge, while the Test set will be released at the final challenge scoring stage. For the training set, participants are allowed to use only the following open-source corpora of OpenSLR.

Aishell (SLR33): includes about 178 hours of Mandarin speech data recorded in a quiet indoor environment;
Free ST Chinese Mandarin Corpus (SLR38): include 102600 utterances rescored in silent indoor environments using cellphones;
Primewords Chinese Corpus Set 1 (SLR47): includes about 100 hours of Mandarin speech data recorded by smart mobile phones;
aidatatang 200zh (SLR62): includes about 200 hours of speech data from 600 speakers;
MAGICDATA Mandarin Chinese Read Speech Corpus (SLR68): includes about 755 hours of scripted read speech data from 1080 native speakers;
MAGICDATA Mandarin Chinese Conversational Speech Corpus (SLR123): includes about 180 hours of rich annotated Mandarin spontaneous conversational speech data;
MUSAN (SLR17): includes music, speech, and noise;
Room Impulse Response and Noise Database (SLR28): includes simulated and real room impulse responses, isotropic and point-source noises.

Rules

All participants should adhere to the following rules to be eligible for the challenge.

Data augmentation is allowed on the original training dataset, including but not limited to adding noise or reverberation, speed perturbation, and tone change. Participants can only use noise in the training data listed above (SLR17, SLR28), while privately recorded data is not allowed to use.
The use of the Test dataset in any form of non-compliance is strictly prohibited, including but not limited to using the Test dataset to fine-tune or train the model. The use of Eval set is allowed.
Multi-system fusion is allowed in Track2 but not allowed in Track 1.
If the CERs of the two systems on the Test dataset are the same, the system with lower computation complexity will be judged as the superior one.
If the forced alignment is used to obtain the frame-level classification label, the forced alignment model must be trained on the data allowed by the corresponding track.
Shallow fusion is allowed to the end-to-end approaches, e.g., LAS, RNNT, and Transformer, but the training data of the shallow fusion language model can only come from the transcripts of the allowed training dataset.
The right of final interpretation belongs to the organizer. In case of special circumstances, the organizer will coordinate the interpretation.

Guidelines

Potential participants from both academia and industry should send an email to azhang@nwpu-aslp.org to register to the challenge before or by September 10 with the following requirements:

Email subject: [ ISCSLP 2022 ICSRC Challenge Registration] - Team Name - Participating track.
Provide team name, affiliation, participating track, team captain as well as members with contacts.
Registration email must be sent out from an official institutional or company email address, e.g., edu.cn; public email address (e.g., 163.com, qq.com or gmail.com) is not accepted.

The organizer will notify the qualified teams to join the challenge via email in 3 working days. The qualified teams must obey the challenge rules.

Ranking List

The ICSRC 2022 final ranking list can be seen from below:

Track I Limited model size track

Rank	TeamID	Team Name	Organization	CER（%）
1	T044	Tooong	--	10.66
2	T002	一汽红旗语音(FawAISpeech)	中国第一汽车集团，研发总院	12.67
3	T009	勇敢牛牛队	悉尼大学	13.39
4	T013	METAWALL_ASR	北京仙林智能科技有限公司	13.57
5	T042	dun_speech	--	14.24
6	T027			15.55
7	T045			15.87
8	T039			16.59
9	T016			18.25
10	T014			20.69

Track II Unlimited model size track

Rank	TeamID	Team Name	Organization	CER（%）
1	T044	Tooong	--	8.94
2	T009	勇敢牛牛队	悉尼大学	9.86
3	T039	LeVoice	联想AI Lab	10.20
4	T002	一汽红旗语音(FawAISpeech)	中国第一汽车集团，研发总院	10.21
5	T025	阳光初心	阳光出行	10.91
6	T045			11.40
7	T027			12.21
8	T041			12.30
9	T024			12.64
10	T050			13.21

Timeline

September 3^rd, 2022 : Release of the Eval data.
September 10^th, 2022 : Registration deadline, the due date for participants to join the challenge.
September 11^th, 2022 : Release of the baseline system and open the leaderboard for ranking on the Eval data.
September 28^th, 2022 : Release of the Test data.
September 30^th, 2022 : Final submission deadline.
October 8^rd, 2022 : Release of the results and rankings.
October 14^th, 2022 : Paper submission deadline.
October 27^th, 2022 : Publish paper review results.
November 5^th, 2022 : Camera ready paper submission deadline.
December 11^th - 14^th, 2022 : ISCSLP main conference, challenge session and award ceremony.

Notification

Call for Participation

Track Setting and Evaluation

Dataset

Eval Data

Test Data

Rules

Guidelines

Baseline System

Submission and Leaderboard

Ranking List

Track I Limited model size track

Track II Unlimited model size track

Timeline

Publication

Organizers