Our challenge has come to the final evaluation stage. We have announced our Test dataset, whose download link is in the dataset section. Please be sure to submit your prediction results at the submission page.
As cars become an indispensable part of human daily life, a secure and comfortable driving environment is more and more attractive. The touch-based interaction in the traditional cockpit is easy to distract the driver's attention, leading to inefficient operations and potential security risks. Thus,the concept of intelligent cockpit is gradually on the rise.
The intelligent cockpit aims to achieve a seamless driving experience for people by integrating multimodal intelligent interactions, like speech, gestures, body, etc., with different driving functions, like commands recognition, entertainment, navigation, etc. As a natural human-computer interaction method, a robust speech or command recognition system is crucial to the intelligent cockpit. Although speech recognition has achieved great progress in lots of applications, there are still many challenges in the driving scenario. First of all, the acoustic environment of the cockpit is complex. Since the cockpit is a closed and irregular space, it has special room impulse response (RIR), resulting in special reverberation conditions. In addition, there are various kinds of noise during driving from both inside and outside, such as wind, engine, wheel, background music and interfering speaker, etc. Secondly, the main content of intelligent cockpit speech interaction is the user's command recognition, which includes controlling the air conditioner, playing songs, navigating, etc. These commands may involve a large number of named entities such as contacts, singer names and point of interest (POI).
Nowadays there is a large amount of open-source data for speech recognition, and the model trained with open-source data has achieved good performance in many applications. However, such models often show poor performance in the intelligent cockpit scene because of the special acoustic environment and content characteristics. Therefore, we launch the Intelligent Cockpit Speech Recognition Challenge (ICSRC), in which we will release an intelligent cockpit dataset and aim to explore speech recognition techniques in intelligent cockpit scenes. The corpus consists of 20 hours of real-world recorded data collected by a Hi-Fi microphone placed in a car in different driving conditions. This competition consists of 2 tracks with different limits of model configurations.
We set up two tracks in the challenge for participants to investigate intelligent cockpit speech recognition with different limits on the scope of model size.
Both of the tracks allow the participants to use the training data listed in the dataset section. The participants have to indicate the data used in the final system description paper and describe the data simulation scheme in detail.
The accuracy of the ASR system is measured by Character Error Rate (CER). The CER indicates the percentage of characters that are incorrectly predicted. For a given hypothesis output, it computes the minimum number of insertions (Ins), substitutions (Subs), and deletions (Del) of characters that are required to obtain the reference transcript. Specifically, CER is calculated by
where NIns, NSubs, NDel are the character number of the three errors, and NTotal is the total number of characters. As standard, insertion, deletion, and substitution all account for the errors.
The dataset of the challenge contains 20 hours of speech data in total. It is collected in the new energy vehicle with a Hi-Fi microphone placed on the display screen of the car. During recording, the speakers sit on the passenger seats. The distance between the microphone and speaker is around 0.5m. All speakers are native Chinese speaking Mandarin without strong accents. During driving, the driver may change the driving speed, open windows, and play music, which covers various scenes and conditions. The dataset can be categorized into five categories:
The detailed statistics of the dataset are shown in Table 1.
In this challenge, the dataset is divided into 10 hours for evaluation (Eval set) and 10 hours for scoring and ranking (Test set). Both Eval and Test sets have 50 speakers with balanced gender coverage. The Eval sets will be released to the participants at the beginning of the challenge, while the Test set will be released at the final challenge scoring stage. For the training set, participants are allowed to use only the following open-source corpora of OpenSLR.
All participants should adhere to the following rules to be eligible for the challenge.
Potential participants from both academia and industry should send an email to email@example.com to register to the challenge before or by September 10 with the following requirements:
The organizer will notify the qualified teams to join the challenge via email in 3 working days. The qualified teams must obey the challenge rules.
Participants should submit their results via the submission system. Once the submission is completed, it will be shown in the Leaderboard, and all participants can check their positions. For each track, participants can submit their results no more than 3 times a day.
The ICSRC 2022 final ranking list can be seen from below:
The top ranking teams will be invited to submit challenge papers and accepted papers will be included in the ISCSLP2022 conference proceedings as well as in the challenge session in the technical program.