Open-source

Our lab has led and participated in the development of several open-source toolkits, projects, and datasets. Some selected ones are listed below.

Softwares


ESPnet is an end-to-end speech processing toolkit, with a broad coverage of speech recognition, text-to-speech, speech enhancement/separation, and speech translation. ESPnet uses pytorch as a main deep learning engine, and also follows Kaldi style data processing, feature extraction/format, and recipes to provide a complete setup for speech recognition and other speech processing experiments.

Kaldi is a toolkit for speech recognition written in C++ and licensed under the Apache License v2.0. Kaldi is intended for use by speech recognition researchers.

This is an open source toolkit called s3prl, which stands for Self-Supervised Speech Pre-training and Representation Learning. Self-supervised speech pre-trained models are called upstream in this toolkit, and are utilized in various downstream tasks.

Espresso is an open-source, modular, extensible end-to-end neural automatic speech recognition (ASR) toolkit based on the deep learning library PyTorch and the popular neural machine translation toolkit fairseq. Espresso supports distributed training across GPUs and computing nodes, and features various decoding approaches commonly employed in ASR, including look-ahead word-based language model fusion, for which a fast, parallelized decoder is implemented.

Muskits is an open-source music processing toolkit, currently focus on benchmarking the end-to-end singing voice synthesis and expect to extend more tasks in the future. Muskit employs pytorch as a deep learning engine and also follows ESPnet and Kaldi style data processing, and recipes to provide a complete setup for various music processing experiments.

Projects


OWSM Open Whisper-style Speech Models (OWSM, pronounced as "awesome") are a series of speech foundation models developed by WAVLab at Carnegie Mellon University. We reproduce Whisper-style training using publicly available data and our open-source toolkit ESPnet. By publicly releasing data preparation scripts, training and inference code, pre-trained model weights and training logs, we aim to promote transparency and open science in large-scale speech pre-training.

Datasets


Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges, we are pleased to announce the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6). The new challenge will consider distant multi-microphone conversational speech diarization and recognition in everyday home environments. Speech material was elicited using a dinner party scenario with efforts taken to capture data that is representative of natural conversational speech.

AVSR (MIPS2021) Audio-Visual Speech Recognition (AVSR) corpus of MISP2021 challenge, a large-scale audio-visual Chinese conversational corpus consisting of 141h audio and video data collected by far/middle/near microphones and far/middle cameras in 34 real-home TV rooms. To our best knowledge, our corpus is the first distant multi-microphone conversational Chinese audio-visual corpus and the first large vocabulary continuous Chinese lip-reading dataset in the adverse home-tv scenario.

AVWWS (MIPS2021) Audio-Visual Wake Word Spotting (AVWWS) concerns the identification of predefined wake word(s) in utterances. ‘1’ indicates that the sample contains wake word, and ‘0’ indicates the opposite. For more information, please refer to the MISP Challenge task 1 description.

SPGISpeech SPGISpeech is a corpus of 5,000 hours of professionally-transcribed financial audio. In contrast to previous transcription datasets, SPGISpeech contains a broad cross-section of L1 and L2 English accents, strongly varying audio quality, and both spontaneous and narrated speech. The transcripts have each been cross-checked by multiple professional editors for high accuracy and are fully formatted, including capitalization, punctuation, and denormalization of non-standard words.

GigaSpeech GigaSpeech is an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training.

ASR corpus for endangered language documentation (Yoloxóchitl Mixtec) Substantive material of Yoloxóchitl Mixtec speech corpus (Glottocode: yolo1241 | ISO 639-3 = xty) presented here was brought together over ten years by Jonathan D. Amith (PI) and Rey Castillo García, a native speaker linguist from the community of Yoloxóchitl. The corpus is designed for ASR research in endangered language documentation.

ASR and ST corpus for endangered language documentation (Puebla Nahuatl) The substantive material of Puebla Nahuatl speech corpus was gathered over ten years by Jonathan D. Amith (PI) and a team of native-speaker colleagues who have participated in the project for many years, one from its inception in 2009. The corpus is designed for ASR & MT research in endangered language documentation.

ASR corpus for endangered language documentation (Totonac) The substantive material of Totonac from the northern sierras of Puebla and adjacent areas of Veracruz were compiled starting in 2016 by Jonathan D. Amith and continue to the present as part of a joint effort by Amith and Osbel López Francisco, a native speaker biologist from Zongozotla.