Installation

Starting a bash Slurm job

srun --pty -n 1 --cpus-per-task=8  --gres=gpu:1 --mem=12G /bin/bash -l

Loading Required Modules

module load cuda-11.1.1 cudnn-11.1.1-v8.0.4.30 gcc-5.5.0

Miniconda installation

   wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
   bash Miniconda3-latest-Linux-x86_64.sh

Installing ESPnet & Kaldi

cd <path-to-your-projects>
git clone https://github.com/espnet/espnet

cd <espnet-root>/tools/
git clone https://github.com/kaldi-asr/kaldi

# setup virtual environment (venv) for python
cd <espnet-root>/tools/
./setup_venv.sh $(command -v python3)

Building kaldi

cd <espnet-root>/tools/
. activate_python.sh

Check dependencies and install OpenBLAS (MKL and ATLAS installations need sudo privileges)

cd <kaldi-root>/tools/
extras/check_dependencies.sh
make -j 8
./extras/install_openblas.sh
./extras/install_irstlm.sh

cd <kaldi-root>/src
# without CUDA (ESPnet uses only feature extractor, so you can disable CUDA)
./configure --openblas-root=../tools/OpenBLAS/install --use-cuda=no
make -j clean depend; make -j 8

Building espnet

cd <espnet-root>/tools
make -j 8 CUDA_VERSION=11.1 TH_VERSION=1.8.1

Check espnet + kaldi installation

cd <espnet-root>/egs/an4/asr1/
./run.sh

Exit the bash Slurm job

exit

Misc.

Installing sox manually (if unable to install via conda)

mkdir -p ~/rpm
cd ~/rpm
wget http://mirror.centos.org/centos/7/os/x86_64/Packages/sox-14.4.1-7.el7.x86_64.rpm
rpm2cpio ~/rpm/sox-14.4.1-7.el7.x86_64.rpm | cpio -id

export PATH="$HOME/rpm/usr/sbin:$HOME/rpm/usr/bin:$HOME/rpm/bin:$PATH"
L='/lib:/lib64:/usr/lib:/usr/lib64'
export LD_LIBRARY_PATH="$L:$HOME/rpm/usr/lib:$HOME/rpm/usr/lib64"

Note: You can also add the last 3 lines to your ~/.bashrc file because connecting over ssh reads and executes commands from ~/.bashrc.

Install Sox from scratch with flac support: refer to this.

Login

Simple ssh would be working:

ssh <your_username>@tir.lti.cs.cmu.edu

Usage details

General guidelines

There is a general document for tir usage at https://docs.google.com/document/d/1ieMgNos6F97XAtfD_m6WINte1gAQcPDqpqbbB82Rv4A/edit?usp=sharing

Data storage

We have stored many databases in /projects/tir5/data/speech_corpora. Please look ahead at the directory before downloading on your own. In the same time, please add new databases there if you have any other needs.

IO issues

The TIR may have IO issues when directly train models with data stored in storage node. One option to fix that is to shift the prepared features to the /tmp/ and then run my training. After that delete related files from /tmp before exit.

Procedures for ESPNet1 would like:

dumpdir=`mktemp -d /tmp/st-XXXX`    # directory to dump full features

feat_tr_dir=${dumpdir}/${train_set}/delta${do_delta}; mkdir -p ${feat_tr_dir}
feat_dt_dir=${dumpdir}/${train_dev}/delta${do_delta}; mkdir -p ${feat_dt_dir}

Procedures for ESPNet2 would like:

In run.sh, set audio format to lower IO issues

--audio_format "flac.ark" \

At the start of the training stage (e.g., asr.sh stage 11), add:

tempdir=$(mktemp -d "/tmp/<your_projectname>-$$.XXXXXXXX")
trap 'rm -rf ${tempdir}' EXIT
cp -r "${data_feats}" ${tempdir}
# or rsync -zav --progress --bwlimit=100 "${data_feats}" ${tempdir}
data_feats="${tempdir}/$(basename ${data_feats})"
scp_lists=$(find ${tempdir} -type f -name "*.scp")
for f in ${scp_lists}; do
    sed -i -e "s/${dumpdir//\//\\/}/${tempdir//\//\\/}/g" $f
done

At the end of asr.sh, add

rm -rf ${tempdir}

As the tmp folder is corresponding to specifc compute node, please set the cmd as local and process run.sh with sbatch

Other option

Since the tmp method depends on limited tmp storage and need to copy the data everytime, you can also copy all your environment to /scratch/<your_node_name> and execute your jobs only on that node by setting --nodelist=<your_node_name>.

Notes for running multiple GPUs

When running jobs with multiple GPUs, you should submit jobs with arguments as --mem Xgb --cpus-per-task Y --gres gpu:ngpus. A simple rule would be like

X = 16 * ngpus or 4 * ncpus
Y = 5 * ngpus

Use slurm backend for ESPnet

Because of the I/O issue, we would recommend you to submit jobs by local backend. But if you want to directly use the slurm backend, you can use the following config to replace the conf/slurm.conf

# Default configuration
command sbatch --export=PATH
option name=* --job-name $0
option time=* --time $0
option exclude=* --exclude $0
option mem=* --mem-per-cpu $0
option mem=0
option num_threads=* --cpus-per-task $0
option num_threads=1 --cpus-per-task 1
option num_nodes=* --nodes $0
default gpu=0
default mem=6000
# option gpu=0 -p cpu
option gpu=0
option gpu=* --gres=gpu:$0 -c $0  # Recommend allocating more CPU than, or equal to the number of GPU

Since tir does not have CPU-only machines and some CPU jobs may block some good GPUs instead of the worse ones, so in that case, please try to exclude some of the nodes by setting

    export train_cmd="slurm.pl --mem 4000 --time 1-0:00:00 --exclude tir-0-19,tir-1-23,tir-1-28,tir-1-11,tir-1-7,tir-0-28,tir-0-3,tir-0-36,tir-0-32,tir-1-13,tir-1-18,tir-0-11"
    export cuda_cmd="slurm.pl --mem 6000 --time 3-0:00:00 --exclude tir-0-19,tir-1-23,tir-1-28,tir-1-11,tir-1-7,tir-0-28,tir-0-3,tir-0-36,tir-0-32,tir-1-13,tir-1-18,tir-0-11"
    export decode_cmd="slurm.pl --mem 8000 --time 1-0:00:00 --exclude tir-0-19,tir-1-23,tir-1-28,tir-1-11,tir-1-7,tir-0-28,tir-0-3,tir-0-36,tir-0-32,tir-1-13,tir-1-18,tir-0-11"