Kaldi
Kaldi is an open-source toolkit for speech recognition, written in C++ and licensed under the Apache License v2.0, designed primarily for use by speech recognition researchers and professionals.[1] It provides a modular framework for building automatic speech recognition (ASR) systems, supporting advanced techniques such as finite-state transducers (FSTs) for modeling and decoding, as well as integration with neural networks and traditional acoustic models.[1] First publicly released on May 14, 2011, during the ICASSP conference in Prague, Kaldi has become a foundational tool in ASR research due to its emphasis on flexibility, efficiency, and reproducibility through complete recipes for standard datasets like those from the Linguistic Data Consortium (LDC).[2] The project originated in 2009 from a workshop at Johns Hopkins University focused on low-cost, high-quality speech recognition, where initial development centered on subspace Gaussian mixture models (SGMMs) and lexicon learning, building partly on the Hidden Markov Model Toolkit (HTK).[2] A follow-up workshop in 2010 in Brno, Czech Republic, refined the toolkit into a general-purpose, clean, and releasable form, leading to its debut presentation.[2] Led by principal developer Daniel Povey, with significant contributions from researchers like Karel Veselý (neural network training) and Arnab Ghoshal (acoustic modeling), Kaldi has involved approximately 70 contributors who provided code, scripts, and patches.[2] Development has been supported by institutions including Microsoft Research, Johns Hopkins University, and funding from agencies like IARPA and the NSF, with ongoing evolution on a single "master" branch; formal versioned releases were introduced from 2017 to 2019 (versions 5.0 to 5.5), followed by continuous updates thereafter.[2][3] As of 2025, Kaldi remains actively maintained on GitHub and widely used in research and industry, with integrations such as NVIDIA GPU support enhancing its performance.[4][5] Key features of Kaldi include deep code-level integration with the OpenFst library for FST operations, a custom matrix library that wraps BLAS and LAPACK for efficient linear algebra computations, and templated decoders that allow extensibility for various scoring sources, such as neural networks.[1] It prioritizes generic, provably correct algorithms and rigorous testing, enabling the creation of state-of-the-art systems while providing recipes for baseline setups on corpora like the Wall Street Journal, Resource Management, and Switchboard datasets.[1] Documentation is geared toward expert users, and the toolkit's design facilitates research in areas like acoustic modeling, language modeling, and pronunciation modeling, making it a staple in academic and industrial ASR advancements.[1]Overview
Introduction
Kaldi is a C++-based open-source toolkit designed for automatic speech recognition (ASR) and related signal processing tasks, primarily targeted at researchers and professionals building advanced speech recognition systems.[1][6] Its architecture emphasizes flexibility for experimenting with cutting-edge ASR methods, leveraging finite-state transducers (FSTs) to achieve computational efficiency in modeling and decoding processes.[6][1] The toolkit's core objective is to facilitate rapid prototyping and integration of novel algorithms within a robust framework, supporting the development of state-of-the-art systems through modular components that handle diverse acoustic environments and languages.[1] Kaldi encompasses a complete end-to-end pipeline, from raw audio input through feature extraction, acoustic modeling, language modeling, to final transcription output, enabling comprehensive ASR workflows without reliance on proprietary software.[6][7] As of 2025, Kaldi remains an actively maintained project with recent optimizations enhancing its performance, sustaining its relevance in ASR research even amid the rise of end-to-end deep learning models by providing a reliable foundation for hybrid approaches and custom adaptations. Originating from efforts at Johns Hopkins University in 2009, it continues to support multi-platform deployment for ongoing academic and industrial applications.[6][4]Licensing and Availability
Kaldi is released under the Apache License 2.0, which allows users to freely use, modify, and distribute the software, including for commercial purposes, provided that appropriate attribution is given and the license terms are preserved in any derivative works.[1][6] The primary repository for Kaldi is hosted on GitHub at https://github.com/kaldi-asr/kaldi, where the source code is maintained and updated continuously.[4][8] Mirrors of the repository are available on platforms such as SourceForge for alternative access. Kaldi does not use formal release tags; instead, it employs version branches, with the 5.x series representing stable development up to version 5.5 in February 2020, followed by ongoing updates on the master branch through 2025.[3][9] Kaldi is distributed exclusively as open-source code, requiring users to compile it from source on compatible platforms, with no pre-built binaries provided by the project.[4] The distribution includes extensive examples and recipes in theegs/ directory for tasks such as acoustic model training and speech recognition on common datasets.
Contributions to Kaldi are welcomed through pull requests on GitHub, typically starting with forking the repository and creating a feature branch for changes.[4] The project emphasizes maintaining backward compatibility in updates to ensure stability for existing users and scripts.[3]
History and Development
Origins and Early Development
Kaldi emerged from the 2009 summer workshop at Johns Hopkins University (JHU), titled "Low Development Cost, High Quality Speech Recognition for New Languages and Domains," which focused on scaling lattice-based methods for automatic speech recognition (ASR) using subspace Gaussian mixture models (SGMMs) and lexicon learning techniques.[2] The project was initiated to address the challenges of developing efficient ASR systems for resource-limited languages, with early implementations relying on the Hidden Markov Model Toolkit (HTK) for baseline functionality.[10] Led primarily by Daniel Povey, an associate research scientist at JHU, along with collaborators such as Arnab Ghoshal, Nagendra Goel, and participants from institutions including Brno University of Technology, the effort aimed to overcome the limitations of existing toolkits like HTK, which suffered from restrictive licensing, limited support for modern mathematical operations, and fragmented scripting.[6] A follow-up workshop in Brno, Czech Republic, in 2010 further refined the codebase, emphasizing the creation of a cohesive, independent toolkit free from HTK dependencies to foster broader community adoption.[2] Key motivations included integrating finite-state transducers (FSTs) for graph-based decoding from the outset, leveraging libraries like OpenFst, and supporting advanced acoustic modeling approaches such as discriminative training to improve efficiency and accuracy over prior systems.[6] The toolkit's name draws from the Ethiopian legend of Kaldi, the goatherder who discovered the stimulating effects of coffee beans when his goats became unusually energetic after consuming them.[1] This choice reflects the project's intent to invigorate ASR research by providing a flexible, open-source alternative to legacy tools. Between 2009 and 2011, development prioritized a C++-based architecture with robust linear algebra support, culminating in the initial public release on May 14, 2011, under the Apache 2.0 license.[2]Key Milestones and Releases
Kaldi's first public release occurred in 2011, coinciding with the presentation of the seminal paper "The Kaldi Speech Recognition Toolkit" by Daniel Povey and colleagues at the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (ASRU).[6] This initial version established Kaldi as an open-source C++ toolkit for speech recognition, emphasizing modular design and support for Gaussian mixture models (GMMs) and subspace Gaussian mixture models (SGMMs).[2] In 2014, Kaldi integrated deep neural network (DNN) support, marking a significant enhancement for acoustic modeling through contributions like Dan Povey's DNN implementation and recipes combining Kaldi with the PDNN library.[11][12] This transition enabled hybrid DNN-HMM systems, improving recognition accuracy on large-scale datasets and aligning Kaldi with emerging deep learning trends in speech processing. A formal version numbering scheme was introduced in January 2017 with version 5.0.0, retroactively recognizing prior development; subsequent releases like 5.1 (February 2017) added features such as online decoding for LSTMs and variable chunk sizes in the nnet3 framework.[3] By version 5.2 (May 2017), support for convolutional components, dropout, and cross-platform builds—including Android via NDK—expanded Kaldi's applicability.[3][4] Late 2010s updates, particularly in versions 5.1 through 5.4 (2017–2018), incorporated nnet3 extensions for attention mechanisms and backstitching, facilitating end-to-end model training.[3] Version 5.5.636, released in February 2020, served as a stable milestone with over 600 patches, incorporating batched nnet3 computations, SpecAugment integration, and Python 3 compatibility.[3] Following this, Kaldi transitioned to greater community-driven maintenance, with Daniel Povey at Johns Hopkins University overseeing contributions from approximately 70 developers via GitHub.[2] Around 2020–2022, development of "next-generation Kaldi" began as a parallel effort, replacing the OpenFst library with the faster k2 library for GPU-accelerated finite-state transducer operations, enabling more efficient training and decoding for modern neural architectures like conformers. This project, hosted under k2-fsa, continued to evolve alongside the original Kaldi, with updates in 2025 including support for new acoustic models and deployment on mobile platforms.[13][14] Ongoing activity in the main repository persisted through 2025, including commits for build fixes and optimizations, as evidenced by updates in July 2025 and acoustic model enhancements detailed in recent technical reports.[15][16]Technical Architecture
Core Components
Kaldi's architecture is built around a modular design that facilitates the construction of automatic speech recognition (ASR) pipelines through a series of interconnected command-line tools organized into distinct stages, including data preparation, feature extraction, training, and decoding. This structure allows users to process audio data progressively, with each stage producing outputs that serve as inputs for the next, often piped together for efficiency. The toolkit's core is implemented in C++ and relies on external libraries such as OpenFst for finite-state transducer operations and BLAS/LAPACK for linear algebra computations, enabling flexible integration of various acoustic modeling approaches without requiring recompilation of the entire system.[6][17] At the heart of Kaldi's model integration lies the use of finite-state transducers (FSTs), which are essential for composing the acoustic model (H), pronunciation lexicon (L), and language model (G) into a single decoding graph. FSTs represent these components as weighted automata, allowing efficient operations like composition, determinization, and minimization to create compact, searchable structures for recognition. Kaldi leverages the OpenFst library to perform these tasks; for instance, determinization via the customDeterminizeStar() algorithm removes epsilon transitions and handles non-determinism by creating intermediate states, while minimization reduces the graph size using OpenFst's algorithms with extensions for encoded weights. Key binaries such as fstcompose facilitate model integration by combining FSTs (e.g., lexicon with language model), and fstdeterminizestar ensures the resulting graph is deterministic for fast lattice generation during decoding.[6][18]
Specific tools exemplify this modularity, including gmm-init-mono for initializing Gaussian mixture model (GMM) training with monophone systems, which sets up the foundational acoustic model parameters from flat-start alignments. These binaries are invoked through scripting layers, typically in Bash, to automate workflows, with Python options available for advanced customization. Kaldi's extensibility is further enhanced by recipe-based workflows housed in the egs/ (examples) directory, which provide templated scripts for common datasets like RM or WSJ, allowing users to adapt pipelines for new corpora or models by modifying high-level recipes without altering the underlying C++ code. This design promotes reproducibility and community contributions, as recipes encapsulate best practices for chaining tools across stages.[6][17]
Feature Extraction and Acoustic Modeling
Kaldi's feature extraction processes raw audio waveforms into compact representations suitable for acoustic modeling. It generates cepstral coefficients such as Mel-frequency cepstral coefficients (MFCCs), perceptual linear prediction (PLP) features, and filterbank (FBank) energies. MFCCs are computed by framing the audio (typically 25 ms frames with 10 ms shifts), applying a Hamming window, performing a fast Fourier transform, extracting mel-scale filterbank energies, taking logarithms, and applying a discrete cosine transform to yield 13 coefficients per frame, with options for liftering and mean subtraction for normalization.[19] PLP features follow a similar pipeline but incorporate perceptual weighting and linear prediction to better approximate human auditory perception.[19] FBank features provide raw log-energies from mel-scale filters (e.g., 23-40 bins), serving as an intermediate step or direct input for neural models.[19] To enhance robustness and handle variability, Kaldi applies linear transformations during feature extraction. Linear discriminant analysis (LDA) reduces dimensionality while maximizing class separability across concatenated frames, often combined with speaker-independent transforms like MLLT (maximum likelihood linear transform).[20][6] For speaker adaptation, feature-space maximum likelihood linear regression (fMLLR) estimates utterance-specific affine transforms applied to features, improving generalization without retraining the model.[1][20] Additionally, iVectors capture low-dimensional speaker-specific embeddings from Gaussian posteriors, enabling on-the-fly adaptation by appending them to input features, particularly in online decoding scenarios.[21] Kaldi employs Gaussian mixture models (GMMs) as a baseline for acoustic modeling, where each state in a hidden Markov model (HMM) is modeled by a mixture of Gaussians. The likelihood of an observation vector \mathbf{o}_t given mixture m is computed as p(\mathbf{o}_t | m) = \sum_{i=1}^N c_{mi} \mathcal{N}(\mathbf{o}_t; \mu_{mi}, \Sigma_{mi}), with c_{mi} as mixing weights, \mu_{mi} as means, and \Sigma_{mi} as covariances (typically diagonal for efficiency).[6] Beyond maximum likelihood estimation, Kaldi supports advanced discriminative training methods to refine models. Maximum mutual information (MMI) training optimizes the objective \log \frac{P(\mathcal{O}|M_\lambda)^{k} P(M_\lambda)}{\sum_{m'} P(\mathcal{O}|M_{m'}) P(M_{m'})}, where \mathcal{O} is the observation sequence, M_\lambda the reference transcription, k an acoustic scale (often 0.5), and the denominator sums over competing paths; this is approximated using lattices.[1][22] Boosted MMI extends this by boosting the likelihood of high-scoring incorrect paths with a temperature parameter, enhancing discrimination.[1] Minimum classification error (MCE) training minimizes misclassification risk via a smoothed loss function over frame-level decisions.[1] For modern acoustic modeling, Kaldi integrates neural networks through the nnet3 toolkit, supporting feed-forward deep neural networks (DNNs) for context-independent posteriors, long short-term memory (LSTM) recurrent networks for sequential dependencies, and time-delay neural networks (TDNNs) for efficient temporal modeling with spliced inputs. These include chain models, which use sequence-level objectives for training hybrid DNN-HMM systems, achieving improved word error rates and faster decoding compared to conventional approaches.[23][1][24] These replace or hybridize GMM-HMM systems, with nnet3 enabling flexible architectures trained via stochastic gradient descent on GPUs. Feature-space discriminative training, such as fMLLR, adapts neural inputs similarly to GMMs, while iVectors provide speaker conditioning.[1][24]Usage and Implementation
Installation and Platforms
Kaldi is primarily supported on Unix-like systems, including various distributions of Linux such as Debian and Red Hat, macOS via the Darwin kernel, and BSD variants.[25][26] For Windows environments, compatibility is achieved through Cygwin or the Windows Subsystem for Linux (WSL), which provides a Unix-like interface.[27][26] Additionally, Kaldi supports cross-compilation for embedded platforms, such as Android using the Android Native Development Kit (NDK) with Clang++ and OpenBLAS, and WebAssembly via Emscripten for browser-based execution.[4][28] The toolkit requires several core dependencies for compilation and operation. Essential libraries include a BLAS implementation for linear algebra operations, such as ATLAS (preferred for its performance) or OpenBLAS as an alternative, and OpenFST for weighted finite-state transducers used in recognition modeling.[26] Other prerequisites encompass standard Unix tools like g++, make, Git, wget, and zlib, along with optional components such as CUDA for GPU-accelerated neural network training.[26] These dependencies are typically installed via system package managers (e.g., apt on Debian-based systems) or Kaldi's provided scripts in the tools directory.[26] Installation begins by cloning the official repository from GitHub usinggit clone https://github.com/kaldi-asr/kaldi.git, followed by navigating to the cloned directory.[29] In the tools/ subdirectory, run extras/check_dependencies.sh to verify and install prerequisites like OpenFST and BLAS if needed, then execute make to build the tools (approximately 10-20 minutes).[30] Next, move to the src/ directory, configure with ./configure --shared (adding --use-cuda=yes for GPU support if applicable), followed by make depend -j and make -j to compile the core library and binaries (20-40 minutes on modern multi-core hardware).[31] Finally, run make check to verify the build integrity.[31] The entire process typically takes several hours on contemporary hardware, depending on system resources and dependency availability.[32][30]
Common installation challenges include errors related to missing BLAS libraries, often resolved by installing ATLAS via sudo apt-get install libatlas-base-dev on Debian-based systems or equivalent packages on other distributions, or by using Kaldi's extras/install_openblas.sh script.[26][33] For environments with complex dependency management, Docker containers are recommended, particularly NVIDIA's GPU Cloud (NGC) images, which provide a pre-configured Kaldi environment with CUDA support as of 2025.[34] These containers simplify setup on supported hosts with NVIDIA Docker runtime, avoiding manual compilation issues.
Training and Decoding Workflows
Kaldi's training and decoding workflows are designed as modular pipelines, typically orchestrated through shell scripts and Python programs in theegs/ (example recipes) directories of the toolkit. These workflows begin with data preparation to organize audio corpora, transcripts, and linguistic resources, followed by a multi-stage acoustic model training process that progresses from basic Gaussian Mixture Model (GMM) systems to advanced deep neural network (DNN) models, and culminate in decoding for inference on new audio.[35][36]
Data Preparation
Data preparation in Kaldi involves formatting acoustic and linguistic resources into standardized directory structures. For acoustic data, key files includewav.scp mapping utterance IDs to audio file paths (e.g., Sphinx or WAV formats), text containing transcripts aligned to utterances, and utt2spk linking utterances to speaker IDs; additional files like spk2utt (speaker to utterances) and segments (for time offsets in long recordings) handle speaker variability and segmentation.[35] Linguistic preparation starts with a pronunciation lexicon in lexicon.txt (word-to-phone mappings) and phone lists in data/local/dict/, followed by the utils/prepare_lang.sh script to build the language model directory (data/lang/) with finite-state transducers (FSTs) for phones (L.fst) and words (words.txt), incorporating an out-of-vocabulary (OOV) symbol like <UNK>.[35]
For robustness, recipes apply perturbations such as speed perturbation (altering audio playback rate by factors like 0.9 or 1.1) to augment training data, reducing overfitting to specific speaking rates; this is common in corpora like Wall Street Journal (WSJ) or Switchboard, where scripts in egs/wsj/s5/local/ or egs/swbd/s5/ generate augmented segments and alignments.[35] Transcripts are normalized to lowercase and punctuation-free, with alignments generated via forced alignment tools like align-equal-compiled for initial GMM-based supervision. Example for WSJ: local/wsj_data_prep.sh /path/to/WSJ data/ creates data directories, while Switchboard uses local/swbd1_data_prep.sh to process sphere files into WAVs and handle conversational transcripts.[35]
Training Pipeline
The training pipeline is multi-stage, starting with GMM-HMM models for initial alignments before fine-tuning DNNs. Feature extraction precedes modeling, usingsteps/make_mfcc.sh to compute Mel-frequency cepstral coefficients (MFCCs) from WAV files (e.g., 13-dimensional with deltas, at 10ms frame shift), followed by cepstral mean and variance normalization (steps/compute_cmvn_stats.sh).[35]
GMM training begins with monophone models via steps/train_mono.sh (e.g., --nj 4 --cmd "run.pl" data/train data/lang exp/mono), training context-independent HMMs starting with a small number of Gaussians (e.g., 100-1000 total across the model) on a subset like the shortest 500 utterances for efficiency. This is followed by triphone models (steps/train_deltas.sh for delta features, then steps/train_lda_mllt.sh for linear discriminant analysis and maximum likelihood linear transform), increasing Gaussian counts to 2000-15000 and using speaker-adaptive training (SAT) with feature-space maximum likelihood linear regression (fMLLR). Alignments from each stage (steps/align_si.sh or steps/align_fmllr.sh) provide supervision for the next.[6][36]
DNN fine-tuning uses these alignments, transitioning to neural acoustic models in steps/nnet3/chain/ for end-to-end chain models. A representative script is steps/nnet3/chain/train_tdnn.py (or local/chain/run_tdnn_1a.sh in recipes), which trains time-delay neural networks (TDNNs) with lattice-free maximum mutual information (LF-MM I) objectives over speed-perturbed data, incorporating i-vectors for speaker adaptation; parameters include 4-6 epochs, minibatch sizes of 128-512 chunks (equating to approximately 10,000-50,000 frames depending on chunk length and subsampling factor), and learning rates decaying from 0.002 to 0.00025.[36] This stage requires high-resolution features (40-dimensional MFCCs + i-vectors) and a new decision tree from tree/build-tree for context-dependent tying.[36]
Decoding
Decoding employs lattice-based search with weighted finite-state transducers (WFSTs) for efficient hypothesis generation. The decoding graph is built usingutils/mkgraph.sh data/lang exp/tri3b exp/tri3b/graph, composing the HMM topology (H.fst), lexicon (L.fst), and language model (G.fst) into HCLG.fst with acoustic scaling (e.g., 0.1) and self-loop penalties.[37] The steps/decode.sh script invokes decoders like LatticeFasterDecoder, performing Viterbi beam search (beam ~10-20) to generate word lattices representing alternative paths within a lattice-beam (e.g., 8) of the best scoring hypothesis; determinization via lattice-determinize ensures one path per word sequence.[38][37]
Output formats include text transcripts (1best), lattices in compact binary format (.gz), and detailed timings in CTM (confidence time-marked words) or JSON via post-processing scripts like lattice-to-ctm. For real-time applications, Kaldi's online decoding in online2/ processes streaming audio chunks (e.g., 0.16s frames) with online2-wav-nnet3-latgen-faster, estimating fMLLR adaptations every 2-4.5 seconds and using moving-window CMVN (600 frames); this supports low-latency inference on DNN models without full utterance buffering.[21][38]
Example Recipe: Librispeech Dataset
The Librispeech recipe (egs/librispeech/s5/run.sh) exemplifies a complete workflow on the 960-hour English audiobook corpus. Data preparation (stages 0-2) downloads and untars audio/transcripts, runs local/data_prep.sh to generate wav.scp, text, and utt2spk for train_960, dev_clean_2, and test_clean, then utils/prepare_lang.sh with a provided lexicon (data/local/dict/) to create data/lang/ including a 4-gram language model via local/lm/run_arpa.sh.[39]
Feature extraction (stage 3) uses steps/make_mfcc.sh --nj 70 on perturbed data (stages 4-6: speed perturbation with steps/data/recompute_data.py at rates 0.9/1.0, generating train_960_p for robustness). GMM training (stages 7-13) follows monophone (steps/train_mono.sh --nj 30), tri1 (steps/train_deltas.sh 4500 35000), tri2b (steps/train_lda_mllt.sh), and tri3b (steps/train_sat.sh) with alignments at each step (e.g., steps/align_fmllr.sh). DNN training (stages 14-19) extracts i-vectors (steps/online/nnet2/get_pca_transform.sh), builds a chain tree (steps/nnet3/chain/build_tree.sh), and trains TDNN via steps/nnet3/chain/train_tdnn.py --stage -7 --cmd "run.pl" --trainer.num-epochs 4 ... --dir exp/chain/tdnn_1a_sp on 4 GPUs, using LF-MMI with frame-subsampling (3x).[39][36]
Decoding (stages 20+) builds graphs (utils/mkgraph.sh --self-loop-scale 1.0 data/lang_chain data/lang_test_tgmed), then runs steps/nnet3/decode.sh --nj 30 --cmd "run.pl" --acwt 1.0 --post-decode-acwt 10.0 --scoring-opts "--min-lmwt 5 --max-lmwt 15" exp/chain/tdnn_1a_sp/graph_tgmed dev_clean_2 exp/chain/tdnn_1a_sp/decode_tgmed, producing lattices and scoring WER via steps/scoring/score_kaldi_cer.sh (e.g., ~3% on test_clean). Real-time decoding can adapt this by swapping to online2 binaries.[39][21]