GigaAM
Описание
Foundational Model for Speech Recognition Tasks
Языки
- Python72%
- Jupyter Notebook28%
GigaAM: the family of open-source acoustic models for speech processing
Latest News
- 2025/11 — GigaAM-v3: 30% WER reduction on new data domains; GigaAM-v3-e2e: end-to-end transcription support (70:30 win in Side-by-Side vs Whisper-large-v3)
- 2025/06 — Our research paper on GigaAM was accepted to InterSpeech 2025!
- 2024/12 — MIT License, GigaAM-v2 (-15% and -12% WER Reduction for CTC and RNN-T models, respectively), ONNX export support
- 2024/05 — GigaAM-RNNT (-19% WER Reduction), long-form inference using external Voice Activity Detection
- 2024/04 — GigaAM Release: GigaAM-CTC (SoTA Speech Recognition model for the Russian language), GigaAM-Emo
Setup
Requirements
- Python ≥ 3.10
- ffmpeg installed and added to your system's PATH
Install the GigaAM Package
GigaAM overview
GigaAM is a Conformer-based foundational model (220-240M parameters) pre-trained on diverse Russian speech data. It serves as the backbone for the entire GigaAM family, enabling state-of-the-art fine-tuned performance in speech recognition and emotion recognition. More information about GigaAM-v1 can be found in our post on Habr. We fine-tuned the GigaAM encoder for ASR using CTC and RNNT decoders. GigaAM family includes three lines of models
| Pretrain Method | Pretrain (hours) | ASR (hours) | Available Versions | |
|---|---|---|---|---|
| v1 | Wav2vec 2.0 | 50,000 | 2,000 | , , , |
| v2 | HuBERT–CTC | 50,000 | 2,000 | , , |
| v3 | HuBERT–CTC | 700,000 | 4,000 | , , , , |
Where and support punctuation and text normalization.
Model Performance
training incorporates new internal datasets: callcenter, music, speech with atypical characteristics, and voice messages. As a result, the models perform on average 30% better on these new domains while maintaining the same quality as on public benchmarks. In end-to-end ASR comparisons of and against Whisper (judged via independent LLM-as-a-Judge side-by-side) GigaAM models win by an average margin of 70:30. Our emotion recognition model outperforms existing models by 15% Macro F1-Score.
For detailed results, see here.
Usage
Model inference
Note: ASR with function is applicable for audio only up to 25 seconds. To enable install the additional pyannote.audio dependencies
Longform setup instruction
- Generate Hugging Face API token
- Accept the conditions to access pyannote/segmentation-3.0 files and content
Loading from Hugging Face
Note: Install requirements from the example.
ONNX Export and Inference
Note: GPU support can be enabled with
if applicable.pip install onnxruntime-gpu==1.23.*
-
Export the model to ONNX using the
method:model.to_onnx -
Run ONNX inference:
These and more advanced (e.g. custom audio loading, batching) examples can be found in the Colab notebook.
Citation
If you use GigaAM in your research, please cite our paper:
Links
- [arxiv] GigaAM: Efficient Self-Supervised Learner for Speech Recognition
- [habr] GigaAM: класс открытых моделей для обработки звучащей речи
- [youtube] Как научить LLM слышать: GigaAM 🤝 GigaChat Audio
- [youtube] GigaAM: Семейство акустических моделей для русского языка
- [youtube] Speech-only Pre-training: обучение универсального аудиоэнкодера