GigaAM
Описание
Foundational Model for Speech Recognition Tasks
Языки
Python
- Jupyter Notebook
GigaAM: the family of open-source acoustic models for speech processing
Latest News
- 2024/12 — MIT License, GigaAM-v2 (-15% and -12% WER Reduction for CTC and RNN-T models, respectively), ONNX export support
- 2024/05 — GigaAM-RNNT (-19% WER Reduction), long-form inference using external Voice Activity Detection
- 2024/04 — GigaAM Release: GigaAM-CTC (SoTA Speech Recognition model for the Russian language), GigaAM-Emo
Table of Contents
- Overview
- Installation
- GigaAM: The Foundational Model
- GigaAM for Speech Recognition
- GigaAM-Emo: Emotion Recognition
- License
- Links
Overview
GigaAM (Giga Acoustic Model) is a family of open-source models for Russian speech processing tasks, including speech recognition and emotion recognition. The models are built on top of the Conformer architecture and leverage self-supervised learning (wav2vec2-based for GigaAM-v1 and HuBERT-based for GigaAM-v2).
GigaAM models are state-of-the-art open-source solutions for their respective tasks in the Russian language.
This repository includes:
- GigaAM: A foundational self-supervised model pre-trained on massive Russian speech datasets.
- GigaAM-CTC and GigaAM-RNNT: Fine-tuned models for automatic speech recognition (ASR).
- GigaAM-Emo: A fine-tuned model for emotion recognition.
Installation
Requirements
- Python ≥ 3.8
- ffmpeg installed and added to your system's PATH
Install the GigaAM Package
-
Clone the repository:
git clone https://github.com/salute-developers/GigaAM.gitcd GigaAM -
Install the package in editable mode:
pip install -e . -
Verify the installation:
import gigaammodel = gigaam.load_model("ctc")print(model)
GigaAM: The Foundational Model
GigaAM is a Conformer-based foundational model (240M parameters) pre-trained on 50,000+ hours of diverse Russian speech data.
It serves as the backbone for the entire GigaAM family, enabling state-of-the-art fine-tuned performance in speech recognition and emotion recognition.
There are 2 available versions:
- GigaAM-v1 was trained with a wav2vec2-like approach and can be used by loading the
model version.v1_ssl - GigaAM-v2 was trained with a HuBERT-like approach and allows us to get GigaAM-v2 ASR model with better quality. It can be used by loading the
orv2_ssl
model version.ssl
More information about GigaAM-v1 can be found in our post on Habr.
GigaAM Usage Example
import gigaammodel = gigaam.load_model('ssl') # Options: "ssl", "v1_ssl"embedding, _ = model.embed_audio(audio_path)
GigaAM for Speech Recognition
We fine-tuned the GigaAM encoder for ASR using two different architectures:
- GigaAM-CTC was fine-tuned with Connectionist Temporal Classification and a character-based tokenizer.
- GigaAM-RNNT was fine-tuned with RNN Transducer loss.
Fine-tuning was done for both GigaAM-v1 and GigaAM-v2 SSL models, so we have 4 ASR models:
and
versions for both CTC and RNNT.
Training Data
The models were trained on publicly available Russian datasets:
Dataset | Size (hours) | Weight |
---|---|---|
Golos | 1227 | 0.6 |
SOVA | 369 | 0.2 |
Russian Common Voice | 207 | 0.1 |
Russian LibriSpeech | 93 | 0.1 |
Performance Metrics (Word Error Rate)
Model | Parameters | Golos Crowd | Golos Farfield | OpenSTT YouTube | OpenSTT Phone Calls | OpenSTT Audiobooks | Mozilla Common Voice 12 | Mozilla Common Voice 19 | Russian LibriSpeech |
---|---|---|---|---|---|---|---|---|---|
Whisper-large-v3 | 1.5B | 13.9 | 16.6 | 18.0 | 28.0 | 14.4 | 5.7 | 5.5 | 9.5 |
NVIDIA FastConformer | 115M | 2.2 | 6.6 | 21.2 | 30.0 | 13.9 | 2.7 | 5.7 | 11.3 |
GigaAM-CTC-v1 | 242M | 3.0 | 5.7 | 16.0 | 23.2 | 12.5 | 2.0 | 10.5 | 7.5 |
GigaAM-RNNT-v1 | 243M | 2.3 | 5.0 | 14.0 | 21.7 | 11.7 | 1.9 | 9.9 | 7.7 |
GigaAM-CTC-v2 | 242M | 2.5 | 4.3 | 14.1 | 21.1 | 10.7 | 2.1 | 3.1 | 5.5 |
GigaAM-RNNT-v2 | 243M | 2.2 | 3.9 | 13.3 | 20.0 | 10.2 | 1.8 | 2.7 | 5.5 |
Speech Recognition Example (GigaAM-ASR)
Basic usage: short audio transcribation (up to 30 seconds)
import gigaammodel_name = "rnnt" # Options: "v2_ctc" or "ctc", "v2_rnnt" or "rnnt", "v1_ctc", "v1_rnnt"model = gigaam.load_model(model_name)transcription = model.transcribe(audio_path)
Long-form audio transcribation
- Install external VAD dependencies (pyannote.audio library) with
pip install gigaam[longform]
-
- Generate Hugging Face API token
- Accept the conditions to access pyannote/voice-activity-detection files and content.
- Accept the conditions to access pyannote/segmentation files and content.
- Use the
method:model.transcribe_longformimport osimport gigaamos.environ["HF_TOKEN"] = "<HF_TOKEN>"model = gigaam.load_model("ctc")recognition_result = model.transcribe_longform("long_example.wav")for utterance in recognition_result:transcription = utterance["transcription"]start, end = utterance["boundaries"]print(f"[{gigaam.format_time(start)} - {gigaam.format_time(end)}]: {transcription}")
ONNX inference example
- Export the model to ONNX using the
method:model.to_onnxonnx_dir = "onnx"model_type = "rnnt" # or "ctc"model = gigaam.load_model(model_type,fp16_encoder=False, # only fp32 tensorsuse_flash=False, # disable flash attention)model.to_onnx(dir_path=onnx_dir) - Run ONNX inference:
from gigaam.onnx_utils import load_onnx_sessions, transcribe_samplesessions = load_onnx_sessions(onnx_dir, model_type)transcribe_sample("example.wav", model_type, sessions)
All these examples can also be found in inference_example.ipynb notebook.
GigaAM-Emo: Emotion Recognition
GigaAM-Emo is a fine-tuned model for emotion recognition trained on the Dusha dataset. It significantly outperforms existing models on several metrics.
Performance Metrics
Crowd | Podcast | |||||
---|---|---|---|---|---|---|
Unweighted Accuracy | Weighted Accuracy | Macro F1-score | Unweighted Accuracy | Weighted Accuracy | Macro F1-score | |
DUSHA baseline (MobileNetV2 + Self-Attention) | 0.83 | 0.76 | 0.77 | 0.89 | 0.53 | 0.54 |
АБК (TIM-Net) | 0.84 | 0.77 | 0.78 | 0.90 | 0.50 | 0.55 |
GigaAM-Emo | 0.90 | 0.87 | 0.84 | 0.90 | 0.76 | 0.67 |
Emotion Recognition Example (GigaAM-Emo)
import gigaammodel = gigaam.load_model('emo')emotion2prob: Dict[str, int] = model.get_probs("example.wav")
print(", ".join([f"{emotion}: {prob:.3f}" for emotion, prob in emotion2prob.items()]))
License
GigaAM's code and model weights are released under the MIT License.