GigaAM

4

Описание

Foundational Model for Speech Recognition Tasks

Языки

  • Python72%
  • Jupyter Notebook28%
5 месяцев назад
5 месяцев назад
5 месяцев назад
5 месяцев назад
5 месяцев назад
5 месяцев назад
5 месяцев назад
5 месяцев назад
5 месяцев назад
5 месяцев назад
README.md

GigaAM: the family of open-source acoustic models for speech processing

License: MIT Python 3.10+ arXiv HuggingFace Open In Colab


plot

Latest News


Setup

Requirements

  • Python ≥ 3.10
  • ffmpeg installed and added to your system's PATH

Install the GigaAM Package


GigaAM overview

GigaAM is a Conformer-based foundational model (220-240M parameters) pre-trained on diverse Russian speech data. It serves as the backbone for the entire GigaAM family, enabling state-of-the-art fine-tuned performance in speech recognition and emotion recognition. More information about GigaAM-v1 can be found in our post on Habr. We fine-tuned the GigaAM encoder for ASR using CTC and RNNT decoders. GigaAM family includes three lines of models

Pretrain MethodPretrain (hours)ASR (hours)Available Versions
v1Wav2vec 2.050,0002,000
v1_ssl
,
emo
,
v1_ctc
,
v1_rnnt
v2HuBERT–CTC50,0002,000
v2_ssl
,
v2_ctc
,
v2_rnnt
v3HuBERT–CTC700,0004,000
v3_ssl
,
v3_ctc
,
v3_rnnt
,
v3_e2e_ctc
,
v3_e2e_rnnt

Where

v3_e2e_ctc
and
v3_e2e_rnnt
support punctuation and text normalization.

Model Performance

GigaAM-v3
training incorporates new internal datasets: callcenter, music, speech with atypical characteristics, and voice messages. As a result, the models perform on average 30% better on these new domains while maintaining the same quality as
GigaAM-v2
on public benchmarks. In end-to-end ASR comparisons of
e2e_ctc
and
e2e_rnnt
against Whisper (judged via independent LLM-as-a-Judge side-by-side) GigaAM models win by an average margin of 70:30. Our emotion recognition model
GigaAM-Emo
outperforms existing models by 15% Macro F1-Score.

For detailed results, see here.


Usage

Model inference

Note: ASR with

.transcribe
function is applicable for audio only up to 25 seconds. To enable
.transcribe_longform
install the additional pyannote.audio dependencies

Longform setup instruction

Loading from Hugging Face

Note: Install requirements from the example.

ONNX Export and Inference

Note: GPU support can be enabled with

pip install onnxruntime-gpu==1.23.*
if applicable.

  1. Export the model to ONNX using the

    model.to_onnx
    method:

  2. Run ONNX inference:

These and more advanced (e.g. custom audio loading, batching) examples can be found in the Colab notebook.


Citation

If you use GigaAM in your research, please cite our paper: