В 22:00 МСК будет объявлен перерыв - 15 минут. Вы отдыхаете - мы обновляем!

GigaAM

0

Описание

Foundational Model for Speech Recognition Tasks

Языки

Python

  • Jupyter Notebook
Сообщить о нарушении
README.md

GigaAM: the family of open-source acoustic models for speech processing

plot

Latest News


Table of Contents


Overview

GigaAM (Giga Acoustic Model) is a family of open-source models for Russian speech processing tasks, including speech recognition and emotion recognition. The models are built on top of the Conformer architecture and leverage self-supervised learning (wav2vec2-based for GigaAM-v1 and HuBERT-based for GigaAM-v2).

GigaAM models are state-of-the-art open-source solutions for their respective tasks in the Russian language.

This repository includes:

  • GigaAM: A foundational self-supervised model pre-trained on massive Russian speech datasets.
  • GigaAM-CTC and GigaAM-RNNT: Fine-tuned models for automatic speech recognition (ASR).
  • GigaAM-Emo: A fine-tuned model for emotion recognition.

Installation

Requirements

  • Python ≥ 3.8
  • ffmpeg installed and added to your system's PATH

Install the GigaAM Package

  1. Clone the repository:

    git clone https://github.com/salute-developers/GigaAM.git
    cd GigaAM
  2. Install the package in editable mode:

    pip install -e .
  3. Verify the installation:

    import gigaam
    model = gigaam.load_model("ctc")
    print(model)

GigaAM: The Foundational Model

GigaAM is a Conformer-based foundational model (240M parameters) pre-trained on 50,000+ hours of diverse Russian speech data.

It serves as the backbone for the entire GigaAM family, enabling state-of-the-art fine-tuned performance in speech recognition and emotion recognition.

There are 2 available versions:

  • GigaAM-v1 was trained with a wav2vec2-like approach and can be used by loading the
    v1_ssl
    model version.
  • GigaAM-v2 was trained with a HuBERT-like approach and allows us to get GigaAM-v2 ASR model with better quality. It can be used by loading the
    v2_ssl
    or
    ssl
    model version.

More information about GigaAM-v1 can be found in our post on Habr.

GigaAM Usage Example

import gigaam
model = gigaam.load_model('ssl') # Options: "ssl", "v1_ssl"
embedding, _ = model.embed_audio(audio_path)

GigaAM for Speech Recognition

We fine-tuned the GigaAM encoder for ASR using two different architectures:

Fine-tuning was done for both GigaAM-v1 and GigaAM-v2 SSL models, so we have 4 ASR models:

v1
and
v2
versions for both CTC and RNNT.

Training Data

The models were trained on publicly available Russian datasets:

DatasetSize (hours)Weight
Golos12270.6
SOVA3690.2
Russian Common Voice2070.1
Russian LibriSpeech930.1

Performance Metrics (Word Error Rate)

ModelParametersGolos CrowdGolos FarfieldOpenSTT YouTubeOpenSTT Phone CallsOpenSTT AudiobooksMozilla Common Voice 12Mozilla Common Voice 19Russian LibriSpeech
Whisper-large-v31.5B13.916.618.028.014.45.75.59.5
NVIDIA FastConformer115M2.26.621.230.013.92.75.711.3
GigaAM-CTC-v1242M3.05.716.023.212.52.010.57.5
GigaAM-RNNT-v1243M2.35.014.021.711.71.99.97.7
GigaAM-CTC-v2242M2.54.314.121.110.72.13.15.5
GigaAM-RNNT-v2243M2.23.913.320.010.21.82.75.5

Speech Recognition Example (GigaAM-ASR)

Basic usage: short audio transcribation (up to 30 seconds)

import gigaam
model_name = "rnnt" # Options: "v2_ctc" or "ctc", "v2_rnnt" or "rnnt", "v1_ctc", "v1_rnnt"
model = gigaam.load_model(model_name)
transcription = model.transcribe(audio_path)

Long-form audio transcribation

  1. Install external VAD dependencies (pyannote.audio library) with
    pip install gigaam[longform]
  2. Use the
    model.transcribe_longform
    method:
    import os
    import gigaam
    os.environ["HF_TOKEN"] = "<HF_TOKEN>"
    model = gigaam.load_model("ctc")
    recognition_result = model.transcribe_longform("long_example.wav")
    for utterance in recognition_result:
    transcription = utterance["transcription"]
    start, end = utterance["boundaries"]
    print(f"[{gigaam.format_time(start)} - {gigaam.format_time(end)}]: {transcription}")

ONNX inference example

  1. Export the model to ONNX using the
    model.to_onnx
    method:
    onnx_dir = "onnx"
    model_type = "rnnt" # or "ctc"
    model = gigaam.load_model(
    model_type,
    fp16_encoder=False, # only fp32 tensors
    use_flash=False, # disable flash attention
    )
    model.to_onnx(dir_path=onnx_dir)
  2. Run ONNX inference:
    from gigaam.onnx_utils import load_onnx_sessions, transcribe_sample
    sessions = load_onnx_sessions(onnx_dir, model_type)
    transcribe_sample("example.wav", model_type, sessions)

All these examples can also be found in inference_example.ipynb notebook.


GigaAM-Emo: Emotion Recognition

GigaAM-Emo is a fine-tuned model for emotion recognition trained on the Dusha dataset. It significantly outperforms existing models on several metrics.

Performance Metrics

CrowdPodcast
Unweighted AccuracyWeighted AccuracyMacro F1-scoreUnweighted AccuracyWeighted AccuracyMacro F1-score
DUSHA baseline
(MobileNetV2 + Self-Attention)
0.830.760.770.890.530.54
АБК (TIM-Net)0.840.770.780.900.500.55
GigaAM-Emo0.900.870.840.900.760.67

Emotion Recognition Example (GigaAM-Emo)

import gigaam
model = gigaam.load_model('emo')
emotion2prob: Dict[str, int] = model.get_probs("example.wav")
print(", ".join([f"{emotion}: {prob:.3f}" for emotion, prob in emotion2prob.items()]))

License

GigaAM's code and model weights are released under the MIT License.


Использование cookies

Мы используем файлы cookie в соответствии с Политикой конфиденциальности и Политикой использования cookies.

Нажимая кнопку «Принимаю», Вы даете АО «СберТех» согласие на обработку Ваших персональных данных в целях совершенствования нашего веб-сайта и Сервиса GitVerse, а также повышения удобства их использования.

Запретить использование cookies Вы можете самостоятельно в настройках Вашего браузера.