voice-input

0

Описание

Бесплатный офлайн-инструмент голосового ввода для Windows с тремя ASR-движками: GigaAM (3.3% WER, лучшее качество для русского), Whisper (мультиязычный, GPU) и Vosk (быстрый офлайн). 90% снижение ошибок распознавания, 93 автотеста, push-to-talk в любое приложение. Open source, MIT.

https://borisovai.tech/ru/projects/voice-input

Языки

  • Python98,8%
  • C0,6%
  • PowerShell0,4%
  • Остальные0,2%
README.md

ScribeAir

Offline voice input for Windows — push-to-talk and voice activation

Speak — text appears at cursor. Hold a hotkey or say "record" to start. Works in any application.

License: MIT Platform Python 3.10+ Release

Download | Features | Benchmarks | Installation | Русский


GigaAM achieves 3.3% WER on Russian (CPU) — outperforms every Whisper model, even on an RTX 4090 GPU. Fully offline after initial model download. Free and open source.

Why ScribeAir?

  • Offline — works without internet (Windows Voice Typing, Google, Dragon require cloud)
  • Best Russian quality — 3.3% WER vs ~25% Windows, ~10% Google, ~8% Dragon
  • Free & open source — MIT license (Dragon costs $300+, Google charges per request)
  • Configurable — 3 ASR backends to choose from (GigaAM, Whisper, Vosk)
  • Private — 100% local, your speech never leaves your computer

Download

Pre-built Windows executables are available in Releases — no Python required.

Extract the ZIP and run

ScribeAir.exe
. Models download automatically on first launch.

Features

  • Push-to-talk with configurable hotkeys (LShift+RShift, Win+Shift, etc.)
  • Wake word activation — say "запись" / "record" to start, "стоп" / "stop" to finish (no hotkey needed)
  • 3 ASR backends: Whisper (GPU/CPU), GigaAM (ONNX, Russian-optimized), Vosk (lightweight)
  • Progressive transcription — see intermediate results in real-time during recording
  • IT term replacement — automatic питон→Python, гугл→Google (81 terms with morphology)
  • Real-time overlay showing recording/transcription progress
  • Streaming pipeline with Silero VAD for instant voice detection
  • Fully offline after initial model download
  • Multi-language: Russian, English, auto-detect, mixed RU+EN, RU→EN translation
  • T5 text correction for fixing ASR errors in Russian
  • System tray UI with full settings menu
  • GPU acceleration via NVIDIA CUDA (auto CPU fallback)
  • Dual builds — CUDA (~4.9 GB) and CPU-only (~800 MB)
  • Custom vocabulary for domain-specific terms
  • Windows autostart support

Architecture

Wake Word (OWW) ──→ Recording ──→ Progressive GigaAM (live preview) or Hotkey │ │ ▼ ▼ "стоп/stop" Silero VAD → ASR Engine → Term Replacement → Clipboard ├── GigaAM (Russian, CPU) ├── Whisper (multilingual, GPU/CPU) └── Vosk (lightweight, CPU)

Recognition Quality

Benchmarked on Russian audiobook data (detailed benchmarks):

Backend Model WER Reduction Latency ───────── ───────────────── ────── ────────── ────────── GigaAM v3-e2e-rnnt 3.3% 90.0% 0.66s (CPU) GigaAM v3-rnnt 3.3% 90.0% 0.82s (CPU) GigaAM v3-e2e-ctc 4.2% 87.2% 1.08s (CPU) Whisper large-v3-turbo 7.9% 75.7% 0.44s (GPU) Whisper large-v3 8.8% 72.9% 2.30s (GPU) Whisper medium 10.7% 67.2% 1.75s (GPU) Vosk small-ru 13.0% 60.0% 0.75s (CPU) Whisper base (baseline) 32.6% — 0.42s (CPU)

Key finding: GigaAM on CPU (3.3% WER) outperforms Whisper large-v3-turbo on RTX 4090 GPU (7.9% WER) for Russian.

Installation

From source

Requirements: Windows 10/11, Python 3.10+, 8 GB RAM, microphone.

Models download automatically on first run (~240 MB for GigaAM, ~1.6 GB for Whisper).

Usage

  1. Launch the application — a microphone icon appears in the system tray
  2. Wait for model loading (icon turns from blue to gray)
  3. Start recording using either method:
    • Hotkey — hold LShift + RShift (configurable), release to finish
    • Voice — say "запись" / "record" to start, "стоп" / "stop" to finish
  4. Speak into your microphone
  5. Text is inserted at the cursor position

Voice Activation (Wake Word)

Enable "Wake Word" in the tray menu for hands-free operation — no hotkey needed, fully voice-controlled.

Tray Menu (right-click)

  • Language — Auto-detect, Russian, English, RU+EN Mixed, RU→EN Translate
  • Model — Tiny, Small, Medium, Large v3, Large v3 Turbo
  • ASR Backend — Auto, Whisper, GigaAM, Vosk
  • GigaAM Model — v2-ctc, v3-rnnt, v3-e2e-rnnt, etc.
  • Hotkey — Key combination for push-to-talk
  • Audio Device — Select input microphone
  • Wake Word — Toggle voice activation (say "запись"/"record" to start)
  • IT Term Replacement — Toggle automatic Cyrillic → Latin term conversion
  • Text Correction (T5) — Toggle T5 spell correction
  • Start with Windows — Toggle autostart
  • Show/Hide Console — Open a debug console with live logs

Configuration

Settings are stored in

%APPDATA%\ScribeAir\config.json
:

ASR Backend Modes

  • auto
    — Default: GigaAM for Russian on CPU, Whisper on GPU (3.3–7.9% WER, 0.4–0.8s)
  • whisper
    — English or GPU-accelerated transcription (7.9% WER, 0.44s on GPU)
  • gigaam
    — Russian on CPU, best quality (3.3% WER, 0.66s)
  • vosk
    — Ultra-low latency, short phrases (13% WER, 0.7s)

Building EXE

  • CUDA (~4.9 GB) — Full GPU support, includes NVIDIA DLLs
  • CPU (~800 MB) — CPU-only, no CUDA dependencies

AI models are not bundled — they download automatically on first launch and cache in

%APPDATA%\ScribeAir\models\
.

Project Structure

scribe-air/ ├── src/ │ ├── main.py # Application entry point │ ├── config.py # Configuration management │ ├── hotkey.py # Global hotkey detection │ ├── recorder.py # Audio recording (sounddevice) │ ├── transcriber.py # Whisper transcription │ ├── gigaam_transcriber.py # GigaAM ONNX transcription │ ├── vosk_transcriber.py # Vosk transcription │ ├── streaming_pipeline.py # Streaming VAD + transcription │ ├── wakeword_listener.py # Wake word detection (openWakeWord) │ ├── term_replacer.py # IT term replacement (Cyrillic → Latin) │ ├── audio_processor.py # Audio preprocessing │ ├── text_corrector_t5.py # T5 text correction │ ├── model_downloader.py # Model download with mirror fallback │ ├── inserter.py # Text insertion via clipboard │ ├── overlay.py # Floating transcription window │ ├── tray.py # System tray UI │ └── autostart.py # Windows autostart ├── wakeword_data/models/ # Wake word ONNX models (zapis, stop) ├── tests/ # Pytest test suite ├── docs/guides/ # User and developer guides ├── assets/icon.ico # Application icon ├── requirements.txt # Dependencies ├── voice_app.spec # PyInstaller config └── build.py # Build script

Troubleshooting

  • Model not loading — Check internet connection. First download is ~240 MB–1.6 GB from HuggingFace. Models cache in
    %APPDATA%\ScribeAir\models\
  • No audio — Verify microphone in Windows Sound Settings. Select device via Audio Device in tray menu
  • GPU not used — Install CUDA 12.x, update NVIDIA drivers. Set
    "device": "cuda"
    in config. Use CUDA build
  • Text not inserted — Ensure cursor is in a text field. Test with Notepad first
  • Debug logs — Right-click tray icon → "Show Console" for live logs. Log file:
    %APPDATA%\ScribeAir\scribe_air.log

Testing

Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

License

MIT License. See LICENSE.

Acknowledgements