🎉 Proposal got accepted in v3.0.0rc3. See https://nightly.spacy.io/models/ru for official Russian pretrained models.

natasha-spacy

SpaCy official Russian model proposal. Work is heavily inspired and based on spacy-ru by @buriy.

Russian model is trained on two resources, both available under MIT license:

Nerus — part of Natasha project, large silver standard Russian corpus annotated with morphology tags, syntax trees and PER, LOC, ORG NER-tags.
Navec — also part of Natasha project, pretrained word embeddings for Russian language.

Code in this repo is also available under MIT license.

Resulting model is relatively small due to embeddings table pruning (138MB), works fast on CPU. Shows near SOTA performance on tasks of morphology tagging and syntax parsing, beating heavy DeepPavlov BERT on news and wiki domains. On NER task model shows quality comparable to other top Russian systems, beating DeepPavlov, PullEnti, Stanza. See Naeval morphology, syntax, and NER sections.

Download

Model	Size	SpaCy version
ru_core_news_md-2.3.0.tar.gz	138MB	2.3.*
ru_core_news_md-3.0.0.tar.gz	135MB	3.0.*

Usage

First download and install the model. SpaCy 2.3.* is required, model won't work with SpaCy 2.1, 2.2.

Model for SpaCy 3.0 is also available.

wget https://storage.yandexcloud.net/natasha-spacy/models/ru_core_news_md-3.0.0.tar.gz
pip install ru_core_news_md-3.0.0.tar.gz

Use ipymarkup for NER and syntax visualization.

Training

v2

Both Nerus and Navec are adapted to fit SpaCy utilities. Training procedure uses only standart

spacy convert

spacy init-model

spacy train

Initialize the environment. We use SpaCy 2.3 for training, Russian language in SpaCy requires PyMorphy for morphology.

Download 650MB embeddings table. Navec is precomputed on fiction texts, has 500 000 words in vocabulary.

Download 1.5GB training data. We use 10% slice of original Nerus, it contains 100 000 documents, 1 000 000 sentences.

WARNING! Conversion requires 32GB of RAM, resulting in JSON that is 4.5GB in size.

Original Navec embeddings have 500 000 words in vocabulary. Pruning to 125 000 words we lose just 0.5 percentage points in accuracy.

Training takes ~2 hours per epoch on CPU (~5 times faster on GPU).

v3

We use SpaCy projects, training procedure is described in project/project.yml.

Download, uncompress embeddings table and training data.

Convert training data for SpaCy binary format. WARNING! 32 GB of RAM is required.

Convert and prune embeddings table.

~3 hours per epoch on CPU, requires ~24 GB of RAM.

Package

Update

meta.json

with description, authors, sources. On model name

core_news_md

core — provides all three components: tagger, parser and ner;
news — trained on Nerus that is large automatically annotated news corpus;
md — in SpaCy small models are 10-50MB in size, md - 50-200MB, lg - 200-600MB, out model is ~140MB.

v2

Use

spacy package

and

python sdist

to produce tar.gz archive.

v3

Change versions, rest is the same as in v2.

Use SpaCy projects to build package, config is in project/project.yml.

History

2020-12-24 SpaCy discussion #6628 "Russian model proposal"
2021-01-08 Support SpaCy v3
2021-01-19 Proposal accepted in v3.0.0rc3

natasha-spacy

Описание

Языки

Kukushkin Alexander
Got accepted
5 лет назад
6de94ea

🎉 Proposal got accepted in v3.0.0rc3. See https://nightly.spacy.io/models/ru for official Russian pretrained models.

natasha-spacy

Download

Usage

Training

v2

v3

Package

v2

v3

History

natasha-spacy

Описание

Языки

Kukushkin AlexanderGot accepted 5 лет назад6de94ea

🎉 Proposal got accepted in v3.0.0rc3. See https://nightly.spacy.io/models/ru for official Russian pretrained models.

natasha-spacy

Download

Usage

Training

v2

v3

Package

v2

v3

History

Kukushkin Alexander
Got accepted
5 лет назад
6de94ea