natasha-spacy

0

Описание

SpaCy official Russian model proposal

Языки

  • Jupyter Notebook94,6%
  • Python5,4%
README.md

🎉 Proposal got accepted in v3.0.0rc3. See https://nightly.spacy.io/models/ru for official Russian pretrained models.

natasha-spacy

SpaCy official Russian model proposal. Work is heavily inspired and based on spacy-ru by @buriy.

Russian model is trained on two resources, both available under MIT license:

  1. Nerus — part of Natasha project, large silver standard Russian corpus annotated with morphology tags, syntax trees and PER, LOC, ORG NER-tags.
  2. Navec — also part of Natasha project, pretrained word embeddings for Russian language.

Code in this repo is also available under MIT license.

Resulting model is relatively small due to embeddings table pruning (138MB), works fast on CPU. Shows near SOTA performance on tasks of morphology tagging and syntax parsing, beating heavy DeepPavlov BERT on news and wiki domains. On NER task model shows quality comparable to other top Russian systems, beating DeepPavlov, PullEnti, Stanza. See Naeval morphology, syntax, and NER sections.

Download

Model Size SpaCy version
ru_core_news_md-2.3.0.tar.gz 138MB 2.3.*
ru_core_news_md-3.0.0.tar.gz 135MB 3.0.*

Usage

First download and install the model. SpaCy 2.3.* is required, model won't work with SpaCy 2.1, 2.2.

Model for SpaCy 3.0 is also available.

wget https://storage.yandexcloud.net/natasha-spacy/models/ru_core_news_md-3.0.0.tar.gz pip install ru_core_news_md-3.0.0.tar.gz

Use ipymarkup for NER and syntax visualization.

Training

v2

Both Nerus and Navec are adapted to fit SpaCy utilities. Training procedure uses only standart

spacy convert
,
spacy init-model
,
spacy train
.

Initialize the environment. We use SpaCy 2.3 for training, Russian language in SpaCy requires PyMorphy for morphology.

Download 650MB embeddings table. Navec is precomputed on fiction texts, has 500 000 words in vocabulary.

Download 1.5GB training data. We use 10% slice of original Nerus, it contains 100 000 documents, 1 000 000 sentences.

WARNING! Conversion requires 32GB of RAM, resulting in JSON that is 4.5GB in size.

Original Navec embeddings have 500 000 words in vocabulary. Pruning to 125 000 words we lose just 0.5 percentage points in accuracy.

Training takes ~2 hours per epoch on CPU (~5 times faster on GPU).

v3

We use SpaCy projects, training procedure is described in project/project.yml.

Download, uncompress embeddings table and training data.

Convert training data for SpaCy binary format. WARNING! 32 GB of RAM is required.

Convert and prune embeddings table.

~3 hours per epoch on CPU, requires ~24 GB of RAM.

Package

Update

meta.json
with description, authors, sources. On model name
core_news_md
:

  • core
    — provides all three components: tagger, parser and ner;
  • news
    — trained on Nerus that is large automatically annotated news corpus;
  • md
    — in SpaCy small models are 10-50MB in size,
    md
    - 50-200MB,
    lg
    - 200-600MB, out model is ~140MB.

v2

Use

spacy package
and
python sdist
to produce tar.gz archive.

v3

Change versions, rest is the same as in v2.

Use SpaCy projects to build package, config is in project/project.yml.

History