natasha

0

Описание

Solves basic Russian NLP tasks, API for lower level Natasha projects

Языки

  • Python66,3%
  • Jupyter Notebook33,5%
  • Makefile0,2%
README.md

CI

Natasha solves basic NLP tasks for Russian language: tokenization, sentence segmentation, word embedding, morphology tagging, lemmatization, phrase normalization, syntax parsing, NER tagging, fact extraction. Quality on every task is similar or better than current SOTAs for Russian language on news articles, see evaluation section. Natasha is not a research project, underlying technologies are built for production. We pay attention to model size, RAM usage and performance. Models run on CPU, use Numpy for inference.

Natasha integrates libraries from Natasha project under one convenient API:

  • Razdel — token, sentence segmentation for Russian
  • Navec — compact Russian embeddings
  • Slovnet — modern deep-learning techniques for Russian NLP, compact models for Russian morphology, syntax, NER.
  • Yargy — rule-based fact extraction similar to Tomita parser.
  • Ipymarkup — NLP visualizations for NER and syntax markups.

⚠ API may change, for realworld tasks consider using low level libraries from Natasha project. Models optimized for news articles, quality on other domain may be lower. To use old

NamesExtractor
,
AddressExtactor
downgrade
pip install natasha<1 yargy<0.13

Install

Natasha supports Python 3.7+ and PyPy3:

Usage

Import, initialize modules, build

Doc
object.

Segmentation

Split text into tokens and sentencies. Defines

tokens
and
sents
properties of
doc
. Uses Razdel internally.

Morphology

For every token extract rich morphology tags. Depends on segmentation step. Defines

pos
and
feats
properties of
doc.tokens
. Uses Slovnet morphology model internally.

Call

morph.print()
to visualize morphology markup.

Lemmatization

Lemmatize every token. Depends on morphology step. Defines

lemma
property of
doc.tokens
. Uses Pymorphy internally.

Syntax

For every sentence run syntax analyzer. Depends on segmentation step. Defines

id
,
head_id
,
rel
properties of
doc.tokens
. Uses Slovnet syntax model internally.

Use

syntax.print()
to visualize syntax markup. Uses Ipymarkup internally.

NER

Extract standart named entities: names, locations, organizations. Depends on segmentation step. Defines

spans
property of
doc
. Uses Slovnet NER model internally.

Call

ner.print()
to visualize NER markup. Uses Ipymarkup internally.

Named entity normalization

For every NER span apply normalization procedure. Depends on NER, morphology and syntax steps. Defines

normal
property of
doc.spans
.

One can not just lemmatize every token inside entity span, otherwise "Организации украинских националистов" would become "Организация украинские националисты". Natasha uses syntax dependencies to produce correct "Организация украинских националистов".

Named entity parsing

Parse

PER
named entities into firstname, surname and patronymic. Depends on NER step. Defines
fact
property of
doc.spans
. Uses Yargy-parser internally.

Natasha also has built in extractors for dates, money, address.

Documentation

Evaluation

Support

Development

Dev env

Test

Docs

Release