natasha
Описание
Solves basic Russian NLP tasks, API for lower level Natasha projects
Языки
- Python66,3%
- Jupyter Notebook33,5%
- Makefile0,2%
Natasha solves basic NLP tasks for Russian language: tokenization, sentence segmentation, word embedding, morphology tagging, lemmatization, phrase normalization, syntax parsing, NER tagging, fact extraction. Quality on every task is similar or better than current SOTAs for Russian language on news articles, see evaluation section. Natasha is not a research project, underlying technologies are built for production. We pay attention to model size, RAM usage and performance. Models run on CPU, use Numpy for inference.
Natasha integrates libraries from Natasha project under one convenient API:
- Razdel — token, sentence segmentation for Russian
- Navec — compact Russian embeddings
- Slovnet — modern deep-learning techniques for Russian NLP, compact models for Russian morphology, syntax, NER.
- Yargy — rule-based fact extraction similar to Tomita parser.
- Ipymarkup — NLP visualizations for NER and syntax markups.
⚠ API may change, for realworld tasks consider using low level libraries from Natasha project. Models optimized for news articles, quality on other domain may be lower. To use old
,NamesExtractordowngradeAddressExtactorpip install natasha<1 yargy<0.13
Install
Natasha supports Python 3.7+ and PyPy3:
Usage
Import, initialize modules, build object.
Segmentation
Split text into tokens and sentencies. Defines and properties of . Uses Razdel internally.
Morphology
For every token extract rich morphology tags. Depends on segmentation step. Defines and properties of . Uses Slovnet morphology model internally.
Call to visualize morphology markup.
Lemmatization
Lemmatize every token. Depends on morphology step. Defines property of . Uses Pymorphy internally.
Syntax
For every sentence run syntax analyzer. Depends on segmentation step. Defines , , properties of . Uses Slovnet syntax model internally.
Use to visualize syntax markup. Uses Ipymarkup internally.
NER
Extract standart named entities: names, locations, organizations. Depends on segmentation step. Defines property of . Uses Slovnet NER model internally.
Call to visualize NER markup. Uses Ipymarkup internally.
Named entity normalization
For every NER span apply normalization procedure. Depends on NER, morphology and syntax steps. Defines property of .
One can not just lemmatize every token inside entity span, otherwise "Организации украинских националистов" would become "Организация украинские националисты". Natasha uses syntax dependencies to produce correct "Организация украинских националистов".
Named entity parsing
Parse named entities into firstname, surname and patronymic. Depends on NER step. Defines property of . Uses Yargy-parser internally.
Natasha also has built in extractors for dates, money, address.
Documentation
- Examples with description + reference
- Natasha section in longread on Natasha project (in Russian)
- Natasha section of Datafest 2020 talk (in Russian)
Evaluation
- Segmentation — Razdel evalualtion section
- Embedding — Navec evalualtion section
- Morphology — Slovnet Morph evaluation section
- Syntax — Slovnet Syntax evaluation section
- NER — Slovnet NER evaluation section
Support
- Chat — https://t.me/natural_language_processing
- Issues — https://github.com/natasha/natasha/issues
- Commercial support — https://lab.alexkuk.ru
Development
Dev env
Test
Docs
Release