razdel
Описание
Rule-based token, sentence segmentation for Russian language
Языки
- Python99,9%
- Makefile0,1%
— rule-based system for Russian sentence and word tokenization.
Usage
Installation
supports Python 3.7+ and PyPy 3.
Documentation
Materials are in Russian:
Evaluation
Unfortunately, there is no single correct way to split text into sentences and tokens. For example, one may split into three sentences while splits it into two . What would be the correct way to tokenizer ? One may split in into , splits into .
tries to mimic segmentation of these 4 datasets: SynTagRus, OpenCorpora, GICRYA and RNC. These datasets mainly consist of news and fiction. rules are optimized for these kinds of texts. Library may perform worse on other domains like social media, scientific articles, legal documents.
We measure absolute number of errors. There are a lot of trivial cases in the tokenization task. For example, text is not non-trivial, one may split it into while the correct tokenization is , such examples are rare. Vast majority of cases are trivial, for example text is correctly tokenized even via Python native into . Due to the large number of trivial case overall quality of all segmenators is high, it is hard to compare differentiate between for examlpe 99.33%, 99.95% and 99.88%, so we report the absolute number of errors.
— number of errors per 1000 tokens/sentencies. For example, consider etalon segmentation is , prediction is , then the number of errors is 3: 1 for missing split + 2 for extra splits .
— seconds taken to process whole dataset.
, and others a defined in naeval/segment/models.py, for links to models see Naeval registry. Tables are computed in naeval/segment/main.ipynb.
Tokens
| corpora | syntag | gicrya | rnc | |||||
|---|---|---|---|---|---|---|---|---|
| errors | time | errors | time | errors | time | errors | time | |
| re.findall(\w+|\d+|\p+) | 24 | 0.5 | 16 | 0.5 | 19 | 0.4 | 60 | 0.4 |
| spacy | 26 | 6.2 | 13 | 5.8 | 14 | 4.1 | 32 | 3.9 |
| nltk.word_tokenize | 60 | 3.4 | 256 | 3.3 | 75 | 2.7 | 199 | 2.9 |
| mystem | 23 | 5.0 | 15 | 4.7 | 19 | 3.7 | 14 | 3.9 |
| mosestokenizer | 11 | 2.1 | 8 | 1.9 | 15 | 1.6 | 16 | 1.7 |
| segtok.word_tokenize | 16 | 2.3 | 8 | 2.3 | 14 | 1.8 | 9 | 1.8 |
| aatimofeev/spacy_russian_tokenizer | 17 | 48.7 | 4 | 51.1 | 5 | 39.5 | 20 | 52.2 |
| koziev/rutokenizer | 15 | 1.1 | 8 | 1.0 | 23 | 0.8 | 68 | 0.9 |
| razdel.tokenize | 9 | 2.9 | 9 | 2.8 | 3 | 2.0 | 16 | 2.2 |
Sentences
| corpora | syntag | gicrya | rnc | |||||
|---|---|---|---|---|---|---|---|---|
| errors | time | errors | time | errors | time | errors | time | |
| re.split([.?!…]) | 114 | 0.9 | 53 | 0.6 | 63 | 0.7 | 130 | 1.0 |
| segtok.split_single | 106 | 17.8 | 36 | 13.4 | 1001 | 1.1 | 912 | 2.8 |
| mosestokenizer | 238 | 8.9 | 182 | 5.7 | 80 | 6.4 | 287 | 7.4 |
| nltk.sent_tokenize | 92 | 10.1 | 36 | 5.3 | 44 | 5.6 | 183 | 8.9 |
| deeppavlov/rusenttokenize | 57 | 10.9 | 10 | 7.9 | 56 | 6.8 | 119 | 7.0 |
| razdel.sentenize | 52 | 6.1 | 7 | 3.9 | 72 | 4.5 | 59 | 7.5 |
Support
- Chat — https://telegram.me/natural_language_processing
- Issues — https://github.com/natasha/razdel/issues
- Commercial support — https://lab.alexkuk.ru
Development
Dev env
Test
Release
errors on
Non-trivial token tests
Update integration tests
and diff
performance