naeval

Форк
0

4 года назад
2 года назад
5 лет назад
4 года назад
2 года назад
2 года назад
2 года назад
5 лет назад
README.md

CI

Naeval — comparing quality and performance of NLP systems for Russian language. Naeval is used to evaluate project Natasha components: Razdel, Navec, Slovnet.

Install

Naeval supports Python 3.7+

$ pip install naeval

Documentation

Materials are in Russian:

Models

ModelTagsDescription
DeepPavlov NER # ner BiLSTM-CRF NER trained on Collection5. Original repo, docs, paper
DeepPavlov BERT NER # ner Current SOTA for Russian language. Docs, video
DeepPavlov Slavic BERT NER # ner DeepPavlov solution for BSNLP-2019. Paper
DeepPavlov Morph # morph Docs
DeepPavlov BERT Morph # morph Docs
DeepPavlov BERT Syntax # syntax BERT + biaffine head. Docs
Slovnet NER # ner
Slovnet BERT NER # ner
Slovnet Morph # morph
Slovnet BERT Morph # morph
Slovnet Syntax # syntax
Slovnet BERT Syntax # syntax
PullEnti # ner morph First place on factRuEval-2016, super sophisticated ruled based system
Stanza # ner morph syntax Tool by Stanford NLP released in 2020. Paper
SpaCy # token sent ner morph syntax Uses Russian models trained by @buriy
Texterra # morph syntax ner token sent Multifunctional NLP solution by ISP RAS
Tomita # ner GLR-parser by Yandex, only implementation for person names is publicly available
MITIE # ner Engine developed at MIT + third party model for Russian language
RuPosTagger # morph CRF tagger, part of Solarix project
RNNMorph # morph First place solution on morphoRuEval-2017. Post on Habr
Maru # morph
UDPipe # morph syntax Model trained on SynTagRus
NLTK # token sent Multifunctional library, provides model for Russian text segmentation. Docs
MyStem # token morph Wrapper for Yandex morphological analyzers
Moses # token sent Wrapper for Perl Moses utils
SegTok # token sent
RuTokenizer # token
Razdel # token sent
Spacy Russian Tokenizer # token sent Spacy segmentation pipeline for Russian texts by @aatimofeev
RuSentTokenizer # sent DeepPavlov sentence segmentation

Tokenization

See Razdel evalualtion section for more info.

corporasyntaggicryarnc
errorstimeerrorstimeerrorstimeerrorstime
re.findall(\w+|\d+|\p+)240.5160.5190.4600.4
spacy266.2135.8144.1323.9
nltk.word_tokenize603.42563.3752.71992.9
mystem235.0154.7193.7143.9
mosestokenizer112.181.9151.6161.7
segtok.word_tokenize162.382.3141.891.8
aatimofeev/spacy_russian_tokenizer1748.7451.1539.52052.2
koziev/rutokenizer151.181.0230.8680.9
razdel.tokenize92.992.832.0162.2

Sentence segmentation

corporasyntaggicryarnc
errorstimeerrorstimeerrorstimeerrorstime
re.split([.?!…])1140.9530.6630.71301.0
segtok.split_single10617.83613.410011.19122.8
mosestokenizer2388.91825.7806.42877.4
nltk.sent_tokenize9210.1365.3445.61838.9
deeppavlov/rusenttokenize5710.9107.9566.81197.0
razdel.sentenize526.173.9724.5597.5

Pretrained embeddings

See Navec evalualtion section for more info.

typeinit, sget, µsdisk, mbram, mbvocab
hudlit_12B_500K_300d_100qnavec1.121.650.695.3500K
news_1B_250K_300d_100qnavec0.820.725.447.7250K
ruscorpora_upos_cbow_300_20_2019w2v3.31.4220.6236.1189K
ruwikiruscorpora_upos_skipgram_300_2_2019w2v5.01.5290.0309.4248K
tayga_upos_skipgram_300_2_2019w2v5.21.4290.7310.9249K
tayga_none_fasttextcbow_300_10_2019fasttext8.013.42741.92746.9192K
araneum_none_fasttextcbow_300_5_2018fasttext16.410.62752.12754.7195K
typesimlexhjrtaeae2lrwc
hudlit_12B_500K_300d_100qnavec0.3100.7070.8420.9310.9230.604
news_1B_250K_300d_100qnavec0.2300.5900.7840.8660.8610.589
ruscorpora_upos_cbow_300_20_2019w2v0.3590.6850.8520.7580.8960.602
ruwikiruscorpora_upos_skipgram_300_2_2019w2v0.3210.7230.8170.8010.8600.629
tayga_upos_skipgram_300_2_2019w2v0.4290.7490.8710.7710.8990.639
tayga_none_fasttextcbow_300_10_2019fasttext0.3690.6390.7930.6820.8130.536
araneum_none_fasttextcbow_300_5_2018fasttext0.3490.6710.8010.7060.7930.579

Morphology taggers

See Slovnet evaluation section for more info.

newswikifictionsocialpoetry
slovnet0.9610.8150.9050.8070.664
slovnet_bert0.9820.8840.9900.8900.856
deeppavlov0.9400.8410.9440.8700.857
deeppavlov_bert0.9510.8680.9640.8920.865
udpipe0.9180.8110.9570.8700.776
spacy0.9640.8490.9420.8570.784
stanza0.9340.8310.9400.8730.825
rnnmorph0.8960.8120.8900.8600.838
maru0.8940.8080.8870.8610.840
rupostagger0.6730.6450.6610.6410.636
init, sdisk, mbram, mbspeed, it/s
slovnet1.027115532.0
slovnet_bert5.04758087285.0 (gpu)
deeppavlov4.0321024090.0 (gpu)
deeppavlov_bert20.01393870485.0 (gpu)
udpipe6.94524256.2
spacy8.014057950.0
stanza2.059139392.0
rnnmorph8.71028916.6
maru15.84437036.4
rupostagger4.8311848.0

Syntax parser

newswikifictionsocialpoetry
uaslasuaslasuaslasuaslasuaslas
slovnet0.9070.8800.7750.7180.8060.7760.7260.6560.5420.469
slovnet_bert0.9650.9360.8910.8280.9580.9400.8460.7820.7760.706
deeppavlov_bert0.9620.9100.8820.7860.9630.9290.8440.7610.7840.691
udpipe0.8730.8230.6220.5310.9100.8760.7000.6240.6250.534
spacy0.9430.9160.8510.7830.9010.8740.8040.7370.7040.616
stanza0.9400.8860.8150.7160.9360.8950.8020.7140.7130.613
init, sdisk, mbram, mbspeed, it/s
slovnet1.027125450.0
slovnet_bert5.05043427200.0 (gpu)
deeppavlov_bert34.01427870475.0 (gpu)
udpipe6.94524256.2
spacy9.014057941.0
stanza3.059189012.0

NER

See Slovnet evalualtion section for more info.

factrugareevne5bsnlp
f1PERLOCORGPERORGPERLOCORGPERLOCORG
slovnet0.9590.9150.8250.9770.8990.9840.9730.9510.9440.8340.718
slovnet_bert0.9730.9280.8310.9910.9110.9960.9890.9760.9600.8380.733
deeppavlov0.9100.8860.7420.9440.7980.9420.9190.8810.8660.7670.624
deeppavlov_bert0.9710.9280.8250.9800.9160.9970.9900.9760.9540.8400.741
deeppavlov_slavic0.9560.8840.7140.9760.7760.9840.8170.7610.9650.9250.831
pullenti0.9050.8140.6860.9390.6390.9520.8620.6830.9000.7690.566
spacy0.9010.8860.7650.9700.8830.9670.9280.9180.9190.8230.693
stanza0.9430.8650.6870.9530.8270.9230.7530.7340.9380.8380.724
texterra0.9000.8000.5970.8880.5610.9010.7770.5940.8580.7830.548
tomita0.9290.9210.9450.881
mitie0.8880.8610.5320.8490.4520.7530.6420.4320.7360.8010.524
init, sdisk, mbram, mbspeed, it/s
slovnet1.02720525.3
slovnet_bert5.0473950040.0 (gpu)
deeppavlov5.91024307224.3 (gpu)
deeppavlov_bert34.52048614413.1 (gpu)
deeppavlov_slavic35.0204840968.0 (gpu)
pullenti2.9162536.0
spacy8.01406258.0
stanza3.0591112643.0 (gpu)
texterra47.619333794.0
tomita2.0646329.8
mitie28.332726132.8

Support

Development

Dev env

python -m venv ~/.venvs/natasha-naeval
source ~/.venvs/natasha-naeval/bin/activate
pip install -r requirements/dev.txt
pip install -e .
python -m ipykernel install --user --name natasha-naeval

Lint

make lint

Описание

Comparing quality and performance of NLP systems for Russian language

Языки

Python

  • Makefile
  • Dockerfile
  • Shell
  • C++
  • Jupyter Notebook
Сообщить о нарушении

Использование cookies

Мы используем файлы cookie в соответствии с Политикой конфиденциальности и Политикой использования cookies.

Нажимая кнопку «Принимаю», Вы даете АО «СберТех» согласие на обработку Ваших персональных данных в целях совершенствования нашего веб-сайта и Сервиса GitVerse, а также повышения удобства их использования.

Запретить использование cookies Вы можете самостоятельно в настройках Вашего браузера.