corus
Links to publicly available Russian corpora + code for loading and parsing. 20+ datasets, 350Gb+ of text.
Usage
For example lets use dump of lenta.ru by @yutkin. Manually download the archive (link in the Reference section):
wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz
Use corus
to load the data:
>>> from corus import load_lenta
>>> path = 'lenta-ru-news.csv.gz'>>> records = load_lenta(path)>>> next(records)
LentaRecord( url='https://lenta.ru/news/2018/12/14/cancer/', title='Названы регионы России с\xa0самой высокой смертностью от\xa0рака', text='Вице-премьер по социальным вопросам Татьяна Голикова рассказала, в каких регионах России зафиксирована наиболее высокая смертность от рака, сооб...', topic='Россия', tags='Общество')
Iterate over texts:
>>> records = load_lenta(path)>>> for record in records:... text = record.text... ...
For links to other datasets and their loaders see the Reference section.
Documentation
Materials are in Russian:
Install
corus
supports Python 3.5+, PyPy 3.
$ pip install corus
Reference
Dataset | API from corus import | Tags | Texts | Uncompressed | Description |
---|---|---|---|---|---|
Lenta.ru | |||||
Lenta.ru v1.0 |
load_lenta
#
|
news
| 739 351 | 1.66 Gb |
wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz
|
Lenta.ru v1.1+ |
load_lenta2
#
|
news
| 800 975 | 1.94 Gb |
wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.1/lenta-ru-news.csv.bz2
|
Lib.rus.ec |
load_librusec
#
|
fiction
| 301 871 | 144.92 Gb |
Dump of lib.rus.ec prepared for RUSSE workshop
wget http://panchenko.me/data/russe/librusec_fb2.plain.gz
|
Rossiya Segodnya |
load_ria_raw
#
load_ria
#
|
news
| 1 003 869 | 3.70 Gb |
wget https://github.com/RossiyaSegodnya/ria_news_dataset/raw/master/ria.json.gz
|
Mokoron Russian Twitter Corpus |
load_mokoron
#
|
social
sentiment
| 17 633 417 | 1.86 Gb |
Russian Twitter sentiment markup
Manually download https://www.dropbox.com/s/9egqjszeicki4ho/db.sql |
Wikipedia |
load_wiki
#
| 1 541 401 | 12.94 Gb |
Russian Wiki dump
wget https://dumps.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2
| |
GramEval2020 |
load_gramru
#
| 162 372 | 30.04 Mb |
wget https://github.com/dialogue-evaluation/GramEval2020/archive/master.zip
unzip master.zip
mv GramEval2020-master/dataTrain train
mv GramEval2020-master/dataOpenTest dev
rm -r master.zip GramEval2020-master
wget https://github.com/AlexeySorokin/GramEval2020/raw/master/data/GramEval_private_test.conllu
| |
OpenCorpora |
load_corpora
#
|
morph
| 4 030 | 20.21 Mb |
wget http://opencorpora.org/files/export/annot/annot.opcorpora.xml.zip
|
RusVectores SimLex-965 |
load_simlex
#
|
emb
sim
|
wget https://rusvectores.org/static/testsets/ru_simlex965_tagged.tsv
wget https://rusvectores.org/static/testsets/ru_simlex965.tsv
| ||
Omnia Russica |
load_omnia
#
|
morph
web
fiction
| 489.62 Gb |
Taiga + Wiki + Araneum. Read "Even larger Russian corpus" https://events.spbu.ru/eventsContent/events/2019/corpora/corp_sborn.pdf
Manually download http://bit.ly/2ZT4BY9 | |
factRuEval-2016 |
load_factru
#
|
ner
news
| 254 | 969.27 Kb |
Manual PER, LOC, ORG markup prepared for 2016 Dialog competition
wget https://github.com/dialogue-evaluation/factRuEval-2016/archive/master.zip
unzip master.zip
rm master.zip
|
Gareev |
load_gareev
#
|
ner
news
| 97 | 455.02 Kb |
Manual PER, ORG markup (no LOC)
Email Rinat Gareev (gareev-rm@yandex.ru) ask for dataset tar -xvf rus-ner-news-corpus.iob.tar.gz
rm rus-ner-news-corpus.iob.tar.gz
|
Collection5 |
load_ne5
#
|
ner
news
| 1 000 | 2.96 Mb |
News articles with manual PER, LOC, ORG markup
wget http://www.labinform.ru/pub/named_entities/collection5.zip
unzip collection5.zip
rm collection5.zip
|
WiNER |
load_wikiner
#
|
ner
| 203 287 | 36.15 Mb |
Sentences from Wiki auto annotated with PER, LOC, ORG tags
wget https://github.com/dice-group/FOX/raw/master/input/Wikiner/aij-wikiner-ru-wp3.bz2
|
BSNLP-2019 |
load_bsnlp
#
|
ner
| 464 | 1.16 Mb |
Markup prepared for 2019 BSNLP Shared Task
wget http://bsnlp.cs.helsinki.fi/TRAININGDATA_BSNLP_2019_shared_task.zip
wget http://bsnlp.cs.helsinki.fi/TESTDATA_BSNLP_2019_shared_task.zip
unzip TRAININGDATA_BSNLP_2019_shared_task.zip
unzip TESTDATA_BSNLP_2019_shared_task.zip -d test_pl_cs_ru_bg
rm TRAININGDATA_BSNLP_2019_shared_task.zip TESTDATA_BSNLP_2019_shared_task.zip
|
Persons-1000 |
load_persons
#
|
ner
news
| 1 000 | 2.96 Mb |
Same as Collection5, only PER markup + normalized names
wget http://ai-center.botik.ru/Airec/ai-resources/Persons-1000.zip
|
The Russian Drug Reaction Corpus (RuDReC) |
load_rudrec
#
|
ner
| 4 809 | 1.73 Kb |
RuDReC is a new partially annotated corpus of consumer reviews in Russian about pharmaceutical products for the detection of health-related named entities and the effectiveness of pharmaceutical products. Here you can download and work with the annotated part, to get the raw part (1.4M reviews) please refer to https://github.com/cimm-kzn/RuDReC.
wget https://github.com/cimm-kzn/RuDReC/raw/master/data/rudrec_annotated.json
|
Taiga |
Large collection of Russian texts from various sources: news sites, magazines, literacy, social networks
wget https://linghub.ru/static/Taiga/retagged_taiga.tar.gz
tar -xzvf retagged_taiga.tar.gz
| ||||
Arzamas |
load_taiga_arzamas
#
|
news
| 311 | 4.50 Mb | |
Fontanka |
load_taiga_fontanka
#
|
news
| 342 683 | 786.23 Mb | |
Interfax |
load_taiga_interfax
#
|
news
| 46 429 | 77.55 Mb | |
KP |
load_taiga_kp
#
|
news
| 45 503 | 61.79 Mb | |
Lenta |
load_taiga_lenta
#
|
news
| 36 446 | 95.15 Mb | |
Taiga/N+1 |
load_taiga_nplus1
#
|
news
| 7 696 | 24.96 Mb | |
Magazines |
load_taiga_magazines
#
| 39 890 | 2.19 Gb | ||
Subtitles |
load_taiga_subtitles
#
| 19 011 | 909.08 Mb | ||
Social |
load_taiga_social
#
|
social
| 1 876 442 | 648.18 Mb | |
Proza |
load_taiga_proza
#
|
fiction
| 1 732 434 | 38.25 Gb | |
Stihi |
load_taiga_stihi
#
| 9 157 686 | 12.80 Gb | ||
Russian NLP Datasets | Several Russian news datasets from webhose.io, lenta.ru and other news sites. | ||||
News |
load_buriy_news
#
|
news
| 2 154 801 | 6.84 Gb |
Dump of top 40 news + 20 fashion news sites.
wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2014.tar.bz2
wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part1.tar.bz2
wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part2.tar.bz2
|
Webhose |
load_buriy_webhose
#
|
news
| 285 965 | 859.32 Mb |
Dump from webhose.io, 300 sources for one month.
wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/webhose-2016.tar.bz2
|
ODS #proj_news_viz | Several news sites scraped by members of #proj_news_viz ODS project. | ||||
Interfax |
load_ods_interfax
#
|
news
| 543 961 | 1.22 Gb |
wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/interfax.csv.gz
|
Gazeta |
load_ods_gazeta
#
|
news
| 865 847 | 1.63 Gb |
wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/gazeta.csv.gz
|
Izvestia |
load_ods_izvestia
#
|
news
| 86 601 | 307.19 Mb |
wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/iz.csv.gz
|
Meduza |
load_ods_meduza
#
|
news
| 71 806 | 270.11 Mb |
wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/meduza.csv.gz
|
RIA |
load_ods_ria
#
|
news
| 101 543 | 233.88 Mb |
wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/ria.csv.gz
|
Russia Today |
load_ods_rt
#
|
news
| 106 644 | 187.12 Mb |
wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/rt.csv.gz
|
TASS |
load_ods_tass
#
|
news
| 1 135 635 | 3.27 Gb |
wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/tass-001.csv.gz
|
Universal Dependencies | |||||
GSD |
load_ud_gsd
#
|
morph
syntax
| 5 030 | 1.01 Mb |
wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-dev.conllu
wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-test.conllu
wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-train.conllu
|
Taiga |
load_ud_taiga
#
|
morph
syntax
| 3 264 | 353.80 Kb |
wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-dev.conllu
wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-test.conllu
wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-train.conllu
|
PUD |
load_ud_pud
#
|
morph
syntax
| 1 000 | 207.78 Kb |
wget https://github.com/UniversalDependencies/UD_Russian-PUD/raw/master/ru_pud-ud-test.conllu
|
SynTagRus |
load_ud_syntag
#
|
morph
syntax
| 61 889 | 11.33 Mb |
wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-dev.conllu
wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-test.conllu
wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-train.conllu
|
morphoRuEval-2017 | |||||
General Internet-Corpus |
load_morphoru_gicrya
#
|
morph
| 83 148 | 10.58 Mb |
wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/GIKRYA_texts_new.zip
unzip GIKRYA_texts_new.zip
rm GIKRYA_texts_new.zip
|
Russian National Corpus |
load_morphoru_rnc
#
|
morph
| 98 892 | 12.71 Mb |
wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/RNC_texts.rar
unrar x RNC_texts.rar
rm RNC_texts.rar
|
OpenCorpora |
load_morphoru_corpora
#
|
morph
| 38 510 | 4.80 Mb |
wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/OpenCorpora_Texts.rar
unrar x OpenCorpora_Texts.rar
rm OpenCorpora_Texts.rar
|
RUSSE Russian Semantic Relatedness | |||||
HJ: Human Judgements of Word Pairs |
load_russe_hj
#
|
emb
sim
|
wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/hj.csv
| ||
RT: Synonyms and Hypernyms from the Thesaurus RuThes |
load_russe_rt
#
|
emb
sim
|
wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/rt.csv
| ||
AE: Cognitive Associations from the Sociation.org Experiment |
load_russe_ae
#
|
emb
sim
|
wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-train.csv
wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-test.csv
wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/ae2.csv
| ||
Toloka Datasets | |||||
Lexical Relations from the Wisdom of the Crowd (LRWC) |
load_toloka_lrwc
#
|
emb
sim
|
wget https://tlk.s3.yandex.net/dataset/LRWC.zip
unzip LRWC.zip
rm LRWC.zip
| ||
The Russian Adverse Drug Reaction Corpus of Tweets (RuADReCT) |
load_ruadrect
#
|
social
| 9 515 | 2.09 Mb |
This corpus was developed for the Social Media Mining for Health Applications (#SMM4H) Shared Task 2020
wget https://github.com/cimm-kzn/RuDReC/raw/master/data/RuADReCT.zip
unzip RuADReCT.zip
rm RuADReCT.zip
|
Support
- Chat — https://t.me/natural_language_processing
- Issues — https://github.com/natasha/corus/issues
- Commercial support — https://lab.alexkuk.ru
Add new source
- Implement
corus/sources/<source>.py
- Add import into
corus/sources/__init__.py
- Add meta into
corus/source/meta.py
- Add example into
docs.ipynb
(check meta table is correct) - Run tests (readme is updated)
Development
Dev env
python -m venv ~/.venvs/natasha-corussource ~/.venvs/natasha-corus/bin/activate
pip install -r requirements/dev.txtpip install -e .
python -m ipykernel install --user --name natasha-corus
Lint + update docs
make lintmake exec-docs
Release
# Update setup.py version
git commit -am 'Up version'git tag v0.10.0
git pushgit push --tags
Описание
Links to Russian corpora + Python functions for loading and parsing
Языки
Jupyter Notebook
- Makefile
- Python