/

/

Обзор Центр заботыВойти

slovnet

Ветки: 2 Коммиты: 234 Теги: 6

..

/

scripts

ignore_id -> mask

4 года назад

Added the switch allowing to train big BERT model for custom tags on the custom dataset. Also added the separate notebook to run pretrained big BERT in order to get big synthetic dataset with the markup as the input for the 05_ner model

2 года назад

5 лет назад

Add syntax, up bert syntax

4 года назад

Added the switch allowing to train small NER model for custom tags on the custom dataset.

2 года назад

Shared navec, Model -> Model.load, change model urls

4 года назад

Shared navec, Model -> Model.load, change model urls

4 года назад

2 года назад

Added JSON config containing the S3 bucket access keys. Some code formatting fixes

2 года назад

README.md

How to run notebooks in the /slovnet/scripts:

Navigate to the repository root folder
Install the required dependencies by running these commands:

pip3 install -r requirements/dev.txt
pip3 install -e .

How to train the NER model on your custom texts and tags:

Step 1: train big BERT NER model

Specify three variables in the slovnet/scripts/02_bert_ner/main.py

If you are going to train the model on your own custom texts and tags, then:
- firstly set CUSTOM_TUNING flag to True
- then specify the list of your custom tags in the TAGS variable
- and also specify the name of the file with your custom dataset in the CUSTOM_TEXTS variable (custom-dataset.jl.gz by default)
By running the cells in the slovnet/scripts/02_bert_ner/main.ipynb
- Download the default datasets factru.jl.gz and ne5.jl.gz in order to understand the input file format
- Annotate your texts (about 1K-2K texts ~1000 symbols each) with any NER annotator, for example this online tool
- Combine annotated texts into the ".jl" file and name it according to the value specified in the CUSTOM_TEXTS variable from the step above
- Zip you ".jl" file
- Put your zipped ".jl.gz" file into the same 02_bert_ner/data/ directory
- Download the encoder (encoder.pt), tokens (vocab.txt) and embeddings (emb.pt) pretrained on hundreds of thousands of articles into 02_bert_ner/bert/ directory
- Configure and train the model
- And lastly dump the resulting model into the 02_bert_ner/model/ directory as encoder.pt and ner.pt files

Step 2: mark out big dataset with BERT NER model from step one

Prepare big dataset containing about 500K-1000K of texts ~1000 symbols each
By running the cells in the slovnet/scripts/02_bert_ner/infer.ipynb
- Download (if needed) the tokens (vocab.txt) pretrained on hundreds of thousands of articles into 02_bert_ner/bert/ directory
- Configure the model pretrained on the previous step
- Iterate through the list of your texts and feed them to the inference as the small chunks (~1000 symbols each)

Step 3: train small NER model on the big synthetic markup from step two

Specify four variables in the slovnet/scripts/05_ner/main.py

If you are going to train the model on your own custom texts and tags, then:
- firstly set CUSTOM_TUNING flag to True
- then specify the list of your custom tags in the TAGS variable (Important! These must be the same TAGS as in step one )
- then specify the name of the file with your big synthetic custom dataset (from the previous step) in the CUSTOM_TEXTS variable (big-synthetic-dataset.jl.gz by default)
- and finally specify your resulting package name by setting the ID variable (slovnet_ner_custom_tags by default)
By running the cells in the slovnet/scripts/05_ner/main.ipynb
- Download the default big synthetic dataset nerus.jl.gz in order to understand the input file format
- Put your zipped big synthetic dataset from the previous step in ".jl.gz" file into the same 05_ner/data/ directory
- Download the navec embeddings (navec_news_v1_1B_250K_300d_100q.tar) pretrained on hundreds of thousands of articles into 05_ner/navec/ directory
- Configure and train the model
- And lastly dump the resulting model into the 05_ner/model/ directory as encoder.pt, ner.pt and shape.pt files

Step 4: pack and test small NER model

By running the cells in the slovnet/scripts/05_ner/pack.ipynb

Download (if needed) the navec embeddings (navec_news_v1_1B_250K_300d_100q.tar) pretrained on hundreds of thousands of articles into 05_ner/navec/ directory
Configure the model pretrained on the previous step
Prepare and dump the resulting package - it will be 05_ner/slovnet_ner_custom_tags.tar by default
Load the package, pass the embeddings into it and then test with your piece of text

Использование cookies

Мы используем файлы cookie в соответствии с Политикой конфиденциальности и Политикой использования cookies.

Нажимая кнопку «Принимаю», Вы даете АО «СберТех» согласие на обработку Ваших персональных данных в целях совершенствования нашего веб-сайта и Сервиса GitVerse, а также повышения удобства их использования.

Запретить использование cookies Вы можете самостоятельно в настройках Вашего браузера.