tokenizers
ΠΠΏΠΈΡΠ°Π½ΠΈΠ΅
π₯ Fast State-of-the-Art Tokenizers optimized for Research and Production
Π―Π·ΡΠΊΠΈ
- Rust70,5%
- Python20,3%
- Jupyter Notebook4,9%
- TypeScript2,5%
- JavaScript1,2%
- CSS0,3%
- ΠΡΡΠ°Π»ΡΠ½ΡΠ΅0,3%
2 Π³ΠΎΠ΄Π° Π½Π°Π·Π°Π΄
2 Π³ΠΎΠ΄Π° Π½Π°Π·Π°Π΄
3 Π³ΠΎΠ΄Π° Π½Π°Π·Π°Π΄
2 Π³ΠΎΠ΄Π° Π½Π°Π·Π°Π΄
3 Π³ΠΎΠ΄Π° Π½Π°Π·Π°Π΄
3 Π³ΠΎΠ΄Π° Π½Π°Π·Π°Π΄
6 Π»Π΅Ρ Π½Π°Π·Π°Π΄
2 Π³ΠΎΠ΄Π° Π½Π°Π·Π°Π΄
2 Π³ΠΎΠ΄Π° Π½Π°Π·Π°Π΄
README.md
Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.
Main features:
- Train new vocabularies and tokenize, using today's most used tokenizers.
- Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
- Easy to use, but also extremely versatile.
- Designed for research and production.
- Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
- Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.
Bindings
We provide bindings to the following languages (more to come!):
Quick example using Python:
Choose your model between Byte-Pair Encoding, WordPiece or Unigram and instantiate a tokenizer:
You can customize how pre-tokenization (e.g., splitting into words) is done:
Then training your tokenizer on a set of files just takes two lines of codes:
Once your tokenizer is trained, encode any text with just one line:
Check the documentation or the quicktour to learn more!