tokenizers

0

ОписаниС

πŸ’₯ Fast State-of-the-Art Tokenizers optimized for Research and Production

Π―Π·Ρ‹ΠΊΠΈ

  • Rust70,5%
  • Python20,3%
  • Jupyter Notebook4,9%
  • TypeScript2,5%
  • JavaScript1,2%
  • CSS0,3%
  • ΠžΡΡ‚Π°Π»ΡŒΠ½Ρ‹Π΅0,3%
2 Π³ΠΎΠ΄Π° Π½Π°Π·Π°Π΄
2 Π³ΠΎΠ΄Π° Π½Π°Π·Π°Π΄
3 Π³ΠΎΠ΄Π° Π½Π°Π·Π°Π΄
3 Π³ΠΎΠ΄Π° Π½Π°Π·Π°Π΄
3 Π³ΠΎΠ΄Π° Π½Π°Π·Π°Π΄
6 Π»Π΅Ρ‚ Π½Π°Π·Π°Π΄
2 Π³ΠΎΠ΄Π° Π½Π°Π·Π°Π΄
README.md



Build GitHub

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Main features:

  • Train new vocabularies and tokenize, using today's most used tokenizers.
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Bindings

We provide bindings to the following languages (more to come!):

Quick example using Python:

Choose your model between Byte-Pair Encoding, WordPiece or Unigram and instantiate a tokenizer:

You can customize how pre-tokenization (e.g., splitting into words) is done:

Then training your tokenizer on a set of files just takes two lines of codes:

Once your tokenizer is trained, encode any text with just one line:

Check the documentation or the quicktour to learn more!