tokenizers

Π€ΠΎΡ€ΠΊ
0

9 мСсяцСв Π½Π°Π·Π°Π΄
8 мСсяцСв Π½Π°Π·Π°Π΄
Π³ΠΎΠ΄ Π½Π°Π·Π°Π΄
Π³ΠΎΠ΄ Π½Π°Π·Π°Π΄
5 Π»Π΅Ρ‚ Π½Π°Π·Π°Π΄
10 мСсяцСв Π½Π°Π·Π°Π΄
README.md



Build GitHub

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Main features:

  • Train new vocabularies and tokenize, using today's most used tokenizers.
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Bindings

We provide bindings to the following languages (more to come!):

Quick example using Python:

Choose your model between Byte-Pair Encoding, WordPiece or Unigram and instantiate a tokenizer:

from tokenizers import Tokenizer
from tokenizers.models import BPE
tokenizer = Tokenizer(BPE())

You can customize how pre-tokenization (e.g., splitting into words) is done:

from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()

Then training your tokenizer on a set of files just takes two lines of codes:

from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train(files=["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer=trainer)

Once your tokenizer is trained, encode any text with just one line:

output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.tokens)
# ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]

Check the documentation or the quicktour to learn more!

ОписаниС

πŸ’₯ Fast State-of-the-Art Tokenizers optimized for Research and Production

Π―Π·Ρ‹ΠΊΠΈ

Rust

  • JavaScript
  • TypeScript
  • Python
  • Jupyter Notebook
  • CSS
  • HTML
  • Makefile
Π‘ΠΎΠΎΠ±Ρ‰ΠΈΡ‚ΡŒ ΠΎ Π½Π°Ρ€ΡƒΡˆΠ΅Π½ΠΈΠΈ

ИспользованиС cookies

ΠœΡ‹Β ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΡƒΠ΅ΠΌ Ρ„Π°ΠΉΠ»Ρ‹ cookie в соотвСтствии с ΠŸΠΎΠ»ΠΈΡ‚ΠΈΠΊΠΎΠΉ ΠΊΠΎΠ½Ρ„ΠΈΠ΄Π΅Π½Ρ†ΠΈΠ°Π»ΡŒΠ½ΠΎΡΡ‚ΠΈ ΠΈ ΠŸΠΎΠ»ΠΈΡ‚ΠΈΠΊΠΎΠΉ использования cookies.

НаТимая ΠΊΠ½ΠΎΠΏΠΊΡƒ Β«ΠŸΡ€ΠΈΠ½ΠΈΠΌΠ°ΡŽΒ», Π’Ρ‹Β Π΄Π°Π΅Ρ‚Π΅ ΠΠžΒ Β«Π‘Π±Π΅Ρ€Π’Π΅Ρ…Β» согласиС Π½Π°Β ΠΎΠ±Ρ€Π°Π±ΠΎΡ‚ΠΊΡƒ Π’Π°ΡˆΠΈΡ… ΠΏΠ΅Ρ€ΡΠΎΠ½Π°Π»ΡŒΠ½Ρ‹Ρ… Π΄Π°Π½Π½Ρ‹Ρ… в цСлях ΡΠΎΠ²Π΅Ρ€ΡˆΠ΅Π½ΡΡ‚Π²ΠΎΠ²Π°Π½ΠΈΡ нашСго Π²Π΅Π±-сайта и БСрвиса GitVerse, Π°Β Ρ‚Π°ΠΊΠΆΠ΅ ΠΏΠΎΠ²Ρ‹ΡˆΠ΅Π½ΠΈΡ удобства ΠΈΡ…Β ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΠΎΠ²Π°Π½ΠΈΡ.

Π—Π°ΠΏΡ€Π΅Ρ‚ΠΈΡ‚ΡŒ использованиС cookiesΒ Π’Ρ‹ ΠΌΠΎΠΆΠ΅Ρ‚Π΅ ΡΠ°ΠΌΠΎΡΡ‚ΠΎΡΡ‚Π΅Π»ΡŒΠ½ΠΎ в настройках Π’Π°ΡˆΠ΅Π³ΠΎ Π±Ρ€Π°ΡƒΠ·Π΅Ρ€Π°.