openai-cookbook
612 строк · 20.6 Кб
1{
2"cells": [
3{
4"attachments": {},
5"cell_type": "markdown",
6"metadata": {},
7"source": [
8"# How to count tokens with tiktoken\n",
9"\n",
10"[`tiktoken`](https://github.com/openai/tiktoken/blob/main/README.md) is a fast open-source tokenizer by OpenAI.\n",
11"\n",
12"Given a text string (e.g., `\"tiktoken is great!\"`) and an encoding (e.g., `\"cl100k_base\"`), a tokenizer can split the text string into a list of tokens (e.g., `[\"t\", \"ik\", \"token\", \" is\", \" great\", \"!\"]`).\n",
13"\n",
14"Splitting text strings into tokens is useful because GPT models see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token).\n",
15"\n",
16"\n",
17"## Encodings\n",
18"\n",
19"Encodings specify how text is converted into tokens. Different models use different encodings.\n",
20"\n",
21"`tiktoken` supports three encodings used by OpenAI models:\n",
22"\n",
23"| Encoding name | OpenAI models |\n",
24"|-------------------------|-----------------------------------------------------|\n",
25"| `cl100k_base` | `gpt-4`, `gpt-3.5-turbo`, `text-embedding-ada-002`, `text-embedding-3-small`, `text-embedding-3-large` |\n",
26"| `p50k_base` | Codex models, `text-davinci-002`, `text-davinci-003`|\n",
27"| `r50k_base` (or `gpt2`) | GPT-3 models like `davinci` |\n",
28"\n",
29"You can retrieve the encoding for a model using `tiktoken.encoding_for_model()` as follows:\n",
30"```python\n",
31"encoding = tiktoken.encoding_for_model('gpt-3.5-turbo')\n",
32"```\n",
33"\n",
34"Note that `p50k_base` overlaps substantially with `r50k_base`, and for non-code applications, they will usually give the same tokens.\n",
35"\n",
36"## Tokenizer libraries by language\n",
37"\n",
38"For `cl100k_base` and `p50k_base` encodings:\n",
39"- Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md)\n",
40"- .NET / C#: [SharpToken](https://github.com/dmitry-brazhenko/SharpToken), [TiktokenSharp](https://github.com/aiqinxuancai/TiktokenSharp)\n",
41"- Java: [jtokkit](https://github.com/knuddelsgmbh/jtokkit)\n",
42"- Golang: [tiktoken-go](https://github.com/pkoukk/tiktoken-go)\n",
43"- Rust: [tiktoken-rs](https://github.com/zurawiki/tiktoken-rs)\n",
44"\n",
45"For `r50k_base` (`gpt2`) encodings, tokenizers are available in many languages.\n",
46"- Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md) (or alternatively [GPT2TokenizerFast](https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2TokenizerFast))\n",
47"- JavaScript: [gpt-3-encoder](https://www.npmjs.com/package/gpt-3-encoder)\n",
48"- .NET / C#: [GPT Tokenizer](https://github.com/dluc/openai-tools)\n",
49"- Java: [gpt2-tokenizer-java](https://github.com/hyunwoongko/gpt2-tokenizer-java)\n",
50"- PHP: [GPT-3-Encoder-PHP](https://github.com/CodeRevolutionPlugins/GPT-3-Encoder-PHP)\n",
51"- Golang: [tiktoken-go](https://github.com/pkoukk/tiktoken-go)\n",
52"- Rust: [tiktoken-rs](https://github.com/zurawiki/tiktoken-rs)\n",
53"\n",
54"(OpenAI makes no endorsements or guarantees of third-party libraries.)\n",
55"\n",
56"\n",
57"## How strings are typically tokenized\n",
58"\n",
59"In English, tokens commonly range in length from one character to one word (e.g., `\"t\"` or `\" great\"`), though in some languages tokens can be shorter than one character or longer than one word. Spaces are usually grouped with the starts of words (e.g., `\" is\"` instead of `\"is \"` or `\" \"`+`\"is\"`). You can quickly check how a string is tokenized at the [OpenAI Tokenizer](https://beta.openai.com/tokenizer), or the third-party [Tiktokenizer](https://tiktokenizer.vercel.app/) webapp."
60]
61},
62{
63"attachments": {},
64"cell_type": "markdown",
65"metadata": {},
66"source": [
67"## 0. Install `tiktoken`\n",
68"\n",
69"If needed, install `tiktoken` with `pip`:"
70]
71},
72{
73"cell_type": "code",
74"execution_count": null,
75"metadata": {},
76"outputs": [],
77"source": [
78"%pip install --upgrade tiktoken\n",
79"%pip install --upgrade openai"
80]
81},
82{
83"attachments": {},
84"cell_type": "markdown",
85"metadata": {},
86"source": [
87"## 1. Import `tiktoken`"
88]
89},
90{
91"cell_type": "code",
92"execution_count": 1,
93"metadata": {},
94"outputs": [],
95"source": [
96"import tiktoken"
97]
98},
99{
100"attachments": {},
101"cell_type": "markdown",
102"metadata": {},
103"source": [
104"## 2. Load an encoding\n",
105"\n",
106"Use `tiktoken.get_encoding()` to load an encoding by name.\n",
107"\n",
108"The first time this runs, it will require an internet connection to download. Later runs won't need an internet connection."
109]
110},
111{
112"cell_type": "code",
113"execution_count": 3,
114"metadata": {},
115"outputs": [],
116"source": [
117"encoding = tiktoken.get_encoding(\"cl100k_base\")\n"
118]
119},
120{
121"attachments": {},
122"cell_type": "markdown",
123"metadata": {},
124"source": [
125"Use `tiktoken.encoding_for_model()` to automatically load the correct encoding for a given model name."
126]
127},
128{
129"cell_type": "code",
130"execution_count": 4,
131"metadata": {},
132"outputs": [],
133"source": [
134"encoding = tiktoken.encoding_for_model(\"gpt-3.5-turbo\")"
135]
136},
137{
138"attachments": {},
139"cell_type": "markdown",
140"metadata": {},
141"source": [
142"## 3. Turn text into tokens with `encoding.encode()`\n",
143"\n"
144]
145},
146{
147"attachments": {},
148"cell_type": "markdown",
149"metadata": {},
150"source": [
151"The `.encode()` method converts a text string into a list of token integers."
152]
153},
154{
155"cell_type": "code",
156"execution_count": 5,
157"metadata": {},
158"outputs": [
159{
160"data": {
161"text/plain": [
162"[83, 1609, 5963, 374, 2294, 0]"
163]
164},
165"execution_count": 5,
166"metadata": {},
167"output_type": "execute_result"
168}
169],
170"source": [
171"encoding.encode(\"tiktoken is great!\")\n"
172]
173},
174{
175"attachments": {},
176"cell_type": "markdown",
177"metadata": {},
178"source": [
179"Count tokens by counting the length of the list returned by `.encode()`."
180]
181},
182{
183"cell_type": "code",
184"execution_count": 6,
185"metadata": {},
186"outputs": [],
187"source": [
188"def num_tokens_from_string(string: str, encoding_name: str) -> int:\n",
189" \"\"\"Returns the number of tokens in a text string.\"\"\"\n",
190" encoding = tiktoken.get_encoding(encoding_name)\n",
191" num_tokens = len(encoding.encode(string))\n",
192" return num_tokens\n"
193]
194},
195{
196"cell_type": "code",
197"execution_count": 7,
198"metadata": {},
199"outputs": [
200{
201"data": {
202"text/plain": [
203"6"
204]
205},
206"execution_count": 7,
207"metadata": {},
208"output_type": "execute_result"
209}
210],
211"source": [
212"num_tokens_from_string(\"tiktoken is great!\", \"cl100k_base\")\n"
213]
214},
215{
216"attachments": {},
217"cell_type": "markdown",
218"metadata": {},
219"source": [
220"## 4. Turn tokens into text with `encoding.decode()`"
221]
222},
223{
224"attachments": {},
225"cell_type": "markdown",
226"metadata": {},
227"source": [
228"`.decode()` converts a list of token integers to a string."
229]
230},
231{
232"cell_type": "code",
233"execution_count": 8,
234"metadata": {},
235"outputs": [
236{
237"data": {
238"text/plain": [
239"'tiktoken is great!'"
240]
241},
242"execution_count": 8,
243"metadata": {},
244"output_type": "execute_result"
245}
246],
247"source": [
248"encoding.decode([83, 1609, 5963, 374, 2294, 0])\n"
249]
250},
251{
252"attachments": {},
253"cell_type": "markdown",
254"metadata": {},
255"source": [
256"Warning: although `.decode()` can be applied to single tokens, beware that it can be lossy for tokens that aren't on utf-8 boundaries."
257]
258},
259{
260"attachments": {},
261"cell_type": "markdown",
262"metadata": {},
263"source": [
264"For single tokens, `.decode_single_token_bytes()` safely converts a single integer token to the bytes it represents."
265]
266},
267{
268"cell_type": "code",
269"execution_count": 9,
270"metadata": {},
271"outputs": [
272{
273"data": {
274"text/plain": [
275"[b't', b'ik', b'token', b' is', b' great', b'!']"
276]
277},
278"execution_count": 9,
279"metadata": {},
280"output_type": "execute_result"
281}
282],
283"source": [
284"[encoding.decode_single_token_bytes(token) for token in [83, 1609, 5963, 374, 2294, 0]]\n"
285]
286},
287{
288"attachments": {},
289"cell_type": "markdown",
290"metadata": {},
291"source": [
292"(The `b` in front of the strings indicates that the strings are byte strings.)"
293]
294},
295{
296"attachments": {},
297"cell_type": "markdown",
298"metadata": {},
299"source": [
300"## 5. Comparing encodings\n",
301"\n",
302"Different encodings vary in how they split words, group spaces, and handle non-English characters. Using the methods above, we can compare different encodings on a few example strings."
303]
304},
305{
306"cell_type": "code",
307"execution_count": 10,
308"metadata": {},
309"outputs": [],
310"source": [
311"def compare_encodings(example_string: str) -> None:\n",
312" \"\"\"Prints a comparison of three string encodings.\"\"\"\n",
313" # print the example string\n",
314" print(f'\\nExample string: \"{example_string}\"')\n",
315" # for each encoding, print the # of tokens, the token integers, and the token bytes\n",
316" for encoding_name in [\"r50k_base\", \"p50k_base\", \"cl100k_base\"]:\n",
317" encoding = tiktoken.get_encoding(encoding_name)\n",
318" token_integers = encoding.encode(example_string)\n",
319" num_tokens = len(token_integers)\n",
320" token_bytes = [encoding.decode_single_token_bytes(token) for token in token_integers]\n",
321" print()\n",
322" print(f\"{encoding_name}: {num_tokens} tokens\")\n",
323" print(f\"token integers: {token_integers}\")\n",
324" print(f\"token bytes: {token_bytes}\")\n",
325" "
326]
327},
328{
329"cell_type": "code",
330"execution_count": 11,
331"metadata": {},
332"outputs": [
333{
334"name": "stdout",
335"output_type": "stream",
336"text": [
337"\n",
338"Example string: \"antidisestablishmentarianism\"\n",
339"\n",
340"r50k_base: 5 tokens\n",
341"token integers: [415, 29207, 44390, 3699, 1042]\n",
342"token bytes: [b'ant', b'idis', b'establishment', b'arian', b'ism']\n",
343"\n",
344"p50k_base: 5 tokens\n",
345"token integers: [415, 29207, 44390, 3699, 1042]\n",
346"token bytes: [b'ant', b'idis', b'establishment', b'arian', b'ism']\n",
347"\n",
348"cl100k_base: 6 tokens\n",
349"token integers: [519, 85342, 34500, 479, 8997, 2191]\n",
350"token bytes: [b'ant', b'idis', b'establish', b'ment', b'arian', b'ism']\n"
351]
352}
353],
354"source": [
355"compare_encodings(\"antidisestablishmentarianism\")\n"
356]
357},
358{
359"cell_type": "code",
360"execution_count": 12,
361"metadata": {},
362"outputs": [
363{
364"name": "stdout",
365"output_type": "stream",
366"text": [
367"\n",
368"Example string: \"2 + 2 = 4\"\n",
369"\n",
370"r50k_base: 5 tokens\n",
371"token integers: [17, 1343, 362, 796, 604]\n",
372"token bytes: [b'2', b' +', b' 2', b' =', b' 4']\n",
373"\n",
374"p50k_base: 5 tokens\n",
375"token integers: [17, 1343, 362, 796, 604]\n",
376"token bytes: [b'2', b' +', b' 2', b' =', b' 4']\n",
377"\n",
378"cl100k_base: 7 tokens\n",
379"token integers: [17, 489, 220, 17, 284, 220, 19]\n",
380"token bytes: [b'2', b' +', b' ', b'2', b' =', b' ', b'4']\n"
381]
382}
383],
384"source": [
385"compare_encodings(\"2 + 2 = 4\")\n"
386]
387},
388{
389"cell_type": "code",
390"execution_count": 13,
391"metadata": {},
392"outputs": [
393{
394"name": "stdout",
395"output_type": "stream",
396"text": [
397"\n",
398"Example string: \"お誕生日おめでとう\"\n",
399"\n",
400"r50k_base: 14 tokens\n",
401"token integers: [2515, 232, 45739, 243, 37955, 33768, 98, 2515, 232, 1792, 223, 30640, 30201, 29557]\n",
402"token bytes: [b'\\xe3\\x81', b'\\x8a', b'\\xe8\\xaa', b'\\x95', b'\\xe7\\x94\\x9f', b'\\xe6\\x97', b'\\xa5', b'\\xe3\\x81', b'\\x8a', b'\\xe3\\x82', b'\\x81', b'\\xe3\\x81\\xa7', b'\\xe3\\x81\\xa8', b'\\xe3\\x81\\x86']\n",
403"\n",
404"p50k_base: 14 tokens\n",
405"token integers: [2515, 232, 45739, 243, 37955, 33768, 98, 2515, 232, 1792, 223, 30640, 30201, 29557]\n",
406"token bytes: [b'\\xe3\\x81', b'\\x8a', b'\\xe8\\xaa', b'\\x95', b'\\xe7\\x94\\x9f', b'\\xe6\\x97', b'\\xa5', b'\\xe3\\x81', b'\\x8a', b'\\xe3\\x82', b'\\x81', b'\\xe3\\x81\\xa7', b'\\xe3\\x81\\xa8', b'\\xe3\\x81\\x86']\n",
407"\n",
408"cl100k_base: 9 tokens\n",
409"token integers: [33334, 45918, 243, 21990, 9080, 33334, 62004, 16556, 78699]\n",
410"token bytes: [b'\\xe3\\x81\\x8a', b'\\xe8\\xaa', b'\\x95', b'\\xe7\\x94\\x9f', b'\\xe6\\x97\\xa5', b'\\xe3\\x81\\x8a', b'\\xe3\\x82\\x81', b'\\xe3\\x81\\xa7', b'\\xe3\\x81\\xa8\\xe3\\x81\\x86']\n"
411]
412}
413],
414"source": [
415"compare_encodings(\"お誕生日おめでとう\")\n"
416]
417},
418{
419"attachments": {},
420"cell_type": "markdown",
421"metadata": {},
422"source": [
423"## 6. Counting tokens for chat completions API calls\n",
424"\n",
425"ChatGPT models like `gpt-3.5-turbo` and `gpt-4` use tokens in the same way as older completions models, but because of their message-based formatting, it's more difficult to count how many tokens will be used by a conversation.\n",
426"\n",
427"Below is an example function for counting tokens for messages passed to `gpt-3.5-turbo` or `gpt-4`.\n",
428"\n",
429"Note that the exact way that tokens are counted from messages may change from model to model. Consider the counts from the function below an estimate, not a timeless guarantee.\n",
430"\n",
431"In particular, requests that use the optional functions input will consume extra tokens on top of the estimates calculated below."
432]
433},
434{
435"cell_type": "code",
436"execution_count": 2,
437"metadata": {},
438"outputs": [],
439"source": [
440"def num_tokens_from_messages(messages, model=\"gpt-3.5-turbo-0613\"):\n",
441" \"\"\"Return the number of tokens used by a list of messages.\"\"\"\n",
442" try:\n",
443" encoding = tiktoken.encoding_for_model(model)\n",
444" except KeyError:\n",
445" print(\"Warning: model not found. Using cl100k_base encoding.\")\n",
446" encoding = tiktoken.get_encoding(\"cl100k_base\")\n",
447" if model in {\n",
448" \"gpt-3.5-turbo-0613\",\n",
449" \"gpt-3.5-turbo-16k-0613\",\n",
450" \"gpt-4-0314\",\n",
451" \"gpt-4-32k-0314\",\n",
452" \"gpt-4-0613\",\n",
453" \"gpt-4-32k-0613\",\n",
454" }:\n",
455" tokens_per_message = 3\n",
456" tokens_per_name = 1\n",
457" elif model == \"gpt-3.5-turbo-0301\":\n",
458" tokens_per_message = 4 # every message follows <|start|>{role/name}\\n{content}<|end|>\\n\n",
459" tokens_per_name = -1 # if there's a name, the role is omitted\n",
460" elif \"gpt-3.5-turbo\" in model:\n",
461" print(\"Warning: gpt-3.5-turbo may update over time. Returning num tokens assuming gpt-3.5-turbo-0613.\")\n",
462" return num_tokens_from_messages(messages, model=\"gpt-3.5-turbo-0613\")\n",
463" elif \"gpt-4\" in model:\n",
464" print(\"Warning: gpt-4 may update over time. Returning num tokens assuming gpt-4-0613.\")\n",
465" return num_tokens_from_messages(messages, model=\"gpt-4-0613\")\n",
466" else:\n",
467" raise NotImplementedError(\n",
468" f\"\"\"num_tokens_from_messages() is not implemented for model {model}. See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens.\"\"\"\n",
469" )\n",
470" num_tokens = 0\n",
471" for message in messages:\n",
472" num_tokens += tokens_per_message\n",
473" for key, value in message.items():\n",
474" num_tokens += len(encoding.encode(value))\n",
475" if key == \"name\":\n",
476" num_tokens += tokens_per_name\n",
477" num_tokens += 3 # every reply is primed with <|start|>assistant<|message|>\n",
478" return num_tokens\n"
479]
480},
481{
482"cell_type": "code",
483"execution_count": 4,
484"metadata": {},
485"outputs": [
486{
487"name": "stdout",
488"output_type": "stream",
489"text": [
490"gpt-3.5-turbo-0301\n",
491"127 prompt tokens counted by num_tokens_from_messages().\n",
492"127 prompt tokens counted by the OpenAI API.\n",
493"\n",
494"gpt-3.5-turbo-0613\n",
495"129 prompt tokens counted by num_tokens_from_messages().\n",
496"129 prompt tokens counted by the OpenAI API.\n",
497"\n",
498"gpt-3.5-turbo\n",
499"Warning: gpt-3.5-turbo may update over time. Returning num tokens assuming gpt-3.5-turbo-0613.\n",
500"129 prompt tokens counted by num_tokens_from_messages().\n",
501"129 prompt tokens counted by the OpenAI API.\n",
502"\n",
503"gpt-4-0314\n",
504"129 prompt tokens counted by num_tokens_from_messages().\n",
505"129 prompt tokens counted by the OpenAI API.\n",
506"\n",
507"gpt-4-0613\n",
508"129 prompt tokens counted by num_tokens_from_messages().\n",
509"129 prompt tokens counted by the OpenAI API.\n",
510"\n",
511"gpt-4\n",
512"Warning: gpt-4 may update over time. Returning num tokens assuming gpt-4-0613.\n",
513"129 prompt tokens counted by num_tokens_from_messages().\n",
514"129 prompt tokens counted by the OpenAI API.\n",
515"\n"
516]
517}
518],
519"source": [
520"# let's verify the function above matches the OpenAI API response\n",
521"\n",
522"from openai import OpenAI\n",
523"import os\n",
524"\n",
525"client = OpenAI(api_key=os.environ.get(\"OPENAI_API_KEY\", \"<your OpenAI API key if not set as env var>\"))\n",
526"\n",
527"example_messages = [\n",
528" {\n",
529" \"role\": \"system\",\n",
530" \"content\": \"You are a helpful, pattern-following assistant that translates corporate jargon into plain English.\",\n",
531" },\n",
532" {\n",
533" \"role\": \"system\",\n",
534" \"name\": \"example_user\",\n",
535" \"content\": \"New synergies will help drive top-line growth.\",\n",
536" },\n",
537" {\n",
538" \"role\": \"system\",\n",
539" \"name\": \"example_assistant\",\n",
540" \"content\": \"Things working well together will increase revenue.\",\n",
541" },\n",
542" {\n",
543" \"role\": \"system\",\n",
544" \"name\": \"example_user\",\n",
545" \"content\": \"Let's circle back when we have more bandwidth to touch base on opportunities for increased leverage.\",\n",
546" },\n",
547" {\n",
548" \"role\": \"system\",\n",
549" \"name\": \"example_assistant\",\n",
550" \"content\": \"Let's talk later when we're less busy about how to do better.\",\n",
551" },\n",
552" {\n",
553" \"role\": \"user\",\n",
554" \"content\": \"This late pivot means we don't have time to boil the ocean for the client deliverable.\",\n",
555" },\n",
556"]\n",
557"\n",
558"for model in [\n",
559" \"gpt-3.5-turbo-0301\",\n",
560" \"gpt-3.5-turbo-0613\",\n",
561" \"gpt-3.5-turbo\",\n",
562" \"gpt-4-0314\",\n",
563" \"gpt-4-0613\",\n",
564" \"gpt-4\",\n",
565" ]:\n",
566" print(model)\n",
567" # example token count from the function defined above\n",
568" print(f\"{num_tokens_from_messages(example_messages, model)} prompt tokens counted by num_tokens_from_messages().\")\n",
569" # example token count from the OpenAI API\n",
570" response = client.chat.completions.create(model=model,\n",
571" messages=example_messages,\n",
572" temperature=0,\n",
573" max_tokens=1)\n",
574" print(f'{response.usage.prompt_tokens} prompt tokens counted by the OpenAI API.')\n",
575" print()\n"
576]
577},
578{
579"cell_type": "code",
580"execution_count": null,
581"metadata": {},
582"outputs": [],
583"source": []
584}
585],
586"metadata": {
587"kernelspec": {
588"display_name": "Python 3",
589"language": "python",
590"name": "python3"
591},
592"language_info": {
593"codemirror_mode": {
594"name": "ipython",
595"version": 3
596},
597"file_extension": ".py",
598"mimetype": "text/x-python",
599"name": "python",
600"nbconvert_exporter": "python",
601"pygments_lexer": "ipython3",
602"version": "3.11.5"
603},
604"vscode": {
605"interpreter": {
606"hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97"
607}
608}
609},
610"nbformat": 4,
611"nbformat_minor": 2
612}
613