instructor

1-introduction.ipynb
626 строк · 19.1 Кб
Перенос по словам
1
{
2
 "cells": [
3
  {
4
   "cell_type": "markdown",
5
   "metadata": {},
6
   "source": [
7
    "# Working with structured outputs\n",
8
    "\n",
9
    "If you've seen my [talk](https://www.youtube.com/watch?v=yj-wSRJwrrc&t=1s) on this topic, you can skip this chapter.\n",
10
    "\n",
11
    "tl;dr\n",
12
    "\n",
13
    "When we work with LLMs you find that many times we are not building chatbots, instead we're working with structured outputs in order to solve a problem by returning machine readable data. However the way we think about the problem is still very much influenced by the way we think about chatbots. This is a problem because it leads to a lot of confusion and frustration. In this chapter we'll try to understand why this happens and how we can fix it.\n"
14
   ]
15
  },
16
  {
17
   "cell_type": "code",
18
   "execution_count": 1,
19
   "metadata": {},
20
   "outputs": [],
21
   "source": [
22
    "import traceback"
23
   ]
24
  },
25
  {
26
   "cell_type": "code",
27
   "execution_count": 2,
28
   "metadata": {},
29
   "outputs": [],
30
   "source": [
31
    "RED = \"\\033[91m\"\n",
32
    "RESET = \"\\033[0m\""
33
   ]
34
  },
35
  {
36
   "cell_type": "markdown",
37
   "metadata": {},
38
   "source": [
39
    "## The fundamental problem with JSON and Dictionaries\n",
40
    "\n",
41
    "Lets say we have a simple JSON object, and we want to work with it. We can use the `json` module to load it into a dictionary, and then work with it. However, this is a bit of a pain, because we have to manually check the types of the data, and we have to manually check if the data is valid. For example, lets say we have a JSON object that looks like this:\n"
42
   ]
43
  },
44
  {
45
   "cell_type": "code",
46
   "execution_count": 3,
47
   "metadata": {},
48
   "outputs": [],
49
   "source": [
50
    "data = [{\"first_name\": \"Jason\", \"age\": 10}, {\"firstName\": \"Jason\", \"age\": \"10\"}]"
51
   ]
52
  },
53
  {
54
   "cell_type": "markdown",
55
   "metadata": {},
56
   "source": [
57
    "We have a `name` field, which is a string, and an `age` field, which is an integer. However, if we were to load this into a dictionary, we would have no way of knowing if the data is valid. For example, we could have a string for the age, or we could have a float for the age. We could also have a string for the name, or we could have a list for the name.\n"
58
   ]
59
  },
60
  {
61
   "cell_type": "code",
62
   "execution_count": 4,
63
   "metadata": {},
64
   "outputs": [
65
    {
66
     "name": "stdout",
67
     "output_type": "stream",
68
     "text": [
69
      "Jason is 10\n",
70
      "None is 10\n",
71
      "Next year Jason will be 11 years old\n"
72
     ]
73
    },
74
    {
75
     "name": "stderr",
76
     "output_type": "stream",
77
     "text": [
78
      "Traceback (most recent call last):\n",
79
      "  File \"/var/folders/l2/jjqj299126j0gycr9kkkt9xm0000gn/T/ipykernel_24047/2607506000.py\", line 10, in <module>\n",
80
      "    age_next_year = age + 1\n",
81
      "                    ~~~~^~~\n",
82
      "TypeError: can only concatenate str (not \"int\") to str\n"
83
     ]
84
    }
85
   ],
86
   "source": [
87
    "for obj in data:\n",
88
    "    name = obj.get(\"first_name\")\n",
89
    "    age = obj.get(\"age\")\n",
90
    "    print(f\"{name} is {age}\")\n",
91
    "\n",
92
    "for obj in data:\n",
93
    "    name = obj.get(\"first_name\")\n",
94
    "    age = obj.get(\"age\")\n",
95
    "    try:\n",
96
    "        age_next_year = age + 1\n",
97
    "        print(f\"Next year {name} will be {age_next_year} years old\")\n",
98
    "    except TypeError:\n",
99
    "        traceback.print_exc()"
100
   ]
101
  },
102
  {
103
   "cell_type": "markdown",
104
   "metadata": {},
105
   "source": [
106
    "You see that while we were able to program with a dictionary, we had issues with the data being valid. We would have had to manually check the types of the data, and we had to manually check if the data was valid. This is a pain, and we can do better.\n"
107
   ]
108
  },
109
  {
110
   "cell_type": "markdown",
111
   "metadata": {},
112
   "source": [
113
    "## Pydantic to the rescue\n",
114
    "\n",
115
    "Pydantic is a library that allows us to define data structures, and then validate them.\n"
116
   ]
117
  },
118
  {
119
   "cell_type": "code",
120
   "execution_count": 5,
121
   "metadata": {},
122
   "outputs": [
123
    {
124
     "data": {
125
      "text/plain": [
126
       "Person(name='Sam', age=30)"
127
      ]
128
     },
129
     "execution_count": 5,
130
     "metadata": {},
131
     "output_type": "execute_result"
132
    }
133
   ],
134
   "source": [
135
    "from pydantic import BaseModel, Field, ValidationError\n",
136
    "\n",
137
    "class Person(BaseModel):\n",
138
    "    name: str\n",
139
    "    age: int\n",
140
    "\n",
141
    "\n",
142
    "person = Person(name=\"Sam\", age=30)\n",
143
    "person"
144
   ]
145
  },
146
  {
147
   "cell_type": "code",
148
   "execution_count": 6,
149
   "metadata": {},
150
   "outputs": [
151
    {
152
     "data": {
153
      "text/plain": [
154
       "Person(name='Sam', age=30)"
155
      ]
156
     },
157
     "execution_count": 6,
158
     "metadata": {},
159
     "output_type": "execute_result"
160
    }
161
   ],
162
   "source": [
163
    "# Data is correctly casted to the right type\n",
164
    "person = Person.model_validate({\"name\": \"Sam\", \"age\": \"30\"})\n",
165
    "person"
166
   ]
167
  },
168
  {
169
   "cell_type": "code",
170
   "execution_count": 7,
171
   "metadata": {},
172
   "outputs": [
173
    {
174
     "name": "stderr",
175
     "output_type": "stream",
176
     "text": [
177
      "Traceback (most recent call last):\n",
178
      "  File \"/var/folders/l2/jjqj299126j0gycr9kkkt9xm0000gn/T/ipykernel_24047/3040264600.py\", line 5, in <module>\n",
179
      "    assert person.age == 20\n",
180
      "           ^^^^^^^^^^^^^^^^\n",
181
      "AssertionError\n"
182
     ]
183
    }
184
   ],
185
   "source": [
186
    "assert person.name == \"Sam\"\n",
187
    "assert person.age == 30\n",
188
    "\n",
189
    "try:\n",
190
    "    assert person.age == 20\n",
191
    "except AssertionError:\n",
192
    "    traceback.print_exc()"
193
   ]
194
  },
195
  {
196
   "cell_type": "code",
197
   "execution_count": 8,
198
   "metadata": {},
199
   "outputs": [
200
    {
201
     "name": "stdout",
202
     "output_type": "stream",
203
     "text": [
204
      "Validation Error:\n",
205
      "Field: name, Error: Field required\n",
206
      "Field: age, Error: Input should be a valid integer, unable to parse string as an integer\n",
207
      "\u001b[91m\n",
208
      "Original Traceback Below\u001b[0m\n"
209
     ]
210
    },
211
    {
212
     "name": "stderr",
213
     "output_type": "stream",
214
     "text": [
215
      "Traceback (most recent call last):\n",
216
      "  File \"/var/folders/l2/jjqj299126j0gycr9kkkt9xm0000gn/T/ipykernel_24047/621989455.py\", line 3, in <module>\n",
217
      "    person = Person.model_validate({\"first_name\": \"Sam\", \"age\": \"30.2\"})\n",
218
      "             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n",
219
      "  File \"/opt/homebrew/Caskroom/miniconda/base/envs/instructor/lib/python3.11/site-packages/pydantic/main.py\", line 509, in model_validate\n",
220
      "    return cls.__pydantic_validator__.validate_python(\n",
221
      "           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n",
222
      "pydantic_core._pydantic_core.ValidationError: 2 validation errors for Person\n",
223
      "name\n",
224
      "  Field required [type=missing, input_value={'first_name': 'Sam', 'age': '30.2'}, input_type=dict]\n",
225
      "    For further information visit https://errors.pydantic.dev/2.6/v/missing\n",
226
      "age\n",
227
      "  Input should be a valid integer, unable to parse string as an integer [type=int_parsing, input_value='30.2', input_type=str]\n",
228
      "    For further information visit https://errors.pydantic.dev/2.6/v/int_parsing\n"
229
     ]
230
    }
231
   ],
232
   "source": [
233
    "# Data is validated to get better error messages\n",
234
    "try:\n",
235
    "    person = Person.model_validate({\"first_name\": \"Sam\", \"age\": \"30.2\"})\n",
236
    "except ValidationError as e:\n",
237
    "    print(\"Validation Error:\")\n",
238
    "    for error in e.errors():\n",
239
    "        print(f\"Field: {error['loc'][0]}, Error: {error['msg']}\")\n",
240
    "\n",
241
    "    print(f\"{RED}\\nOriginal Traceback Below{RESET}\")\n",
242
    "    traceback.print_exc()"
243
   ]
244
  },
245
  {
246
   "cell_type": "markdown",
247
   "metadata": {},
248
   "source": [
249
    "By introducing pydantic into any python codebase you can get a lot of benefits. You can get type checking, you can get validation, and you can get autocomplete. This is a huge win, because it means you can catch errors before they happen. This is even more useful when we rely on language models to generate data for us.\n",
250
    "\n",
251
    "You can also define validators that are run on the data. This is useful because it means you can catch errors before they happen. For example, you can define a validator that checks if the age is greater than 0. This is useful because it means you can catch errors before they happen.\n"
252
   ]
253
  },
254
  {
255
   "cell_type": "markdown",
256
   "metadata": {},
257
   "source": [
258
    "## Fundamental problem with asking for JSON from OpenAI\n",
259
    "\n",
260
    "As we shall see below, the correct json format would be something of the format below:\n",
261
    "\n",
262
    "```python\n",
263
    "{\n",
264
    "    \"name\": \"Jason\",\n",
265
    "    \"age\": 10\n",
266
    "}\n",
267
    "```\n",
268
    "\n",
269
    "However, we get errorenous outputs like:\n",
270
    "\n",
271
    "```python\n",
272
    "{\n",
273
    "  \"jason\": 10\n",
274
    "}\n",
275
    "```"
276
   ]
277
  },
278
  {
279
   "cell_type": "code",
280
   "execution_count": 9,
281
   "metadata": {},
282
   "outputs": [
283
    {
284
     "name": "stdout",
285
     "output_type": "stream",
286
     "text": [
287
      "json that we want:\n",
288
      "\n",
289
      "{\n",
290
      "    \"name\": \"Jason\",\n",
291
      "    \"age\": 10\n",
292
      "}\n",
293
      "\n",
294
      "error!!\n",
295
      "{\n",
296
      "  \"jason\": 10\n",
297
      "}\n",
298
      "correctly parsed person=Person(name='Jason', age=10)\n",
299
      "correctly parsed person=Person(name='jason', age=10)\n",
300
      "error!!\n",
301
      "{\n",
302
      "  \"Jason\": {\n",
303
      "    \"age\": 10\n",
304
      "  }\n",
305
      "}\n",
306
      "error!!\n",
307
      "{\n",
308
      "  \"Jason\": {\n",
309
      "    \"age\": 10\n",
310
      "  }\n",
311
      "}\n",
312
      "error!!\n",
313
      "{\n",
314
      "  \"Jason\": {\n",
315
      "    \"age\": 10\n",
316
      "  }\n",
317
      "}\n",
318
      "error!!\n",
319
      "{\n",
320
      "  \"Jason\": {\n",
321
      "    \"age\": 10\n",
322
      "  }\n",
323
      "}\n",
324
      "correctly parsed person=Person(name='Jason', age=10)\n",
325
      "correctly parsed person=Person(name='Jason', age=10)\n",
326
      "error!!\n",
327
      "{\n",
328
      "  \"jason\": 10\n",
329
      "}\n"
330
     ]
331
    }
332
   ],
333
   "source": [
334
    "from openai import OpenAI\n",
335
    "\n",
336
    "client = OpenAI()\n",
337
    "\n",
338
    "resp = client.chat.completions.create(\n",
339
    "    model=\"gpt-3.5-turbo\",\n",
340
    "    messages=[\n",
341
    "        {\"role\": \"user\", \"content\": \"Please give me jason is 10 as a json object ```json\\n\"},\n",
342
    "    ],\n",
343
    "    n=10,\n",
344
    "    temperature=1,\n",
345
    ")\n",
346
    "\n",
347
    "print(\"json that we want:\")\n",
348
    "print(\"\"\"\n",
349
    "{\n",
350
    "    \"name\": \"Jason\",\n",
351
    "    \"age\": 10\n",
352
    "}\n",
353
    "\"\"\")\n",
354
    "\n",
355
    "for choice in resp.choices:\n",
356
    "    json = choice.message.content\n",
357
    "    try:\n",
358
    "        person = Person.model_validate_json(json)\n",
359
    "        print(f\"correctly parsed {person=}\")\n",
360
    "    except Exception as e:\n",
361
    "        print(\"error!!\")\n",
362
    "        print(json)"
363
   ]
364
  },
365
  {
366
   "cell_type": "markdown",
367
   "metadata": {},
368
   "source": [
369
    "## Introduction to Function Calling\n",
370
    "\n",
371
    "The json could be anything! We could add more and more into a prompt and hope it works, or we can use something called [function calling](https://platform.openai.com/docs/guides/function-calling) to directly specify the schema we want.\n",
372
    "\n",
373
    "**Function Calling**\n",
374
    "\n",
375
    "In an API call, you can describe _functions_ and have the model intelligently\n",
376
    "choose to output a _JSON object_ containing _arguments_ to call one or many\n",
377
    "functions. The Chat Completions API does **not** call the function; instead, the\n",
378
    "model generates _JSON_ that you can use to call the function in **your code**.\n",
379
    "\n"
380
   ]
381
  },
382
  {
383
   "cell_type": "code",
384
   "execution_count": 10,
385
   "metadata": {},
386
   "outputs": [
387
    {
388
     "data": {
389
      "text/plain": [
390
       "PersonBirthday(name='Jason Liu', age=30, birthday=datetime.date(1994, 3, 26))"
391
      ]
392
     },
393
     "execution_count": 10,
394
     "metadata": {},
395
     "output_type": "execute_result"
396
    }
397
   ],
398
   "source": [
399
    "import datetime\n",
400
    "\n",
401
    "\n",
402
    "class PersonBirthday(BaseModel):\n",
403
    "    name: str\n",
404
    "    age: int\n",
405
    "    birthday: datetime.date\n",
406
    "\n",
407
    "\n",
408
    "schema = {\n",
409
    "    \"properties\": {\n",
410
    "        \"name\": {\"type\": \"string\"},\n",
411
    "        \"age\": {\"type\": \"integer\"},\n",
412
    "        \"birthday\": {\"type\": \"string\", \"format\": \"YYYY-MM-DD\"},\n",
413
    "    },\n",
414
    "    \"required\": [\"name\", \"age\"],\n",
415
    "    \"type\": \"object\",\n",
416
    "}\n",
417
    "\n",
418
    "resp = client.chat.completions.create(\n",
419
    "    model=\"gpt-3.5-turbo\",\n",
420
    "    messages=[\n",
421
    "        {\n",
422
    "            \"role\": \"user\",\n",
423
    "            \"content\": f\"Extract `Jason Liu is thirty years old his birthday is yesturday` into json today is {datetime.date.today()}\",\n",
424
    "        },\n",
425
    "    ],\n",
426
    "    functions=[{\"name\": \"Person\", \"parameters\": schema}],\n",
427
    "    function_call=\"auto\",\n",
428
    ")\n",
429
    "\n",
430
    "PersonBirthday.model_validate_json(resp.choices[0].message.function_call.arguments)"
431
   ]
432
  },
433
  {
434
   "cell_type": "markdown",
435
   "metadata": {},
436
   "source": [
437
    "But it turns out, pydantic actually not only does our serialization, we can define the schema as well as add additional documentation!\n"
438
   ]
439
  },
440
  {
441
   "cell_type": "code",
442
   "execution_count": 11,
443
   "metadata": {},
444
   "outputs": [
445
    {
446
     "data": {
447
      "text/plain": [
448
       "{'properties': {'name': {'title': 'Name', 'type': 'string'},\n",
449
       "  'age': {'title': 'Age', 'type': 'integer'},\n",
450
       "  'birthday': {'format': 'date', 'title': 'Birthday', 'type': 'string'}},\n",
451
       " 'required': ['name', 'age', 'birthday'],\n",
452
       " 'title': 'PersonBirthday',\n",
453
       " 'type': 'object'}"
454
      ]
455
     },
456
     "execution_count": 11,
457
     "metadata": {},
458
     "output_type": "execute_result"
459
    }
460
   ],
461
   "source": [
462
    "PersonBirthday.model_json_schema()"
463
   ]
464
  },
465
  {
466
   "cell_type": "markdown",
467
   "metadata": {},
468
   "source": [
469
    "We can even define nested complex schemas, and documentation with ease.\n"
470
   ]
471
  },
472
  {
473
   "cell_type": "code",
474
   "execution_count": 12,
475
   "metadata": {},
476
   "outputs": [
477
    {
478
     "data": {
479
      "text/plain": [
480
       "{'$defs': {'Address': {'properties': {'address': {'description': 'Full street address',\n",
481
       "     'title': 'Address',\n",
482
       "     'type': 'string'},\n",
483
       "    'city': {'title': 'City', 'type': 'string'},\n",
484
       "    'state': {'title': 'State', 'type': 'string'}},\n",
485
       "   'required': ['address', 'city', 'state'],\n",
486
       "   'title': 'Address',\n",
487
       "   'type': 'object'}},\n",
488
       " 'description': 'A Person with an address',\n",
489
       " 'properties': {'name': {'title': 'Name', 'type': 'string'},\n",
490
       "  'age': {'title': 'Age', 'type': 'integer'},\n",
491
       "  'address': {'$ref': '#/$defs/Address'}},\n",
492
       " 'required': ['name', 'age', 'address'],\n",
493
       " 'title': 'PersonAddress',\n",
494
       " 'type': 'object'}"
495
      ]
496
     },
497
     "execution_count": 12,
498
     "metadata": {},
499
     "output_type": "execute_result"
500
    }
501
   ],
502
   "source": [
503
    "class Address(BaseModel):\n",
504
    "    address: str = Field(description=\"Full street address\")\n",
505
    "    city: str\n",
506
    "    state: str\n",
507
    "\n",
508
    "\n",
509
    "class PersonAddress(Person):\n",
510
    "    \"\"\"A Person with an address\"\"\"\n",
511
    "\n",
512
    "    address: Address\n",
513
    "\n",
514
    "\n",
515
    "PersonAddress.model_json_schema()"
516
   ]
517
  },
518
  {
519
   "cell_type": "markdown",
520
   "metadata": {},
521
   "source": [
522
    "These simple concepts become what we built into `instructor` and most of the work has been around documenting how we can leverage schema engineering.\n",
523
    "Except now we use `instructor.patch()` to add a bunch more capabilities to the OpenAI SDK.\n"
524
   ]
525
  },
526
  {
527
   "cell_type": "markdown",
528
   "metadata": {},
529
   "source": [
530
    "# The core idea around Instructor\n",
531
    "\n",
532
    "1. Using function calling allows us use a llm that is finetuned to use json_schema and output json.\n",
533
    "2. Pydantic can be used to define the object, schema, and validation in one single class, allow us to encapsulate everything neatly\n",
534
    "3. As a library with 100M downloads, we can leverage pydantic to do all the heavy lifting for us and fit nicely with the python ecosystem\n"
535
   ]
536
  },
537
  {
538
   "cell_type": "code",
539
   "execution_count": 13,
540
   "metadata": {},
541
   "outputs": [
542
    {
543
     "data": {
544
      "text/plain": [
545
       "PersonAddress(name='Jason Liu', age=30, address=Address(address='123 Main St', city='San Francisco', state='CA'))"
546
      ]
547
     },
548
     "execution_count": 13,
549
     "metadata": {},
550
     "output_type": "execute_result"
551
    }
552
   ],
553
   "source": [
554
    "import instructor\n",
555
    "import datetime\n",
556
    "\n",
557
    "# patch the client to add `response_model` to the `create` method\n",
558
    "client = instructor.patch(OpenAI(), mode=instructor.Mode.MD_JSON)\n",
559
    "\n",
560
    "resp = client.chat.completions.create(\n",
561
    "    model=\"gpt-3.5-turbo-1106\",\n",
562
    "    messages=[\n",
563
    "        {\n",
564
    "            \"role\": \"user\",\n",
565
    "            \"content\": f\"\"\"\n",
566
    "            Today is {datetime.date.today()}\n",
567
    "\n",
568
    "            Extract `Jason Liu is thirty years old his birthday is yesturday`\n",
569
    "            he lives at 123 Main St, San Francisco, CA\"\"\",\n",
570
    "        },\n",
571
    "    ],\n",
572
    "    response_model=PersonAddress,\n",
573
    ")\n",
574
    "resp"
575
   ]
576
  },
577
  {
578
   "cell_type": "markdown",
579
   "metadata": {},
580
   "source": [
581
    "By defining `response_model` we can leverage pydantic to do all the heavy lifting. Later we'll introduce the other features that `instructor.patch()` adds to the OpenAI SDK.\n",
582
    "but for now, this small change allows us to do a lot more with the API.\n"
583
   ]
584
  },
585
  {
586
   "cell_type": "markdown",
587
   "metadata": {},
588
   "source": [
589
    "## Is instructor the only way to do this?\n",
590
    "\n",
591
    "No. Libraries like Marvin, Langchain, and Llamaindex all now leverage the Pydantic object in similar ways. The goal is to be as light weight as possible, get you as close as possible to the openai api, and then get out of your way.\n",
592
    "\n",
593
    "More importantly, we've also added straight forward validation and reasking to the mix.\n",
594
    "\n",
595
    "The goal of instructor is to show you how to think about structured prompting and provide examples and documentation that you can take with you to any framework.\n",
596
    "\n",
597
    "For further exploration:\n",
598
    "\n",
599
    "- [Marvin](https://www.askmarvin.ai/)\n",
600
    "- [Langchain](https://python.langchain.com/docs/modules/model_io/output_parsers/pydantic)\n",
601
    "- [LlamaIndex](https://gpt-index.readthedocs.io/en/latest/examples/output_parsing/openai_pydantic_program.html)\n"
602
   ]
603
  }
604
 ],
605
 "metadata": {
606
  "kernelspec": {
607
   "display_name": "Python 3 (ipykernel)",
608
   "language": "python",
609
   "name": "python3"
610
  },
611
  "language_info": {
612
   "codemirror_mode": {
613
    "name": "ipython",
614
    "version": 3
615
   },
616
   "file_extension": ".py",
617
   "mimetype": "text/x-python",
618
   "name": "python",
619
   "nbconvert_exporter": "python",
620
   "pygments_lexer": "ipython3",
621
   "version": "3.11.8"
622
  }
623
 },
624
 "nbformat": 4,
625
 "nbformat_minor": 4
626
}
627
instructor

Использование cookies