instructor

Форк
0
/
6-chain-of-density.ipynb 
976 строк · 112.6 Кб
1
{
2
 "cells": [
3
  {
4
   "cell_type": "markdown",
5
   "id": "df019bc4-bdc3-4351-9f03-294be147bf01",
6
   "metadata": {},
7
   "source": [
8
    "# Chain Of Density Summarization"
9
   ]
10
  },
11
  {
12
   "attachments": {},
13
   "cell_type": "markdown",
14
   "id": "2b2ec7b8-96f0-44ae-afad-2d578a7164aa",
15
   "metadata": {},
16
   "source": [
17
    "## Introduction\n",
18
    "\n",
19
    "**What is Chain Of Density summarization?**\n",
20
    "\n",
21
    "Summarizing extensive texts with AI can be challenging. Initially, an AI produces a summary, then refines it through multiple iterations, adding missing article entities. Each iteration adds new article entities to the summary, keeping length consistent, leading to an entity-dense, informative summary called Chain Of Density.\n",
22
    "\n",
23
    "It was first introduced in the paper - From Sparse to Dense : GPT-4 Summarization with Chain of Density prompting. \n",
24
    "\n",
25
    "This was done in the original paper by asking GPT-4 to generate all of the rewritten summaries in a single go with the following prompt below. "
26
   ]
27
  },
28
  {
29
   "cell_type": "markdown",
30
   "id": "3850682a-91ac-43ec-8279-fa12cfb88c2f",
31
   "metadata": {},
32
   "source": [
33
    "> Article: {{ARTICLE}}\n",
34
    ">\n",
35
    "> You will generate increasingly concise, entity-dense summaries of the\n",
36
    "> above Article.\n",
37
    ">\n",
38
    "> Repeat the following 2 steps 5 times.\n",
39
    ">\n",
40
    "> Step 1. Identify 1-3 informative Entities (\";\" delimited) from the\n",
41
    "> Article which are missing from the previously generated summary.\n",
42
    "> Step 2. Write a new, denser summary of identical length which covers\n",
43
    "> every entity and detail from the previous summary plus the Missing\n",
44
    "> Entities.\n",
45
    ">\n",
46
    "> A Missing Entity is:\n",
47
    "> - Relevant: to the main story.\n",
48
    "> - Specific: descriptive yet concise (5 words or fewer).\n",
49
    "> - Novel; not in the previous summary.\n",
50
    "> - Faithful: present in the Article.\n",
51
    "> - Anywhere: located anywhere in the Article.\n",
52
    ">\n",
53
    "> Guidelines:\n",
54
    "> - The first summary should be long (4-5 sentences, -80 words) yet\n",
55
    "> highly non-specific, containing little information beyond the\n",
56
    "> entities marked as missing. Use overly verbose language and fillers\n",
57
    "> (e.g., \"this article discusses\") to reach -80 words.\n",
58
    "> - Make every word count: re-write the previous summary to improve\n",
59
    "> flow and make space for additional entities.\n",
60
    "> - Make space with fusion, compression, and removal of uninformative\n",
61
    "> phrases like \"the article discusses\"\n",
62
    "> - The summaries should become highly dense and concise yet\n",
63
    "> self-contained, e.g., easily understood without the Article.\n",
64
    "> - Missing entities can appear anywhere in the new summary.\n",
65
    "> - Never drop entities from the previous summary. If space cannot be\n",
66
    "> made, add fewer new entities.\n",
67
    ">\n",
68
    "> Remember, use the exact same number of words for each summary.\n",
69
    ">\n",
70
    "> Answer in JSON. The JSON should be a list (length 5) of dictionaries\n",
71
    "> whose keys are \"Missing_Entities\" and \"Denser_Summary\""
72
   ]
73
  },
74
  {
75
   "cell_type": "markdown",
76
   "id": "758c99e8-2c9e-4a2b-9ae2-cebce820dde2",
77
   "metadata": {},
78
   "source": [
79
    "While the original paper used a single prompt to generate the iterative generations, we can go one step better with `Instructor` and break down the process into smaller API calls - with validation along the way.\n",
80
    "\n",
81
    "The process can be broken down as seen below."
82
   ]
83
  },
84
  {
85
   "attachments": {
86
    "e3835897-9292-49af-a248-95eaa1d0b86a.png": {
87
     "image/png": ""
88
    }
89
   },
90
   "cell_type": "markdown",
91
   "id": "ed20a4f2-ec79-44d7-9550-7ad5699c136d",
92
   "metadata": {},
93
   "source": [
94
    "![image.png](attachment:e3835897-9292-49af-a248-95eaa1d0b86a.png)"
95
   ]
96
  },
97
  {
98
   "cell_type": "markdown",
99
   "id": "e11663b3-ff06-4f4d-a17f-b215b22f99cd",
100
   "metadata": {},
101
   "source": [
102
    "### Setup and Dependencies\n",
103
    "\n",
104
    "We'll be using two new libraries for our demonstration \n",
105
    "\n",
106
    "1. `spaCy` : This provides a handful of useful utilities to do generic NLP tasks with\n",
107
    "2. `nltk` : This was used by the original paper to count the number of tokens in our generated summaries"
108
   ]
109
  },
110
  {
111
   "cell_type": "markdown",
112
   "id": "35dd5dae-0659-4b86-b8f2-57ec56087831",
113
   "metadata": {},
114
   "source": [
115
    "We'll need to install the tokenizer packages and the spacy english library before we can proceed with the rest of the lesson"
116
   ]
117
  },
118
  {
119
   "cell_type": "code",
120
   "execution_count": 1,
121
   "id": "0dbdda0a-2648-4e0f-8633-ea19bef4a460",
122
   "metadata": {},
123
   "outputs": [
124
    {
125
     "name": "stderr",
126
     "output_type": "stream",
127
     "text": [
128
      "[nltk_data] Downloading package punkt to /Users/admin/nltk_data...\n",
129
      "[nltk_data]   Package punkt is already up-to-date!\n"
130
     ]
131
    },
132
    {
133
     "name": "stdout",
134
     "output_type": "stream",
135
     "text": [
136
      "\u001b[38;5;2m✔ Download and installation successful\u001b[0m\n",
137
      "You can now load the package via spacy.load('en_core_web_sm')\n"
138
     ]
139
    }
140
   ],
141
   "source": [
142
    "import nltk\n",
143
    "nltk.download('punkt')\n",
144
    "\n",
145
    "!python -m spacy download en_core_web_sm --quiet"
146
   ]
147
  },
148
  {
149
   "cell_type": "markdown",
150
   "id": "90874bad-06b5-4656-beec-73fe984efbcb",
151
   "metadata": {},
152
   "source": [
153
    "Once that's done, let's now move on to writing some code."
154
   ]
155
  },
156
  {
157
   "cell_type": "markdown",
158
   "id": "424ca094-9ae2-4da4-90f8-32ec89cddabc",
159
   "metadata": {},
160
   "source": [
161
    "## Definitions"
162
   ]
163
  },
164
  {
165
   "cell_type": "markdown",
166
   "id": "68397732-fd6f-424d-8823-7818a0752aea",
167
   "metadata": {},
168
   "source": [
169
    "There are a few different definitions which we'll need to understand in the tutorial. They are\n",
170
    "\n",
171
    "1. Tokens and tokenizers\n",
172
    "2. Entities\n",
173
    "3. Entity-Dense\n",
174
    "\n",
175
    "Once we've gotten a hang of these concepts, we'll walk through a simple implementation of a Chain Of Density summarizer"
176
   ]
177
  },
178
  {
179
   "cell_type": "markdown",
180
   "id": "4cf72a9d-db37-4ec9-b242-171468090bc1",
181
   "metadata": {},
182
   "source": [
183
    "### Tokens and Tokenizers\n",
184
    "\n",
185
    "In the original paper, the authors used `NLTK` to split the generated summary into tokens. These represent the smallest units that each sentence could be broken into where each hold semantic meaning.\n",
186
    "\n",
187
    "Let's walk through a simple example to see how the `NLTK` tokenizer might work"
188
   ]
189
  },
190
  {
191
   "cell_type": "code",
192
   "execution_count": 2,
193
   "id": "bd6ebf95-60c6-4ec8-be17-d5ab436a67fd",
194
   "metadata": {},
195
   "outputs": [
196
    {
197
     "data": {
198
      "text/plain": [
199
       "['My', 'favourite', 'type', 'of', 'Sashimi', 'is', 'Toro']"
200
      ]
201
     },
202
     "execution_count": 2,
203
     "metadata": {},
204
     "output_type": "execute_result"
205
    }
206
   ],
207
   "source": [
208
    "import nltk\n",
209
    "sentence = \"My favourite type of Sashimi is Toro\"\n",
210
    "\n",
211
    "nltk.word_tokenize(sentence)"
212
   ]
213
  },
214
  {
215
   "cell_type": "markdown",
216
   "id": "281f523d-7707-4e33-af29-f233a1f7bf2a",
217
   "metadata": {},
218
   "source": [
219
    "NLTK's word tokenizer does more than just split by empty whitespace. It handles a lot of nice edge cases and contractions such as `don't` or `I'm`."
220
   ]
221
  },
222
  {
223
   "cell_type": "code",
224
   "execution_count": 3,
225
   "id": "8a87b231-57b0-426c-98d5-cd7d8b512121",
226
   "metadata": {},
227
   "outputs": [
228
    {
229
     "data": {
230
      "text/plain": [
231
       "['I', \"'m\", 'fascinated', 'by', 'machine', 'learning', '!']"
232
      ]
233
     },
234
     "execution_count": 3,
235
     "metadata": {},
236
     "output_type": "execute_result"
237
    }
238
   ],
239
   "source": [
240
    "sentence = \"I'm fascinated by machine learning!\"\n",
241
    "\n",
242
    "nltk.word_tokenize(sentence)"
243
   ]
244
  },
245
  {
246
   "cell_type": "markdown",
247
   "id": "6719c508-f575-41a5-91a2-47b2fa76cd3f",
248
   "metadata": {},
249
   "source": [
250
    "We can then calculate the number of tokens by simply finding the `len` of the generated sequence."
251
   ]
252
  },
253
  {
254
   "cell_type": "code",
255
   "execution_count": 4,
256
   "id": "c905dff4-5753-4274-90fe-44aa3393ff0f",
257
   "metadata": {},
258
   "outputs": [
259
    {
260
     "name": "stdout",
261
     "output_type": "stream",
262
     "text": [
263
      "['I', \"'m\", 'fascinated', 'by', 'machine', 'learning', '!']\n",
264
      "7\n"
265
     ]
266
    }
267
   ],
268
   "source": [
269
    "sentence = \"I'm fascinated by machine learning!\"\n",
270
    "tokens = nltk.word_tokenize(sentence)\n",
271
    "print(tokens)\n",
272
    "print(len(tokens))"
273
   ]
274
  },
275
  {
276
   "cell_type": "markdown",
277
   "id": "692316bc-10e6-421f-adba-5323376b95d6",
278
   "metadata": {},
279
   "source": [
280
    "### Entities\n",
281
    "\n",
282
    "A named entity is an object in the real-world that we identify using a name. Common examples include people, countries, products or even books that we know and love. We can use the `spaCy` library for us to be able to detect the number of entities in a given sentence."
283
   ]
284
  },
285
  {
286
   "cell_type": "code",
287
   "execution_count": 5,
288
   "id": "47a4a8f6-295d-4040-beb1-3c8e9ff3bf99",
289
   "metadata": {},
290
   "outputs": [],
291
   "source": [
292
    "# First we load in the library\n",
293
    "import spacy\n",
294
    "\n",
295
    "# Then we initialise an NLP object. \n",
296
    "nlp = spacy.load(\"en_core_web_sm\")"
297
   ]
298
  },
299
  {
300
   "cell_type": "code",
301
   "execution_count": 6,
302
   "id": "51197222-2124-46f8-9a57-555d43836401",
303
   "metadata": {},
304
   "outputs": [
305
    {
306
     "data": {
307
      "text/plain": [
308
       "(Apple, U.K., $1 billion)"
309
      ]
310
     },
311
     "execution_count": 6,
312
     "metadata": {},
313
     "output_type": "execute_result"
314
    }
315
   ],
316
   "source": [
317
    "sentence = \"Apple is looking at buying U.K. startup for $1 billion\"\n",
318
    "\n",
319
    "doc = nlp(sentence)\n",
320
    "doc.ents"
321
   ]
322
  },
323
  {
324
   "cell_type": "markdown",
325
   "id": "5e2560b2-ca27-4223-84ed-e01f9542fdbd",
326
   "metadata": {},
327
   "source": [
328
    "We can see that Spacy was able to identify unique and named entities that were present within the sentence using the `doc.ents` property. Let's see a few more examples."
329
   ]
330
  },
331
  {
332
   "cell_type": "code",
333
   "execution_count": 7,
334
   "id": "9c2ad5a0-2f24-442e-a46a-3a265ef873f6",
335
   "metadata": {},
336
   "outputs": [
337
    {
338
     "data": {
339
      "text/plain": [
340
       "()"
341
      ]
342
     },
343
     "execution_count": 7,
344
     "metadata": {},
345
     "output_type": "execute_result"
346
    }
347
   ],
348
   "source": [
349
    "sentence = \"A knowledge graph, also known as a semantic network\\\n",
350
    ", represents real-world entities and their relationships\"\n",
351
    "\n",
352
    "doc = nlp(sentence)\n",
353
    "doc.ents"
354
   ]
355
  },
356
  {
357
   "cell_type": "code",
358
   "execution_count": 8,
359
   "id": "dc7964d3-61f6-436e-bfb0-080cd46c41bf",
360
   "metadata": {},
361
   "outputs": [
362
    {
363
     "data": {
364
      "text/plain": [
365
       "(J.K., one, Harry Potter')"
366
      ]
367
     },
368
     "execution_count": 8,
369
     "metadata": {},
370
     "output_type": "execute_result"
371
    }
372
   ],
373
   "source": [
374
    "sentence = \"For example, a node representing an author like 'J.K. Rowling'\\\n",
375
    "can be connected to another node representing one of her books, 'Harry Potter'\\\n",
376
    ", with the edge 'author of'\"\n",
377
    "\n",
378
    "doc = nlp(sentence)\n",
379
    "doc.ents"
380
   ]
381
  },
382
  {
383
   "cell_type": "markdown",
384
   "id": "11b7737d-d5a7-4aa4-bdea-b0d12d1589ed",
385
   "metadata": {},
386
   "source": [
387
    "As we can see from the examples above, entities are not nouns. They're direct or indirect references to people, places, concepts."
388
   ]
389
  },
390
  {
391
   "cell_type": "markdown",
392
   "id": "c8e69fa8-defa-4f47-b8cc-cfcfa4cbcfba",
393
   "metadata": {},
394
   "source": [
395
    "### Entity Density\n",
396
    "\n",
397
    "Now that we know what tokens and tokens are, we can move on to our last concept - that of entity density. Entity density is simply the mean number of entities present per token within your string of text."
398
   ]
399
  },
400
  {
401
   "cell_type": "code",
402
   "execution_count": 9,
403
   "id": "15accf59-a264-4e1c-9b77-8b486e423f95",
404
   "metadata": {},
405
   "outputs": [],
406
   "source": [
407
    "import math\n",
408
    "nlp = spacy.load(\"en_core_web_sm\")\n",
409
    "\n",
410
    "def calculate_entity_density(sentence:str):\n",
411
    "    tokens = nltk.word_tokenize(sentence)\n",
412
    "    entities = nlp(sentence).ents\n",
413
    "    entity_density = round(len(entities)/len(tokens),3)\n",
414
    "\n",
415
    "    return len(tokens),len(entities),entity_density"
416
   ]
417
  },
418
  {
419
   "cell_type": "code",
420
   "execution_count": 10,
421
   "id": "648206dc-a734-49eb-bd2e-8b46a914cacf",
422
   "metadata": {},
423
   "outputs": [
424
    {
425
     "data": {
426
      "text/plain": [
427
       "(17, 0, 0.0)"
428
      ]
429
     },
430
     "execution_count": 10,
431
     "metadata": {},
432
     "output_type": "execute_result"
433
    }
434
   ],
435
   "source": [
436
    "sentence_1 = \"A knowledge graph, also known as a semantic network\\\n",
437
    ", represents real-world entities and their relationships\"\n",
438
    "\n",
439
    "calculate_entity_density(sentence_1)"
440
   ]
441
  },
442
  {
443
   "cell_type": "code",
444
   "execution_count": 11,
445
   "id": "9fd5717f-202a-4b39-976c-a32d0f1a4b29",
446
   "metadata": {},
447
   "outputs": [
448
    {
449
     "data": {
450
      "text/plain": [
451
       "(11, 3, 0.273)"
452
      ]
453
     },
454
     "execution_count": 11,
455
     "metadata": {},
456
     "output_type": "execute_result"
457
    }
458
   ],
459
   "source": [
460
    "sentence_2 = \"Apple is looking at buying U.K. startup for $1 billion\"\n",
461
    "\n",
462
    "calculate_entity_density(sentence_2)"
463
   ]
464
  },
465
  {
466
   "cell_type": "markdown",
467
   "id": "1d9ac4df-5e7a-4186-83f2-bb542dba6189",
468
   "metadata": {},
469
   "source": [
470
    "This gives us a quantitative method to be able to understand and compare two different sentences/summaries.\n",
471
    "\n",
472
    "We want summaries that are more entity-dense"
473
   ]
474
  },
475
  {
476
   "cell_type": "code",
477
   "execution_count": 12,
478
   "id": "ae27bcc5-da32-4aaa-9ebb-dbc21700ee14",
479
   "metadata": {},
480
   "outputs": [
481
    {
482
     "data": {
483
      "text/plain": [
484
       "((82, 11, 0.134), (71, 17, 0.239))"
485
      ]
486
     },
487
     "execution_count": 12,
488
     "metadata": {},
489
     "output_type": "execute_result"
490
    }
491
   ],
492
   "source": [
493
    "summary_1 = \"\"\"\n",
494
    "This article discusses an incident that occurred during the Chinese Grand Prix\n",
495
    "involving two racing drivers, Jenson Button and Pastor Maldonado. The two were \n",
496
    "competing for the 13th place when Button collided with Maldonado's vehicle, \n",
497
    "causing damage to both cars. The incident resulted in a penalty for Button, \n",
498
    "who was demoted to 14th place. Maldonado, on the other hand, had to retire from \n",
499
    "the race due to the damage his car sustained.\n",
500
    "\"\"\"\n",
501
    "\n",
502
    "summary_2 = \"\"\"\n",
503
    "Jenson Button's McLaren collided with Pastor Maldonado's Lotus during the Chinese \n",
504
    "Grand Prix, causing front wing damage to Button's car and rear-end damage to \n",
505
    "Maldonado's, forcing his retirement. Button received a five-second penalty and \n",
506
    "two superlicence points, dropping himto 14th. Fernando Alonso advanced two places, \n",
507
    "while Button was lapped by Nico Rosberg and Alonso by Sebastian Vettel and \n",
508
    "Kimi Raikkonen.\n",
509
    "\"\"\"\n",
510
    "\n",
511
    "calculate_entity_density(summary_1),calculate_entity_density(summary_2)"
512
   ]
513
  },
514
  {
515
   "cell_type": "markdown",
516
   "id": "9d59c170-a4fb-4687-8012-9cb0ed807a8c",
517
   "metadata": {},
518
   "source": [
519
    "We can see that the final summary is almost twice as dense as the first summary and is hence more *entity dense*."
520
   ]
521
  },
522
  {
523
   "cell_type": "markdown",
524
   "id": "112b2f52-b15a-46d5-9767-e8a95d1f674f",
525
   "metadata": {},
526
   "source": [
527
    "## Implementation\n",
528
    "### Data Classes\n",
529
    "\n",
530
    "Let's start by walking through some of the data models that we'll be using as the response_model for our open ai function calls. We'll need a total of two different classes\n",
531
    "\n",
532
    "1. Initial Summary: which is the lengthy and overly verbose article\n",
533
    "2. Rewritten Summary : which represents"
534
   ]
535
  },
536
  {
537
   "cell_type": "code",
538
   "execution_count": 13,
539
   "id": "2ac40d98-2843-4c9c-bc18-50ab1d4ffa94",
540
   "metadata": {},
541
   "outputs": [],
542
   "source": [
543
    "from pydantic import BaseModel,Field,field_validator\n",
544
    "from typing import List"
545
   ]
546
  },
547
  {
548
   "cell_type": "code",
549
   "execution_count": 14,
550
   "id": "486e85fc-3fc8-4143-bdf4-d7cef91a37cf",
551
   "metadata": {},
552
   "outputs": [],
553
   "source": [
554
    "class InitialSummary(BaseModel):\n",
555
    "    \"\"\"\n",
556
    "    This is an initial summary which should be long ( 4-5 sentences, ~80 words)\n",
557
    "    yet highly non-specific, containing little information beyond the entities marked as missing.\n",
558
    "    Use overly verbose languages and fillers (Eg. This article discusses) to reach ~80 words.\n",
559
    "    \"\"\"\n",
560
    "\n",
561
    "    summary: str = Field(\n",
562
    "        ...,\n",
563
    "        description=\"This is a summary of the article provided which is overly verbose and uses fillers. \\\n",
564
    "        It should be roughly 80 words in length\",\n",
565
    "    )"
566
   ]
567
  },
568
  {
569
   "cell_type": "markdown",
570
   "id": "c3b8e382-dcfc-487f-8141-6dd9093c01b0",
571
   "metadata": {},
572
   "source": [
573
    "Pydantic is extremely handy because it allows us to do two things\n",
574
    "\n",
575
    "1. We can validate that our generated outputs are consistent with what we want, **and write vanilla python to validate so**\n",
576
    "2. We can export the generated class definition into a simple schema that fits in perfectly with OpenAI's function calling"
577
   ]
578
  },
579
  {
580
   "cell_type": "code",
581
   "execution_count": 15,
582
   "id": "609a9edd-7c4e-4586-a5be-037c4c3c7ff7",
583
   "metadata": {},
584
   "outputs": [
585
    {
586
     "data": {
587
      "text/plain": [
588
       "{'description': 'This is an initial summary which should be long ( 4-5 sentences, ~80 words)\\nyet highly non-specific, containing little information beyond the entities marked as missing.\\nUse overly verbose languages and fillers (Eg. This article discusses) to reach ~80 words.',\n",
589
       " 'properties': {'summary': {'description': 'This is a summary of the article provided which is overly verbose and uses fillers.         It should be roughly 80 words in length',\n",
590
       "   'title': 'Summary',\n",
591
       "   'type': 'string'}},\n",
592
       " 'required': ['summary'],\n",
593
       " 'title': 'InitialSummary',\n",
594
       " 'type': 'object'}"
595
      ]
596
     },
597
     "execution_count": 15,
598
     "metadata": {},
599
     "output_type": "execute_result"
600
    }
601
   ],
602
   "source": [
603
    "InitialSummary.model_json_schema()"
604
   ]
605
  },
606
  {
607
   "cell_type": "markdown",
608
   "id": "e910611e-2033-4db5-91b6-ebc97c11d252",
609
   "metadata": {},
610
   "source": [
611
    "It's important here to provide a good description of the overall class and the respective fields. This is because all of the descriptions that we write for the individual fields and the class itself **are directly used by the llm when generating outputs**.\n",
612
    "\n",
613
    "Now, as a quick recap, when we rewrite our summaries at each step, we're performing a few things\n",
614
    "\n",
615
    "1. We identify any entities from the original article that are relevant which are **missing from our current summary**\n",
616
    "2. We then rewrite our summary, making sure to include as many of these new entities as possible with the goal of increasing the entity density of the new summary\n",
617
    "3. We then make sure that we have included all of the entities in our previous summary in the new rewritten summary.\n",
618
    "\n",
619
    "We can express this in the form of the data model seen below called `RewrittenSummary`."
620
   ]
621
  },
622
  {
623
   "cell_type": "code",
624
   "execution_count": 16,
625
   "id": "d3d589ca-00cd-42cc-9a7a-a8f0620b4ea1",
626
   "metadata": {},
627
   "outputs": [],
628
   "source": [
629
    "class RewrittenSummary(BaseModel):\n",
630
    "    \"\"\"\n",
631
    "    This is a new, denser summary of identical length which covers every entity\n",
632
    "    and detail from the previous summary plus the Missing Entities.\n",
633
    "\n",
634
    "    Guidelines\n",
635
    "    - Make every word count : Rewrite the previous summary to improve flow and make space for additional entities\n",
636
    "    - Never drop entities from the previous summary. If space cannot be made, add fewer new entities.\n",
637
    "    - The new summary should be highly dense and concise yet self-contained, eg., easily understood without the Article.\n",
638
    "    - Make space with fusion, compression, and removal of uninformative phrases like \"the article discusses\"\n",
639
    "    - Missing entities can appear anywhere in the new summary\n",
640
    "\n",
641
    "    An Entity is a real-world object that's assigned a name - for example, a person, country a product or a book title.\n",
642
    "    \"\"\"\n",
643
    "\n",
644
    "    summary: str = Field(\n",
645
    "        ...,\n",
646
    "        description=\"This is a new, denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities. It should have the same length ( ~ 80 words ) as the previous summary and should be easily understood without the Article\",\n",
647
    "    )\n",
648
    "    absent: List[str] = Field(\n",
649
    "        ...,\n",
650
    "        default_factory=list,\n",
651
    "        description=\"this is a list of Entities found absent from the new summary that were present in the previous summary\",\n",
652
    "    )\n",
653
    "    missing: List[str] = Field(\n",
654
    "        default_factory=list,\n",
655
    "        description=\"This is a list of 1-3 informative Entities from the Article that are missing from the new summary which should be included in the next generated summary.\",\n",
656
    "    )"
657
   ]
658
  },
659
  {
660
   "cell_type": "markdown",
661
   "id": "06529289-309f-4143-979b-8d4119b7d141",
662
   "metadata": {},
663
   "source": [
664
    "We'd also want our rewritten summary to have\n",
665
    "\n",
666
    "1. No missing entities => `absent` should have a length of 0\n",
667
    "2. New entities to be added in the next rewrite -> `missing` should have at least 1 entry\n",
668
    "3. A minimum length of 60 tokens and to have a density of at least 0.08 ( **NOTE**: 60 tokens and the 0.08 cut off are chosen arbitrarily, feel free to adjust them even higher if you wish. However, this might require you to add more retries in your code )\n",
669
    "\n",
670
    "We can do so using the `field_validator` that we learnt in the previous lesson. This allows us to add in a validator for a specific field to ensure it meets our requirements. \n",
671
    "\n",
672
    "This gives us the final definition of our `RewrittenSummary` class as seen below"
673
   ]
674
  },
675
  {
676
   "cell_type": "code",
677
   "execution_count": 17,
678
   "id": "8f81f281-0950-4973-81b6-e1acd8b35aa0",
679
   "metadata": {},
680
   "outputs": [],
681
   "source": [
682
    "class RewrittenSummary(BaseModel):\n",
683
    "    \"\"\"\n",
684
    "    This is a new, denser summary of identical length which covers every entity\n",
685
    "    and detail from the previous summary plus the Missing Entities.\n",
686
    "\n",
687
    "    Guidelines\n",
688
    "    - Make every word count : Rewrite the previous summary to improve flow and make space for additional entities\n",
689
    "    - Never drop entities from the previous summary. If space cannot be made, add fewer new entities.\n",
690
    "    - The new summary should be highly dense and concise yet self-contained, eg., easily understood without the Article.\n",
691
    "    - Make space with fusion, compression, and removal of uninformative phrases like \"the article discusses\"\n",
692
    "    - Missing entities can appear anywhere in the new summary\n",
693
    "\n",
694
    "    An Entity is a real-world object that's assigned a name - for example, a person, country a product or a book title.\n",
695
    "    \"\"\"\n",
696
    "\n",
697
    "    summary: str = Field(\n",
698
    "        ...,\n",
699
    "        description=\"This is a new, denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities. It should have the same length ( ~ 80 words ) as the previous summary and should be easily understood without the Article\",\n",
700
    "    )\n",
701
    "    absent: List[str] = Field(\n",
702
    "        ...,\n",
703
    "        default_factory=list,\n",
704
    "        description=\"this is a list of Entities found absent from the new summary that were present in the previous summary\",\n",
705
    "    )\n",
706
    "    missing: List[str] = Field(\n",
707
    "        default_factory=list,\n",
708
    "        description=\"This is a list of 1-3 informative Entities from the Article that are missing from the new summary which should be included in the next generated summary.\",\n",
709
    "    )\n",
710
    "        \n",
711
    "    \n",
712
    "    @field_validator(\"summary\")\n",
713
    "    def min_length(cls, v: str):\n",
714
    "        tokens = nltk.word_tokenize(v) \n",
715
    "        num_tokens = len(tokens)\n",
716
    "        if num_tokens < 60:\n",
717
    "            raise ValueError(\n",
718
    "                \"The current summary is too short. Please make sure that you generate a new summary that is around 80 words long.\"\n",
719
    "            )\n",
720
    "        return v\n",
721
    "    \n",
722
    "    @field_validator(\"missing\")\n",
723
    "    def has_missing_entities(cls, missing_entities: List[str]):\n",
724
    "        if len(missing_entities) == 0:\n",
725
    "            raise ValueError(\n",
726
    "                \"You must identify 1-3 informative Entities from the Article which are missing from the previously generated summary to be used in a new summary\"\n",
727
    "            )\n",
728
    "        return missing_entities\n",
729
    "    \n",
730
    "    @field_validator(\"absent\")\n",
731
    "    def has_no_absent_entities(cls, absent_entities: List[str]):\n",
732
    "        absent_entity_string = \",\".join(absent_entities)\n",
733
    "        if len(absent_entities) > 0:\n",
734
    "            print(f\"Detected absent entities of {absent_entity_string}\")\n",
735
    "            raise ValueError(\n",
736
    "                f\"Do not omit the following Entities {absent_entity_string} from the new summary\"\n",
737
    "            )\n",
738
    "        return absent_entities\n",
739
    "    \n",
740
    "    @field_validator(\"summary\")\n",
741
    "    def min_entity_density(cls, v: str):\n",
742
    "        tokens = nltk.word_tokenize(v)\n",
743
    "        num_tokens = len(tokens)\n",
744
    "    \n",
745
    "        # Extract Entities\n",
746
    "        doc = nlp(v) \n",
747
    "        num_entities = len(doc.ents)\n",
748
    "    \n",
749
    "        density = num_entities / num_tokens\n",
750
    "        if density < 0.08: \n",
751
    "            raise ValueError(\n",
752
    "                f\"The summary of {v} has too few entities. Please regenerate a new summary with more new entities added to it. Remember that new entities can be added at any point of the summary.\"\n",
753
    "            )\n",
754
    "    \n",
755
    "        return v"
756
   ]
757
  },
758
  {
759
   "cell_type": "markdown",
760
   "id": "3e182039-ad7f-4918-b2f9-4c567d95a890",
761
   "metadata": {},
762
   "source": [
763
    "### Putting it all together\n",
764
    "\n",
765
    "Now that we have our models, let's implement a function to summarize a piece of text using a Chain Of Density summarization"
766
   ]
767
  },
768
  {
769
   "cell_type": "code",
770
   "execution_count": 18,
771
   "id": "fc66ffcc-db30-429a-8007-4d4a24bf2426",
772
   "metadata": {},
773
   "outputs": [],
774
   "source": [
775
    "from openai import OpenAI\n",
776
    "import instructor\n",
777
    "\n",
778
    "client = instructor.patch(OpenAI()) \n",
779
    "\n",
780
    "def summarize_article(article: str, summary_steps: int = 3):\n",
781
    "    summary_chain = []\n",
782
    "    # We first generate an initial summary\n",
783
    "    summary: InitialSummary = client.chat.completions.create(  \n",
784
    "        model=\"gpt-4-1106-preview\",\n",
785
    "        response_model=InitialSummary,\n",
786
    "        messages=[\n",
787
    "            {\n",
788
    "                \"role\": \"system\",\n",
789
    "                \"content\": \"Write a summary about the article that is long (4-5 sentences) yet highly non-specific. Use overly, verbose language and fillers(eg.,'this article discusses') to reach ~80 words\",\n",
790
    "            },\n",
791
    "            {\"role\": \"user\", \"content\": f\"Here is the Article: {article}\"},\n",
792
    "            {\n",
793
    "                \"role\": \"user\",\n",
794
    "                \"content\": \"The generated summary should be about 80 words.\",\n",
795
    "            },\n",
796
    "        ],\n",
797
    "        max_retries=2,\n",
798
    "    )\n",
799
    "    prev_summary = None\n",
800
    "    summary_chain.append(summary.summary)\n",
801
    "    for i in range(summary_steps):\n",
802
    "        missing_entity_message = (\n",
803
    "            []\n",
804
    "            if prev_summary is None\n",
805
    "            else [\n",
806
    "                {\n",
807
    "                    \"role\": \"user\",\n",
808
    "                    \"content\": f\"Please include these Missing Entities: {','.join(prev_summary.missing)}\",\n",
809
    "                },\n",
810
    "            ]\n",
811
    "        )\n",
812
    "        new_summary: RewrittenSummary = client.chat.completions.create( \n",
813
    "            model=\"gpt-4-1106-preview\",\n",
814
    "            messages=[\n",
815
    "                {\n",
816
    "                    \"role\": \"system\",\n",
817
    "                    \"content\": \"\"\"\n",
818
    "                You are going to generate an increasingly concise,entity-dense summary of the following article.\n",
819
    "\n",
820
    "                Perform the following two tasks\n",
821
    "                - Identify 1-3 informative entities from the following article which is missing from the previous summary\n",
822
    "                - Write a new denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities\n",
823
    "\n",
824
    "                Guidelines\n",
825
    "                - Make every word count: re-write the previous summary to improve flow and make space for additional entities\n",
826
    "                - Make space with fusion, compression, and removal of uninformative phrases like \"the article discusses\".\n",
827
    "                - The summaries should become highly dense and concise yet self-contained, e.g., easily understood without the Article.\n",
828
    "                - Missing entities can appear anywhere in the new summary\n",
829
    "                - Never drop entities from the previous summary. If space cannot be made, add fewer new entities.\n",
830
    "                \"\"\",\n",
831
    "                },\n",
832
    "                {\"role\": \"user\", \"content\": f\"Here is the Article: {article}\"},\n",
833
    "                {\n",
834
    "                    \"role\": \"user\",\n",
835
    "                    \"content\": f\"Here is the previous summary: {summary_chain[-1]}\",\n",
836
    "                },\n",
837
    "                *missing_entity_message,\n",
838
    "            ],\n",
839
    "            max_retries=3, \n",
840
    "            max_tokens=1000,\n",
841
    "            response_model=RewrittenSummary,\n",
842
    "        )\n",
843
    "        summary_chain.append(new_summary.summary)\n",
844
    "        prev_summary = new_summary\n",
845
    "\n",
846
    "    return summary_chain"
847
   ]
848
  },
849
  {
850
   "cell_type": "markdown",
851
   "id": "0a034f57-1299-4fae-8fd5-f2d9a9ca985b",
852
   "metadata": {},
853
   "source": [
854
    "### Trial Run\n",
855
    "\n",
856
    "Let's try running this on some sample text which we can import in from our repository. We've provided a sample article in a file called `article.txt`"
857
   ]
858
  },
859
  {
860
   "cell_type": "code",
861
   "execution_count": 19,
862
   "id": "6044c72b-fdc7-4cea-893b-a408c7b60230",
863
   "metadata": {},
864
   "outputs": [],
865
   "source": [
866
    "with open(\"./assets/article.txt\",\"r+\") as file:\n",
867
    "    article = file.readline()"
868
   ]
869
  },
870
  {
871
   "cell_type": "code",
872
   "execution_count": null,
873
   "id": "2302dedc-f22a-41e9-b9c2-1579a4e8f623",
874
   "metadata": {},
875
   "outputs": [],
876
   "source": [
877
    "%%time\n",
878
    "\n",
879
    "summaries = summarize_article(article)"
880
   ]
881
  },
882
  {
883
   "cell_type": "markdown",
884
   "id": "a17de9a7-17c0-4b5f-b788-74a7347c4952",
885
   "metadata": {},
886
   "source": [
887
    "We can see that it took roughly 40 seconds to do an iterative chain of density using this article. But does our approach increase the density of each individual summary? We can check by calculating the entity density of each summary in our list of summaries using the `calculate_entity_density` function we defined above."
888
   ]
889
  },
890
  {
891
   "cell_type": "code",
892
   "execution_count": null,
893
   "id": "99f7361c-2737-44ef-8515-1919e009e718",
894
   "metadata": {},
895
   "outputs": [],
896
   "source": [
897
    "for index,summary in enumerate(summaries):\n",
898
    "    tokens,entity,density = calculate_entity_density(summary)\n",
899
    "    print(f\"Article {index+1} -> Results (Tokens: {tokens}, Entity Count: {entity}, Density: {density})\")"
900
   ]
901
  },
902
  {
903
   "cell_type": "markdown",
904
   "id": "70571151-f378-4936-889d-0e1ca5082307",
905
   "metadata": {},
906
   "source": [
907
    "We can take a look at the articles themselves to see if they qualitatively show improvement"
908
   ]
909
  },
910
  {
911
   "cell_type": "code",
912
   "execution_count": null,
913
   "id": "e7149f4d-41ca-4cb1-8438-65cd97cb4246",
914
   "metadata": {},
915
   "outputs": [],
916
   "source": [
917
    "for summary in summaries:\n",
918
    "    print(f\"\\n{summary}\\n\")"
919
   ]
920
  },
921
  {
922
   "cell_type": "markdown",
923
   "id": "ba77b7b2-152a-4ad0-9076-4c59a454bed0",
924
   "metadata": {},
925
   "source": [
926
    "As we can see, the articles progressively introduce more entities and become more entity dense. We've performed 4 rounds of summarization here but you could definitely do with maybe 2-3 if latency is a significant issue."
927
   ]
928
  },
929
  {
930
   "cell_type": "markdown",
931
   "id": "c2932bc2-7e93-4434-b9ad-a68981630961",
932
   "metadata": {},
933
   "source": [
934
    "## Future Steps"
935
   ]
936
  },
937
  {
938
   "cell_type": "markdown",
939
   "id": "cf93e36c-f28a-4824-8b15-b23478577ce7",
940
   "metadata": {},
941
   "source": [
942
    "This guide showed how to to generate complex summaries using chain of density summarization. We spent some time covering how to apply more complex validators - using `spaCy` and `NLTK` to ensure we had a minimum number of tokens and entity density as well as how you might apply instructor in a multi-stage process.\n",
943
    "\n",
944
    "By building in validation at each step of the proccess, this helps to improve the performance of your LLM across various tasks.\n",
945
    "\n",
946
    "For those looking to delve deeper, here are some to-do lists to explore.\n",
947
    "\n",
948
    "- **Validate Increasing Entity Density**: `Pydantic` exposes a more complex validator that can take in an arbitrary python dictionary. Use the validation context to check the entity density of the previous summary and the new summary to validate that our model has generated a more entity-dense rewrite\n",
949
    "- **Fine-Tuning** : `Instructor` comes with a simple to use interface to help you fine-tune other OpenAI models for your needs. This can be accomplished by capturing the outputs of LLMs using the `Instructions` module to generate training data for fine-tuning. In this specific case, finetuning a model to generate dense summaries could decrease latency and cost significantly by replacing the iterative LLM calls that we make .\n",
950
    "\n",
951
    "By accomplishing these tasks, you'll gain practical experience in tuning your models to suit your specific tasks as well as build in more complex validation processes when working with LLMs to ensure more reliable, accurate and consistent outputs."
952
   ]
953
  }
954
 ],
955
 "metadata": {
956
  "kernelspec": {
957
   "display_name": ".venv",
958
   "language": "python",
959
   "name": "python3"
960
  },
961
  "language_info": {
962
   "codemirror_mode": {
963
    "name": "ipython",
964
    "version": 3
965
   },
966
   "file_extension": ".py",
967
   "mimetype": "text/x-python",
968
   "name": "python",
969
   "nbconvert_exporter": "python",
970
   "pygments_lexer": "ipython3",
971
   "version": "3.11.6"
972
  }
973
 },
974
 "nbformat": 4,
975
 "nbformat_minor": 5
976
}
977

Использование cookies

Мы используем файлы cookie в соответствии с Политикой конфиденциальности и Политикой использования cookies.

Нажимая кнопку «Принимаю», Вы даете АО «СберТех» согласие на обработку Ваших персональных данных в целях совершенствования нашего веб-сайта и Сервиса GitVerse, а также повышения удобства их использования.

Запретить использование cookies Вы можете самостоятельно в настройках Вашего браузера.