examples

llama-index-intro.ipynb
541 строка · 16.7 Кб
Перенос по словам
1
{
2
 "cells": [
3
  {
4
   "attachments": {},
5
   "cell_type": "markdown",
6
   "metadata": {
7
    "id": "s4eK-unQlKTF"
8
   },
9
   "source": [
10
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/generation/llama-index/llama-index-intro.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/generation/llama-index/llama-index-intro.ipynb)\n",
11
    "\n",
12
    "# Llama-Index with Pinecone\n",
13
    "\n",
14
    "**Note:** This notebook is built to run end-to-end in Google Colab. Users might run into issues when running locally, depending on their local environment setup.\n",
15
    "\n",
16
    "In this notebook, we will demo how to use the `llama-index` (previously GPT-index) library with Pinecone for semantic search. This notebook is an introduction and does not cover the more advanced features of `llama-index`. You will find these in future releases in the [Pinecone examples repo](https://github.com/pinecone-io/examples/tree/master/learn/generation).\n",
17
    "\n",
18
    "We will start by installing the necessary libraries and initializing Pinecone."
19
   ]
20
  },
21
  {
22
   "cell_type": "code",
23
   "execution_count": 2,
24
   "metadata": {
25
    "id": "YenjP38jtRdE"
26
   },
27
   "outputs": [
28
    {
29
     "name": "stdout",
30
     "output_type": "stream",
31
     "text": [
32
      "\n",
33
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.2.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.3.2\u001b[0m\n",
34
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
35
     ]
36
    }
37
   ],
38
   "source": [
39
    "!pip install -qU llama-index==0.9.29 datasets==2.16.1 pinecone-client==3.1.0 openai==1.7.1 transformers==4.36.2"
40
   ]
41
  },
42
  {
43
   "attachments": {},
44
   "cell_type": "markdown",
45
   "metadata": {
46
    "id": "KjeiHqOIoGnI"
47
   },
48
   "source": [
49
    "We can go ahead and load the **SQuAD** dataset, which contains questions and answer pairs from Wikipedia articles. We'll then convert the dataset into a pandas DataFrame and keep only the unique 'context' fields, which are the text passages that the questions are based on."
50
   ]
51
  },
52
  {
53
   "cell_type": "code",
54
   "execution_count": 3,
55
   "metadata": {
56
    "colab": {
57
     "base_uri": "https://localhost:8080/",
58
     "height": 244
59
    },
60
    "id": "TlKkc8Jbqiin",
61
    "outputId": "34a6aa74-477e-45bf-fede-4d89603c291b"
62
   },
63
   "outputs": [
64
    {
65
     "name": "stderr",
66
     "output_type": "stream",
67
     "text": [
68
      "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
69
      "  from .autonotebook import tqdm as notebook_tqdm\n"
70
     ]
71
    },
72
    {
73
     "data": {
74
      "text/html": [
75
       "<div>\n",
76
       "<style scoped>\n",
77
       "    .dataframe tbody tr th:only-of-type {\n",
78
       "        vertical-align: middle;\n",
79
       "    }\n",
80
       "\n",
81
       "    .dataframe tbody tr th {\n",
82
       "        vertical-align: top;\n",
83
       "    }\n",
84
       "\n",
85
       "    .dataframe thead th {\n",
86
       "        text-align: right;\n",
87
       "    }\n",
88
       "</style>\n",
89
       "<table border=\"1\" class=\"dataframe\">\n",
90
       "  <thead>\n",
91
       "    <tr style=\"text-align: right;\">\n",
92
       "      <th></th>\n",
93
       "      <th>id</th>\n",
94
       "      <th>context</th>\n",
95
       "      <th>title</th>\n",
96
       "    </tr>\n",
97
       "  </thead>\n",
98
       "  <tbody>\n",
99
       "    <tr>\n",
100
       "      <th>0</th>\n",
101
       "      <td>5733be284776f41900661182</td>\n",
102
       "      <td>Architecturally, the school has a Catholic cha...</td>\n",
103
       "      <td>University_of_Notre_Dame</td>\n",
104
       "    </tr>\n",
105
       "    <tr>\n",
106
       "      <th>5</th>\n",
107
       "      <td>5733bf84d058e614000b61be</td>\n",
108
       "      <td>As at most other universities, Notre Dame's st...</td>\n",
109
       "      <td>University_of_Notre_Dame</td>\n",
110
       "    </tr>\n",
111
       "    <tr>\n",
112
       "      <th>10</th>\n",
113
       "      <td>5733bed24776f41900661188</td>\n",
114
       "      <td>The university is the major seat of the Congre...</td>\n",
115
       "      <td>University_of_Notre_Dame</td>\n",
116
       "    </tr>\n",
117
       "    <tr>\n",
118
       "      <th>15</th>\n",
119
       "      <td>5733a6424776f41900660f51</td>\n",
120
       "      <td>The College of Engineering was established in ...</td>\n",
121
       "      <td>University_of_Notre_Dame</td>\n",
122
       "    </tr>\n",
123
       "    <tr>\n",
124
       "      <th>20</th>\n",
125
       "      <td>5733a70c4776f41900660f64</td>\n",
126
       "      <td>All of Notre Dame's undergraduate students are...</td>\n",
127
       "      <td>University_of_Notre_Dame</td>\n",
128
       "    </tr>\n",
129
       "  </tbody>\n",
130
       "</table>\n",
131
       "</div>"
132
      ],
133
      "text/plain": [
134
       "                          id  \\\n",
135
       "0   5733be284776f41900661182   \n",
136
       "5   5733bf84d058e614000b61be   \n",
137
       "10  5733bed24776f41900661188   \n",
138
       "15  5733a6424776f41900660f51   \n",
139
       "20  5733a70c4776f41900660f64   \n",
140
       "\n",
141
       "                                              context  \\\n",
142
       "0   Architecturally, the school has a Catholic cha...   \n",
143
       "5   As at most other universities, Notre Dame's st...   \n",
144
       "10  The university is the major seat of the Congre...   \n",
145
       "15  The College of Engineering was established in ...   \n",
146
       "20  All of Notre Dame's undergraduate students are...   \n",
147
       "\n",
148
       "                       title  \n",
149
       "0   University_of_Notre_Dame  \n",
150
       "5   University_of_Notre_Dame  \n",
151
       "10  University_of_Notre_Dame  \n",
152
       "15  University_of_Notre_Dame  \n",
153
       "20  University_of_Notre_Dame  "
154
      ]
155
     },
156
     "execution_count": 3,
157
     "metadata": {},
158
     "output_type": "execute_result"
159
    }
160
   ],
161
   "source": [
162
    "from datasets import load_dataset\n",
163
    "\n",
164
    "data = load_dataset('squad', split='train')\n",
165
    "data = data.to_pandas()[['id', 'context', 'title']]\n",
166
    "data.drop_duplicates(subset='context', keep='first', inplace=True)\n",
167
    "data.head()"
168
   ]
169
  },
170
  {
171
   "cell_type": "code",
172
   "execution_count": 4,
173
   "metadata": {
174
    "colab": {
175
     "base_uri": "https://localhost:8080/"
176
    },
177
    "id": "ZT6Z5gW4oTkG",
178
    "outputId": "c4797ae0-4091-4b8e-fa63-293174deb073"
179
   },
180
   "outputs": [
181
    {
182
     "data": {
183
      "text/plain": [
184
       "18891"
185
      ]
186
     },
187
     "execution_count": 4,
188
     "metadata": {},
189
     "output_type": "execute_result"
190
    }
191
   ],
192
   "source": [
193
    "len(data)"
194
   ]
195
  },
196
  {
197
   "attachments": {},
198
   "cell_type": "markdown",
199
   "metadata": {
200
    "id": "IK90FFhf42hd"
201
   },
202
   "source": [
203
    "This code transforms our DataFrame into a list of Document objects, ready for indexing with llama_index. Each document contains the text passage, a unique id, and an extra field for the article title."
204
   ]
205
  },
206
  {
207
   "cell_type": "code",
208
   "execution_count": 1,
209
   "metadata": {
210
    "colab": {
211
     "base_uri": "https://localhost:8080/"
212
    },
213
    "id": "uXOYMGfLtO-Z",
214
    "outputId": "62466814-5797-40f4-c606-a163c44581e4"
215
   },
216
   "outputs": [],
217
   "source": [
218
    "from llama_index import Document\n",
219
    "\n",
220
    "docs = []\n",
221
    "\n",
222
    "for i, row in data.iterrows():\n",
223
    "    docs.append(Document(\n",
224
    "        text=row['context'],\n",
225
    "        doc_id=row['id'],\n",
226
    "        extra_info={'title': row['title']}\n",
227
    "    ))\n",
228
    "docs[0]"
229
   ]
230
  },
231
  {
232
   "cell_type": "code",
233
   "execution_count": 18,
234
   "metadata": {
235
    "colab": {
236
     "base_uri": "https://localhost:8080/"
237
    },
238
    "id": "hG81jmfCobYF",
239
    "outputId": "e8813daa-98ad-4f37-a0dd-560046a1073a"
240
   },
241
   "outputs": [
242
    {
243
     "data": {
244
      "text/plain": [
245
       "18891"
246
      ]
247
     },
248
     "execution_count": 18,
249
     "metadata": {},
250
     "output_type": "execute_result"
251
    }
252
   ],
253
   "source": [
254
    "len(docs)"
255
   ]
256
  },
257
  {
258
   "attachments": {},
259
   "cell_type": "markdown",
260
   "metadata": {
261
    "id": "uQ4d9UlS5OeU"
262
   },
263
   "source": [
264
    "Here, we're setting up the OpenAI API key and initializing a `SimpleNodeParser`. This parser processes our list of `Document` objects into 'nodes', which are the basic units that `llama_index` uses for indexing and querying. The first node is displayed below."
265
   ]
266
  },
267
  {
268
   "cell_type": "code",
269
   "execution_count": 19,
270
   "metadata": {
271
    "id": "1MSXio_atoSq"
272
   },
273
   "outputs": [],
274
   "source": [
275
    "import os\n",
276
    "\n",
277
    "os.environ['OPENAI_API_KEY'] = '<your OpenAI API key>'  # platform.openai.com"
278
   ]
279
  },
280
  {
281
   "cell_type": "code",
282
   "execution_count": 20,
283
   "metadata": {
284
    "colab": {
285
     "base_uri": "https://localhost:8080/"
286
    },
287
    "id": "gKTwFpkmte7o",
288
    "outputId": "f3774aa6-4fca-44a8-ae1f-58063eae50c6"
289
   },
290
   "outputs": [
291
    {
292
     "data": {
293
      "text/plain": [
294
       "Node(text='Architecturally, the school has a Catholic character. Atop the Main Building\\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend \"Venite Ad Me Omnes\". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', doc_id='8968aa3c-abbe-4fe0-ad91-be0298830542', embedding=None, doc_hash='ba8eb7d843763d4f7e7f71f0be64c70c4597fd712bfcce77007fc4c821b8d1b6', extra_info={'title': 'University_of_Notre_Dame'}, node_info={'start': 0, 'end': 695}, relationships={<DocumentRelationship.SOURCE: '1'>: '5733be284776f41900661182'})"
295
      ]
296
     },
297
     "execution_count": 20,
298
     "metadata": {},
299
     "output_type": "execute_result"
300
    }
301
   ],
302
   "source": [
303
    "from llama_index.node_parser import SimpleNodeParser\n",
304
    "\n",
305
    "parser = SimpleNodeParser()\n",
306
    "\n",
307
    "nodes = parser.get_nodes_from_documents(docs)\n",
308
    "nodes[0]"
309
   ]
310
  },
311
  {
312
   "cell_type": "code",
313
   "execution_count": 21,
314
   "metadata": {
315
    "colab": {
316
     "base_uri": "https://localhost:8080/"
317
    },
318
    "id": "LQL4X2_WodU2",
319
    "outputId": "34d05a8e-1c39-4248-e4d6-25433fd31b58"
320
   },
321
   "outputs": [
322
    {
323
     "data": {
324
      "text/plain": [
325
       "18892"
326
      ]
327
     },
328
     "execution_count": 21,
329
     "metadata": {},
330
     "output_type": "execute_result"
331
    }
332
   ],
333
   "source": [
334
    "len(nodes)"
335
   ]
336
  },
337
  {
338
   "attachments": {},
339
   "cell_type": "markdown",
340
   "metadata": {
341
    "id": "JQrAcAdZ1fS9"
342
   },
343
   "source": [
344
    "### Indexing in Pinecone\n",
345
    "\n",
346
    "Pinecone is a managed vector database service designed for machine learning applications. We're using it in this context to store and retrieve embeddings generated by our language model, enabling efficient and scalable semantic similarity-based search.\n",
347
    "\n",
348
    "We initialize Pinecone with the relevant API key and environment that we [get for **free** in the console](https://app.pinecone.io/), then create a new index. The index has a dimension of 1536 and uses cosine similarity, which is the recommended metric for comparing vectors produced by the `text-embedding-ada-002` model we'll be using."
349
   ]
350
  },
351
  {
352
   "cell_type": "code",
353
   "execution_count": null,
354
   "metadata": {},
355
   "outputs": [],
356
   "source": [
357
    "from pinecone import Pinecone\n",
358
    "from pinecone import ServerlessSpec\n",
359
    "\n",
360
    "\n",
361
    "# find API key in console at app.pinecone.io\n",
362
    "os.environ['PINECONE_API_KEY'] = '<your API key here>'\n",
363
    "\n",
364
    "# initialize connection to pinecone\n",
365
    "pc = Pinecone(api_key=os.getenv('PINECONE_API_KEY'))\n",
366
    "\n",
367
    "\n",
368
    "# create the index if it does not exist already\n",
369
    "index_name = 'llama-index-intro'\n",
370
    "existing_indexes = [i.get('name') for i in pc.list_indexes()]\n",
371
    "\n",
372
    "if index_name not in existing_indexes:\n",
373
    "    pc.create_index(\n",
374
    "        name=index_name,\n",
375
    "        dimension=1536,\n",
376
    "        metric='cosine',\n",
377
    "        spec=ServerlessSpec(cloud='aws', region='us-west-2')\n",
378
    "    )\n",
379
    "\n",
380
    "# connect to the index\n",
381
    "pinecone_index = pc.Index(index_name)"
382
   ]
383
  },
384
  {
385
   "attachments": {},
386
   "cell_type": "markdown",
387
   "metadata": {
388
    "id": "r4KQXwLh645t"
389
   },
390
   "source": [
391
    "Here, we're initializing a `PineconeVectorStore` with our previously created Pinecone index. This object will serve as the storage and retrieval interface for our document embeddings in Pinecone's vector database."
392
   ]
393
  },
394
  {
395
   "cell_type": "code",
396
   "execution_count": 23,
397
   "metadata": {
398
    "id": "AO4eAvSn4t-R"
399
   },
400
   "outputs": [],
401
   "source": [
402
    "from llama_index.vector_stores import PineconeVectorStore\n",
403
    "\n",
404
    "# we can select a namespace (acts as a partition in an index)\n",
405
    "namespace = '' # default namespace\n",
406
    "\n",
407
    "vector_store = PineconeVectorStore(pinecone_index=pinecone_index)"
408
   ]
409
  },
410
  {
411
   "attachments": {},
412
   "cell_type": "markdown",
413
   "metadata": {
414
    "id": "qotocu00ZjfF"
415
   },
416
   "source": [
417
    "Next we initialize the `GPTVectorStoreIndex` with our list of `Document` objects, using the `PineconeVectorStore` as storage and `OpenAIEmbedding` model for embeddings.\n",
418
    "\n",
419
    "`StorageContext` is used to configure the storage setup, and `ServiceContext` sets up the embedding model. The `GPTVectorStoreIndex` handles the indexing and querying process, making use of the provided storage and service contexts."
420
   ]
421
  },
422
  {
423
   "cell_type": "code",
424
   "execution_count": 24,
425
   "metadata": {
426
    "id": "V64_Y5FZqJ8w"
427
   },
428
   "outputs": [],
429
   "source": [
430
    "from llama_index import GPTVectorStoreIndex, StorageContext, ServiceContext\n",
431
    "from llama_index.embeddings.openai import OpenAIEmbedding\n",
432
    "\n",
433
    "# setup our storage (vector db)\n",
434
    "storage_context = StorageContext.from_defaults(\n",
435
    "    vector_store=vector_store\n",
436
    ")\n",
437
    "# setup the index/query process, ie the embedding model (and completion if used)\n",
438
    "embed_model = OpenAIEmbedding(model='text-embedding-ada-002', embed_batch_size=100)\n",
439
    "service_context = ServiceContext.from_defaults(embed_model=embed_model)\n",
440
    "\n",
441
    "index = GPTVectorStoreIndex.from_documents(\n",
442
    "    docs, storage_context=storage_context,\n",
443
    "    service_context=service_context\n",
444
    ")"
445
   ]
446
  },
447
  {
448
   "attachments": {},
449
   "cell_type": "markdown",
450
   "metadata": {
451
    "id": "Z7veOHn3Zyv1"
452
   },
453
   "source": [
454
    "Finally we can build a query engine from the `index` we build and use this engine to perform a query."
455
   ]
456
  },
457
  {
458
   "cell_type": "code",
459
   "execution_count": 25,
460
   "metadata": {
461
    "colab": {
462
     "base_uri": "https://localhost:8080/"
463
    },
464
    "id": "4loa7g_y5x2v",
465
    "outputId": "e87ab48f-f641-46a1-f06b-be2d5daf4a0b"
466
   },
467
   "outputs": [
468
    {
469
     "name": "stdout",
470
     "output_type": "stream",
471
     "text": [
472
      "\n",
473
      "The College of Engineering was established in 1920.\n"
474
     ]
475
    }
476
   ],
477
   "source": [
478
    "query_engine = index.as_query_engine()\n",
479
    "res = query_engine.query(\"in what year was the college of engineering established at the University of Notre Dame?\")\n",
480
    "print(res)"
481
   ]
482
  },
483
  {
484
   "attachments": {},
485
   "cell_type": "markdown",
486
   "metadata": {
487
    "id": "xQEzZJqWaAgS"
488
   },
489
   "source": [
490
    "That's our quick intro to using Llama-index and Pinecone! Once we're done testing the system we should delete the Pinecone index to save resources:"
491
   ]
492
  },
493
  {
494
   "cell_type": "code",
495
   "execution_count": 26,
496
   "metadata": {
497
    "id": "z89CVJMGaJ4j"
498
   },
499
   "outputs": [],
500
   "source": [
501
    "pc.delete_index(index_name)"
502
   ]
503
  },
504
  {
505
   "attachments": {},
506
   "cell_type": "markdown",
507
   "metadata": {
508
    "id": "DWtX3Nxl5V71"
509
   },
510
   "source": [
511
    "---"
512
   ]
513
  }
514
 ],
515
 "metadata": {
516
  "colab": {
517
   "gpuType": "T4",
518
   "provenance": []
519
  },
520
  "gpuClass": "standard",
521
  "kernelspec": {
522
   "display_name": "Python 3 (ipykernel)",
523
   "language": "python",
524
   "name": "python3"
525
  },
526
  "language_info": {
527
   "codemirror_mode": {
528
    "name": "ipython",
529
    "version": 3
530
   },
531
   "file_extension": ".py",
532
   "mimetype": "text/x-python",
533
   "name": "python",
534
   "nbconvert_exporter": "python",
535
   "pygments_lexer": "ipython3",
536
   "version": "3.11.5"
537
  }
538
 },
539
 "nbformat": 4,
540
 "nbformat_minor": 4
541
}
542
examples

Использование cookies