milvus-io_bootcamp

imdb_milvus_client.ipynb
1257 строк · 45.7 Кб
Перенос по словам
1
{
2
 "cells": [
3
  {
4
   "cell_type": "markdown",
5
   "id": "369c3444",
6
   "metadata": {},
7
   "source": [
8
    "# IMDB Vector Search using Milvus Client"
9
   ]
10
  },
11
  {
12
   "cell_type": "markdown",
13
   "id": "f6ffd11a",
14
   "metadata": {},
15
   "source": [
16
    "First, import some common libraries and define the data reading functions."
17
   ]
18
  },
19
  {
20
   "cell_type": "code",
21
   "execution_count": 1,
22
   "id": "d7570b2e",
23
   "metadata": {},
24
   "outputs": [],
25
   "source": [
26
    "# For colab install these libraries in this order:\n",
27
    "# !pip install pymilvus, langchain, torch, transformers, python-dotenv\n",
28
    "\n",
29
    "# Import common libraries.\n",
30
    "import sys, time, pprint\n",
31
    "import pandas as pd\n",
32
    "import numpy as np\n",
33
    "\n",
34
    "# Import custom functions for splitting and search.\n",
35
    "sys.path.append(\"..\")  # Adds higher directory to python modules path.\n",
36
    "import milvus_utilities as _utils"
37
   ]
38
  },
39
  {
40
   "cell_type": "markdown",
41
   "id": "fb844837",
42
   "metadata": {},
43
   "source": [
44
    "## Start up a Zilliz free tier cluster.\n",
45
    "\n",
46
    "Code in this notebook uses fully-managed Milvus on [Ziliz Cloud free trial](https://cloud.zilliz.com/login).  \n",
47
    "  1. Choose the default \"Starter\" option when you provision > Create collection > Give it a name > Create cluster and collection.  \n",
48
    "  2. On the Cluster main page, copy your `API Key` and store it locally in a .env variable.  See note below how to do that.\n",
49
    "  3. Also on the Cluster main page, copy the `Public Endpoint URI`.\n",
50
    "\n",
51
    "💡 Note: To keep your tokens private, best practice is to use an **env variable**.  See [how to save api key in env variable](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety). <br>\n",
52
    "\n",
53
    "In Jupyter, you also need a .env file (in same dir as notebooks) containing lines like this:\n",
54
    "- VARIABLE_NAME=value\n"
55
   ]
56
  },
57
  {
58
   "cell_type": "code",
59
   "execution_count": 2,
60
   "id": "0806d2db",
61
   "metadata": {},
62
   "outputs": [
63
    {
64
     "name": "stdout",
65
     "output_type": "stream",
66
     "text": [
67
      "Type of server: zilliz_cloud\n"
68
     ]
69
    }
70
   ],
71
   "source": [
72
    "# STEP 1. CONNECT TO MILVUS\n",
73
    "\n",
74
    "# !pip install pymilvus #python sdk for milvus\n",
75
    "from pymilvus import connections, utility\n",
76
    "\n",
77
    "import os\n",
78
    "from dotenv import load_dotenv\n",
79
    "load_dotenv()\n",
80
    "TOKEN = os.getenv(\"ZILLIZ_API_KEY\")\n",
81
    "\n",
82
    "# Connect to Zilliz cloud using endpoint URI and API key TOKEN.\n",
83
    "# TODO change this.\n",
84
    "CLUSTER_ENDPOINT=\"https://in03-xxxx.api.gcp-us-west1.zillizcloud.com:443\"\n",
85
    "connections.connect(\n",
86
    "  alias='default',\n",
87
    "  #  Public endpoint obtained from Zilliz Cloud\n",
88
    "  uri=CLUSTER_ENDPOINT,\n",
89
    "  # API key or a colon-separated cluster username and password\n",
90
    "  token=TOKEN,\n",
91
    ")\n",
92
    "\n",
93
    "# Check if the server is ready and get colleciton name.\n",
94
    "print(f\"Type of server: {utility.get_server_version()}\")"
95
   ]
96
  },
97
  {
98
   "cell_type": "markdown",
99
   "id": "b01d6622",
100
   "metadata": {},
101
   "source": [
102
    "## Load the Embedding Model checkpoint and use it to create vector embeddings\n",
103
    "**Embedding model:**  We will use the open-source [sentence transformers](https://www.sbert.net/docs/pretrained_models.html) available on HuggingFace to encode the documentation text.  We will download the model from HuggingFace and run it locally. \n",
104
    "\n",
105
    "Two model parameters of note below:\n",
106
    "1. EMBEDDING_LENGTH refers to the dimensionality or length of the embedding vector. In this case, the embeddings generated for EACH token in the input text will have the SAME length = 1024. This size of embedding is often associated with BERT-based models, where the embeddings are used for downstream tasks such as classification, question answering, or text generation. <br><br>\n",
107
    "2. MAX_SEQ_LENGTH is the maximum length the encoder model can handle for input sequences. In this case, if sequences longer than 512 tokens are given to the model, everything longer will be (silently!) chopped off.  This is the reason why a chunking strategy is needed to segment input texts into chunks with lengths that will fit in the model's input."
108
   ]
109
  },
110
  {
111
   "cell_type": "code",
112
   "execution_count": 3,
113
   "id": "dd2be7fd",
114
   "metadata": {},
115
   "outputs": [
116
    {
117
     "name": "stdout",
118
     "output_type": "stream",
119
     "text": [
120
      "device: cpu\n"
121
     ]
122
    },
123
    {
124
     "name": "stderr",
125
     "output_type": "stream",
126
     "text": [
127
      "No sentence-transformers model found with name /Users/christybergman/.cache/torch/sentence_transformers/WhereIsAI_UAE-Large-V1. Creating a new one with MEAN pooling.\n"
128
     ]
129
    },
130
    {
131
     "name": "stdout",
132
     "output_type": "stream",
133
     "text": [
134
      "<class 'sentence_transformers.SentenceTransformer.SentenceTransformer'>\n",
135
      "SentenceTransformer(\n",
136
      "  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel \n",
137
      "  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})\n",
138
      ")\n",
139
      "model_name: WhereIsAI/UAE-Large-V1\n",
140
      "EMBEDDING_LENGTH: 1024\n",
141
      "MAX_SEQ_LENGTH: 512\n"
142
     ]
143
    }
144
   ],
145
   "source": [
146
    "# STEP 2. DOWNLOAD AN OPEN SOURCE EMBEDDING MODEL.\n",
147
    "\n",
148
    "# Import torch.\n",
149
    "import torch\n",
150
    "from torch.nn import functional as F\n",
151
    "from sentence_transformers import SentenceTransformer\n",
152
    "\n",
153
    "# Initialize torch settings\n",
154
    "torch.backends.cudnn.deterministic = True\n",
155
    "DEVICE = torch.device('cuda:3' if torch.cuda.is_available() else 'cpu')\n",
156
    "print(f\"device: {DEVICE}\")\n",
157
    "\n",
158
    "# Load the model from huggingface model hub.\n",
159
    "# python -m pip install -U angle-emb\n",
160
    "model_name = \"WhereIsAI/UAE-Large-V1\"\n",
161
    "encoder = SentenceTransformer(model_name, device=DEVICE)\n",
162
    "print(type(encoder))\n",
163
    "print(encoder)\n",
164
    "\n",
165
    "# Get the model parameters and save for later.\n",
166
    "EMBEDDING_LENGTH = encoder.get_sentence_embedding_dimension()\n",
167
    "MAX_SEQ_LENGTH_IN_TOKENS = encoder.get_max_seq_length() \n",
168
    "# # Assume tokens are 3 characters long.\n",
169
    "# MAX_SEQ_LENGTH = MAX_SEQ_LENGTH_IN_TOKENS * 3\n",
170
    "# HF_EOS_TOKEN_LENGTH = 1 * 3\n",
171
    "# Test with 512 sequence length.\n",
172
    "MAX_SEQ_LENGTH = MAX_SEQ_LENGTH_IN_TOKENS\n",
173
    "HF_EOS_TOKEN_LENGTH = 1\n",
174
    "\n",
175
    "# Inspect model parameters.\n",
176
    "print(f\"model_name: {model_name}\")\n",
177
    "print(f\"EMBEDDING_LENGTH: {EMBEDDING_LENGTH}\")\n",
178
    "print(f\"MAX_SEQ_LENGTH: {MAX_SEQ_LENGTH}\")"
179
   ]
180
  },
181
  {
182
   "cell_type": "markdown",
183
   "id": "d2b12728",
184
   "metadata": {},
185
   "source": [
186
    "## Create a Milvus collection\n",
187
    "\n",
188
    "You can think of a collection in Milvus like a \"table\" in SQL databases.  The **collection** will contain the \n",
189
    "- **Schema** (or [no-schema Milvus client](https://milvus.io/docs/using_milvusclient.md)).  \n",
190
    "💡 You'll need the vector `EMBEDDING_LENGTH` parameter from your embedding model.\n",
191
    "Typical values are:\n",
192
    "   - 768 for sbert embedding models\n",
193
    "   - 1536 for ada-002 OpenAI embedding models\n",
194
    "- **Vector index** for efficient vector search\n",
195
    "- **Vector distance metric** for measuring nearest neighbor vectors\n",
196
    "- **Consistency level**\n",
197
    "In Milvus, transactional consistency is possible; however, according to the [CAP theorem](https://en.wikipedia.org/wiki/CAP_theorem), some latency must be sacrificed. 💡 Searching movie reviews is not mission-critical, so [`eventually`](https://milvus.io/docs/consistency.md) consistent is fine here.\n"
198
   ]
199
  },
200
  {
201
   "cell_type": "markdown",
202
   "id": "87a34e05",
203
   "metadata": {},
204
   "source": [
205
    "### Exercise #1 (2 min):\n",
206
    "Create a collection named \"movies\".  Use the default AUTOINDEX.\n",
207
    "> 💡 AUTOINDEX works on both Milvus and Zilliz Cloud (where it is the fastest!)"
208
   ]
209
  },
210
  {
211
   "cell_type": "code",
212
   "execution_count": null,
213
   "id": "20a05d59",
214
   "metadata": {},
215
   "outputs": [],
216
   "source": [
217
    "from pymilvus import MilvusClient\n",
218
    "\n",
219
    "# Set the Milvus collection name.\n",
220
    "COLLECTION_NAME = # TODO (exercise): code here\n",
221
    "\n",
222
    "# Use no-schema Milvus client uses flexible json key:value format.\n",
223
    "# https://milvus.io/docs/using_milvusclient.md\n",
224
    "mc = MilvusClient(\n",
225
    "    uri=CLUSTER_ENDPOINT,\n",
226
    "    # API key or a colon-separated cluster username and password\n",
227
    "    token=TOKEN)\n",
228
    "\n",
229
    "mc.drop_collection(COLLECTION_NAME)\n",
230
    "mc.create_collection(COLLECTION_NAME, \n",
231
    "                     EMBEDDING_LENGTH, \n",
232
    "                    )\n",
233
    "\n",
234
    "print(mc.describe_collection(COLLECTION_NAME))\n",
235
    "print(f\"Created collection: {COLLECTION_NAME}\")"
236
   ]
237
  },
238
  {
239
   "cell_type": "markdown",
240
   "metadata": {},
241
   "source": [
242
    "## Add a Vector Index\n",
243
    "\n",
244
    "The vector index determines the vector **search algorithm** used to find the closest vectors in your data to the query a user submits.  \n",
245
    "\n",
246
    "Most vector indexes use different sets of parameters depending on whether the database is:\n",
247
    "- **inserting vectors** (creation mode) - vs - \n",
248
    "- **searching vectors** (search mode) \n",
249
    "\n",
250
    "Scroll down the [docs page](https://milvus.io/docs/index.md) to see a table listing different vector indexes available on Milvus.  For example:\n",
251
    "- FLAT - deterministic exhaustive search\n",
252
    "- IVF_FLAT or IVF_SQ8 - Hash index (stochastic approximate search)\n",
253
    "- HNSW - Graph index (stochastic approximate search)\n",
254
    "- AUTOINDEX - Automatically determined based on OSS vs [Zilliz cloud](https://docs.zilliz.com/docs/autoindex-explained), type of GPU, size of data.\n",
255
    "\n",
256
    "Besides a search algorithm, we also need to specify a **distance metric**, that is, a definition of what is considered \"close\" in vector space.  In the cell below, the [`HNSW`](https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md) search index is chosen.  Its possible distance metrics are one of:\n",
257
    "- L2 - L2-norm\n",
258
    "- IP - Dot-product\n",
259
    "- COSINE - Angular distance\n",
260
    "\n",
261
    "💡 Most use cases work better with normalized embeddings, in which case L2 is useless (every vector has length=1) and IP and COSINE are the same.  Only choose L2 if you plan to keep your embeddings unnormalized."
262
   ]
263
  },
264
  {
265
   "cell_type": "code",
266
   "execution_count": 5,
267
   "id": "4a85b295",
268
   "metadata": {},
269
   "outputs": [
270
    {
271
     "name": "stdout",
272
     "output_type": "stream",
273
     "text": [
274
      "Embedding length: 1024\n",
275
      "Successfully dropped collection: `movies`\n",
276
      "Created collection: movies\n",
277
      "{'collection_name': 'movies', 'auto_id': True, 'num_shards': 1, 'description': '', 'fields': [{'field_id': 100, 'name': 'id', 'description': '', 'type': 5, 'params': {}, 'element_type': 0, 'auto_id': True, 'is_primary': True}, {'field_id': 101, 'name': 'vector', 'description': '', 'type': 101, 'params': {'dim': 1024}, 'element_type': 0}], 'aliases': [], 'collection_id': 446268198622108304, 'consistency_level': 3, 'properties': {}, 'num_partitions': 1, 'enable_dynamic_field': True}\n"
278
     ]
279
    }
280
   ],
281
   "source": [
282
    "# STEP 3. CREATE A NO-SCHEMA MILVUS COLLECTION AND DEFINE THE DATABASE INDEX.\n",
283
    "\n",
284
    "# Re-run create collection and add vector index specifying custom params.\n",
285
    "from pymilvus import MilvusClient\n",
286
    "\n",
287
    "# For vector length, use the embedding length from the embedding model.\n",
288
    "print(f\"Embedding length: {EMBEDDING_LENGTH}\")\n",
289
    "\n",
290
    "# Set the Milvus collection name.\n",
291
    "COLLECTION_NAME = \"movies\"\n",
292
    "\n",
293
    "# Add custom HNSW search index to the collection.\n",
294
    "# M = max number graph connections per layer. Large M = denser graph.\n",
295
    "# Choice of M: 4~64, larger M for larger data and larger embedding lengths.\n",
296
    "M = 16\n",
297
    "# efConstruction = num_candidate_nearest_neighbors per layer. \n",
298
    "# Use Rule of thumb: int. 8~512, efConstruction = M * 2.\n",
299
    "efConstruction = M * 2\n",
300
    "# Create the search index for local Milvus server.\n",
301
    "INDEX_PARAMS = dict({\n",
302
    "    'M': M,               \n",
303
    "    \"efConstruction\": efConstruction })\n",
304
    "index_params = {\n",
305
    "    \"index_type\": \"HNSW\", \n",
306
    "    \"metric_type\": \"COSINE\", \n",
307
    "    \"params\": INDEX_PARAMS\n",
308
    "    }\n",
309
    "\n",
310
    "# Use no-schema Milvus client (flexible json key:value format).\n",
311
    "# https://milvus.io/docs/using_milvusclient.md\n",
312
    "mc = MilvusClient(\n",
313
    "    uri=CLUSTER_ENDPOINT,\n",
314
    "    # API key or a colon-separated cluster username and password\n",
315
    "    token=TOKEN)\n",
316
    "\n",
317
    "# Check if collection already exists, if so drop it.\n",
318
    "has = utility.has_collection(COLLECTION_NAME)\n",
319
    "if has:\n",
320
    "    drop_result = utility.drop_collection(COLLECTION_NAME)\n",
321
    "    print(f\"Successfully dropped collection: `{COLLECTION_NAME}`\")\n",
322
    "\n",
323
    "mc.create_collection(\n",
324
    "    COLLECTION_NAME, \n",
325
    "    EMBEDDING_LENGTH, \n",
326
    "    consistency_level=\"Eventually\", \n",
327
    "    auto_id=True,  \n",
328
    "    overwrite=True,\n",
329
    "    # skip setting params below, if using AUTOINDEX\n",
330
    "    params=index_params\n",
331
    "    )\n",
332
    "\n",
333
    "print(f\"Created collection: {COLLECTION_NAME}\")\n",
334
    "print(mc.describe_collection(COLLECTION_NAME))"
335
   ]
336
  },
337
  {
338
   "cell_type": "markdown",
339
   "id": "e735fe08",
340
   "metadata": {},
341
   "source": [
342
    "## Read CSV data into a pandas dataframe\n",
343
    "\n",
344
    "The data used in this notebook is the [IMDB large movie review dataset](https://ai.stanford.edu/~amaas/data/sentiment/) from the Stanford AI Lab. It is a conveniently processed 50,000 dataset (50:50 sampled ratio Positive/Negative reviews). This data has columns: movie_index, raw review text, and movie rating."
345
   ]
346
  },
347
  {
348
   "cell_type": "code",
349
   "execution_count": 6,
350
   "id": "6861beb7",
351
   "metadata": {},
352
   "outputs": [],
353
   "source": [
354
    "# 1. Download data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz\n",
355
    "# 2. Move .csv file to data/ folder.\n",
356
    "\n",
357
    "# citation:  ACL 2011, @InProceedings{maas-EtAl:2011:ACL-HLT2011,\n",
358
    "#   author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},\n",
359
    "#   title     = {Learning Word Vectors for Sentiment Analysis},\n",
360
    "#   booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},\n",
361
    "#   month     = {June},\n",
362
    "#   year      = {2011},\n",
363
    "#   address   = {Portland, Oregon, USA},\n",
364
    "#   publisher = {Association for Computational Linguistics},\n",
365
    "#   pages     = {142--150},\n",
366
    "#   url       = {http://www.aclweb.org/anthology/P11-1015}\n",
367
    "# }"
368
   ]
369
  },
370
  {
371
   "cell_type": "code",
372
   "execution_count": 7,
373
   "id": "6a381e57",
374
   "metadata": {},
375
   "outputs": [
376
    {
377
     "name": "stdout",
378
     "output_type": "stream",
379
     "text": [
380
      "original df shape: (100, 4)\n",
381
      "df_train shape: (100, 4), df_val shape: (0, 4), df_test shape: (0, 4)\n",
382
      "Example text length: 1113\n",
383
      "Example text: The whole town of Blackstone is afraid, because they lynched Bret Dixon's brother - and he is coming back for revenge! At least that's what they think.<br /><br />A great Johnny Hallyday and a very interesting, early Mario Adorf star in this Italo-Western, obviously filmed in the Alps.<br /><br />Bret Dixon is coming back to Blackstone to investigate why his brother was lynched. He is a loner and gunslinger par excellance, everybody is afraid of him - the Mexican bandits (fighting the Gringos that took their land!) as well as the \"decent\" citizens that lynched Bret's brother. They lynched him, because they thought he stole their money instead of bringing it to Dallas to the safety of the bank there. But this is is only half the truth, as we find out in the course of this psychologically interesting western.<br /><br />But beware, it's kind of a depressing movie as everybody turns out to be guilty somehow and definitely everybody is bad to the bone...<br /><br />Still, I enjoyed it very much and gave it an 8/10. Strange, that only less than 5 people voted for this movie as of January 12th 2002....\n"
384
     ]
385
    },
386
    {
387
     "data": {
388
      "text/html": [
389
       "<div>\n",
390
       "<style scoped>\n",
391
       "    .dataframe tbody tr th:only-of-type {\n",
392
       "        vertical-align: middle;\n",
393
       "    }\n",
394
       "\n",
395
       "    .dataframe tbody tr th {\n",
396
       "        vertical-align: top;\n",
397
       "    }\n",
398
       "\n",
399
       "    .dataframe thead th {\n",
400
       "        text-align: right;\n",
401
       "    }\n",
402
       "</style>\n",
403
       "<table border=\"1\" class=\"dataframe\">\n",
404
       "  <thead>\n",
405
       "    <tr style=\"text-align: right;\">\n",
406
       "      <th></th>\n",
407
       "      <th>movie_index</th>\n",
408
       "      <th>text</th>\n",
409
       "      <th>label_int</th>\n",
410
       "      <th>label</th>\n",
411
       "    </tr>\n",
412
       "  </thead>\n",
413
       "  <tbody>\n",
414
       "    <tr>\n",
415
       "      <th>0</th>\n",
416
       "      <td>80</td>\n",
417
       "      <td>The whole town of Blackstone is afraid, becaus...</td>\n",
418
       "      <td>1</td>\n",
419
       "      <td>Positive</td>\n",
420
       "    </tr>\n",
421
       "    <tr>\n",
422
       "      <th>1</th>\n",
423
       "      <td>84</td>\n",
424
       "      <td>This Harold Lloyd short wasn't really much; no...</td>\n",
425
       "      <td>0</td>\n",
426
       "      <td>Negative</td>\n",
427
       "    </tr>\n",
428
       "  </tbody>\n",
429
       "</table>\n",
430
       "</div>"
431
      ],
432
      "text/plain": [
433
       "   movie_index                                               text  label_int  \\\n",
434
       "0           80  The whole town of Blackstone is afraid, becaus...          1   \n",
435
       "1           84  This Harold Lloyd short wasn't really much; no...          0   \n",
436
       "\n",
437
       "      label  \n",
438
       "0  Positive  \n",
439
       "1  Negative  "
440
      ]
441
     },
442
     "metadata": {},
443
     "output_type": "display_data"
444
    }
445
   ],
446
   "source": [
447
    "# Read locally stored data.\n",
448
    "filepath = \"data/movie_data.csv\"\n",
449
    "\n",
450
    "df = pd.read_csv(f\"{filepath}\")\n",
451
    "\n",
452
    "# Drop duplicates\n",
453
    "df.drop_duplicates(keep='first', inplace=True)\n",
454
    "\n",
455
    "# Change label column names.\n",
456
    "df.columns = ['text', 'label_int']\n",
457
    "\n",
458
    "# Map numbers to text 'Postive' and 'Negative' for sentiment labels.\n",
459
    "df[\"label\"] = df[\"label_int\"].apply(_utils.sentiment_score_to_name)\n",
460
    "\n",
461
    "# Split data into train/valid/test.\n",
462
    "columns = ['movie_index', 'text', 'label_int', 'label']\n",
463
    "df, df_train, df_val, df_test = _utils.partition_dataset(df, columns, smoke_test=False)\n",
464
    "print(f\"original df shape: {df.shape}\")\n",
465
    "print(f\"df_train shape: {df_train.shape}, df_val shape: {df_val.shape}, df_test shape: {df_test.shape}\")\n",
466
    "assert df_train.shape[0] + df_val.shape[0] + df_test.shape[0] == df.shape[0]\n",
467
    "\n",
468
    "# Inspect data.\n",
469
    "print(f\"Example text length: {len(df.text[0])}\")\n",
470
    "print(f\"Example text: {df.text[0]}\")\n",
471
    "display(df.head(2))\n"
472
   ]
473
  },
474
  {
475
   "cell_type": "code",
476
   "execution_count": 8,
477
   "id": "654dd135",
478
   "metadata": {},
479
   "outputs": [
480
    {
481
     "name": "stdout",
482
     "output_type": "stream",
483
     "text": [
484
      "Count samples positive: 50\n",
485
      "Count samples negative: 50\n"
486
     ]
487
    }
488
   ],
489
   "source": [
490
    "# Check if approx. equal number training examples for each class.\n",
491
    "class1 = df_train.loc[(df_train.label == \"Positive\"), :].copy()\n",
492
    "class2 = df_train.loc[(df_train.label == \"Negative\"), :].copy()\n",
493
    "print(f\"Count samples positive: {class1.shape[0]}\")\n",
494
    "print(f\"Count samples negative: {class2.shape[0]}\")"
495
   ]
496
  },
497
  {
498
   "cell_type": "code",
499
   "execution_count": 9,
500
   "metadata": {},
501
   "outputs": [],
502
   "source": [
503
    "# Uncomment this to create the small sample of data for github.\n",
504
    "# df_small = df.head(100)[['text', 'label_int']].copy()\n",
505
    "# display(df_small.head())\n",
506
    "# df_small.to_csv(\"data/movie_data_small.csv\", index=False)"
507
   ]
508
  },
509
  {
510
   "cell_type": "markdown",
511
   "id": "c60423a5",
512
   "metadata": {},
513
   "source": [
514
    "## Chunking\n",
515
    "\n",
516
    "Before embedding, it is necessary to decide your chunk strategy, chunk size, and chunk overlap.  In this demo, I will use:\n",
517
    "- **Strategy** = Keep movie reveiws as single chunks unless they are too long.\n",
518
    "- **Chunk size** = Use the embedding model's parameter `MAX_SEQ_LENGTH`\n",
519
    "- **Overlap** = Rule-of-thumb 10-15%\n",
520
    "- **Function** = Langchain's convenient `RecursiveCharacterTextSplitter` to split up long reviews recursively.\n"
521
   ]
522
  },
523
  {
524
   "cell_type": "markdown",
525
   "id": "249e9c74",
526
   "metadata": {},
527
   "source": [
528
    "⚠️ **Demo batch size = 100 rows for demonstration purposes.**\n",
529
    "\n",
530
    "This means the question results could be better with more data!"
531
   ]
532
  },
533
  {
534
   "cell_type": "markdown",
535
   "id": "8c1b0045",
536
   "metadata": {},
537
   "source": [
538
    "### Exercise #2 (2 min):\n",
539
    "Change the chunk_size and see what happens?  Model default is 511.\n",
540
    "\n",
541
    "- What do your observations imply about changing the chunk_size and the number of vectors?\n",
542
    "- How many vectors are there with chunk_size=256?"
543
   ]
544
  },
545
  {
546
   "cell_type": "code",
547
   "execution_count": null,
548
   "id": "1954c96d",
549
   "metadata": {},
550
   "outputs": [],
551
   "source": [
552
    "###############\n",
553
    "## EXERCISE #1: Change chunk_size to 256 below.  How many chunks (vectors) does this create?\n",
554
    "## ANSWER:  542\n",
555
    "## BONUS:   Can you explain why the number of vectors changed from 290 to 542?  \n",
556
    "##          Hint:  What is the default chunk overlap?  290 * (2 - 0.10) approx. equals 542.\n",
557
    "###############\n",
558
    "# Default chunk_size and overlap are calculated from embedding model parameters.\n",
559
    "chunk_size = # TODO (exercise): code here\n",
560
    "chunk_overlap = np.round(chunk_size * 0.10, 0)\n",
561
    "BATCH_SIZE = 100\n",
562
    "\n",
563
    "# Chunk a batch of data from pandas DataFrame and inspect it.\n",
564
    "batch = _utils.imdb_chunk_text( # TODO (exercise): code here )"
565
   ]
566
  },
567
  {
568
   "cell_type": "code",
569
   "execution_count": 11,
570
   "id": "470a93c7",
571
   "metadata": {},
572
   "outputs": [
573
    {
574
     "name": "stdout",
575
     "output_type": "stream",
576
     "text": [
577
      "chunk size: 511\n",
578
      "original shape: (100, 4)\n",
579
      "new shape: (290, 5)\n",
580
      "Chunking + embedding time for 100 docs: 20.689467906951904 sec\n"
581
     ]
582
    },
583
    {
584
     "data": {
585
      "text/html": [
586
       "<div>\n",
587
       "<style scoped>\n",
588
       "    .dataframe tbody tr th:only-of-type {\n",
589
       "        vertical-align: middle;\n",
590
       "    }\n",
591
       "\n",
592
       "    .dataframe tbody tr th {\n",
593
       "        vertical-align: top;\n",
594
       "    }\n",
595
       "\n",
596
       "    .dataframe thead th {\n",
597
       "        text-align: right;\n",
598
       "    }\n",
599
       "</style>\n",
600
       "<table border=\"1\" class=\"dataframe\">\n",
601
       "  <thead>\n",
602
       "    <tr style=\"text-align: right;\">\n",
603
       "      <th></th>\n",
604
       "      <th>movie_index</th>\n",
605
       "      <th>text</th>\n",
606
       "      <th>chunk</th>\n",
607
       "      <th>vector</th>\n",
608
       "      <th>label_int</th>\n",
609
       "      <th>label</th>\n",
610
       "    </tr>\n",
611
       "  </thead>\n",
612
       "  <tbody>\n",
613
       "    <tr>\n",
614
       "      <th>0</th>\n",
615
       "      <td>80</td>\n",
616
       "      <td>The whole town of Blackstone is afraid, becaus...</td>\n",
617
       "      <td>The whole town of Blackstone is afraid, becaus...</td>\n",
618
       "      <td>[0.023260135, 0.03262592, 0.0071149827, 0.0475...</td>\n",
619
       "      <td>1</td>\n",
620
       "      <td>Positive</td>\n",
621
       "    </tr>\n",
622
       "    <tr>\n",
623
       "      <th>1</th>\n",
624
       "      <td>80</td>\n",
625
       "      <td>The whole town of Blackstone is afraid, becaus...</td>\n",
626
       "      <td>Mexican bandits (fighting the Gringos that too...</td>\n",
627
       "      <td>[0.024261247, 0.018350782, -0.005168957, 0.020...</td>\n",
628
       "      <td>1</td>\n",
629
       "      <td>Positive</td>\n",
630
       "    </tr>\n",
631
       "    <tr>\n",
632
       "      <th>2</th>\n",
633
       "      <td>80</td>\n",
634
       "      <td>The whole town of Blackstone is afraid, becaus...</td>\n",
635
       "      <td>and definitely everybody is bad to the bone......</td>\n",
636
       "      <td>[0.034700453, 0.011013481, -0.022261137, 0.003...</td>\n",
637
       "      <td>1</td>\n",
638
       "      <td>Positive</td>\n",
639
       "    </tr>\n",
640
       "    <tr>\n",
641
       "      <th>3</th>\n",
642
       "      <td>84</td>\n",
643
       "      <td>This Harold Lloyd short wasn't really much; no...</td>\n",
644
       "      <td>This Harold Lloyd short wasn't really much; no...</td>\n",
645
       "      <td>[0.01173156, 0.01819113, 0.03528512, 0.0179632...</td>\n",
646
       "      <td>0</td>\n",
647
       "      <td>Negative</td>\n",
648
       "    </tr>\n",
649
       "    <tr>\n",
650
       "      <th>4</th>\n",
651
       "      <td>84</td>\n",
652
       "      <td>This Harold Lloyd short wasn't really much; no...</td>\n",
653
       "      <td>part was the last four or five minutes when th...</td>\n",
654
       "      <td>[0.05225119, 0.033677388, 0.011586295, 0.00569...</td>\n",
655
       "      <td>0</td>\n",
656
       "      <td>Negative</td>\n",
657
       "    </tr>\n",
658
       "  </tbody>\n",
659
       "</table>\n",
660
       "</div>"
661
      ],
662
      "text/plain": [
663
       "  movie_index                                               text  \\\n",
664
       "0          80  The whole town of Blackstone is afraid, becaus...   \n",
665
       "1          80  The whole town of Blackstone is afraid, becaus...   \n",
666
       "2          80  The whole town of Blackstone is afraid, becaus...   \n",
667
       "3          84  This Harold Lloyd short wasn't really much; no...   \n",
668
       "4          84  This Harold Lloyd short wasn't really much; no...   \n",
669
       "\n",
670
       "                                               chunk  \\\n",
671
       "0  The whole town of Blackstone is afraid, becaus...   \n",
672
       "1  Mexican bandits (fighting the Gringos that too...   \n",
673
       "2  and definitely everybody is bad to the bone......   \n",
674
       "3  This Harold Lloyd short wasn't really much; no...   \n",
675
       "4  part was the last four or five minutes when th...   \n",
676
       "\n",
677
       "                                              vector  label_int     label  \n",
678
       "0  [0.023260135, 0.03262592, 0.0071149827, 0.0475...          1  Positive  \n",
679
       "1  [0.024261247, 0.018350782, -0.005168957, 0.020...          1  Positive  \n",
680
       "2  [0.034700453, 0.011013481, -0.022261137, 0.003...          1  Positive  \n",
681
       "3  [0.01173156, 0.01819113, 0.03528512, 0.0179632...          0  Negative  \n",
682
       "4  [0.05225119, 0.033677388, 0.011586295, 0.00569...          0  Negative  "
683
      ]
684
     },
685
     "metadata": {},
686
     "output_type": "display_data"
687
    },
688
    {
689
     "name": "stdout",
690
     "output_type": "stream",
691
     "text": [
692
      "type embeddings: <class 'pandas.core.series.Series'> of <class 'numpy.ndarray'>\n",
693
      "of numbers: <class 'numpy.float32'>\n"
694
     ]
695
    }
696
   ],
697
   "source": [
698
    "# Don't forget to re-run using the better batch size!  \n",
699
    "\n",
700
    "# Use the embedding model parameters to calculate chunk_size and overlap.\n",
701
    "chunk_size = MAX_SEQ_LENGTH - HF_EOS_TOKEN_LENGTH\n",
702
    "chunk_overlap = np.round(chunk_size * 0.10, 0)\n",
703
    "BATCH_SIZE = 100\n",
704
    "\n",
705
    "# Chunk a batch of data from pandas DataFrame and inspect it.\n",
706
    "batch = _utils.imdb_chunk_text(encoder, BATCH_SIZE, df, chunk_size, chunk_overlap)"
707
   ]
708
  },
709
  {
710
   "cell_type": "markdown",
711
   "id": "d9bd8153",
712
   "metadata": {},
713
   "source": [
714
    "## Insert data into Milvus\n",
715
    "\n",
716
    "For each original text chunk, we'll write the quadruplet (`vector, text, source, h1, h2`) into the database.\n",
717
    "\n",
718
    "<div>\n",
719
    "<img src=\"../../images/db_insert.png\" width=\"80%\"/>\n",
720
    "</div>\n",
721
    "\n",
722
    "**The Milvus Client wrapper can only handle loading data from a list of dictionaries.**\n",
723
    "\n",
724
    "Otherwise, in general, Milvus supports loading data from:\n",
725
    "- pandas dataframes \n",
726
    "- list of dictionaries\n",
727
    "\n",
728
    "Below, we use the embedding model provided by HuggingFace, download its checkpoint, and run it locally as the encoder."
729
   ]
730
  },
731
  {
732
   "cell_type": "code",
733
   "execution_count": 12,
734
   "id": "b51ff139",
735
   "metadata": {},
736
   "outputs": [
737
    {
738
     "name": "stdout",
739
     "output_type": "stream",
740
     "text": [
741
      "Start inserting entities\n"
742
     ]
743
    },
744
    {
745
     "name": "stderr",
746
     "output_type": "stream",
747
     "text": [
748
      "100%|██████████| 1/1 [00:05<00:00,  5.90s/it]\n"
749
     ]
750
    },
751
    {
752
     "name": "stdout",
753
     "output_type": "stream",
754
     "text": [
755
      "Milvus insert time for 290 vectors: 5.9037556648254395 seconds\n"
756
     ]
757
    }
758
   ],
759
   "source": [
760
    "# STEP 5. INSERT CHUNKS AND EMBEDDINGS IN ZILLIZ.\n",
761
    "\n",
762
    "# Convert DataFrame to a list of dictionaries\n",
763
    "dict_list = []\n",
764
    "for _, row in batch.iterrows():\n",
765
    "    dictionary = row.to_dict()\n",
766
    "    dict_list.append(dictionary)\n",
767
    "\n",
768
    "print(\"Start inserting entities\")\n",
769
    "start_time = time.time()\n",
770
    "insert_result = mc.insert(\n",
771
    "    COLLECTION_NAME,\n",
772
    "    data=dict_list,\n",
773
    "    progress_bar=True)\n",
774
    "end_time = time.time()\n",
775
    "print(f\"Milvus insert time for {batch.shape[0]} vectors: {end_time - start_time} seconds\")\n",
776
    "\n",
777
    "# After final entity is inserted, call flush to stop growing segments left in memory.\n",
778
    "mc.flush(COLLECTION_NAME)"
779
   ]
780
  },
781
  {
782
   "cell_type": "markdown",
783
   "id": "4ebfb115",
784
   "metadata": {},
785
   "source": [
786
    "## Run a Semantic Search\n",
787
    "\n",
788
    "Now we can run very fast search over all the movie review embeddings to find the `TOP_K` movie reviews with the closest embeddings to a user's query.\n",
789
    "- In this example, we'll search for a movie recommendation for a medical doctor.\n",
790
    "\n",
791
    "💡 The same model should always be used for consistency for all the embeddings."
792
   ]
793
  },
794
  {
795
   "cell_type": "markdown",
796
   "id": "02c589ff",
797
   "metadata": {},
798
   "source": [
799
    "## Ask a question about your data\n",
800
    "\n",
801
    "So far in this demo notebook: \n",
802
    "1. Your custom data has been mapped into a vector embedding space\n",
803
    "2. Those vector embeddings have been saved into a vector database\n",
804
    "\n",
805
    "Next, you can ask a question about your custom data!\n",
806
    "\n",
807
    "💡 With LLMs:\n",
808
    "> **Query** is the generic term for user questions.  \n",
809
    "A query is a list of multiple individual questions, up to maybe 1000 different questions!\n",
810
    "\n",
811
    "> **Question** usually refers to a single user question.  \n",
812
    "In our example below, the user question is \"I'm a medical doctor, what movie should I watch?\""
813
   ]
814
  },
815
  {
816
   "cell_type": "code",
817
   "execution_count": 13,
818
   "id": "5e7f41f4",
819
   "metadata": {},
820
   "outputs": [
821
    {
822
     "name": "stdout",
823
     "output_type": "stream",
824
     "text": [
825
      "query length: 48\n"
826
     ]
827
    }
828
   ],
829
   "source": [
830
    "# Define a sample question about your data.\n",
831
    "question = \"I'm a medical doctor, what movie should I watch?\"\n",
832
    "query = [question]\n",
833
    "\n",
834
    "# Inspect the length of the query.\n",
835
    "QUERY_LENGTH = len(query[0])\n",
836
    "print(f\"query length: {QUERY_LENGTH}\")"
837
   ]
838
  },
839
  {
840
   "cell_type": "markdown",
841
   "id": "fa545611",
842
   "metadata": {},
843
   "source": [
844
    "**Embed the question using the same embedding model you used earlier**\n",
845
    "\n",
846
    "In order for vector search to work, the question itself should be embedded with the same model used to create the colleciton you want to search."
847
   ]
848
  },
849
  {
850
   "cell_type": "code",
851
   "execution_count": 14,
852
   "id": "a6863a32",
853
   "metadata": {},
854
   "outputs": [
855
    {
856
     "name": "stdout",
857
     "output_type": "stream",
858
     "text": [
859
      "<class 'list'> 1 <class 'numpy.ndarray'>\n",
860
      "<class 'numpy.float32'>\n"
861
     ]
862
    }
863
   ],
864
   "source": [
865
    "# Embed the query using same embedding model used to create the Milvus collection.\n",
866
    "query_embeddings = _utils.embed_query(encoder, query)\n",
867
    "\n",
868
    "# Inspect data.\n",
869
    "print(type(query_embeddings), len(query_embeddings), type(query_embeddings[0]))\n",
870
    "print(type(query_embeddings[0][0]) ) "
871
   ]
872
  },
873
  {
874
   "cell_type": "markdown",
875
   "id": "9ea29411",
876
   "metadata": {},
877
   "source": [
878
    "## Execute a vector search\n",
879
    "\n",
880
    "Search Milvus using [PyMilvus API](https://milvus.io/docs/search.md).\n",
881
    "\n",
882
    "💡 By their nature, vector searches are \"semantic\" searches.  For example, if you were to search for \"leaky faucet\": \n",
883
    "> **Traditional Key-word Search** - either or both words \"leaky\", \"faucet\" would have to match some text in order to return a web page or link text to the document.\n",
884
    "\n",
885
    "> **Semantic search** - results containing words \"drippy\" \"taps\" would be returned as well because these words mean the same thing even though they are different words,"
886
   ]
887
  },
888
  {
889
   "cell_type": "markdown",
890
   "id": "e49e830c",
891
   "metadata": {},
892
   "source": [
893
    "### Exercise #3 (2 min):\n",
894
    "Search Milvus using the default search index.\n"
895
   ]
896
  },
897
  {
898
   "cell_type": "code",
899
   "execution_count": 15,
900
   "id": "2ace8d04",
901
   "metadata": {},
902
   "outputs": [
903
    {
904
     "name": "stdout",
905
     "output_type": "stream",
906
     "text": [
907
      "Search time: 0.17416715621948242 sec\n",
908
      "type: <class 'list'>, count: 10\n"
909
     ]
910
    }
911
   ],
912
   "source": [
913
    "# Run semantic vector search using your query and the vector database.\n",
914
    "\n",
915
    "# # Not needed with Milvus Client API.\n",
916
    "# mc.load()\n",
917
    "\n",
918
    "# Uses default search algorithm:  HNSW and top_k=10.\n",
919
    "start_time = time.time()\n",
920
    "results = mc.search(\n",
921
    "    COLLECTION_NAME,\n",
922
    "    data=query_embeddings, \n",
923
    "    )\n",
924
    "\n",
925
    "elapsed_time = time.time() - start_time\n",
926
    "print(f\"Search time: {elapsed_time} sec\")\n",
927
    "\n",
928
    "# Inspect search result.\n",
929
    "print(f\"type: {type(results)}, count: {len(results[0])}\")"
930
   ]
931
  },
932
  {
933
   "cell_type": "code",
934
   "execution_count": 16,
935
   "id": "c5d98e28",
936
   "metadata": {},
937
   "outputs": [
938
    {
939
     "name": "stdout",
940
     "output_type": "stream",
941
     "text": [
942
      "Milvus search time: 0.0590059757232666 sec\n",
943
      "type: <class 'list'>, count: 3\n"
944
     ]
945
    }
946
   ],
947
   "source": [
948
    "# Re-run the search using custom settings.\n",
949
    "\n",
950
    "# Return top k results with HNSW index.\n",
951
    "TOP_K = 3\n",
952
    "OUTPUT_FIELDS=[\"movie_index\", \"chunk\", \"label\"]\n",
953
    "SEARCH_PARAMS = dict({\n",
954
    "    # Re-use index param for num_candidate_nearest_neighbors.\n",
955
    "    \"ef\": INDEX_PARAMS['efConstruction']\n",
956
    "    })\n",
957
    "\n",
958
    "# Run the search and time it.\n",
959
    "start_time = time.time()\n",
960
    "results = mc.search(\n",
961
    "    COLLECTION_NAME,\n",
962
    "    data=query_embeddings, \n",
963
    "    search_params=SEARCH_PARAMS,\n",
964
    "    output_fields=OUTPUT_FIELDS, \n",
965
    "    # Milvus can utilize metadata in boolean expressions to filter search.\n",
966
    "    # expr=\"\",\n",
967
    "    limit=TOP_K,\n",
968
    "    consistency_level=\"Eventually\",\n",
969
    "    )\n",
970
    "\n",
971
    "elapsed_time = time.time() - start_time\n",
972
    "print(f\"Milvus search time: {elapsed_time} sec\")\n",
973
    "\n",
974
    "# Inspect search result.\n",
975
    "print(f\"type: {type(results)}, count: {len(results[0])}\")"
976
   ]
977
  },
978
  {
979
   "cell_type": "markdown",
980
   "id": "95f7e011",
981
   "metadata": {},
982
   "source": [
983
    "## Assemble and inspect the search result\n",
984
    "\n",
985
    "The search result is in the variable `result[0]` of type `'pymilvus.orm.search.SearchResult'`.  "
986
   ]
987
  },
988
  {
989
   "cell_type": "code",
990
   "execution_count": 17,
991
   "id": "22d65363",
992
   "metadata": {},
993
   "outputs": [
994
    {
995
     "name": "stdout",
996
     "output_type": "stream",
997
     "text": [
998
      "Length context: 507, Number of contexts: 3\n",
999
      "Retrieved result #1\n",
1000
      "Context: Dr. K(David H Hickey)has been trying to master a formula that would end all disease and handicaps, but needs live donors to complete his work. His doc\n",
1001
      "Metadata: {'movie_index': '56', 'label': 'Negative'}\n",
1002
      "\n",
1003
      "Retrieved result #2\n",
1004
      "Context: is not a horror movie, although it does contain some violent scenes, but is rather a comedy. A satire to be precise. And it never runs out of steam! T\n",
1005
      "Metadata: {'movie_index': '44', 'label': 'Positive'}\n",
1006
      "\n",
1007
      "Retrieved result #3\n",
1008
      "Context: a good movie with a real good story. The fact that there are so many other big stars who all also had great performances is just an added BONUS! So do\n",
1009
      "Metadata: {'movie_index': '67', 'label': 'Positive'}\n",
1010
      "\n"
1011
     ]
1012
    }
1013
   ],
1014
   "source": [
1015
    "# Assemble `num_shot_answers` retrieved 1st context and context metadata.\n",
1016
    "METADATA_FIELDS = [f for f in OUTPUT_FIELDS if f != 'chunk']\n",
1017
    "formatted_results, context, context_metadata = _utils.client_assemble_retrieved_context(\n",
1018
    "    results, metadata_fields=METADATA_FIELDS, num_shot_answers=3)\n",
1019
    "print(f\"Length context: {len(context[0])}, Number of contexts: {len(context)}\")\n",
1020
    "\n",
1021
    "# TODO - Uncomment to loop throught each context and metadata and print.\n",
1022
    "for i in range(len(context)):\n",
1023
    "    print(f\"Retrieved result #{i+1}\")\n",
1024
    "    print(f\"Context: {context[i][:150]}\")\n",
1025
    "    print(f\"Metadata: {context_metadata[i]}\")\n",
1026
    "    print()"
1027
   ]
1028
  },
1029
  {
1030
   "cell_type": "markdown",
1031
   "id": "309c5109",
1032
   "metadata": {},
1033
   "source": [
1034
    "## Same question, but add Metadata filter.\n",
1035
    "\n",
1036
    "Keeping the same question, add a SQL-like filter on metadata.\n",
1037
    "\n",
1038
    "We expect the same answers as above, but omitting any \"Negative\" labeled movies."
1039
   ]
1040
  },
1041
  {
1042
   "cell_type": "code",
1043
   "execution_count": 18,
1044
   "id": "230c2b44",
1045
   "metadata": {},
1046
   "outputs": [
1047
    {
1048
     "name": "stdout",
1049
     "output_type": "stream",
1050
     "text": [
1051
      "Milvus search time: 0.0647287368774414 sec\n",
1052
      "Length context: 457, Number of contexts: 3\n",
1053
      "Retrieved result #1\n",
1054
      "Context: is not a horror movie, although it does contain some violent scenes, but is rather a comedy. A satire to be precise. And it never runs out of steam! T\n",
1055
      "Metadata: {'movie_index': '44', 'label': 'Positive'}\n",
1056
      "\n",
1057
      "Retrieved result #2\n",
1058
      "Context: a good movie with a real good story. The fact that there are so many other big stars who all also had great performances is just an added BONUS! So do\n",
1059
      "Metadata: {'movie_index': '67', 'label': 'Positive'}\n",
1060
      "\n",
1061
      "Retrieved result #3\n",
1062
      "Context: This movie took the Jerry Springer approach to super-human power. \"Wilder Napalm\" is the kind of theme-based movie that I love, addressing the idea th\n",
1063
      "Metadata: {'movie_index': '88', 'label': 'Positive'}\n",
1064
      "\n"
1065
     ]
1066
    }
1067
   ],
1068
   "source": [
1069
    "# Same question, but add Metadata filter only positive movies.\n",
1070
    "metadata_filter = \"(label like 'Positive%')\"\n",
1071
    "\n",
1072
    "# Run the search and time it.\n",
1073
    "start_time = time.time()\n",
1074
    "new_results = mc.search(\n",
1075
    "    COLLECTION_NAME,\n",
1076
    "    data=query_embeddings, \n",
1077
    "    search_params=SEARCH_PARAMS,\n",
1078
    "    output_fields=OUTPUT_FIELDS, \n",
1079
    "    filter=metadata_filter,\n",
1080
    "    limit=TOP_K,\n",
1081
    "    consistency_level=\"Eventually\",\n",
1082
    "    )\n",
1083
    "\n",
1084
    "elapsed_time = time.time() - start_time\n",
1085
    "print(f\"Milvus search time: {elapsed_time} sec\")\n",
1086
    "\n",
1087
    "# Assemble `num_shot_answers` retrieved 1st context and context metadata.\n",
1088
    "METADATA_FIELDS = [f for f in OUTPUT_FIELDS if f != 'chunk']\n",
1089
    "formatted_results, context, context_metadata = _utils.client_assemble_retrieved_context(\n",
1090
    "    new_results, metadata_fields=METADATA_FIELDS, num_shot_answers=3)\n",
1091
    "print(f\"Length context: {len(context[0])}, Number of contexts: {len(context)}\")\n",
1092
    "\n",
1093
    "# TODO - Uncomment to loop throught each context and metadata and print.\n",
1094
    "for i in range(len(context)):\n",
1095
    "    print(f\"Retrieved result #{i+1}\")\n",
1096
    "    print(f\"Context: {context[i][:150]}\")\n",
1097
    "    print(f\"Metadata: {context_metadata[i]}\")\n",
1098
    "    print()\n",
1099
    "\n",
1100
    "# As expected, same answers, except 'Negative' movies are omitted."
1101
   ]
1102
  },
1103
  {
1104
   "cell_type": "markdown",
1105
   "id": "9cf49a96",
1106
   "metadata": {},
1107
   "source": [
1108
    "## Try another question\n",
1109
    "\n",
1110
    "This time just add the words **only good movies** to the question, see if the answers are any different?  \n",
1111
    "\n",
1112
    "For semantically different questions, we expect the answers to be different."
1113
   ]
1114
  },
1115
  {
1116
   "cell_type": "code",
1117
   "execution_count": 19,
1118
   "id": "922073f2",
1119
   "metadata": {},
1120
   "outputs": [
1121
    {
1122
     "name": "stdout",
1123
     "output_type": "stream",
1124
     "text": [
1125
      "Question: I'm a computer scientist, what movie should I watch?\n",
1126
      "Milvus search time: 0.14014911651611328 sec\n",
1127
      "Length context: 133, Number of contexts: 3\n",
1128
      "Retrieved result #1\n",
1129
      "Context: i would be curious what kids think of this movie. Maybe they would enjoy it? But as for adults, safe bet they wont, even if a CS fan.\n",
1130
      "Metadata: {'movie_index': '37', 'label': 'Negative'}\n",
1131
      "\n",
1132
      "Retrieved result #2\n",
1133
      "Context: Bears about as much resemblance to Dean Koontz's novel as Jessica Simpson does to a rocket scientist. If you've read the book, I suggest you put it as\n",
1134
      "Metadata: {'movie_index': '21', 'label': 'Positive'}\n",
1135
      "\n",
1136
      "Retrieved result #3\n",
1137
      "Context: a good movie with a real good story. The fact that there are so many other big stars who all also had great performances is just an added BONUS! So do\n",
1138
      "Metadata: {'movie_index': '67', 'label': 'Positive'}\n",
1139
      "\n"
1140
     ]
1141
    }
1142
   ],
1143
   "source": [
1144
    "# # Take as input a user question and conduct semantic vector search using the question.\n",
1145
    "question = \"I'm a medical doctor, what movie should I watch?\"\n",
1146
    "new_question = \"I'm a computer scientist, what movie should I watch?\"\n",
1147
    "print(f\"Question: {new_question}\")\n",
1148
    "# Embed the query using same embedding model used to create the Milvus collection.\n",
1149
    "new_query_embeddings = _utils.embed_query(encoder, [new_question])\n",
1150
    "\n",
1151
    "# Run the search and time it.\n",
1152
    "start_time = time.time()\n",
1153
    "new_results = mc.search(\n",
1154
    "    COLLECTION_NAME,\n",
1155
    "    data=new_query_embeddings, \n",
1156
    "    search_params=SEARCH_PARAMS,\n",
1157
    "    output_fields=OUTPUT_FIELDS, \n",
1158
    "    # Milvus can utilize metadata in boolean expressions to filter search.\n",
1159
    "    # expr=\"\",\n",
1160
    "    limit=TOP_K,\n",
1161
    "    consistency_level=\"Eventually\",\n",
1162
    "    )\n",
1163
    "\n",
1164
    "elapsed_time = time.time() - start_time\n",
1165
    "print(f\"Milvus search time: {elapsed_time} sec\")\n",
1166
    "\n",
1167
    "# Assemble `num_shot_answers` retrieved 1st context and context metadata.\n",
1168
    "METADATA_FIELDS = [f for f in OUTPUT_FIELDS if f != 'chunk']\n",
1169
    "formatted_results, context, context_metadata = _utils.client_assemble_retrieved_context(\n",
1170
    "    new_results, metadata_fields=METADATA_FIELDS, num_shot_answers=3)\n",
1171
    "print(f\"Length context: {len(context[0])}, Number of contexts: {len(context)}\")\n",
1172
    "\n",
1173
    "# TODO - Uncomment to loop throught each context and metadata and print.\n",
1174
    "for i in range(len(context)):\n",
1175
    "    print(f\"Retrieved result #{i+1}\")\n",
1176
    "    print(f\"Context: {context[i][:150]}\")\n",
1177
    "    print(f\"Metadata: {context_metadata[i]}\")\n",
1178
    "    print()"
1179
   ]
1180
  },
1181
  {
1182
   "cell_type": "code",
1183
   "execution_count": 20,
1184
   "id": "d0e81e68",
1185
   "metadata": {},
1186
   "outputs": [],
1187
   "source": [
1188
    "# Drop collection\n",
1189
    "utility.drop_collection(COLLECTION_NAME)"
1190
   ]
1191
  },
1192
  {
1193
   "cell_type": "code",
1194
   "execution_count": 21,
1195
   "id": "c777937e",
1196
   "metadata": {},
1197
   "outputs": [
1198
    {
1199
     "name": "stdout",
1200
     "output_type": "stream",
1201
     "text": [
1202
      "Author: Christy Bergman\n",
1203
      "\n",
1204
      "Python implementation: CPython\n",
1205
      "Python version       : 3.11.6\n",
1206
      "IPython version      : 8.18.1\n",
1207
      "\n",
1208
      "torch       : 2.1.1\n",
1209
      "transformers: 4.35.2\n",
1210
      "milvus      : 2.3.3\n",
1211
      "pymilvus    : 2.3.4\n",
1212
      "langchain   : 0.0.322\n",
1213
      "\n",
1214
      "conda environment: py311\n",
1215
      "\n"
1216
     ]
1217
    }
1218
   ],
1219
   "source": [
1220
    "# Props to Sebastian Raschka for this handy watermark.\n",
1221
    "# !pip install watermark\n",
1222
    "\n",
1223
    "%load_ext watermark\n",
1224
    "%watermark -a 'Christy Bergman' -v -p torch,transformers,milvus,pymilvus,langchain --conda"
1225
   ]
1226
  },
1227
  {
1228
   "cell_type": "code",
1229
   "execution_count": null,
1230
   "id": "7c5de90f",
1231
   "metadata": {},
1232
   "outputs": [],
1233
   "source": []
1234
  }
1235
 ],
1236
 "metadata": {
1237
  "kernelspec": {
1238
   "display_name": "Python 3 (ipykernel)",
1239
   "language": "python",
1240
   "name": "python3"
1241
  },
1242
  "language_info": {
1243
   "codemirror_mode": {
1244
    "name": "ipython",
1245
    "version": 3
1246
   },
1247
   "file_extension": ".py",
1248
   "mimetype": "text/x-python",
1249
   "name": "python",
1250
   "nbconvert_exporter": "python",
1251
   "pygments_lexer": "ipython3",
1252
   "version": "3.11.6"
1253
  }
1254
 },
1255
 "nbformat": 4,
1256
 "nbformat_minor": 5
1257
}
1258
milvus-io_bootcamp

Использование cookies