milvus-io_bootcamp

bge_m3_embedding.ipynb
446 строк · 13.8 Кб
Перенос по словам
1
{
2
 "cells": [
3
  {
4
   "cell_type": "markdown",
5
   "metadata": {
6
    "pycharm": {
7
     "name": "#%% md\n"
8
    }
9
   },
10
   "source": [
11
    "# Using BGE M3-Embedding Model with Milvus\n",
12
    "\n",
13
    "As Deep Neural Networks continue to advance rapidly, it's increasingly common to employ them for information representation and retrieval. Referred to as embedding models, they can encode information into dense or sparse vector representations within a multi-dimensional space.\n",
14
    "\n",
15
    "\n",
16
    "On January 30, 2024, a new member called BGE-M3 was released as part of the BGE model series. The M3 represents its capabilities in supporting over 100 languages, accommodating input lengths of up to 8192, and incorporating multiple functions such as dense, lexical, and multi-vec/colbert retrieval into a unified system. BGE-M3 holds the distinction of being the first embedding model to offer support for all three retrieval methods, resulting in achieving state-of-the-art performance on multi-lingual (MIRACL) and cross-lingual (MKQA) benchmark tests.\n",
17
    "\n",
18
    "Milvus, world's first open-source vector database, plays a vital role in semantic search with efficient storage and retrieval for vector embeddings. Its scalability and advanced functionalities, such as metadata filtering, further contribute to its significance in this field. \n",
19
    "\n",
20
    "This tutorial shows how to use **BGE M3 embedding model with Milvus** for semantic similarity search.\n",
21
    "\n",
22
    "![](../../images/bge_m3.png)\n"
23
   ]
24
  },
25
  {
26
   "cell_type": "markdown",
27
   "metadata": {
28
    "pycharm": {
29
     "name": "#%% md\n"
30
    }
31
   },
32
   "source": [
33
    "## Preparations\n",
34
    "\n",
35
    "We will demonstrate with `BAAI/bge-m3` model and Milvus in Standalone mode. The text for searching comes from the [M3 paper](https://arxiv.org/pdf/2402.03216.pdf). For each sentence in the paper, we use `BAAI/bge-m3` model to convert the text string into 1024 dimension vector embedding, and store each embedding in Milvus.\n",
36
    "\n",
37
    "We then search a query by converting the query text into a vector embedding, and perform vector Approximate Nearest Neighbor search to find the text strings with cloest semantic.\n",
38
    "\n",
39
    "To run this demo, be sure you have already [started up a Milvus instance](https://milvus.io/docs/install_standalone-docker.md) and installed python packages `pymilvus` (Milvus client library) and `FlagEmbedding` (library for BGE models)."
40
   ]
41
  },
42
  {
43
   "cell_type": "code",
44
   "execution_count": null,
45
   "metadata": {},
46
   "outputs": [],
47
   "source": [
48
    "! pip install pymilvus FlagEmbedding"
49
   ]
50
  },
51
  {
52
   "cell_type": "markdown",
53
   "metadata": {},
54
   "source": [
55
    "Import packages."
56
   ]
57
  },
58
  {
59
   "cell_type": "code",
60
   "execution_count": 1,
61
   "metadata": {
62
    "pycharm": {
63
     "name": "#%%\n"
64
    }
65
   },
66
   "outputs": [],
67
   "source": [
68
    "from pymilvus import (\n",
69
    "    connections,\n",
70
    "    utility,\n",
71
    "    FieldSchema,\n",
72
    "    CollectionSchema,\n",
73
    "    DataType,\n",
74
    "    Collection,\n",
75
    ")\n",
76
    "from FlagEmbedding import BGEM3FlagModel\n"
77
   ]
78
  },
79
  {
80
   "cell_type": "markdown",
81
   "metadata": {},
82
   "source": [
83
    "Set up the options for Milvus, specify model name as `BAAI/bge-m3`."
84
   ]
85
  },
86
  {
87
   "cell_type": "code",
88
   "execution_count": 2,
89
   "metadata": {
90
    "pycharm": {
91
     "name": "#%%\n"
92
    }
93
   },
94
   "outputs": [],
95
   "source": [
96
    "MILVUS_HOST = \"localhost\"\n",
97
    "MILVUS_PORT = \"19530\"\n",
98
    "COLLECTION_NAME = \"bge_m3_doc_collection\"  # Milvus collection name\n",
99
    "EMBEDDING_MODEL = \"BAAI/bge-m3\""
100
   ]
101
  },
102
  {
103
   "cell_type": "markdown",
104
   "metadata": {
105
    "pycharm": {
106
     "name": "#%% md\n"
107
    }
108
   },
109
   "source": [
110
    "Let’s try the BGE M3 Embedding service with a text string, print the result vector embedding and get the dimensions of the model."
111
   ]
112
  },
113
  {
114
   "cell_type": "code",
115
   "execution_count": 3,
116
   "metadata": {
117
    "pycharm": {
118
     "name": "#%%\n"
119
    }
120
   },
121
   "outputs": [
122
    {
123
     "data": {
124
      "application/vnd.jupyter.widget-view+json": {
125
       "model_id": "0e74cb7d298d48b98dafb1e9b1dadee2",
126
       "version_major": 2,
127
       "version_minor": 0
128
      },
129
      "text/plain": [
130
       "Fetching 22 files:   0%|          | 0/22 [00:00<?, ?it/s]"
131
      ]
132
     },
133
     "metadata": {},
134
     "output_type": "display_data"
135
    },
136
    {
137
     "name": "stdout",
138
     "output_type": "stream",
139
     "text": [
140
      "loading existing colbert_linear and sparse_linear---------\n",
141
      "----------using 4*GPUs----------\n",
142
      "[-0.03415   -0.04712   -0.0009007 -0.04697    0.04025   -0.07654\n",
143
      " -0.001877   0.007637  -0.01312   -0.007435  -0.0712     0.0526\n",
144
      "  0.02162   -0.04178    0.000628  -0.05307    0.00796   -0.0431\n",
145
      "  0.01224   -0.006145 ] ...\n",
146
      "\n",
147
      "Dimensions of `BAAI/bge-m3` embedding model is: 1024\n"
148
     ]
149
    }
150
   ],
151
   "source": [
152
    "test_sentences = \"What is BGE M3?\"\n",
153
    "\n",
154
    "model = BGEM3FlagModel(EMBEDDING_MODEL, use_fp16=True)  # Setting use_fp16 to True speeds up computation with a slight performance degradation\n",
155
    "\n",
156
    "\n",
157
    "test_embedding = model.encode([test_sentences])['dense_vecs'][0]\n",
158
    "\n",
159
    "print(f'{test_embedding[:20]} ...')\n",
160
    "dimension = len(test_embedding)\n",
161
    "print(f'\\nDimensions of `{EMBEDDING_MODEL}` embedding model is: {dimension}')"
162
   ]
163
  },
164
  {
165
   "cell_type": "markdown",
166
   "metadata": {
167
    "pycharm": {
168
     "name": "#%% md\n"
169
    }
170
   },
171
   "source": [
172
    "## Load vectors to Milvus\n",
173
    "\n",
174
    "We need creat a collection in Milvus and build index so that we can efficiently search vectors. For more information on how to use Milvus, check out the [documentation](https://milvus.io/docs/example_code.md).\n"
175
   ]
176
  },
177
  {
178
   "cell_type": "code",
179
   "execution_count": 4,
180
   "metadata": {
181
    "pycharm": {
182
     "name": "#%%\n"
183
    }
184
   },
185
   "outputs": [
186
    {
187
     "data": {
188
      "text/plain": [
189
       "Status(code=0, message=)"
190
      ]
191
     },
192
     "execution_count": 4,
193
     "metadata": {},
194
     "output_type": "execute_result"
195
    }
196
   ],
197
   "source": [
198
    "# Connect to Milvus\n",
199
    "connections.connect(host=MILVUS_HOST, port=MILVUS_PORT)\n",
200
    "\n",
201
    "# Remove collection if it already exists\n",
202
    "if utility.has_collection(COLLECTION_NAME):\n",
203
    "    utility.drop_collection(COLLECTION_NAME)\n",
204
    "\n",
205
    "# Set scheme with 3 fields: id (int), text (string), and embedding (float array).\n",
206
    "fields = [\n",
207
    "    FieldSchema(name=\"id\", dtype=DataType.INT64, is_primary=True, auto_id=False),\n",
208
    "    FieldSchema(name=\"text\", dtype=DataType.VARCHAR, max_length=65_535),\n",
209
    "    FieldSchema(name=\"embedding\", dtype=DataType.FLOAT_VECTOR, dim=dimension)\n",
210
    "]\n",
211
    "schema = CollectionSchema(fields, \"Here is description of this collection.\")\n",
212
    "# Create a collection with above schema.\n",
213
    "doc_collection = Collection(COLLECTION_NAME, schema)\n",
214
    "\n",
215
    "# Create an index for the collection.\n",
216
    "index = {\n",
217
    "    \"index_type\": \"IVF_FLAT\",\n",
218
    "    \"metric_type\": \"L2\",\n",
219
    "    \"params\": {\"nlist\": 128},\n",
220
    "}\n",
221
    "doc_collection.create_index(\"embedding\", index)"
222
   ]
223
  },
224
  {
225
   "cell_type": "markdown",
226
   "metadata": {
227
    "pycharm": {
228
     "name": "#%% md\n"
229
    }
230
   },
231
   "source": [
232
    "Here we have prepared a data set of text strings from the [M3 paper](https://arxiv.org/pdf/2402.03216.pdf), named `m3_paper.txt`. It stores each sentence as a line, and we convert each line in the document into a dense vector embedding with `BAAI/bge-m3` and then insert these embeddings into Milvus collection."
233
   ]
234
  },
235
  {
236
   "cell_type": "code",
237
   "execution_count": 5,
238
   "metadata": {
239
    "pycharm": {
240
     "name": "#%%\n"
241
    }
242
   },
243
   "outputs": [],
244
   "source": [
245
    "with open('./docs/m3_paper.txt', 'r') as f:\n",
246
    "    lines = f.readlines()\n",
247
    "\n",
248
    "embeddings = model.encode(lines)['dense_vecs']\n",
249
    "entities = [\n",
250
    "    list(range(len(lines))),  # field id (primary key) \n",
251
    "    lines,  # field text\n",
252
    "    embeddings,  #field embedding\n",
253
    "]\n",
254
    "insert_result = doc_collection.insert(entities)\n",
255
    "\n",
256
    "# In Milvus, it's a best practice to call flush() after all vectors are inserted,\n",
257
    "# so that a more efficient index is built for the just inserted vectors.\n",
258
    "doc_collection.flush()"
259
   ]
260
  },
261
  {
262
   "cell_type": "markdown",
263
   "metadata": {
264
    "pycharm": {
265
     "name": "#%% md\n"
266
    }
267
   },
268
   "source": [
269
    "## Query\n",
270
    "\n",
271
    "Here we will build a `semantic_search` function, which is used to retrieve the topK most semantically similar document from a Milvus collection.\n"
272
   ]
273
  },
274
  {
275
   "cell_type": "code",
276
   "execution_count": 6,
277
   "metadata": {
278
    "pycharm": {
279
     "name": "#%%\n"
280
    }
281
   },
282
   "outputs": [],
283
   "source": [
284
    "# Load the collection into memory for searching\n",
285
    "doc_collection.load()\n",
286
    "\n",
287
    "\n",
288
    "def semantic_search(query, top_k=3):\n",
289
    "    vectors_to_search = model.encode([query])['dense_vecs']\n",
290
    "    search_params = {\n",
291
    "        \"metric_type\": \"L2\",\n",
292
    "        \"params\": {\"nprobe\": 10},\n",
293
    "    }\n",
294
    "    result = doc_collection.search(vectors_to_search, \"embedding\", search_params, limit=top_k, output_fields=[\"text\"])\n",
295
    "    return result[0]"
296
   ]
297
  },
298
  {
299
   "cell_type": "markdown",
300
   "metadata": {
301
    "pycharm": {
302
     "name": "#%% md\n"
303
    }
304
   },
305
   "source": [
306
    "Here we ask a question about the embedding models."
307
   ]
308
  },
309
  {
310
   "cell_type": "code",
311
   "execution_count": 13,
312
   "metadata": {
313
    "pycharm": {
314
     "name": "#%%\n"
315
    }
316
   },
317
   "outputs": [
318
    {
319
     "name": "stdout",
320
     "output_type": "stream",
321
     "text": [
322
      "distance = 0.46\n",
323
      "Particularly, M3-Embedding is proficient in multilinguality, which is able to support more than 100 world languages.\n",
324
      "\n",
325
      "distance = 0.53\n",
326
      "1) We present M3-Embedding, which is the first model which supports multi-linguality, multifunctionality, and multi-granularity.\n",
327
      "\n",
328
      "distance = 0.63\n",
329
      "In this paper, we present M3-Embedding, which achieves notable versatility in supporting multilingual retrieval, handling input of diverse granularities, and unifying different retrieval functionalities.\n",
330
      "\n"
331
     ]
332
    }
333
   ],
334
   "source": [
335
    "question = 'How many working languages does the M3-Embedding model support?'\n",
336
    "\n",
337
    "match_results = semantic_search(question, top_k=3)\n",
338
    "for match in match_results:\n",
339
    "    print(f\"distance = {match.distance:.2f}\\n{match.entity.text}\")"
340
   ]
341
  },
342
  {
343
   "cell_type": "markdown",
344
   "metadata": {
345
    "pycharm": {
346
     "name": "#%% md\n"
347
    }
348
   },
349
   "source": [
350
    "The smaller the distance, the closer the vector is, that is, semantically more similar. We can see that the top 1 result returned *\"M3-Embedding...more than 100 world languages...\"* can directly answer the question.\n",
351
    "\n",
352
    "Let's try another question."
353
   ]
354
  },
355
  {
356
   "cell_type": "code",
357
   "execution_count": 23,
358
   "metadata": {
359
    "pycharm": {
360
     "name": "#%%\n"
361
    }
362
   },
363
   "outputs": [
364
    {
365
     "name": "stdout",
366
     "output_type": "stream",
367
     "text": [
368
      "distance = 0.61\n",
369
      "The three data sources complement to each other, which are applied to different stages of the training process.\n",
370
      "\n",
371
      "distance = 0.69\n",
372
      "Our dataset consists of three sources: 1) the extraction of unsupervised data from massive multi-lingual corpora, 2) the integration of closely related supervised data, 3) the synthesization of scarce training data.\n",
373
      "\n",
374
      "distance = 0.74\n",
375
      "In this   Table 1: Specification of training data.\n",
376
      "\n"
377
     ]
378
    }
379
   ],
380
   "source": [
381
    "question = 'What are the sources of data used in the training dataset?'\n",
382
    "\n",
383
    "match_results = semantic_search(question, top_k=3)\n",
384
    "for match in match_results:\n",
385
    "    print(f\"distance = {match.distance:.2f}\\n{match.entity.text}\")"
386
   ]
387
  },
388
  {
389
   "cell_type": "markdown",
390
   "metadata": {
391
    "pycharm": {
392
     "name": "#%% md\n"
393
    }
394
   },
395
   "source": [
396
    "In this example, the top 2 results have enough information to answer the question. By selecting the top K results, semantic search with embedding model and vector retrieval is able to identify the meaning of queries and return the most semantically similar documents. Plugging this solution with Large Language Model (a pattern referred to as Retrieval Augmented Generation), a more human-readable answer can be crafted.\n",
397
    "\n",
398
    "We can delete this collection to save resources."
399
   ]
400
  },
401
  {
402
   "cell_type": "code",
403
   "execution_count": 24,
404
   "metadata": {
405
    "pycharm": {
406
     "name": "#%%\n"
407
    }
408
   },
409
   "outputs": [],
410
   "source": [
411
    "# Drops the collection\n",
412
    "utility.drop_collection(COLLECTION_NAME)"
413
   ]
414
  },
415
  {
416
   "cell_type": "markdown",
417
   "metadata": {},
418
   "source": [
419
    "In this notebook, we showed how to generate dense vectors with BGE M3 embedding model and use Milvus to perform semantic search. In the upcoming releases, Milvus will support hybrid search with dense and sparse vectors, which BGE M3 model can produce at the same time.\n",
420
    "\n",
421
    "Milvus has integrated with all major model providers, including OpenAI, HuggingFace and many more. You can learn about Milvus at https://milvus.io/docs."
422
   ]
423
  }
424
 ],
425
 "metadata": {
426
  "kernelspec": {
427
   "display_name": "Python 3",
428
   "language": "python",
429
   "name": "python3"
430
  },
431
  "language_info": {
432
   "codemirror_mode": {
433
    "name": "ipython",
434
    "version": 3
435
   },
436
   "file_extension": ".py",
437
   "mimetype": "text/x-python",
438
   "name": "python",
439
   "nbconvert_exporter": "python",
440
   "pygments_lexer": "ipython3",
441
   "version": "3.9.6"
442
  }
443
 },
444
 "nbformat": 4,
445
 "nbformat_minor": 1
446
}
447
milvus-io_bootcamp

Использование cookies