haystack-tutorials

26_Hybrid_Retrieval.ipynb
460 строк · 13.7 Кб
Перенос по словам
1
{
2
 "cells": [
3
  {
4
   "cell_type": "markdown",
5
   "metadata": {
6
    "id": "kTas9ZQ7lXP7"
7
   },
8
   "source": [
9
    "# Tutorial: Creating a Hybrid Retrieval Pipeline\n",
10
    "\n",
11
    "- **Level**: Intermediate\n",
12
    "- **Time to complete**: 15 minutes\n",
13
    "- **Nodes Used**: `EmbeddingRetriever`, `BM25Retriever`, `JoinDocuments`, `SentenceTransformersRanker` and `InMemoryDocumentStore`\n",
14
    "- **Goal**: After completing this tutorial, you will have learned about creating your first hybrid retrieval and when it's useful."
15
   ]
16
  },
17
  {
18
   "cell_type": "markdown",
19
   "metadata": {},
20
   "source": [
21
    "> This tutorial is based on Haystack 1.x. If you're using Haystack 2.0-Beta and would like to follow the updated version of this tutorial, check out [Creating a Hybrid Pipeline](https://haystack.deepset.ai/tutorials/33_hybrid_retrieval). \n",
22
    ">\n",
23
    "> For more information on Haystack 2.0-Beta, you can also read the [announcement post](https://haystack.deepset.ai/blog/introducing-haystack-2-beta-and-advent)."
24
   ]
25
  },
26
  {
27
   "cell_type": "markdown",
28
   "metadata": {
29
    "id": "0hw_zoKolXQL"
30
   },
31
   "source": [
32
    "## Overview\n",
33
    "\n",
34
    "\n",
35
    "**Hybrid Retrieval** merges dense and sparse vectors together to deliver the best of both search methods. Generally speaking, dense vectors excel at understanding the context of the query, whereas sparse vectors excel at keyword matches.\n",
36
    "\n",
37
    "There are many cases when a simple sparse retrieval like BM25 performs better than a dense retrieval (for example in a specific domain like healthcare) because a dense encoder model needs to be trained on data. For more details about Hybrid Retrieval, check out [Blog Post: Hybrid Document Retrieval](https://haystack.deepset.ai/blog/hybrid-retrieval)."
38
   ]
39
  },
40
  {
41
   "cell_type": "markdown",
42
   "metadata": {
43
    "id": "ITs3WTT5lXQT"
44
   },
45
   "source": [
46
    "## Preparing the Colab Environment\n",
47
    "\n",
48
    "- [Enable GPU Runtime in Colab](https://docs.haystack.deepset.ai/docs/enabling-gpu-acceleration#enabling-the-gpu-in-colab)\n",
49
    "- [Set logging level to INFO](https://docs.haystack.deepset.ai/docs/log-level)"
50
   ]
51
  },
52
  {
53
   "cell_type": "markdown",
54
   "metadata": {
55
    "id": "2g9fhjxDlXQb"
56
   },
57
   "source": [
58
    "## Installing Haystack\n",
59
    "\n",
60
    "To start, let's install the latest release of Haystack with `pip`:"
61
   ]
62
  },
63
  {
64
   "cell_type": "code",
65
   "execution_count": null,
66
   "metadata": {
67
    "id": "L40ZxZW8lXQh"
68
   },
69
   "outputs": [],
70
   "source": [
71
    "%%bash\n",
72
    "\n",
73
    "pip install --upgrade pip\n",
74
    "pip install \"datasets>=2.6.1\"\n",
75
    "pip install farm-haystack[inference]"
76
   ]
77
  },
78
  {
79
   "cell_type": "markdown",
80
   "metadata": {
81
    "id": "CJBcPNbBlXQq"
82
   },
83
   "source": [
84
    "### Enabling Telemetry\n",
85
    "\n",
86
    "Knowing you're using this tutorial helps us decide where to invest our efforts to build a better product but you can always opt out by commenting the following line. See [Telemetry](https://docs.haystack.deepset.ai/docs/telemetry) for more details."
87
   ]
88
  },
89
  {
90
   "cell_type": "code",
91
   "execution_count": null,
92
   "metadata": {
93
    "id": "lUbTGVo4lXQv"
94
   },
95
   "outputs": [],
96
   "source": [
97
    "from haystack.telemetry import tutorial_running\n",
98
    "\n",
99
    "tutorial_running(26)"
100
   ]
101
  },
102
  {
103
   "cell_type": "markdown",
104
   "metadata": {
105
    "id": "5HLBUYOplXQ1"
106
   },
107
   "source": [
108
    "## Creating a Hybrid Retrieval Pipeline"
109
   ]
110
  },
111
  {
112
   "cell_type": "markdown",
113
   "metadata": {
114
    "id": "usdANiAGlXQ9"
115
   },
116
   "source": [
117
    "### 1) Initialize the DocumentStore and Clean Documents\n",
118
    "\n",
119
    "\n",
120
    "You'll start creating a hybrid pipeline by initializing a DocumentStore and preprocessing documents before storing them in the DocumentStore.\n",
121
    "\n",
122
    "You will use the PubMed Abstracts as Documents. There are a lot of datasets from PubMed on Hugging Face Hub; you will use [anakin87/medrag-pubmed-chunk](https://huggingface.co/datasets/anakin87/medrag-pubmed-chunk) in this tutorial.\n",
123
    "\n",
124
    "Initialize `InMemoryDocumentStore` and don't forget to set `use_bm25=True` and the dimension of your embeddings in `embedding_dim`:"
125
   ]
126
  },
127
  {
128
   "cell_type": "code",
129
   "execution_count": null,
130
   "metadata": {
131
    "id": "cLbh-UtelXRL"
132
   },
133
   "outputs": [],
134
   "source": [
135
    "from datasets import load_dataset\n",
136
    "from haystack.document_stores import InMemoryDocumentStore\n",
137
    "\n",
138
    "dataset = load_dataset(\"anakin87/medrag-pubmed-chunk\", split=\"train\")\n",
139
    "\n",
140
    "document_store = InMemoryDocumentStore(use_bm25=True, embedding_dim=384)"
141
   ]
142
  },
143
  {
144
   "cell_type": "markdown",
145
   "metadata": {
146
    "id": "WgxFjbGgdQla"
147
   },
148
   "source": [
149
    "You can create your document list with a simple for loop.\n",
150
    "The data has 3 features:\n",
151
    "* *pmid*\n",
152
    "* *title*\n",
153
    "* *content*: the abstract\n",
154
    "* *contents*: abstract + title\n",
155
    "\n",
156
    "For searching, you will use the *contents* feature. The other features will be stored as metadata, and you will use them to have a **pretty print** of the search results.\n"
157
   ]
158
  },
159
  {
160
   "cell_type": "code",
161
   "execution_count": null,
162
   "metadata": {
163
    "id": "RvrG_QzirSsq"
164
   },
165
   "outputs": [],
166
   "source": [
167
    "from haystack.schema import Document\n",
168
    "\n",
169
    "docs = []\n",
170
    "for doc in dataset:\n",
171
    "    docs.append(\n",
172
    "        Document(content=doc[\"contents\"], meta={\"title\": doc[\"title\"], \"abstract\": doc[\"content\"], \"pmid\": doc[\"id\"]})\n",
173
    "    )"
174
   ]
175
  },
176
  {
177
   "cell_type": "markdown",
178
   "metadata": {
179
    "id": "tNJkztzWaWzZ"
180
   },
181
   "source": [
182
    "The PreProcessor class is designed to help you clean and split text into sensible units.\n",
183
    "\n",
184
    "> To learn about the preprocessing step, check out [Tutorial: Preprocessing Your Documents](https://haystack.deepset.ai/tutorials/08_preprocessing).\n",
185
    "\n"
186
   ]
187
  },
188
  {
189
   "cell_type": "code",
190
   "execution_count": null,
191
   "metadata": {
192
    "id": "RrCCmLvGqhYw"
193
   },
194
   "outputs": [],
195
   "source": [
196
    "from haystack.nodes import PreProcessor\n",
197
    "\n",
198
    "preprocessor = PreProcessor(\n",
199
    "    clean_empty_lines=True,\n",
200
    "    clean_whitespace=True,\n",
201
    "    clean_header_footer=True,\n",
202
    "    split_by=\"word\",\n",
203
    "    split_length=512,\n",
204
    "    split_overlap=32,\n",
205
    "    split_respect_sentence_boundary=True,\n",
206
    ")"
207
   ]
208
  },
209
  {
210
   "cell_type": "code",
211
   "execution_count": null,
212
   "metadata": {
213
    "id": "8PzBU_jnsBTZ"
214
   },
215
   "outputs": [],
216
   "source": [
217
    "docs_to_index = preprocessor.process(docs)"
218
   ]
219
  },
220
  {
221
   "cell_type": "markdown",
222
   "metadata": {
223
    "id": "ii9x0gr9lXRT"
224
   },
225
   "source": [
226
    "### 2) Initialize the Retrievers\n",
227
    "\n",
228
    "Initialize a sparse retriever using [BM25](https://docs.haystack.deepset.ai/docs/retriever#bm25-recommended) and a dense retriever using a [sentence-transformers model](https://docs.haystack.deepset.ai/docs/retriever#embedding-retrieval-recommended)."
229
   ]
230
  },
231
  {
232
   "cell_type": "code",
233
   "execution_count": null,
234
   "metadata": {
235
    "id": "rXHbHru0lXRY"
236
   },
237
   "outputs": [],
238
   "source": [
239
    "from haystack.nodes import EmbeddingRetriever, BM25Retriever\n",
240
    "\n",
241
    "sparse_retriever = BM25Retriever(document_store=document_store)\n",
242
    "dense_retriever = EmbeddingRetriever(\n",
243
    "    document_store=document_store,\n",
244
    "    embedding_model=\"sentence-transformers/all-MiniLM-L6-v2\",\n",
245
    "    use_gpu=True,\n",
246
    "    scale_score=False,\n",
247
    ")"
248
   ]
249
  },
250
  {
251
   "cell_type": "markdown",
252
   "metadata": {
253
    "id": "cx8307ZglXRd"
254
   },
255
   "source": [
256
    "### 3) Write Documents and Update Embeddings\n",
257
    "\n",
258
    "Write documents to the DocumentStore, first by deleting any remaining documents and then calling `write_documents()`. The `update_embeddings()` method uses the given retriever to create an embedding for each document."
259
   ]
260
  },
261
  {
262
   "cell_type": "code",
263
   "execution_count": null,
264
   "metadata": {
265
    "id": "7S-QdaDYlXRg"
266
   },
267
   "outputs": [],
268
   "source": [
269
    "document_store.delete_documents()\n",
270
    "document_store.write_documents(docs_to_index)\n",
271
    "document_store.update_embeddings(retriever=dense_retriever)"
272
   ]
273
  },
274
  {
275
   "cell_type": "markdown",
276
   "metadata": {
277
    "id": "_gugk_k2lXRi"
278
   },
279
   "source": [
280
    "### 4) Initialize JoinDocuments and Ranker\n",
281
    "\n",
282
    "While exploring hybrid search, we needed a way to combine the results of BM25 and dense vector search into a single ranked list. It may not be obvious how to combine them:\n",
283
    "\n",
284
    "* Different retrievers use incompatible score types, like BM25 and cosine similarity.\n",
285
    "* Documents may come from single or multiple sources at the same time. There should be a way to deal with duplicates in the final ranking.\n",
286
    "\n",
287
    "The merging and ranking of the documents from different retrievers is an open problem, however, Haystack offers several methods in [`JoinDocuments`](https://docs.haystack.deepset.ai/docs/join_documents). Here, you will use the simplest, `concatenate`, and pass the task to the ranker.\n",
288
    "\n",
289
    "Use a [re-ranker based on a cross-encoder](https://docs.haystack.deepset.ai/docs/ranker#sentencetransformersranker) that scores the relevancy of all candidates for the given search query.\n",
290
    "For more information about the `Ranker`, check the Haystack [docs](https://docs.haystack.deepset.ai/docs/ranker)."
291
   ]
292
  },
293
  {
294
   "cell_type": "code",
295
   "execution_count": null,
296
   "metadata": {
297
    "id": "d_RiKspTlXRl"
298
   },
299
   "outputs": [],
300
   "source": [
301
    "from haystack.nodes import JoinDocuments, SentenceTransformersRanker\n",
302
    "\n",
303
    "join_documents = JoinDocuments(join_mode=\"concatenate\")\n",
304
    "rerank = SentenceTransformersRanker(model_name_or_path=\"cross-encoder/ms-marco-MiniLM-L-6-v2\")"
305
   ]
306
  },
307
  {
308
   "cell_type": "markdown",
309
   "metadata": {
310
    "id": "PexSrsBLlXRp"
311
   },
312
   "source": [
313
    "### 5) Create the Hybrid Retrieval Pipeline\n",
314
    "\n",
315
    "With a Haystack `Pipeline`, you can connect your building blocks into a search pipeline. Under the hood, `Pipelines` are Directed Acyclic Graphs (DAGs) that you can easily customize for your own use cases.\n",
316
    "You can learn more about Pipelines in the [docs](https://docs.haystack.deepset.ai/docs/pipelines)."
317
   ]
318
  },
319
  {
320
   "cell_type": "code",
321
   "execution_count": null,
322
   "metadata": {
323
    "id": "i0XLbnAXlXRt"
324
   },
325
   "outputs": [],
326
   "source": [
327
    "from haystack.pipelines import Pipeline\n",
328
    "\n",
329
    "pipeline = Pipeline()\n",
330
    "pipeline.add_node(component=sparse_retriever, name=\"SparseRetriever\", inputs=[\"Query\"])\n",
331
    "pipeline.add_node(component=dense_retriever, name=\"DenseRetriever\", inputs=[\"Query\"])\n",
332
    "pipeline.add_node(component=join_documents, name=\"JoinDocuments\", inputs=[\"SparseRetriever\", \"DenseRetriever\"])\n",
333
    "pipeline.add_node(component=rerank, name=\"ReRanker\", inputs=[\"JoinDocuments\"])"
334
   ]
335
  },
336
  {
337
   "cell_type": "markdown",
338
   "metadata": {
339
    "id": "V3bsFkHuhHn4"
340
   },
341
   "source": [
342
    "### Generating a Pipeline Diagram\n",
343
    "\n",
344
    "With any Pipeline, whether prebuilt or custom constructed, you can save a diagram showing how all the components are connected. For example, the hybrid pipeline should look like this:"
345
   ]
346
  },
347
  {
348
   "cell_type": "code",
349
   "execution_count": null,
350
   "metadata": {
351
    "id": "oCIMtwmThQG4"
352
   },
353
   "outputs": [],
354
   "source": [
355
    "# Uncomment the following to generate the images\n",
356
    "# !apt install libgraphviz-dev\n",
357
    "# !pip install pygraphviz\n",
358
    "\n",
359
    "# pipeline.draw(\"pipeline_hybrid.png\")"
360
   ]
361
  },
362
  {
363
   "cell_type": "markdown",
364
   "metadata": {
365
    "id": "sTTVLUJylXRx"
366
   },
367
   "source": [
368
    "## Trying Out the Hybrid Pipeline\n",
369
    "\n",
370
    "Search an article with Hybrid Retrieval. If you want to see all the steps, enable `debug=True` in `JoinDocuments`'s `params`."
371
   ]
372
  },
373
  {
374
   "cell_type": "code",
375
   "execution_count": null,
376
   "metadata": {
377
    "id": "p-5WbeBulXR0"
378
   },
379
   "outputs": [],
380
   "source": [
381
    "prediction = pipeline.run(\n",
382
    "    query=\"apnea in infants\",\n",
383
    "    params={\n",
384
    "        \"SparseRetriever\": {\"top_k\": 10},\n",
385
    "        \"DenseRetriever\": {\"top_k\": 10},\n",
386
    "        \"JoinDocuments\": {\"top_k_join\": 15},  # comment for debug\n",
387
    "        # \"JoinDocuments\": {\"top_k_join\": 15, \"debug\":True}, #uncomment for debug\n",
388
    "        \"ReRanker\": {\"top_k\": 5},\n",
389
    "    },\n",
390
    ")"
391
   ]
392
  },
393
  {
394
   "cell_type": "markdown",
395
   "metadata": {
396
    "id": "WvPv1cJ6gbBJ"
397
   },
398
   "source": [
399
    "Create a function to print a kind of *search page*."
400
   ]
401
  },
402
  {
403
   "cell_type": "code",
404
   "execution_count": null,
405
   "metadata": {
406
    "id": "raL_z_sByDoQ"
407
   },
408
   "outputs": [],
409
   "source": [
410
    "def pretty_print_results(prediction):\n",
411
    "    for doc in prediction[\"documents\"]:\n",
412
    "        print(doc.meta[\"title\"], \"\\t\", doc.score)\n",
413
    "        print(doc.meta[\"abstract\"])\n",
414
    "        print(\"\\n\", \"\\n\")"
415
   ]
416
  },
417
  {
418
   "cell_type": "code",
419
   "execution_count": null,
420
   "metadata": {
421
    "id": "mSUiizGNytwX"
422
   },
423
   "outputs": [],
424
   "source": [
425
    "pretty_print_results(prediction)"
426
   ]
427
  }
428
 ],
429
 "metadata": {
430
  "accelerator": "GPU",
431
  "colab": {
432
   "gpuType": "T4",
433
   "provenance": []
434
  },
435
  "kernelspec": {
436
   "display_name": "Python 3",
437
   "name": "python3"
438
  },
439
  "language_info": {
440
   "codemirror_mode": {
441
    "name": "ipython",
442
    "version": 3
443
   },
444
   "file_extension": ".py",
445
   "mimetype": "text/x-python",
446
   "name": "python",
447
   "nbconvert_exporter": "python",
448
   "pygments_lexer": "ipython3",
449
   "version": "3.11.4"
450
  },
451
  "orig_nbformat": 4,
452
  "vscode": {
453
   "interpreter": {
454
    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
455
   }
456
  }
457
 },
458
 "nbformat": 4,
459
 "nbformat_minor": 0
460
}
461
haystack-tutorials

Использование cookies