haystack-tutorials

Форк
0
380 строк · 10.8 Кб
1
{
2
 "cells": [
3
  {
4
   "attachments": {},
5
   "cell_type": "markdown",
6
   "metadata": {
7
    "id": "bEH-CRbeA6NU"
8
   },
9
   "source": [
10
    "# Generative QA with Seq2SeqGenerator"
11
   ]
12
  },
13
  {
14
   "attachments": {},
15
   "cell_type": "markdown",
16
   "metadata": {},
17
   "source": [
18
    "> As of version 1.16, `Seq2SeqGenerator` has been deprecated in Haystack and completely removed from Haystack as of v1.18. We recommend following the tutorial on [Creating a Generative QA Pipeline with Retrieval-Augmentation](https://haystack.deepset.ai/tutorials/22_pipeline_with_promptnode) instead. For more details about this deprecation, check out [our announcement](https://github.com/deepset-ai/haystack/discussions/4816) on Github."
19
   ]
20
  },
21
  {
22
   "attachments": {},
23
   "cell_type": "markdown",
24
   "metadata": {},
25
   "source": [
26
    "Follow this tutorial to learn how to build and use a pipeline for Long-Form Question Answering (LFQA). LFQA is a variety of the generative question answering task. LFQA systems query large document stores for relevant information and then use this information to generate accurate, multi-sentence answers. In a regular question answering system, the retrieved documents related to the query (context passages) act as source tokens for extracted answers. In an LFQA system, context passages provide the context the system uses to generate original, abstractive, long-form answers."
27
   ]
28
  },
29
  {
30
   "attachments": {},
31
   "cell_type": "markdown",
32
   "metadata": {
33
    "id": "3K27Y5FbA6NV"
34
   },
35
   "source": [
36
    "\n",
37
    "## Preparing the Colab Environment\n",
38
    "\n",
39
    "- [Enable GPU Runtime](https://docs.haystack.deepset.ai/docs/enabling-gpu-acceleration#enabling-the-gpu-in-colab)\n"
40
   ]
41
  },
42
  {
43
   "attachments": {},
44
   "cell_type": "markdown",
45
   "metadata": {},
46
   "source": [
47
    "## Installing Haystack\n",
48
    "\n",
49
    "To start, let's install the latest release of Haystack with `pip`:"
50
   ]
51
  },
52
  {
53
   "cell_type": "code",
54
   "execution_count": null,
55
   "metadata": {
56
    "id": "NM36kbRFA6Nc"
57
   },
58
   "outputs": [],
59
   "source": [
60
    "%%bash\n",
61
    "\n",
62
    "pip install --upgrade pip\n",
63
    "pip install farm-haystack[colab,faiss]==1.17.2"
64
   ]
65
  },
66
  {
67
   "attachments": {},
68
   "cell_type": "markdown",
69
   "metadata": {},
70
   "source": [
71
    "### Enabling Telemetry \n",
72
    "Knowing you're using this tutorial helps us decide where to invest our efforts to build a better product but you can always opt out by commenting the following line. See [Telemetry](https://docs.haystack.deepset.ai/docs/telemetry) for more details."
73
   ]
74
  },
75
  {
76
   "cell_type": "code",
77
   "execution_count": null,
78
   "metadata": {},
79
   "outputs": [],
80
   "source": [
81
    "from haystack.telemetry import tutorial_running\n",
82
    "\n",
83
    "tutorial_running(12)"
84
   ]
85
  },
86
  {
87
   "attachments": {},
88
   "cell_type": "markdown",
89
   "metadata": {
90
    "collapsed": false,
91
    "pycharm": {
92
     "name": "#%% md\n"
93
    }
94
   },
95
   "source": [
96
    "## Logging\n",
97
    "\n",
98
    "We configure how logging messages should be displayed and which log level should be used before importing Haystack.\n",
99
    "Example log message:\n",
100
    "INFO - haystack.utils.preprocessing -  Converting data/tutorial1/218_Olenna_Tyrell.txt\n",
101
    "Default log level in basicConfig is WARNING so the explicit parameter is not necessary but can be changed easily:"
102
   ]
103
  },
104
  {
105
   "cell_type": "code",
106
   "execution_count": null,
107
   "metadata": {
108
    "collapsed": false,
109
    "pycharm": {
110
     "name": "#%%\n"
111
    }
112
   },
113
   "outputs": [],
114
   "source": [
115
    "import logging\n",
116
    "\n",
117
    "logging.basicConfig(format=\"%(levelname)s - %(name)s -  %(message)s\", level=logging.WARNING)\n",
118
    "logging.getLogger(\"haystack\").setLevel(logging.INFO)"
119
   ]
120
  },
121
  {
122
   "attachments": {},
123
   "cell_type": "markdown",
124
   "metadata": {
125
    "id": "q3dSo7ZtA6Nl"
126
   },
127
   "source": [
128
    "## Initializing the DocumentStore\n",
129
    "\n",
130
    "FAISS is a library for efficient similarity search on a cluster of dense vectors.\n",
131
    "The `FAISSDocumentStore` uses a SQL(SQLite in-memory be default) database under-the-hood\n",
132
    "to store the document text and other meta data. The vector embeddings of the text are\n",
133
    "indexed on a FAISS Index that later is queried for searching answers.\n",
134
    "The default flavour of FAISSDocumentStore is \"Flat\" but can also be set to \"HNSW\" for\n",
135
    "faster search at the expense of some accuracy. Just set the faiss_index_factor_str argument in the constructor.\n",
136
    "For more info on which suits your use case: https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index"
137
   ]
138
  },
139
  {
140
   "cell_type": "code",
141
   "execution_count": null,
142
   "metadata": {
143
    "id": "1cYgDJmrA6Nv",
144
    "pycharm": {
145
     "name": "#%%\n"
146
    }
147
   },
148
   "outputs": [],
149
   "source": [
150
    "from haystack.document_stores import FAISSDocumentStore\n",
151
    "\n",
152
    "document_store = FAISSDocumentStore(embedding_dim=128, faiss_index_factory_str=\"Flat\")"
153
   ]
154
  },
155
  {
156
   "attachments": {},
157
   "cell_type": "markdown",
158
   "metadata": {
159
    "id": "06LatTJBA6N0",
160
    "pycharm": {
161
     "name": "#%% md\n"
162
    }
163
   },
164
   "source": [
165
    "## Cleaning and Writing Documents\n",
166
    "\n",
167
    "Similarly to the previous tutorials, we download, convert and write some Game of Thrones articles to our DocumentStore."
168
   ]
169
  },
170
  {
171
   "cell_type": "code",
172
   "execution_count": null,
173
   "metadata": {
174
    "id": "iqKnu6wxA6N1",
175
    "pycharm": {
176
     "name": "#%%\n"
177
    }
178
   },
179
   "outputs": [],
180
   "source": [
181
    "from haystack.utils import convert_files_to_docs, fetch_archive_from_http, clean_wiki_text\n",
182
    "\n",
183
    "\n",
184
    "# Let's first get some files that we want to use\n",
185
    "doc_dir = \"data/tutorial12\"\n",
186
    "s3_url = \"https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt12.zip\"\n",
187
    "fetch_archive_from_http(url=s3_url, output_dir=doc_dir)\n",
188
    "\n",
189
    "# Convert files to dicts\n",
190
    "docs = convert_files_to_docs(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)\n",
191
    "\n",
192
    "# Now, let's write the dicts containing documents to our DB.\n",
193
    "document_store.write_documents(docs)"
194
   ]
195
  },
196
  {
197
   "attachments": {},
198
   "cell_type": "markdown",
199
   "metadata": {
200
    "id": "wgjedxx_A6N6"
201
   },
202
   "source": [
203
    "## Initializing the Retriever\n",
204
    "\n",
205
    "We use a `DensePassageRetriever` and we invoke `update_embeddings` to index the embeddings of documents in the `FAISSDocumentStore`."
206
   ]
207
  },
208
  {
209
   "cell_type": "code",
210
   "execution_count": null,
211
   "metadata": {
212
    "id": "kFwiPP60A6N7",
213
    "pycharm": {
214
     "is_executing": true
215
    }
216
   },
217
   "outputs": [],
218
   "source": [
219
    "from haystack.nodes import DensePassageRetriever\n",
220
    "\n",
221
    "retriever = DensePassageRetriever(\n",
222
    "    document_store=document_store,\n",
223
    "    query_embedding_model=\"vblagoje/dpr-question_encoder-single-lfqa-wiki\",\n",
224
    "    passage_embedding_model=\"vblagoje/dpr-ctx_encoder-single-lfqa-wiki\",\n",
225
    ")\n",
226
    "\n",
227
    "document_store.update_embeddings(retriever)"
228
   ]
229
  },
230
  {
231
   "attachments": {},
232
   "cell_type": "markdown",
233
   "metadata": {
234
    "id": "sMlVEnJ2NkZZ"
235
   },
236
   "source": [
237
    "Before we blindly use the `DensePassageRetriever` let's empirically test it to make sure a simple search indeed finds the relevant documents."
238
   ]
239
  },
240
  {
241
   "cell_type": "code",
242
   "execution_count": null,
243
   "metadata": {
244
    "id": "qpu-t9rndgpe"
245
   },
246
   "outputs": [],
247
   "source": [
248
    "from haystack.utils import print_documents\n",
249
    "from haystack.pipelines import DocumentSearchPipeline\n",
250
    "\n",
251
    "p_retrieval = DocumentSearchPipeline(retriever)\n",
252
    "res = p_retrieval.run(query=\"Tell me something about Arya Stark?\", params={\"Retriever\": {\"top_k\": 10}})\n",
253
    "print_documents(res, max_text_len=512)"
254
   ]
255
  },
256
  {
257
   "attachments": {},
258
   "cell_type": "markdown",
259
   "metadata": {
260
    "id": "rnVR28OXA6OA"
261
   },
262
   "source": [
263
    "## Initializing the Generator\n",
264
    "\n",
265
    "Similar to previous Tutorials we now initalize our Generator.\n",
266
    "\n",
267
    "Here we use a `Seq2SeqGenerator` with the [*vblagoje/bart_lfqa*](https://huggingface.co/vblagoje/bart_lfqa) model."
268
   ]
269
  },
270
  {
271
   "cell_type": "code",
272
   "execution_count": null,
273
   "metadata": {
274
    "id": "fyIuWVwhA6OB"
275
   },
276
   "outputs": [],
277
   "source": [
278
    "from haystack.nodes import Seq2SeqGenerator\n",
279
    "\n",
280
    "\n",
281
    "generator = Seq2SeqGenerator(model_name_or_path=\"vblagoje/bart_lfqa\")"
282
   ]
283
  },
284
  {
285
   "attachments": {},
286
   "cell_type": "markdown",
287
   "metadata": {
288
    "id": "unhLD18yA6OF"
289
   },
290
   "source": [
291
    "## Initializing the Pipeline\n",
292
    "\n",
293
    "With a Haystack `Pipeline` you can stick together your building blocks to a search pipeline.\n",
294
    "Under the hood, `Pipelines` are Directed Acyclic Graphs (DAGs) that you can easily customize for your own use cases.\n",
295
    "To speed things up, Haystack also comes with a few predefined Pipelines. One of them is the `GenerativeQAPipeline` that combines a Retriever and a Generator to answer our questions.\n",
296
    "You can learn more about `Pipelines` in the [docs](https://docs.haystack.deepset.ai/docs/pipelines)."
297
   ]
298
  },
299
  {
300
   "cell_type": "code",
301
   "execution_count": null,
302
   "metadata": {
303
    "id": "TssPQyzWA6OG"
304
   },
305
   "outputs": [],
306
   "source": [
307
    "from haystack.pipelines import GenerativeQAPipeline\n",
308
    "\n",
309
    "pipe = GenerativeQAPipeline(generator, retriever)"
310
   ]
311
  },
312
  {
313
   "attachments": {},
314
   "cell_type": "markdown",
315
   "metadata": {
316
    "id": "bXlBBxKXA6OL"
317
   },
318
   "source": [
319
    "## Asking a Question\n",
320
    "We use the pipeline `run()` method to ask a question."
321
   ]
322
  },
323
  {
324
   "cell_type": "code",
325
   "execution_count": null,
326
   "metadata": {
327
    "id": "Zi97Hif2A6OM"
328
   },
329
   "outputs": [],
330
   "source": [
331
    "pipe.run(\n",
332
    "    query=\"How did Arya Stark's character get portrayed in a television adaptation?\", params={\"Retriever\": {\"top_k\": 3}}\n",
333
    ")"
334
   ]
335
  },
336
  {
337
   "cell_type": "code",
338
   "execution_count": null,
339
   "metadata": {
340
    "id": "IfTP9BfFGOo6"
341
   },
342
   "outputs": [],
343
   "source": [
344
    "pipe.run(query=\"Why is Arya Stark an unusual character?\", params={\"Retriever\": {\"top_k\": 3}})"
345
   ]
346
  }
347
 ],
348
 "metadata": {
349
  "accelerator": "GPU",
350
  "colab": {
351
   "collapsed_sections": [],
352
   "name": "LFQA_via_Haystack.ipynb",
353
   "provenance": []
354
  },
355
  "kernelspec": {
356
   "display_name": "Python 3.8.9 64-bit",
357
   "language": "python",
358
   "name": "python3"
359
  },
360
  "language_info": {
361
   "codemirror_mode": {
362
    "name": "ipython",
363
    "version": 3
364
   },
365
   "file_extension": ".py",
366
   "mimetype": "text/x-python",
367
   "name": "python",
368
   "nbconvert_exporter": "python",
369
   "pygments_lexer": "ipython3",
370
   "version": "3.9.6"
371
  },
372
  "vscode": {
373
   "interpreter": {
374
    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
375
   }
376
  }
377
 },
378
 "nbformat": 4,
379
 "nbformat_minor": 0
380
}
381

Использование cookies

Мы используем файлы cookie в соответствии с Политикой конфиденциальности и Политикой использования cookies.

Нажимая кнопку «Принимаю», Вы даете АО «СберТех» согласие на обработку Ваших персональных данных в целях совершенствования нашего веб-сайта и Сервиса GitVerse, а также повышения удобства их использования.

Запретить использование cookies Вы можете самостоятельно в настройках Вашего браузера.