haystack-tutorials

03_Scalable_QA_System.ipynb
569 строк · 17.5 Кб
Перенос по словам
1
{
2
 "cells": [
3
  {
4
   "attachments": {},
5
   "cell_type": "markdown",
6
   "metadata": {},
7
   "source": [
8
    "# Tutorial: Build a Scalable Question Answering System\n",
9
    "\n",
10
    "- **Level**: Beginner\n",
11
    "- **Time to complete**: 20 minutes\n",
12
    "- **Nodes Used**: `ElasticsearchDocumentStore`, `BM25Retriever`, `FARMReader`\n",
13
    "- **Goal**: After completing this tutorial, you'll have built a scalable search system that runs on text files and can answer questions about Game of Thrones. You'll then be able to expand this system for your needs.\n"
14
   ]
15
  },
16
  {
17
   "attachments": {},
18
   "cell_type": "markdown",
19
   "metadata": {
20
    "collapsed": false
21
   },
22
   "source": [
23
    "## Overview\n",
24
    "\n",
25
    "Learn how to set up a question answering system that can search through complex knowledge bases and highlight answers to questions such as \"Who is the father of Arya Stark?\". In this tutorial, we'll work on a set of Wikipedia pages about Game of Thrones, but you can adapt it to search through internal wikis or a collection of financial reports, for example.\n",
26
    "\n",
27
    "This tutorial introduces you to all the concepts needed to build such a question answering system. It also uses Haystack components, such as indexing pipelines, querying pipelines, and DocumentStores backed by external database services.\n",
28
    "\n",
29
    "Let's learn how to build a question answering system and discover more about the marvelous seven kingdoms!"
30
   ]
31
  },
32
  {
33
   "attachments": {},
34
   "cell_type": "markdown",
35
   "metadata": {},
36
   "source": [
37
    "\n",
38
    "## Preparing the Colab Environment\n",
39
    "\n",
40
    "- [Enable GPU Runtime](https://docs.haystack.deepset.ai/docs/enabling-gpu-acceleration#enabling-the-gpu-in-colab)\n"
41
   ]
42
  },
43
  {
44
   "attachments": {},
45
   "cell_type": "markdown",
46
   "metadata": {},
47
   "source": [
48
    "## Installing Haystack\n",
49
    "\n",
50
    "To start, let's install the latest release of Haystack with `pip`:"
51
   ]
52
  },
53
  {
54
   "cell_type": "code",
55
   "execution_count": null,
56
   "metadata": {},
57
   "outputs": [],
58
   "source": [
59
    "%%bash\n",
60
    "\n",
61
    "pip install --upgrade pip\n",
62
    "pip install farm-haystack[colab,preprocessing,elasticsearch,inference]"
63
   ]
64
  },
65
  {
66
   "attachments": {},
67
   "cell_type": "markdown",
68
   "metadata": {},
69
   "source": [
70
    "### Enabling Telemetry \n",
71
    "Knowing you're using this tutorial helps us decide where to invest our efforts to build a better product but you can always opt out by commenting the following line. See [Telemetry](https://docs.haystack.deepset.ai/docs/telemetry) for more details."
72
   ]
73
  },
74
  {
75
   "cell_type": "code",
76
   "execution_count": null,
77
   "metadata": {},
78
   "outputs": [],
79
   "source": [
80
    "from haystack.telemetry import tutorial_running\n",
81
    "\n",
82
    "tutorial_running(3)"
83
   ]
84
  },
85
  {
86
   "attachments": {},
87
   "cell_type": "markdown",
88
   "metadata": {},
89
   "source": [
90
    "Set the logging level to INFO:"
91
   ]
92
  },
93
  {
94
   "cell_type": "code",
95
   "execution_count": null,
96
   "metadata": {},
97
   "outputs": [],
98
   "source": [
99
    "import logging\n",
100
    "\n",
101
    "logging.basicConfig(format=\"%(levelname)s - %(name)s -  %(message)s\", level=logging.WARNING)\n",
102
    "logging.getLogger(\"haystack\").setLevel(logging.INFO)"
103
   ]
104
  },
105
  {
106
   "attachments": {},
107
   "cell_type": "markdown",
108
   "metadata": {},
109
   "source": [
110
    "## Initializing the ElasticsearchDocumentStore\n",
111
    "\n",
112
    "A DocumentStore stores the Documents that the question answering system uses to find answers to your questions. Here, we're using the [`ElasticsearchDocumentStore`](https://docs.haystack.deepset.ai/reference/document-store-api#module-elasticsearch) which connects to a running Elasticsearch service. It's a fast and scalable text-focused storage option. This service runs independently from Haystack and persists even after the Haystack program has finished running. To learn more about the DocumentStore and the different types of external databases that we support, see [DocumentStore](https://docs.haystack.deepset.ai/docs/document_store)."
113
   ]
114
  },
115
  {
116
   "attachments": {},
117
   "cell_type": "markdown",
118
   "metadata": {},
119
   "source": [
120
    "1. Download, extract, and set the permissions for the Elasticsearch installation image:"
121
   ]
122
  },
123
  {
124
   "cell_type": "code",
125
   "execution_count": null,
126
   "metadata": {},
127
   "outputs": [],
128
   "source": [
129
    "%%bash\n",
130
    "\n",
131
    "wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q\n",
132
    "tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz\n",
133
    "chown -R daemon:daemon elasticsearch-7.9.2"
134
   ]
135
  },
136
  {
137
   "attachments": {},
138
   "cell_type": "markdown",
139
   "metadata": {},
140
   "source": [
141
    "2. Start the server:"
142
   ]
143
  },
144
  {
145
   "cell_type": "code",
146
   "execution_count": null,
147
   "metadata": {},
148
   "outputs": [],
149
   "source": [
150
    "%%bash --bg\n",
151
    "\n",
152
    "sudo -u daemon -- elasticsearch-7.9.2/bin/elasticsearch"
153
   ]
154
  },
155
  {
156
   "attachments": {},
157
   "cell_type": "markdown",
158
   "metadata": {},
159
   "source": [
160
    "If Docker is available in your environment (Colab notebooks do not support Docker), you can also start Elasticsearch using Docker. You can do this manually, or using our [`launch_es()`](https://docs.haystack.deepset.ai/reference/utils-api#module-doc_store) utility function."
161
   ]
162
  },
163
  {
164
   "cell_type": "code",
165
   "execution_count": null,
166
   "metadata": {},
167
   "outputs": [],
168
   "source": [
169
    "# from haystack.utils import launch_es\n",
170
    "\n",
171
    "# launch_es()"
172
   ]
173
  },
174
  {
175
   "attachments": {},
176
   "cell_type": "markdown",
177
   "metadata": {},
178
   "source": [
179
    "3. Wait 30 seconds for the server to fully start up:"
180
   ]
181
  },
182
  {
183
   "cell_type": "code",
184
   "execution_count": null,
185
   "metadata": {},
186
   "outputs": [],
187
   "source": [
188
    "import time\n",
189
    "\n",
190
    "time.sleep(30)"
191
   ]
192
  },
193
  {
194
   "attachments": {},
195
   "cell_type": "markdown",
196
   "metadata": {},
197
   "source": [
198
    "4. Initialize the ElasticsearchDocumentStore:\n"
199
   ]
200
  },
201
  {
202
   "cell_type": "code",
203
   "execution_count": 12,
204
   "metadata": {},
205
   "outputs": [],
206
   "source": [
207
    "import os\n",
208
    "from haystack.document_stores import ElasticsearchDocumentStore\n",
209
    "\n",
210
    "# Get the host where Elasticsearch is running, default to localhost\n",
211
    "host = os.environ.get(\"ELASTICSEARCH_HOST\", \"localhost\")\n",
212
    "\n",
213
    "document_store = ElasticsearchDocumentStore(host=host, username=\"\", password=\"\", index=\"document\")"
214
   ]
215
  },
216
  {
217
   "attachments": {},
218
   "cell_type": "markdown",
219
   "metadata": {},
220
   "source": [
221
    "ElasticsearchDocumentStore is up and running and ready to store the Documents."
222
   ]
223
  },
224
  {
225
   "attachments": {},
226
   "cell_type": "markdown",
227
   "metadata": {},
228
   "source": [
229
    "## Indexing Documents with a Pipeline\n",
230
    "\n",
231
    "The next step is adding the files to the DocumentStore. The indexing pipeline turns your files into Document objects and writes them to the DocumentStore. Our indexing pipeline will have two nodes: `TextConverter`, which turns `.txt` files into Haystack `Document` objects, and `PreProcessor`, which cleans and splits the text within a `Document`.\n",
232
    "\n",
233
    "Once we combine these nodes into a pipeline, the pipeline will ingest `.txt` file paths, preprocess them, and write them into the DocumentStore.\n"
234
   ]
235
  },
236
  {
237
   "attachments": {},
238
   "cell_type": "markdown",
239
   "metadata": {},
240
   "source": [
241
    "1. Download 517 articles from the Game of Thrones Wikipedia. You can find them in *data/build_a_scalable_question_answering_system* as a set of *.txt* files."
242
   ]
243
  },
244
  {
245
   "cell_type": "code",
246
   "execution_count": null,
247
   "metadata": {},
248
   "outputs": [],
249
   "source": [
250
    "from haystack.utils import fetch_archive_from_http\n",
251
    "\n",
252
    "doc_dir = \"data/build_a_scalable_question_answering_system\"\n",
253
    "\n",
254
    "fetch_archive_from_http(\n",
255
    "    url=\"https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt3.zip\",\n",
256
    "    output_dir=doc_dir,\n",
257
    ")"
258
   ]
259
  },
260
  {
261
   "attachments": {},
262
   "cell_type": "markdown",
263
   "metadata": {},
264
   "source": [
265
    "2. Initialize the pipeline, TextConverter, and PreProcessor:"
266
   ]
267
  },
268
  {
269
   "cell_type": "code",
270
   "execution_count": 4,
271
   "metadata": {},
272
   "outputs": [],
273
   "source": [
274
    "from haystack import Pipeline\n",
275
    "from haystack.nodes import TextConverter, PreProcessor\n",
276
    "\n",
277
    "indexing_pipeline = Pipeline()\n",
278
    "text_converter = TextConverter()\n",
279
    "preprocessor = PreProcessor(\n",
280
    "    clean_whitespace=True,\n",
281
    "    clean_header_footer=True,\n",
282
    "    clean_empty_lines=True,\n",
283
    "    split_by=\"word\",\n",
284
    "    split_length=200,\n",
285
    "    split_overlap=20,\n",
286
    "    split_respect_sentence_boundary=True,\n",
287
    ")"
288
   ]
289
  },
290
  {
291
   "attachments": {},
292
   "cell_type": "markdown",
293
   "metadata": {},
294
   "source": [
295
    "To learn more about the parameters of the `PreProcessor`, see [Usage](https://docs.haystack.deepset.ai/docs/preprocessor#usage). To understand why document splitting is important for your question answering system's performance, see [Document Length](https://docs.haystack.deepset.ai/docs/optimization#document-length)."
296
   ]
297
  },
298
  {
299
   "attachments": {},
300
   "cell_type": "markdown",
301
   "metadata": {},
302
   "source": [
303
    "2. Add the nodes into an indexing pipeline. You should provide the `name` or `name`s of preceding nodes as the `input` argument. Note that in an indexing pipeline, the input to the first node is `File`."
304
   ]
305
  },
306
  {
307
   "cell_type": "code",
308
   "execution_count": 5,
309
   "metadata": {},
310
   "outputs": [],
311
   "source": [
312
    "import os\n",
313
    "\n",
314
    "indexing_pipeline.add_node(component=text_converter, name=\"TextConverter\", inputs=[\"File\"])\n",
315
    "indexing_pipeline.add_node(component=preprocessor, name=\"PreProcessor\", inputs=[\"TextConverter\"])\n",
316
    "indexing_pipeline.add_node(component=document_store, name=\"DocumentStore\", inputs=[\"PreProcessor\"])"
317
   ]
318
  },
319
  {
320
   "attachments": {},
321
   "cell_type": "markdown",
322
   "metadata": {},
323
   "source": [
324
    "3. Run the indexing pipeline to write the text data into the DocumentStore:"
325
   ]
326
  },
327
  {
328
   "cell_type": "code",
329
   "execution_count": null,
330
   "metadata": {},
331
   "outputs": [],
332
   "source": [
333
    "files_to_index = [doc_dir + \"/\" + f for f in os.listdir(doc_dir)]\n",
334
    "indexing_pipeline.run_batch(file_paths=files_to_index)"
335
   ]
336
  },
337
  {
338
   "attachments": {},
339
   "cell_type": "markdown",
340
   "metadata": {},
341
   "source": [
342
    "The code in this tutorial uses Game of Thrones data, but you can also supply your own `.txt` files and index them in the same way.\n",
343
    "\n",
344
    "As an alternative, you can cast you text data into [Document objects](https://docs.haystack.deepset.ai/docs/documents_answers_labels#document) and write them into the DocumentStore using [`DocumentStore.write_documents()`](https://docs.haystack.deepset.ai/reference/document-store-api#basedocumentstorewrite_documents)."
345
   ]
346
  },
347
  {
348
   "attachments": {},
349
   "cell_type": "markdown",
350
   "metadata": {},
351
   "source": [
352
    "Now that the Documents are in the DocumentStore, let's initialize the nodes we want to use in our query pipeline."
353
   ]
354
  },
355
  {
356
   "attachments": {},
357
   "cell_type": "markdown",
358
   "metadata": {},
359
   "source": [
360
    "## Initializing the Retriever\n",
361
    "\n",
362
    "Our query pipeline is going to use a Retriever, so we need to initialize it. A Retriever sifts through all the Documents and returns only those that are relevant to the question. This tutorial uses the BM25Retriever. This is the recommended Retriever for a question answering system like the one we're creating. For more Retriever options, see [Retriever](https://docs.haystack.deepset.ai/docs/retriever)."
363
   ]
364
  },
365
  {
366
   "cell_type": "code",
367
   "execution_count": 7,
368
   "metadata": {},
369
   "outputs": [],
370
   "source": [
371
    "from haystack.nodes import BM25Retriever\n",
372
    "\n",
373
    "retriever = BM25Retriever(document_store=document_store)"
374
   ]
375
  },
376
  {
377
   "attachments": {},
378
   "cell_type": "markdown",
379
   "metadata": {},
380
   "source": [
381
    "The BM25Retriever is initialized and ready for the pipeline."
382
   ]
383
  },
384
  {
385
   "attachments": {},
386
   "cell_type": "markdown",
387
   "metadata": {},
388
   "source": [
389
    "## Initializing the Reader\n",
390
    "\n",
391
    "Our query pipeline also needs a Reader, so we'll initialize it next. A Reader scans the texts it received from the Retriever and extracts the top answer candidates. Readers are based on powerful deep learning models but are much slower than Retrievers at processing the same amount of text. This tutorials uses a FARMReader with a base-sized RoBERTa question answering model called [`deepset/roberta-base-squad2`](https://huggingface.co/deepset/roberta-base-squad2). It's a good all-round model to start with. To find a model that's best for your use case, see [Models](https://docs.haystack.deepset.ai/docs/reader#models)."
392
   ]
393
  },
394
  {
395
   "cell_type": "code",
396
   "execution_count": 8,
397
   "metadata": {},
398
   "outputs": [],
399
   "source": [
400
    "from haystack.nodes import FARMReader\n",
401
    "\n",
402
    "reader = FARMReader(model_name_or_path=\"deepset/roberta-base-squad2\", use_gpu=True)"
403
   ]
404
  },
405
  {
406
   "attachments": {},
407
   "cell_type": "markdown",
408
   "metadata": {},
409
   "source": [
410
    "## Creating the Retriever-Reader Pipeline\n",
411
    "\n",
412
    "You can combine the Reader and Retriever in a querying pipeline using the `Pipeline` class. The combination of the two speeds up processing because the Reader only processes the Documents that it received from the Retriever. "
413
   ]
414
  },
415
  {
416
   "attachments": {},
417
   "cell_type": "markdown",
418
   "metadata": {},
419
   "source": [
420
    "Initialize the `Pipeline` object and add the Retriever and Reader as nodes. You should provide the `name` or `name`s of preceding nodes as the input argument. Note that in a querying pipeline, the input to the first node is `Query`."
421
   ]
422
  },
423
  {
424
   "cell_type": "code",
425
   "execution_count": 9,
426
   "metadata": {},
427
   "outputs": [],
428
   "source": [
429
    "from haystack import Pipeline\n",
430
    "\n",
431
    "querying_pipeline = Pipeline()\n",
432
    "querying_pipeline.add_node(component=retriever, name=\"Retriever\", inputs=[\"Query\"])\n",
433
    "querying_pipeline.add_node(component=reader, name=\"Reader\", inputs=[\"Retriever\"])"
434
   ]
435
  },
436
  {
437
   "attachments": {},
438
   "cell_type": "markdown",
439
   "metadata": {},
440
   "source": [
441
    "That's it! Your pipeline's ready to answer your questions!"
442
   ]
443
  },
444
  {
445
   "attachments": {},
446
   "cell_type": "markdown",
447
   "metadata": {},
448
   "source": [
449
    "## Asking a Question\n",
450
    "\n",
451
    "1. Use the pipeline's `run()` method to ask a question. The query argument is where you type your question. Additionally, you can set the number of documents you want the Reader and Retriever to return using the `top-k` parameter. To learn more about setting arguments, see [Arguments](https://docs.haystack.deepset.ai/docs/pipelines#arguments). To understand the importance of the `top-k` parameter, see [Choosing the Right top-k Values](https://docs.haystack.deepset.ai/docs/optimization#choosing-the-right-top-k-values).\n"
452
   ]
453
  },
454
  {
455
   "cell_type": "code",
456
   "execution_count": 10,
457
   "metadata": {},
458
   "outputs": [
459
    {
460
     "name": "stderr",
461
     "output_type": "stream",
462
     "text": [
463
      "Inferencing Samples: 100%|██████████| 1/1 [00:23<00:00, 23.77s/ Batches]\n"
464
     ]
465
    }
466
   ],
467
   "source": [
468
    "prediction = querying_pipeline.run(\n",
469
    "    query=\"Who is the father of Arya Stark?\", params={\"Retriever\": {\"top_k\": 10}, \"Reader\": {\"top_k\": 5}}\n",
470
    ")"
471
   ]
472
  },
473
  {
474
   "attachments": {},
475
   "cell_type": "markdown",
476
   "metadata": {},
477
   "source": [
478
    "Here are some questions you could try out:\n",
479
    "- Who is the father of Arya Stark?\n",
480
    "- Who created the Dothraki vocabulary?\n",
481
    "- Who is the sister of Sansa?"
482
   ]
483
  },
484
  {
485
   "attachments": {},
486
   "cell_type": "markdown",
487
   "metadata": {},
488
   "source": [
489
    "2. Print out the answers the pipeline returns:"
490
   ]
491
  },
492
  {
493
   "cell_type": "code",
494
   "execution_count": null,
495
   "metadata": {},
496
   "outputs": [],
497
   "source": [
498
    "from pprint import pprint\n",
499
    "\n",
500
    "pprint(prediction)"
501
   ]
502
  },
503
  {
504
   "attachments": {},
505
   "cell_type": "markdown",
506
   "metadata": {},
507
   "source": [
508
    "3. Simplify the printed answers:"
509
   ]
510
  },
511
  {
512
   "cell_type": "code",
513
   "execution_count": null,
514
   "metadata": {},
515
   "outputs": [],
516
   "source": [
517
    "from haystack.utils import print_answers\n",
518
    "\n",
519
    "print_answers(prediction, details=\"minimum\")  ## Choose from `minimum`, `medium` and `all`"
520
   ]
521
  },
522
  {
523
   "attachments": {},
524
   "cell_type": "markdown",
525
   "metadata": {},
526
   "source": [
527
    "And there you have it! Congratulations on building a scalable machine learning based question answering system!"
528
   ]
529
  },
530
  {
531
   "attachments": {},
532
   "cell_type": "markdown",
533
   "metadata": {
534
    "collapsed": false
535
   },
536
   "source": [
537
    "# Next Steps\n",
538
    "\n",
539
    "To learn how to improve the performance of the Reader, see [Fine-Tune a Reader](https://haystack.deepset.ai/tutorials/02_finetune_a_model_on_your_data)."
540
   ]
541
  }
542
 ],
543
 "metadata": {
544
  "kernelspec": {
545
   "display_name": "Python 3.8.12 ('haystack_py38')",
546
   "language": "python",
547
   "name": "python3"
548
  },
549
  "language_info": {
550
   "codemirror_mode": {
551
    "name": "ipython",
552
    "version": 3
553
   },
554
   "file_extension": ".py",
555
   "mimetype": "text/x-python",
556
   "name": "python",
557
   "nbconvert_exporter": "python",
558
   "pygments_lexer": "ipython3",
559
   "version": "3.8.12"
560
  },
561
  "vscode": {
562
   "interpreter": {
563
    "hash": "85ea2c107d7945555de8e73270cf8a4d668bafec7aac344fa62e3415dc7bf5ec"
564
   }
565
  }
566
 },
567
 "nbformat": 4,
568
 "nbformat_minor": 2
569
}
570
haystack-tutorials

Использование cookies