haystack-tutorials

15_TableQA.ipynb
785 строк · 23.0 Кб
Перенос по словам
1
{
2
 "cells": [
3
  {
4
   "attachments": {},
5
   "cell_type": "markdown",
6
   "metadata": {
7
    "id": "DeAkZwDhufYA"
8
   },
9
   "source": [
10
    "# Open-Domain QA on Tables\n",
11
    "\n",
12
    "This tutorial shows you how to perform question-answering on tables using the `EmbeddingRetriever` or `BM25Retriever` as retriever node and the `TableReader` as reader node."
13
   ]
14
  },
15
  {
16
   "attachments": {},
17
   "cell_type": "markdown",
18
   "metadata": {
19
    "id": "vbR3bETlvi-3"
20
   },
21
   "source": [
22
    "\n",
23
    "## Preparing the Colab Environment\n",
24
    "\n",
25
    "- [Enable GPU Runtime](https://docs.haystack.deepset.ai/docs/enabling-gpu-acceleration#enabling-the-gpu-in-colab)\n"
26
   ]
27
  },
28
  {
29
   "attachments": {},
30
   "cell_type": "markdown",
31
   "metadata": {},
32
   "source": [
33
    "## Installing Haystack\n",
34
    "\n",
35
    "To start, let's install the latest release of Haystack with `pip`:"
36
   ]
37
  },
38
  {
39
   "cell_type": "code",
40
   "execution_count": null,
41
   "metadata": {},
42
   "outputs": [],
43
   "source": [
44
    "%%bash\n",
45
    "\n",
46
    "pip install --upgrade pip\n",
47
    "pip install farm-haystack[colab,elasticsearch,metrics,inference]\n",
48
    "\n",
49
    "# Install pygraphviz for visualization of Pipelines\n",
50
    "apt install libgraphviz-dev\n",
51
    "pip install pygraphviz"
52
   ]
53
  },
54
  {
55
   "attachments": {},
56
   "cell_type": "markdown",
57
   "metadata": {},
58
   "source": [
59
    "### Enabling Telemetry \n",
60
    "Knowing you're using this tutorial helps us decide where to invest our efforts to build a better product but you can always opt out by commenting the following line. See [Telemetry](https://docs.haystack.deepset.ai/docs/telemetry) for more details."
61
   ]
62
  },
63
  {
64
   "cell_type": "code",
65
   "execution_count": null,
66
   "metadata": {},
67
   "outputs": [],
68
   "source": [
69
    "from haystack.telemetry import tutorial_running\n",
70
    "\n",
71
    "tutorial_running(15)"
72
   ]
73
  },
74
  {
75
   "attachments": {},
76
   "cell_type": "markdown",
77
   "metadata": {},
78
   "source": [
79
    "## Logging\n",
80
    "\n",
81
    "We configure how logging messages should be displayed and which log level should be used before importing Haystack.\n",
82
    "Example log message:\n",
83
    "INFO - haystack.utils.preprocessing -  Converting data/tutorial1/218_Olenna_Tyrell.txt\n",
84
    "Default log level in basicConfig is WARNING so the explicit parameter is not necessary but can be changed easily:"
85
   ]
86
  },
87
  {
88
   "cell_type": "code",
89
   "execution_count": null,
90
   "metadata": {},
91
   "outputs": [],
92
   "source": [
93
    "import logging\n",
94
    "\n",
95
    "logging.basicConfig(format=\"%(levelname)s - %(name)s -  %(message)s\", level=logging.WARNING)\n",
96
    "logging.getLogger(\"haystack\").setLevel(logging.INFO)"
97
   ]
98
  },
99
  {
100
   "attachments": {},
101
   "cell_type": "markdown",
102
   "metadata": {},
103
   "source": [
104
    "### Start an Elasticsearch server\n",
105
    "You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (eg., in Colab notebooks), then you can manually download and execute Elasticsearch from source."
106
   ]
107
  },
108
  {
109
   "cell_type": "code",
110
   "execution_count": null,
111
   "metadata": {},
112
   "outputs": [],
113
   "source": [
114
    "# Recommended: Start Elasticsearch using Docker via the Haystack utility function\n",
115
    "from haystack.utils import launch_es\n",
116
    "\n",
117
    "launch_es()"
118
   ]
119
  },
120
  {
121
   "attachments": {},
122
   "cell_type": "markdown",
123
   "metadata": {},
124
   "source": [
125
    "### Start an Elasticsearch server in Colab\n",
126
    "\n",
127
    "If Docker is not readily available in your environment (e.g. in Colab notebooks), then you can manually download and execute Elasticsearch from source."
128
   ]
129
  },
130
  {
131
   "cell_type": "code",
132
   "execution_count": null,
133
   "metadata": {
134
    "vscode": {
135
     "languageId": "shellscript"
136
    }
137
   },
138
   "outputs": [],
139
   "source": [
140
    "%%bash\n",
141
    "\n",
142
    "wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q\n",
143
    "tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz\n",
144
    "chown -R daemon:daemon elasticsearch-7.9.2\n"
145
   ]
146
  },
147
  {
148
   "cell_type": "code",
149
   "execution_count": null,
150
   "metadata": {
151
    "vscode": {
152
     "languageId": "shellscript"
153
    }
154
   },
155
   "outputs": [],
156
   "source": [
157
    "%%bash --bg\n",
158
    "\n",
159
    "sudo -u daemon -- elasticsearch-7.9.2/bin/elasticsearch"
160
   ]
161
  },
162
  {
163
   "cell_type": "code",
164
   "execution_count": null,
165
   "metadata": {
166
    "id": "RmxepXZtwQ0E"
167
   },
168
   "outputs": [],
169
   "source": [
170
    "# Connect to Elasticsearch\n",
171
    "import os\n",
172
    "import time\n",
173
    "from haystack.document_stores import ElasticsearchDocumentStore\n",
174
    "\n",
175
    "\n",
176
    "# Wait 30 seconds only to be sure Elasticsearch is ready before continuing\n",
177
    "time.sleep(30)\n",
178
    "\n",
179
    "# Get the host where Elasticsearch is running, default to localhost\n",
180
    "host = os.environ.get(\"ELASTICSEARCH_HOST\", \"localhost\")\n",
181
    "\n",
182
    "document_index = \"document\"\n",
183
    "document_store = ElasticsearchDocumentStore(host=host, username=\"\", password=\"\", index=document_index)"
184
   ]
185
  },
186
  {
187
   "attachments": {},
188
   "cell_type": "markdown",
189
   "metadata": {
190
    "id": "fFh26LIlxldw"
191
   },
192
   "source": [
193
    "## Add Tables to DocumentStore\n",
194
    "To quickly demonstrate the capabilities of the `EmbeddingRetriever` and the `TableReader` we use a subset of 1000 tables and text documents from a dataset we have published in [this paper](https://arxiv.org/abs/2108.04049).\n",
195
    "\n",
196
    "Just as text passages, tables are represented as `Document` objects in Haystack. The content field, though, is a pandas DataFrame instead of a string."
197
   ]
198
  },
199
  {
200
   "cell_type": "code",
201
   "execution_count": null,
202
   "metadata": {
203
    "id": "nM63uwbd8zd6"
204
   },
205
   "outputs": [],
206
   "source": [
207
    "# Let's first fetch some tables that we want to query\n",
208
    "# Here: 1000 tables from OTT-QA\n",
209
    "from haystack.utils import fetch_archive_from_http\n",
210
    "\n",
211
    "doc_dir = \"data/tutorial15\"\n",
212
    "s3_url = \"https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/table_text_dataset.zip\"\n",
213
    "fetch_archive_from_http(url=s3_url, output_dir=doc_dir)"
214
   ]
215
  },
216
  {
217
   "cell_type": "code",
218
   "execution_count": null,
219
   "metadata": {
220
    "colab": {
221
     "base_uri": "https://localhost:8080/"
222
    },
223
    "id": "SKjw2LuXxlGh",
224
    "outputId": "92c67d24-d6fb-413e-8dd7-53075141d508"
225
   },
226
   "outputs": [],
227
   "source": [
228
    "# Add the tables to the DocumentStore\n",
229
    "import json\n",
230
    "from haystack import Document\n",
231
    "import pandas as pd\n",
232
    "\n",
233
    "\n",
234
    "def read_tables(filename):\n",
235
    "    processed_tables = []\n",
236
    "    with open(filename) as tables:\n",
237
    "        tables = json.load(tables)\n",
238
    "        for key, table in tables.items():\n",
239
    "            current_columns = table[\"header\"]\n",
240
    "            current_rows = table[\"data\"]\n",
241
    "            current_df = pd.DataFrame(columns=current_columns, data=current_rows)\n",
242
    "            document = Document(content=current_df, content_type=\"table\", id=key)\n",
243
    "            processed_tables.append(document)\n",
244
    "\n",
245
    "    return processed_tables\n",
246
    "\n",
247
    "\n",
248
    "tables = read_tables(f\"{doc_dir}/tables.json\")\n",
249
    "document_store.write_documents(tables, index=document_index)\n",
250
    "\n",
251
    "# Showing content field and meta field of one of the Documents of content_type 'table'\n",
252
    "print(tables[0].content)\n",
253
    "print(tables[0].meta)"
254
   ]
255
  },
256
  {
257
   "attachments": {},
258
   "cell_type": "markdown",
259
   "metadata": {
260
    "id": "hmQC1sDmw3d7"
261
   },
262
   "source": [
263
    "## Initialize Retriever, Reader & Pipeline\n",
264
    "\n",
265
    "### Retriever\n",
266
    "\n",
267
    "Retrievers help narrowing down the scope for the Reader to a subset of tables where a given question could be answered.\n",
268
    "They use some simple but fast algorithm.\n",
269
    "\n",
270
    "**Here:** We specify an embedding model that is finetuned so it can also generate embeddings for tables (instead of just text).\n",
271
    "\n",
272
    "**Alternatives:**\n",
273
    "\n",
274
    "- `BM25Retriever` that uses BM25 algorithm\n"
275
   ]
276
  },
277
  {
278
   "cell_type": "code",
279
   "execution_count": null,
280
   "metadata": {
281
    "id": "EY_qvdV6wyK5"
282
   },
283
   "outputs": [],
284
   "source": [
285
    "from haystack.nodes.retriever import EmbeddingRetriever\n",
286
    "\n",
287
    "retriever = EmbeddingRetriever(document_store=document_store, embedding_model=\"deepset/all-mpnet-base-v2-table\")"
288
   ]
289
  },
290
  {
291
   "cell_type": "code",
292
   "execution_count": null,
293
   "metadata": {
294
    "id": "jasi1RM2zIJ7"
295
   },
296
   "outputs": [],
297
   "source": [
298
    "# Add table embeddings to the tables in DocumentStore\n",
299
    "document_store.update_embeddings(retriever=retriever)"
300
   ]
301
  },
302
  {
303
   "cell_type": "code",
304
   "execution_count": null,
305
   "metadata": {
306
    "id": "XM-ijy6Zz11L"
307
   },
308
   "outputs": [],
309
   "source": [
310
    "## Alternative: BM25Retriever\n",
311
    "# from haystack.nodes.retriever import BM25Retriever\n",
312
    "# retriever = BM25Retriever(document_store=document_store)"
313
   ]
314
  },
315
  {
316
   "cell_type": "code",
317
   "execution_count": null,
318
   "metadata": {
319
    "colab": {
320
     "base_uri": "https://localhost:8080/"
321
    },
322
    "id": "YHfQWxVI0N2e",
323
    "outputId": "1d8dc4d2-a184-489e-defa-d445d76c458f"
324
   },
325
   "outputs": [],
326
   "source": [
327
    "# Try the Retriever\n",
328
    "retrieved_tables = retriever.retrieve(\"Who won the Super Bowl?\", top_k=5)\n",
329
    "\n",
330
    "# Get highest scored table\n",
331
    "print(retrieved_tables[0].content)"
332
   ]
333
  },
334
  {
335
   "attachments": {},
336
   "cell_type": "markdown",
337
   "metadata": {
338
    "id": "zbwkXScm2-gy"
339
   },
340
   "source": [
341
    "### Reader\n",
342
    "The `TableReader` is based on TaPas, a transformer-based language model capable of grasping the two-dimensional structure of a table. It scans the tables returned by the retriever and extracts the anser. The available TableReader models can be found [here](https://huggingface.co/models?pipeline_tag=table-question-answering&sort=downloads).\n",
343
    "\n",
344
    "**Notice**: The `TableReader` will return an answer for each table, even if the query cannot be answered by the table. Furthermore, the confidence scores are not useful as of now, given that they will *always* be very high (i.e. 1 or close to 1)."
345
   ]
346
  },
347
  {
348
   "cell_type": "code",
349
   "execution_count": null,
350
   "metadata": {
351
    "id": "4APcRoio2RxG"
352
   },
353
   "outputs": [],
354
   "source": [
355
    "from haystack.nodes import TableReader\n",
356
    "\n",
357
    "reader = TableReader(model_name_or_path=\"google/tapas-base-finetuned-wtq\", max_seq_len=512)"
358
   ]
359
  },
360
  {
361
   "cell_type": "code",
362
   "execution_count": null,
363
   "metadata": {
364
    "colab": {
365
     "base_uri": "https://localhost:8080/"
366
    },
367
    "id": "ILuAXkyN4F7x",
368
    "outputId": "4bd19dcb-df8e-4a4d-b9d2-d34650e9e5c2"
369
   },
370
   "outputs": [],
371
   "source": [
372
    "# Try the TableReader on one Table\n",
373
    "\n",
374
    "table_doc = document_store.get_document_by_id(\"36964e90-3735-4ba1-8e6a-bec236e88bb2\")\n",
375
    "print(table_doc.content)"
376
   ]
377
  },
378
  {
379
   "cell_type": "code",
380
   "execution_count": null,
381
   "metadata": {
382
    "colab": {
383
     "base_uri": "https://localhost:8080/"
384
    },
385
    "id": "ilbsecgA4vfN",
386
    "outputId": "f845f43e-43e8-48fe-d0ef-91b17a5eff0e"
387
   },
388
   "outputs": [],
389
   "source": [
390
    "from haystack.utils import print_answers\n",
391
    "\n",
392
    "prediction = reader.predict(query=\"Who played Gregory House in the series House?\", documents=[table_doc])\n",
393
    "print_answers(prediction, details=\"all\")"
394
   ]
395
  },
396
  {
397
   "attachments": {},
398
   "cell_type": "markdown",
399
   "metadata": {
400
    "id": "jkAYNMb7R9qu"
401
   },
402
   "source": [
403
    "The offsets in the `offsets_in_document` and `offsets_in_context` field indicate the table cells that the model predicts to be part of the answer. They need to be interpreted on the linearized table, i.e., a flat list containing all of the table cells."
404
   ]
405
  },
406
  {
407
   "cell_type": "code",
408
   "execution_count": null,
409
   "metadata": {
410
    "colab": {
411
     "base_uri": "https://localhost:8080/"
412
    },
413
    "id": "It8XYT2ZTVJs",
414
    "outputId": "7d31af60-e04a-485d-f0ee-f29592b03928"
415
   },
416
   "outputs": [],
417
   "source": [
418
    "print(f\"Predicted answer: {prediction['answers'][0].answer}\")\n",
419
    "print(f\"Meta field: {prediction['answers'][0].meta}\")"
420
   ]
421
  },
422
  {
423
   "attachments": {},
424
   "cell_type": "markdown",
425
   "metadata": {
426
    "id": "pgmG7pzL5ceh"
427
   },
428
   "source": [
429
    "### Pipeline\n",
430
    "The Retriever and the Reader can be sticked together to a pipeline in order to first retrieve relevant tables and then extract the answer.\n",
431
    "\n",
432
    "**Notice**: Given that the `TableReader` does not provide useful confidence scores and returns an answer for each of the tables, the sorting of the answers might be not helpful."
433
   ]
434
  },
435
  {
436
   "cell_type": "code",
437
   "execution_count": null,
438
   "metadata": {
439
    "id": "G-aZZvyv4-Mf"
440
   },
441
   "outputs": [],
442
   "source": [
443
    "# Initialize pipeline\n",
444
    "from haystack import Pipeline\n",
445
    "\n",
446
    "table_qa_pipeline = Pipeline()\n",
447
    "table_qa_pipeline.add_node(component=retriever, name=\"EmbeddingRetriever\", inputs=[\"Query\"])\n",
448
    "table_qa_pipeline.add_node(component=reader, name=\"TableReader\", inputs=[\"EmbeddingRetriever\"])"
449
   ]
450
  },
451
  {
452
   "cell_type": "code",
453
   "execution_count": null,
454
   "metadata": {
455
    "colab": {
456
     "base_uri": "https://localhost:8080/"
457
    },
458
    "id": "m8evexnW6dev",
459
    "outputId": "40514084-f516-4f13-fb48-6a55cb578366"
460
   },
461
   "outputs": [],
462
   "source": [
463
    "prediction = table_qa_pipeline.run(\"When was Guilty Gear Xrd : Sign released?\", params={\"top_k\": 30})\n",
464
    "print_answers(prediction, details=\"minimum\")"
465
   ]
466
  },
467
  {
468
   "cell_type": "code",
469
   "execution_count": null,
470
   "metadata": {
471
    "id": "4CBcIjIq_uFx"
472
   },
473
   "outputs": [],
474
   "source": [
475
    "# Add 500 text passages to our document store.\n",
476
    "\n",
477
    "\n",
478
    "def read_texts(filename):\n",
479
    "    processed_passages = []\n",
480
    "    with open(filename) as passages:\n",
481
    "        passages = json.load(passages)\n",
482
    "        for key, content in passages.items():\n",
483
    "            document = Document(content=content, content_type=\"text\", id=key)\n",
484
    "            processed_passages.append(document)\n",
485
    "\n",
486
    "    return processed_passages\n",
487
    "\n",
488
    "\n",
489
    "passages = read_texts(f\"{doc_dir}/texts.json\")\n",
490
    "document_store.write_documents(passages, index=document_index)"
491
   ]
492
  },
493
  {
494
   "cell_type": "code",
495
   "execution_count": null,
496
   "metadata": {
497
    "id": "j1TaNF7SiKgH"
498
   },
499
   "outputs": [],
500
   "source": [
501
    "document_store.update_embeddings(retriever=retriever, update_existing_embeddings=False)"
502
   ]
503
  },
504
  {
505
   "attachments": {},
506
   "cell_type": "markdown",
507
   "metadata": {
508
    "id": "c2sk_uNHj0DY"
509
   },
510
   "source": [
511
    "## Pipeline for QA on Combination of Text and Tables\n",
512
    "We are using one node for retrieving both texts and tables, the `EmbeddingRetriever`. In order to do question-answering on the Documents coming from the `EmbeddingRetriever`, we need to route Documents of type `\"text\"` to a `FARMReader` (or alternatively `TransformersReader`) and Documents of type `\"table\"` to a `TableReader`.\n",
513
    "\n",
514
    "To achieve this, we make use of two additional nodes:\n",
515
    "- `RouteDocuments`: Splits the List of Documents retrieved by the `EmbeddingRetriever` into two lists containing only Documents of type `\"text\"` or `\"table\"`, respectively.\n",
516
    "- `JoinAnswers`: Takes Answers coming from two different Readers (in this case `FARMReader` and `TableReader`) and joins them to a single list of Answers."
517
   ]
518
  },
519
  {
520
   "cell_type": "code",
521
   "execution_count": null,
522
   "metadata": {
523
    "id": "Ej_j8Q3wlxXE"
524
   },
525
   "outputs": [],
526
   "source": [
527
    "from haystack.nodes import FARMReader, RouteDocuments, JoinAnswers\n",
528
    "\n",
529
    "text_reader = FARMReader(\"deepset/roberta-base-squad2\")\n",
530
    "# In order to get meaningful scores from the TableReader, use \"deepset/tapas-large-nq-hn-reader\" or\n",
531
    "# \"deepset/tapas-large-nq-reader\" as TableReader models. The disadvantage of these models is, however,\n",
532
    "# that they are not capable of doing aggregations over multiple table cells.\n",
533
    "table_reader = TableReader(\"deepset/tapas-large-nq-hn-reader\")\n",
534
    "route_documents = RouteDocuments()\n",
535
    "join_answers = JoinAnswers()"
536
   ]
537
  },
538
  {
539
   "cell_type": "code",
540
   "execution_count": null,
541
   "metadata": {
542
    "id": "Zdq6JnF5m3aP"
543
   },
544
   "outputs": [],
545
   "source": [
546
    "text_table_qa_pipeline = Pipeline()\n",
547
    "text_table_qa_pipeline.add_node(component=retriever, name=\"EmbeddingRetriever\", inputs=[\"Query\"])\n",
548
    "text_table_qa_pipeline.add_node(component=route_documents, name=\"RouteDocuments\", inputs=[\"EmbeddingRetriever\"])\n",
549
    "text_table_qa_pipeline.add_node(component=text_reader, name=\"TextReader\", inputs=[\"RouteDocuments.output_1\"])\n",
550
    "text_table_qa_pipeline.add_node(component=table_reader, name=\"TableReader\", inputs=[\"RouteDocuments.output_2\"])\n",
551
    "text_table_qa_pipeline.add_node(component=join_answers, name=\"JoinAnswers\", inputs=[\"TextReader\", \"TableReader\"])"
552
   ]
553
  },
554
  {
555
   "cell_type": "code",
556
   "execution_count": null,
557
   "metadata": {
558
    "colab": {
559
     "base_uri": "https://localhost:8080/",
560
     "height": 540
561
    },
562
    "id": "K4vH1ZEnniut",
563
    "outputId": "85aa17a8-227d-40e4-c8c0-5d0532faa47a"
564
   },
565
   "outputs": [],
566
   "source": [
567
    "# Remove the following comment to generate the structure of the combined Table an Text QA pipeline.\n",
568
    "# text_table_qa_pipeline.draw(\"pipeline.png\")"
569
   ]
570
  },
571
  {
572
   "attachments": {},
573
   "cell_type": "markdown",
574
   "metadata": {},
575
   "source": [
576
    "![image](https://github.com/deepset-ai/haystack-tutorials/blob/main/tutorials/img/table-qa-pipeline.png?raw=true)"
577
   ]
578
  },
579
  {
580
   "cell_type": "code",
581
   "execution_count": null,
582
   "metadata": {
583
    "id": "strPNduPoBLe"
584
   },
585
   "outputs": [],
586
   "source": [
587
    "# Example query whose answer resides in a text passage\n",
588
    "predictions = text_table_qa_pipeline.run(query=\"Who was Thomas Alva Edison?\")"
589
   ]
590
  },
591
  {
592
   "cell_type": "code",
593
   "execution_count": null,
594
   "metadata": {
595
    "colab": {
596
     "base_uri": "https://localhost:8080/"
597
    },
598
    "id": "9YiK75tSoOGA",
599
    "outputId": "bd52f841-3846-441f-dd6f-53b02111691e"
600
   },
601
   "outputs": [],
602
   "source": [
603
    "# We can see both text passages and tables as contexts of the predicted answers.\n",
604
    "print_answers(predictions, details=\"minimum\")"
605
   ]
606
  },
607
  {
608
   "cell_type": "code",
609
   "execution_count": null,
610
   "metadata": {
611
    "id": "QYOHDSmLpzEg"
612
   },
613
   "outputs": [],
614
   "source": [
615
    "# Example query whose answer resides in a table\n",
616
    "predictions = text_table_qa_pipeline.run(query=\"Which country does the film Macaroni come from?\")"
617
   ]
618
  },
619
  {
620
   "cell_type": "code",
621
   "execution_count": null,
622
   "metadata": {
623
    "colab": {
624
     "base_uri": "https://localhost:8080/"
625
    },
626
    "id": "4kw53uWep3zj",
627
    "outputId": "b332cc17-3cb8-4e20-d79d-bb4cf656f277"
628
   },
629
   "outputs": [],
630
   "source": [
631
    "# We can see both text passages and tables as contexts of the predicted answers.\n",
632
    "print_answers(predictions, details=\"minimum\")"
633
   ]
634
  },
635
  {
636
   "attachments": {},
637
   "cell_type": "markdown",
638
   "metadata": {},
639
   "source": [
640
    "## Evaluation\n",
641
    "To evaluate our pipeline, we can use haystack's evaluation feature. We just need to convert our labels into `MultiLabel` objects and the `eval` method will do the rest."
642
   ]
643
  },
644
  {
645
   "cell_type": "code",
646
   "execution_count": null,
647
   "metadata": {},
648
   "outputs": [],
649
   "source": [
650
    "from haystack import Label, MultiLabel, Answer\n",
651
    "\n",
652
    "\n",
653
    "def read_labels(filename, tables):\n",
654
    "    processed_labels = []\n",
655
    "    with open(filename) as labels:\n",
656
    "        labels = json.load(labels)\n",
657
    "        for table in tables:\n",
658
    "            if table.id not in labels:\n",
659
    "                continue\n",
660
    "            label = labels[table.id]\n",
661
    "            label = Label(\n",
662
    "                query=label[\"query\"],\n",
663
    "                document=table,\n",
664
    "                is_correct_answer=True,\n",
665
    "                is_correct_document=True,\n",
666
    "                answer=Answer(answer=label[\"answer\"]),\n",
667
    "                origin=\"gold-label\",\n",
668
    "            )\n",
669
    "            processed_labels.append(MultiLabel(labels=[label]))\n",
670
    "    return processed_labels\n",
671
    "\n",
672
    "\n",
673
    "table_labels = read_labels(f\"{doc_dir}/labels.json\", tables)\n",
674
    "passage_labels = read_labels(f\"{doc_dir}/labels.json\", passages)"
675
   ]
676
  },
677
  {
678
   "cell_type": "code",
679
   "execution_count": null,
680
   "metadata": {},
681
   "outputs": [],
682
   "source": [
683
    "eval_results = text_table_qa_pipeline.eval(table_labels + passage_labels, params={\"top_k\": 10})"
684
   ]
685
  },
686
  {
687
   "cell_type": "code",
688
   "execution_count": null,
689
   "metadata": {},
690
   "outputs": [],
691
   "source": [
692
    "# Calculating and printing the evaluation metrics\n",
693
    "print(eval_results.calculate_metrics())"
694
   ]
695
  },
696
  {
697
   "attachments": {},
698
   "cell_type": "markdown",
699
   "metadata": {},
700
   "source": [
701
    "## Adding tables from PDFs\n",
702
    "It can sometimes be hard to provide your data in form of a pandas DataFrame. For this case, we provide the `ParsrConverter` wrapper that can help you to convert, for example, a PDF file into a document that you can index.\n",
703
    "\n",
704
    "**Attention: `parsr` needs a docker environment for execution, but Colab doesn't support docker.**\n",
705
    "**If you have a local docker environment, you can uncomment and run the following cells.**"
706
   ]
707
  },
708
  {
709
   "cell_type": "code",
710
   "execution_count": null,
711
   "metadata": {},
712
   "outputs": [],
713
   "source": [
714
    "# import time\n",
715
    "\n",
716
    "# !docker run -d -p 3001:3001 axarev/parsr\n",
717
    "# time.sleep(30)"
718
   ]
719
  },
720
  {
721
   "cell_type": "code",
722
   "execution_count": null,
723
   "metadata": {},
724
   "outputs": [],
725
   "source": [
726
    "# !wget https://www.w3.org/WAI/WCAG21/working-examples/pdf-table/table.pdf"
727
   ]
728
  },
729
  {
730
   "cell_type": "code",
731
   "execution_count": null,
732
   "metadata": {},
733
   "outputs": [],
734
   "source": [
735
    "# from haystack.nodes import ParsrConverter\n",
736
    "\n",
737
    "# converter = ParsrConverter()\n",
738
    "\n",
739
    "# docs = converter.convert(\"table.pdf\")\n",
740
    "\n",
741
    "# tables = [doc for doc in docs if doc.content_type == \"table\"]"
742
   ]
743
  },
744
  {
745
   "cell_type": "code",
746
   "execution_count": null,
747
   "metadata": {},
748
   "outputs": [],
749
   "source": [
750
    "# print(tables)"
751
   ]
752
  }
753
 ],
754
 "metadata": {
755
  "accelerator": "GPU",
756
  "colab": {
757
   "name": "Tutorial15_TableQA.ipynb",
758
   "provenance": []
759
  },
760
  "kernelspec": {
761
   "display_name": "Python 3 (ipykernel)",
762
   "language": "python",
763
   "name": "python3"
764
  },
765
  "language_info": {
766
   "codemirror_mode": {
767
    "name": "ipython",
768
    "version": 3
769
   },
770
   "file_extension": ".py",
771
   "mimetype": "text/x-python",
772
   "name": "python",
773
   "nbconvert_exporter": "python",
774
   "pygments_lexer": "ipython3",
775
   "version": "3.10.9"
776
  },
777
  "vscode": {
778
   "interpreter": {
779
    "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1"
780
   }
781
  }
782
 },
783
 "nbformat": 4,
784
 "nbformat_minor": 1
785
}
786
haystack-tutorials

Использование cookies