txtai

34_Build_a_QA_database.ipynb
455 строк · 13.5 Кб
Перенос по словам
1
{
2
  "nbformat": 4,
3
  "nbformat_minor": 0,
4
  "metadata": {
5
    "kernelspec": {
6
      "name": "python3",
7
      "display_name": "Python 3",
8
      "language": "python"
9
    },
10
    "language_info": {
11
      "name": "python",
12
      "version": "3.7.6",
13
      "mimetype": "text/x-python",
14
      "codemirror_mode": {
15
        "name": "ipython",
16
        "version": 3
17
      },
18
      "pygments_lexer": "ipython3",
19
      "nbconvert_exporter": "python",
20
      "file_extension": ".py"
21
    },
22
    "colab": {
23
      "name": "34 - Build a QA database",
24
      "provenance": [],
25
      "collapsed_sections": []
26
    },
27
    "accelerator": "GPU"
28
  },
29
  "cells": [
30
    {
31
      "cell_type": "markdown",
32
      "metadata": {
33
        "id": "POWZoSJR6XzK"
34
      },
35
      "source": [
36
        "# Build a QA database\n",
37
        "\n",
38
        "Conversational AI is a growing field that could potentially automate much of the customer service industry. Full automation is still a ways away (most of us have been on a call with an automated agent and just want to get to a person) but it certainly can be a solid first line before human intervention.\n",
39
        "\n",
40
        "This notebook presents a process to answer user questions using a txtai embeddings instance. It's not conversational AI but instead looks to find the closest existing question to a user question. This is useful in cases where there are a list of frequently asked questions. "
41
      ]
42
    },
43
    {
44
      "cell_type": "markdown",
45
      "metadata": {
46
        "id": "qa_PPKVX6XzN"
47
      },
48
      "source": [
49
        "# Install dependencies\n",
50
        "\n",
51
        "Install `txtai` and all dependencies."
52
      ]
53
    },
54
    {
55
      "cell_type": "code",
56
      "metadata": {
57
        "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5",
58
        "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19",
59
        "trusted": true,
60
        "_kg_hide-output": true,
61
        "id": "24q-1n5i6XzQ"
62
      },
63
      "source": [
64
        "%%capture\n",
65
        "!pip install git+https://github.com/neuml/txtai datasets"
66
      ],
67
      "execution_count": 75,
68
      "outputs": []
69
    },
70
    {
71
      "cell_type": "markdown",
72
      "source": [
73
        "# Load the dataset\n",
74
        "\n",
75
        "We'll use a Hugging Face dataset of web questions for this example. The dataset has a list of questions and answers. The code below loads the dataset and prints a couple examples to get an idea of how the data is formatted."
76
      ],
77
      "metadata": {
78
        "id": "QAEVXfOiVzEB"
79
      }
80
    },
81
    {
82
      "cell_type": "code",
83
      "source": [
84
        "from datasets import load_dataset\n",
85
        "\n",
86
        "ds = load_dataset(\"web_questions\", split=\"train\")\n",
87
        "\n",
88
        "for row in ds.select(range(5)):\n",
89
        "  print(row[\"question\"], row[\"answers\"])"
90
      ],
91
      "metadata": {
92
        "colab": {
93
          "base_uri": "https://localhost:8080/"
94
        },
95
        "id": "koM4vYHXL82P",
96
        "outputId": "91926a1b-8b9c-46be-f450-f76a1e3aab82"
97
      },
98
      "execution_count": 76,
99
      "outputs": [
100
        {
101
          "output_type": "stream",
102
          "name": "stdout",
103
          "text": [
104
            "what is the name of justin bieber brother? ['Jazmyn Bieber', 'Jaxon Bieber']\n",
105
            "what character did natalie portman play in star wars? ['Padmé Amidala']\n",
106
            "what state does selena gomez? ['New York City']\n",
107
            "what country is the grand bahama island in? ['Bahamas']\n",
108
            "what kind of money to take to bahamas? ['Bahamian dollar']\n"
109
          ]
110
        }
111
      ]
112
    },
113
    {
114
      "cell_type": "markdown",
115
      "source": [
116
        "# Create index\n",
117
        "\n",
118
        "Next, we'll create a txtai index. The question will be the indexed text. We'll also store full content so we can access the answer at query time.\n"
119
      ],
120
      "metadata": {
121
        "id": "0p3WCDniUths"
122
      }
123
    },
124
    {
125
      "cell_type": "code",
126
      "metadata": {
127
        "_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a",
128
        "_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0",
129
        "trusted": true,
130
        "id": "2j_CFGDR6Xzp"
131
      },
132
      "source": [
133
        "from txtai.embeddings import Embeddings\n",
134
        "\n",
135
        "# Create embeddings index with content enabled. The default behavior is to only store indexed vectors.\n",
136
        "embeddings = Embeddings({\"path\": \"sentence-transformers/nli-mpnet-base-v2\", \"content\": True})\n",
137
        "\n",
138
        "# Map question to text and store content\n",
139
        "embeddings.index([(uid, {\"url\": row[\"url\"], \"text\": row[\"question\"], \"answer\": \", \".join(row[\"answers\"])}, None) for uid, row in enumerate(ds)])"
140
      ],
141
      "execution_count": 77,
142
      "outputs": []
143
    },
144
    {
145
      "cell_type": "markdown",
146
      "source": [
147
        "# Asking questions\n",
148
        "\n",
149
        "Now that the index is built, let's ask some questions! We'll use txtai SQL to select the fields we want to return.\n",
150
        "\n",
151
        "See the list of questions asked and best matching question-answer combo."
152
      ],
153
      "metadata": {
154
        "id": "dqx-PmpzWMez"
155
      }
156
    },
157
    {
158
      "cell_type": "code",
159
      "source": [
160
        "def question(text):\n",
161
        "  return embeddings.search(f\"select text, answer, score from txtai where similar('{text}') limit 1\")\n",
162
        "\n",
163
        "question(\"What is the timezone of NYC?\")"
164
      ],
165
      "metadata": {
166
        "colab": {
167
          "base_uri": "https://localhost:8080/"
168
        },
169
        "id": "Gatp4mDQPA_v",
170
        "outputId": "93efaec7-c41e-4865-f042-371c55170ffe"
171
      },
172
      "execution_count": 78,
173
      "outputs": [
174
        {
175
          "output_type": "execute_result",
176
          "data": {
177
            "text/plain": [
178
              "[{'answer': 'North American Eastern Time Zone',\n",
179
              "  'score': 0.8904051184654236,\n",
180
              "  'text': 'what time zone is new york under?'}]"
181
            ]
182
          },
183
          "metadata": {},
184
          "execution_count": 78
185
        }
186
      ]
187
    },
188
    {
189
      "cell_type": "code",
190
      "source": [
191
        "question(\"Things to do in New York\")"
192
      ],
193
      "metadata": {
194
        "colab": {
195
          "base_uri": "https://localhost:8080/"
196
        },
197
        "id": "WUvdAvvlSPDT",
198
        "outputId": "00ed87fe-b659-406e-d03a-97b7b5ea9773"
199
      },
200
      "execution_count": 79,
201
      "outputs": [
202
        {
203
          "output_type": "execute_result",
204
          "data": {
205
            "text/plain": [
206
              "[{'answer': \"Chelsea Art Museum, Brooklyn Bridge, Empire State Building, The Broadway Theatre, American Museum of Natural History, Central Park, St. Patrick's Cathedral, Japan Society of New York, FusionArts Museum, American Folk Art Museum\",\n",
207
              "  'score': 0.8308358192443848,\n",
208
              "  'text': 'what are some places to visit in new york?'}]"
209
            ]
210
          },
211
          "metadata": {},
212
          "execution_count": 79
213
        }
214
      ]
215
    },
216
    {
217
      "cell_type": "code",
218
      "source": [
219
        "question(\"Microsoft founder\")"
220
      ],
221
      "metadata": {
222
        "colab": {
223
          "base_uri": "https://localhost:8080/"
224
        },
225
        "id": "vvsSj2qKRrrY",
226
        "outputId": "1c4a94d7-3bf6-4cae-8b40-13fbfef35205"
227
      },
228
      "execution_count": 80,
229
      "outputs": [
230
        {
231
          "output_type": "execute_result",
232
          "data": {
233
            "text/plain": [
234
              "[{'answer': 'Bill Gates',\n",
235
              "  'score': 0.6617322564125061,\n",
236
              "  'text': 'who created microsoft windows?'}]"
237
            ]
238
          },
239
          "metadata": {},
240
          "execution_count": 80
241
        }
242
      ]
243
    },
244
    {
245
      "cell_type": "code",
246
      "source": [
247
        "question(\"Apple founder university\")"
248
      ],
249
      "metadata": {
250
        "colab": {
251
          "base_uri": "https://localhost:8080/"
252
        },
253
        "id": "UUD56XSRR1jc",
254
        "outputId": "fb94e89d-1580-471b-9e0e-541e5208d937"
255
      },
256
      "execution_count": 81,
257
      "outputs": [
258
        {
259
          "output_type": "execute_result",
260
          "data": {
261
            "text/plain": [
262
              "[{'answer': 'Reed College',\n",
263
              "  'score': 0.5137897729873657,\n",
264
              "  'text': 'what college did steve jobs attend?'}]"
265
            ]
266
          },
267
          "metadata": {},
268
          "execution_count": 81
269
        }
270
      ]
271
    },
272
    {
273
      "cell_type": "code",
274
      "source": [
275
        "question(\"What country uses the Yen?\")"
276
      ],
277
      "metadata": {
278
        "colab": {
279
          "base_uri": "https://localhost:8080/"
280
        },
281
        "id": "D6Ur0wZWSBfd",
282
        "outputId": "caf6e0aa-af62-4f33-fed1-6e1a111fed1a"
283
      },
284
      "execution_count": 82,
285
      "outputs": [
286
        {
287
          "output_type": "execute_result",
288
          "data": {
289
            "text/plain": [
290
              "[{'answer': 'Japanese yen',\n",
291
              "  'score': 0.6663530468940735,\n",
292
              "  'text': 'what money do japanese use?'}]"
293
            ]
294
          },
295
          "metadata": {},
296
          "execution_count": 82
297
        }
298
      ]
299
    },
300
    {
301
      "cell_type": "code",
302
      "source": [
303
        "question(\"Show me a list of Pixar movies\")"
304
      ],
305
      "metadata": {
306
        "colab": {
307
          "base_uri": "https://localhost:8080/"
308
        },
309
        "id": "2POCUWrqSKGP",
310
        "outputId": "f8069bf4-a135-47df-a571-198f168c03fb"
311
      },
312
      "execution_count": 83,
313
      "outputs": [
314
        {
315
          "output_type": "execute_result",
316
          "data": {
317
            "text/plain": [
318
              "[{'answer': \"A Bug's Life, Toy Story 2, Ratatouille, Cars, Up, Toy Story, Monsters, Inc., The Incredibles, Finding Nemo, WALL-E\",\n",
319
              "  'score': 0.653051495552063,\n",
320
              "  'text': 'what does pixar produce?'}]"
321
            ]
322
          },
323
          "metadata": {},
324
          "execution_count": 83
325
        }
326
      ]
327
    },
328
    {
329
      "cell_type": "code",
330
      "source": [
331
        "question(\"What is the timezone of Florida?\")"
332
      ],
333
      "metadata": {
334
        "colab": {
335
          "base_uri": "https://localhost:8080/"
336
        },
337
        "id": "xll4a1ChTaVg",
338
        "outputId": "00cb967f-753e-4d15-cb87-308c08dfde59"
339
      },
340
      "execution_count": 84,
341
      "outputs": [
342
        {
343
          "output_type": "execute_result",
344
          "data": {
345
            "text/plain": [
346
              "[{'answer': 'North American Eastern Time Zone',\n",
347
              "  'score': 0.9672279357910156,\n",
348
              "  'text': 'where is the time zone in florida?'}]"
349
            ]
350
          },
351
          "metadata": {},
352
          "execution_count": 84
353
        }
354
      ]
355
    },
356
    {
357
      "cell_type": "code",
358
      "source": [
359
        "question(\"Tell me an animal found offshore in Florida\")"
360
      ],
361
      "metadata": {
362
        "colab": {
363
          "base_uri": "https://localhost:8080/"
364
        },
365
        "id": "4EyAOhWXUcKe",
366
        "outputId": "ef3f83b1-9314-4458-843e-16afb3364cd9"
367
      },
368
      "execution_count": 85,
369
      "outputs": [
370
        {
371
          "output_type": "execute_result",
372
          "data": {
373
            "text/plain": [
374
              "[{'answer': 'Largemouth bass',\n",
375
              "  'score': 0.6526554822921753,\n",
376
              "  'text': 'what kind of fish do you catch in florida?'}]"
377
            ]
378
          },
379
          "metadata": {},
380
          "execution_count": 85
381
        }
382
      ]
383
    },
384
    {
385
      "cell_type": "markdown",
386
      "source": [
387
        "Not too bad! This database only has over 6,000 question-answer pairs. To improve quality a score filter could be put on the query to only return highly confident answers. But this gives an idea of what is possible."
388
      ],
389
      "metadata": {
390
        "id": "KFxsjtsnWgpe"
391
      }
392
    },
393
    {
394
      "cell_type": "markdown",
395
      "source": [
396
        "# Run as an application\n",
397
        "\n",
398
        "This can also be run as an application. See below."
399
      ],
400
      "metadata": {
401
        "id": "2x9awoKNZfZg"
402
      }
403
    },
404
    {
405
      "cell_type": "code",
406
      "source": [
407
        "from txtai.app import Application\n",
408
        "\n",
409
        "# Save index\n",
410
        "embeddings.save(\"questions.tar.gz\")\n",
411
        "\n",
412
        "# Build application and index data\n",
413
        "app = Application(\"path: questions.tar.gz\")\n",
414
        "\n",
415
        "# Run search query\n",
416
        "app.search(\"select text, answer, score from txtai where similar('Tell me an animal found offshore in Florida') limit 1\")[0]"
417
      ],
418
      "metadata": {
419
        "colab": {
420
          "base_uri": "https://localhost:8080/"
421
        },
422
        "id": "0lH9cf1bZokt",
423
        "outputId": "77afcb03-c834-46c9-98f2-b1a3c34c0e3b"
424
      },
425
      "execution_count": 86,
426
      "outputs": [
427
        {
428
          "output_type": "execute_result",
429
          "data": {
430
            "text/plain": [
431
              "{'answer': 'Largemouth bass',\n",
432
              " 'score': 0.6526554822921753,\n",
433
              " 'text': 'what kind of fish do you catch in florida?'}"
434
            ]
435
          },
436
          "metadata": {},
437
          "execution_count": 86
438
        }
439
      ]
440
    },
441
    {
442
      "cell_type": "markdown",
443
      "metadata": {
444
        "id": "aDIF3tYt6X0O"
445
      },
446
      "source": [
447
        "# Wrapping up\n",
448
        "\n",
449
        "This notebook introduced a simple question matching service. This could be the foundation of an automated customer service agent and/or an online FAQ.\n",
450
        "\n",
451
        "For a full example, see [codequestion](https://github.com/neuml/codequestion), which is an application that matches user questions to Stack Overflow question-answer pairs."
452
      ]
453
    }
454
  ]
455
}
456
txtai

Использование cookies