txtai

14_Run_pipeline_workflows.ipynb
526 строк · 19.5 Кб
Перенос по словам
1
{
2
  "nbformat": 4,
3
  "nbformat_minor": 0,
4
  "metadata": {
5
    "colab": {
6
      "provenance": []
7
    },
8
    "kernelspec": {
9
      "name": "python3",
10
      "display_name": "Python 3"
11
    },
12
    "accelerator": "GPU"
13
  },
14
  "cells": [
15
    {
16
      "cell_type": "markdown",
17
      "metadata": {
18
        "id": "4Pjmz-RORV8E"
19
      },
20
      "source": [
21
        "# Run pipeline workflows\n",
22
        "\n",
23
        "txtai has a growing list of models available through it's pipeline framework. Pipelines wrap a machine learning model and transform data. Currently, pipelines can wrap Hugging Face models, Hugging Face pipelines or PyTorch models (support for TensorFlow is in the backlog).\n",
24
        "\n",
25
        "The following is a list of the currently implemented pipelines.\n",
26
        "\n",
27
        "* **Questions** - Answer questions using a text context\n",
28
        "* **Labels** - Apply labels to text using a zero-shot classification model. Also supports similarity comparisions.\n",
29
        "* **Summary** - Abstractive text summarization\n",
30
        "* **Textractor** - Extract text from documents\n",
31
        "* **Transcription** - Transcribe audio to text\n",
32
        "* **Translation** - Machine translation\n",
33
        "\n",
34
        "Pipelines are great and make using a variety of machine learning models easier. But what if we want to glue the results of different pipelines together? For example, extract text, summarize it, translate it to English and load it into an Embedding index. That would require code to join those operations together in an efficient manner.\n",
35
        "\n",
36
        "Enter workflows. Workflows are a simple yet powerful construct that takes a callable and returns elements. Workflows don't know they are working with pipelines but enable efficient processing of pipeline data. Workflows are streaming by nature and work on data in batches, allowing large volumes of data to be processed efficiently."
37
      ]
38
    },
39
    {
40
      "cell_type": "markdown",
41
      "metadata": {
42
        "id": "Dk31rbYjSTYm"
43
      },
44
      "source": [
45
        "# Install dependencies\n",
46
        "\n",
47
        "Install `txtai` and all dependencies. Since this notebook is using optional pipelines/workflows, we need to install the pipeline and workflow extras package."
48
      ]
49
    },
50
    {
51
      "cell_type": "code",
52
      "metadata": {
53
        "id": "XMQuuun2R06J"
54
      },
55
      "source": [
56
        "%%capture\n",
57
        "!pip install git+https://github.com/neuml/txtai#egg=txtai[pipeline,workflow] sacremoses\n",
58
        "\n",
59
        "# Get test data\n",
60
        "!wget -N https://github.com/neuml/txtai/releases/download/v2.0.0/tests.tar.gz\n",
61
        "!tar -xvzf tests.tar.gz"
62
      ],
63
      "execution_count": null,
64
      "outputs": []
65
    },
66
    {
67
      "cell_type": "markdown",
68
      "metadata": {
69
        "id": "I1dNQE7WT4kE"
70
      },
71
      "source": [
72
        "# Create a series of pipelines to use in this notebook"
73
      ]
74
    },
75
    {
76
      "cell_type": "code",
77
      "metadata": {
78
        "id": "w4YqwBJaT4QD"
79
      },
80
      "source": [
81
        "%%capture\n",
82
        "from txtai.pipeline import Summary, Textractor, Transcription, Translation\n",
83
        "\n",
84
        "# Summary instance\n",
85
        "summary = Summary()\n",
86
        "\n",
87
        "# Text extraction\n",
88
        "textractor = Textractor()\n",
89
        "\n",
90
        "# Transcription instance\n",
91
        "transcribe = Transcription(\"facebook/wav2vec2-large-960h\")\n",
92
        "\n",
93
        "# Create a translation instance\n",
94
        "translate = Translation()"
95
      ],
96
      "execution_count": null,
97
      "outputs": []
98
    },
99
    {
100
      "cell_type": "markdown",
101
      "metadata": {
102
        "id": "PNPJ95cdTKSS"
103
      },
104
      "source": [
105
        "# Basic workflow\n",
106
        "\n",
107
        "The following shows a basic workflow in action!"
108
      ]
109
    },
110
    {
111
      "cell_type": "code",
112
      "metadata": {
113
        "id": "nTDwXOUeTH2-",
114
        "colab": {
115
          "base_uri": "https://localhost:8080/"
116
        },
117
        "outputId": "906d4354-cf29-4593-a790-8c175d981dee"
118
      },
119
      "source": [
120
        "from txtai.workflow import Workflow, Task\n",
121
        "\n",
122
        "# Workflow that translate text to French\n",
123
        "workflow = Workflow([Task(lambda x: translate(x, \"fr\"))])\n",
124
        "\n",
125
        "# Data to run through the pipeline\n",
126
        "data = [\"The sky is blue\", \"Forest through the trees\"]\n",
127
        "\n",
128
        "# Workflows are generators for efficiency, read results to list for display\n",
129
        "list(workflow(data))"
130
      ],
131
      "execution_count": null,
132
      "outputs": [
133
        {
134
          "output_type": "execute_result",
135
          "data": {
136
            "text/plain": [
137
              "['Le ciel est bleu', 'Forêt à travers les arbres']"
138
            ]
139
          },
140
          "metadata": {},
141
          "execution_count": 13
142
        }
143
      ]
144
    },
145
    {
146
      "cell_type": "markdown",
147
      "metadata": {
148
        "id": "wicr0CAYRWZ0"
149
      },
150
      "source": [
151
        "This isn't too different from previous pipeline examples. The only difference is data is feed through the workflow. In this example, the workflow calls the translation pipeline and translates text to French. Let's look at a more complex example."
152
      ]
153
    },
154
    {
155
      "cell_type": "markdown",
156
      "metadata": {
157
        "id": "0EeD8m6FR5cH"
158
      },
159
      "source": [
160
        "# Multistep workflow\n",
161
        "\n",
162
        "The following workflow reads a series of audio files, transcribes them to text and translates the text to French. This is based on the classic txtai example from [Introducing txtai](https://colab.research.google.com/github/neuml/txtai/blob/master/examples/01_Introducing_txtai.ipynb).\n",
163
        "\n",
164
        "Workflows take two main parameters. The action to execute which is a callable and a pattern to filter data with. Data that is accepted by the filter will be processed, otherwise it will be passed through to the next task."
165
      ]
166
    },
167
    {
168
      "cell_type": "code",
169
      "metadata": {
170
        "id": "OF2G5-OiSBzy",
171
        "colab": {
172
          "base_uri": "https://localhost:8080/"
173
        },
174
        "outputId": "e5c74089-1916-4bbd-93d3-9e25b1fe4ee5"
175
      },
176
      "source": [
177
        "from txtai.workflow import FileTask\n",
178
        "\n",
179
        "tasks = [\n",
180
        "    FileTask(transcribe, r\"\\.wav$\"),\n",
181
        "    Task(lambda x: translate(x, \"fr\"))\n",
182
        "]\n",
183
        "\n",
184
        "# List of files to process\n",
185
        "data = [\n",
186
        "  \"txtai/US_tops_5_million.wav\",\n",
187
        "  \"txtai/Canadas_last_fully.wav\",\n",
188
        "  \"txtai/Beijing_mobilises.wav\",\n",
189
        "  \"txtai/The_National_Park.wav\",\n",
190
        "  \"txtai/Maine_man_wins_1_mil.wav\",\n",
191
        "  \"txtai/Make_huge_profits.wav\"\n",
192
        "]\n",
193
        "\n",
194
        "# Workflow that translate text to French\n",
195
        "workflow = Workflow(tasks)\n",
196
        "\n",
197
        "# Run workflow\n",
198
        "list(workflow(data))"
199
      ],
200
      "execution_count": null,
201
      "outputs": [
202
        {
203
          "output_type": "execute_result",
204
          "data": {
205
            "text/plain": [
206
              "[\"Les cas de virus U sont en tête d'un million\",\n",
207
              " \"La dernière plate-forme de glace entièrement intacte du Canada s'est soudainement effondrée en formant un berge de glace de taille manhatten\",\n",
208
              " \"Bagage mobilise les embarcations d'invasion le long des côtes à mesure que les tensions tiwaniennes s'intensifient\",\n",
209
              " \"Le service des parcs nationaux met en garde contre le sacrifice d'amis plus lents dans une attaque nue\",\n",
210
              " \"L'homme principal gagne du billet de loterie\",\n",
211
              " \"Faire d'énormes profits sans travailler faire jusqu'à cent mille dollars par jour\"]"
212
            ]
213
          },
214
          "metadata": {},
215
          "execution_count": 14
216
        }
217
      ]
218
    },
219
    {
220
      "cell_type": "markdown",
221
      "metadata": {
222
        "id": "PN08rnrQU1hx"
223
      },
224
      "source": [
225
        "# Complex workflow\n",
226
        "\n",
227
        "Let's put this all together into a full-fledged workflow to build an embeddings index. This workflow will work with both documents and audio files. Documents will have text extracted and summarized. Audio files will be transcribed. Both results will be joined, translated into French and loaded into an Embeddings index."
228
      ]
229
    },
230
    {
231
      "cell_type": "code",
232
      "metadata": {
233
        "id": "coZJw_1yU1Sq",
234
        "colab": {
235
          "base_uri": "https://localhost:8080/"
236
        },
237
        "outputId": "213b34d5-157f-4548-8788-ac29cb4039dd"
238
      },
239
      "source": [
240
        "from txtai.embeddings import Embeddings, Documents\n",
241
        "from txtai.workflow import FileTask, WorkflowTask\n",
242
        "\n",
243
        "# Embeddings index\n",
244
        "embeddings = Embeddings({\"path\": \"sentence-transformers/paraphrase-multilingual-mpnet-base-v2\", \"content\": True})\n",
245
        "documents = Documents()\n",
246
        "\n",
247
        "# List of files to process\n",
248
        "files = [\n",
249
        "  \"txtai/article.pdf\",\n",
250
        "  \"txtai/US_tops_5_million.wav\",\n",
251
        "  \"txtai/Canadas_last_fully.wav\",\n",
252
        "  \"txtai/Beijing_mobilises.wav\",\n",
253
        "  \"txtai/The_National_Park.wav\",\n",
254
        "  \"txtai/Maine_man_wins_1_mil.wav\",\n",
255
        "  \"txtai/Make_huge_profits.wav\"\n",
256
        "]\n",
257
        "\n",
258
        "data = [(x, element, None) for x, element in enumerate(files)]\n",
259
        "\n",
260
        "# Workflow that extracts text and builds a summary\n",
261
        "articles = Workflow([\n",
262
        "    FileTask(textractor),\n",
263
        "    Task(summary)\n",
264
        "])\n",
265
        "\n",
266
        "# Define workflow tasks. Workflows can also be tasks!\n",
267
        "tasks = [\n",
268
        "    WorkflowTask(articles, r\".\\.pdf$\"),\n",
269
        "    FileTask(transcribe, r\"\\.wav$\"),\n",
270
        "    Task(lambda x: translate(x, \"fr\")),\n",
271
        "    Task(documents.add, unpack=False)\n",
272
        "]\n",
273
        "\n",
274
        "# Workflow that translate text to French\n",
275
        "workflow = Workflow(tasks)\n",
276
        "\n",
277
        "# Run workflow and show results to be indexed\n",
278
        "for x in workflow(data):\n",
279
        "  print(x)\n",
280
        "\n",
281
        "# Build the embeddings index\n",
282
        "embeddings.index(documents)\n",
283
        "\n",
284
        "# Cleanup temporary storage\n",
285
        "documents.close()"
286
      ],
287
      "execution_count": null,
288
      "outputs": [
289
        {
290
          "output_type": "stream",
291
          "name": "stdout",
292
          "text": [
293
            "(0, \"Txtai, un moteur de recherche alimenté par l'IA construit sur Transformers, permet la recherche basée sur la compréhension du langage naturel (NLU) dans n'importe quelle application. Le champ de traitement du langage naturel (NLP) évolue rapidement avec un certain nombre de nouveaux développements. Le moteur de recherche open-source est open source et disponible sur GitHub.\", None)\n",
294
            "(1, \"Les cas de virus U sont en tête d'un million\", None)\n",
295
            "(2, \"La dernière plate-forme de glace entièrement intacte du Canada s'est soudainement effondrée en formant un berge de glace de taille manhatten\", None)\n",
296
            "(3, \"Bagage mobilise les embarcations d'invasion le long des côtes à mesure que les tensions tiwaniennes s'intensifient\", None)\n",
297
            "(4, \"Le service des parcs nationaux met en garde contre le sacrifice d'amis plus lents dans une attaque nue\", None)\n",
298
            "(5, \"L'homme principal gagne du billet de loterie\", None)\n",
299
            "(6, \"Faire d'énormes profits sans travailler faire jusqu'à cent mille dollars par jour\", None)\n"
300
          ]
301
        }
302
      ]
303
    },
304
    {
305
      "cell_type": "markdown",
306
      "metadata": {
307
        "id": "n6i-xhJya8o4"
308
      },
309
      "source": [
310
        "# Query for results in French"
311
      ]
312
    },
313
    {
314
      "cell_type": "code",
315
      "metadata": {
316
        "id": "cHbjivUOaUGu",
317
        "colab": {
318
          "base_uri": "https://localhost:8080/"
319
        },
320
        "outputId": "0da8d8cb-dac6-4cad-ef00-a096b44533cf"
321
      },
322
      "source": [
323
        "# Run a search query and show the result.\n",
324
        "embeddings.search(\"changement climatique\", 1)[0]"
325
      ],
326
      "execution_count": null,
327
      "outputs": [
328
        {
329
          "output_type": "execute_result",
330
          "data": {
331
            "text/plain": [
332
              "{'id': '2',\n",
333
              " 'score': 0.2982647716999054,\n",
334
              " 'text': \"La dernière plate-forme de glace entièrement intacte du Canada s'est soudainement effondrée en formant un berge de glace de taille manhatten\"}"
335
            ]
336
          },
337
          "metadata": {},
338
          "execution_count": 16
339
        }
340
      ]
341
    },
342
    {
343
      "cell_type": "code",
344
      "metadata": {
345
        "id": "aNerHvNpaxD4",
346
        "colab": {
347
          "base_uri": "https://localhost:8080/"
348
        },
349
        "outputId": "f3792220-4518-4388-c7e7-c38f38f19b20"
350
      },
351
      "source": [
352
        "# Run a search query and show the result.\n",
353
        "embeddings.search(\"traitement du langage naturel\", 1)[0]"
354
      ],
355
      "execution_count": null,
356
      "outputs": [
357
        {
358
          "output_type": "execute_result",
359
          "data": {
360
            "text/plain": [
361
              "{'id': '0',\n",
362
              " 'score': 0.47031939029693604,\n",
363
              " 'text': \"Txtai, un moteur de recherche alimenté par l'IA construit sur Transformers, permet la recherche basée sur la compréhension du langage naturel (NLU) dans n'importe quelle application. Le champ de traitement du langage naturel (NLP) évolue rapidement avec un certain nombre de nouveaux développements. Le moteur de recherche open-source est open source et disponible sur GitHub.\"}"
364
            ]
365
          },
366
          "metadata": {},
367
          "execution_count": 17
368
        }
369
      ]
370
    },
371
    {
372
      "cell_type": "markdown",
373
      "source": [
374
        "# Configuration-driven workflow\n",
375
        "\n",
376
        "Workflows can also be defined with YAML and run as an application. Applications can run standalone or as a FastAPI instance. More information can be [found here](https://neuml.github.io/txtai/api/). "
377
      ],
378
      "metadata": {
379
        "id": "Sz_f9qoOMC_m"
380
      }
381
    },
382
    {
383
      "cell_type": "code",
384
      "source": [
385
        "workflow = \"\"\"\n",
386
        "writable: true\n",
387
        "embeddings:\n",
388
        "  path: sentence-transformers/paraphrase-multilingual-mpnet-base-v2\n",
389
        "  content: True\n",
390
        "\n",
391
        "# Summarize text\n",
392
        "summary:\n",
393
        "\n",
394
        "# Extract text from documents\n",
395
        "textractor:\n",
396
        "\n",
397
        "# Transcribe audio to text\n",
398
        "transcription:\n",
399
        "  path: facebook/wav2vec2-large-960h\n",
400
        "\n",
401
        "# Translate text between languages\n",
402
        "translation:\n",
403
        "\n",
404
        "workflow:\n",
405
        "  summarize:\n",
406
        "    tasks:\n",
407
        "      - action: textractor\n",
408
        "        task: file\n",
409
        "      - summary\n",
410
        "  index:\n",
411
        "    tasks:\n",
412
        "      - action: summarize\n",
413
        "        select: '\\\\.pdf$'\n",
414
        "      - action: transcription\n",
415
        "        select: '\\\\.wav$'\n",
416
        "        task: file\n",
417
        "      - action: translation\n",
418
        "        args: ['fr']\n",
419
        "      - action: index\n",
420
        "\"\"\"\n",
421
        "\n",
422
        "# Create and run the workflow\n",
423
        "from txtai.app import Application\n",
424
        "\n",
425
        "# Create and run the workflow\n",
426
        "app = Application(workflow)\n",
427
        "list(app.workflow(\"index\", files))"
428
      ],
429
      "metadata": {
430
        "colab": {
431
          "base_uri": "https://localhost:8080/"
432
        },
433
        "id": "HoVlk_vNJKHY",
434
        "outputId": "34b68bcb-a6d5-4029-9bf2-f33e4381d1bc"
435
      },
436
      "execution_count": null,
437
      "outputs": [
438
        {
439
          "output_type": "execute_result",
440
          "data": {
441
            "text/plain": [
442
              "[\"Txtai, un moteur de recherche alimenté par l'IA construit sur Transformers, permet la recherche basée sur la compréhension du langage naturel (NLU) dans n'importe quelle application. Le champ de traitement du langage naturel (NLP) évolue rapidement avec un certain nombre de nouveaux développements. Le moteur de recherche open-source est open source et disponible sur GitHub.\",\n",
443
              " \"Les cas de virus U sont en tête d'un million\",\n",
444
              " \"La dernière plate-forme de glace entièrement intacte du Canada s'est soudainement effondrée en formant un berge de glace de taille manhatten\",\n",
445
              " \"Bagage mobilise les embarcations d'invasion le long des côtes à mesure que les tensions tiwaniennes s'intensifient\",\n",
446
              " \"Le service des parcs nationaux met en garde contre le sacrifice d'amis plus lents dans une attaque nue\",\n",
447
              " \"L'homme principal gagne du billet de loterie\",\n",
448
              " \"Faire d'énormes profits sans travailler faire jusqu'à cent mille dollars par jour\"]"
449
            ]
450
          },
451
          "metadata": {},
452
          "execution_count": 18
453
        }
454
      ]
455
    },
456
    {
457
      "cell_type": "code",
458
      "source": [
459
        "# Run a search query and show the result.\n",
460
        "app.search(\"changement climatique\", 1)[0]"
461
      ],
462
      "metadata": {
463
        "colab": {
464
          "base_uri": "https://localhost:8080/"
465
        },
466
        "id": "a_klVZAXHJcw",
467
        "outputId": "33229268-0f98-4ca1-af7d-212bcbde6482"
468
      },
469
      "execution_count": null,
470
      "outputs": [
471
        {
472
          "output_type": "execute_result",
473
          "data": {
474
            "text/plain": [
475
              "{'id': '2',\n",
476
              " 'score': 0.2982647716999054,\n",
477
              " 'text': \"La dernière plate-forme de glace entièrement intacte du Canada s'est soudainement effondrée en formant un berge de glace de taille manhatten\"}"
478
            ]
479
          },
480
          "metadata": {},
481
          "execution_count": 19
482
        }
483
      ]
484
    },
485
    {
486
      "cell_type": "code",
487
      "source": [
488
        "# Run a search query and show the result.\n",
489
        "app.search(\"traitement du langage naturel\", 1)[0]"
490
      ],
491
      "metadata": {
492
        "colab": {
493
          "base_uri": "https://localhost:8080/"
494
        },
495
        "id": "I5xin0VNHJOu",
496
        "outputId": "2fbb9a93-b860-437d-c361-ee21eed75b6b"
497
      },
498
      "execution_count": null,
499
      "outputs": [
500
        {
501
          "output_type": "execute_result",
502
          "data": {
503
            "text/plain": [
504
              "{'id': '0',\n",
505
              " 'score': 0.47031939029693604,\n",
506
              " 'text': \"Txtai, un moteur de recherche alimenté par l'IA construit sur Transformers, permet la recherche basée sur la compréhension du langage naturel (NLU) dans n'importe quelle application. Le champ de traitement du langage naturel (NLP) évolue rapidement avec un certain nombre de nouveaux développements. Le moteur de recherche open-source est open source et disponible sur GitHub.\"}"
507
            ]
508
          },
509
          "metadata": {},
510
          "execution_count": 20
511
        }
512
      ]
513
    },
514
    {
515
      "cell_type": "markdown",
516
      "metadata": {
517
        "id": "7zG4AimucFJs"
518
      },
519
      "source": [
520
        "# Wrapping up\n",
521
        "\n",
522
        "Results are good! We can see the power of workflows and how they can join a series of pipelines together in an efficient manner. Workflows can work with any callable, not just pipelines, workflows transform data from one format to another. Workflows are an exciting and promising development for txtai."
523
      ]
524
    }
525
  ]
526
}
txtai

Использование cookies