txtai

Форк
0
/
59_Whats_new_in_txtai_7_0.ipynb 
594 строки · 115.7 Кб
1
{
2
  "cells": [
3
    {
4
      "cell_type": "markdown",
5
      "metadata": {
6
        "id": "e3wdiK5fGUoZ"
7
      },
8
      "source": [
9
        "# 💡 What's new in txtai 7.0\n",
10
        "\n",
11
        "txtai 7.0 brings a number of major feature enhancements. Highlights include:\n",
12
        "\n",
13
        "- Semantic Graph 2.0\n",
14
        "  - Graph search\n",
15
        "  - Advanced graph traversal\n",
16
        "  - Graph RAG\n",
17
        "\n",
18
        "- Embeddings\n",
19
        "  - Default configuration format now JSON\n",
20
        "  - Move ids storage outside of configuration when content is disabled\n",
21
        "\n",
22
        "- Pipelines\n",
23
        "  - Training support for LoRA / QLoRA\n",
24
        "\n",
25
        "- API\n",
26
        "  - Binary transport support\n",
27
        "\n",
28
        "These are just the big, high level changes. There are also many improvements and bug fixes.\n",
29
        "\n",
30
        "This notebook will cover all the changes with examples.\n",
31
        "\n",
32
        "**Standard upgrade disclaimer below**\n",
33
        "\n",
34
        "While everything is backwards compatible, it's prudent to backup production indexes before upgrading and test before deploying."
35
      ]
36
    },
37
    {
38
      "cell_type": "markdown",
39
      "metadata": {
40
        "id": "p8BbfjrhH-V2"
41
      },
42
      "source": [
43
        "# Install dependencies\n",
44
        "\n",
45
        "Install `txtai` and all dependencies."
46
      ]
47
    },
48
    {
49
      "cell_type": "code",
50
      "execution_count": null,
51
      "metadata": {
52
        "id": "-OXsTQgaGQPM"
53
      },
54
      "outputs": [],
55
      "source": [
56
        "%%capture\n",
57
        "!pip install git+https://github.com/neuml/txtai#egg=txtai[api,graph,pipeline-train] datasets autoawq"
58
      ]
59
    },
60
    {
61
      "cell_type": "markdown",
62
      "metadata": {
63
        "id": "n4EXrtcYIIYE"
64
      },
65
      "source": [
66
        "# Semantic Graph 2.0\n",
67
        "\n",
68
        "The biggest change and reason this is a major release is the addition of a number of new graph-driven patterns. Let's jump right into with that.\n",
69
        "\n",
70
        "## Graph search\n",
71
        "\n",
72
        "The first feature we'll test out is running a search that returns results as a graph. With this change, not only do we get search results, we get how these search results relate to each other.\n",
73
        "\n",
74
        "We'll use a [prompt dataset on the Hugging Face Hub](https://huggingface.co/datasets/fka/awesome-chatgpt-prompts) for all examples."
75
      ]
76
    },
77
    {
78
      "cell_type": "code",
79
      "execution_count": null,
80
      "metadata": {
81
        "id": "hzPD8_cQJNtN"
82
      },
83
      "outputs": [],
84
      "source": [
85
        "from datasets import load_dataset\n",
86
        "\n",
87
        "import txtai\n",
88
        "\n",
89
        "# Load dataset\n",
90
        "ds = load_dataset(\"fka/awesome-chatgpt-prompts\", split=\"train\")\n",
91
        "\n",
92
        "def stream():\n",
93
        "  for row in ds:\n",
94
        "    yield (row[\"act\"], f\"{row['act']} {row['prompt']}\")\n",
95
        "\n",
96
        "# Build sparse keyword index\n",
97
        "embeddings = txtai.Embeddings(content=True, graph={\"approximate\": False})\n",
98
        "embeddings.index(stream())\n",
99
        "\n",
100
        "graph = embeddings.search(\"Linux terminal\", 5, graph=True)"
101
      ]
102
    },
103
    {
104
      "cell_type": "code",
105
      "execution_count": null,
106
      "metadata": {
107
        "id": "1v4s_pT5k8ZX",
108
        "outputId": "adfc0161-a377-46af-bfa0-e157f0bdd64d"
109
      },
110
      "outputs": [
111
        {
112
          "data": {
113
            "image/png": "",
114
            "text/plain": [
115
              "<Figure size 1600x500 with 1 Axes>"
116
            ]
117
          },
118
          "metadata": {},
119
          "output_type": "display_data"
120
        }
121
      ],
122
      "source": [
123
        "import matplotlib.pyplot as plt\n",
124
        "import networkx as nx\n",
125
        "\n",
126
        "def plot(graph):\n",
127
        "    labels = {x: f\"{graph.attribute(x, 'id')} ({x})\" for x in graph.scan()}\n",
128
        "    options = {\n",
129
        "        \"node_size\": 1500,\n",
130
        "        \"node_color\": \"#0277bd\",\n",
131
        "        \"edge_color\": \"#454545\",\n",
132
        "        \"font_color\": \"#fff\",\n",
133
        "        \"font_size\": 9,\n",
134
        "        \"alpha\": 1.0\n",
135
        "    }\n",
136
        "\n",
137
        "    fig, ax = plt.subplots(figsize=(16, 5))\n",
138
        "    pos = nx.spring_layout(graph.backend, seed=0, k=0.9, iterations=50)\n",
139
        "    nx.draw_networkx(graph.backend, pos=pos, labels=labels, **options)\n",
140
        "    ax.set_facecolor(\"#303030\")\n",
141
        "    ax.axis(\"off\")\n",
142
        "    fig.set_facecolor(\"#303030\")\n",
143
        "\n",
144
        "    plt.show()\n",
145
        "\n",
146
        "plot(graph)"
147
      ]
148
    },
149
    {
150
      "cell_type": "markdown",
151
      "metadata": {
152
        "id": "caDteWUVK5jI"
153
      },
154
      "source": [
155
        "We now see a graph of not only the search results but how they relate to each other!"
156
      ]
157
    },
158
    {
159
      "cell_type": "markdown",
160
      "metadata": {
161
        "id": "dPdgwyZhk8ZX"
162
      },
163
      "source": [
164
        "## Advanced graph traversal\n",
165
        "\n",
166
        "Before 7.0, the main way to traverse a graph was via the `showpath` method. This method finds the shortest path between two graph nodes. What if we want more control over how a path is traversed? Enter advanced graph traversal."
167
      ]
168
    },
169
    {
170
      "cell_type": "code",
171
      "execution_count": null,
172
      "metadata": {
173
        "id": "B1KRPrcnk8ZX",
174
        "outputId": "2a2f7658-0f5b-4620-cc34-6b66e2496282"
175
      },
176
      "outputs": [
177
        {
178
          "data": {
179
            "image/png": "",
180
            "text/plain": [
181
              "<Figure size 1600x500 with 1 Axes>"
182
            ]
183
          },
184
          "metadata": {},
185
          "output_type": "display_data"
186
        }
187
      ],
188
      "source": [
189
        "g = embeddings.graph.search(\"\"\"\n",
190
        "MATCH P=({id: \"Poet\"})-[*1..2]->({id: \"Rapper\"})\n",
191
        "RETURN P\n",
192
        "LIMIT 5\n",
193
        "\"\"\", graph=True)\n",
194
        "\n",
195
        "plot(g)"
196
      ]
197
    },
198
    {
199
      "cell_type": "markdown",
200
      "metadata": {
201
        "id": "yJIHr_9zk8ZY"
202
      },
203
      "source": [
204
        "The query above finds the top 5 connections between a `Poet` and a `Rapper` using a graph query. Graph queries use the [openCypher](https://github.com/opencypher/openCypher) query standard."
205
      ]
206
    },
207
    {
208
      "cell_type": "markdown",
209
      "metadata": {
210
        "id": "MvlWX0yYk8ZY"
211
      },
212
      "source": [
213
        "## Graph RAG\n",
214
        "\n",
215
        "Graph path traversal opens up a different type of RAG process. A standard RAG process typically runs a single vector search query and returns the closest matches. Those matches are then passed into a LLM prompt and used to limit the context and help ensure more factually correct answers are generated. Graphs enable more complex analysis.\n",
216
        "\n",
217
        "We'll use the graph path from the previous example for a more complex RAG query."
218
      ]
219
    },
220
    {
221
      "cell_type": "code",
222
      "execution_count": null,
223
      "metadata": {
224
        "id": "zC_yDuyKk8ZY",
225
        "outputId": "3f52332c-de63-4e94-b2ce-03c1c21fe597"
226
      },
227
      "outputs": [
228
        {
229
          "name": "stdout",
230
          "output_type": "stream",
231
          "text": [
232
            "The roles that are similar to both a rapper and poet are:\n",
233
            "\n",
234
            "1. Composer: A composer creates music for various forms of art, including songs and poems. They work with different instruments and tools to bring the lyrics to life, making the music harmonious and engaging.\n",
235
            "\n",
236
            "2. Novelist: A novelist creates captivating stories with engaging characters and plotlines. They can write in various genres, such as science fiction, romance, or historical fiction. The goal is to write a story that keeps readers engaged and entertained.\n",
237
            "\n",
238
            "3. Movie Critic: A movie critic evaluates and reviews movies, discussing aspects like plot, themes, acting, direction, and more. They aim to express their feelings about the movie and how it impacted them, while also providing constructive criticism.\n",
239
            "\n",
240
            "4. Motivational Speaker: A motivational speaker inspires and empowers their audience by sharing words of wisdom and encouragement. They can talk about various topics, but the goal is to make their audience feel motivated and inspired to achieve their goals.\n"
241
          ]
242
        }
243
      ],
244
      "source": [
245
        "from txtai import LLM\n",
246
        "\n",
247
        "llm = LLM(\"TheBloke/Mistral-7B-OpenOrca-AWQ\")\n",
248
        "\n",
249
        "def rag(question, text):\n",
250
        "    prompt = f\"\"\"<|im_start|>system\n",
251
        "    You are a friendly assistant. You answer questions from users.<|im_end|>\n",
252
        "    <|im_start|>user\n",
253
        "    Answer the following question using only the context below. Only include information specifically discussed.\n",
254
        "\n",
255
        "    question: {question}\n",
256
        "    context: {text} <|im_end|>\n",
257
        "    <|im_start|>assistant\n",
258
        "    \"\"\"\n",
259
        "\n",
260
        "    return llm(prompt, maxlength=4096)\n",
261
        "\n",
262
        "context = \"\\n\".join(g.attribute(node, \"text\") for node in list(g.scan()))\n",
263
        "\n",
264
        "print(rag(\"What roles are similar to both a rapper and poet?\", context))\n"
265
      ]
266
    },
267
    {
268
      "cell_type": "markdown",
269
      "metadata": {
270
        "id": "gb-7zteQk8ZZ"
271
      },
272
      "source": [
273
        "Let's compare these results with the results from a standard RAG query. We'll pull the 6 most similar rows to have the same sized dataset as what is in the graph above."
274
      ]
275
    },
276
    {
277
      "cell_type": "code",
278
      "execution_count": null,
279
      "metadata": {
280
        "id": "jYEUfWbFk8ZZ",
281
        "outputId": "f4bb6f16-f360-4991-d0ab-47eabebdfa0f"
282
      },
283
      "outputs": [
284
        {
285
          "name": "stdout",
286
          "output_type": "stream",
287
          "text": [
288
            "The roles most similar to both a rapper and poet are the Composer and the Song Recommender. Both roles involve creating music or recommending songs based on given lyrics or themes.\n"
289
          ]
290
        }
291
      ],
292
      "source": [
293
        "question = \"What roles are most similar role to both a rapper and poet?\"\n",
294
        "context = \"\\n\".join(x[\"text\"] for x in embeddings.search(question, limit=6))\n",
295
        "print(rag(question, context))"
296
      ]
297
    },
298
    {
299
      "cell_type": "markdown",
300
      "metadata": {
301
        "id": "kQUDymlVk8ZZ"
302
      },
303
      "source": [
304
        "As we can see, the Graph RAG approach yields a more comprehensive answer. The standard RAG answer isn't bad, it's just not as complete."
305
      ]
306
    },
307
    {
308
      "cell_type": "markdown",
309
      "metadata": {
310
        "id": "TrMC9Seqk8ZZ"
311
      },
312
      "source": [
313
        "# Embeddings\n",
314
        "\n",
315
        "There are a couple backwards compatible changes to the embeddings database format. The default configuration format moving forward is `json`. While `pickle` configuration is still supported, txtai is moving towards a readable configuration format. This is to have maximum compatability with the Hugging Face Hub, when uploading models. The `pickle` format is generally not recommended when sharing indexes."
316
      ]
317
    },
318
    {
319
      "cell_type": "code",
320
      "execution_count": null,
321
      "metadata": {
322
        "id": "_N1IwubTk8ZZ",
323
        "outputId": "2f91d501-6a70-4ff2-abf1-a39c8f5a78dd"
324
      },
325
      "outputs": [
326
        {
327
          "name": "stdout",
328
          "output_type": "stream",
329
          "text": [
330
            "{\n",
331
            "  \"autoid\": 2,\n",
332
            "  \"backend\": \"faiss\",\n",
333
            "  \"build\": {\n",
334
            "    \"create\": \"2024-02-21T16:23:26Z\",\n",
335
            "    \"python\": \"3.8.18\",\n",
336
            "    \"settings\": {\n",
337
            "      \"components\": \"IDMap,Flat\"\n",
338
            "    },\n",
339
            "    \"system\": \"Linux (x86_64)\",\n",
340
            "    \"txtai\": \"7.0.0\"\n",
341
            "  },\n",
342
            "  \"dimensions\": 384,\n",
343
            "  \"offset\": 2,\n",
344
            "  \"path\": \"sentence-transformers/all-MiniLM-L6-v2\",\n",
345
            "  \"update\": \"2024-02-21T16:23:26Z\"\n",
346
            "}\n",
347
            "ID List: [0, 1]\n"
348
          ]
349
        }
350
      ],
351
      "source": [
352
        "import json\n",
353
        "import pickle\n",
354
        "\n",
355
        "from txtai import Embeddings\n",
356
        "\n",
357
        "# Create a default index\n",
358
        "embeddings = Embeddings()\n",
359
        "embeddings.index([\"test1\", \"test2\"])\n",
360
        "embeddings.save(\"index\")\n",
361
        "\n",
362
        "# Read standard configuration\n",
363
        "with open(\"index/config.json\") as f:\n",
364
        "    print(json.dumps(json.load(f), sort_keys=True, default=str, indent=2))\n",
365
        "\n",
366
        "# Read ids\n",
367
        "with open(\"index/ids\", \"rb\") as f:\n",
368
        "    print(\"ID List:\", pickle.load(f))"
369
      ]
370
    },
371
    {
372
      "cell_type": "markdown",
373
      "metadata": {
374
        "id": "a1X1_fHdk8ZZ"
375
      },
376
      "source": [
377
        "When no configuration is specified, notice that a `config.json` file is created along with an `ids` file. Ids are no longer stored within the configuration both for `json` and `pickle` configuration. When loading an existing index, the ids are automatically read and moved when saving a new version."
378
      ]
379
    },
380
    {
381
      "cell_type": "markdown",
382
      "metadata": {
383
        "id": "mGpwVhuwk8Za"
384
      },
385
      "source": [
386
        "# LoRA / QLoRA support\n",
387
        "\n",
388
        "Two new parameters have been added to the `HFTrainer` pipeline, `lora` and `quantize`. When both of those are enabled, models are trained using QLoRA. Custom settings are also supported."
389
      ]
390
    },
391
    {
392
      "cell_type": "code",
393
      "execution_count": null,
394
      "metadata": {
395
        "id": "isP662sdk8Za",
396
        "outputId": "e72b4f64-671d-42db-b388-c8a702d9fc04",
397
        "colab": {
398
          "referenced_widgets": [
399
            "b6e77e11948f42429b587fc0cb07faa5"
400
          ]
401
        }
402
      },
403
      "outputs": [
404
        {
405
          "name": "stdout",
406
          "output_type": "stream",
407
          "text": [
408
            "trainable params: 8,355,840 || all params: 470,041,600 || trainable%: 1.7776809541963945\n"
409
          ]
410
        },
411
        {
412
          "data": {
413
            "application/vnd.jupyter.widget-view+json": {
414
              "model_id": "b6e77e11948f42429b587fc0cb07faa5",
415
              "version_major": 2,
416
              "version_minor": 0
417
            },
418
            "text/plain": [
419
              "  0%|          | 0/3 [00:00<?, ?it/s]"
420
            ]
421
          },
422
          "metadata": {},
423
          "output_type": "display_data"
424
        },
425
        {
426
          "name": "stdout",
427
          "output_type": "stream",
428
          "text": [
429
            "{'train_runtime': 0.3832, 'train_samples_per_second': 7.829, 'train_steps_per_second': 7.829, 'train_loss': 9.008923212687174, 'epoch': 3.0}\n"
430
          ]
431
        }
432
      ],
433
      "source": [
434
        "from txtai.pipeline import HFTrainer\n",
435
        "\n",
436
        "trainer = HFTrainer()\n",
437
        "model, _ = trainer(\n",
438
        "    \"ahxt/LiteLlama-460M-1T\",\n",
439
        "    [{\"label\": 0, \"text\": \"sample text\"}],\n",
440
        "    maxlength=16,\n",
441
        "    task=\"language-generation\",\n",
442
        "    quantize=True,\n",
443
        "    lora=True,\n",
444
        ")"
445
      ]
446
    },
447
    {
448
      "cell_type": "markdown",
449
      "metadata": {
450
        "id": "myQTbsHsk8Za"
451
      },
452
      "source": [
453
        "# Binary transport support\n",
454
        "\n",
455
        "The API added support for reading and writing binary content. These changes will be pushed to the API clients in a future release. These changes include:\n",
456
        "\n",
457
        "- Images and media content\n",
458
        "- Encoding binary JSON and using MessagePack"
459
      ]
460
    },
461
    {
462
      "cell_type": "code",
463
      "execution_count": null,
464
      "metadata": {
465
        "id": "TkC2pJIvk8Za",
466
        "outputId": "6d4ef215-c914-4edd-b819-c3ec802db9d9"
467
      },
468
      "outputs": [
469
        {
470
          "name": "stdout",
471
          "output_type": "stream",
472
          "text": [
473
            "Overwriting index.yml\n"
474
          ]
475
        }
476
      ],
477
      "source": [
478
        "%%writefile index.yml\n",
479
        "\n",
480
        "embeddings:"
481
      ]
482
    },
483
    {
484
      "cell_type": "code",
485
      "execution_count": null,
486
      "metadata": {
487
        "id": "xmoseDnbk8Za"
488
      },
489
      "outputs": [],
490
      "source": [
491
        "!CONFIG=index.yml nohup uvicorn \"txtai.api:app\" &> api.log &\n",
492
        "!sleep 90"
493
      ]
494
    },
495
    {
496
      "cell_type": "code",
497
      "execution_count": null,
498
      "metadata": {
499
        "id": "F6IdEWsIk8Za",
500
        "outputId": "2dba53b1-c81a-418e-b2f6-411e3445a03e"
501
      },
502
      "outputs": [
503
        {
504
          "name": "stdout",
505
          "output_type": "stream",
506
          "text": [
507
            "b'0'\n",
508
            "b'\\x00'\n"
509
          ]
510
        }
511
      ],
512
      "source": [
513
        "import requests\n",
514
        "\n",
515
        "requests.post(\"http://localhost:8000/add\", json=[\"test\"])\n",
516
        "requests.get(\"http://localhost:8000/index\")\n",
517
        "\n",
518
        "print(requests.get(\"http://localhost:8000/count\", headers={\"Accept\": \"application/json\"}).content)\n",
519
        "print(requests.get(\"http://localhost:8000/count\", headers={\"Accept\": \"application/msgpack\"}).content)"
520
      ]
521
    },
522
    {
523
      "cell_type": "markdown",
524
      "metadata": {
525
        "id": "Njc8wa09k8Zb"
526
      },
527
      "source": [
528
        "Notice the subtle but important difference between the two outputs. The first response is a `0` character as JSON. The second response is the `\\x00` character and can be intrepreted as a `0` using MessagePack. See below."
529
      ]
530
    },
531
    {
532
      "cell_type": "code",
533
      "execution_count": null,
534
      "metadata": {
535
        "id": "w_KkrUEBk8Zb",
536
        "outputId": "dc3f180c-808a-4752-f8f1-bc7ca3158fc9"
537
      },
538
      "outputs": [
539
        {
540
          "name": "stdout",
541
          "output_type": "stream",
542
          "text": [
543
            "0\n"
544
          ]
545
        }
546
      ],
547
      "source": [
548
        "import msgpack\n",
549
        "print(msgpack.loads(requests.get(\"http://localhost:8000/count\", headers={\"Accept\": \"application/msgpack\"}).content))"
550
      ]
551
    },
552
    {
553
      "cell_type": "markdown",
554
      "metadata": {
555
        "id": "tvnMO1Eai6Gy"
556
      },
557
      "source": [
558
        "# Wrapping up\n",
559
        "\n",
560
        "This notebook gave a quick overview of txtai 7.0. Updated documentation and more examples will be forthcoming. There is much to cover and much to build on!\n",
561
        "\n",
562
        "See the following links for more information.\n",
563
        "\n",
564
        "- [7.0 Release on GitHub](https://github.com/neuml/txtai/releases/tag/v7.0.0)\n",
565
        "- [Documentation site](https://neuml.github.io/txtai)"
566
      ]
567
    }
568
  ],
569
  "metadata": {
570
    "accelerator": "GPU",
571
    "colab": {
572
      "gpuType": "T4",
573
      "provenance": []
574
    },
575
    "kernelspec": {
576
      "display_name": "Python 3",
577
      "name": "python3"
578
    },
579
    "language_info": {
580
      "codemirror_mode": {
581
        "name": "ipython",
582
        "version": 3
583
      },
584
      "file_extension": ".py",
585
      "mimetype": "text/x-python",
586
      "name": "python",
587
      "nbconvert_exporter": "python",
588
      "pygments_lexer": "ipython3",
589
      "version": "3.8.18"
590
    }
591
  },
592
  "nbformat": 4,
593
  "nbformat_minor": 0
594
}
595

Использование cookies

Мы используем файлы cookie в соответствии с Политикой конфиденциальности и Политикой использования cookies.

Нажимая кнопку «Принимаю», Вы даете АО «СберТех» согласие на обработку Ваших персональных данных в целях совершенствования нашего веб-сайта и Сервиса GitVerse, а также повышения удобства их использования.

Запретить использование cookies Вы можете самостоятельно в настройках Вашего браузера.