txtai

30_Embeddings_SQL_custom_functions.ipynb
373 строки · 12.3 Кб
Перенос по словам
1
{
2
  "nbformat": 4,
3
  "nbformat_minor": 0,
4
  "metadata": {
5
    "kernelspec": {
6
      "name": "python3",
7
      "display_name": "Python 3",
8
      "language": "python"
9
    },
10
    "language_info": {
11
      "name": "python",
12
      "version": "3.7.6",
13
      "mimetype": "text/x-python",
14
      "codemirror_mode": {
15
        "name": "ipython",
16
        "version": 3
17
      },
18
      "pygments_lexer": "ipython3",
19
      "nbconvert_exporter": "python",
20
      "file_extension": ".py"
21
    },
22
    "colab": {
23
      "provenance": []
24
    }
25
  },
26
  "cells": [
27
    {
28
      "cell_type": "markdown",
29
      "metadata": {
30
        "id": "POWZoSJR6XzK"
31
      },
32
      "source": [
33
        "# Embeddings SQL custom functions\n",
34
        "\n",
35
        "txtai 4.0 added support for SQL-based embeddings queries. This feature combines natural language queries for similarity with concrete filtering rules. txtai now has support for user-defined SQL functions, making this feature even more powerful."
36
      ]
37
    },
38
    {
39
      "cell_type": "markdown",
40
      "metadata": {
41
        "id": "qa_PPKVX6XzN"
42
      },
43
      "source": [
44
        "# Install dependencies\n",
45
        "\n",
46
        "Install `txtai` and all dependencies."
47
      ]
48
    },
49
    {
50
      "cell_type": "code",
51
      "metadata": {
52
        "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5",
53
        "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19",
54
        "trusted": true,
55
        "_kg_hide-output": true,
56
        "id": "24q-1n5i6XzQ"
57
      },
58
      "source": [
59
        "%%capture\n",
60
        "!pip install git+https://github.com/neuml/txtai#egg=txtai[pipeline]"
61
      ],
62
      "execution_count": null,
63
      "outputs": []
64
    },
65
    {
66
      "cell_type": "markdown",
67
      "source": [
68
        "# Create index\n",
69
        "Let's first recap how to create an index. We'll use the classic txtai example.\n"
70
      ],
71
      "metadata": {
72
        "id": "0p3WCDniUths"
73
      }
74
    },
75
    {
76
      "cell_type": "code",
77
      "metadata": {
78
        "_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a",
79
        "_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0",
80
        "trusted": true,
81
        "id": "2j_CFGDR6Xzp",
82
        "colab": {
83
          "base_uri": "https://localhost:8080/"
84
        },
85
        "outputId": "f2488a78-6cae-4c25-985e-fb2dd674a534"
86
      },
87
      "source": [
88
        "from txtai.embeddings import Embeddings\n",
89
        "\n",
90
        "data = [\"US tops 5 million confirmed virus cases\",\n",
91
        "        \"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg\",\n",
92
        "        \"Beijing mobilises invasion craft along coast as Taiwan tensions escalate\",\n",
93
        "        \"The National Park Service warns against sacrificing slower friends in a bear attack\",\n",
94
        "        \"Maine man wins $1M from $25 lottery ticket\",\n",
95
        "        \"Make huge profits without work, earn up to $100,000 a day\"]\n",
96
        "\n",
97
        "# Create embeddings index with content enabled. The default behavior is to only store indexed vectors.\n",
98
        "embeddings = Embeddings({\"path\": \"sentence-transformers/nli-mpnet-base-v2\", \"content\": True})\n",
99
        "\n",
100
        "# Create an index for the list of text\n",
101
        "embeddings.index([(uid, text, None) for uid, text in enumerate(data)])\n",
102
        "\n",
103
        "# Run a search\n",
104
        "embeddings.search(\"feel good story\", 1)"
105
      ],
106
      "execution_count": null,
107
      "outputs": [
108
        {
109
          "output_type": "execute_result",
110
          "data": {
111
            "text/plain": [
112
              "[{'id': '4',\n",
113
              "  'score': 0.08329004049301147,\n",
114
              "  'text': 'Maine man wins $1M from $25 lottery ticket'}]"
115
            ]
116
          },
117
          "metadata": {},
118
          "execution_count": 14
119
        }
120
      ]
121
    },
122
    {
123
      "cell_type": "markdown",
124
      "source": [
125
        "# Custom SQL functions\n",
126
        "\n",
127
        "Next, we'll recreate the index adding user-defined SQL functions. These functions are simply Python callable objects or functions that take an input and return values. Pipelines, workflows, custom tasks and any other callable object is supported."
128
      ],
129
      "metadata": {
130
        "id": "QTee7YMNDD4R"
131
      }
132
    },
133
    {
134
      "cell_type": "code",
135
      "source": [
136
        "def clength(text):\n",
137
        "  return len(text) if text else 0\n",
138
        "\n",
139
        "# Create embeddings index with content enabled. The default behavior is to only store indexed vectors.\n",
140
        "embeddings = Embeddings({\"path\": \"sentence-transformers/nli-mpnet-base-v2\", \"content\": True, \"functions\": [clength]})\n",
141
        "\n",
142
        "# Create an index for the list of text\n",
143
        "embeddings.index([(uid, text, None) for uid, text in enumerate(data)])\n",
144
        "\n",
145
        "# Run a search using a custom SQL function\n",
146
        "embeddings.search(\"select clength(text) clength, length(text) length, text from txtai where similar('feel good story')\", 1)"
147
      ],
148
      "metadata": {
149
        "colab": {
150
          "base_uri": "https://localhost:8080/"
151
        },
152
        "id": "rbsEXtysDDNg",
153
        "outputId": "f966be17-086b-49b4-e1af-62b766f1c995"
154
      },
155
      "execution_count": null,
156
      "outputs": [
157
        {
158
          "output_type": "execute_result",
159
          "data": {
160
            "text/plain": [
161
              "[{'clength': 42,\n",
162
              "  'length': 42,\n",
163
              "  'text': 'Maine man wins $1M from $25 lottery ticket'}]"
164
            ]
165
          },
166
          "metadata": {},
167
          "execution_count": 15
168
        }
169
      ]
170
    },
171
    {
172
      "cell_type": "markdown",
173
      "source": [
174
        "The function itself is simple, it's just alternate length function. But this example is just warming us up to what is possible and what is more exciting. "
175
      ],
176
      "metadata": {
177
        "id": "epIV58P1DyZa"
178
      }
179
    },
180
    {
181
      "cell_type": "markdown",
182
      "source": [
183
        "# Pipelines in SQL\n",
184
        "\n",
185
        "As mentioned above, any callable can be registered as a custom SQL function. Let's add a translate SQL function."
186
      ],
187
      "metadata": {
188
        "id": "1Iw1WKR6FW3S"
189
      }
190
    },
191
    {
192
      "cell_type": "code",
193
      "source": [
194
        "from txtai.pipeline import Translation\n",
195
        "\n",
196
        "# Translation pipeline\n",
197
        "translate = Translation()\n",
198
        "\n",
199
        "# Create embeddings index with content enabled. The default behavior is to only store indexed vectors.\n",
200
        "embeddings = Embeddings({\"path\": \"sentence-transformers/nli-mpnet-base-v2\", \"content\": True, \"functions\": [translate]})\n",
201
        "\n",
202
        "# Create an index for the list of text\n",
203
        "embeddings.index([(uid, text, None) for uid, text in enumerate(data)])\n",
204
        "\n",
205
        "query = \"\"\"\n",
206
        "select\n",
207
        "  text,\n",
208
        "  translation(text, 'de', null) 'text (DE)',\n",
209
        "  translation(text, 'es', null) 'text (ES)',\n",
210
        "  translation(text, 'fr', null) 'text (FR)'\n",
211
        "from txtai where similar('feel good story')\n",
212
        "limit 1\n",
213
        "\"\"\"\n",
214
        "\n",
215
        "# Run a search using a custom SQL function\n",
216
        "embeddings.search(query)"
217
      ],
218
      "metadata": {
219
        "colab": {
220
          "base_uri": "https://localhost:8080/"
221
        },
222
        "id": "83e8yXpXFh4F",
223
        "outputId": "0b17e9be-8983-418d-9903-b1e72efc5918"
224
      },
225
      "execution_count": null,
226
      "outputs": [
227
        {
228
          "output_type": "execute_result",
229
          "data": {
230
            "text/plain": [
231
              "[{'text': 'Maine man wins $1M from $25 lottery ticket',\n",
232
              "  'text (DE)': 'Maine Mann gewinnt $1M von $25 Lotterie-Ticket',\n",
233
              "  'text (ES)': 'Maine hombre gana $1M de billete de lotería de $25',\n",
234
              "  'text (FR)': 'Maine homme gagne $1M à partir de $25 billet de loterie'}]"
235
            ]
236
          },
237
          "metadata": {},
238
          "execution_count": 16
239
        }
240
      ]
241
    },
242
    {
243
      "cell_type": "markdown",
244
      "source": [
245
        "And just like that we have translations through SQL! This is pretty 🔥🔥🔥\n",
246
        "\n",
247
        "We can do more to make this easier though. Let's define a helper function to not require as many parameters. The default logic will require all function parameters each call, including parameters with default values."
248
      ],
249
      "metadata": {
250
        "id": "Ck_XTyBEQtaW"
251
      }
252
    },
253
    {
254
      "cell_type": "code",
255
      "source": [
256
        "def translation(text, lang):\n",
257
        "  return translate(text, lang)\n",
258
        "\n",
259
        "# Create embeddings index with content enabled. The default behavior is to only store indexed vectors.\n",
260
        "embeddings = Embeddings({\"path\": \"sentence-transformers/nli-mpnet-base-v2\", \"content\": True, \"functions\": [translation]})\n",
261
        "\n",
262
        "# Create an index for the list of text\n",
263
        "embeddings.index([(uid, text, None) for uid, text in enumerate(data)])\n",
264
        "\n",
265
        "query = \"\"\"\n",
266
        "select\n",
267
        "  text,\n",
268
        "  translation(text, 'de') 'text (DE)',\n",
269
        "  translation(text, 'es') 'text (ES)',\n",
270
        "  translation(text, 'fr') 'text (FR)'\n",
271
        "from txtai where similar('feel good story')\n",
272
        "limit 1\n",
273
        "\"\"\"\n",
274
        "\n",
275
        "# Run a search using a custom SQL function\n",
276
        "embeddings.search(query)"
277
      ],
278
      "metadata": {
279
        "colab": {
280
          "base_uri": "https://localhost:8080/"
281
        },
282
        "id": "L2DDJrd0RAaN",
283
        "outputId": "0bb437ec-5c9b-4a0c-fe8a-07f641c94a49"
284
      },
285
      "execution_count": null,
286
      "outputs": [
287
        {
288
          "output_type": "execute_result",
289
          "data": {
290
            "text/plain": [
291
              "[{'text': 'Maine man wins $1M from $25 lottery ticket',\n",
292
              "  'text (DE)': 'Maine Mann gewinnt $1M von $25 Lotterie-Ticket',\n",
293
              "  'text (ES)': 'Maine hombre gana $1M de billete de lotería de $25',\n",
294
              "  'text (FR)': 'Maine homme gagne $1M à partir de $25 billet de loterie'}]"
295
            ]
296
          },
297
          "metadata": {},
298
          "execution_count": 17
299
        }
300
      ]
301
    },
302
    {
303
      "cell_type": "markdown",
304
      "source": [
305
        "# Custom SQL functions with applications\n",
306
        "\n",
307
        "Of course this is all available with YAML-configured applications."
308
      ],
309
      "metadata": {
310
        "id": "mTT8nopiRdVH"
311
      }
312
    },
313
    {
314
      "cell_type": "code",
315
      "source": [
316
        "config = \"\"\"\n",
317
        "translation:\n",
318
        "\n",
319
        "writable: true\n",
320
        "embeddings:\n",
321
        "  path: sentence-transformers/nli-mpnet-base-v2\n",
322
        "  content: true\n",
323
        "  functions:\n",
324
        "    - {name: translation, argcount: 2, function: translation}\n",
325
        "\"\"\"\n",
326
        "\n",
327
        "from txtai.app import Application\n",
328
        "\n",
329
        "# Build application and index data\n",
330
        "app = Application(config)\n",
331
        "app.add([{\"id\": x, \"text\": row} for x, row in enumerate(data)])\n",
332
        "app.index()\n",
333
        "\n",
334
        "# Run search with custom SQL\n",
335
        "app.search(query)"
336
      ],
337
      "metadata": {
338
        "colab": {
339
          "base_uri": "https://localhost:8080/"
340
        },
341
        "id": "FZ_7G6M4RUbz",
342
        "outputId": "4eca94f3-d2aa-4449-dc6f-f1091ad9dd67"
343
      },
344
      "execution_count": null,
345
      "outputs": [
346
        {
347
          "output_type": "execute_result",
348
          "data": {
349
            "text/plain": [
350
              "[{'text': 'Maine man wins $1M from $25 lottery ticket',\n",
351
              "  'text (DE)': 'Maine Mann gewinnt $1M von $25 Lotterie-Ticket',\n",
352
              "  'text (ES)': 'Maine hombre gana $1M de billete de lotería de $25',\n",
353
              "  'text (FR)': 'Maine homme gagne $1M à partir de $25 billet de loterie'}]"
354
            ]
355
          },
356
          "metadata": {},
357
          "execution_count": 18
358
        }
359
      ]
360
    },
361
    {
362
      "cell_type": "markdown",
363
      "metadata": {
364
        "id": "aDIF3tYt6X0O"
365
      },
366
      "source": [
367
        "# Wrapping up\n",
368
        "\n",
369
        "This notebook introduced running user-defined custom SQL functions through embeddings SQL. This powerful feature can be used with any callable function including pipelines, tasks and workflows in tandem with similarity and rules filters."
370
      ]
371
    }
372
  ]
373
}
txtai

Использование cookies