haystack-tutorials

Форк
0
/
10_Knowledge_Graph.ipynb 
441 строка · 12.8 Кб
1
{
2
 "cells": [
3
  {
4
   "attachments": {},
5
   "cell_type": "markdown",
6
   "metadata": {
7
    "pycharm": {
8
     "name": "#%% md\n"
9
    }
10
   },
11
   "source": [
12
    "# Question Answering on a Knowledge Graph\n",
13
    "\n",
14
    "> Starting from version 1.15, `BaseKnowledgeGraph`, `GraphDBKnowledgeGraph`, `InMemoryKnowledgeGraph`, and `Text2SparqlRetriever` are being deprecated and will be removed from Haystack as of version 1.17. For more details about this deprecation, check out [our announcement](https://github.com/deepset-ai/haystack/discussions/4882) on Github. \n",
15
    "\n",
16
    "Haystack allows storing and querying knowledge graphs with the help of pre-trained models that translate text queries to SPARQL queries.\n",
17
    "This tutorial demonstrates how to load an existing knowledge graph into haystack, load a pre-trained retriever, and execute text queries on the knowledge graph.\n",
18
    "The training of models that translate text queries into SPARQL queries is currently not supported.\n",
19
    "\n",
20
    "To start, install the latest release of Haystack with `pip`:"
21
   ]
22
  },
23
  {
24
   "attachments": {},
25
   "cell_type": "markdown",
26
   "metadata": {},
27
   "source": [
28
    "\n",
29
    "## Preparing the Colab Environment\n",
30
    "\n",
31
    "- [Enable GPU Runtime](https://docs.haystack.deepset.ai/docs/enabling-gpu-acceleration#enabling-the-gpu-in-colab)\n"
32
   ]
33
  },
34
  {
35
   "attachments": {},
36
   "cell_type": "markdown",
37
   "metadata": {},
38
   "source": [
39
    "## Installing Haystack\n",
40
    "\n",
41
    "To start, let's install the latest release of Haystack with `pip`:"
42
   ]
43
  },
44
  {
45
   "cell_type": "code",
46
   "execution_count": null,
47
   "metadata": {
48
    "collapsed": false,
49
    "jupyter": {
50
     "outputs_hidden": false
51
    },
52
    "pycharm": {
53
     "name": "#%%\n"
54
    }
55
   },
56
   "outputs": [],
57
   "source": [
58
    "%%bash\n",
59
    "\n",
60
    "pip install --upgrade pip\n",
61
    "pip install farm-haystack[colab,inmemorygraph]==1.16.1"
62
   ]
63
  },
64
  {
65
   "attachments": {},
66
   "cell_type": "markdown",
67
   "metadata": {},
68
   "source": [
69
    "### Enabling Telemetry \n",
70
    "Knowing you're using this tutorial helps us decide where to invest our efforts to build a better product but you can always opt out by commenting the following line. See [Telemetry](https://docs.haystack.deepset.ai/docs/telemetry) for more details."
71
   ]
72
  },
73
  {
74
   "cell_type": "code",
75
   "execution_count": null,
76
   "metadata": {},
77
   "outputs": [],
78
   "source": [
79
    "from haystack.telemetry import tutorial_running\n",
80
    "\n",
81
    "tutorial_running(10)"
82
   ]
83
  },
84
  {
85
   "attachments": {},
86
   "cell_type": "markdown",
87
   "metadata": {
88
    "collapsed": false,
89
    "pycharm": {
90
     "name": "#%% md\n"
91
    }
92
   },
93
   "source": [
94
    "## Logging\n",
95
    "\n",
96
    "We configure how logging messages should be displayed and which log level should be used before importing Haystack.\n",
97
    "Example log message:\n",
98
    "INFO - haystack.utils.preprocessing -  Converting data/tutorial1/218_Olenna_Tyrell.txt\n",
99
    "Default log level in basicConfig is WARNING so the explicit parameter is not necessary but can be changed easily:"
100
   ]
101
  },
102
  {
103
   "cell_type": "code",
104
   "execution_count": null,
105
   "metadata": {
106
    "collapsed": false,
107
    "pycharm": {
108
     "name": "#%%\n"
109
    }
110
   },
111
   "outputs": [],
112
   "source": [
113
    "import logging\n",
114
    "\n",
115
    "logging.basicConfig(format=\"%(levelname)s - %(name)s -  %(message)s\", level=logging.WARNING)\n",
116
    "logging.getLogger(\"haystack\").setLevel(logging.INFO)"
117
   ]
118
  },
119
  {
120
   "attachments": {},
121
   "cell_type": "markdown",
122
   "metadata": {
123
    "pycharm": {
124
     "name": "#%% md\n"
125
    }
126
   },
127
   "source": [
128
    "## Downloading Knowledge Graph and Model"
129
   ]
130
  },
131
  {
132
   "cell_type": "code",
133
   "execution_count": null,
134
   "metadata": {
135
    "collapsed": false,
136
    "jupyter": {
137
     "outputs_hidden": false
138
    },
139
    "pycharm": {
140
     "name": "#%%\n"
141
    }
142
   },
143
   "outputs": [],
144
   "source": [
145
    "from haystack.utils import fetch_archive_from_http\n",
146
    "\n",
147
    "\n",
148
    "# Let's first fetch some triples that we want to store in our knowledge graph\n",
149
    "# Here: exemplary triples from the wizarding world\n",
150
    "graph_dir = \"data/tutorial10\"\n",
151
    "s3_url = \"https://fandom-qa.s3-eu-west-1.amazonaws.com/triples_and_config.zip\"\n",
152
    "fetch_archive_from_http(url=s3_url, output_dir=graph_dir)\n",
153
    "\n",
154
    "# Fetch a pre-trained BART model that translates text queries to SPARQL queries\n",
155
    "model_dir = \"../saved_models/tutorial10_knowledge_graph/\"\n",
156
    "s3_url = \"https://fandom-qa.s3-eu-west-1.amazonaws.com/saved_models/hp_v3.4.zip\"\n",
157
    "fetch_archive_from_http(url=s3_url, output_dir=model_dir)"
158
   ]
159
  },
160
  {
161
   "attachments": {},
162
   "cell_type": "markdown",
163
   "metadata": {},
164
   "source": [
165
    "## Initialize a knowledge graph and load data"
166
   ]
167
  },
168
  {
169
   "attachments": {},
170
   "cell_type": "markdown",
171
   "metadata": {},
172
   "source": [
173
    "Currently, Haystack supports two alternative implementations for knowledge graphs:\n",
174
    "* simple InMemoryKnowledgeGraph (based on RDFLib in-memory store)\n",
175
    "* GraphDBKnowledgeGraph, which runs on GraphDB."
176
   ]
177
  },
178
  {
179
   "attachments": {},
180
   "cell_type": "markdown",
181
   "metadata": {},
182
   "source": [
183
    "### InMemoryKnowledgeGraph "
184
   ]
185
  },
186
  {
187
   "cell_type": "code",
188
   "execution_count": null,
189
   "metadata": {},
190
   "outputs": [],
191
   "source": [
192
    "from pathlib import Path\n",
193
    "\n",
194
    "from haystack.document_stores import InMemoryKnowledgeGraph\n",
195
    "\n",
196
    "\n",
197
    "# Initialize a in memory knowledge graph and use \"tutorial_10_index\" as the name of the index\n",
198
    "kg = InMemoryKnowledgeGraph(index=\"tutorial_10_index\")\n",
199
    "\n",
200
    "# Delete the index as it might have been already created in previous runs\n",
201
    "kg.delete_index()\n",
202
    "\n",
203
    "# Create the index\n",
204
    "kg.create_index()\n",
205
    "\n",
206
    "# Import triples of subject, predicate, and object statements from a ttl file\n",
207
    "kg.import_from_ttl_file(index=\"tutorial_10_index\", path=Path(graph_dir) / \"triples.ttl\")\n",
208
    "print(f\"The last triple stored in the knowledge graph is: {kg.get_all_triples()[-1]}\")\n",
209
    "print(f\"There are {len(kg.get_all_triples())} triples stored in the knowledge graph.\")"
210
   ]
211
  },
212
  {
213
   "attachments": {},
214
   "cell_type": "markdown",
215
   "metadata": {
216
    "jp-MarkdownHeadingCollapsed": true,
217
    "tags": []
218
   },
219
   "source": [
220
    "### GraphDBKnowledgeGraph (alternative)"
221
   ]
222
  },
223
  {
224
   "attachments": {},
225
   "cell_type": "markdown",
226
   "metadata": {
227
    "pycharm": {
228
     "name": "#%% md\n"
229
    }
230
   },
231
   "source": [
232
    "#### Launching a GraphDB instance\n",
233
    "\n",
234
    "Unfortunately, there seems to be no good way to run GraphDB in colab environments.\n",
235
    "In your local environment, you could start a GraphDB server with docker, feel free to check GraphDB's website for the free version https://www.ontotext.com/products/graphdb/graphdb-free/"
236
   ]
237
  },
238
  {
239
   "cell_type": "code",
240
   "execution_count": null,
241
   "metadata": {
242
    "collapsed": false,
243
    "jupyter": {
244
     "outputs_hidden": false
245
    },
246
    "pycharm": {
247
     "name": "#%%\n"
248
    }
249
   },
250
   "outputs": [],
251
   "source": [
252
    "# import os\n",
253
    "# import subprocess\n",
254
    "# import time\n",
255
    "\n",
256
    "# LAUNCH_GRAPHDB = os.environ.get(\"LAUNCH_GRAPHDB\", False)\n",
257
    "\n",
258
    "# if LAUNCH_GRAPHDB:\n",
259
    "#     print(\"Starting GraphDB ...\")\n",
260
    "#     status = subprocess.run(\n",
261
    "#         [\n",
262
    "#             \"docker run -d -p 7200:7200 --name graphdb-instance-tutorial docker-registry.ontotext.com/graphdb-free:9.4.1-adoptopenjdk11\"\n",
263
    "#         ],\n",
264
    "#         shell=True,\n",
265
    "#     )\n",
266
    "#     if status.returncode:\n",
267
    "#         raise Exception(\n",
268
    "#             \"Failed to launch GraphDB. Maybe it is already running or you already have a container with that name that you could start?\"\n",
269
    "#         )\n",
270
    "#     time.sleep(5)"
271
   ]
272
  },
273
  {
274
   "attachments": {},
275
   "cell_type": "markdown",
276
   "metadata": {
277
    "pycharm": {
278
     "name": "#%% md\n"
279
    }
280
   },
281
   "source": [
282
    "#### Creating a new GraphDB repository (also known as index in haystack's document stores)"
283
   ]
284
  },
285
  {
286
   "cell_type": "code",
287
   "execution_count": null,
288
   "metadata": {
289
    "collapsed": false,
290
    "jupyter": {
291
     "outputs_hidden": false
292
    },
293
    "pycharm": {
294
     "name": "#%%\n"
295
    }
296
   },
297
   "outputs": [],
298
   "source": [
299
    "# from haystack.document_stores import GraphDBKnowledgeGraph\n",
300
    "\n",
301
    "# # Initialize a knowledge graph connected to GraphDB and use \"tutorial_10_index\" as the name of the index\n",
302
    "# kg = GraphDBKnowledgeGraph(index=\"tutorial_10_index\")\n",
303
    "\n",
304
    "# # Delete the index as it might have been already created in previous runs\n",
305
    "# kg.delete_index()\n",
306
    "\n",
307
    "# # Create the index based on a configuration file\n",
308
    "# kg.create_index(config_path=Path(graph_dir) / \"repo-config.ttl\")\n",
309
    "\n",
310
    "# # Import triples of subject, predicate, and object statements from a ttl file\n",
311
    "# kg.import_from_ttl_file(index=\"tutorial_10_index\", path=Path(graph_dir) / \"triples.ttl\")\n",
312
    "# print(f\"The last triple stored in the knowledge graph is: {kg.get_all_triples()[-1]}\")\n",
313
    "# print(f\"There are {len(kg.get_all_triples())} triples stored in the knowledge graph.\")"
314
   ]
315
  },
316
  {
317
   "cell_type": "code",
318
   "execution_count": null,
319
   "metadata": {
320
    "collapsed": false,
321
    "jupyter": {
322
     "outputs_hidden": false
323
    },
324
    "pycharm": {
325
     "name": "#%%\n"
326
    }
327
   },
328
   "outputs": [],
329
   "source": [
330
    "# # Define prefixes for names of resources so that we can use shorter resource names in queries\n",
331
    "# prefixes = \"\"\"PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\n",
332
    "# PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>\n",
333
    "# PREFIX hp: <https://deepset.ai/harry_potter/>\n",
334
    "# \"\"\"\n",
335
    "# kg.prefixes = prefixes"
336
   ]
337
  },
338
  {
339
   "attachments": {},
340
   "cell_type": "markdown",
341
   "metadata": {},
342
   "source": [
343
    "## Load the pre-trained retriever"
344
   ]
345
  },
346
  {
347
   "cell_type": "code",
348
   "execution_count": null,
349
   "metadata": {},
350
   "outputs": [],
351
   "source": [
352
    "from haystack.nodes import Text2SparqlRetriever\n",
353
    "\n",
354
    "\n",
355
    "# Load a pre-trained model that translates text queries to SPARQL queries\n",
356
    "kgqa_retriever = Text2SparqlRetriever(knowledge_graph=kg, model_name_or_path=Path(model_dir) / \"hp_v3.4\")"
357
   ]
358
  },
359
  {
360
   "attachments": {},
361
   "cell_type": "markdown",
362
   "metadata": {
363
    "pycharm": {
364
     "name": "#%% md\n"
365
    }
366
   },
367
   "source": [
368
    "## Query Execution\n",
369
    "\n",
370
    "We can now ask questions that will be answered by our knowledge graph!\n",
371
    "One limitation though: our pre-trained model can only generate questions about resources it has seen during training.\n",
372
    "Otherwise, it cannot translate the name of the resource to the identifier used in the knowledge graph.\n",
373
    "E.g. \"Harry\" -> \"hp:Harry_potter\""
374
   ]
375
  },
376
  {
377
   "cell_type": "code",
378
   "execution_count": null,
379
   "metadata": {
380
    "collapsed": false,
381
    "jupyter": {
382
     "outputs_hidden": false
383
    },
384
    "pycharm": {
385
     "name": "#%%\n"
386
    }
387
   },
388
   "outputs": [],
389
   "source": [
390
    "query = \"In which house is Harry Potter?\"\n",
391
    "print(f'Translating the text query \"{query}\" to a SPARQL query and executing it on the knowledge graph...')\n",
392
    "result = kgqa_retriever.retrieve(query=query)\n",
393
    "print(result)\n",
394
    "# Correct SPARQL query: select ?a { hp:Harry_potter hp:house ?a . }\n",
395
    "# Correct answer: Gryffindor\n",
396
    "\n",
397
    "print(\"Executing a SPARQL query with prefixed names of resources...\")\n",
398
    "result = kgqa_retriever._query_kg(\n",
399
    "    sparql_query=\"select distinct ?sbj where { ?sbj hp:job hp:Keeper_of_keys_and_grounds . }\"\n",
400
    ")\n",
401
    "print(result)\n",
402
    "# Paraphrased question: Who is the keeper of keys and grounds?\n",
403
    "# Correct answer: Rubeus Hagrid\n",
404
    "\n",
405
    "print(\"Executing a SPARQL query with full names of resources...\")\n",
406
    "result = kgqa_retriever._query_kg(\n",
407
    "    sparql_query=\"select distinct ?obj where { <https://deepset.ai/harry_potter/Hermione_granger> <https://deepset.ai/harry_potter/patronus> ?obj . }\"\n",
408
    ")\n",
409
    "print(result)\n",
410
    "# Paraphrased question: What is the patronus of Hermione?\n",
411
    "# Correct answer: Otter"
412
   ]
413
  }
414
 ],
415
 "metadata": {
416
  "kernelspec": {
417
   "display_name": "Python 3.9.6 64-bit",
418
   "language": "python",
419
   "name": "python3"
420
  },
421
  "language_info": {
422
   "codemirror_mode": {
423
    "name": "ipython",
424
    "version": 3
425
   },
426
   "file_extension": ".py",
427
   "mimetype": "text/x-python",
428
   "name": "python",
429
   "nbconvert_exporter": "python",
430
   "pygments_lexer": "ipython3",
431
   "version": "3.9.6"
432
  },
433
  "vscode": {
434
   "interpreter": {
435
    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
436
   }
437
  }
438
 },
439
 "nbformat": 4,
440
 "nbformat_minor": 4
441
}
442

Использование cookies

Мы используем файлы cookie в соответствии с Политикой конфиденциальности и Политикой использования cookies.

Нажимая кнопку «Принимаю», Вы даете АО «СберТех» согласие на обработку Ваших персональных данных в целях совершенствования нашего веб-сайта и Сервиса GitVerse, а также повышения удобства их использования.

Запретить использование cookies Вы можете самостоятельно в настройках Вашего браузера.