GenerativeAIExamples
528 строк · 17.3 Кб
1{
2"cells": [
3{
4"cell_type": "markdown",
5"id": "4afa980c-21be-44b8-807e-710b5de56198",
6"metadata": {},
7"source": [
8"## Notebook 2: Filling RAG outputs For Evaluation\n",
9"\n",
10"In this notebook, we will use the example RAG pipeline to populate the RAG outputs: contexts (retrieved relevant documents) and answer (generated by RAG pipeline).\n",
11"\n",
12"The example RAG pipeline provided as part of this repository uses [LlamaIndex](https://gpt-index.readthedocs.io/en/stable/) to build a chatbot that references a custom knowledge base. \n",
13"\n",
14"If you want to learn more about how the example RAG works, please see [03_llama_index_simple.ipynb](../notebooks/03_llama_index_simple.ipynb).\n",
15"\n",
16"- **Steps 1-5**: Build the RAG pipeline.\n",
17"- **Step 6**: Build the Query Engine, exposing the Retriever and Generator outputs\n",
18"- **Step 7**: Fill the RAG outputs "
19]
20},
21{
22"cell_type": "markdown",
23"id": "191e7b90-128e-4432-82ab-897426389d06",
24"metadata": {},
25"source": [
26"### Steps 1-5: Build the RAG pipeline\n",
27"\n",
28"#### Define the LLM\n",
29"Here we are using a local llm on triton and the address and gRPC port that the Triton is available on. \n",
30"\n",
31"***If you are using AI Playground (no local GPU) replace, the code in the cell two cells below with the following: ***\n",
32"\n",
33"```\n",
34"import os\n",
35"from nv_aiplay import GeneralLLM\n",
36"os.environ['NVAPI_KEY'] = \"REPLACE_WITH_YOUR_API_KEY\"\n",
37"\n",
38"llm = GeneralLLM(\n",
39" model=\"llama2_70b\",\n",
40" temperature=0.2,\n",
41" max_tokens=300\n",
42")\n",
43"```"
44]
45},
46{
47"cell_type": "code",
48"execution_count": 3,
49"id": "a18dfc7b",
50"metadata": {},
51"outputs": [],
52"source": [
53"%%capture\n",
54"!test -d dataset || unzip dataset.zip"
55]
56},
57{
58"cell_type": "code",
59"execution_count": null,
60"id": "8a80987e-1ddb-4248-b76c-f3ce16745ca3",
61"metadata": {},
62"outputs": [],
63"source": [
64"from triton_trt_llm import TensorRTLLM\n",
65"from llama_index.llms.langchain import LangChainLLM\n",
66"trtllm =TensorRTLLM(server_url=\"llm:8001\", model_name=\"ensemble\", tokens=300)\n",
67"llm = LangChainLLM(llm=trtllm)"
68]
69},
70{
71"cell_type": "markdown",
72"id": "bc57b68d-afd5-4a0c-832c-0ad8f3f475d5",
73"metadata": {},
74"source": [
75"#### Create a Prompt Template\n",
76"\n",
77"A [**prompt template**](https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/prompts.html) is a common paradigm in LLM development.\n",
78"\n",
79"They are a pre-defined set of instructions provided to the LLM and guide the output produced by the model. They can contain few shot examples and guidance and are a quick way to engineer the responses from the LLM. Llama 2 accepts the [prompt format](https://huggingface.co/blog/llama2#how-to-prompt-llama-2) shown in `LLAMA_PROMPT_TEMPLATE`, which we manipulate to be constructed with:\n",
80"- The system prompt\n",
81"- The context\n",
82"- The user's question\n",
83" \n",
84"Much like LangChain's abstraction of prompts, LlamaIndex has similar abstractions for you to create prompts."
85]
86},
87{
88"cell_type": "code",
89"execution_count": null,
90"id": "682ec812-33be-430f-8bb1-ae3d68690198",
91"metadata": {},
92"outputs": [],
93"source": [
94"# import the relevant libraries\n",
95"from llama_index.core import Prompt\n",
96"\n",
97"LLAMA_PROMPT_TEMPLATE = (\n",
98" \"<s>[INST] <<SYS>>\"\n",
99" \"Use the following context to answer the user's question. If you don't know the answer, just say that you don't know, don't try to make up an answer.\"\n",
100" \"<</SYS>>\"\n",
101" \"<s>[INST] Context: {context_str} Question: {query_str} Only return the helpful answer below and nothing else. Helpful answer:[/INST]\"\n",
102")\n",
103"\n",
104"qa_template = Prompt(LLAMA_PROMPT_TEMPLATE)"
105]
106},
107{
108"cell_type": "markdown",
109"id": "d0af7922",
110"metadata": {},
111"source": [
112"### Load Documents\n",
113"Follow the step number 1 [defined here](../notebooks/05_dataloader.ipynb) to upload the pdf's to Milvus server.\n"
114]
115},
116{
117"cell_type": "markdown",
118"id": "a7bb75ad",
119"metadata": {},
120"source": [
121"In this rest of this section, we will load and split the pdfs of NVIDIA blogs. We will use the `SentenceTransformersTokenTextSplitter`.\n",
122"Additionally, we use a LlamaIndex [``PromptHelper``](https://gpt-index.readthedocs.io/en/latest/api_reference/service_context/prompt_helper.html) to help deal with LLM context window token limitations. "
123]
124},
125{
126"cell_type": "code",
127"execution_count": null,
128"id": "fa366250-108e-45a0-88ce-e6f7274da8e1",
129"metadata": {},
130"outputs": [],
131"source": [
132"# import the relevant libraries\n",
133"from langchain.text_splitter import SentenceTransformersTokenTextSplitter\n",
134"from llama_index.core.node_parser import LangchainNodeParser\n",
135"from llama_index.core import PromptHelper\n",
136"\n",
137"# setup the text splitter\n",
138"TEXT_SPLITTER_MODEL = \"intfloat/e5-large-v2\"\n",
139"TEXT_SPLITTER_TOKENS_PER_CHUNK = 510\n",
140"TEXT_SPLITTER_CHUNCK_OVERLAP = 200\n",
141"\n",
142"text_splitter = SentenceTransformersTokenTextSplitter(\n",
143" model_name=TEXT_SPLITTER_MODEL,\n",
144" tokens_per_chunk=TEXT_SPLITTER_TOKENS_PER_CHUNK,\n",
145" chunk_overlap=TEXT_SPLITTER_CHUNCK_OVERLAP,\n",
146")\n",
147"\n",
148"node_parser = LangchainNodeParser(text_splitter)\n",
149"\n",
150"\n",
151"# Use the PromptHelper\n",
152"\n",
153"prompt_helper = PromptHelper(\n",
154" context_window=4096,\n",
155" num_output=256,\n",
156" chunk_overlap_ratio=0.1,\n",
157" chunk_size_limit=None\n",
158")"
159]
160},
161{
162"cell_type": "markdown",
163"id": "b8dab583-a12d-4fb1-a9eb-3a1b1f04075d",
164"metadata": {},
165"source": [
166"#### Generate and Store Embeddings\n",
167"##### a) Generate Embeddings \n",
168"[Embeddings](https://python.langchain.com/docs/modules/data_connection/text_embedding/) for documents are created by vectorizing the document text; this vectorization captures the semantic meaning of the text. \n",
169"\n",
170"We will use [intfloat/e5-large-v2](https://huggingface.co/intfloat/e5-large-v2) for the embeddings."
171]
172},
173{
174"cell_type": "code",
175"execution_count": null,
176"id": "e9011ba0-f3f6-41f0-8a15-48f264743545",
177"metadata": {},
178"outputs": [],
179"source": [
180"# import the relevant libraries\n",
181"from langchain.embeddings import HuggingFaceEmbeddings\n",
182"from llama_index.embeddings.langchain import LangchainEmbedding\n",
183"\n",
184"#Running the model on CPU as we want to conserve gpu memory.\n",
185"#In the production deployment (API server shown as part of the 5th notebook we run the model on GPU)\n",
186"model_name=\"intfloat/e5-large-v2\"\n",
187"model_kwargs = {\"device\": \"cuda:0\"}\n",
188"encode_kwargs = {\"normalize_embeddings\": False}\n",
189"hf_embeddings = HuggingFaceEmbeddings(\n",
190" model_name=model_name,\n",
191" model_kwargs=model_kwargs,\n",
192" encode_kwargs=encode_kwargs,\n",
193")\n",
194"# Load in a specific embedding model\n",
195"embed_model = LangchainEmbedding(hf_embeddings)"
196]
197},
198{
199"cell_type": "markdown",
200"id": "8db99124-e438-406d-880d-557501a461d3",
201"metadata": {},
202"source": [
203"##### b) Store Embeddings \n",
204"\n",
205"We will use the LlamaIndex module [`Settings`](https://docs.llamaindex.ai/en/stable/module_guides/supporting_modules/settings/?h=settings) to bundle commonly used resources during the indexing and querying stage.\n",
206"\n",
207"\n",
208"In this example, we bundle the build resources: the LLM, the embedding model, the node parser, and the prompt helper. "
209]
210},
211{
212"cell_type": "code",
213"execution_count": null,
214"id": "0e493f9d-589a-4820-902d-f68932bfb0d8",
215"metadata": {},
216"outputs": [],
217"source": [
218"# import the relevant libraries\n",
219"from llama_index.core import Settings\n",
220"\n",
221"Settings.llm = llm\n",
222"Settings.embed_model = embed_model\n",
223"Settings.node_parser = node_parser\n",
224"Settings.prompt_helper = prompt_helper"
225]
226},
227{
228"cell_type": "markdown",
229"id": "44e10c13",
230"metadata": {},
231"source": [
232"Ingest the dataset using the /documents endpoint in the chain-server."
233]
234},
235{
236"cell_type": "code",
237"execution_count": null,
238"id": "acdc51db",
239"metadata": {},
240"outputs": [],
241"source": [
242"import os\n",
243"import requests\n",
244"import mimetypes\n",
245"\n",
246"def upload_document(file_path, url):\n",
247" headers = {\n",
248" 'accept': 'application/json'\n",
249" }\n",
250" mime_type, _ = mimetypes.guess_type(file_path)\n",
251" files = {\n",
252" 'file': (file_path, open(file_path, 'rb'), mime_type)\n",
253" }\n",
254" response = requests.post(url, headers=headers, files=files)\n",
255"\n",
256" return response.text\n",
257"\n",
258"def upload_pdf_files(folder_path, upload_url):\n",
259" for files in os.listdir(folder_path):\n",
260" _, ext = os.path.splitext(files)\n",
261" # Ingest only pdf files\n",
262" if ext.lower() == \".pdf\":\n",
263" file_path = os.path.join(folder_path, files)\n",
264" print(upload_document(file_path, upload_url))"
265]
266},
267{
268"cell_type": "code",
269"execution_count": null,
270"id": "823c89f9",
271"metadata": {},
272"outputs": [],
273"source": [
274"import time\n",
275"\n",
276"start_time = time.time()\n",
277"upload_pdf_files(\"dataset\", \"http://chain-server:8081/documents\")\n",
278"print(f\"--- {time.time() - start_time} seconds ---\")"
279]
280},
281{
282"attachments": {},
283"cell_type": "markdown",
284"id": "79c7923c-d778-4f32-be37-4314063ecd2f",
285"metadata": {},
286"source": [
287"<div class=\"alert alert-block alert-info\">\n",
288" \n",
289"⚠️ in the deployment of this workflow, [Milvus](https://milvus.io/) is running as a vector database microservice.\n",
290"</div>"
291]
292},
293{
294"cell_type": "code",
295"execution_count": null,
296"id": "1e94e53e-41a9-47d3-a9d3-7c0af4c07f76",
297"metadata": {},
298"outputs": [],
299"source": [
300"# import the relevant libraries\n",
301"from llama_index.core import VectorStoreIndex\n",
302"from llama_index.core.storage.storage_context import StorageContext\n",
303"from llama_index.vector_stores.milvus import MilvusVectorStore\n",
304"\n",
305"# store\n",
306"vector_store = MilvusVectorStore(uri=\"http://milvus:19530\",\n",
307" dim=1024,\n",
308" collection_name=\"developer_rag\",\n",
309" index_config={\"index_type\": \"GPU_IVF_FLAT\", \"nlist\": 64},\n",
310" search_config={\"nprobe\": 16},\n",
311" overwrite=False\n",
312")\n",
313"storage_context = StorageContext.from_defaults(vector_store=vector_store)\n",
314"index = VectorStoreIndex.from_vector_store(vector_store)"
315]
316},
317{
318"cell_type": "markdown",
319"id": "b3b58028-04fa-4050-9ec4-6526817fd9cf",
320"metadata": {},
321"source": [
322"### Step 6: Build the Query Engine, exposing the Retriever and Generator outputs\n",
323"\n",
324"#### a) Limit the Retriever Total Output Length\n",
325"\n",
326"First, we need to restrict the output of the Retriever to a reasonable length so that the prompt can fit the context length of the LLM.\n",
327"In this notebook, we will restrict it to 1000 (anything up to 1000 will ignored).\n"
328]
329},
330{
331"cell_type": "code",
332"execution_count": null,
333"id": "6efc410c-f488-43aa-af65-c39376bd7ba5",
334"metadata": {},
335"outputs": [],
336"source": [
337"# import the relevant libraries\n",
338"from llama_index.core.postprocessor.types import BaseNodePostprocessor\n",
339"from typing import TYPE_CHECKING, List, Optional\n",
340"from llama_index.core.utils import get_tokenizer\n",
341"DEFAULT_MAX_CONTEXT = 1000\n",
342"\n",
343"# limit the Retriever total outputs length\n",
344"class LimitRetrievedNodesLength(BaseNodePostprocessor):\n",
345" \"\"\"Llama Index chain filter to limit token lengths.\"\"\"\n",
346"\n",
347" def _postprocess_nodes(\n",
348" self, nodes: List[\"NodeWithScore\"], query_bundle: Optional[\"QueryBundle\"] = None\n",
349" ) -> List[\"NodeWithScore\"]:\n",
350" \"\"\"Filter function.\"\"\"\n",
351" included_nodes = []\n",
352" current_length = 0\n",
353" limit = DEFAULT_MAX_CONTEXT\n",
354"\n",
355" tokenizer = get_tokenizer()\n",
356" for node in nodes:\n",
357" current_length += len(\n",
358" tokenizer(\n",
359" node.node.get_content(metadata_mode=MetadataMode.LLM)\n",
360" )\n",
361" )\n",
362" if current_length > limit:\n",
363" break\n",
364" included_nodes.append(node)\n",
365"\n",
366" return included_nodes\n",
367"\n"
368]
369},
370{
371"cell_type": "markdown",
372"id": "e33cfed2-2a63-40be-8a7d-787ba04d2af9",
373"metadata": {},
374"source": [
375"#### b) Build the Query Engine\n",
376"\n",
377"Now, let's build the query engine that takes a query and returns a response. Each vector index has a default corresponding query engine; for example, the default query engine for a vector index performs a standard top-k retrieval over the vector store.\n",
378"We will use `RetrieverQueryEngine` to get the output of the Retriever and generator. Learn more about the RetrieverQueryEngine in the [documentation](https://gpt-index.readthedocs.io/en/latest/examples/query_engine/CustomRetrievers.html).\n",
379"\n",
380" "
381]
382},
383{
384"cell_type": "code",
385"execution_count": null,
386"id": "f56f37e0-341e-4d7d-b282-f374a16f55b2",
387"metadata": {},
388"outputs": [],
389"source": [
390"# import the relevant libraries\n",
391"from llama_index.core.query_engine import RetrieverQueryEngine\n",
392"from llama_index.core.schema import MetadataMode\n",
393"\n",
394"# Expose the retriever\n",
395"retriever = index.as_retriever(similarity_top_k=2)\n",
396"\n",
397"query_engine = RetrieverQueryEngine.from_args(\n",
398" retriever,\n",
399" text_qa_template=qa_template,\n",
400" node_postprocessors=[LimitRetrievedNodesLength()]\n",
401")"
402]
403},
404{
405"cell_type": "markdown",
406"id": "c6a58983-2069-450e-adf9-24b0f8736498",
407"metadata": {},
408"source": [
409"### Step 7: Fill the RAG outputs \n",
410"\n",
411"Let's now query the RAG pipeline and fill the outputs `contexts` and `answer` on the evaluation JSON file.\n",
412"\n",
413"First, we need to load the previously generated dataset. So far, the RAG outputs fields are empty.\n"
414]
415},
416{
417"cell_type": "code",
418"execution_count": null,
419"id": "82f0f304-3476-42e3-9be7-1ab38f9e14cd",
420"metadata": {},
421"outputs": [],
422"source": [
423"# import the relevant libraries\n",
424"import json\n",
425"from IPython.display import JSON\n",
426"\n",
427"# load the evaluation data\n",
428"f = open('qa_generation.json')\n",
429"data = json.load(f)\n",
430"\n",
431"# show the first element\n",
432"JSON(data[0])"
433]
434},
435{
436"cell_type": "markdown",
437"id": "d4b4321b-dfce-4c72-a8f1-2e2264b3c59d",
438"metadata": {},
439"source": [
440"Let now query the RAG pipeline and populate the `contexts` and `answer` fields."
441]
442},
443{
444"cell_type": "code",
445"execution_count": null,
446"id": "6f238d58-071a-4bb9-956c-d014748c15ab",
447"metadata": {
448"scrolled": true
449},
450"outputs": [],
451"source": [
452"for entry in data:\n",
453" limited_retrieval_length = LimitRetrievedNodesLength()\n",
454" retrieved_text = \"\"\n",
455" response = query_engine.query(entry[\"question\"])\n",
456" entry[\"answer\"] = response.response\n",
457" print(entry[\"answer\"])\n",
458" nodes = retriever.retrieve(entry[\"question\"])\n",
459" included_nodes = limited_retrieval_length.postprocess_nodes(nodes)\n",
460" for node in included_nodes:\n",
461" retrieved_text = retrieved_text + \" \" + node.text\n",
462" entry[\"contexts\"] = [retrieved_text]"
463]
464},
465{
466"cell_type": "code",
467"execution_count": null,
468"id": "14407673-a8f1-4245-8748-d6885e08f06d",
469"metadata": {},
470"outputs": [],
471"source": [
472"# json_list_string=json.dumps(data)\n",
473"\n",
474"# show again the first element\n",
475"JSON(data[0])"
476]
477},
478{
479"cell_type": "markdown",
480"id": "dfa9f140-5989-4c3c-98af-18ec63a954b9",
481"metadata": {},
482"source": [
483"Let now save the new evaluation datasets."
484]
485},
486{
487"cell_type": "code",
488"execution_count": null,
489"id": "958653ba-4228-4c81-8f65-81ead7c8254f",
490"metadata": {},
491"outputs": [],
492"source": [
493"import json\n",
494"with open('eval.json', 'w') as f:\n",
495" json.dump(data, f)"
496]
497},
498{
499"cell_type": "markdown",
500"id": "248982b8-9f9e-4021-a326-657e2e82d43d",
501"metadata": {},
502"source": [
503"In the next notebook, we will evaluate the [Corp Comms Copilot](https://gitlab-master.nvidia.com/chat-labs/rag-demos/corp-comms-copilot) RAG pipeline."
504]
505}
506],
507"metadata": {
508"kernelspec": {
509"display_name": "Python 3 (ipykernel)",
510"language": "python",
511"name": "python3"
512},
513"language_info": {
514"codemirror_mode": {
515"name": "ipython",
516"version": 3
517},
518"file_extension": ".py",
519"mimetype": "text/x-python",
520"name": "python",
521"nbconvert_exporter": "python",
522"pygments_lexer": "ipython3",
523"version": "3.10.12"
524}
525},
526"nbformat": 4,
527"nbformat_minor": 5
528}
529