fastembed

Форк
0
/
Retrieval_with_FastEmbed.ipynb 
184 строки · 5.7 Кб
1
{
2
 "cells": [
3
  {
4
   "cell_type": "markdown",
5
   "metadata": {},
6
   "source": [
7
    "# ⚓️ Retrieval with FastEmbed\n",
8
    "\n",
9
    "This notebook demonstrates how to use FastEmbed to perform vector search and retrieval. It consists of the following sections:\n",
10
    "\n",
11
    "1. Setup: Installing the necessary packages.\n",
12
    "2. Importing Libraries: Importing FastEmbed and other libraries.\n",
13
    "3. Data Preparation: Example data and embedding generation.\n",
14
    "4. Querying: Defining a function to search documents based on a query.\n",
15
    "5. Running Queries: Running example queries.\n",
16
    "\n",
17
    "## Setup\n",
18
    "\n",
19
    "First, we need to install the dependencies. `fastembed` to create embeddings and perform retrieval."
20
   ]
21
  },
22
  {
23
   "cell_type": "code",
24
   "execution_count": 1,
25
   "metadata": {},
26
   "outputs": [],
27
   "source": [
28
    "# !pip install fastembed --quiet --upgrade"
29
   ]
30
  },
31
  {
32
   "cell_type": "markdown",
33
   "metadata": {},
34
   "source": [
35
    "Importing the necessary libraries:"
36
   ]
37
  },
38
  {
39
   "cell_type": "code",
40
   "execution_count": 2,
41
   "metadata": {},
42
   "outputs": [],
43
   "source": [
44
    "from typing import List\n",
45
    "import numpy as np\n",
46
    "from fastembed import TextEmbedding"
47
   ]
48
  },
49
  {
50
   "cell_type": "markdown",
51
   "metadata": {},
52
   "source": [
53
    "## Data Preparation\n",
54
    "We initialize the embedding model and generate embeddings for the documents.\n",
55
    "\n",
56
    "### 💡 Tip: Prefer using `query_embed` for queries and `passage_embed` for documents."
57
   ]
58
  },
59
  {
60
   "cell_type": "code",
61
   "execution_count": 3,
62
   "metadata": {},
63
   "outputs": [
64
    {
65
     "name": "stdout",
66
     "output_type": "stream",
67
     "text": [
68
      "(384,) 10\n"
69
     ]
70
    }
71
   ],
72
   "source": [
73
    "# Example list of documents\n",
74
    "documents: List[str] = [\n",
75
    "    \"Maharana Pratap was a Rajput warrior king from Mewar\",\n",
76
    "    \"He fought against the Mughal Empire led by Akbar\",\n",
77
    "    \"The Battle of Haldighati in 1576 was his most famous battle\",\n",
78
    "    \"He refused to submit to Akbar and continued guerrilla warfare\",\n",
79
    "    \"His capital was Chittorgarh, which he lost to the Mughals\",\n",
80
    "    \"He died in 1597 at the age of 57\",\n",
81
    "    \"Maharana Pratap is considered a symbol of Rajput resistance against foreign rule\",\n",
82
    "    \"His legacy is celebrated in Rajasthan through festivals and monuments\",\n",
83
    "    \"He had 11 wives and 17 sons, including Amar Singh I who succeeded him as ruler of Mewar\",\n",
84
    "    \"His life has been depicted in various films, TV shows, and books\",\n",
85
    "]\n",
86
    "# Initialize the DefaultEmbedding class with the desired parameters\n",
87
    "embedding_model = TextEmbedding(model_name=\"BAAI/bge-small-en\")\n",
88
    "\n",
89
    "# We'll use the passage_embed method to get the embeddings for the documents\n",
90
    "embeddings: List[np.ndarray] = list(\n",
91
    "    embedding_model.passage_embed(documents)\n",
92
    ")  # notice that we are casting the generator to a list\n",
93
    "\n",
94
    "print(embeddings[0].shape, len(embeddings))"
95
   ]
96
  },
97
  {
98
   "cell_type": "markdown",
99
   "metadata": {},
100
   "source": [
101
    "## Querying\n",
102
    "\n",
103
    "We'll define a function to print the top k documents based on a query, and prepare a sample query."
104
   ]
105
  },
106
  {
107
   "cell_type": "code",
108
   "execution_count": 4,
109
   "metadata": {},
110
   "outputs": [],
111
   "source": [
112
    "query = \"Who was Maharana Pratap?\"\n",
113
    "query_embedding = list(embedding_model.query_embed(query))[0]\n",
114
    "plain_query_embedding = list(embedding_model.embed(query))[0]\n",
115
    "\n",
116
    "\n",
117
    "def print_top_k(query_embedding, embeddings, documents, k=5):\n",
118
    "    # use numpy to calculate the cosine similarity between the query and the documents\n",
119
    "    scores = np.dot(embeddings, query_embedding)\n",
120
    "    # sort the scores in descending order\n",
121
    "    sorted_scores = np.argsort(scores)[::-1]\n",
122
    "    # print the top 5\n",
123
    "    for i in range(k):\n",
124
    "        print(f\"Rank {i+1}: {documents[sorted_scores[i]]}\")"
125
   ]
126
  },
127
  {
128
   "cell_type": "code",
129
   "execution_count": 5,
130
   "metadata": {},
131
   "outputs": [
132
    {
133
     "data": {
134
      "text/plain": [
135
       "(array([-0.06002192,  0.04322132, -0.00545516, -0.04419701, -0.00542277],\n",
136
       "       dtype=float32),\n",
137
       " array([-0.06002192,  0.04322132, -0.00545516, -0.04419701, -0.00542277],\n",
138
       "       dtype=float32))"
139
      ]
140
     },
141
     "execution_count": 5,
142
     "metadata": {},
143
     "output_type": "execute_result"
144
    }
145
   ],
146
   "source": [
147
    "query_embedding[:5], plain_query_embedding[:5]"
148
   ]
149
  },
150
  {
151
   "cell_type": "markdown",
152
   "metadata": {},
153
   "source": [
154
    "The `query_embed` is specifically designed for queries, leading to more relevant and context-aware results. The retrieved documents tend to align closely with the query's intent.\n",
155
    "\n",
156
    "In contrast, `embed` is a more general-purpose representation that might not capture the nuances of the query as effectively. The retrieved documents using plain embeddings might be less relevant or ordered differently compared to the results obtained using query embeddings.\n",
157
    "\n",
158
    "Conclusion: Using query and passage embeddings leads to more relevant and context-aware results."
159
   ]
160
  }
161
 ],
162
 "metadata": {
163
  "kernelspec": {
164
   "display_name": "fst",
165
   "language": "python",
166
   "name": "python3"
167
  },
168
  "language_info": {
169
   "codemirror_mode": {
170
    "name": "ipython",
171
    "version": 3
172
   },
173
   "file_extension": ".py",
174
   "mimetype": "text/x-python",
175
   "name": "python",
176
   "nbconvert_exporter": "python",
177
   "pygments_lexer": "ipython3",
178
   "version": "3.10.13"
179
  },
180
  "orig_nbformat": 4
181
 },
182
 "nbformat": 4,
183
 "nbformat_minor": 2
184
}
185

Использование cookies

Мы используем файлы cookie в соответствии с Политикой конфиденциальности и Политикой использования cookies.

Нажимая кнопку «Принимаю», Вы даете АО «СберТех» согласие на обработку Ваших персональных данных в целях совершенствования нашего веб-сайта и Сервиса GitVerse, а также повышения удобства их использования.

Запретить использование cookies Вы можете самостоятельно в настройках Вашего браузера.