fastembed

Форк
0
/
Getting Started.ipynb 
244 строки · 7.4 Кб
1
{
2
 "cells": [
3
  {
4
   "cell_type": "markdown",
5
   "id": "3f159fb4",
6
   "metadata": {},
7
   "source": [
8
    "# 🚶🏻‍♂️ Getting Started\n",
9
    "\n",
10
    "Here you will learn how to use the fastembed package to embed your data into a vector space. The package is designed to be easy to use and fast. It is built on top of the [ONNX](https://onnx.ai/) standard, which allows for fast inference on a variety of hardware (called Runtimes in ONNX). \n",
11
    "\n",
12
    "## Quick Start\n",
13
    "\n",
14
    "The fastembed package is designed to be easy to use. We'll be using `TextEmbedding` class. It takes a list of strings as input and returns an generator of vectors. If you're seeing generators for the first time, don't worry, you can convert it to a list using `list()`.\n",
15
    "\n",
16
    "> 💡 You can learn more about generators from [Python Wiki](https://wiki.python.org/moin/Generators)"
17
   ]
18
  },
19
  {
20
   "cell_type": "code",
21
   "execution_count": 1,
22
   "id": "ada95c6a",
23
   "metadata": {},
24
   "outputs": [],
25
   "source": [
26
    "!pip install -Uqq fastembed # Install fastembed"
27
   ]
28
  },
29
  {
30
   "cell_type": "code",
31
   "execution_count": 2,
32
   "id": "b61c6552",
33
   "metadata": {},
34
   "outputs": [
35
    {
36
     "data": {
37
      "application/vnd.jupyter.widget-view+json": {
38
       "model_id": "890cc3b969354eec8d149d143e301a7a",
39
       "version_major": 2,
40
       "version_minor": 0
41
      },
42
      "text/plain": [
43
       "Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]"
44
      ]
45
     },
46
     "metadata": {},
47
     "output_type": "display_data"
48
    },
49
    {
50
     "name": "stdout",
51
     "output_type": "stream",
52
     "text": [
53
      "The model BAAI/bge-small-en-v1.5 is ready to use.\n"
54
     ]
55
    },
56
    {
57
     "data": {
58
      "text/plain": [
59
       "384"
60
      ]
61
     },
62
     "execution_count": 2,
63
     "metadata": {},
64
     "output_type": "execute_result"
65
    }
66
   ],
67
   "source": [
68
    "import numpy as np\n",
69
    "from fastembed import TextEmbedding\n",
70
    "from typing import List\n",
71
    "\n",
72
    "# Example list of documents\n",
73
    "documents: List[str] = [\n",
74
    "    \"This is built to be faster and lighter than other embedding libraries e.g. Transformers, Sentence-Transformers, etc.\",\n",
75
    "    \"fastembed is supported by and maintained by Qdrant.\",\n",
76
    "]\n",
77
    "\n",
78
    "# This will trigger the model download and initialization\n",
79
    "embedding_model = TextEmbedding()\n",
80
    "print(\"The model BAAI/bge-small-en-v1.5 is ready to use.\")\n",
81
    "\n",
82
    "embeddings_generator = embedding_model.embed(documents)  # reminder this is a generator\n",
83
    "embeddings_list = list(embeddings_generator)\n",
84
    "# you can also convert the generator to a list, and that to a numpy array\n",
85
    "len(embeddings_list[0])  # Vector of 384 dimensions"
86
   ]
87
  },
88
  {
89
   "cell_type": "markdown",
90
   "id": "d772190b",
91
   "metadata": {},
92
   "source": [
93
    "> 💡 **Why do we use generators?**\n",
94
    "> \n",
95
    "> We use them to save memory mostly. Instead of loading all the vectors into memory, we can load them one by one. This is useful when you have a large dataset and you don't want to load all the vectors at once."
96
   ]
97
  },
98
  {
99
   "cell_type": "code",
100
   "execution_count": 3,
101
   "id": "8a225cb8",
102
   "metadata": {},
103
   "outputs": [
104
    {
105
     "name": "stdout",
106
     "output_type": "stream",
107
     "text": [
108
      "Document: This is built to be faster and lighter than other embedding libraries e.g. Transformers, Sentence-Transformers, etc.\n",
109
      "Vector of type: <class 'numpy.ndarray'> with shape: (384,)\n",
110
      "Document: fastembed is supported by and maintained by Qdrant.\n",
111
      "Vector of type: <class 'numpy.ndarray'> with shape: (384,)\n"
112
     ]
113
    }
114
   ],
115
   "source": [
116
    "embeddings_generator = embedding_model.embed(documents)  # reminder this is a generator\n",
117
    "\n",
118
    "for doc, vector in zip(documents, embeddings_generator):\n",
119
    "    print(\"Document:\", doc)\n",
120
    "    print(f\"Vector of type: {type(vector)} with shape: {vector.shape}\")"
121
   ]
122
  },
123
  {
124
   "cell_type": "code",
125
   "execution_count": 4,
126
   "id": "769a1be9",
127
   "metadata": {},
128
   "outputs": [
129
    {
130
     "data": {
131
      "text/plain": [
132
       "(2, 384)"
133
      ]
134
     },
135
     "execution_count": 4,
136
     "metadata": {},
137
     "output_type": "execute_result"
138
    }
139
   ],
140
   "source": [
141
    "embeddings_list = np.array(\n",
142
    "    list(embedding_model.embed(documents))\n",
143
    ")  # you can also convert the generator to a list, and that to a numpy array\n",
144
    "embeddings_list.shape"
145
   ]
146
  },
147
  {
148
   "cell_type": "markdown",
149
   "id": "8c49ae50",
150
   "metadata": {},
151
   "source": [
152
    "We're using [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) a state of the art Flag Embedding model. The model does better than OpenAI text-embedding-ada-002. We've made it even faster by converting it to ONNX format and quantizing the model for you.\n",
153
    "\n",
154
    "#### Format of the Document List\n",
155
    "\n",
156
    "1. List of Strings: Your documents must be in a list, and each document must be a string\n",
157
    "2. For Retrieval Tasks with our default: If you're working with queries and passages, you can add special labels to them:\n",
158
    "- **Queries**: Add \"query:\" at the beginning of each query string\n",
159
    "- **Passages**: Add \"passage:\" at the beginning of each passage string\n",
160
    "\n",
161
    "## Beyond the default model\n",
162
    "\n",
163
    "The default model is built for speed and efficiency. If you need a more accurate model, you can use the `TextEmbedding` class to load any model from our list of available models. You can find the list of available models using `TextEmbedding.list_supported_models()`."
164
   ]
165
  },
166
  {
167
   "cell_type": "code",
168
   "execution_count": 5,
169
   "id": "2e9c8766",
170
   "metadata": {},
171
   "outputs": [
172
    {
173
     "data": {
174
      "application/vnd.jupyter.widget-view+json": {
175
       "model_id": "9470ec542f3c4400a42452c2489a1abc",
176
       "version_major": 2,
177
       "version_minor": 0
178
      },
179
      "text/plain": [
180
       "Fetching 8 files:   0%|          | 0/8 [00:00<?, ?it/s]"
181
      ]
182
     },
183
     "metadata": {},
184
     "output_type": "display_data"
185
    }
186
   ],
187
   "source": [
188
    "multilingual_large_model = TextEmbedding(\"intfloat/multilingual-e5-large\")  # This can take a few minutes to download"
189
   ]
190
  },
191
  {
192
   "cell_type": "code",
193
   "execution_count": 6,
194
   "id": "a9e70f0e",
195
   "metadata": {},
196
   "outputs": [
197
    {
198
     "data": {
199
      "text/plain": [
200
       "(4, 1024)"
201
      ]
202
     },
203
     "execution_count": 6,
204
     "metadata": {},
205
     "output_type": "execute_result"
206
    }
207
   ],
208
   "source": [
209
    "np.array(\n",
210
    "    list(multilingual_large_model.embed([\"Hello, world!\", \"你好世界\", \"¡Hola Mundo!\", \"नमस्ते!\"]))\n",
211
    ").shape  # Vector of 1024 dimensions"
212
   ]
213
  },
214
  {
215
   "cell_type": "markdown",
216
   "id": "64fe20ed",
217
   "metadata": {},
218
   "source": [
219
    "Next: Checkout how to use FastEmbed with Qdrant for similarity search: [FastEmbed with Qdrant](https://qdrant.github.io/fastembed/examples/Usage_With_Qdrant/)"
220
   ]
221
  }
222
 ],
223
 "metadata": {
224
  "kernelspec": {
225
   "display_name": "Python 3 (ipykernel)",
226
   "language": "python",
227
   "name": "python3"
228
  },
229
  "language_info": {
230
   "codemirror_mode": {
231
    "name": "ipython",
232
    "version": 3
233
   },
234
   "file_extension": ".py",
235
   "mimetype": "text/x-python",
236
   "name": "python",
237
   "nbconvert_exporter": "python",
238
   "pygments_lexer": "ipython3",
239
   "version": "3.10.13"
240
  }
241
 },
242
 "nbformat": 4,
243
 "nbformat_minor": 5
244
}
245

Использование cookies

Мы используем файлы cookie в соответствии с Политикой конфиденциальности и Политикой использования cookies.

Нажимая кнопку «Принимаю», Вы даете АО «СберТех» согласие на обработку Ваших персональных данных в целях совершенствования нашего веб-сайта и Сервиса GitVerse, а также повышения удобства их использования.

Запретить использование cookies Вы можете самостоятельно в настройках Вашего браузера.