4
"cell_type": "markdown",
8
"# 🚶🏻♂️ Getting Started\n",
10
"Here you will learn how to use the fastembed package to embed your data into a vector space. The package is designed to be easy to use and fast. It is built on top of the [ONNX](https://onnx.ai/) standard, which allows for fast inference on a variety of hardware (called Runtimes in ONNX). \n",
14
"The fastembed package is designed to be easy to use. We'll be using `TextEmbedding` class. It takes a list of strings as input and returns an generator of vectors. If you're seeing generators for the first time, don't worry, you can convert it to a list using `list()`.\n",
16
"> 💡 You can learn more about generators from [Python Wiki](https://wiki.python.org/moin/Generators)"
26
"!pip install -Uqq fastembed # Install fastembed"
37
"application/vnd.jupyter.widget-view+json": {
38
"model_id": "890cc3b969354eec8d149d143e301a7a",
43
"Fetching 9 files: 0%| | 0/9 [00:00<?, ?it/s]"
47
"output_type": "display_data"
51
"output_type": "stream",
53
"The model BAAI/bge-small-en-v1.5 is ready to use.\n"
64
"output_type": "execute_result"
68
"import numpy as np\n",
69
"from fastembed import TextEmbedding\n",
70
"from typing import List\n",
72
"# Example list of documents\n",
73
"documents: List[str] = [\n",
74
" \"This is built to be faster and lighter than other embedding libraries e.g. Transformers, Sentence-Transformers, etc.\",\n",
75
" \"fastembed is supported by and maintained by Qdrant.\",\n",
78
"# This will trigger the model download and initialization\n",
79
"embedding_model = TextEmbedding()\n",
80
"print(\"The model BAAI/bge-small-en-v1.5 is ready to use.\")\n",
82
"embeddings_generator = embedding_model.embed(documents) # reminder this is a generator\n",
83
"embeddings_list = list(embeddings_generator)\n",
84
"# you can also convert the generator to a list, and that to a numpy array\n",
85
"len(embeddings_list[0]) # Vector of 384 dimensions"
89
"cell_type": "markdown",
93
"> 💡 **Why do we use generators?**\n",
95
"> We use them to save memory mostly. Instead of loading all the vectors into memory, we can load them one by one. This is useful when you have a large dataset and you don't want to load all the vectors at once."
100
"execution_count": 3,
106
"output_type": "stream",
108
"Document: This is built to be faster and lighter than other embedding libraries e.g. Transformers, Sentence-Transformers, etc.\n",
109
"Vector of type: <class 'numpy.ndarray'> with shape: (384,)\n",
110
"Document: fastembed is supported by and maintained by Qdrant.\n",
111
"Vector of type: <class 'numpy.ndarray'> with shape: (384,)\n"
116
"embeddings_generator = embedding_model.embed(documents) # reminder this is a generator\n",
118
"for doc, vector in zip(documents, embeddings_generator):\n",
119
" print(\"Document:\", doc)\n",
120
" print(f\"Vector of type: {type(vector)} with shape: {vector.shape}\")"
125
"execution_count": 4,
135
"execution_count": 4,
137
"output_type": "execute_result"
141
"embeddings_list = np.array(\n",
142
" list(embedding_model.embed(documents))\n",
143
") # you can also convert the generator to a list, and that to a numpy array\n",
144
"embeddings_list.shape"
148
"cell_type": "markdown",
152
"We're using [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) a state of the art Flag Embedding model. The model does better than OpenAI text-embedding-ada-002. We've made it even faster by converting it to ONNX format and quantizing the model for you.\n",
154
"#### Format of the Document List\n",
156
"1. List of Strings: Your documents must be in a list, and each document must be a string\n",
157
"2. For Retrieval Tasks with our default: If you're working with queries and passages, you can add special labels to them:\n",
158
"- **Queries**: Add \"query:\" at the beginning of each query string\n",
159
"- **Passages**: Add \"passage:\" at the beginning of each passage string\n",
161
"## Beyond the default model\n",
163
"The default model is built for speed and efficiency. If you need a more accurate model, you can use the `TextEmbedding` class to load any model from our list of available models. You can find the list of available models using `TextEmbedding.list_supported_models()`."
168
"execution_count": 5,
174
"application/vnd.jupyter.widget-view+json": {
175
"model_id": "9470ec542f3c4400a42452c2489a1abc",
180
"Fetching 8 files: 0%| | 0/8 [00:00<?, ?it/s]"
184
"output_type": "display_data"
188
"multilingual_large_model = TextEmbedding(\"intfloat/multilingual-e5-large\") # This can take a few minutes to download"
193
"execution_count": 6,
203
"execution_count": 6,
205
"output_type": "execute_result"
210
" list(multilingual_large_model.embed([\"Hello, world!\", \"你好世界\", \"¡Hola Mundo!\", \"नमस्ते!\"]))\n",
211
").shape # Vector of 1024 dimensions"
215
"cell_type": "markdown",
219
"Next: Checkout how to use FastEmbed with Qdrant for similarity search: [FastEmbed with Qdrant](https://qdrant.github.io/fastembed/examples/Usage_With_Qdrant/)"
225
"display_name": "Python 3 (ipykernel)",
226
"language": "python",
234
"file_extension": ".py",
235
"mimetype": "text/x-python",
237
"nbconvert_exporter": "python",
238
"pygments_lexer": "ipython3",