openai-cookbook
233 строки · 7.2 Кб
1{
2"cells": [
3{
4"cell_type": "markdown",
5"metadata": {},
6"source": [
7"# Get embeddings from dataset\n",
8"\n",
9"This notebook gives an example on how to get embeddings from a large dataset.\n",
10"\n",
11"\n",
12"## 1. Load the dataset\n",
13"\n",
14"The dataset used in this example is [fine-food reviews](https://www.kaggle.com/snap/amazon-fine-food-reviews) from Amazon. The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of this dataset, consisting of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text).\n",
15"\n",
16"We will combine the review summary and review text into a single combined text. The model will encode this combined text and it will output a single vector embedding."
17]
18},
19{
20"attachments": {},
21"cell_type": "markdown",
22"metadata": {},
23"source": [
24"To run this notebook, you will need to install: pandas, openai, transformers, plotly, matplotlib, scikit-learn, torch (transformer dep), torchvision, and scipy."
25]
26},
27{
28"cell_type": "code",
29"execution_count": 20,
30"metadata": {},
31"outputs": [],
32"source": [
33"import pandas as pd\n",
34"import tiktoken\n",
35"\n",
36"from utils.embeddings_utils import get_embedding"
37]
38},
39{
40"cell_type": "code",
41"execution_count": 16,
42"metadata": {},
43"outputs": [],
44"source": [
45"embedding_model = \"text-embedding-3-small\"\n",
46"embedding_encoding = \"cl100k_base\"\n",
47"max_tokens = 8000 # the maximum for text-embedding-3-small is 8191"
48]
49},
50{
51"cell_type": "code",
52"execution_count": 17,
53"metadata": {},
54"outputs": [
55{
56"data": {
57"text/html": [
58"<div>\n",
59"<style scoped>\n",
60" .dataframe tbody tr th:only-of-type {\n",
61" vertical-align: middle;\n",
62" }\n",
63"\n",
64" .dataframe tbody tr th {\n",
65" vertical-align: top;\n",
66" }\n",
67"\n",
68" .dataframe thead th {\n",
69" text-align: right;\n",
70" }\n",
71"</style>\n",
72"<table border=\"1\" class=\"dataframe\">\n",
73" <thead>\n",
74" <tr style=\"text-align: right;\">\n",
75" <th></th>\n",
76" <th>Time</th>\n",
77" <th>ProductId</th>\n",
78" <th>UserId</th>\n",
79" <th>Score</th>\n",
80" <th>Summary</th>\n",
81" <th>Text</th>\n",
82" <th>combined</th>\n",
83" </tr>\n",
84" </thead>\n",
85" <tbody>\n",
86" <tr>\n",
87" <th>0</th>\n",
88" <td>1351123200</td>\n",
89" <td>B003XPF9BO</td>\n",
90" <td>A3R7JR3FMEBXQB</td>\n",
91" <td>5</td>\n",
92" <td>where does one start...and stop... with a tre...</td>\n",
93" <td>Wanted to save some to bring to my Chicago fam...</td>\n",
94" <td>Title: where does one start...and stop... wit...</td>\n",
95" </tr>\n",
96" <tr>\n",
97" <th>1</th>\n",
98" <td>1351123200</td>\n",
99" <td>B003JK537S</td>\n",
100" <td>A3JBPC3WFUT5ZP</td>\n",
101" <td>1</td>\n",
102" <td>Arrived in pieces</td>\n",
103" <td>Not pleased at all. When I opened the box, mos...</td>\n",
104" <td>Title: Arrived in pieces; Content: Not pleased...</td>\n",
105" </tr>\n",
106" </tbody>\n",
107"</table>\n",
108"</div>"
109],
110"text/plain": [
111" Time ProductId UserId Score \\\n",
112"0 1351123200 B003XPF9BO A3R7JR3FMEBXQB 5 \n",
113"1 1351123200 B003JK537S A3JBPC3WFUT5ZP 1 \n",
114"\n",
115" Summary \\\n",
116"0 where does one start...and stop... with a tre... \n",
117"1 Arrived in pieces \n",
118"\n",
119" Text \\\n",
120"0 Wanted to save some to bring to my Chicago fam... \n",
121"1 Not pleased at all. When I opened the box, mos... \n",
122"\n",
123" combined \n",
124"0 Title: where does one start...and stop... wit... \n",
125"1 Title: Arrived in pieces; Content: Not pleased... "
126]
127},
128"execution_count": 17,
129"metadata": {},
130"output_type": "execute_result"
131}
132],
133"source": [
134"# load & inspect dataset\n",
135"input_datapath = \"data/fine_food_reviews_1k.csv\" # to save space, we provide a pre-filtered dataset\n",
136"df = pd.read_csv(input_datapath, index_col=0)\n",
137"df = df[[\"Time\", \"ProductId\", \"UserId\", \"Score\", \"Summary\", \"Text\"]]\n",
138"df = df.dropna()\n",
139"df[\"combined\"] = (\n",
140" \"Title: \" + df.Summary.str.strip() + \"; Content: \" + df.Text.str.strip()\n",
141")\n",
142"df.head(2)"
143]
144},
145{
146"cell_type": "code",
147"execution_count": 18,
148"metadata": {},
149"outputs": [
150{
151"data": {
152"text/plain": [
153"1000"
154]
155},
156"execution_count": 18,
157"metadata": {},
158"output_type": "execute_result"
159}
160],
161"source": [
162"# subsample to 1k most recent reviews and remove samples that are too long\n",
163"top_n = 1000\n",
164"df = df.sort_values(\"Time\").tail(top_n * 2) # first cut to first 2k entries, assuming less than half will be filtered out\n",
165"df.drop(\"Time\", axis=1, inplace=True)\n",
166"\n",
167"encoding = tiktoken.get_encoding(embedding_encoding)\n",
168"\n",
169"# omit reviews that are too long to embed\n",
170"df[\"n_tokens\"] = df.combined.apply(lambda x: len(encoding.encode(x)))\n",
171"df = df[df.n_tokens <= max_tokens].tail(top_n)\n",
172"len(df)"
173]
174},
175{
176"attachments": {},
177"cell_type": "markdown",
178"metadata": {},
179"source": [
180"## 2. Get embeddings and save them for future reuse"
181]
182},
183{
184"cell_type": "code",
185"execution_count": 21,
186"metadata": {},
187"outputs": [],
188"source": [
189"# Ensure you have your API key set in your environment per the README: https://github.com/openai/openai-python#usage\n",
190"\n",
191"# This may take a few minutes\n",
192"df[\"embedding\"] = df.combined.apply(lambda x: get_embedding(x, model=embedding_model))\n",
193"df.to_csv(\"data/fine_food_reviews_with_embeddings_1k.csv\")"
194]
195},
196{
197"cell_type": "code",
198"execution_count": 22,
199"metadata": {},
200"outputs": [],
201"source": [
202"a = get_embedding(\"hi\", model=embedding_model)"
203]
204}
205],
206"metadata": {
207"kernelspec": {
208"display_name": "openai",
209"language": "python",
210"name": "python3"
211},
212"language_info": {
213"codemirror_mode": {
214"name": "ipython",
215"version": 3
216},
217"file_extension": ".py",
218"mimetype": "text/x-python",
219"name": "python",
220"nbconvert_exporter": "python",
221"pygments_lexer": "ipython3",
222"version": "3.11.5"
223},
224"orig_nbformat": 4,
225"vscode": {
226"interpreter": {
227"hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97"
228}
229}
230},
231"nbformat": 4,
232"nbformat_minor": 2
233}
234