openai-cookbook

Форк
0
/
Get_embeddings_from_dataset.ipynb 
233 строки · 7.2 Кб
1
{
2
 "cells": [
3
  {
4
   "cell_type": "markdown",
5
   "metadata": {},
6
   "source": [
7
    "# Get embeddings from dataset\n",
8
    "\n",
9
    "This notebook gives an example on how to get embeddings from a large dataset.\n",
10
    "\n",
11
    "\n",
12
    "## 1. Load the dataset\n",
13
    "\n",
14
    "The dataset used in this example is [fine-food reviews](https://www.kaggle.com/snap/amazon-fine-food-reviews) from Amazon. The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of this dataset, consisting of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text).\n",
15
    "\n",
16
    "We will combine the review summary and review text into a single combined text. The model will encode this combined text and it will output a single vector embedding."
17
   ]
18
  },
19
  {
20
   "attachments": {},
21
   "cell_type": "markdown",
22
   "metadata": {},
23
   "source": [
24
    "To run this notebook, you will need to install: pandas, openai, transformers, plotly, matplotlib, scikit-learn, torch (transformer dep), torchvision, and scipy."
25
   ]
26
  },
27
  {
28
   "cell_type": "code",
29
   "execution_count": 20,
30
   "metadata": {},
31
   "outputs": [],
32
   "source": [
33
    "import pandas as pd\n",
34
    "import tiktoken\n",
35
    "\n",
36
    "from utils.embeddings_utils import get_embedding"
37
   ]
38
  },
39
  {
40
   "cell_type": "code",
41
   "execution_count": 16,
42
   "metadata": {},
43
   "outputs": [],
44
   "source": [
45
    "embedding_model = \"text-embedding-3-small\"\n",
46
    "embedding_encoding = \"cl100k_base\"\n",
47
    "max_tokens = 8000  # the maximum for text-embedding-3-small is 8191"
48
   ]
49
  },
50
  {
51
   "cell_type": "code",
52
   "execution_count": 17,
53
   "metadata": {},
54
   "outputs": [
55
    {
56
     "data": {
57
      "text/html": [
58
       "<div>\n",
59
       "<style scoped>\n",
60
       "    .dataframe tbody tr th:only-of-type {\n",
61
       "        vertical-align: middle;\n",
62
       "    }\n",
63
       "\n",
64
       "    .dataframe tbody tr th {\n",
65
       "        vertical-align: top;\n",
66
       "    }\n",
67
       "\n",
68
       "    .dataframe thead th {\n",
69
       "        text-align: right;\n",
70
       "    }\n",
71
       "</style>\n",
72
       "<table border=\"1\" class=\"dataframe\">\n",
73
       "  <thead>\n",
74
       "    <tr style=\"text-align: right;\">\n",
75
       "      <th></th>\n",
76
       "      <th>Time</th>\n",
77
       "      <th>ProductId</th>\n",
78
       "      <th>UserId</th>\n",
79
       "      <th>Score</th>\n",
80
       "      <th>Summary</th>\n",
81
       "      <th>Text</th>\n",
82
       "      <th>combined</th>\n",
83
       "    </tr>\n",
84
       "  </thead>\n",
85
       "  <tbody>\n",
86
       "    <tr>\n",
87
       "      <th>0</th>\n",
88
       "      <td>1351123200</td>\n",
89
       "      <td>B003XPF9BO</td>\n",
90
       "      <td>A3R7JR3FMEBXQB</td>\n",
91
       "      <td>5</td>\n",
92
       "      <td>where does one  start...and stop... with a tre...</td>\n",
93
       "      <td>Wanted to save some to bring to my Chicago fam...</td>\n",
94
       "      <td>Title: where does one  start...and stop... wit...</td>\n",
95
       "    </tr>\n",
96
       "    <tr>\n",
97
       "      <th>1</th>\n",
98
       "      <td>1351123200</td>\n",
99
       "      <td>B003JK537S</td>\n",
100
       "      <td>A3JBPC3WFUT5ZP</td>\n",
101
       "      <td>1</td>\n",
102
       "      <td>Arrived in pieces</td>\n",
103
       "      <td>Not pleased at all. When I opened the box, mos...</td>\n",
104
       "      <td>Title: Arrived in pieces; Content: Not pleased...</td>\n",
105
       "    </tr>\n",
106
       "  </tbody>\n",
107
       "</table>\n",
108
       "</div>"
109
      ],
110
      "text/plain": [
111
       "         Time   ProductId          UserId  Score  \\\n",
112
       "0  1351123200  B003XPF9BO  A3R7JR3FMEBXQB      5   \n",
113
       "1  1351123200  B003JK537S  A3JBPC3WFUT5ZP      1   \n",
114
       "\n",
115
       "                                             Summary  \\\n",
116
       "0  where does one  start...and stop... with a tre...   \n",
117
       "1                                  Arrived in pieces   \n",
118
       "\n",
119
       "                                                Text  \\\n",
120
       "0  Wanted to save some to bring to my Chicago fam...   \n",
121
       "1  Not pleased at all. When I opened the box, mos...   \n",
122
       "\n",
123
       "                                            combined  \n",
124
       "0  Title: where does one  start...and stop... wit...  \n",
125
       "1  Title: Arrived in pieces; Content: Not pleased...  "
126
      ]
127
     },
128
     "execution_count": 17,
129
     "metadata": {},
130
     "output_type": "execute_result"
131
    }
132
   ],
133
   "source": [
134
    "# load & inspect dataset\n",
135
    "input_datapath = \"data/fine_food_reviews_1k.csv\"  # to save space, we provide a pre-filtered dataset\n",
136
    "df = pd.read_csv(input_datapath, index_col=0)\n",
137
    "df = df[[\"Time\", \"ProductId\", \"UserId\", \"Score\", \"Summary\", \"Text\"]]\n",
138
    "df = df.dropna()\n",
139
    "df[\"combined\"] = (\n",
140
    "    \"Title: \" + df.Summary.str.strip() + \"; Content: \" + df.Text.str.strip()\n",
141
    ")\n",
142
    "df.head(2)"
143
   ]
144
  },
145
  {
146
   "cell_type": "code",
147
   "execution_count": 18,
148
   "metadata": {},
149
   "outputs": [
150
    {
151
     "data": {
152
      "text/plain": [
153
       "1000"
154
      ]
155
     },
156
     "execution_count": 18,
157
     "metadata": {},
158
     "output_type": "execute_result"
159
    }
160
   ],
161
   "source": [
162
    "# subsample to 1k most recent reviews and remove samples that are too long\n",
163
    "top_n = 1000\n",
164
    "df = df.sort_values(\"Time\").tail(top_n * 2)  # first cut to first 2k entries, assuming less than half will be filtered out\n",
165
    "df.drop(\"Time\", axis=1, inplace=True)\n",
166
    "\n",
167
    "encoding = tiktoken.get_encoding(embedding_encoding)\n",
168
    "\n",
169
    "# omit reviews that are too long to embed\n",
170
    "df[\"n_tokens\"] = df.combined.apply(lambda x: len(encoding.encode(x)))\n",
171
    "df = df[df.n_tokens <= max_tokens].tail(top_n)\n",
172
    "len(df)"
173
   ]
174
  },
175
  {
176
   "attachments": {},
177
   "cell_type": "markdown",
178
   "metadata": {},
179
   "source": [
180
    "## 2. Get embeddings and save them for future reuse"
181
   ]
182
  },
183
  {
184
   "cell_type": "code",
185
   "execution_count": 21,
186
   "metadata": {},
187
   "outputs": [],
188
   "source": [
189
    "# Ensure you have your API key set in your environment per the README: https://github.com/openai/openai-python#usage\n",
190
    "\n",
191
    "# This may take a few minutes\n",
192
    "df[\"embedding\"] = df.combined.apply(lambda x: get_embedding(x, model=embedding_model))\n",
193
    "df.to_csv(\"data/fine_food_reviews_with_embeddings_1k.csv\")"
194
   ]
195
  },
196
  {
197
   "cell_type": "code",
198
   "execution_count": 22,
199
   "metadata": {},
200
   "outputs": [],
201
   "source": [
202
    "a = get_embedding(\"hi\", model=embedding_model)"
203
   ]
204
  }
205
 ],
206
 "metadata": {
207
  "kernelspec": {
208
   "display_name": "openai",
209
   "language": "python",
210
   "name": "python3"
211
  },
212
  "language_info": {
213
   "codemirror_mode": {
214
    "name": "ipython",
215
    "version": 3
216
   },
217
   "file_extension": ".py",
218
   "mimetype": "text/x-python",
219
   "name": "python",
220
   "nbconvert_exporter": "python",
221
   "pygments_lexer": "ipython3",
222
   "version": "3.11.5"
223
  },
224
  "orig_nbformat": 4,
225
  "vscode": {
226
   "interpreter": {
227
    "hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97"
228
   }
229
  }
230
 },
231
 "nbformat": 4,
232
 "nbformat_minor": 2
233
}
234

Использование cookies

Мы используем файлы cookie в соответствии с Политикой конфиденциальности и Политикой использования cookies.

Нажимая кнопку «Принимаю», Вы даете АО «СберТех» согласие на обработку Ваших персональных данных в целях совершенствования нашего веб-сайта и Сервиса GitVerse, а также повышения удобства их использования.

Запретить использование cookies Вы можете самостоятельно в настройках Вашего браузера.