open-source-models-with-hugging-face

Форк
0
/
L4_Sentence_Embeddings.ipynb 
449 строк · 10.4 Кб
1
{
2
 "cells": [
3
  {
4
   "cell_type": "markdown",
5
   "id": "93de5736",
6
   "metadata": {},
7
   "source": [
8
    "# Lesson 4: Sentence Embeddings"
9
   ]
10
  },
11
  {
12
   "cell_type": "markdown",
13
   "id": "c9363203",
14
   "metadata": {},
15
   "source": [
16
    "- In the classroom, the libraries are already installed for you.\n",
17
    "- If you would like to run this code on your own machine, you can install the following:\n",
18
    "``` \n",
19
    "    !pip install sentence-transformers\n",
20
    "```"
21
   ]
22
  },
23
  {
24
   "cell_type": "markdown",
25
   "id": "632f7aac-6786-4ea6-8fa3-25a6cebbd2e5",
26
   "metadata": {},
27
   "source": [
28
    "- Here is some code that suppresses warning messages."
29
   ]
30
  },
31
  {
32
   "cell_type": "code",
33
   "execution_count": 1,
34
   "id": "058015a6-19cf-4f80-940d-f4af86cb589c",
35
   "metadata": {
36
    "height": 47
37
   },
38
   "outputs": [],
39
   "source": [
40
    "from transformers.utils import logging\n",
41
    "logging.set_verbosity_error()"
42
   ]
43
  },
44
  {
45
   "cell_type": "markdown",
46
   "id": "c35a8e72",
47
   "metadata": {},
48
   "source": [
49
    "### Build the `sentence embedding` pipeline using 🤗 Transformers Library"
50
   ]
51
  },
52
  {
53
   "cell_type": "code",
54
   "execution_count": 2,
55
   "id": "2ed9cec8-803a-4d7e-99d9-c4c84682901c",
56
   "metadata": {
57
    "height": 30
58
   },
59
   "outputs": [],
60
   "source": [
61
    "from sentence_transformers import SentenceTransformer"
62
   ]
63
  },
64
  {
65
   "cell_type": "code",
66
   "execution_count": 3,
67
   "id": "dd5dbb50-8c36-456c-ac0e-f724429c4b7f",
68
   "metadata": {
69
    "height": 30
70
   },
71
   "outputs": [
72
    {
73
     "data": {
74
      "application/vnd.jupyter.widget-view+json": {
75
       "model_id": "f477c562ac644656a09572b36f8b78c7",
76
       "version_major": 2,
77
       "version_minor": 0
78
      },
79
      "text/plain": [
80
       "modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]"
81
      ]
82
     },
83
     "metadata": {},
84
     "output_type": "display_data"
85
    },
86
    {
87
     "data": {
88
      "application/vnd.jupyter.widget-view+json": {
89
       "model_id": "32b6a6dd5e2a49889a4232831648d6bb",
90
       "version_major": 2,
91
       "version_minor": 0
92
      },
93
      "text/plain": [
94
       "config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]"
95
      ]
96
     },
97
     "metadata": {},
98
     "output_type": "display_data"
99
    },
100
    {
101
     "data": {
102
      "application/vnd.jupyter.widget-view+json": {
103
       "model_id": "416194046b31421ebb8aa430ee340f81",
104
       "version_major": 2,
105
       "version_minor": 0
106
      },
107
      "text/plain": [
108
       "README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]"
109
      ]
110
     },
111
     "metadata": {},
112
     "output_type": "display_data"
113
    },
114
    {
115
     "data": {
116
      "application/vnd.jupyter.widget-view+json": {
117
       "model_id": "591d7ab058bc4a5ca7b8dd77dc23f663",
118
       "version_major": 2,
119
       "version_minor": 0
120
      },
121
      "text/plain": [
122
       "sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]"
123
      ]
124
     },
125
     "metadata": {},
126
     "output_type": "display_data"
127
    },
128
    {
129
     "data": {
130
      "application/vnd.jupyter.widget-view+json": {
131
       "model_id": "758911bce2bb4365ad73acba30a3423b",
132
       "version_major": 2,
133
       "version_minor": 0
134
      },
135
      "text/plain": [
136
       "config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]"
137
      ]
138
     },
139
     "metadata": {},
140
     "output_type": "display_data"
141
    },
142
    {
143
     "data": {
144
      "application/vnd.jupyter.widget-view+json": {
145
       "model_id": "a1070e85f4d44bf9932219c997fa6c3a",
146
       "version_major": 2,
147
       "version_minor": 0
148
      },
149
      "text/plain": [
150
       "pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]"
151
      ]
152
     },
153
     "metadata": {},
154
     "output_type": "display_data"
155
    },
156
    {
157
     "data": {
158
      "application/vnd.jupyter.widget-view+json": {
159
       "model_id": "fbb3e6de5e5640f48ef6fe4188f63943",
160
       "version_major": 2,
161
       "version_minor": 0
162
      },
163
      "text/plain": [
164
       "tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]"
165
      ]
166
     },
167
     "metadata": {},
168
     "output_type": "display_data"
169
    },
170
    {
171
     "data": {
172
      "application/vnd.jupyter.widget-view+json": {
173
       "model_id": "7faaf865f1e04dd0bb31a53c42223823",
174
       "version_major": 2,
175
       "version_minor": 0
176
      },
177
      "text/plain": [
178
       "vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]"
179
      ]
180
     },
181
     "metadata": {},
182
     "output_type": "display_data"
183
    },
184
    {
185
     "data": {
186
      "application/vnd.jupyter.widget-view+json": {
187
       "model_id": "1065313dab364207a5250b97ba4b03cd",
188
       "version_major": 2,
189
       "version_minor": 0
190
      },
191
      "text/plain": [
192
       "tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]"
193
      ]
194
     },
195
     "metadata": {},
196
     "output_type": "display_data"
197
    },
198
    {
199
     "data": {
200
      "application/vnd.jupyter.widget-view+json": {
201
       "model_id": "361832cb527140ca928dda8e721bd5bd",
202
       "version_major": 2,
203
       "version_minor": 0
204
      },
205
      "text/plain": [
206
       "special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]"
207
      ]
208
     },
209
     "metadata": {},
210
     "output_type": "display_data"
211
    },
212
    {
213
     "data": {
214
      "application/vnd.jupyter.widget-view+json": {
215
       "model_id": "dbe5bd8b42a54f1d82c4574e369b2cc4",
216
       "version_major": 2,
217
       "version_minor": 0
218
      },
219
      "text/plain": [
220
       "1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]"
221
      ]
222
     },
223
     "metadata": {},
224
     "output_type": "display_data"
225
    }
226
   ],
227
   "source": [
228
    "model = SentenceTransformer(\"all-MiniLM-L6-v2\")"
229
   ]
230
  },
231
  {
232
   "cell_type": "markdown",
233
   "id": "cad701ed",
234
   "metadata": {},
235
   "source": [
236
    "More info on [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)."
237
   ]
238
  },
239
  {
240
   "cell_type": "code",
241
   "execution_count": 4,
242
   "id": "fe7d23e0-5e68-4537-8dd8-eb125e1a6820",
243
   "metadata": {
244
    "height": 64
245
   },
246
   "outputs": [],
247
   "source": [
248
    "sentences1 = ['The cat sits outside',\n",
249
    "              'A man is playing guitar',\n",
250
    "              'The movies are awesome']"
251
   ]
252
  },
253
  {
254
   "cell_type": "code",
255
   "execution_count": 5,
256
   "id": "c33db645-edd8-4a28-a06f-e0fd8200f27f",
257
   "metadata": {
258
    "height": 30
259
   },
260
   "outputs": [],
261
   "source": [
262
    "embeddings1 = model.encode(sentences1, convert_to_tensor=True)"
263
   ]
264
  },
265
  {
266
   "cell_type": "code",
267
   "execution_count": 6,
268
   "id": "75de3f4a-bd8e-41d6-847b-9a3a043adeeb",
269
   "metadata": {
270
    "height": 30
271
   },
272
   "outputs": [
273
    {
274
     "data": {
275
      "text/plain": [
276
       "tensor([[ 0.1392,  0.0030,  0.0470,  ...,  0.0641, -0.0163,  0.0636],\n",
277
       "        [ 0.0227, -0.0014, -0.0056,  ..., -0.0225,  0.0846, -0.0283],\n",
278
       "        [-0.1043, -0.0628,  0.0093,  ...,  0.0020,  0.0653, -0.0150]])"
279
      ]
280
     },
281
     "execution_count": 6,
282
     "metadata": {},
283
     "output_type": "execute_result"
284
    }
285
   ],
286
   "source": [
287
    "embeddings1"
288
   ]
289
  },
290
  {
291
   "cell_type": "code",
292
   "execution_count": 7,
293
   "id": "5136886d-80d4-4a3a-a68e-692c25496b51",
294
   "metadata": {
295
    "height": 64
296
   },
297
   "outputs": [],
298
   "source": [
299
    "sentences2 = ['The dog plays in the garden',\n",
300
    "              'A woman watches TV',\n",
301
    "              'The new movie is so great']"
302
   ]
303
  },
304
  {
305
   "cell_type": "code",
306
   "execution_count": 8,
307
   "id": "c7e0c68f",
308
   "metadata": {
309
    "height": 47
310
   },
311
   "outputs": [],
312
   "source": [
313
    "embeddings2 = model.encode(sentences2, \n",
314
    "                           convert_to_tensor=True)"
315
   ]
316
  },
317
  {
318
   "cell_type": "code",
319
   "execution_count": 9,
320
   "id": "8a213124-ea97-4706-bbf4-737490e94244",
321
   "metadata": {
322
    "height": 30
323
   },
324
   "outputs": [
325
    {
326
     "name": "stdout",
327
     "output_type": "stream",
328
     "text": [
329
      "tensor([[ 0.0163, -0.0700,  0.0384,  ...,  0.0447,  0.0254, -0.0023],\n",
330
      "        [ 0.0054, -0.0920,  0.0140,  ...,  0.0167, -0.0086, -0.0424],\n",
331
      "        [-0.0842, -0.0592, -0.0010,  ..., -0.0157,  0.0764,  0.0389]])\n"
332
     ]
333
    }
334
   ],
335
   "source": [
336
    "print(embeddings2)"
337
   ]
338
  },
339
  {
340
   "cell_type": "markdown",
341
   "id": "41c3d585",
342
   "metadata": {},
343
   "source": [
344
    "* Calculate the cosine similarity between two sentences as a measure of how similar they are to each other."
345
   ]
346
  },
347
  {
348
   "cell_type": "code",
349
   "execution_count": 10,
350
   "id": "9e3b38a5-9b35-49de-9f85-c62583d6287d",
351
   "metadata": {
352
    "height": 30
353
   },
354
   "outputs": [],
355
   "source": [
356
    "from sentence_transformers import util"
357
   ]
358
  },
359
  {
360
   "cell_type": "code",
361
   "execution_count": 11,
362
   "id": "39c1f4f3-94ad-4b5e-a40d-c4ba8277815b",
363
   "metadata": {
364
    "height": 30
365
   },
366
   "outputs": [],
367
   "source": [
368
    "cosine_scores = util.cos_sim(embeddings1,embeddings2)"
369
   ]
370
  },
371
  {
372
   "cell_type": "code",
373
   "execution_count": 12,
374
   "id": "b6859d46-15a7-4f61-8a9f-06a15baeff40",
375
   "metadata": {
376
    "height": 30
377
   },
378
   "outputs": [
379
    {
380
     "name": "stdout",
381
     "output_type": "stream",
382
     "text": [
383
      "tensor([[ 0.2838,  0.1310, -0.0029],\n",
384
      "        [ 0.2277, -0.0327, -0.0136],\n",
385
      "        [-0.0124, -0.0465,  0.6571]])\n"
386
     ]
387
    }
388
   ],
389
   "source": [
390
    "print(cosine_scores)"
391
   ]
392
  },
393
  {
394
   "cell_type": "code",
395
   "execution_count": 13,
396
   "id": "fae8571e-2dea-4872-b244-342731b949de",
397
   "metadata": {
398
    "height": 94
399
   },
400
   "outputs": [
401
    {
402
     "name": "stdout",
403
     "output_type": "stream",
404
     "text": [
405
      "The cat sits outside \t\t The dog plays in the garden \t\t Score: 0.2838\n",
406
      "A man is playing guitar \t\t A woman watches TV \t\t Score: -0.0327\n",
407
      "The movies are awesome \t\t The new movie is so great \t\t Score: 0.6571\n"
408
     ]
409
    }
410
   ],
411
   "source": [
412
    "for i in range(len(sentences1)):\n",
413
    "    print(\"{} \\t\\t {} \\t\\t Score: {:.4f}\".format(sentences1[i],\n",
414
    "                                                 sentences2[i],\n",
415
    "                                                 cosine_scores[i][i]))"
416
   ]
417
  },
418
  {
419
   "cell_type": "markdown",
420
   "id": "49863f4c",
421
   "metadata": {},
422
   "source": [
423
    "### Try it yourself! \n",
424
    "- Try this model with your own sentences!"
425
   ]
426
  }
427
 ],
428
 "metadata": {
429
  "kernelspec": {
430
   "display_name": "Python 3 (ipykernel)",
431
   "language": "python",
432
   "name": "python3"
433
  },
434
  "language_info": {
435
   "codemirror_mode": {
436
    "name": "ipython",
437
    "version": 3
438
   },
439
   "file_extension": ".py",
440
   "mimetype": "text/x-python",
441
   "name": "python",
442
   "nbconvert_exporter": "python",
443
   "pygments_lexer": "ipython3",
444
   "version": "3.9.18"
445
  }
446
 },
447
 "nbformat": 4,
448
 "nbformat_minor": 5
449
}
450

Использование cookies

Мы используем файлы cookie в соответствии с Политикой конфиденциальности и Политикой использования cookies.

Нажимая кнопку «Принимаю», Вы даете АО «СберТех» согласие на обработку Ваших персональных данных в целях совершенствования нашего веб-сайта и Сервиса GitVerse, а также повышения удобства их использования.

Запретить использование cookies Вы можете самостоятельно в настройках Вашего браузера.