open-source-models-with-hugging-face
/
L4_Sentence_Embeddings.ipynb
449 строк · 10.4 Кб
1{
2"cells": [
3{
4"cell_type": "markdown",
5"id": "93de5736",
6"metadata": {},
7"source": [
8"# Lesson 4: Sentence Embeddings"
9]
10},
11{
12"cell_type": "markdown",
13"id": "c9363203",
14"metadata": {},
15"source": [
16"- In the classroom, the libraries are already installed for you.\n",
17"- If you would like to run this code on your own machine, you can install the following:\n",
18"``` \n",
19" !pip install sentence-transformers\n",
20"```"
21]
22},
23{
24"cell_type": "markdown",
25"id": "632f7aac-6786-4ea6-8fa3-25a6cebbd2e5",
26"metadata": {},
27"source": [
28"- Here is some code that suppresses warning messages."
29]
30},
31{
32"cell_type": "code",
33"execution_count": 1,
34"id": "058015a6-19cf-4f80-940d-f4af86cb589c",
35"metadata": {
36"height": 47
37},
38"outputs": [],
39"source": [
40"from transformers.utils import logging\n",
41"logging.set_verbosity_error()"
42]
43},
44{
45"cell_type": "markdown",
46"id": "c35a8e72",
47"metadata": {},
48"source": [
49"### Build the `sentence embedding` pipeline using 🤗 Transformers Library"
50]
51},
52{
53"cell_type": "code",
54"execution_count": 2,
55"id": "2ed9cec8-803a-4d7e-99d9-c4c84682901c",
56"metadata": {
57"height": 30
58},
59"outputs": [],
60"source": [
61"from sentence_transformers import SentenceTransformer"
62]
63},
64{
65"cell_type": "code",
66"execution_count": 3,
67"id": "dd5dbb50-8c36-456c-ac0e-f724429c4b7f",
68"metadata": {
69"height": 30
70},
71"outputs": [
72{
73"data": {
74"application/vnd.jupyter.widget-view+json": {
75"model_id": "f477c562ac644656a09572b36f8b78c7",
76"version_major": 2,
77"version_minor": 0
78},
79"text/plain": [
80"modules.json: 0%| | 0.00/349 [00:00<?, ?B/s]"
81]
82},
83"metadata": {},
84"output_type": "display_data"
85},
86{
87"data": {
88"application/vnd.jupyter.widget-view+json": {
89"model_id": "32b6a6dd5e2a49889a4232831648d6bb",
90"version_major": 2,
91"version_minor": 0
92},
93"text/plain": [
94"config_sentence_transformers.json: 0%| | 0.00/116 [00:00<?, ?B/s]"
95]
96},
97"metadata": {},
98"output_type": "display_data"
99},
100{
101"data": {
102"application/vnd.jupyter.widget-view+json": {
103"model_id": "416194046b31421ebb8aa430ee340f81",
104"version_major": 2,
105"version_minor": 0
106},
107"text/plain": [
108"README.md: 0%| | 0.00/10.7k [00:00<?, ?B/s]"
109]
110},
111"metadata": {},
112"output_type": "display_data"
113},
114{
115"data": {
116"application/vnd.jupyter.widget-view+json": {
117"model_id": "591d7ab058bc4a5ca7b8dd77dc23f663",
118"version_major": 2,
119"version_minor": 0
120},
121"text/plain": [
122"sentence_bert_config.json: 0%| | 0.00/53.0 [00:00<?, ?B/s]"
123]
124},
125"metadata": {},
126"output_type": "display_data"
127},
128{
129"data": {
130"application/vnd.jupyter.widget-view+json": {
131"model_id": "758911bce2bb4365ad73acba30a3423b",
132"version_major": 2,
133"version_minor": 0
134},
135"text/plain": [
136"config.json: 0%| | 0.00/612 [00:00<?, ?B/s]"
137]
138},
139"metadata": {},
140"output_type": "display_data"
141},
142{
143"data": {
144"application/vnd.jupyter.widget-view+json": {
145"model_id": "a1070e85f4d44bf9932219c997fa6c3a",
146"version_major": 2,
147"version_minor": 0
148},
149"text/plain": [
150"pytorch_model.bin: 0%| | 0.00/90.9M [00:00<?, ?B/s]"
151]
152},
153"metadata": {},
154"output_type": "display_data"
155},
156{
157"data": {
158"application/vnd.jupyter.widget-view+json": {
159"model_id": "fbb3e6de5e5640f48ef6fe4188f63943",
160"version_major": 2,
161"version_minor": 0
162},
163"text/plain": [
164"tokenizer_config.json: 0%| | 0.00/350 [00:00<?, ?B/s]"
165]
166},
167"metadata": {},
168"output_type": "display_data"
169},
170{
171"data": {
172"application/vnd.jupyter.widget-view+json": {
173"model_id": "7faaf865f1e04dd0bb31a53c42223823",
174"version_major": 2,
175"version_minor": 0
176},
177"text/plain": [
178"vocab.txt: 0%| | 0.00/232k [00:00<?, ?B/s]"
179]
180},
181"metadata": {},
182"output_type": "display_data"
183},
184{
185"data": {
186"application/vnd.jupyter.widget-view+json": {
187"model_id": "1065313dab364207a5250b97ba4b03cd",
188"version_major": 2,
189"version_minor": 0
190},
191"text/plain": [
192"tokenizer.json: 0%| | 0.00/466k [00:00<?, ?B/s]"
193]
194},
195"metadata": {},
196"output_type": "display_data"
197},
198{
199"data": {
200"application/vnd.jupyter.widget-view+json": {
201"model_id": "361832cb527140ca928dda8e721bd5bd",
202"version_major": 2,
203"version_minor": 0
204},
205"text/plain": [
206"special_tokens_map.json: 0%| | 0.00/112 [00:00<?, ?B/s]"
207]
208},
209"metadata": {},
210"output_type": "display_data"
211},
212{
213"data": {
214"application/vnd.jupyter.widget-view+json": {
215"model_id": "dbe5bd8b42a54f1d82c4574e369b2cc4",
216"version_major": 2,
217"version_minor": 0
218},
219"text/plain": [
220"1_Pooling/config.json: 0%| | 0.00/190 [00:00<?, ?B/s]"
221]
222},
223"metadata": {},
224"output_type": "display_data"
225}
226],
227"source": [
228"model = SentenceTransformer(\"all-MiniLM-L6-v2\")"
229]
230},
231{
232"cell_type": "markdown",
233"id": "cad701ed",
234"metadata": {},
235"source": [
236"More info on [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)."
237]
238},
239{
240"cell_type": "code",
241"execution_count": 4,
242"id": "fe7d23e0-5e68-4537-8dd8-eb125e1a6820",
243"metadata": {
244"height": 64
245},
246"outputs": [],
247"source": [
248"sentences1 = ['The cat sits outside',\n",
249" 'A man is playing guitar',\n",
250" 'The movies are awesome']"
251]
252},
253{
254"cell_type": "code",
255"execution_count": 5,
256"id": "c33db645-edd8-4a28-a06f-e0fd8200f27f",
257"metadata": {
258"height": 30
259},
260"outputs": [],
261"source": [
262"embeddings1 = model.encode(sentences1, convert_to_tensor=True)"
263]
264},
265{
266"cell_type": "code",
267"execution_count": 6,
268"id": "75de3f4a-bd8e-41d6-847b-9a3a043adeeb",
269"metadata": {
270"height": 30
271},
272"outputs": [
273{
274"data": {
275"text/plain": [
276"tensor([[ 0.1392, 0.0030, 0.0470, ..., 0.0641, -0.0163, 0.0636],\n",
277" [ 0.0227, -0.0014, -0.0056, ..., -0.0225, 0.0846, -0.0283],\n",
278" [-0.1043, -0.0628, 0.0093, ..., 0.0020, 0.0653, -0.0150]])"
279]
280},
281"execution_count": 6,
282"metadata": {},
283"output_type": "execute_result"
284}
285],
286"source": [
287"embeddings1"
288]
289},
290{
291"cell_type": "code",
292"execution_count": 7,
293"id": "5136886d-80d4-4a3a-a68e-692c25496b51",
294"metadata": {
295"height": 64
296},
297"outputs": [],
298"source": [
299"sentences2 = ['The dog plays in the garden',\n",
300" 'A woman watches TV',\n",
301" 'The new movie is so great']"
302]
303},
304{
305"cell_type": "code",
306"execution_count": 8,
307"id": "c7e0c68f",
308"metadata": {
309"height": 47
310},
311"outputs": [],
312"source": [
313"embeddings2 = model.encode(sentences2, \n",
314" convert_to_tensor=True)"
315]
316},
317{
318"cell_type": "code",
319"execution_count": 9,
320"id": "8a213124-ea97-4706-bbf4-737490e94244",
321"metadata": {
322"height": 30
323},
324"outputs": [
325{
326"name": "stdout",
327"output_type": "stream",
328"text": [
329"tensor([[ 0.0163, -0.0700, 0.0384, ..., 0.0447, 0.0254, -0.0023],\n",
330" [ 0.0054, -0.0920, 0.0140, ..., 0.0167, -0.0086, -0.0424],\n",
331" [-0.0842, -0.0592, -0.0010, ..., -0.0157, 0.0764, 0.0389]])\n"
332]
333}
334],
335"source": [
336"print(embeddings2)"
337]
338},
339{
340"cell_type": "markdown",
341"id": "41c3d585",
342"metadata": {},
343"source": [
344"* Calculate the cosine similarity between two sentences as a measure of how similar they are to each other."
345]
346},
347{
348"cell_type": "code",
349"execution_count": 10,
350"id": "9e3b38a5-9b35-49de-9f85-c62583d6287d",
351"metadata": {
352"height": 30
353},
354"outputs": [],
355"source": [
356"from sentence_transformers import util"
357]
358},
359{
360"cell_type": "code",
361"execution_count": 11,
362"id": "39c1f4f3-94ad-4b5e-a40d-c4ba8277815b",
363"metadata": {
364"height": 30
365},
366"outputs": [],
367"source": [
368"cosine_scores = util.cos_sim(embeddings1,embeddings2)"
369]
370},
371{
372"cell_type": "code",
373"execution_count": 12,
374"id": "b6859d46-15a7-4f61-8a9f-06a15baeff40",
375"metadata": {
376"height": 30
377},
378"outputs": [
379{
380"name": "stdout",
381"output_type": "stream",
382"text": [
383"tensor([[ 0.2838, 0.1310, -0.0029],\n",
384" [ 0.2277, -0.0327, -0.0136],\n",
385" [-0.0124, -0.0465, 0.6571]])\n"
386]
387}
388],
389"source": [
390"print(cosine_scores)"
391]
392},
393{
394"cell_type": "code",
395"execution_count": 13,
396"id": "fae8571e-2dea-4872-b244-342731b949de",
397"metadata": {
398"height": 94
399},
400"outputs": [
401{
402"name": "stdout",
403"output_type": "stream",
404"text": [
405"The cat sits outside \t\t The dog plays in the garden \t\t Score: 0.2838\n",
406"A man is playing guitar \t\t A woman watches TV \t\t Score: -0.0327\n",
407"The movies are awesome \t\t The new movie is so great \t\t Score: 0.6571\n"
408]
409}
410],
411"source": [
412"for i in range(len(sentences1)):\n",
413" print(\"{} \\t\\t {} \\t\\t Score: {:.4f}\".format(sentences1[i],\n",
414" sentences2[i],\n",
415" cosine_scores[i][i]))"
416]
417},
418{
419"cell_type": "markdown",
420"id": "49863f4c",
421"metadata": {},
422"source": [
423"### Try it yourself! \n",
424"- Try this model with your own sentences!"
425]
426}
427],
428"metadata": {
429"kernelspec": {
430"display_name": "Python 3 (ipykernel)",
431"language": "python",
432"name": "python3"
433},
434"language_info": {
435"codemirror_mode": {
436"name": "ipython",
437"version": 3
438},
439"file_extension": ".py",
440"mimetype": "text/x-python",
441"name": "python",
442"nbconvert_exporter": "python",
443"pygments_lexer": "ipython3",
444"version": "3.9.18"
445}
446},
447"nbformat": 4,
448"nbformat_minor": 5
449}
450