datasets

Форк
0
/
Overview.ipynb 
3715 строк · 320.6 Кб
1
{
2
  "cells": [
3
    {
4
      "cell_type": "markdown",
5
      "metadata": {
6
        "colab_type": "text",
7
        "id": "view-in-github"
8
      },
9
      "source": [
10
        "<a href=\"https://colab.research.google.com/github/huggingface/datasets/blob/main/notebooks/Overview.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
11
      ]
12
    },
13
    {
14
      "cell_type": "markdown",
15
      "metadata": {},
16
      "source": [
17
        "**⚠️ This notebook is deprecated in favor of the [Quickstart notebook](https://github.com/huggingface/notebooks/blob/main/datasets_doc/en/quickstart.ipynb)**"
18
      ]
19
    },
20
    {
21
      "cell_type": "markdown",
22
      "metadata": {
23
        "id": "zNp6kK7OvSUg",
24
        "pycharm": {
25
          "name": "#%% md\n"
26
        }
27
      },
28
      "source": [
29
        "# HuggingFace 🤗 Datasets library - Quick overview\n",
30
        "\n",
31
        "Models come and go (linear models, LSTM, Transformers, ...) but two core elements have consistently been the beating heart of Natural Language Processing: Datasets & Metrics\n",
32
        "\n",
33
        "🤗 Datasets is a fast and efficient library to easily share and load datasets, already providing access to the public datasets in the [Hugging Face Hub](https://huggingface.co/datasets).\n",
34
        "\n",
35
        "The library has several interesting features (besides easy access to datasets):\n",
36
        "\n",
37
        "- Build-in interoperability with PyTorch, Tensorflow 2, Pandas and Numpy\n",
38
        "- Lighweight and fast library with a transparent and pythonic API\n",
39
        "- Strive on large datasets: frees you from RAM memory limits, all datasets are memory-mapped on drive by default.\n",
40
        "- Smart caching with an intelligent `tf.data`-like cache: never wait for your data to process several times\n",
41
        "\n",
42
        "🤗 Datasets originated from a fork of the awesome Tensorflow-Datasets and the HuggingFace team want to deeply thank the team behind this amazing library and user API. We have tried to keep a layer of compatibility with `tfds` and can provide conversion from one format to the other.\n",
43
        "To learn more about how to use metrics, take a look at the library 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index)! In addition to metrics, you can find more tools for evaluating models and datasets."
44
      ]
45
    },
46
    {
47
      "cell_type": "markdown",
48
      "metadata": {
49
        "id": "dzk9aEtIvSUh",
50
        "pycharm": {
51
          "name": "#%% md\n"
52
        }
53
      },
54
      "source": [
55
        "# Main datasets API\n",
56
        "\n",
57
        "This notebook is a quick dive in the main user API for loading datasets in `datasets`"
58
      ]
59
    },
60
    {
61
      "cell_type": "code",
62
      "execution_count": null,
63
      "metadata": {
64
        "colab": {
65
          "base_uri": "https://localhost:8080/"
66
        },
67
        "id": "my95uHbLyjwR",
68
        "outputId": "8db75d45-02b9-46ed-efc2-f8ff764fe3d7",
69
        "pycharm": {
70
          "name": "#%%\n"
71
        }
72
      },
73
      "outputs": [
74
        {
75
          "name": "stdout",
76
          "output_type": "stream",
77
          "text": [
78
            "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
79
            "Requirement already satisfied: datasets in /usr/local/lib/python3.10/dist-packages (2.12.0)\n",
80
            "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from datasets) (1.22.4)\n",
81
            "Requirement already satisfied: pyarrow>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (9.0.0)\n",
82
            "Requirement already satisfied: dill<0.3.7,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.3.6)\n",
83
            "Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets) (1.5.3)\n",
84
            "Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (2.27.1)\n",
85
            "Requirement already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (4.65.0)\n",
86
            "Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets) (3.2.0)\n",
87
            "Requirement already satisfied: multiprocess in /usr/local/lib/python3.10/dist-packages (from datasets) (0.70.14)\n",
88
            "Requirement already satisfied: fsspec[http]>=2021.11.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (2023.4.0)\n",
89
            "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets) (3.8.4)\n",
90
            "Requirement already satisfied: huggingface-hub<1.0.0,>=0.11.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.14.1)\n",
91
            "Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from datasets) (23.1)\n",
92
            "Requirement already satisfied: responses<0.19 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.18.0)\n",
93
            "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (6.0)\n",
94
            "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (23.1.0)\n",
95
            "Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (2.0.12)\n",
96
            "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (6.0.4)\n",
97
            "Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (4.0.2)\n",
98
            "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.9.2)\n",
99
            "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.3)\n",
100
            "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.1)\n",
101
            "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0.0,>=0.11.0->datasets) (3.12.0)\n",
102
            "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0.0,>=0.11.0->datasets) (4.5.0)\n",
103
            "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (1.26.15)\n",
104
            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (2022.12.7)\n",
105
            "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (3.4)\n",
106
            "Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2.8.2)\n",
107
            "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2022.7.1)\n",
108
            "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas->datasets) (1.16.0)\n"
109
          ]
110
        }
111
      ],
112
      "source": [
113
        "# install datasets\n",
114
        "!pip install datasets"
115
      ]
116
    },
117
    {
118
      "cell_type": "code",
119
      "execution_count": null,
120
      "metadata": {
121
        "id": "PVjXLiYxvSUl",
122
        "pycharm": {
123
          "name": "#%%\n"
124
        }
125
      },
126
      "outputs": [],
127
      "source": [
128
        "# Let's import the library. We typically only need at most two methods:\n",
129
        "from datasets import list_datasets, load_dataset\n",
130
        "\n",
131
        "from pprint import pprint"
132
      ]
133
    },
134
    {
135
      "cell_type": "markdown",
136
      "metadata": {
137
        "id": "TNloBBx-vSUo",
138
        "pycharm": {
139
          "name": "#%% md\n"
140
        }
141
      },
142
      "source": [
143
        "## Listing the currently available datasets"
144
      ]
145
    },
146
    {
147
      "cell_type": "code",
148
      "execution_count": null,
149
      "metadata": {
150
        "colab": {
151
          "base_uri": "https://localhost:8080/"
152
        },
153
        "id": "d3RJisGLvSUp",
154
        "outputId": "1ece3326-6977-48c8-ba37-6b1753f1d029",
155
        "pycharm": {
156
          "name": "#%%\n"
157
        }
158
      },
159
      "outputs": [
160
        {
161
          "name": "stdout",
162
          "output_type": "stream",
163
          "text": [
164
            "🤩 Currently 36662 datasets are available on the hub:\n",
165
            "['acronym_identification', 'ade_corpus_v2', 'adversarial_qa', 'aeslc',\n",
166
            " 'afrikaans_ner_corpus', 'ag_news', 'ai2_arc', 'air_dialogue',\n",
167
            " 'ajgt_twitter_ar', 'allegro_reviews', 'allocine', 'alt', 'amazon_polarity',\n",
168
            " 'amazon_reviews_multi', 'amazon_us_reviews', 'ambig_qa', 'americas_nli', 'ami',\n",
169
            " 'amttl', 'anli', 'app_reviews', 'aqua_rat', 'aquamuse', 'ar_cov19',\n",
170
            " 'ar_res_reviews', 'ar_sarcasm', 'arabic_billion_words', 'arabic_pos_dialect',\n",
171
            " 'arabic_speech_corpus', 'arcd', 'arsentd_lev', 'art', 'arxiv_dataset',\n",
172
            " 'ascent_kb', 'aslg_pc12', 'asnq', 'asset', 'assin', 'assin2', 'atomic',\n",
173
            " 'autshumato', 'facebook/babi_qa', 'banking77', 'bbaw_egyptian',\n",
174
            " 'bbc_hindi_nli', 'bc2gm_corpus', 'beans', 'best2009', 'bianet', 'bible_para',\n",
175
            " 'big_patent', 'billsum', 'bing_coronavirus_query_set', 'biomrc', 'biosses',\n",
176
            " 'blbooks', 'blbooksgenre', 'blended_skill_talk', 'blimp',\n",
177
            " 'blog_authorship_corpus', 'bn_hate_speech', 'bnl_newspapers', 'bookcorpus',\n",
178
            " 'bookcorpusopen', 'boolq', 'bprec', 'break_data', 'brwac', 'bsd_ja_en',\n",
179
            " 'bswac', 'c3', 'c4', 'cail2018', 'caner', 'capes', 'casino',\n",
180
            " 'catalonia_independence', 'cats_vs_dogs', 'cawac', 'cbt', 'cc100', 'cc_news',\n",
181
            " 'ccaligned_multilingual', 'cdsc', 'cdt', 'cedr', 'cfq', 'chr_en', 'cifar10',\n",
182
            " 'cifar100', 'circa', 'civil_comments', 'clickbait_news_bg', 'climate_fever',\n",
183
            " 'clinc_oos', 'clue', 'cmrc2018', 'cmu_hinglish_dog', 'cnn_dailymail',\n",
184
            " 'coached_conv_pref', '36562 more...']\n"
185
          ]
186
        }
187
      ],
188
      "source": [
189
        "# Currently available datasets\n",
190
        "datasets = list_datasets()\n",
191
        "\n",
192
        "print(f\"🤩 Currently {len(datasets)} datasets are available on the hub:\")\n",
193
        "pprint(datasets[:100] + [f\"{len(datasets) - 100} more...\"], compact=True)"
194
      ]
195
    },
196
    {
197
      "cell_type": "code",
198
      "execution_count": null,
199
      "metadata": {
200
        "colab": {
201
          "base_uri": "https://localhost:8080/"
202
        },
203
        "id": "7T5AG3BxvSUr",
204
        "outputId": "72b52fbd-2344-4802-f040-83d640cbf899",
205
        "pycharm": {
206
          "name": "#%%\n"
207
        }
208
      },
209
      "outputs": [
210
        {
211
          "name": "stdout",
212
          "output_type": "stream",
213
          "text": [
214
            "{'_id': '621ffdd236468d709f181f95',\n",
215
            " 'author': None,\n",
216
            " 'cardData': {'annotations_creators': ['crowdsourced'],\n",
217
            "              'dataset_info': {'config_name': 'plain_text',\n",
218
            "                               'dataset_size': 89789763,\n",
219
            "                               'download_size': 35142551,\n",
220
            "                               'features': [{'dtype': 'string', 'name': 'id'},\n",
221
            "                                            {'dtype': 'string',\n",
222
            "                                             'name': 'title'},\n",
223
            "                                            {'dtype': 'string',\n",
224
            "                                             'name': 'context'},\n",
225
            "                                            {'dtype': 'string',\n",
226
            "                                             'name': 'question'},\n",
227
            "                                            {'name': 'answers',\n",
228
            "                                             'sequence': [{'dtype': 'string',\n",
229
            "                                                           'name': 'text'},\n",
230
            "                                                          {'dtype': 'int32',\n",
231
            "                                                           'name': 'answer_start'}]}],\n",
232
            "                               'splits': [{'name': 'train',\n",
233
            "                                           'num_bytes': 79317110,\n",
234
            "                                           'num_examples': 87599},\n",
235
            "                                          {'name': 'validation',\n",
236
            "                                           'num_bytes': 10472653,\n",
237
            "                                           'num_examples': 10570}]},\n",
238
            "              'language': ['en'],\n",
239
            "              'language_creators': ['crowdsourced', 'found'],\n",
240
            "              'license': ['cc-by-4.0'],\n",
241
            "              'multilinguality': ['monolingual'],\n",
242
            "              'paperswithcode_id': 'squad',\n",
243
            "              'pretty_name': 'SQuAD',\n",
244
            "              'size_categories': ['10K<n<100K'],\n",
245
            "              'source_datasets': ['extended|wikipedia'],\n",
246
            "              'task_categories': ['question-answering'],\n",
247
            "              'task_ids': ['extractive-qa'],\n",
248
            "              'train-eval-index': [{'col_mapping': {'answers': {'answer_start': 'answer_start',\n",
249
            "                                                                'text': 'text'},\n",
250
            "                                                    'context': 'context',\n",
251
            "                                                    'question': 'question'},\n",
252
            "                                    'config': 'plain_text',\n",
253
            "                                    'metrics': [{'name': 'SQuAD',\n",
254
            "                                                 'type': 'squad'}],\n",
255
            "                                    'splits': {'eval_split': 'validation',\n",
256
            "                                               'train_split': 'train'},\n",
257
            "                                    'task': 'question-answering',\n",
258
            "                                    'task_id': 'extractive_question_answering'}]},\n",
259
            " 'citation': '@article{2016arXiv160605250R,\\n'\n",
260
            "             '       author = {{Rajpurkar}, Pranav and {Zhang}, Jian and '\n",
261
            "             '{Lopyrev},\\n'\n",
262
            "             '                 Konstantin and {Liang}, Percy},\\n'\n",
263
            "             '        title = \"{SQuAD: 100,000+ Questions for Machine '\n",
264
            "             'Comprehension of Text}\",\\n'\n",
265
            "             '      journal = {arXiv e-prints},\\n'\n",
266
            "             '         year = 2016,\\n'\n",
267
            "             '          eid = {arXiv:1606.05250},\\n'\n",
268
            "             '        pages = {arXiv:1606.05250},\\n'\n",
269
            "             'archivePrefix = {arXiv},\\n'\n",
270
            "             '       eprint = {1606.05250},\\n'\n",
271
            "             '}',\n",
272
            " 'description': 'Stanford Question Answering Dataset (SQuAD) is a reading '\n",
273
            "                'comprehension dataset, consisting of questions posed by '\n",
274
            "                'crowdworkers on a set of Wikipedia articles, where the answer '\n",
275
            "                'to every question is a segment of text, or span, from the '\n",
276
            "                'corresponding reading passage, or the question might be '\n",
277
            "                'unanswerable.',\n",
278
            " 'disabled': False,\n",
279
            " 'downloads': 190504,\n",
280
            " 'gated': False,\n",
281
            " 'id': 'squad',\n",
282
            " 'lastModified': '2023-04-05T13:40:31.000Z',\n",
283
            " 'likes': 88,\n",
284
            " 'paperswithcode_id': 'squad',\n",
285
            " 'private': False,\n",
286
            " 'sha': '5fe18c4c680f9922d794e3f4dd673a751c74ee37',\n",
287
            " 'siblings': [],\n",
288
            " 'tags': ['task_categories:question-answering',\n",
289
            "          'task_ids:extractive-qa',\n",
290
            "          'annotations_creators:crowdsourced',\n",
291
            "          'language_creators:crowdsourced',\n",
292
            "          'language_creators:found',\n",
293
            "          'multilinguality:monolingual',\n",
294
            "          'size_categories:10K<n<100K',\n",
295
            "          'source_datasets:extended|wikipedia',\n",
296
            "          'language:en',\n",
297
            "          'license:cc-by-4.0',\n",
298
            "          'arxiv:1606.05250']}\n"
299
          ]
300
        },
301
        {
302
          "name": "stderr",
303
          "output_type": "stream",
304
          "text": [
305
            "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_deprecation.py:229: FutureWarning: 'list_datasets' currently returns a list of objects but is planned to be a generator starting from version 0.14 in order to implement pagination. Please avoid to use `list_datasets(...).__getitem__` or explicitly convert the output to a list first with `list(iter(list_datasets)(...))`.\n",
306
            "  warnings.warn(self._deprecation_msg.format(attr_name=attr_name), FutureWarning)\n"
307
          ]
308
        }
309
      ],
310
      "source": [
311
        "# You can access various attributes of the datasets before downloading them\n",
312
        "squad_dataset = list_datasets(with_details=True)[datasets.index('squad')]\n",
313
        "\n",
314
        "pprint(squad_dataset.__dict__)  # It's a simple python dataclass"
315
      ]
316
    },
317
    {
318
      "cell_type": "markdown",
319
      "metadata": {
320
        "id": "9uqSkkSovSUt",
321
        "pycharm": {
322
          "name": "#%% md\n"
323
        }
324
      },
325
      "source": [
326
        "## An example with SQuAD"
327
      ]
328
    },
329
    {
330
      "cell_type": "code",
331
      "execution_count": null,
332
      "metadata": {
333
        "colab": {
334
          "base_uri": "https://localhost:8080/"
335
        },
336
        "id": "aOXl6afcvSUu",
337
        "outputId": "b2e3d2cc-44b2-40c2-f566-8181fb4b8316",
338
        "pycharm": {
339
          "name": "#%%\n"
340
        }
341
      },
342
      "outputs": [
343
        {
344
          "name": "stderr",
345
          "output_type": "stream",
346
          "text": [
347
            "WARNING:datasets.builder:Found cached dataset squad (/root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)\n"
348
          ]
349
        }
350
      ],
351
      "source": [
352
        "# Downloading and loading a dataset\n",
353
        "dataset = load_dataset('squad', split='validation[:10%]')"
354
      ]
355
    },
356
    {
357
      "cell_type": "markdown",
358
      "metadata": {
359
        "id": "rQ0G-eK3vSUw",
360
        "pycharm": {
361
          "name": "#%% md\n"
362
        }
363
      },
364
      "source": [
365
        "This call to `datasets.load_dataset()` does the following steps under the hood:\n",
366
        "\n",
367
        "1. Download and import in the library the **SQuAD python processing script** from HuggingFace AWS bucket if it's not already stored in the library. You can find the SQuAD processing script [here](https://github.com/huggingface/datasets/tree/master/datasets/squad/squad.py) for instance.\n",
368
        "\n",
369
        "   Processing scripts are small python scripts which define the info (citation, description) and format of the dataset and contain the URL to the original SQuAD JSON files and the code to load examples from the original SQuAD JSON files.\n",
370
        "\n",
371
        "\n",
372
        "2. Run the SQuAD python processing script which will:\n",
373
        "    - **Download the SQuAD dataset** from the original URL (see the script) if it's not already downloaded and cached.\n",
374
        "    - **Process and cache** all SQuAD in a structured Arrow table for each standard splits stored on the drive.\n",
375
        "\n",
376
        "      Arrow table are arbitrarily long tables, typed with types that can be mapped to numpy/pandas/python standard types and can store nested objects. They can be directly access from drive, loaded in RAM or even streamed over the web.\n",
377
        "    \n",
378
        "\n",
379
        "3. Return a **dataset built from the splits** asked by the user (default: all); in the above example we create a dataset with the first 10% of the validation split."
380
      ]
381
    },
382
    {
383
      "cell_type": "code",
384
      "execution_count": null,
385
      "metadata": {
386
        "colab": {
387
          "base_uri": "https://localhost:8080/"
388
        },
389
        "id": "fercoFwLvSUx",
390
        "outputId": "4674bcb6-edfa-4163-845e-fde18083d1a2",
391
        "pycharm": {
392
          "name": "#%%\n"
393
        }
394
      },
395
      "outputs": [
396
        {
397
          "name": "stdout",
398
          "output_type": "stream",
399
          "text": [
400
            "{'builder_name': 'squad',\n",
401
            " 'citation': '@article{2016arXiv160605250R,\\n'\n",
402
            "             '       author = {{Rajpurkar}, Pranav and {Zhang}, Jian and '\n",
403
            "             '{Lopyrev},\\n'\n",
404
            "             '                 Konstantin and {Liang}, Percy},\\n'\n",
405
            "             '        title = \"{SQuAD: 100,000+ Questions for Machine '\n",
406
            "             'Comprehension of Text}\",\\n'\n",
407
            "             '      journal = {arXiv e-prints},\\n'\n",
408
            "             '         year = 2016,\\n'\n",
409
            "             '          eid = {arXiv:1606.05250},\\n'\n",
410
            "             '        pages = {arXiv:1606.05250},\\n'\n",
411
            "             'archivePrefix = {arXiv},\\n'\n",
412
            "             '       eprint = {1606.05250},\\n'\n",
413
            "             '}\\n',\n",
414
            " 'config_name': 'plain_text',\n",
415
            " 'dataset_size': 89819092,\n",
416
            " 'description': 'Stanford Question Answering Dataset (SQuAD) is a reading '\n",
417
            "                'comprehension dataset, consisting of questions posed by '\n",
418
            "                'crowdworkers on a set of Wikipedia articles, where the answer '\n",
419
            "                'to every question is a segment of text, or span, from the '\n",
420
            "                'corresponding reading passage, or the question might be '\n",
421
            "                'unanswerable.\\n',\n",
422
            " 'download_checksums': {'https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json': {'checksum': None,\n",
423
            "                                                                                             'num_bytes': 4854279},\n",
424
            "                        'https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json': {'checksum': None,\n",
425
            "                                                                                               'num_bytes': 30288272}},\n",
426
            " 'download_size': 35142551,\n",
427
            " 'features': {'answers': Sequence(feature={'answer_start': Value(dtype='int32',\n",
428
            "                                                                 id=None),\n",
429
            "                                           'text': Value(dtype='string',\n",
430
            "                                                         id=None)},\n",
431
            "                                  length=-1,\n",
432
            "                                  id=None),\n",
433
            "              'context': Value(dtype='string', id=None),\n",
434
            "              'id': Value(dtype='string', id=None),\n",
435
            "              'question': Value(dtype='string', id=None),\n",
436
            "              'title': Value(dtype='string', id=None)},\n",
437
            " 'homepage': 'https://rajpurkar.github.io/SQuAD-explorer/',\n",
438
            " 'license': '',\n",
439
            " 'post_processed': None,\n",
440
            " 'post_processing_size': None,\n",
441
            " 'size_in_bytes': 124961643,\n",
442
            " 'splits': {'train': SplitInfo(name='train',\n",
443
            "                               num_bytes=79346108,\n",
444
            "                               num_examples=87599,\n",
445
            "                               shard_lengths=None,\n",
446
            "                               dataset_name='squad'),\n",
447
            "            'validation': SplitInfo(name='validation',\n",
448
            "                                    num_bytes=10472984,\n",
449
            "                                    num_examples=10570,\n",
450
            "                                    shard_lengths=None,\n",
451
            "                                    dataset_name='squad')},\n",
452
            " 'supervised_keys': None,\n",
453
            " 'task_templates': [QuestionAnsweringExtractive(task='question-answering-extractive',\n",
454
            "                                                question_column='question',\n",
455
            "                                                context_column='context',\n",
456
            "                                                answers_column='answers')],\n",
457
            " 'version': 1.0.0}\n"
458
          ]
459
        }
460
      ],
461
      "source": [
462
        "# Informations on the dataset (description, citation, size, splits, format...)\n",
463
        "# are provided in `dataset.info` (a simple python dataclass) and also as direct attributes in the dataset object\n",
464
        "pprint(dataset.info.__dict__)"
465
      ]
466
    },
467
    {
468
      "cell_type": "markdown",
469
      "metadata": {
470
        "id": "GE0E87zsvSUz",
471
        "pycharm": {
472
          "name": "#%% md\n"
473
        }
474
      },
475
      "source": [
476
        "## Inspecting and using the dataset: elements, slices and columns"
477
      ]
478
    },
479
    {
480
      "cell_type": "markdown",
481
      "metadata": {
482
        "id": "DKf4YFnevSU0",
483
        "pycharm": {
484
          "name": "#%% md\n"
485
        }
486
      },
487
      "source": [
488
        "The returned `Dataset` object is a memory mapped dataset that behaves similarly to a normal map-style dataset. It is backed by an Apache Arrow table which allows many interesting features."
489
      ]
490
    },
491
    {
492
      "cell_type": "code",
493
      "execution_count": null,
494
      "metadata": {
495
        "colab": {
496
          "base_uri": "https://localhost:8080/"
497
        },
498
        "id": "tP1xPqSyvSU0",
499
        "outputId": "74de9886-cf30-4f11-f6c2-5037571f49f6",
500
        "pycharm": {
501
          "name": "#%%\n"
502
        }
503
      },
504
      "outputs": [
505
        {
506
          "name": "stdout",
507
          "output_type": "stream",
508
          "text": [
509
            "Dataset({\n",
510
            "    features: ['id', 'title', 'context', 'question', 'answers'],\n",
511
            "    num_rows: 1057\n",
512
            "})\n"
513
          ]
514
        }
515
      ],
516
      "source": [
517
        "print(dataset)"
518
      ]
519
    },
520
    {
521
      "cell_type": "markdown",
522
      "metadata": {
523
        "id": "aiO3rC8yvSU2",
524
        "pycharm": {
525
          "name": "#%% md\n"
526
        }
527
      },
528
      "source": [
529
        "You can query it's length and get items or slices like you would do normally with a python mapping."
530
      ]
531
    },
532
    {
533
      "cell_type": "code",
534
      "execution_count": null,
535
      "metadata": {
536
        "colab": {
537
          "base_uri": "https://localhost:8080/"
538
        },
539
        "id": "xxLcdj2yvSU3",
540
        "outputId": "1c5c069c-962c-4d7d-fae0-8750b929fe31",
541
        "pycharm": {
542
          "name": "#%%\n"
543
        }
544
      },
545
      "outputs": [
546
        {
547
          "name": "stdout",
548
          "output_type": "stream",
549
          "text": [
550
            "👉 Dataset len(dataset): 1057\n",
551
            "\n",
552
            "👉 First item 'dataset[0]':\n",
553
            "{'answers': {'answer_start': [177, 177, 177],\n",
554
            "             'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos']},\n",
555
            " 'context': 'Super Bowl 50 was an American football game to determine the '\n",
556
            "            'champion of the National Football League (NFL) for the 2015 '\n",
557
            "            'season. The American Football Conference (AFC) champion Denver '\n",
558
            "            'Broncos defeated the National Football Conference (NFC) champion '\n",
559
            "            'Carolina Panthers 24–10 to earn their third Super Bowl title. The '\n",
560
            "            \"game was played on February 7, 2016, at Levi's Stadium in the San \"\n",
561
            "            'Francisco Bay Area at Santa Clara, California. As this was the '\n",
562
            "            '50th Super Bowl, the league emphasized the \"golden anniversary\" '\n",
563
            "            'with various gold-themed initiatives, as well as temporarily '\n",
564
            "            'suspending the tradition of naming each Super Bowl game with '\n",
565
            "            'Roman numerals (under which the game would have been known as '\n",
566
            "            '\"Super Bowl L\"), so that the logo could prominently feature the '\n",
567
            "            'Arabic numerals 50.',\n",
568
            " 'id': '56be4db0acb8001400a502ec',\n",
569
            " 'question': 'Which NFL team represented the AFC at Super Bowl 50?',\n",
570
            " 'title': 'Super_Bowl_50'}\n"
571
          ]
572
        }
573
      ],
574
      "source": [
575
        "print(f\"👉 Dataset len(dataset): {len(dataset)}\")\n",
576
        "print(\"\\n👉 First item 'dataset[0]':\")\n",
577
        "pprint(dataset[0])"
578
      ]
579
    },
580
    {
581
      "cell_type": "code",
582
      "execution_count": null,
583
      "metadata": {
584
        "colab": {
585
          "base_uri": "https://localhost:8080/"
586
        },
587
        "id": "zk1WQ_cczP5w",
588
        "outputId": "b5cb7002-1dc2-4ff8-8dfa-78019dec74ca",
589
        "pycharm": {
590
          "name": "#%%\n"
591
        }
592
      },
593
      "outputs": [
594
        {
595
          "name": "stdout",
596
          "output_type": "stream",
597
          "text": [
598
            "\n",
599
            "👉Slice of the two items 'dataset[10:12]':\n",
600
            "{'answers': [{'answer_start': [334, 334, 334],\n",
601
            "              'text': ['February 7, 2016', 'February 7', 'February 7, 2016']},\n",
602
            "             {'answer_start': [177, 177, 177],\n",
603
            "              'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos']}],\n",
604
            " 'context': ['Super Bowl 50 was an American football game to determine the '\n",
605
            "             'champion of the National Football League (NFL) for the 2015 '\n",
606
            "             'season. The American Football Conference (AFC) champion Denver '\n",
607
            "             'Broncos defeated the National Football Conference (NFC) champion '\n",
608
            "             'Carolina Panthers 24–10 to earn their third Super Bowl title. '\n",
609
            "             \"The game was played on February 7, 2016, at Levi's Stadium in \"\n",
610
            "             'the San Francisco Bay Area at Santa Clara, California. As this '\n",
611
            "             'was the 50th Super Bowl, the league emphasized the \"golden '\n",
612
            "             'anniversary\" with various gold-themed initiatives, as well as '\n",
613
            "             'temporarily suspending the tradition of naming each Super Bowl '\n",
614
            "             'game with Roman numerals (under which the game would have been '\n",
615
            "             'known as \"Super Bowl L\"), so that the logo could prominently '\n",
616
            "             'feature the Arabic numerals 50.',\n",
617
            "             'Super Bowl 50 was an American football game to determine the '\n",
618
            "             'champion of the National Football League (NFL) for the 2015 '\n",
619
            "             'season. The American Football Conference (AFC) champion Denver '\n",
620
            "             'Broncos defeated the National Football Conference (NFC) champion '\n",
621
            "             'Carolina Panthers 24–10 to earn their third Super Bowl title. '\n",
622
            "             \"The game was played on February 7, 2016, at Levi's Stadium in \"\n",
623
            "             'the San Francisco Bay Area at Santa Clara, California. As this '\n",
624
            "             'was the 50th Super Bowl, the league emphasized the \"golden '\n",
625
            "             'anniversary\" with various gold-themed initiatives, as well as '\n",
626
            "             'temporarily suspending the tradition of naming each Super Bowl '\n",
627
            "             'game with Roman numerals (under which the game would have been '\n",
628
            "             'known as \"Super Bowl L\"), so that the logo could prominently '\n",
629
            "             'feature the Arabic numerals 50.'],\n",
630
            " 'id': ['56bea9923aeaaa14008c91bb', '56beace93aeaaa14008c91df'],\n",
631
            " 'question': ['What day was the Super Bowl played on?',\n",
632
            "              'Who won Super Bowl 50?'],\n",
633
            " 'title': ['Super_Bowl_50', 'Super_Bowl_50']}\n"
634
          ]
635
        }
636
      ],
637
      "source": [
638
        "# Or get slices with several examples:\n",
639
        "print(\"\\n👉Slice of the two items 'dataset[10:12]':\")\n",
640
        "pprint(dataset[10:12])"
641
      ]
642
    },
643
    {
644
      "cell_type": "code",
645
      "execution_count": null,
646
      "metadata": {
647
        "colab": {
648
          "base_uri": "https://localhost:8080/"
649
        },
650
        "id": "QXj2Qr5KvSU5",
651
        "outputId": "3b620962-ad0a-4b9e-c291-dc4560431d18",
652
        "pycharm": {
653
          "name": "#%%\n"
654
        }
655
      },
656
      "outputs": [
657
        {
658
          "name": "stdout",
659
          "output_type": "stream",
660
          "text": [
661
            "['Which NFL team represented the AFC at Super Bowl 50?', 'Which NFL team represented the NFC at Super Bowl 50?', 'Where did Super Bowl 50 take place?', 'Which NFL team won Super Bowl 50?', 'What color was used to emphasize the 50th anniversary of the Super Bowl?', 'What was the theme of Super Bowl 50?', 'What day was the game played on?', 'What is the AFC short for?', 'What was the theme of Super Bowl 50?', 'What does AFC stand for?']\n"
662
          ]
663
        }
664
      ],
665
      "source": [
666
        "# You can get a full column of the dataset by indexing with its name as a string:\n",
667
        "print(dataset['question'][:10])"
668
      ]
669
    },
670
    {
671
      "cell_type": "markdown",
672
      "metadata": {
673
        "id": "6Au7rqPMvSU7",
674
        "pycharm": {
675
          "name": "#%% md\n"
676
        }
677
      },
678
      "source": [
679
        "The `__getitem__` method will return different format depending on the type of query:\n",
680
        "\n",
681
        "- Items like `dataset[0]` are returned as dict of elements.\n",
682
        "- Slices like `dataset[10:20]` are returned as dict of lists of elements.\n",
683
        "- Columns like `dataset['question']` are returned as a list of elements.\n",
684
        "\n",
685
        "This may seems surprising at first but in our experiments it's actually a lot easier to use for data processing than returning the same format for each of these views on the dataset."
686
      ]
687
    },
688
    {
689
      "cell_type": "markdown",
690
      "metadata": {
691
        "id": "6DB_y79cvSU8",
692
        "pycharm": {
693
          "name": "#%% md\n"
694
        }
695
      },
696
      "source": [
697
        "In particular, you can easily iterate along columns in slices, and also naturally permute consecutive indexings with identical results as showed here by permuting column indexing with elements and slices:"
698
      ]
699
    },
700
    {
701
      "cell_type": "code",
702
      "execution_count": null,
703
      "metadata": {
704
        "colab": {
705
          "base_uri": "https://localhost:8080/"
706
        },
707
        "id": "wjGocqArvSU9",
708
        "outputId": "27cd92aa-e4ef-4c61-c47d-fdc189fa19b7",
709
        "pycharm": {
710
          "name": "#%%\n"
711
        }
712
      },
713
      "outputs": [
714
        {
715
          "name": "stdout",
716
          "output_type": "stream",
717
          "text": [
718
            "True\n",
719
            "True\n"
720
          ]
721
        }
722
      ],
723
      "source": [
724
        "print(dataset[0]['question'] == dataset['question'][0])\n",
725
        "print(dataset[10:20]['context'] == dataset['context'][10:20])"
726
      ]
727
    },
728
    {
729
      "cell_type": "markdown",
730
      "metadata": {
731
        "id": "b1-Kj1xQvSU_",
732
        "pycharm": {
733
          "name": "#%% md\n"
734
        }
735
      },
736
      "source": [
737
        "### Dataset are internally typed and structured\n",
738
        "\n",
739
        "The dataset is backed by one (or several) Apache Arrow tables which are typed and allows for fast retrieval and access as well as arbitrary-size memory mapping.\n",
740
        "\n",
741
        "This means respectively that the format for the dataset is clearly defined and that you can load datasets of arbitrary size without worrying about RAM memory limitation (basically the dataset take no space in RAM, it's directly read from drive when needed with fast IO access)."
742
      ]
743
    },
744
    {
745
      "cell_type": "code",
746
      "execution_count": null,
747
      "metadata": {
748
        "colab": {
749
          "base_uri": "https://localhost:8080/"
750
        },
751
        "id": "rAnp_RyPvSVA",
752
        "outputId": "9d1c7fe6-796e-48f5-d8e9-1a1f8a0fbdc1",
753
        "pycharm": {
754
          "name": "#%%\n"
755
        }
756
      },
757
      "outputs": [
758
        {
759
          "name": "stdout",
760
          "output_type": "stream",
761
          "text": [
762
            "Column names:\n",
763
            "['id', 'title', 'context', 'question', 'answers']\n",
764
            "Features:\n",
765
            "{'answers': Sequence(feature={'answer_start': Value(dtype='int32', id=None),\n",
766
            "                              'text': Value(dtype='string', id=None)},\n",
767
            "                     length=-1,\n",
768
            "                     id=None),\n",
769
            " 'context': Value(dtype='string', id=None),\n",
770
            " 'id': Value(dtype='string', id=None),\n",
771
            " 'question': Value(dtype='string', id=None),\n",
772
            " 'title': Value(dtype='string', id=None)}\n"
773
          ]
774
        }
775
      ],
776
      "source": [
777
        "# You can inspect the dataset column names and types\n",
778
        "print(\"Column names:\")\n",
779
        "pprint(dataset.column_names)\n",
780
        "print(\"Features:\")\n",
781
        "pprint(dataset.features)"
782
      ]
783
    },
784
    {
785
      "cell_type": "markdown",
786
      "metadata": {
787
        "id": "au4v3mOQvSVC",
788
        "pycharm": {
789
          "name": "#%% md\n"
790
        }
791
      },
792
      "source": [
793
        "### Additional misc properties"
794
      ]
795
    },
796
    {
797
      "cell_type": "code",
798
      "execution_count": null,
799
      "metadata": {
800
        "colab": {
801
          "base_uri": "https://localhost:8080/"
802
        },
803
        "id": "efFhDWhlvSVC",
804
        "outputId": "7f8f7be9-e3ec-4825-83c8-9f228d32e5c8",
805
        "pycharm": {
806
          "name": "#%%\n"
807
        }
808
      },
809
      "outputs": [
810
        {
811
          "name": "stdout",
812
          "output_type": "stream",
813
          "text": [
814
            "The number of rows 1057 also available as len(dataset) 1057\n",
815
            "The number of columns 5\n",
816
            "The shape (rows, columns) (1057, 5)\n"
817
          ]
818
        }
819
      ],
820
      "source": [
821
        "# Datasets also have shapes informations\n",
822
        "print(\"The number of rows\", dataset.num_rows, \"also available as len(dataset)\", len(dataset))\n",
823
        "print(\"The number of columns\", dataset.num_columns)\n",
824
        "print(\"The shape (rows, columns)\", dataset.shape)"
825
      ]
826
    },
827
    {
828
      "cell_type": "markdown",
829
      "metadata": {
830
        "id": "1Ox7ppKDvSVN",
831
        "pycharm": {
832
          "name": "#%% md\n"
833
        }
834
      },
835
      "source": [
836
        "## Modifying the dataset with `dataset.map`\n",
837
        "\n",
838
        "Now that we know how to inspect our dataset we also want to update it. For that there is a powerful method `.map()` which is inspired by `tf.data` map method and that you can use to apply a function to each examples, independently or in batch.\n",
839
        "\n",
840
        "`.map()` takes a callable accepting a dict as argument (same dict as the one returned by `dataset[i]`) and iterate over the dataset by calling the function on each example."
841
      ]
842
    },
843
    {
844
      "cell_type": "code",
845
      "execution_count": null,
846
      "metadata": {
847
        "colab": {
848
          "base_uri": "https://localhost:8080/",
849
          "height": 106,
850
          "referenced_widgets": [
851
            "7edfe69de64a4af18febff677b57ab65",
852
            "dc5418db9c3e49cd95b3f85f0dc562ab",
853
            "8db902b229e545649282c130c2a049b8",
854
            "0b3581ddec0b4cabb33593e272a50249",
855
            "d580bdf43d1e44b8afcfefc962410d73",
856
            "cbcbb3853ed544f8b946aab31eaa7f56",
857
            "e21ced63bda64379832735d5aa2e0178",
858
            "757dd94ac5e04ff09ee6fab419f1692d",
859
            "d46a381be01a460cb49cc838c5aa29c0",
860
            "b3d5c33915084f26b060e086138bf898",
861
            "429bdd21215f4ef38687daa6def128f8"
862
          ]
863
        },
864
        "id": "Yz2-27HevSVN",
865
        "outputId": "0159107b-edb2-4b5a-b9a7-d6999bb5245c",
866
        "pycharm": {
867
          "name": "#%%\n"
868
        }
869
      },
870
      "outputs": [
871
        {
872
          "data": {
873
            "application/vnd.jupyter.widget-view+json": {
874
              "model_id": "7edfe69de64a4af18febff677b57ab65",
875
              "version_major": 2,
876
              "version_minor": 0
877
            },
878
            "text/plain": [
879
              "Map:   0%|          | 0/1057 [00:00<?, ? examples/s]"
880
            ]
881
          },
882
          "metadata": {},
883
          "output_type": "display_data"
884
        },
885
        {
886
          "name": "stdout",
887
          "output_type": "stream",
888
          "text": [
889
            "775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,179,179,179,179,179,179,179,179,179,179,179,168,168,168,168,168,168,168,168,168,168,168,168,168,168,168,168,168,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,704,704,704,704,704,704,704,704,704,704,704,704,704,704,704,704,704,704,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1166,1166,1166,1166,1166,1166,1166,1166,1166,1166,1166,1166,1166,1166,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,929,929,929,929,929,929,929,929,929,929,929,929,929,929,929,929,929,929,929,704,704,704,704,704,704,704,704,704,704,704,704,704,704,353,353,353,353,353,353,353,353,353,353,353,353,353,353,353,464,464,464,464,464,464,464,464,464,464,464,464,464,464,464,464,306,306,306,306,306,306,306,306,306,306,306,306,372,372,372,372,372,372,372,372,372,372,372,372,372,372,372,372,372,496,496,496,496,496,496,496,496,496,496,496,496,496,496,496,260,260,260,260,260,260,260,260,260,874,874,874,874,874,874,874,874,874,874,874,874,874,874,1025,1025,1025,1025,1025,1025,1025,1025,1025,1025,1025,1025,1025,1025,1025,176,176,176,176,176,176,176,176,176,176,176,176,176,176,176,176,782,782,782,782,782,782,782,782,782,782,782,782,782,782,782,782,536,536,536,536,536,536,536,536,536,666,666,666,666,666,666,666,666,666,666,666,666,666,666,666,666,666,495,495,495,495,495,495,495,495,495,495,495,385,385,385,385,385,385,385,385,385,385,385,385,385,385,385,385,385,385,385,441,441,441,441,441,441,441,441,441,441,441,357,357,357,357,357,357,357,357,357,296,296,296,296,296,296,296,296,296,296,644,644,644,644,644,644,644,644,644,644,644,644,644,644,644,644,644,804,804,804,804,804,804,804,804,804,804,804,397,397,397,397,397,397,397,397,397,397,397,397,397,397,360,360,360,360,360,360,360,973,973,973,973,973,973,973,973,973,973,973,973,973,973,263,263,263,263,263,263,263,263,263,263,263,568,568,568,568,568,568,568,568,568,568,568,264,264,264,264,264,264,264,264,264,264,264,264,264,264,264,892,892,892,892,892,892,892,892,892,892,892,206,206,206,206,206,489,489,489,489,489,489,489,489,489,489,489,489,489,181,181,181,181,181,181,181,181,181,181,181,181,531,531,531,531,531,531,531,531,531,531,531,531,664,664,664,664,664,664,664,664,664,664,664,664,664,664,672,672,672,672,672,672,672,672,672,672,672,672,672,672,858,858,858,858,858,858,858,858,858,858,858,858,634,634,634,634,634,634,634,634,634,634,634,634,634,634,891,891,891,891,891,891,891,891,891,891,891,891,891,488,488,488,488,488,488,488,488,488,488,488,488,942,942,942,942,942,942,942,942,942,942,942,942,942,942,942,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1353,1353,1353,1353,1353,1353,1353,1353,1353,1353,1353,1353,1353,1353,522,522,522,522,522,1643,1643,1643,1643,1643,628,628,628,628,628,758,758,758,758,758,883,883,883,883,883,559,559,559,559,559,603,603,603,603,631,631,631,631,631,626,626,626,626,626,541,541,541,541,541,795,795,795,795,795,591,591,591,591,591,568,568,568,568,568,536,536,536,536,536,575,575,575,575,575,571,571,571,571,571,641,641,641,641,641,665,665,665,665,665,1088,1088,1088,1088,1088,1619,1619,1619,1619,1619,939,939,939,939,939,865,865,865,865,865,711,711,711,711,711,831,831,831,831,831,501,501,501,501,501,676,676,676,676,676,854,854,854,854,854,784,784,784,784,784,641,641,641,641,641,544,544,544,544,544,918,918,918,918,918,763,763,763,763,763,906,906,906,906,906,632,632,632,632,632,869,869,869,869,869,1044,1044,1044,1044,1044,760,760,760,760,760,715,715,715,715,715,838,838,838,838,838,881,881,881,881,881,940,940,940,940,940,618,618,618,618,618,1205,1205,1205,534,534,534,534,534,757,757,757,757,757,1239,1239,1239,1239,1239,609,609,609,609,609,798,798,798,798,798,613,613,613,613,613,613,613,613,613,613,"
890
          ]
891
        },
892
        {
893
          "data": {
894
            "text/plain": [
895
              "Dataset({\n",
896
              "    features: ['id', 'title', 'context', 'question', 'answers'],\n",
897
              "    num_rows: 1057\n",
898
              "})"
899
            ]
900
          },
901
          "execution_count": 14,
902
          "metadata": {},
903
          "output_type": "execute_result"
904
        }
905
      ],
906
      "source": [
907
        "# Let's print the length of each `context` string in our subset of the dataset\n",
908
        "# (10% of the validation i.e. 1057 examples)\n",
909
        "\n",
910
        "dataset.map(lambda example: print(len(example['context']), end=','))"
911
      ]
912
    },
913
    {
914
      "cell_type": "markdown",
915
      "metadata": {
916
        "id": "Ta3celHnvSVP",
917
        "pycharm": {
918
          "name": "#%% md\n"
919
        }
920
      },
921
      "source": [
922
        "This is basically the same as doing\n",
923
        "\n",
924
        "```python\n",
925
        "for example in dataset:\n",
926
        "    function(example)\n",
927
        "```"
928
      ]
929
    },
930
    {
931
      "cell_type": "markdown",
932
      "metadata": {
933
        "id": "Z4Fjr0DJawuS",
934
        "pycharm": {
935
          "name": "#%% md\n"
936
        }
937
      },
938
      "source": [
939
        "The above examples was a bit verbose. We can control the logging level of 🤗 Datasets with it's logging module:\n"
940
      ]
941
    },
942
    {
943
      "cell_type": "code",
944
      "execution_count": null,
945
      "metadata": {
946
        "colab": {
947
          "base_uri": "https://localhost:8080/",
948
          "height": 106,
949
          "referenced_widgets": [
950
            "961929641bfc4b06b0603bd792c6d351",
951
            "c497e117ef7142338bd45e57b722616b",
952
            "60682d73f15b4020b57f87dabba5f320",
953
            "51f49669810a4b5f941c18e4b1896866",
954
            "f4da65dff9374ace9b92d341ec2793f1",
955
            "9b5b8acd984f44d696f8f83862f20bf1",
956
            "f9fdd11e8b6f411e818447528be333df",
957
            "d8edc4f0a0a44882a7beeca0321276d6",
958
            "09e1966daf9e481da118af73af218d88",
959
            "21a2deb93c614338a9944b5032220c8d",
960
            "d8494cdc5ce04f4690a9adadb921de4c"
961
          ]
962
        },
963
        "id": "qAgptXFYaquI",
964
        "outputId": "c42c9177-92d2-4c63-eeea-d69dbd877e54",
965
        "pycharm": {
966
          "name": "#%%\n"
967
        }
968
      },
969
      "outputs": [
970
        {
971
          "data": {
972
            "application/vnd.jupyter.widget-view+json": {
973
              "model_id": "961929641bfc4b06b0603bd792c6d351",
974
              "version_major": 2,
975
              "version_minor": 0
976
            },
977
            "text/plain": [
978
              "Map:   0%|          | 0/1057 [00:00<?, ? examples/s]"
979
            ]
980
          },
981
          "metadata": {},
982
          "output_type": "display_data"
983
        },
984
        {
985
          "name": "stdout",
986
          "output_type": "stream",
987
          "text": [
988
            "775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,179,179,179,179,179,179,179,179,179,179,179,168,168,168,168,168,168,168,168,168,168,168,168,168,168,168,168,168,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,704,704,704,704,704,704,704,704,704,704,704,704,704,704,704,704,704,704,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1166,1166,1166,1166,1166,1166,1166,1166,1166,1166,1166,1166,1166,1166,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,929,929,929,929,929,929,929,929,929,929,929,929,929,929,929,929,929,929,929,704,704,704,704,704,704,704,704,704,704,704,704,704,704,353,353,353,353,353,353,353,353,353,353,353,353,353,353,353,464,464,464,464,464,464,464,464,464,464,464,464,464,464,464,464,306,306,306,306,306,306,306,306,306,306,306,306,372,372,372,372,372,372,372,372,372,372,372,372,372,372,372,372,372,496,496,496,496,496,496,496,496,496,496,496,496,496,496,496,260,260,260,260,260,260,260,260,260,874,874,874,874,874,874,874,874,874,874,874,874,874,874,1025,1025,1025,1025,1025,1025,1025,1025,1025,1025,1025,1025,1025,1025,1025,176,176,176,176,176,176,176,176,176,176,176,176,176,176,176,176,782,782,782,782,782,782,782,782,782,782,782,782,782,782,782,782,536,536,536,536,536,536,536,536,536,666,666,666,666,666,666,666,666,666,666,666,666,666,666,666,666,666,495,495,495,495,495,495,495,495,495,495,495,385,385,385,385,385,385,385,385,385,385,385,385,385,385,385,385,385,385,385,441,441,441,441,441,441,441,441,441,441,441,357,357,357,357,357,357,357,357,357,296,296,296,296,296,296,296,296,296,296,644,644,644,644,644,644,644,644,644,644,644,644,644,644,644,644,644,804,804,804,804,804,804,804,804,804,804,804,397,397,397,397,397,397,397,397,397,397,397,397,397,397,360,360,360,360,360,360,360,973,973,973,973,973,973,973,973,973,973,973,973,973,973,263,263,263,263,263,263,263,263,263,263,263,568,568,568,568,568,568,568,568,568,568,568,264,264,264,264,264,264,264,264,264,264,264,264,264,264,264,892,892,892,892,892,892,892,892,892,892,892,206,206,206,206,206,489,489,489,489,489,489,489,489,489,489,489,489,489,181,181,181,181,181,181,181,181,181,181,181,181,531,531,531,531,531,531,531,531,531,531,531,531,664,664,664,664,664,664,664,664,664,664,664,664,664,664,672,672,672,672,672,672,672,672,672,672,672,672,672,672,858,858,858,858,858,858,858,858,858,858,858,858,634,634,634,634,634,634,634,634,634,634,634,634,634,634,891,891,891,891,891,891,891,891,891,891,891,891,891,488,488,488,488,488,488,488,488,488,488,488,488,942,942,942,942,942,942,942,942,942,942,942,942,942,942,942,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1353,1353,1353,1353,1353,1353,1353,1353,1353,1353,1353,1353,1353,1353,522,522,522,522,522,1643,1643,1643,1643,1643,628,628,628,628,628,758,758,758,758,758,883,883,883,883,883,559,559,559,559,559,603,603,603,603,631,631,631,631,631,626,626,626,626,626,541,541,541,541,541,795,795,795,795,795,591,591,591,591,591,568,568,568,568,568,536,536,536,536,536,575,575,575,575,575,571,571,571,571,571,641,641,641,641,641,665,665,665,665,665,1088,1088,1088,1088,1088,1619,1619,1619,1619,1619,939,939,939,939,939,865,865,865,865,865,711,711,711,711,711,831,831,831,831,831,501,501,501,501,501,676,676,676,676,676,854,854,854,854,854,784,784,784,784,784,641,641,641,641,641,544,544,544,544,544,918,918,918,918,918,763,763,763,763,763,906,906,906,906,906,632,632,632,632,632,869,869,869,869,869,1044,1044,1044,1044,1044,760,760,760,760,760,715,715,715,715,715,838,838,838,838,838,881,881,881,881,881,940,940,940,940,940,618,618,618,618,618,1205,1205,1205,534,534,534,534,534,757,757,757,757,757,1239,1239,1239,1239,1239,609,609,609,609,609,798,798,798,798,798,613,613,613,613,613,613,613,613,613,613,"
989
          ]
990
        },
991
        {
992
          "data": {
993
            "text/plain": [
994
              "Dataset({\n",
995
              "    features: ['id', 'title', 'context', 'question', 'answers'],\n",
996
              "    num_rows: 1057\n",
997
              "})"
998
            ]
999
          },
1000
          "execution_count": 15,
1001
          "metadata": {},
1002
          "output_type": "execute_result"
1003
        }
1004
      ],
1005
      "source": [
1006
        "from datasets import logging\n",
1007
        "logging.set_verbosity_warning()\n",
1008
        "\n",
1009
        "dataset.map(lambda example: print(len(example['context']), end=','))"
1010
      ]
1011
    },
1012
    {
1013
      "cell_type": "code",
1014
      "execution_count": null,
1015
      "metadata": {
1016
        "id": "KfED6CEHa8J_",
1017
        "pycharm": {
1018
          "name": "#%%\n"
1019
        }
1020
      },
1021
      "outputs": [],
1022
      "source": [
1023
        "# Let's keep it verbose for our tutorial though\n",
1024
        "from datasets import logging\n",
1025
        "logging.set_verbosity_info()"
1026
      ]
1027
    },
1028
    {
1029
      "cell_type": "markdown",
1030
      "metadata": {
1031
        "id": "i_Ouw5gDvSVP",
1032
        "pycharm": {
1033
          "name": "#%% md\n"
1034
        }
1035
      },
1036
      "source": [
1037
        "The above example had no effect on the dataset because the method we supplied to `.map()` didn't return a `dict` or a `abc.Mapping` that could be used to update the examples in the dataset.\n",
1038
        "\n",
1039
        "In such a case, `.map()` will return the same dataset (`self`).\n",
1040
        "\n",
1041
        "Now let's see how we can use a method that actually modify the dataset."
1042
      ]
1043
    },
1044
    {
1045
      "cell_type": "markdown",
1046
      "metadata": {
1047
        "id": "cEnCi9DFvSVQ",
1048
        "pycharm": {
1049
          "name": "#%% md\n"
1050
        }
1051
      },
1052
      "source": [
1053
        "### Modifying the dataset example by example"
1054
      ]
1055
    },
1056
    {
1057
      "cell_type": "markdown",
1058
      "metadata": {
1059
        "id": "kA37VgZhvSVQ",
1060
        "pycharm": {
1061
          "name": "#%% md\n"
1062
        }
1063
      },
1064
      "source": [
1065
        "The main interest of `.map()` is to update and modify the content of the table and leverage smart caching and fast backend.\n",
1066
        "\n",
1067
        "To use `.map()` to update elements in the table you need to provide a function with the following signature: `function(example: dict) -> dict`."
1068
      ]
1069
    },
1070
    {
1071
      "cell_type": "code",
1072
      "execution_count": null,
1073
      "metadata": {
1074
        "colab": {
1075
          "base_uri": "https://localhost:8080/"
1076
        },
1077
        "id": "vUr65K-4vSVQ",
1078
        "outputId": "0d770257-f8d0-45fc-8ae2-7f387210f068",
1079
        "pycharm": {
1080
          "name": "#%%\n"
1081
        }
1082
      },
1083
      "outputs": [
1084
        {
1085
          "name": "stderr",
1086
          "output_type": "stream",
1087
          "text": [
1088
            "WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453/cache-242ccd893f32bdf9.arrow\n"
1089
          ]
1090
        },
1091
        {
1092
          "name": "stdout",
1093
          "output_type": "stream",
1094
          "text": [
1095
            "['My cute title: Super_Bowl_50', 'My cute title: Warsaw']\n"
1096
          ]
1097
        }
1098
      ],
1099
      "source": [
1100
        "# Let's add a prefix 'My cute title: ' to each of our titles\n",
1101
        "\n",
1102
        "def add_prefix_to_title(example):\n",
1103
        "    example['title'] = 'My cute title: ' + example['title']\n",
1104
        "    return example\n",
1105
        "\n",
1106
        "prefixed_dataset = dataset.map(add_prefix_to_title)\n",
1107
        "\n",
1108
        "print(prefixed_dataset.unique('title'))  # `.unique()` is a super fast way to print the unique elemnts in a column (see the doc for all the methods)"
1109
      ]
1110
    },
1111
    {
1112
      "cell_type": "markdown",
1113
      "metadata": {
1114
        "id": "FcZ_amDAvSVS",
1115
        "pycharm": {
1116
          "name": "#%% md\n"
1117
        }
1118
      },
1119
      "source": [
1120
        "This call to `.map()` compute and return the updated table. It will also store the updated table in a cache file indexed by the current state and the mapped function.\n",
1121
        "\n",
1122
        "A subsequent call to `.map()` (even in another python session) will reuse the cached file instead of recomputing the operation.\n",
1123
        "\n",
1124
        "You can test this by running again the previous cell, you will see that the result are directly loaded from the cache and not re-computed again.\n",
1125
        "\n",
1126
        "The updated dataset returned by `.map()` is (again) directly memory mapped from drive and not allocated in RAM."
1127
      ]
1128
    },
1129
    {
1130
      "cell_type": "markdown",
1131
      "metadata": {
1132
        "id": "Skbf8LUEvSVT",
1133
        "pycharm": {
1134
          "name": "#%% md\n"
1135
        }
1136
      },
1137
      "source": [
1138
        "The function you provide to `.map()` should accept an input with the format of an item of the dataset: `function(dataset[0])` and return a python dict.\n",
1139
        "\n",
1140
        "The columns and type of the outputs can be different than the input dict. In this case the new keys will be added as additional columns in the dataset.\n",
1141
        "\n",
1142
        "Bascially each dataset example dict is updated with the dictionary returned by the function like this: `example.update(function(example))`."
1143
      ]
1144
    },
1145
    {
1146
      "cell_type": "code",
1147
      "execution_count": null,
1148
      "metadata": {
1149
        "colab": {
1150
          "base_uri": "https://localhost:8080/"
1151
        },
1152
        "id": "d5De0CfTvSVT",
1153
        "outputId": "e6282b6e-d9ce-4e8b-e0f4-6c9a34330bce",
1154
        "pycharm": {
1155
          "name": "#%%\n"
1156
        }
1157
      },
1158
      "outputs": [
1159
        {
1160
          "name": "stderr",
1161
          "output_type": "stream",
1162
          "text": [
1163
            "WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453/cache-4f3eee21db868c87.arrow\n"
1164
          ]
1165
        },
1166
        {
1167
          "name": "stdout",
1168
          "output_type": "stream",
1169
          "text": [
1170
            "['My cutest title: Super_Bowl_50', 'My cutest title: Warsaw']\n"
1171
          ]
1172
        }
1173
      ],
1174
      "source": [
1175
        "# Since the input example dict is updated with our function output dict,\n",
1176
        "# we can actually just return the updated 'title' field\n",
1177
        "titled_dataset = dataset.map(lambda example: {'title': 'My cutest title: ' + example['title']})\n",
1178
        "\n",
1179
        "print(titled_dataset.unique('title'))"
1180
      ]
1181
    },
1182
    {
1183
      "cell_type": "markdown",
1184
      "metadata": {
1185
        "id": "Q5vny56-vSVV",
1186
        "pycharm": {
1187
          "name": "#%% md\n"
1188
        }
1189
      },
1190
      "source": [
1191
        "#### Removing columns\n",
1192
        "You can also remove columns when running map with the `remove_columns=List[str]` argument."
1193
      ]
1194
    },
1195
    {
1196
      "cell_type": "code",
1197
      "execution_count": null,
1198
      "metadata": {
1199
        "colab": {
1200
          "base_uri": "https://localhost:8080/"
1201
        },
1202
        "id": "-sPWnsz-vSVW",
1203
        "outputId": "c116e3cb-2fa4-4304-d6a5-d600b3bc4930",
1204
        "pycharm": {
1205
          "name": "#%%\n"
1206
        }
1207
      },
1208
      "outputs": [
1209
        {
1210
          "name": "stderr",
1211
          "output_type": "stream",
1212
          "text": [
1213
            "WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453/cache-2800c1727354fbe2.arrow\n"
1214
          ]
1215
        },
1216
        {
1217
          "name": "stdout",
1218
          "output_type": "stream",
1219
          "text": [
1220
            "['id', 'context', 'question', 'answers', 'new_title']\n",
1221
            "['Wouhahh: Super_Bowl_50', 'Wouhahh: Warsaw']\n"
1222
          ]
1223
        }
1224
      ],
1225
      "source": [
1226
        "# This will remove the 'title' column while doing the update (after having send it the the mapped function so you can use it in your function!)\n",
1227
        "less_columns_dataset = dataset.map(lambda example: {'new_title': 'Wouhahh: ' + example['title']}, remove_columns=['title'])\n",
1228
        "\n",
1229
        "print(less_columns_dataset.column_names)\n",
1230
        "print(less_columns_dataset.unique('new_title'))"
1231
      ]
1232
    },
1233
    {
1234
      "cell_type": "markdown",
1235
      "metadata": {
1236
        "id": "G459HzD-vSVY",
1237
        "pycharm": {
1238
          "name": "#%% md\n"
1239
        }
1240
      },
1241
      "source": [
1242
        "#### Using examples indices\n",
1243
        "With `with_indices=True`, dataset indices (from `0` to `len(dataset)`) will be supplied to the function which must thus have the following signature: `function(example: dict, indice: int) -> dict`"
1244
      ]
1245
    },
1246
    {
1247
      "cell_type": "code",
1248
      "execution_count": null,
1249
      "metadata": {
1250
        "colab": {
1251
          "base_uri": "https://localhost:8080/"
1252
        },
1253
        "id": "_kFL37R2vSVY",
1254
        "outputId": "16a436d2-6a2e-4526-8016-b47273116a71",
1255
        "pycharm": {
1256
          "name": "#%%\n"
1257
        }
1258
      },
1259
      "outputs": [
1260
        {
1261
          "name": "stderr",
1262
          "output_type": "stream",
1263
          "text": [
1264
            "WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453/cache-e23b98819de39aea.arrow\n"
1265
          ]
1266
        },
1267
        {
1268
          "name": "stdout",
1269
          "output_type": "stream",
1270
          "text": [
1271
            "['0: Which NFL team represented the AFC at Super Bowl 50?',\n",
1272
            " '1: Which NFL team represented the NFC at Super Bowl 50?',\n",
1273
            " '2: Where did Super Bowl 50 take place?',\n",
1274
            " '3: Which NFL team won Super Bowl 50?',\n",
1275
            " '4: What color was used to emphasize the 50th anniversary of the Super Bowl?']\n"
1276
          ]
1277
        }
1278
      ],
1279
      "source": [
1280
        "# This will add the index in the dataset to the 'question' field\n",
1281
        "with_indices_dataset = dataset.map(lambda example, idx: {'question': f'{idx}: ' + example['question']},\n",
1282
        "                                   with_indices=True)\n",
1283
        "\n",
1284
        "pprint(with_indices_dataset['question'][:5])"
1285
      ]
1286
    },
1287
    {
1288
      "cell_type": "markdown",
1289
      "metadata": {
1290
        "id": "xckhVEWFvSVb",
1291
        "pycharm": {
1292
          "name": "#%% md\n"
1293
        }
1294
      },
1295
      "source": [
1296
        "### Modifying the dataset with batched updates"
1297
      ]
1298
    },
1299
    {
1300
      "cell_type": "markdown",
1301
      "metadata": {
1302
        "id": "dzmicbSnvSVb",
1303
        "pycharm": {
1304
          "name": "#%% md\n"
1305
        }
1306
      },
1307
      "source": [
1308
        "`.map()` can also work with batch of examples (slices of the dataset).\n",
1309
        "\n",
1310
        "This is particularly interesting if you have a function that can handle batch of inputs like the tokenizers of HuggingFace `tokenizers`.\n",
1311
        "\n",
1312
        "To work on batched inputs set `batched=True` when calling `.map()` and supply a function with the following signature: `function(examples: Dict[List]) -> Dict[List]` or, if you use indices, `function(examples: Dict[List], indices: List[int]) -> Dict[List]`).\n",
1313
        "\n",
1314
        "Bascially, your function should accept an input with the format of a slice of the dataset: `function(dataset[:10])`."
1315
      ]
1316
    },
1317
    {
1318
      "cell_type": "code",
1319
      "execution_count": null,
1320
      "metadata": {
1321
        "colab": {
1322
          "base_uri": "https://localhost:8080/"
1323
        },
1324
        "id": "pxHbgSTL0itj",
1325
        "outputId": "20471793-ca8e-4d06-80cd-4bf822eb0d40",
1326
        "pycharm": {
1327
          "name": "#%%\n"
1328
        }
1329
      },
1330
      "outputs": [
1331
        {
1332
          "name": "stdout",
1333
          "output_type": "stream",
1334
          "text": [
1335
            "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
1336
            "Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (4.29.2)\n",
1337
            "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers) (3.12.0)\n",
1338
            "Requirement already satisfied: huggingface-hub<1.0,>=0.14.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.14.1)\n",
1339
            "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (1.22.4)\n",
1340
            "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers) (23.1)\n",
1341
            "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (6.0)\n",
1342
            "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (2022.10.31)\n",
1343
            "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers) (2.27.1)\n",
1344
            "Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.13.3)\n",
1345
            "Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers) (4.65.0)\n",
1346
            "Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.14.1->transformers) (2023.4.0)\n",
1347
            "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.14.1->transformers) (4.5.0)\n",
1348
            "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (1.26.15)\n",
1349
            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2022.12.7)\n",
1350
            "Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2.0.12)\n",
1351
            "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.4)\n"
1352
          ]
1353
        }
1354
      ],
1355
      "source": [
1356
        "!pip install transformers"
1357
      ]
1358
    },
1359
    {
1360
      "cell_type": "code",
1361
      "execution_count": null,
1362
      "metadata": {
1363
        "id": "T7gpEg0yvSVc",
1364
        "pycharm": {
1365
          "name": "#%%\n"
1366
        }
1367
      },
1368
      "outputs": [],
1369
      "source": [
1370
        "# Let's import a fast tokenizer that can work on batched inputs\n",
1371
        "# (the 'Fast' tokenizers in HuggingFace)\n",
1372
        "from transformers import BertTokenizerFast, logging as transformers_logging\n",
1373
        "\n",
1374
        "transformers_logging.set_verbosity_warning()\n",
1375
        "\n",
1376
        "tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')"
1377
      ]
1378
    },
1379
    {
1380
      "cell_type": "code",
1381
      "execution_count": null,
1382
      "metadata": {
1383
        "colab": {
1384
          "base_uri": "https://localhost:8080/"
1385
        },
1386
        "id": "fAmLTPC9vSVe",
1387
        "outputId": "4388ecc8-049a-41cc-90cb-1d48fd05c8dd",
1388
        "pycharm": {
1389
          "name": "#%%\n"
1390
        }
1391
      },
1392
      "outputs": [
1393
        {
1394
          "name": "stderr",
1395
          "output_type": "stream",
1396
          "text": [
1397
            "WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453/cache-1d272c8f779fd409.arrow\n"
1398
          ]
1399
        },
1400
        {
1401
          "name": "stdout",
1402
          "output_type": "stream",
1403
          "text": [
1404
            "encoded_dataset[0]\n",
1405
            "{'answers': {'answer_start': [177, 177, 177],\n",
1406
            "             'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos']},\n",
1407
            " 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
1408
            "                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
1409
            "                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
1410
            "                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
1411
            "                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
1412
            "                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
1413
            "                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
1414
            "                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
1415
            "                    1],\n",
1416
            " 'context': 'Super Bowl 50 was an American football game to determine the '\n",
1417
            "            'champion of the National Football League (NFL) for the 2015 '\n",
1418
            "            'season. The American Football Conference (AFC) champion Denver '\n",
1419
            "            'Broncos defeated the National Football Conference (NFC) champion '\n",
1420
            "            'Carolina Panthers 24–10 to earn their third Super Bowl title. The '\n",
1421
            "            \"game was played on February 7, 2016, at Levi's Stadium in the San \"\n",
1422
            "            'Francisco Bay Area at Santa Clara, California. As this was the '\n",
1423
            "            '50th Super Bowl, the league emphasized the \"golden anniversary\" '\n",
1424
            "            'with various gold-themed initiatives, as well as temporarily '\n",
1425
            "            'suspending the tradition of naming each Super Bowl game with '\n",
1426
            "            'Roman numerals (under which the game would have been known as '\n",
1427
            "            '\"Super Bowl L\"), so that the logo could prominently feature the '\n",
1428
            "            'Arabic numerals 50.',\n",
1429
            " 'id': '56be4db0acb8001400a502ec',\n",
1430
            " 'input_ids': [101, 3198, 5308, 1851, 1108, 1126, 1237, 1709, 1342, 1106, 4959,\n",
1431
            "               1103, 3628, 1104, 1103, 1305, 2289, 1453, 113, 4279, 114, 1111,\n",
1432
            "               1103, 1410, 1265, 119, 1109, 1237, 2289, 3047, 113, 10402, 114,\n",
1433
            "               3628, 7068, 14722, 2378, 1103, 1305, 2289, 3047, 113, 24743, 114,\n",
1434
            "               3628, 2938, 13598, 1572, 782, 1275, 1106, 7379, 1147, 1503, 3198,\n",
1435
            "               5308, 1641, 119, 1109, 1342, 1108, 1307, 1113, 1428, 128, 117,\n",
1436
            "               1446, 117, 1120, 12388, 112, 188, 3339, 1107, 1103, 1727, 2948,\n",
1437
            "               2410, 3894, 1120, 3364, 10200, 117, 1756, 119, 1249, 1142, 1108,\n",
1438
            "               1103, 13163, 3198, 5308, 117, 1103, 2074, 13463, 1103, 107, 5404,\n",
1439
            "               5453, 107, 1114, 1672, 2284, 118, 12005, 11751, 117, 1112, 1218,\n",
1440
            "               1112, 7818, 28117, 20080, 16264, 1103, 3904, 1104, 10505, 1296,\n",
1441
            "               3198, 5308, 1342, 1114, 2264, 183, 15447, 16179, 113, 1223, 1134,\n",
1442
            "               1103, 1342, 1156, 1138, 1151, 1227, 1112, 107, 3198, 5308, 149,\n",
1443
            "               107, 114, 117, 1177, 1115, 1103, 7998, 1180, 15199, 2672, 1103,\n",
1444
            "               4944, 183, 15447, 16179, 1851, 119, 102],\n",
1445
            " 'question': 'Which NFL team represented the AFC at Super Bowl 50?',\n",
1446
            " 'title': 'Super_Bowl_50',\n",
1447
            " 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
1448
            "                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
1449
            "                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
1450
            "                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
1451
            "                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
1452
            "                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
1453
            "                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
1454
            "                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
1455
            "                    0]}\n"
1456
          ]
1457
        }
1458
      ],
1459
      "source": [
1460
        "# Now let's batch tokenize our dataset 'context'\n",
1461
        "encoded_dataset = dataset.map(lambda example: tokenizer(example['context']), batched=True)\n",
1462
        "\n",
1463
        "print(\"encoded_dataset[0]\")\n",
1464
        "pprint(encoded_dataset[0], compact=True)"
1465
      ]
1466
    },
1467
    {
1468
      "cell_type": "code",
1469
      "execution_count": null,
1470
      "metadata": {
1471
        "colab": {
1472
          "base_uri": "https://localhost:8080/"
1473
        },
1474
        "id": "kNaJdKskvSVf",
1475
        "outputId": "17855cc9-47d3-4060-840c-8e8cdd71290d",
1476
        "pycharm": {
1477
          "name": "#%%\n"
1478
        }
1479
      },
1480
      "outputs": [
1481
        {
1482
          "name": "stdout",
1483
          "output_type": "stream",
1484
          "text": [
1485
            "['id',\n",
1486
            " 'title',\n",
1487
            " 'context',\n",
1488
            " 'question',\n",
1489
            " 'answers',\n",
1490
            " 'input_ids',\n",
1491
            " 'token_type_ids',\n",
1492
            " 'attention_mask']\n"
1493
          ]
1494
        }
1495
      ],
1496
      "source": [
1497
        "# we have added additional columns\n",
1498
        "pprint(encoded_dataset.column_names)"
1499
      ]
1500
    },
1501
    {
1502
      "cell_type": "code",
1503
      "execution_count": null,
1504
      "metadata": {
1505
        "colab": {
1506
          "base_uri": "https://localhost:8080/"
1507
        },
1508
        "id": "m3To8ztMvSVj",
1509
        "outputId": "dc46c517-209f-4796-d70c-e99e6d42efe7",
1510
        "pycharm": {
1511
          "name": "#%%\n"
1512
        }
1513
      },
1514
      "outputs": [
1515
        {
1516
          "name": "stderr",
1517
          "output_type": "stream",
1518
          "text": [
1519
            "WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453/cache-a915f5d1f8009aff.arrow\n"
1520
          ]
1521
        }
1522
      ],
1523
      "source": [
1524
        "# Let show a more complex processing with the full preparation of the SQuAD dataset\n",
1525
        "# for training a model from Transformers\n",
1526
        "def convert_to_features(batch):\n",
1527
        "    # Tokenize contexts and questions (as pairs of inputs)\n",
1528
        "    encodings = tokenizer(batch['context'], batch['question'], truncation=True)\n",
1529
        "\n",
1530
        "    # Compute start and end tokens for labels\n",
1531
        "    start_positions, end_positions = [], []\n",
1532
        "    for i, answer in enumerate(batch['answers']):\n",
1533
        "        first_char = answer['answer_start'][0]\n",
1534
        "        last_char = first_char + len(answer['text'][0]) - 1\n",
1535
        "        start_positions.append(encodings.char_to_token(i, first_char))\n",
1536
        "        end_positions.append(encodings.char_to_token(i, last_char))\n",
1537
        "\n",
1538
        "    encodings.update({'start_positions': start_positions, 'end_positions': end_positions})\n",
1539
        "    return encodings\n",
1540
        "\n",
1541
        "encoded_dataset = dataset.map(convert_to_features, batched=True)"
1542
      ]
1543
    },
1544
    {
1545
      "cell_type": "code",
1546
      "execution_count": null,
1547
      "metadata": {
1548
        "colab": {
1549
          "base_uri": "https://localhost:8080/"
1550
        },
1551
        "id": "KBnmSa46vSVl",
1552
        "outputId": "17b8e72e-4434-4364-ec0c-67c5288037a4",
1553
        "pycharm": {
1554
          "name": "#%%\n"
1555
        }
1556
      },
1557
      "outputs": [
1558
        {
1559
          "name": "stdout",
1560
          "output_type": "stream",
1561
          "text": [
1562
            "column_names ['id', 'title', 'context', 'question', 'answers', 'input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions']\n",
1563
            "start_positions [34, 45, 80, 34, 98]\n"
1564
          ]
1565
        }
1566
      ],
1567
      "source": [
1568
        "# Now our dataset comprise the labels for the start and end position\n",
1569
        "# as well as the offsets for converting back tokens\n",
1570
        "# in span of the original string for evaluation\n",
1571
        "print(\"column_names\", encoded_dataset.column_names)\n",
1572
        "print(\"start_positions\", encoded_dataset[:5]['start_positions'])"
1573
      ]
1574
    },
1575
    {
1576
      "cell_type": "markdown",
1577
      "metadata": {
1578
        "id": "J1utN8K4muDW"
1579
      },
1580
      "source": [
1581
        "### Image datasets"
1582
      ]
1583
    },
1584
    {
1585
      "cell_type": "markdown",
1586
      "metadata": {
1587
        "id": "vdYUjP60m-Ie"
1588
      },
1589
      "source": [
1590
        "Images are loaded using Pillow:"
1591
      ]
1592
    },
1593
    {
1594
      "cell_type": "code",
1595
      "execution_count": null,
1596
      "metadata": {
1597
        "colab": {
1598
          "base_uri": "https://localhost:8080/"
1599
        },
1600
        "id": "tAbviPxPm4Ce",
1601
        "outputId": "5c38e76e-ae1c-45c2-c20c-110a806cab49"
1602
      },
1603
      "outputs": [
1604
        {
1605
          "name": "stderr",
1606
          "output_type": "stream",
1607
          "text": [
1608
            "INFO:datasets.info:Loading Dataset Infos from /root/.cache/huggingface/modules/datasets_modules/datasets/cats_vs_dogs/d4fe9cf31b294ed8639aa58f7d8ee13fe189011837038ed9a774fde19a911fcb\n",
1609
            "INFO:datasets.builder:Overwrite dataset info from restored data version if exists.\n",
1610
            "INFO:datasets.info:Loading Dataset info from /root/.cache/huggingface/datasets/cats_vs_dogs/default/1.0.0/d4fe9cf31b294ed8639aa58f7d8ee13fe189011837038ed9a774fde19a911fcb\n",
1611
            "WARNING:datasets.builder:Found cached dataset cats_vs_dogs (/root/.cache/huggingface/datasets/cats_vs_dogs/default/1.0.0/d4fe9cf31b294ed8639aa58f7d8ee13fe189011837038ed9a774fde19a911fcb)\n",
1612
            "INFO:datasets.info:Loading Dataset info from /root/.cache/huggingface/datasets/cats_vs_dogs/default/1.0.0/d4fe9cf31b294ed8639aa58f7d8ee13fe189011837038ed9a774fde19a911fcb\n"
1613
          ]
1614
        },
1615
        {
1616
          "data": {
1617
            "text/plain": [
1618
              "{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=500x375 at 0x7FDE79A6AEC0>,\n",
1619
              " 'labels': 0}"
1620
            ]
1621
          },
1622
          "execution_count": 27,
1623
          "metadata": {},
1624
          "output_type": "execute_result"
1625
        }
1626
      ],
1627
      "source": [
1628
        "image_dataset = load_dataset(\"cats_vs_dogs\", split=\"train\")\n",
1629
        "image_dataset[0]"
1630
      ]
1631
    },
1632
    {
1633
      "cell_type": "code",
1634
      "execution_count": null,
1635
      "metadata": {
1636
        "colab": {
1637
          "base_uri": "https://localhost:8080/",
1638
          "height": 392
1639
        },
1640
        "id": "z0q3Do11npXd",
1641
        "outputId": "b545b95d-746f-4777-f233-a7851a44b72c"
1642
      },
1643
      "outputs": [
1644
        {
1645
          "data": {
1646
            "image/png": "",
1647
            "text/plain": [
1648
              "<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=500x375 at 0x7FDE7458BF10>"
1649
            ]
1650
          },
1651
          "execution_count": 28,
1652
          "metadata": {},
1653
          "output_type": "execute_result"
1654
        }
1655
      ],
1656
      "source": [
1657
        "image_dataset[0][\"image\"]"
1658
      ]
1659
    },
1660
    {
1661
      "cell_type": "markdown",
1662
      "metadata": {
1663
        "id": "3URV1v5Zntxb"
1664
      },
1665
      "source": [
1666
        "### Audio datasets"
1667
      ]
1668
    },
1669
    {
1670
      "cell_type": "markdown",
1671
      "metadata": {
1672
        "id": "Ry1dqcUunzEW"
1673
      },
1674
      "source": [
1675
        "Audio files are decoded using torchaudio or librosa using to the sampling rate of your choice.\n",
1676
        "\n",
1677
        "To read mp3 files you need ffmpeg and restart your runtime"
1678
      ]
1679
    },
1680
    {
1681
      "cell_type": "code",
1682
      "execution_count": null,
1683
      "metadata": {
1684
        "colab": {
1685
          "base_uri": "https://localhost:8080/"
1686
        },
1687
        "id": "k6FSL7S3odEl",
1688
        "outputId": "13299935-e2ff-43b1-e622-33895c3426a7"
1689
      },
1690
      "outputs": [
1691
        {
1692
          "name": "stdout",
1693
          "output_type": "stream",
1694
          "text": [
1695
            "\r0% [Working]\r            \rHit:1 https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/ InRelease\n",
1696
            "Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease\n",
1697
            "Hit:3 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu focal InRelease\n",
1698
            "Hit:4 http://archive.ubuntu.com/ubuntu focal InRelease\n",
1699
            "Hit:5 http://security.ubuntu.com/ubuntu focal-security InRelease\n",
1700
            "Hit:6 http://archive.ubuntu.com/ubuntu focal-updates InRelease\n",
1701
            "Hit:7 http://ppa.launchpad.net/cran/libgit2/ubuntu focal InRelease\n",
1702
            "Hit:8 http://archive.ubuntu.com/ubuntu focal-backports InRelease\n",
1703
            "Hit:9 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu focal InRelease\n",
1704
            "Hit:10 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu focal InRelease\n",
1705
            "Ign:11 http://ppa.launchpad.net/jonathonf/ffmpeg-4/ubuntu focal InRelease\n",
1706
            "Hit:12 http://ppa.launchpad.net/ubuntugis/ppa/ubuntu focal InRelease\n",
1707
            "Err:13 http://ppa.launchpad.net/jonathonf/ffmpeg-4/ubuntu focal Release\n",
1708
            "  404  Not Found [IP: 185.125.190.52 80]\n",
1709
            "Reading package lists... Done\n",
1710
            "E: The repository 'http://ppa.launchpad.net/jonathonf/ffmpeg-4/ubuntu focal Release' does not have a Release file.\n",
1711
            "N: Updating from such a repository can't be done securely, and is therefore disabled by default.\n",
1712
            "N: See apt-secure(8) manpage for repository creation and user configuration details.\n"
1713
          ]
1714
        }
1715
      ],
1716
      "source": [
1717
        "!add-apt-repository -y ppa:jonathonf/ffmpeg-4 && apt update && apt install -y ffmpeg"
1718
      ]
1719
    },
1720
    {
1721
      "cell_type": "code",
1722
      "execution_count": null,
1723
      "metadata": {
1724
        "colab": {
1725
          "base_uri": "https://localhost:8080/"
1726
        },
1727
        "id": "lpKCz3CHnsre",
1728
        "outputId": "8bb79710-04a6-4563-c1db-d966296baa6b"
1729
      },
1730
      "outputs": [
1731
        {
1732
          "name": "stderr",
1733
          "output_type": "stream",
1734
          "text": [
1735
            "INFO:datasets.info:Loading Dataset Infos from /root/.cache/huggingface/modules/datasets_modules/datasets/common_voice/220833898d6a60c50f621126e51fb22eb2dfe5244392c70dccd8e6e2f055f4bf\n",
1736
            "/root/.cache/huggingface/modules/datasets_modules/datasets/common_voice/220833898d6a60c50f621126e51fb22eb2dfe5244392c70dccd8e6e2f055f4bf/common_voice.py:634: FutureWarning: \n",
1737
            "            This version of the Common Voice dataset is deprecated.\n",
1738
            "            You can download the latest one with\n",
1739
            "            >>> load_dataset(\"mozilla-foundation/common_voice_11_0\", \"en\")\n",
1740
            "            \n",
1741
            "  warnings.warn(\n",
1742
            "INFO:datasets.builder:Overwrite dataset info from restored data version if exists.\n",
1743
            "INFO:datasets.info:Loading Dataset info from /root/.cache/huggingface/datasets/common_voice/fi/6.1.0/220833898d6a60c50f621126e51fb22eb2dfe5244392c70dccd8e6e2f055f4bf\n",
1744
            "WARNING:datasets.builder:Found cached dataset common_voice (/root/.cache/huggingface/datasets/common_voice/fi/6.1.0/220833898d6a60c50f621126e51fb22eb2dfe5244392c70dccd8e6e2f055f4bf)\n",
1745
            "INFO:datasets.info:Loading Dataset info from /root/.cache/huggingface/datasets/common_voice/fi/6.1.0/220833898d6a60c50f621126e51fb22eb2dfe5244392c70dccd8e6e2f055f4bf\n"
1746
          ]
1747
        },
1748
        {
1749
          "data": {
1750
            "text/plain": [
1751
              "{'client_id': '4eeeb22a3bbb52e5215593a09a845f0f8c496e0a7c498c6d1e9e5e0f8730f79bf16b2b30483dfcc771d430918f27e3ce8b546d068017302109c5c76ca75b0944',\n",
1752
              " 'path': '/root/.cache/huggingface/datasets/downloads/extracted/cb1c332c2b5d74b2663eb9d5a6181c2972a0a069831f91fadaac8362eb7899fe/cv-corpus-6.1-2020-12-11/fi/clips/common_voice_fi_22986631.mp3',\n",
1753
              " 'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/cb1c332c2b5d74b2663eb9d5a6181c2972a0a069831f91fadaac8362eb7899fe/cv-corpus-6.1-2020-12-11/fi/clips/common_voice_fi_22986631.mp3',\n",
1754
              "  'array': array([ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,\n",
1755
              "         -1.04925891e-06,  4.06746835e-07,  8.70920871e-07]),\n",
1756
              "  'sampling_rate': 48000},\n",
1757
              " 'sentence': 'Mitä nyt tekisimme?',\n",
1758
              " 'up_votes': 2,\n",
1759
              " 'down_votes': 0,\n",
1760
              " 'age': 'thirties',\n",
1761
              " 'gender': 'male',\n",
1762
              " 'accent': '',\n",
1763
              " 'locale': 'fi',\n",
1764
              " 'segment': \"''\"}"
1765
            ]
1766
          },
1767
          "execution_count": 30,
1768
          "metadata": {},
1769
          "output_type": "execute_result"
1770
        }
1771
      ],
1772
      "source": [
1773
        "from datasets import load_dataset\n",
1774
        "audio_dataset = load_dataset(\"common_voice\", \"fi\", split=\"train\")\n",
1775
        "audio_dataset[0]"
1776
      ]
1777
    },
1778
    {
1779
      "cell_type": "code",
1780
      "execution_count": null,
1781
      "metadata": {
1782
        "colab": {
1783
          "base_uri": "https://localhost:8080/"
1784
        },
1785
        "id": "2Uw3iTdfo9mu",
1786
        "outputId": "9f6d13c9-7cbf-4f7b-f7ae-fcbfea1a5a11"
1787
      },
1788
      "outputs": [
1789
        {
1790
          "data": {
1791
            "text/plain": [
1792
              "(array([ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,\n",
1793
              "        -1.04925891e-06,  4.06746835e-07,  8.70920871e-07]),\n",
1794
              " 48000)"
1795
            ]
1796
          },
1797
          "execution_count": 31,
1798
          "metadata": {},
1799
          "output_type": "execute_result"
1800
        }
1801
      ],
1802
      "source": [
1803
        "audio_dataset[0][\"audio\"][\"array\"], audio_dataset[0][\"audio\"][\"sampling_rate\"]"
1804
      ]
1805
    },
1806
    {
1807
      "cell_type": "markdown",
1808
      "metadata": {
1809
        "id": "q6E2SnHupF5l"
1810
      },
1811
      "source": [
1812
        "Audio decoding and resampling is done in-the-fly when accessing examples. You can change the sampling rate this way:"
1813
      ]
1814
    },
1815
    {
1816
      "cell_type": "code",
1817
      "execution_count": null,
1818
      "metadata": {
1819
        "colab": {
1820
          "base_uri": "https://localhost:8080/"
1821
        },
1822
        "id": "nuoyq-E2pJKf",
1823
        "outputId": "99fb9f52-00e0-462a-e772-e93b790e0009"
1824
      },
1825
      "outputs": [
1826
        {
1827
          "data": {
1828
            "text/plain": [
1829
              "(array([ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,\n",
1830
              "        -4.28493877e-07, -1.03890284e-06, -5.02728994e-07]),\n",
1831
              " 16000)"
1832
            ]
1833
          },
1834
          "execution_count": 32,
1835
          "metadata": {},
1836
          "output_type": "execute_result"
1837
        }
1838
      ],
1839
      "source": [
1840
        "from datasets import Audio\n",
1841
        "audio_dataset = audio_dataset.cast_column(\"audio\", Audio(sampling_rate=16_000))\n",
1842
        "audio_dataset[0][\"audio\"][\"array\"], audio_dataset[0][\"audio\"][\"sampling_rate\"]"
1843
      ]
1844
    },
1845
    {
1846
      "cell_type": "markdown",
1847
      "metadata": {
1848
        "id": "NzOXxNzQvSVo",
1849
        "pycharm": {
1850
          "name": "#%% md\n"
1851
        }
1852
      },
1853
      "source": [
1854
        "## Formatting outputs for PyTorch, Tensorflow, Numpy, Pandas\n",
1855
        "\n",
1856
        "Now that we have tokenized our inputs, we probably want to use this dataset in a `torch.Dataloader` or a `tf.data.Dataset`. There are various ways to approach this.\n",
1857
        "\n",
1858
        "Using the `set_format()` method, we can:\n",
1859
        "\n",
1860
        "- format the indexing (`__getitem__`) to return numpy/pytorch/tensorflow tensors, instead of python objects, and\n",
1861
        "- format the indexing (`__getitem__`) to return only the subset of the columns that we need for our model inputs.\n",
1862
        "\n",
1863
        "  We don't want the columns `id` or `title` as inputs to train our model, but we could still want to keep them in the dataset, for instance for the evaluation of the model.\n",
1864
        "    \n",
1865
        "This is handled by the `.set_format(type: Union[None, str], columns: Union[None, str, List[str]])` where:\n",
1866
        "\n",
1867
        "- `type` define the return type for our dataset `__getitem__` method and is one of `[None, 'numpy', 'pandas', 'torch', 'tensorflow']` (`None` means return python objects), and\n",
1868
        "- `columns` define the columns returned by `__getitem__` and takes the name of a column in the dataset or a list of columns to return (`None` means return all columns)."
1869
      ]
1870
    },
1871
    {
1872
      "cell_type": "code",
1873
      "execution_count": null,
1874
      "metadata": {
1875
        "colab": {
1876
          "base_uri": "https://localhost:8080/"
1877
        },
1878
        "id": "aU2h_qQDvSVo",
1879
        "outputId": "46af4ce3-d232-440a-d899-30d30c8b16f9",
1880
        "pycharm": {
1881
          "name": "#%%\n"
1882
        }
1883
      },
1884
      "outputs": [
1885
        {
1886
          "name": "stdout",
1887
          "output_type": "stream",
1888
          "text": [
1889
            "{'attention_mask': <tf.Tensor: shape=(172,), dtype=int64, numpy=\n",
1890
            "array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
1891
            "       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
1892
            "       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
1893
            "       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
1894
            "       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
1895
            "       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
1896
            "       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
1897
            "       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])>,\n",
1898
            " 'end_positions': <tf.Tensor: shape=(), dtype=int64, numpy=46>,\n",
1899
            " 'input_ids': <tf.Tensor: shape=(172,), dtype=int64, numpy=\n",
1900
            "array([  101,  3198,  5308,  1851,  1108,  1126,  1237,  1709,  1342,\n",
1901
            "        1106,  4959,  1103,  3628,  1104,  1103,  1305,  2289,  1453,\n",
1902
            "         113,  4279,   114,  1111,  1103,  1410,  1265,   119,  1109,\n",
1903
            "        1237,  2289,  3047,   113, 10402,   114,  3628,  7068, 14722,\n",
1904
            "        2378,  1103,  1305,  2289,  3047,   113, 24743,   114,  3628,\n",
1905
            "        2938, 13598,  1572,   782,  1275,  1106,  7379,  1147,  1503,\n",
1906
            "        3198,  5308,  1641,   119,  1109,  1342,  1108,  1307,  1113,\n",
1907
            "        1428,   128,   117,  1446,   117,  1120, 12388,   112,   188,\n",
1908
            "        3339,  1107,  1103,  1727,  2948,  2410,  3894,  1120,  3364,\n",
1909
            "       10200,   117,  1756,   119,  1249,  1142,  1108,  1103, 13163,\n",
1910
            "        3198,  5308,   117,  1103,  2074, 13463,  1103,   107,  5404,\n",
1911
            "        5453,   107,  1114,  1672,  2284,   118, 12005, 11751,   117,\n",
1912
            "        1112,  1218,  1112,  7818, 28117, 20080, 16264,  1103,  3904,\n",
1913
            "        1104, 10505,  1296,  3198,  5308,  1342,  1114,  2264,   183,\n",
1914
            "       15447, 16179,   113,  1223,  1134,  1103,  1342,  1156,  1138,\n",
1915
            "        1151,  1227,  1112,   107,  3198,  5308,   149,   107,   114,\n",
1916
            "         117,  1177,  1115,  1103,  7998,  1180, 15199,  2672,  1103,\n",
1917
            "        4944,   183, 15447, 16179,  1851,   119,   102,  5979,  4279,\n",
1918
            "        1264,  2533,  1103, 24743,  1120,  3198,  5308,  1851,   136,\n",
1919
            "         102])>,\n",
1920
            " 'start_positions': <tf.Tensor: shape=(), dtype=int64, numpy=45>,\n",
1921
            " 'token_type_ids': <tf.Tensor: shape=(172,), dtype=int64, numpy=\n",
1922
            "array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
1923
            "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
1924
            "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
1925
            "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
1926
            "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
1927
            "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
1928
            "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
1929
            "       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])>}\n"
1930
          ]
1931
        }
1932
      ],
1933
      "source": [
1934
        "columns_to_return = ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions']\n",
1935
        "\n",
1936
        "# Uncomment whichever one is appropriate for you\n",
1937
        "# encoded_dataset.set_format(type='torch', columns=columns_to_return)\n",
1938
        "encoded_dataset.set_format(type='tensorflow', columns=columns_to_return)\n",
1939
        "\n",
1940
        "# Our dataset indexing output is now ready for being used in a pytorch dataloader\n",
1941
        "pprint(encoded_dataset[1], compact=True)"
1942
      ]
1943
    },
1944
    {
1945
      "cell_type": "code",
1946
      "execution_count": null,
1947
      "metadata": {
1948
        "colab": {
1949
          "base_uri": "https://localhost:8080/"
1950
        },
1951
        "id": "Wj1ukGIuvSVq",
1952
        "outputId": "f7c6014b-dfff-4885-b696-d93d812a04b3",
1953
        "pycharm": {
1954
          "name": "#%%\n"
1955
        }
1956
      },
1957
      "outputs": [
1958
        {
1959
          "name": "stdout",
1960
          "output_type": "stream",
1961
          "text": [
1962
            "['id', 'title', 'context', 'question', 'answers', 'input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions']\n"
1963
          ]
1964
        }
1965
      ],
1966
      "source": [
1967
        "# Note that the columns are not removed from the dataset, just not returned when calling __getitem__\n",
1968
        "# Similarly the inner type of the dataset is not changed to torch.Tensor, the conversion and filtering is done on-the-fly when querying the dataset\n",
1969
        "print(encoded_dataset.column_names)"
1970
      ]
1971
    },
1972
    {
1973
      "cell_type": "code",
1974
      "execution_count": null,
1975
      "metadata": {
1976
        "colab": {
1977
          "base_uri": "https://localhost:8080/"
1978
        },
1979
        "id": "pWmmUdatasetsvSVs",
1980
        "outputId": "bb959fb6-22cc-42fa-c93e-d0c924cc3ad0",
1981
        "pycharm": {
1982
          "name": "#%%\n"
1983
        }
1984
      },
1985
      "outputs": [
1986
        {
1987
          "name": "stdout",
1988
          "output_type": "stream",
1989
          "text": [
1990
            "{'answers': {'answer_start': [249, 249, 249],\n",
1991
            "             'text': ['Carolina Panthers', 'Carolina Panthers',\n",
1992
            "                      'Carolina Panthers']},\n",
1993
            " 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
1994
            "                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
1995
            "                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
1996
            "                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
1997
            "                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
1998
            "                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
1999
            "                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
2000
            "                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
2001
            "                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],\n",
2002
            " 'context': 'Super Bowl 50 was an American football game to determine the '\n",
2003
            "            'champion of the National Football League (NFL) for the 2015 '\n",
2004
            "            'season. The American Football Conference (AFC) champion Denver '\n",
2005
            "            'Broncos defeated the National Football Conference (NFC) champion '\n",
2006
            "            'Carolina Panthers 24–10 to earn their third Super Bowl title. The '\n",
2007
            "            \"game was played on February 7, 2016, at Levi's Stadium in the San \"\n",
2008
            "            'Francisco Bay Area at Santa Clara, California. As this was the '\n",
2009
            "            '50th Super Bowl, the league emphasized the \"golden anniversary\" '\n",
2010
            "            'with various gold-themed initiatives, as well as temporarily '\n",
2011
            "            'suspending the tradition of naming each Super Bowl game with '\n",
2012
            "            'Roman numerals (under which the game would have been known as '\n",
2013
            "            '\"Super Bowl L\"), so that the logo could prominently feature the '\n",
2014
            "            'Arabic numerals 50.',\n",
2015
            " 'end_positions': 46,\n",
2016
            " 'id': '56be4db0acb8001400a502ed',\n",
2017
            " 'input_ids': [101, 3198, 5308, 1851, 1108, 1126, 1237, 1709, 1342, 1106, 4959,\n",
2018
            "               1103, 3628, 1104, 1103, 1305, 2289, 1453, 113, 4279, 114, 1111,\n",
2019
            "               1103, 1410, 1265, 119, 1109, 1237, 2289, 3047, 113, 10402, 114,\n",
2020
            "               3628, 7068, 14722, 2378, 1103, 1305, 2289, 3047, 113, 24743, 114,\n",
2021
            "               3628, 2938, 13598, 1572, 782, 1275, 1106, 7379, 1147, 1503, 3198,\n",
2022
            "               5308, 1641, 119, 1109, 1342, 1108, 1307, 1113, 1428, 128, 117,\n",
2023
            "               1446, 117, 1120, 12388, 112, 188, 3339, 1107, 1103, 1727, 2948,\n",
2024
            "               2410, 3894, 1120, 3364, 10200, 117, 1756, 119, 1249, 1142, 1108,\n",
2025
            "               1103, 13163, 3198, 5308, 117, 1103, 2074, 13463, 1103, 107, 5404,\n",
2026
            "               5453, 107, 1114, 1672, 2284, 118, 12005, 11751, 117, 1112, 1218,\n",
2027
            "               1112, 7818, 28117, 20080, 16264, 1103, 3904, 1104, 10505, 1296,\n",
2028
            "               3198, 5308, 1342, 1114, 2264, 183, 15447, 16179, 113, 1223, 1134,\n",
2029
            "               1103, 1342, 1156, 1138, 1151, 1227, 1112, 107, 3198, 5308, 149,\n",
2030
            "               107, 114, 117, 1177, 1115, 1103, 7998, 1180, 15199, 2672, 1103,\n",
2031
            "               4944, 183, 15447, 16179, 1851, 119, 102, 5979, 4279, 1264, 2533,\n",
2032
            "               1103, 24743, 1120, 3198, 5308, 1851, 136, 102],\n",
2033
            " 'question': 'Which NFL team represented the NFC at Super Bowl 50?',\n",
2034
            " 'start_positions': 45,\n",
2035
            " 'title': 'Super_Bowl_50',\n",
2036
            " 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
2037
            "                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
2038
            "                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
2039
            "                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
2040
            "                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
2041
            "                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
2042
            "                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
2043
            "                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
2044
            "                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}\n"
2045
          ]
2046
        }
2047
      ],
2048
      "source": [
2049
        "# We can remove the formatting with `.reset_format()`\n",
2050
        "# or, identically, a call to `.set_format()` with no arguments\n",
2051
        "encoded_dataset.reset_format()\n",
2052
        "\n",
2053
        "pprint(encoded_dataset[1], compact=True)"
2054
      ]
2055
    },
2056
    {
2057
      "cell_type": "code",
2058
      "execution_count": null,
2059
      "metadata": {
2060
        "colab": {
2061
          "base_uri": "https://localhost:8080/"
2062
        },
2063
        "id": "VyUOA07svSVu",
2064
        "outputId": "343d5e56-2d7b-4c4b-db2d-e72282e9e377",
2065
        "pycharm": {
2066
          "name": "#%%\n"
2067
        }
2068
      },
2069
      "outputs": [
2070
        {
2071
          "name": "stdout",
2072
          "output_type": "stream",
2073
          "text": [
2074
            "{'columns': ['id',\n",
2075
            "             'title',\n",
2076
            "             'context',\n",
2077
            "             'question',\n",
2078
            "             'answers',\n",
2079
            "             'input_ids',\n",
2080
            "             'token_type_ids',\n",
2081
            "             'attention_mask',\n",
2082
            "             'start_positions',\n",
2083
            "             'end_positions'],\n",
2084
            " 'format_kwargs': {},\n",
2085
            " 'output_all_columns': False,\n",
2086
            " 'type': None}\n"
2087
          ]
2088
        }
2089
      ],
2090
      "source": [
2091
        "# The current format can be checked with `.format`,\n",
2092
        "# which is a dict of the type and formatting\n",
2093
        "pprint(encoded_dataset.format)"
2094
      ]
2095
    },
2096
    {
2097
      "cell_type": "markdown",
2098
      "metadata": {
2099
        "id": "Gpa2-z37lUGc",
2100
        "pycharm": {
2101
          "name": "#%% md\n"
2102
        }
2103
      },
2104
      "source": [
2105
        "There is also a convenience method, `to_tf_dataset()`, for the creation of `tf.data.Dataset` objects directly from a HuggingFace `Dataset`. An example will be shown below - when using this method, it is sufficient to pass the `columns` argument and your `DataCollator` - make sure you set the `return_tensors` argument of your `DataCollator` to `tf` or `np`, though, because TensorFlow won't be happy if you start passing it PyTorch Tensors!"
2106
      ]
2107
    },
2108
    {
2109
      "cell_type": "markdown",
2110
      "metadata": {
2111
        "id": "xyi2eMeSvSVv",
2112
        "pycharm": {
2113
          "name": "#%% md\n"
2114
        }
2115
      },
2116
      "source": [
2117
        "# Wrapping this all up\n",
2118
        "\n",
2119
        "Let's wrap this all up with the full code to load and prepare SQuAD for training a PyTorch or TensorFlow model from HuggingFace `transformers` library.\n",
2120
        "\n"
2121
      ]
2122
    },
2123
    {
2124
      "cell_type": "code",
2125
      "execution_count": null,
2126
      "metadata": {
2127
        "colab": {
2128
          "base_uri": "https://localhost:8080/"
2129
        },
2130
        "id": "l0j8BPLi6Qlv",
2131
        "outputId": "334e9749-6187-473b-e5f5-805d8fbc9e22",
2132
        "pycharm": {
2133
          "name": "#%%\n"
2134
        }
2135
      },
2136
      "outputs": [
2137
        {
2138
          "name": "stdout",
2139
          "output_type": "stream",
2140
          "text": [
2141
            "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
2142
            "Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (4.29.2)\n",
2143
            "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers) (3.12.0)\n",
2144
            "Requirement already satisfied: huggingface-hub<1.0,>=0.14.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.14.1)\n",
2145
            "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (1.22.4)\n",
2146
            "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers) (23.1)\n",
2147
            "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (6.0)\n",
2148
            "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (2022.10.31)\n",
2149
            "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers) (2.27.1)\n",
2150
            "Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.13.3)\n",
2151
            "Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers) (4.65.0)\n",
2152
            "Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.14.1->transformers) (2023.4.0)\n",
2153
            "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.14.1->transformers) (4.5.0)\n",
2154
            "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (1.26.15)\n",
2155
            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2022.12.7)\n",
2156
            "Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2.0.12)\n",
2157
            "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.4)\n"
2158
          ]
2159
        }
2160
      ],
2161
      "source": [
2162
        "!pip install transformers"
2163
      ]
2164
    },
2165
    {
2166
      "cell_type": "code",
2167
      "execution_count": null,
2168
      "metadata": {
2169
        "colab": {
2170
          "base_uri": "https://localhost:8080/",
2171
          "height": 208,
2172
          "referenced_widgets": [
2173
            "98a45b56fdb040418e42f7c59e28bc14",
2174
            "bf67657a3f5d47a79d078beb8589a098",
2175
            "f0fa32d1b256417db2850569674350d9",
2176
            "8f77a47ffc79400cbd84280e8bbc9979",
2177
            "defca41aeb5b4f8689930bfea05915f1",
2178
            "dbc2e3e6c2cb4c108d46430e132777a1",
2179
            "f3aa463526554d9da89c2fd0fe8efe2a",
2180
            "329b19be2aff486f8a737751ead4d79c",
2181
            "5a78f50d4f4742f08ee3abe4e9c38129",
2182
            "74ff87c33af14cc093694692397a9ee0",
2183
            "cf01a82f5de54ffb97af38ca88e170c2"
2184
          ]
2185
        },
2186
        "id": "QvExTIZWvSVw",
2187
        "outputId": "1ba5ebeb-d4ac-4a3f-b853-2800a3714913",
2188
        "pycharm": {
2189
          "name": "#%%\n"
2190
        }
2191
      },
2192
      "outputs": [
2193
        {
2194
          "name": "stderr",
2195
          "output_type": "stream",
2196
          "text": [
2197
            "INFO:datasets.builder:No config specified, defaulting to the single config: squad/plain_text\n",
2198
            "INFO:datasets.info:Loading Dataset Infos from /root/.cache/huggingface/modules/datasets_modules/datasets/squad/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453\n",
2199
            "INFO:datasets.builder:Overwrite dataset info from restored data version if exists.\n",
2200
            "INFO:datasets.info:Loading Dataset info from /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453\n",
2201
            "WARNING:datasets.builder:Found cached dataset squad (/root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)\n",
2202
            "INFO:datasets.info:Loading Dataset info from /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453\n"
2203
          ]
2204
        },
2205
        {
2206
          "data": {
2207
            "application/vnd.jupyter.widget-view+json": {
2208
              "model_id": "98a45b56fdb040418e42f7c59e28bc14",
2209
              "version_major": 2,
2210
              "version_minor": 0
2211
            },
2212
            "text/plain": [
2213
              "  0%|          | 0/2 [00:00<?, ?it/s]"
2214
            ]
2215
          },
2216
          "metadata": {},
2217
          "output_type": "display_data"
2218
        },
2219
        {
2220
          "name": "stderr",
2221
          "output_type": "stream",
2222
          "text": [
2223
            "WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453/cache-552174beded062cc.arrow\n",
2224
            "WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453/cache-cfb8e391f3306c89.arrow\n"
2225
          ]
2226
        }
2227
      ],
2228
      "source": [
2229
        "import torch\n",
2230
        "from datasets import load_dataset\n",
2231
        "from transformers import BertTokenizerFast\n",
2232
        "\n",
2233
        "# Load our training dataset and tokenizer\n",
2234
        "dataset = load_dataset('squad')\n",
2235
        "tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')\n",
2236
        "\n",
2237
        "def get_correct_alignement(context, answer):\n",
2238
        "    \"\"\" Some original examples in SQuAD have indices wrong by 1 or 2 character. We test and fix this here. \"\"\"\n",
2239
        "    gold_text = answer['text'][0]\n",
2240
        "    start_idx = answer['answer_start'][0]\n",
2241
        "    end_idx = start_idx + len(gold_text)\n",
2242
        "    if context[start_idx:end_idx] == gold_text:\n",
2243
        "        return start_idx, end_idx       # When the gold label position is good\n",
2244
        "    elif context[start_idx-1:end_idx-1] == gold_text:\n",
2245
        "        return start_idx-1, end_idx-1   # When the gold label is off by one character\n",
2246
        "    elif context[start_idx-2:end_idx-2] == gold_text:\n",
2247
        "        return start_idx-2, end_idx-2   # When the gold label is off by two character\n",
2248
        "    else:\n",
2249
        "        raise ValueError()\n",
2250
        "\n",
2251
        "# Tokenize our training dataset\n",
2252
        "def convert_to_features(example_batch):\n",
2253
        "    # Tokenize contexts and questions (as pairs of inputs)\n",
2254
        "    encodings = tokenizer(example_batch['context'], example_batch['question'], truncation=True)\n",
2255
        "\n",
2256
        "    # Compute start and end tokens for labels using Transformers's fast tokenizers alignement methods.\n",
2257
        "    start_positions, end_positions = [], []\n",
2258
        "    for i, (context, answer) in enumerate(zip(example_batch['context'], example_batch['answers'])):\n",
2259
        "        start_idx, end_idx = get_correct_alignement(context, answer)\n",
2260
        "        start_positions.append(encodings.char_to_token(i, start_idx))\n",
2261
        "        end_positions.append(encodings.char_to_token(i, end_idx-1))\n",
2262
        "    encodings.update({'start_positions': start_positions, 'end_positions': end_positions})\n",
2263
        "    return encodings\n",
2264
        "\n",
2265
        "encoded_dataset = dataset.map(convert_to_features, batched=True)\n"
2266
      ]
2267
    },
2268
    {
2269
      "cell_type": "markdown",
2270
      "metadata": {
2271
        "id": "tFfi22D9lUGc",
2272
        "pycharm": {
2273
          "name": "#%% md\n"
2274
        }
2275
      },
2276
      "source": [
2277
        "That's the end of the shared preprocessing! Next, for Torch, we set our dataset format and create a `dataloader`. If you're using TensorFlow, skip to the next block."
2278
      ]
2279
    },
2280
    {
2281
      "cell_type": "code",
2282
      "execution_count": null,
2283
      "metadata": {
2284
        "id": "-yhzlEoqlUGc",
2285
        "pycharm": {
2286
          "name": "#%%\n"
2287
        }
2288
      },
2289
      "outputs": [],
2290
      "source": [
2291
        "# Format our dataset to outputs torch.Tensor to train a pytorch model\n",
2292
        "columns = ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions']\n",
2293
        "encoded_dataset.set_format(type='torch', columns=columns)\n",
2294
        "\n",
2295
        "# Instantiate a PyTorch Dataloader around our dataset\n",
2296
        "# Let's do dynamic batching (pad on the fly with our own collate_fn)\n",
2297
        "def collate_fn(examples):\n",
2298
        "    return tokenizer.pad(examples, return_tensors='pt')\n",
2299
        "dataloader = torch.utils.data.DataLoader(encoded_dataset['train'], collate_fn=collate_fn, batch_size=8)"
2300
      ]
2301
    },
2302
    {
2303
      "cell_type": "markdown",
2304
      "metadata": {
2305
        "id": "PfyT0VixlUGd",
2306
        "pycharm": {
2307
          "name": "#%% md\n"
2308
        }
2309
      },
2310
      "source": [
2311
        "For TensorFlow, we use the `to_tf_dataset()` method to get a `tf.data.Dataset`."
2312
      ]
2313
    },
2314
    {
2315
      "cell_type": "code",
2316
      "execution_count": null,
2317
      "metadata": {
2318
        "colab": {
2319
          "base_uri": "https://localhost:8080/"
2320
        },
2321
        "id": "XlVPT5PjlUGd",
2322
        "outputId": "7bd6aeb5-c080-4e35-b990-770a5d7f601d",
2323
        "pycharm": {
2324
          "name": "#%%\n"
2325
        }
2326
      },
2327
      "outputs": [
2328
        {
2329
          "name": "stderr",
2330
          "output_type": "stream",
2331
          "text": [
2332
            "You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.\n"
2333
          ]
2334
        }
2335
      ],
2336
      "source": [
2337
        "columns = ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions']\n",
2338
        "\n",
2339
        "# Let's do dynamic batching (pad on the fly with our own collate_fn)\n",
2340
        "def collate_fn(examples):\n",
2341
        "    return tokenizer.pad(examples, return_tensors='np')\n",
2342
        "\n",
2343
        "# to_tf_dataset() returns a tf.data.Dataset that we can pass straight to model.fit().\n",
2344
        "encoded_tf_dataset = encoded_dataset['train'].to_tf_dataset(\n",
2345
        "    columns=columns,\n",
2346
        "    collate_fn=collate_fn,\n",
2347
        "    batch_size=8,\n",
2348
        "    shuffle=True,\n",
2349
        ")"
2350
      ]
2351
    },
2352
    {
2353
      "cell_type": "markdown",
2354
      "metadata": {
2355
        "id": "gzxxvd3nlUGd",
2356
        "pycharm": {
2357
          "name": "#%% md\n"
2358
        }
2359
      },
2360
      "source": [
2361
        "Next, we initialize our model. The next two blocks show model creation and training in Torch. For TensorFlow, skip ahead!"
2362
      ]
2363
    },
2364
    {
2365
      "cell_type": "code",
2366
      "execution_count": null,
2367
      "metadata": {
2368
        "colab": {
2369
          "base_uri": "https://localhost:8080/"
2370
        },
2371
        "id": "4mHnwMx2vSVx",
2372
        "outputId": "da56eb4d-abfd-487d-c665-3ddec1387b43",
2373
        "pycharm": {
2374
          "name": "#%%\n"
2375
        }
2376
      },
2377
      "outputs": [
2378
        {
2379
          "name": "stderr",
2380
          "output_type": "stream",
2381
          "text": [
2382
            "Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForQuestionAnswering: ['cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias']\n",
2383
            "- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
2384
            "- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
2385
            "Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']\n",
2386
            "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
2387
          ]
2388
        }
2389
      ],
2390
      "source": [
2391
        "# Let's load a pretrained Bert model and a simple optimizer\n",
2392
        "from transformers import AutoModelForQuestionAnswering\n",
2393
        "\n",
2394
        "model = AutoModelForQuestionAnswering.from_pretrained('bert-base-cased', return_dict=True)\n",
2395
        "optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)"
2396
      ]
2397
    },
2398
    {
2399
      "cell_type": "code",
2400
      "execution_count": null,
2401
      "metadata": {
2402
        "colab": {
2403
          "base_uri": "https://localhost:8080/"
2404
        },
2405
        "id": "biqDH9vpvSVz",
2406
        "outputId": "130d9dd3-e822-4a7e-90a9-a2bfef66c2d9",
2407
        "pycharm": {
2408
          "name": "#%%\n"
2409
        }
2410
      },
2411
      "outputs": [
2412
        {
2413
          "name": "stdout",
2414
          "output_type": "stream",
2415
          "text": [
2416
            "Step 0 - loss: 5.65\n",
2417
            "Step 1 - loss: 5.63\n",
2418
            "Step 2 - loss: 5.18\n",
2419
            "Step 3 - loss: 5.6\n",
2420
            "Step 4 - loss: 5.29\n",
2421
            "Step 5 - loss: 5.51\n",
2422
            "Step 6 - loss: 5.49\n"
2423
          ]
2424
        }
2425
      ],
2426
      "source": [
2427
        "# Now let's train our model\n",
2428
        "device = 'cuda' if torch.cuda.is_available() else 'cpu'\n",
2429
        "\n",
2430
        "model.train().to(device)\n",
2431
        "for i, batch in enumerate(dataloader):\n",
2432
        "    batch.to(device)\n",
2433
        "    outputs = model(**batch)\n",
2434
        "    loss = outputs.loss\n",
2435
        "    loss.backward()\n",
2436
        "    optimizer.step()\n",
2437
        "    model.zero_grad()\n",
2438
        "    print(f'Step {i} - loss: {loss:.3}')\n",
2439
        "    if i > 5:\n",
2440
        "        break"
2441
      ]
2442
    },
2443
    {
2444
      "cell_type": "markdown",
2445
      "metadata": {
2446
        "id": "HmBZ6FZnlUGd",
2447
        "pycharm": {
2448
          "name": "#%% md\n"
2449
        }
2450
      },
2451
      "source": [
2452
        "Next, we'll initialize and train our TensorFlow model. Note the lack of a loss argument when we `compile()` our model here! All Transformers models support computing loss internally. When no loss argument is provided, the model will use its internal loss - this is especially helpful for cases like QA models, when the loss can be quite complex."
2453
      ]
2454
    },
2455
    {
2456
      "cell_type": "code",
2457
      "execution_count": null,
2458
      "metadata": {
2459
        "colab": {
2460
          "base_uri": "https://localhost:8080/"
2461
        },
2462
        "id": "XnX5xPd9lUGd",
2463
        "outputId": "7b779ab0-9959-4f01-f724-6503039a6831",
2464
        "pycharm": {
2465
          "name": "#%%\n"
2466
        }
2467
      },
2468
      "outputs": [
2469
        {
2470
          "name": "stderr",
2471
          "output_type": "stream",
2472
          "text": [
2473
            "All model checkpoint layers were used when initializing TFBertForQuestionAnswering.\n",
2474
            "\n",
2475
            "Some layers of TFBertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['qa_outputs']\n",
2476
            "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n",
2477
            "No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.\n"
2478
          ]
2479
        }
2480
      ],
2481
      "source": [
2482
        "# Let's load a pretrained Bert model and a simple optimizer\n",
2483
        "from transformers import TFAutoModelForQuestionAnswering\n",
2484
        "import tensorflow as tf\n",
2485
        "\n",
2486
        "model = TFAutoModelForQuestionAnswering.from_pretrained('bert-base-cased')\n",
2487
        "# No loss argument!\n",
2488
        "model.compile(optimizer=tf.keras.optimizers.Adam(1e-5))"
2489
      ]
2490
    },
2491
    {
2492
      "cell_type": "markdown",
2493
      "metadata": {
2494
        "id": "NcOtZ86mlUGe",
2495
        "pycharm": {
2496
          "name": "#%% md\n"
2497
        }
2498
      },
2499
      "source": [
2500
        "Now that all the preprocessing is done, training is an extremely comforting single line of Keras. We stop training early with the `steps_per_epoch` argument - you should probably leave that one out of your actual production code!"
2501
      ]
2502
    },
2503
    {
2504
      "cell_type": "code",
2505
      "execution_count": null,
2506
      "metadata": {
2507
        "colab": {
2508
          "base_uri": "https://localhost:8080/"
2509
        },
2510
        "id": "uJ4B9qU-lUGe",
2511
        "outputId": "1243a53e-e292-49eb-eb77-c54f35786510",
2512
        "pycharm": {
2513
          "name": "#%%\n"
2514
        }
2515
      },
2516
      "outputs": [
2517
        {
2518
          "name": "stdout",
2519
          "output_type": "stream",
2520
          "text": [
2521
            "3/3 [==============================] - 73s 927ms/step - loss: 5.5575\n"
2522
          ]
2523
        },
2524
        {
2525
          "data": {
2526
            "text/plain": [
2527
              "<keras.callbacks.History at 0x7fde0ab9e530>"
2528
            ]
2529
          },
2530
          "execution_count": 44,
2531
          "metadata": {},
2532
          "output_type": "execute_result"
2533
        }
2534
      ],
2535
      "source": [
2536
        "model.fit(encoded_tf_dataset, epochs=1, steps_per_epoch=3)"
2537
      ]
2538
    },
2539
    {
2540
      "cell_type": "markdown",
2541
      "metadata": {
2542
        "id": "ySL-vDadvSV8",
2543
        "pycharm": {
2544
          "name": "#%% md\n"
2545
        }
2546
      },
2547
      "source": [
2548
        "Example with a NER metric: `seqeval`"
2549
      ]
2550
    },
2551
    {
2552
      "cell_type": "code",
2553
      "execution_count": null,
2554
      "metadata": {
2555
        "colab": {
2556
          "base_uri": "https://localhost:8080/"
2557
        },
2558
        "id": "f4uZym7MvSV9",
2559
        "outputId": "2ba24e81-9b35-4284-da34-38221885a4da",
2560
        "pycharm": {
2561
          "name": "#%%\n"
2562
        }
2563
      },
2564
      "outputs": [
2565
        {
2566
          "name": "stdout",
2567
          "output_type": "stream",
2568
          "text": [
2569
            "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
2570
            "Requirement already satisfied: evaluate in /usr/local/lib/python3.10/dist-packages (0.4.0)\n",
2571
            "Requirement already satisfied: seqeval in /usr/local/lib/python3.10/dist-packages (1.2.2)\n",
2572
            "Requirement already satisfied: datasets>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from evaluate) (2.12.0)\n",
2573
            "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from evaluate) (1.22.4)\n",
2574
            "Requirement already satisfied: dill in /usr/local/lib/python3.10/dist-packages (from evaluate) (0.3.6)\n",
2575
            "Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from evaluate) (1.5.3)\n",
2576
            "Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.10/dist-packages (from evaluate) (2.27.1)\n",
2577
            "Requirement already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.10/dist-packages (from evaluate) (4.65.0)\n",
2578
            "Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from evaluate) (3.2.0)\n",
2579
            "Requirement already satisfied: multiprocess in /usr/local/lib/python3.10/dist-packages (from evaluate) (0.70.14)\n",
2580
            "Requirement already satisfied: fsspec[http]>=2021.05.0 in /usr/local/lib/python3.10/dist-packages (from evaluate) (2023.4.0)\n",
2581
            "Requirement already satisfied: huggingface-hub>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from evaluate) (0.14.1)\n",
2582
            "Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from evaluate) (23.1)\n",
2583
            "Requirement already satisfied: responses<0.19 in /usr/local/lib/python3.10/dist-packages (from evaluate) (0.18.0)\n",
2584
            "Requirement already satisfied: scikit-learn>=0.21.3 in /usr/local/lib/python3.10/dist-packages (from seqeval) (1.2.2)\n",
2585
            "Requirement already satisfied: pyarrow>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->evaluate) (9.0.0)\n",
2586
            "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->evaluate) (3.8.4)\n",
2587
            "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->evaluate) (6.0)\n",
2588
            "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.7.0->evaluate) (3.12.0)\n",
2589
            "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.7.0->evaluate) (4.5.0)\n",
2590
            "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->evaluate) (1.26.15)\n",
2591
            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->evaluate) (2022.12.7)\n",
2592
            "Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->evaluate) (2.0.12)\n",
2593
            "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->evaluate) (3.4)\n",
2594
            "Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.21.3->seqeval) (1.10.1)\n",
2595
            "Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.21.3->seqeval) (1.2.0)\n",
2596
            "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.21.3->seqeval) (3.1.0)\n",
2597
            "Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->evaluate) (2.8.2)\n",
2598
            "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->evaluate) (2022.7.1)\n",
2599
            "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->evaluate) (23.1.0)\n",
2600
            "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->evaluate) (6.0.4)\n",
2601
            "Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->evaluate) (4.0.2)\n",
2602
            "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->evaluate) (1.9.2)\n",
2603
            "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->evaluate) (1.3.3)\n",
2604
            "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->evaluate) (1.3.1)\n",
2605
            "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas->evaluate) (1.16.0)\n"
2606
          ]
2607
        },
2608
        {
2609
          "data": {
2610
            "text/plain": [
2611
              "{'MISC': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 1},\n",
2612
              " 'PER': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},\n",
2613
              " 'overall_precision': 0.5,\n",
2614
              " 'overall_recall': 0.5,\n",
2615
              " 'overall_f1': 0.5,\n",
2616
              " 'overall_accuracy': 0.8}"
2617
            ]
2618
          },
2619
          "execution_count": 45,
2620
          "metadata": {},
2621
          "output_type": "execute_result"
2622
        }
2623
      ],
2624
      "source": [
2625
        "!pip install evaluate seqeval\n",
2626
        "import evaluate\n",
2627
        "ner_metric = evaluate.load('seqeval')\n",
2628
        "references = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]\n",
2629
        "predictions =  [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]\n",
2630
        "ner_metric.compute(predictions=predictions, references=references)"
2631
      ]
2632
    },
2633
    {
2634
      "cell_type": "markdown",
2635
      "metadata": {
2636
        "id": "ctY6AIAilLdH",
2637
        "pycharm": {
2638
          "name": "#%% md\n"
2639
        }
2640
      },
2641
      "source": [
2642
        "# Adding a new dataset\n",
2643
        "\n",
2644
        "Datasets can be added with a direct upload using `my_dataset.push_to_hub('username/my_dataset_name')` to a user or organization  on the Hugging Face Hub (like for models in `transformers`). In this case the dataset will be accessible under the gien user/organization name, e.g. `datasets.load_dataset('thomwolf/squad')`.\n",
2645
        "\n",
2646
        "You can also upload your data files directly on the website (see [step-by-step guide here](https://huggingface.co/docs/datasets/upload_dataset)) or using git (see [how to do it using git](https://huggingface.co/docs/datasets/share))."
2647
      ]
2648
    }
2649
  ],
2650
  "metadata": {
2651
    "accelerator": "GPU",
2652
    "colab": {
2653
      "gpuType": "T4",
2654
      "name": "HuggingFace datasets library - Overview",
2655
      "provenance": [],
2656
      "toc_visible": true
2657
    },
2658
    "file_extension": ".py",
2659
    "gpuClass": "standard",
2660
    "kernelspec": {
2661
      "display_name": "Python 3 (ipykernel)",
2662
      "language": "python",
2663
      "name": "python3"
2664
    },
2665
    "language_info": {
2666
      "codemirror_mode": {
2667
        "name": "ipython",
2668
        "version": 3
2669
      },
2670
      "file_extension": ".py",
2671
      "mimetype": "text/x-python",
2672
      "name": "python",
2673
      "nbconvert_exporter": "python",
2674
      "pygments_lexer": "ipython3",
2675
      "version": "3.10.0"
2676
    },
2677
    "mimetype": "text/x-python",
2678
    "name": "python",
2679
    "npconvert_exporter": "python",
2680
    "pygments_lexer": "ipython3",
2681
    "version": 3,
2682
    "widgets": {
2683
      "application/vnd.jupyter.widget-state+json": {
2684
        "09e1966daf9e481da118af73af218d88": {
2685
          "model_module": "@jupyter-widgets/controls",
2686
          "model_module_version": "1.5.0",
2687
          "model_name": "ProgressStyleModel",
2688
          "state": {
2689
            "_model_module": "@jupyter-widgets/controls",
2690
            "_model_module_version": "1.5.0",
2691
            "_model_name": "ProgressStyleModel",
2692
            "_view_count": null,
2693
            "_view_module": "@jupyter-widgets/base",
2694
            "_view_module_version": "1.2.0",
2695
            "_view_name": "StyleView",
2696
            "bar_color": null,
2697
            "description_width": ""
2698
          }
2699
        },
2700
        "0b3581ddec0b4cabb33593e272a50249": {
2701
          "model_module": "@jupyter-widgets/controls",
2702
          "model_module_version": "1.5.0",
2703
          "model_name": "HTMLModel",
2704
          "state": {
2705
            "_dom_classes": [],
2706
            "_model_module": "@jupyter-widgets/controls",
2707
            "_model_module_version": "1.5.0",
2708
            "_model_name": "HTMLModel",
2709
            "_view_count": null,
2710
            "_view_module": "@jupyter-widgets/controls",
2711
            "_view_module_version": "1.5.0",
2712
            "_view_name": "HTMLView",
2713
            "description": "",
2714
            "description_tooltip": null,
2715
            "layout": "IPY_MODEL_b3d5c33915084f26b060e086138bf898",
2716
            "placeholder": "​",
2717
            "style": "IPY_MODEL_429bdd21215f4ef38687daa6def128f8",
2718
            "value": " 1057/1057 [00:00&lt;00:00, 2026.75 examples/s]"
2719
          }
2720
        },
2721
        "21a2deb93c614338a9944b5032220c8d": {
2722
          "model_module": "@jupyter-widgets/base",
2723
          "model_module_version": "1.2.0",
2724
          "model_name": "LayoutModel",
2725
          "state": {
2726
            "_model_module": "@jupyter-widgets/base",
2727
            "_model_module_version": "1.2.0",
2728
            "_model_name": "LayoutModel",
2729
            "_view_count": null,
2730
            "_view_module": "@jupyter-widgets/base",
2731
            "_view_module_version": "1.2.0",
2732
            "_view_name": "LayoutView",
2733
            "align_content": null,
2734
            "align_items": null,
2735
            "align_self": null,
2736
            "border": null,
2737
            "bottom": null,
2738
            "display": null,
2739
            "flex": null,
2740
            "flex_flow": null,
2741
            "grid_area": null,
2742
            "grid_auto_columns": null,
2743
            "grid_auto_flow": null,
2744
            "grid_auto_rows": null,
2745
            "grid_column": null,
2746
            "grid_gap": null,
2747
            "grid_row": null,
2748
            "grid_template_areas": null,
2749
            "grid_template_columns": null,
2750
            "grid_template_rows": null,
2751
            "height": null,
2752
            "justify_content": null,
2753
            "justify_items": null,
2754
            "left": null,
2755
            "margin": null,
2756
            "max_height": null,
2757
            "max_width": null,
2758
            "min_height": null,
2759
            "min_width": null,
2760
            "object_fit": null,
2761
            "object_position": null,
2762
            "order": null,
2763
            "overflow": null,
2764
            "overflow_x": null,
2765
            "overflow_y": null,
2766
            "padding": null,
2767
            "right": null,
2768
            "top": null,
2769
            "visibility": null,
2770
            "width": null
2771
          }
2772
        },
2773
        "329b19be2aff486f8a737751ead4d79c": {
2774
          "model_module": "@jupyter-widgets/base",
2775
          "model_module_version": "1.2.0",
2776
          "model_name": "LayoutModel",
2777
          "state": {
2778
            "_model_module": "@jupyter-widgets/base",
2779
            "_model_module_version": "1.2.0",
2780
            "_model_name": "LayoutModel",
2781
            "_view_count": null,
2782
            "_view_module": "@jupyter-widgets/base",
2783
            "_view_module_version": "1.2.0",
2784
            "_view_name": "LayoutView",
2785
            "align_content": null,
2786
            "align_items": null,
2787
            "align_self": null,
2788
            "border": null,
2789
            "bottom": null,
2790
            "display": null,
2791
            "flex": null,
2792
            "flex_flow": null,
2793
            "grid_area": null,
2794
            "grid_auto_columns": null,
2795
            "grid_auto_flow": null,
2796
            "grid_auto_rows": null,
2797
            "grid_column": null,
2798
            "grid_gap": null,
2799
            "grid_row": null,
2800
            "grid_template_areas": null,
2801
            "grid_template_columns": null,
2802
            "grid_template_rows": null,
2803
            "height": null,
2804
            "justify_content": null,
2805
            "justify_items": null,
2806
            "left": null,
2807
            "margin": null,
2808
            "max_height": null,
2809
            "max_width": null,
2810
            "min_height": null,
2811
            "min_width": null,
2812
            "object_fit": null,
2813
            "object_position": null,
2814
            "order": null,
2815
            "overflow": null,
2816
            "overflow_x": null,
2817
            "overflow_y": null,
2818
            "padding": null,
2819
            "right": null,
2820
            "top": null,
2821
            "visibility": null,
2822
            "width": null
2823
          }
2824
        },
2825
        "429bdd21215f4ef38687daa6def128f8": {
2826
          "model_module": "@jupyter-widgets/controls",
2827
          "model_module_version": "1.5.0",
2828
          "model_name": "DescriptionStyleModel",
2829
          "state": {
2830
            "_model_module": "@jupyter-widgets/controls",
2831
            "_model_module_version": "1.5.0",
2832
            "_model_name": "DescriptionStyleModel",
2833
            "_view_count": null,
2834
            "_view_module": "@jupyter-widgets/base",
2835
            "_view_module_version": "1.2.0",
2836
            "_view_name": "StyleView",
2837
            "description_width": ""
2838
          }
2839
        },
2840
        "51f49669810a4b5f941c18e4b1896866": {
2841
          "model_module": "@jupyter-widgets/controls",
2842
          "model_module_version": "1.5.0",
2843
          "model_name": "HTMLModel",
2844
          "state": {
2845
            "_dom_classes": [],
2846
            "_model_module": "@jupyter-widgets/controls",
2847
            "_model_module_version": "1.5.0",
2848
            "_model_name": "HTMLModel",
2849
            "_view_count": null,
2850
            "_view_module": "@jupyter-widgets/controls",
2851
            "_view_module_version": "1.5.0",
2852
            "_view_name": "HTMLView",
2853
            "description": "",
2854
            "description_tooltip": null,
2855
            "layout": "IPY_MODEL_21a2deb93c614338a9944b5032220c8d",
2856
            "placeholder": "​",
2857
            "style": "IPY_MODEL_d8494cdc5ce04f4690a9adadb921de4c",
2858
            "value": " 981/1057 [00:00&lt;00:00, 1238.52 examples/s]"
2859
          }
2860
        },
2861
        "5a78f50d4f4742f08ee3abe4e9c38129": {
2862
          "model_module": "@jupyter-widgets/controls",
2863
          "model_module_version": "1.5.0",
2864
          "model_name": "ProgressStyleModel",
2865
          "state": {
2866
            "_model_module": "@jupyter-widgets/controls",
2867
            "_model_module_version": "1.5.0",
2868
            "_model_name": "ProgressStyleModel",
2869
            "_view_count": null,
2870
            "_view_module": "@jupyter-widgets/base",
2871
            "_view_module_version": "1.2.0",
2872
            "_view_name": "StyleView",
2873
            "bar_color": null,
2874
            "description_width": ""
2875
          }
2876
        },
2877
        "60682d73f15b4020b57f87dabba5f320": {
2878
          "model_module": "@jupyter-widgets/controls",
2879
          "model_module_version": "1.5.0",
2880
          "model_name": "FloatProgressModel",
2881
          "state": {
2882
            "_dom_classes": [],
2883
            "_model_module": "@jupyter-widgets/controls",
2884
            "_model_module_version": "1.5.0",
2885
            "_model_name": "FloatProgressModel",
2886
            "_view_count": null,
2887
            "_view_module": "@jupyter-widgets/controls",
2888
            "_view_module_version": "1.5.0",
2889
            "_view_name": "ProgressView",
2890
            "bar_style": "",
2891
            "description": "",
2892
            "description_tooltip": null,
2893
            "layout": "IPY_MODEL_d8edc4f0a0a44882a7beeca0321276d6",
2894
            "max": 1057,
2895
            "min": 0,
2896
            "orientation": "horizontal",
2897
            "style": "IPY_MODEL_09e1966daf9e481da118af73af218d88",
2898
            "value": 1057
2899
          }
2900
        },
2901
        "74ff87c33af14cc093694692397a9ee0": {
2902
          "model_module": "@jupyter-widgets/base",
2903
          "model_module_version": "1.2.0",
2904
          "model_name": "LayoutModel",
2905
          "state": {
2906
            "_model_module": "@jupyter-widgets/base",
2907
            "_model_module_version": "1.2.0",
2908
            "_model_name": "LayoutModel",
2909
            "_view_count": null,
2910
            "_view_module": "@jupyter-widgets/base",
2911
            "_view_module_version": "1.2.0",
2912
            "_view_name": "LayoutView",
2913
            "align_content": null,
2914
            "align_items": null,
2915
            "align_self": null,
2916
            "border": null,
2917
            "bottom": null,
2918
            "display": null,
2919
            "flex": null,
2920
            "flex_flow": null,
2921
            "grid_area": null,
2922
            "grid_auto_columns": null,
2923
            "grid_auto_flow": null,
2924
            "grid_auto_rows": null,
2925
            "grid_column": null,
2926
            "grid_gap": null,
2927
            "grid_row": null,
2928
            "grid_template_areas": null,
2929
            "grid_template_columns": null,
2930
            "grid_template_rows": null,
2931
            "height": null,
2932
            "justify_content": null,
2933
            "justify_items": null,
2934
            "left": null,
2935
            "margin": null,
2936
            "max_height": null,
2937
            "max_width": null,
2938
            "min_height": null,
2939
            "min_width": null,
2940
            "object_fit": null,
2941
            "object_position": null,
2942
            "order": null,
2943
            "overflow": null,
2944
            "overflow_x": null,
2945
            "overflow_y": null,
2946
            "padding": null,
2947
            "right": null,
2948
            "top": null,
2949
            "visibility": null,
2950
            "width": null
2951
          }
2952
        },
2953
        "757dd94ac5e04ff09ee6fab419f1692d": {
2954
          "model_module": "@jupyter-widgets/base",
2955
          "model_module_version": "1.2.0",
2956
          "model_name": "LayoutModel",
2957
          "state": {
2958
            "_model_module": "@jupyter-widgets/base",
2959
            "_model_module_version": "1.2.0",
2960
            "_model_name": "LayoutModel",
2961
            "_view_count": null,
2962
            "_view_module": "@jupyter-widgets/base",
2963
            "_view_module_version": "1.2.0",
2964
            "_view_name": "LayoutView",
2965
            "align_content": null,
2966
            "align_items": null,
2967
            "align_self": null,
2968
            "border": null,
2969
            "bottom": null,
2970
            "display": null,
2971
            "flex": null,
2972
            "flex_flow": null,
2973
            "grid_area": null,
2974
            "grid_auto_columns": null,
2975
            "grid_auto_flow": null,
2976
            "grid_auto_rows": null,
2977
            "grid_column": null,
2978
            "grid_gap": null,
2979
            "grid_row": null,
2980
            "grid_template_areas": null,
2981
            "grid_template_columns": null,
2982
            "grid_template_rows": null,
2983
            "height": null,
2984
            "justify_content": null,
2985
            "justify_items": null,
2986
            "left": null,
2987
            "margin": null,
2988
            "max_height": null,
2989
            "max_width": null,
2990
            "min_height": null,
2991
            "min_width": null,
2992
            "object_fit": null,
2993
            "object_position": null,
2994
            "order": null,
2995
            "overflow": null,
2996
            "overflow_x": null,
2997
            "overflow_y": null,
2998
            "padding": null,
2999
            "right": null,
3000
            "top": null,
3001
            "visibility": null,
3002
            "width": null
3003
          }
3004
        },
3005
        "7edfe69de64a4af18febff677b57ab65": {
3006
          "model_module": "@jupyter-widgets/controls",
3007
          "model_module_version": "1.5.0",
3008
          "model_name": "HBoxModel",
3009
          "state": {
3010
            "_dom_classes": [],
3011
            "_model_module": "@jupyter-widgets/controls",
3012
            "_model_module_version": "1.5.0",
3013
            "_model_name": "HBoxModel",
3014
            "_view_count": null,
3015
            "_view_module": "@jupyter-widgets/controls",
3016
            "_view_module_version": "1.5.0",
3017
            "_view_name": "HBoxView",
3018
            "box_style": "",
3019
            "children": [
3020
              "IPY_MODEL_dc5418db9c3e49cd95b3f85f0dc562ab",
3021
              "IPY_MODEL_8db902b229e545649282c130c2a049b8",
3022
              "IPY_MODEL_0b3581ddec0b4cabb33593e272a50249"
3023
            ],
3024
            "layout": "IPY_MODEL_d580bdf43d1e44b8afcfefc962410d73"
3025
          }
3026
        },
3027
        "8db902b229e545649282c130c2a049b8": {
3028
          "model_module": "@jupyter-widgets/controls",
3029
          "model_module_version": "1.5.0",
3030
          "model_name": "FloatProgressModel",
3031
          "state": {
3032
            "_dom_classes": [],
3033
            "_model_module": "@jupyter-widgets/controls",
3034
            "_model_module_version": "1.5.0",
3035
            "_model_name": "FloatProgressModel",
3036
            "_view_count": null,
3037
            "_view_module": "@jupyter-widgets/controls",
3038
            "_view_module_version": "1.5.0",
3039
            "_view_name": "ProgressView",
3040
            "bar_style": "",
3041
            "description": "",
3042
            "description_tooltip": null,
3043
            "layout": "IPY_MODEL_757dd94ac5e04ff09ee6fab419f1692d",
3044
            "max": 1057,
3045
            "min": 0,
3046
            "orientation": "horizontal",
3047
            "style": "IPY_MODEL_d46a381be01a460cb49cc838c5aa29c0",
3048
            "value": 1057
3049
          }
3050
        },
3051
        "8f77a47ffc79400cbd84280e8bbc9979": {
3052
          "model_module": "@jupyter-widgets/controls",
3053
          "model_module_version": "1.5.0",
3054
          "model_name": "HTMLModel",
3055
          "state": {
3056
            "_dom_classes": [],
3057
            "_model_module": "@jupyter-widgets/controls",
3058
            "_model_module_version": "1.5.0",
3059
            "_model_name": "HTMLModel",
3060
            "_view_count": null,
3061
            "_view_module": "@jupyter-widgets/controls",
3062
            "_view_module_version": "1.5.0",
3063
            "_view_name": "HTMLView",
3064
            "description": "",
3065
            "description_tooltip": null,
3066
            "layout": "IPY_MODEL_74ff87c33af14cc093694692397a9ee0",
3067
            "placeholder": "​",
3068
            "style": "IPY_MODEL_cf01a82f5de54ffb97af38ca88e170c2",
3069
            "value": " 2/2 [00:00&lt;00:00, 33.51it/s]"
3070
          }
3071
        },
3072
        "961929641bfc4b06b0603bd792c6d351": {
3073
          "model_module": "@jupyter-widgets/controls",
3074
          "model_module_version": "1.5.0",
3075
          "model_name": "HBoxModel",
3076
          "state": {
3077
            "_dom_classes": [],
3078
            "_model_module": "@jupyter-widgets/controls",
3079
            "_model_module_version": "1.5.0",
3080
            "_model_name": "HBoxModel",
3081
            "_view_count": null,
3082
            "_view_module": "@jupyter-widgets/controls",
3083
            "_view_module_version": "1.5.0",
3084
            "_view_name": "HBoxView",
3085
            "box_style": "",
3086
            "children": [
3087
              "IPY_MODEL_c497e117ef7142338bd45e57b722616b",
3088
              "IPY_MODEL_60682d73f15b4020b57f87dabba5f320",
3089
              "IPY_MODEL_51f49669810a4b5f941c18e4b1896866"
3090
            ],
3091
            "layout": "IPY_MODEL_f4da65dff9374ace9b92d341ec2793f1"
3092
          }
3093
        },
3094
        "98a45b56fdb040418e42f7c59e28bc14": {
3095
          "model_module": "@jupyter-widgets/controls",
3096
          "model_module_version": "1.5.0",
3097
          "model_name": "HBoxModel",
3098
          "state": {
3099
            "_dom_classes": [],
3100
            "_model_module": "@jupyter-widgets/controls",
3101
            "_model_module_version": "1.5.0",
3102
            "_model_name": "HBoxModel",
3103
            "_view_count": null,
3104
            "_view_module": "@jupyter-widgets/controls",
3105
            "_view_module_version": "1.5.0",
3106
            "_view_name": "HBoxView",
3107
            "box_style": "",
3108
            "children": [
3109
              "IPY_MODEL_bf67657a3f5d47a79d078beb8589a098",
3110
              "IPY_MODEL_f0fa32d1b256417db2850569674350d9",
3111
              "IPY_MODEL_8f77a47ffc79400cbd84280e8bbc9979"
3112
            ],
3113
            "layout": "IPY_MODEL_defca41aeb5b4f8689930bfea05915f1"
3114
          }
3115
        },
3116
        "9b5b8acd984f44d696f8f83862f20bf1": {
3117
          "model_module": "@jupyter-widgets/base",
3118
          "model_module_version": "1.2.0",
3119
          "model_name": "LayoutModel",
3120
          "state": {
3121
            "_model_module": "@jupyter-widgets/base",
3122
            "_model_module_version": "1.2.0",
3123
            "_model_name": "LayoutModel",
3124
            "_view_count": null,
3125
            "_view_module": "@jupyter-widgets/base",
3126
            "_view_module_version": "1.2.0",
3127
            "_view_name": "LayoutView",
3128
            "align_content": null,
3129
            "align_items": null,
3130
            "align_self": null,
3131
            "border": null,
3132
            "bottom": null,
3133
            "display": null,
3134
            "flex": null,
3135
            "flex_flow": null,
3136
            "grid_area": null,
3137
            "grid_auto_columns": null,
3138
            "grid_auto_flow": null,
3139
            "grid_auto_rows": null,
3140
            "grid_column": null,
3141
            "grid_gap": null,
3142
            "grid_row": null,
3143
            "grid_template_areas": null,
3144
            "grid_template_columns": null,
3145
            "grid_template_rows": null,
3146
            "height": null,
3147
            "justify_content": null,
3148
            "justify_items": null,
3149
            "left": null,
3150
            "margin": null,
3151
            "max_height": null,
3152
            "max_width": null,
3153
            "min_height": null,
3154
            "min_width": null,
3155
            "object_fit": null,
3156
            "object_position": null,
3157
            "order": null,
3158
            "overflow": null,
3159
            "overflow_x": null,
3160
            "overflow_y": null,
3161
            "padding": null,
3162
            "right": null,
3163
            "top": null,
3164
            "visibility": null,
3165
            "width": null
3166
          }
3167
        },
3168
        "b3d5c33915084f26b060e086138bf898": {
3169
          "model_module": "@jupyter-widgets/base",
3170
          "model_module_version": "1.2.0",
3171
          "model_name": "LayoutModel",
3172
          "state": {
3173
            "_model_module": "@jupyter-widgets/base",
3174
            "_model_module_version": "1.2.0",
3175
            "_model_name": "LayoutModel",
3176
            "_view_count": null,
3177
            "_view_module": "@jupyter-widgets/base",
3178
            "_view_module_version": "1.2.0",
3179
            "_view_name": "LayoutView",
3180
            "align_content": null,
3181
            "align_items": null,
3182
            "align_self": null,
3183
            "border": null,
3184
            "bottom": null,
3185
            "display": null,
3186
            "flex": null,
3187
            "flex_flow": null,
3188
            "grid_area": null,
3189
            "grid_auto_columns": null,
3190
            "grid_auto_flow": null,
3191
            "grid_auto_rows": null,
3192
            "grid_column": null,
3193
            "grid_gap": null,
3194
            "grid_row": null,
3195
            "grid_template_areas": null,
3196
            "grid_template_columns": null,
3197
            "grid_template_rows": null,
3198
            "height": null,
3199
            "justify_content": null,
3200
            "justify_items": null,
3201
            "left": null,
3202
            "margin": null,
3203
            "max_height": null,
3204
            "max_width": null,
3205
            "min_height": null,
3206
            "min_width": null,
3207
            "object_fit": null,
3208
            "object_position": null,
3209
            "order": null,
3210
            "overflow": null,
3211
            "overflow_x": null,
3212
            "overflow_y": null,
3213
            "padding": null,
3214
            "right": null,
3215
            "top": null,
3216
            "visibility": null,
3217
            "width": null
3218
          }
3219
        },
3220
        "bf67657a3f5d47a79d078beb8589a098": {
3221
          "model_module": "@jupyter-widgets/controls",
3222
          "model_module_version": "1.5.0",
3223
          "model_name": "HTMLModel",
3224
          "state": {
3225
            "_dom_classes": [],
3226
            "_model_module": "@jupyter-widgets/controls",
3227
            "_model_module_version": "1.5.0",
3228
            "_model_name": "HTMLModel",
3229
            "_view_count": null,
3230
            "_view_module": "@jupyter-widgets/controls",
3231
            "_view_module_version": "1.5.0",
3232
            "_view_name": "HTMLView",
3233
            "description": "",
3234
            "description_tooltip": null,
3235
            "layout": "IPY_MODEL_dbc2e3e6c2cb4c108d46430e132777a1",
3236
            "placeholder": "​",
3237
            "style": "IPY_MODEL_f3aa463526554d9da89c2fd0fe8efe2a",
3238
            "value": "100%"
3239
          }
3240
        },
3241
        "c497e117ef7142338bd45e57b722616b": {
3242
          "model_module": "@jupyter-widgets/controls",
3243
          "model_module_version": "1.5.0",
3244
          "model_name": "HTMLModel",
3245
          "state": {
3246
            "_dom_classes": [],
3247
            "_model_module": "@jupyter-widgets/controls",
3248
            "_model_module_version": "1.5.0",
3249
            "_model_name": "HTMLModel",
3250
            "_view_count": null,
3251
            "_view_module": "@jupyter-widgets/controls",
3252
            "_view_module_version": "1.5.0",
3253
            "_view_name": "HTMLView",
3254
            "description": "",
3255
            "description_tooltip": null,
3256
            "layout": "IPY_MODEL_9b5b8acd984f44d696f8f83862f20bf1",
3257
            "placeholder": "​",
3258
            "style": "IPY_MODEL_f9fdd11e8b6f411e818447528be333df",
3259
            "value": "Map:  93%"
3260
          }
3261
        },
3262
        "cbcbb3853ed544f8b946aab31eaa7f56": {
3263
          "model_module": "@jupyter-widgets/base",
3264
          "model_module_version": "1.2.0",
3265
          "model_name": "LayoutModel",
3266
          "state": {
3267
            "_model_module": "@jupyter-widgets/base",
3268
            "_model_module_version": "1.2.0",
3269
            "_model_name": "LayoutModel",
3270
            "_view_count": null,
3271
            "_view_module": "@jupyter-widgets/base",
3272
            "_view_module_version": "1.2.0",
3273
            "_view_name": "LayoutView",
3274
            "align_content": null,
3275
            "align_items": null,
3276
            "align_self": null,
3277
            "border": null,
3278
            "bottom": null,
3279
            "display": null,
3280
            "flex": null,
3281
            "flex_flow": null,
3282
            "grid_area": null,
3283
            "grid_auto_columns": null,
3284
            "grid_auto_flow": null,
3285
            "grid_auto_rows": null,
3286
            "grid_column": null,
3287
            "grid_gap": null,
3288
            "grid_row": null,
3289
            "grid_template_areas": null,
3290
            "grid_template_columns": null,
3291
            "grid_template_rows": null,
3292
            "height": null,
3293
            "justify_content": null,
3294
            "justify_items": null,
3295
            "left": null,
3296
            "margin": null,
3297
            "max_height": null,
3298
            "max_width": null,
3299
            "min_height": null,
3300
            "min_width": null,
3301
            "object_fit": null,
3302
            "object_position": null,
3303
            "order": null,
3304
            "overflow": null,
3305
            "overflow_x": null,
3306
            "overflow_y": null,
3307
            "padding": null,
3308
            "right": null,
3309
            "top": null,
3310
            "visibility": null,
3311
            "width": null
3312
          }
3313
        },
3314
        "cf01a82f5de54ffb97af38ca88e170c2": {
3315
          "model_module": "@jupyter-widgets/controls",
3316
          "model_module_version": "1.5.0",
3317
          "model_name": "DescriptionStyleModel",
3318
          "state": {
3319
            "_model_module": "@jupyter-widgets/controls",
3320
            "_model_module_version": "1.5.0",
3321
            "_model_name": "DescriptionStyleModel",
3322
            "_view_count": null,
3323
            "_view_module": "@jupyter-widgets/base",
3324
            "_view_module_version": "1.2.0",
3325
            "_view_name": "StyleView",
3326
            "description_width": ""
3327
          }
3328
        },
3329
        "d46a381be01a460cb49cc838c5aa29c0": {
3330
          "model_module": "@jupyter-widgets/controls",
3331
          "model_module_version": "1.5.0",
3332
          "model_name": "ProgressStyleModel",
3333
          "state": {
3334
            "_model_module": "@jupyter-widgets/controls",
3335
            "_model_module_version": "1.5.0",
3336
            "_model_name": "ProgressStyleModel",
3337
            "_view_count": null,
3338
            "_view_module": "@jupyter-widgets/base",
3339
            "_view_module_version": "1.2.0",
3340
            "_view_name": "StyleView",
3341
            "bar_color": null,
3342
            "description_width": ""
3343
          }
3344
        },
3345
        "d580bdf43d1e44b8afcfefc962410d73": {
3346
          "model_module": "@jupyter-widgets/base",
3347
          "model_module_version": "1.2.0",
3348
          "model_name": "LayoutModel",
3349
          "state": {
3350
            "_model_module": "@jupyter-widgets/base",
3351
            "_model_module_version": "1.2.0",
3352
            "_model_name": "LayoutModel",
3353
            "_view_count": null,
3354
            "_view_module": "@jupyter-widgets/base",
3355
            "_view_module_version": "1.2.0",
3356
            "_view_name": "LayoutView",
3357
            "align_content": null,
3358
            "align_items": null,
3359
            "align_self": null,
3360
            "border": null,
3361
            "bottom": null,
3362
            "display": null,
3363
            "flex": null,
3364
            "flex_flow": null,
3365
            "grid_area": null,
3366
            "grid_auto_columns": null,
3367
            "grid_auto_flow": null,
3368
            "grid_auto_rows": null,
3369
            "grid_column": null,
3370
            "grid_gap": null,
3371
            "grid_row": null,
3372
            "grid_template_areas": null,
3373
            "grid_template_columns": null,
3374
            "grid_template_rows": null,
3375
            "height": null,
3376
            "justify_content": null,
3377
            "justify_items": null,
3378
            "left": null,
3379
            "margin": null,
3380
            "max_height": null,
3381
            "max_width": null,
3382
            "min_height": null,
3383
            "min_width": null,
3384
            "object_fit": null,
3385
            "object_position": null,
3386
            "order": null,
3387
            "overflow": null,
3388
            "overflow_x": null,
3389
            "overflow_y": null,
3390
            "padding": null,
3391
            "right": null,
3392
            "top": null,
3393
            "visibility": "hidden",
3394
            "width": null
3395
          }
3396
        },
3397
        "d8494cdc5ce04f4690a9adadb921de4c": {
3398
          "model_module": "@jupyter-widgets/controls",
3399
          "model_module_version": "1.5.0",
3400
          "model_name": "DescriptionStyleModel",
3401
          "state": {
3402
            "_model_module": "@jupyter-widgets/controls",
3403
            "_model_module_version": "1.5.0",
3404
            "_model_name": "DescriptionStyleModel",
3405
            "_view_count": null,
3406
            "_view_module": "@jupyter-widgets/base",
3407
            "_view_module_version": "1.2.0",
3408
            "_view_name": "StyleView",
3409
            "description_width": ""
3410
          }
3411
        },
3412
        "d8edc4f0a0a44882a7beeca0321276d6": {
3413
          "model_module": "@jupyter-widgets/base",
3414
          "model_module_version": "1.2.0",
3415
          "model_name": "LayoutModel",
3416
          "state": {
3417
            "_model_module": "@jupyter-widgets/base",
3418
            "_model_module_version": "1.2.0",
3419
            "_model_name": "LayoutModel",
3420
            "_view_count": null,
3421
            "_view_module": "@jupyter-widgets/base",
3422
            "_view_module_version": "1.2.0",
3423
            "_view_name": "LayoutView",
3424
            "align_content": null,
3425
            "align_items": null,
3426
            "align_self": null,
3427
            "border": null,
3428
            "bottom": null,
3429
            "display": null,
3430
            "flex": null,
3431
            "flex_flow": null,
3432
            "grid_area": null,
3433
            "grid_auto_columns": null,
3434
            "grid_auto_flow": null,
3435
            "grid_auto_rows": null,
3436
            "grid_column": null,
3437
            "grid_gap": null,
3438
            "grid_row": null,
3439
            "grid_template_areas": null,
3440
            "grid_template_columns": null,
3441
            "grid_template_rows": null,
3442
            "height": null,
3443
            "justify_content": null,
3444
            "justify_items": null,
3445
            "left": null,
3446
            "margin": null,
3447
            "max_height": null,
3448
            "max_width": null,
3449
            "min_height": null,
3450
            "min_width": null,
3451
            "object_fit": null,
3452
            "object_position": null,
3453
            "order": null,
3454
            "overflow": null,
3455
            "overflow_x": null,
3456
            "overflow_y": null,
3457
            "padding": null,
3458
            "right": null,
3459
            "top": null,
3460
            "visibility": null,
3461
            "width": null
3462
          }
3463
        },
3464
        "dbc2e3e6c2cb4c108d46430e132777a1": {
3465
          "model_module": "@jupyter-widgets/base",
3466
          "model_module_version": "1.2.0",
3467
          "model_name": "LayoutModel",
3468
          "state": {
3469
            "_model_module": "@jupyter-widgets/base",
3470
            "_model_module_version": "1.2.0",
3471
            "_model_name": "LayoutModel",
3472
            "_view_count": null,
3473
            "_view_module": "@jupyter-widgets/base",
3474
            "_view_module_version": "1.2.0",
3475
            "_view_name": "LayoutView",
3476
            "align_content": null,
3477
            "align_items": null,
3478
            "align_self": null,
3479
            "border": null,
3480
            "bottom": null,
3481
            "display": null,
3482
            "flex": null,
3483
            "flex_flow": null,
3484
            "grid_area": null,
3485
            "grid_auto_columns": null,
3486
            "grid_auto_flow": null,
3487
            "grid_auto_rows": null,
3488
            "grid_column": null,
3489
            "grid_gap": null,
3490
            "grid_row": null,
3491
            "grid_template_areas": null,
3492
            "grid_template_columns": null,
3493
            "grid_template_rows": null,
3494
            "height": null,
3495
            "justify_content": null,
3496
            "justify_items": null,
3497
            "left": null,
3498
            "margin": null,
3499
            "max_height": null,
3500
            "max_width": null,
3501
            "min_height": null,
3502
            "min_width": null,
3503
            "object_fit": null,
3504
            "object_position": null,
3505
            "order": null,
3506
            "overflow": null,
3507
            "overflow_x": null,
3508
            "overflow_y": null,
3509
            "padding": null,
3510
            "right": null,
3511
            "top": null,
3512
            "visibility": null,
3513
            "width": null
3514
          }
3515
        },
3516
        "dc5418db9c3e49cd95b3f85f0dc562ab": {
3517
          "model_module": "@jupyter-widgets/controls",
3518
          "model_module_version": "1.5.0",
3519
          "model_name": "HTMLModel",
3520
          "state": {
3521
            "_dom_classes": [],
3522
            "_model_module": "@jupyter-widgets/controls",
3523
            "_model_module_version": "1.5.0",
3524
            "_model_name": "HTMLModel",
3525
            "_view_count": null,
3526
            "_view_module": "@jupyter-widgets/controls",
3527
            "_view_module_version": "1.5.0",
3528
            "_view_name": "HTMLView",
3529
            "description": "",
3530
            "description_tooltip": null,
3531
            "layout": "IPY_MODEL_cbcbb3853ed544f8b946aab31eaa7f56",
3532
            "placeholder": "​",
3533
            "style": "IPY_MODEL_e21ced63bda64379832735d5aa2e0178",
3534
            "value": "Map: 100%"
3535
          }
3536
        },
3537
        "defca41aeb5b4f8689930bfea05915f1": {
3538
          "model_module": "@jupyter-widgets/base",
3539
          "model_module_version": "1.2.0",
3540
          "model_name": "LayoutModel",
3541
          "state": {
3542
            "_model_module": "@jupyter-widgets/base",
3543
            "_model_module_version": "1.2.0",
3544
            "_model_name": "LayoutModel",
3545
            "_view_count": null,
3546
            "_view_module": "@jupyter-widgets/base",
3547
            "_view_module_version": "1.2.0",
3548
            "_view_name": "LayoutView",
3549
            "align_content": null,
3550
            "align_items": null,
3551
            "align_self": null,
3552
            "border": null,
3553
            "bottom": null,
3554
            "display": null,
3555
            "flex": null,
3556
            "flex_flow": null,
3557
            "grid_area": null,
3558
            "grid_auto_columns": null,
3559
            "grid_auto_flow": null,
3560
            "grid_auto_rows": null,
3561
            "grid_column": null,
3562
            "grid_gap": null,
3563
            "grid_row": null,
3564
            "grid_template_areas": null,
3565
            "grid_template_columns": null,
3566
            "grid_template_rows": null,
3567
            "height": null,
3568
            "justify_content": null,
3569
            "justify_items": null,
3570
            "left": null,
3571
            "margin": null,
3572
            "max_height": null,
3573
            "max_width": null,
3574
            "min_height": null,
3575
            "min_width": null,
3576
            "object_fit": null,
3577
            "object_position": null,
3578
            "order": null,
3579
            "overflow": null,
3580
            "overflow_x": null,
3581
            "overflow_y": null,
3582
            "padding": null,
3583
            "right": null,
3584
            "top": null,
3585
            "visibility": null,
3586
            "width": null
3587
          }
3588
        },
3589
        "e21ced63bda64379832735d5aa2e0178": {
3590
          "model_module": "@jupyter-widgets/controls",
3591
          "model_module_version": "1.5.0",
3592
          "model_name": "DescriptionStyleModel",
3593
          "state": {
3594
            "_model_module": "@jupyter-widgets/controls",
3595
            "_model_module_version": "1.5.0",
3596
            "_model_name": "DescriptionStyleModel",
3597
            "_view_count": null,
3598
            "_view_module": "@jupyter-widgets/base",
3599
            "_view_module_version": "1.2.0",
3600
            "_view_name": "StyleView",
3601
            "description_width": ""
3602
          }
3603
        },
3604
        "f0fa32d1b256417db2850569674350d9": {
3605
          "model_module": "@jupyter-widgets/controls",
3606
          "model_module_version": "1.5.0",
3607
          "model_name": "FloatProgressModel",
3608
          "state": {
3609
            "_dom_classes": [],
3610
            "_model_module": "@jupyter-widgets/controls",
3611
            "_model_module_version": "1.5.0",
3612
            "_model_name": "FloatProgressModel",
3613
            "_view_count": null,
3614
            "_view_module": "@jupyter-widgets/controls",
3615
            "_view_module_version": "1.5.0",
3616
            "_view_name": "ProgressView",
3617
            "bar_style": "success",
3618
            "description": "",
3619
            "description_tooltip": null,
3620
            "layout": "IPY_MODEL_329b19be2aff486f8a737751ead4d79c",
3621
            "max": 2,
3622
            "min": 0,
3623
            "orientation": "horizontal",
3624
            "style": "IPY_MODEL_5a78f50d4f4742f08ee3abe4e9c38129",
3625
            "value": 2
3626
          }
3627
        },
3628
        "f3aa463526554d9da89c2fd0fe8efe2a": {
3629
          "model_module": "@jupyter-widgets/controls",
3630
          "model_module_version": "1.5.0",
3631
          "model_name": "DescriptionStyleModel",
3632
          "state": {
3633
            "_model_module": "@jupyter-widgets/controls",
3634
            "_model_module_version": "1.5.0",
3635
            "_model_name": "DescriptionStyleModel",
3636
            "_view_count": null,
3637
            "_view_module": "@jupyter-widgets/base",
3638
            "_view_module_version": "1.2.0",
3639
            "_view_name": "StyleView",
3640
            "description_width": ""
3641
          }
3642
        },
3643
        "f4da65dff9374ace9b92d341ec2793f1": {
3644
          "model_module": "@jupyter-widgets/base",
3645
          "model_module_version": "1.2.0",
3646
          "model_name": "LayoutModel",
3647
          "state": {
3648
            "_model_module": "@jupyter-widgets/base",
3649
            "_model_module_version": "1.2.0",
3650
            "_model_name": "LayoutModel",
3651
            "_view_count": null,
3652
            "_view_module": "@jupyter-widgets/base",
3653
            "_view_module_version": "1.2.0",
3654
            "_view_name": "LayoutView",
3655
            "align_content": null,
3656
            "align_items": null,
3657
            "align_self": null,
3658
            "border": null,
3659
            "bottom": null,
3660
            "display": null,
3661
            "flex": null,
3662
            "flex_flow": null,
3663
            "grid_area": null,
3664
            "grid_auto_columns": null,
3665
            "grid_auto_flow": null,
3666
            "grid_auto_rows": null,
3667
            "grid_column": null,
3668
            "grid_gap": null,
3669
            "grid_row": null,
3670
            "grid_template_areas": null,
3671
            "grid_template_columns": null,
3672
            "grid_template_rows": null,
3673
            "height": null,
3674
            "justify_content": null,
3675
            "justify_items": null,
3676
            "left": null,
3677
            "margin": null,
3678
            "max_height": null,
3679
            "max_width": null,
3680
            "min_height": null,
3681
            "min_width": null,
3682
            "object_fit": null,
3683
            "object_position": null,
3684
            "order": null,
3685
            "overflow": null,
3686
            "overflow_x": null,
3687
            "overflow_y": null,
3688
            "padding": null,
3689
            "right": null,
3690
            "top": null,
3691
            "visibility": "hidden",
3692
            "width": null
3693
          }
3694
        },
3695
        "f9fdd11e8b6f411e818447528be333df": {
3696
          "model_module": "@jupyter-widgets/controls",
3697
          "model_module_version": "1.5.0",
3698
          "model_name": "DescriptionStyleModel",
3699
          "state": {
3700
            "_model_module": "@jupyter-widgets/controls",
3701
            "_model_module_version": "1.5.0",
3702
            "_model_name": "DescriptionStyleModel",
3703
            "_view_count": null,
3704
            "_view_module": "@jupyter-widgets/base",
3705
            "_view_module_version": "1.2.0",
3706
            "_view_name": "StyleView",
3707
            "description_width": ""
3708
          }
3709
        }
3710
      }
3711
    }
3712
  },
3713
  "nbformat": 4,
3714
  "nbformat_minor": 0
3715
}
3716

Использование cookies

Мы используем файлы cookie в соответствии с Политикой конфиденциальности и Политикой использования cookies.

Нажимая кнопку «Принимаю», Вы даете АО «СберТех» согласие на обработку Ваших персональных данных в целях совершенствования нашего веб-сайта и Сервиса GitVerse, а также повышения удобства их использования.

Запретить использование cookies Вы можете самостоятельно в настройках Вашего браузера.