google-research

mu4Net.ipynb
3008 строк · 133.2 Кб
Перенос по словам
1
{
2
  "cells": [
3
    {
4
      "cell_type": "markdown",
5
      "metadata": {
6
        "id": "V6iqZHPjN2S0"
7
      },
8
      "source": [
9
        "# License\n",
10
        "Licensed under the Apache License, Version 2.0 (the \"License\");\n",
11
        "you may not use this file except in compliance with the License.\n",
12
        "You may obtain a copy of the License at:\n",
13
        "\n",
14
        "https://www.apache.org/licenses/LICENSE-2.0\n",
15
        "\n",
16
        "Unless required by applicable law or agreed to in writing, software\n",
17
        "distributed under the License is distributed on an \"AS IS\" BASIS,\n",
18
        "WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
19
        "See the License for the specific language governing permissions and\n",
20
        "limitations under the License."
21
      ]
22
    },
23
    {
24
      "cell_type": "markdown",
25
      "metadata": {
26
        "id": "SNwvSD5wN7Er"
27
      },
28
      "source": [
29
        "# Instructions\n",
30
        "\n",
31
        "This Notebook allows to reproduce the experiments reported in the publication titled:\n",
32
        "\n",
33
        "\"[*Multipath Agents for Modular Multitask ML Systems*](https://arxiv.org/abs/2302.02721)\" (2023)\n",
34
        "\n",
35
        "---\n",
36
        "To start an experiment:\n",
37
        "---\n",
38
        "1. Choose the agent type by setting the `AGENT` variable in the configuration below.\n",
39
        "Select `Vit`  in order to choose the singlepath agent. Select `MultiVit` in order to the multipath agent.\n",
40
        "These agents use a ViT-Large root model.\n",
41
        "Set agent types suffixed with `T3` to use a ViT-Tiny root model capped to 3 layers. Set agent types suffixed with `B` to use a ViT-Base root model.\n",
42
        "\n",
43
        "1. Set `TASK_NAME` to the string-id of the task assigned to the instantiation of the selected agent.\n",
44
        "Refer to `TFDS_IMAGE_CLASSIFCATON_DATASETS` and `VTAB_TASKS` (below) for lists of tasks ids that have been tested with the current code. \n",
45
        "These lists contain task ids from the [Tensorflow Datasets Catalog](https://www.tensorflow.org/datasets/catalog/overview).\n",
46
        "Note that some tasks require manual download, refer to the corresponding catalog page for instructions. **WARNING**: The system state needs to be populated with at least one root model before running an agent training on any task. In order to generate the root model, set `TASK_NAME` to either `\"root_model/checkpoint\"` or `\"root_model/random_init\"` for respectively loading a pretrained root model or generating a randomly initialized one.\n",
47
        "\n",
48
        "1. Set `NUM_CYCLES_MAX` to the desired number of evolutionary cycles. Additional configuration parameters can be modified in the Agents definitions code below. Configurations are set to the settings described in the publication.\n",
49
        "\n",
50
        "1. Set `SYSTEM_STATE_RELATIVE_DIR` to a relative path from where the system state will be read and written.\n",
51
        "\n",
52
        "1. By default, the system state is stored under a temporary folder within the Virtual Machine (VM) memory. This temporary folder is deleted when the VM is stopped or restarted.\n",
53
        "It is possible to store the system state folder on your Google Drive by activating the\n",
54
        "`SAVE_SYSTEM_STATE_ON_GOOGLE_DRIVE` option. In this case, you will be prompted for access approval and the system state folder will be saved in a folder named `\"munet_experiments\"` under your Google Drive root folder. Furthermore, it is also possible to store the system state into a Google Drive folder shared with multiple users by creating a link to the shared folder into your Google Drive and then setting `GDRIVE_ROOT_DIR` (below) to the path of the linked shared folder. \n",
55
        "\n",
56
        "1. To start the experiment, select \"Connect to a hosted runtime\" from the dropdown menu on the top right, and then select \"Run all\" from the \"Runtime\" menu. A free hosted CPU runtime is always available. Free access to GPU and TPU accelerators are occasionally provided by the service depending on availability.\n",
57
        "\n",
58
        "---\n",
59
        "During the experiment execution:\n",
60
        "---\n",
61
        "\n",
62
        "1. The print output is displayed after the last cell of this Colab.\n",
63
        "\n",
64
        "1. The system state folder is populated with a subfolder for each agent.\n",
65
        "The name of each agent folder is prefixed with the `AGENT` type string and suffixed with the `TASK_NAME`.\n",
66
        "Each agent directory is populated with incremental state subfolders  containing the sharded state of the architectures and parameters generated by the agent during the corresponding evolutionary cycle.\n",
67
        "\n",
68
        "1. Agents can be started asynchronously and run in parallel in varying quantities.\n",
69
        "It is possible to resume an interrupted agent training by restarting the execution with the same configuration.\n",
70
        "It is possible to continue a completed training by increasing `NUM_CYCLES_MAX`.\n",
71
        "\n",
72
        "1. To achieve a multi-agent execution, multiple Colabs need to be run in parallel, each set to the same configuration but different `TASK_NAME`.\n",
73
        "\n",
74
        "1. To achieve heterogeneous hardware execution, parallel Colab Notebooks can be connected to a runtime of different types.\n",
75
        "It is possible to switch between CPU, GPU and TPU by selecting `Change runtime type` in the `Resources` tab in this Colab Notebook."
76
      ]
77
    },
78
    {
79
      "cell_type": "code",
80
      "execution_count": null,
81
      "metadata": {
82
        "id": "M93tll7z29rX"
83
      },
84
      "outputs": [],
85
      "source": [
86
        "# @title Agent parameters\n",
87
        "AGENT = \"VitT3\" # @param [\"VitT3\", \"VitB\", \"Vit\", \"MultiVitT3\", \"MultiVitB\", \"MultiVit\"] { type: \"string\", isTemplate: true }\n",
88
        "# Set TASK_NAME to \"root_model/checkpoint\" or \"root_model/random_init\" to initalize the population.\n",
89
        "TASK_NAME = \"root_model/checkpoint\"  # @param { type: \"string\", isTemplate: true }\n",
90
        "NUM_CYCLES_MAX = 1 # @param { type: \"integer\", isTemplate: true }\n",
91
        "SYSTEM_STATE_RELATIVE_DIR = \"munet_system_state/\"  # @param { type: \"string\", isTemplate: true }"
92
      ]
93
    },
94
    {
95
      "cell_type": "code",
96
      "execution_count": null,
97
      "metadata": {
98
        "id": "xcaI2KII_MH5"
99
      },
100
      "outputs": [],
101
      "source": [
102
        "# Saves system state on Google drive instead of saving it in a temporary VM folder.\n",
103
        "SAVE_SYSTEM_STATE_ON_GOOGLE_DRIVE = False  # @param { type: \"boolean\", isTemplate: true }\n",
104
        "if SAVE_SYSTEM_STATE_ON_GOOGLE_DRIVE:\n",
105
        "  from google.colab import drive\n",
106
        "  drive.mount('/content/gdrive')\n",
107
        "  GDRIVE_ROOT_DIR = \"/content/gdrive/My Drive/munet_experiments/\"\n",
108
        "  SYSTEM_STATE_DIR = GDRIVE_ROOT_DIR + SYSTEM_STATE_RELATIVE_DIR\n",
109
        "  print(\"Saving system state in Google Drive.\")\n",
110
        "else:\n",
111
        "  SYSTEM_STATE_DIR = \"/tmp/\" + SYSTEM_STATE_RELATIVE_DIR\n",
112
        "  print(\"WARNING: Saving system state in VM, state will be lost after reboot!\")"
113
      ]
114
    },
115
    {
116
      "cell_type": "code",
117
      "execution_count": null,
118
      "metadata": {
119
        "id": "dsiDs_mgBZOx"
120
      },
121
      "outputs": [],
122
      "source": [
123
        "# Test immutability of published paths at beginning of each cycle.\n",
124
        "# Tollerance may be increased if the system is run with any context difference: e.g. harware, input preprocessing, libraries or datsets version.\n",
125
        "TEST_IMMUTABILITY = False\n",
126
        "IMMUTABILITY_RELATIVE_TOLLERANCE = 0.001  # 0.1%"
127
      ]
128
    },
129
    {
130
      "cell_type": "markdown",
131
      "metadata": {
132
        "id": "KZ0njfC-XCBA"
133
      },
134
      "source": [
135
        "# Imports"
136
      ]
137
    },
138
    {
139
      "cell_type": "code",
140
      "execution_count": null,
141
      "metadata": {
142
        "id": "G9CQIJwcYN2N"
143
      },
144
      "outputs": [],
145
      "source": [
146
        "!pip install --upgrade -q pip jax jaxlib\n",
147
        "!pip install --upgrade -q git+https://github.com/google/flax.git\n",
148
        "!pip install -q ml_collections\n",
149
        "!pip install -q tensorflow_addons\n",
150
        "![ -d task_adaptation ] || git clone --depth=1 https://github.com/google-research/task_adaptation\n",
151
        "![ -d vision_transformer ] || git clone --depth=1 https://github.com/google-research/vision_transformer\n",
152
        "\n",
153
        "import sys\n",
154
        "if './task_adaptation' not in sys.path:\n",
155
        "  sys.path.append('./task_adaptation')\n",
156
        "if './vision_transformer' not in sys.path:\n",
157
        "  sys.path.append('./vision_transformer')\n",
158
        "\n",
159
        "import jax.tools.colab_tpu\n",
160
        "try:\n",
161
        "  jax.tools.colab_tpu.setup_tpu()\n",
162
        "except:\n",
163
        "  pass  # Not a Tpu"
164
      ]
165
    },
166
    {
167
      "cell_type": "code",
168
      "execution_count": null,
169
      "metadata": {
170
        "id": "LzhEweKzMg6k"
171
      },
172
      "outputs": [],
173
      "source": [
174
        "import copy\n",
175
        "import flax\n",
176
        "import flax.linen as nn\n",
177
        "import gc\n",
178
        "import inspect\n",
179
        "import jax\n",
180
        "import jax.numpy as jnp\n",
181
        "import json\n",
182
        "import math\n",
183
        "import numpy as np\n",
184
        "import optax\n",
185
        "import os\n",
186
        "import pandas as pd\n",
187
        "import random\n",
188
        "import re\n",
189
        "import tensorflow as tf\n",
190
        "import tensorflow_datasets as tfds\n",
191
        "import time\n",
192
        "from collections import defaultdict\n",
193
        "from functools import partial\n",
194
        "from flax.training import checkpoints as flax_checkpoints\n",
195
        "from ml_collections import ConfigDict, FrozenConfigDict\n",
196
        "from tensorflow.io import gfile\n",
197
        "from threading import Thread, Lock\n",
198
        "from typing import Any, Type\n",
199
        "\n",
200
        "tf.compat.v1.enable_eager_execution()"
201
      ]
202
    },
203
    {
204
      "cell_type": "code",
205
      "execution_count": null,
206
      "metadata": {
207
        "id": "6QHBzcuUYeh5"
208
      },
209
      "outputs": [],
210
      "source": [
211
        "# ViT imports.\n",
212
        "from vision_transformer.vit_jax import input_pipeline\n",
213
        "from vision_transformer.vit_jax import checkpoint\n",
214
        "from vision_transformer.vit_jax.configs import models as models_config  # Model configurations.\n",
215
        "from vision_transformer.vit_jax import models_vit as models  # Actual model code.\n",
216
        "# VTAB imports.\n",
217
        "import task_adaptation.registry as task_adapt_registry\n",
218
        "import task_adaptation.data.caltech\n",
219
        "import task_adaptation.data.cifar\n",
220
        "import task_adaptation.data.dtd\n",
221
        "import task_adaptation.data.oxford_flowers102\n",
222
        "import task_adaptation.data.oxford_iiit_pet\n",
223
        "import task_adaptation.data.sun397\n",
224
        "import task_adaptation.data.svhn\n",
225
        "import task_adaptation.data.patch_camelyon\n",
226
        "import task_adaptation.data.eurosat\n",
227
        "import task_adaptation.data.resisc45\n",
228
        "import task_adaptation.data.diabetic_retinopathy\n",
229
        "import task_adaptation.data.clevr\n",
230
        "import task_adaptation.data.dmlab\n",
231
        "import task_adaptation.data.dsprites\n",
232
        "import task_adaptation.data.kitti\n",
233
        "import task_adaptation.data.smallnorb"
234
      ]
235
    },
236
    {
237
      "cell_type": "markdown",
238
      "metadata": {
239
        "id": "Zhl1L8ldXFJz"
240
      },
241
      "source": [
242
        "# Utils"
243
      ]
244
    },
245
    {
246
      "cell_type": "code",
247
      "execution_count": null,
248
      "metadata": {
249
        "id": "SHBWj0JmpWDX"
250
      },
251
      "outputs": [],
252
      "source": [
253
        "# Ref. Tfds catalog: https://www.tensorflow.org/datasets/catalog/overview\n",
254
        "TFDS_IMAGE_CLASSIFCATON_DATASETS = set([\n",
255
        "    \"beans\",\n",
256
        "    \"binary_alpha_digits\",\n",
257
        "    \"caltech_birds2010\",\n",
258
        "    \"caltech_birds2011\",\n",
259
        "    \"cars196\",\n",
260
        "    \"cassava\",\n",
261
        "    \"cats_vs_dogs\",\n",
262
        "    \"cifar10\",\n",
263
        "    \"cifar100\",\n",
264
        "    \"citrus_leaves\",\n",
265
        "    \"cmaterdb/bangla\",\n",
266
        "    \"cmaterdb/devanagari\",\n",
267
        "    \"cmaterdb/telugu\",\n",
268
        "    \"colorectal_histology\",\n",
269
        "    \"controlled_noisy_web_labels/mini_imagenet_red\",\n",
270
        "    \"controlled_noisy_web_labels/mini_imagenet_blue\",\n",
271
        "    \"curated_breast_imaging_ddsm/patches\",\n",
272
        "    \"cycle_gan/apple2orange\",\n",
273
        "    \"cycle_gan/summer2winter_yosemite\",\n",
274
        "    \"cycle_gan/horse2zebra\",\n",
275
        "    \"cycle_gan/monet2photo\",\n",
276
        "    \"cycle_gan/cezanne2photo\",\n",
277
        "    \"cycle_gan/ukiyoe2photo\",\n",
278
        "    \"cycle_gan/vangogh2photo\",\n",
279
        "    \"cycle_gan/maps\",\n",
280
        "    \"cycle_gan/cityscapes\",\n",
281
        "    \"cycle_gan/facades\",\n",
282
        "    \"cycle_gan/iphone2dslr_flower\",\n",
283
        "    \"deep_weeds\",\n",
284
        "    \"domainnet/real\",\n",
285
        "    \"domainnet/painting\",\n",
286
        "    \"domainnet/clipart\",\n",
287
        "    \"domainnet/quickdraw\",\n",
288
        "    \"domainnet/infograph\",\n",
289
        "    \"domainnet/sketch\",\n",
290
        "    \"emnist/balanced\",\n",
291
        "    \"emnist/byclass\",\n",
292
        "    \"emnist/bymerge\",\n",
293
        "    \"emnist/digits\",\n",
294
        "    \"emnist/letters\",\n",
295
        "    \"emnist/mnist\",\n",
296
        "    \"fashion_mnist\",\n",
297
        "    \"food101\",\n",
298
        "    \"horses_or_humans\",\n",
299
        "    \"i_naturalist2017\",\n",
300
        "    \"i_naturalist2018\",\n",
301
        "    \"imagenet2012\",\n",
302
        "    \"imagenet_a\",\n",
303
        "    \"imagenet_lt\",\n",
304
        "    \"imagenet_r\",\n",
305
        "    \"imagenet_sketch\",\n",
306
        "    \"imagenette\",\n",
307
        "    \"imagewang\",\n",
308
        "    \"kmnist\",\n",
309
        "    \"malaria\",\n",
310
        "    \"mnist\",\n",
311
        "    \"omniglot\",\n",
312
        "    \"pet_finder\",\n",
313
        "    \"places365_small\",\n",
314
        "    \"plant_village\",\n",
315
        "    \"plantae_k\",\n",
316
        "    \"quickdraw_bitmap\",\n",
317
        "    \"rock_paper_scissors\",\n",
318
        "    \"siscore/rotation\",\n",
319
        "    \"siscore/size\",\n",
320
        "    \"siscore/location\",\n",
321
        "    \"stanford_dogs\",\n",
322
        "    \"stanford_online_products\",\n",
323
        "    \"stl10\",\n",
324
        "    \"tf_flowers\",\n",
325
        "    \"uc_merced\",\n",
326
        "    \"visual_domain_decathlon/aircraft\",\n",
327
        "    \"visual_domain_decathlon/cifar100\",\n",
328
        "    \"visual_domain_decathlon/daimlerpedcls\",\n",
329
        "    \"visual_domain_decathlon/dtd\",\n",
330
        "    \"visual_domain_decathlon/gtsrb\",\n",
331
        "    \"visual_domain_decathlon/imagenet12\",\n",
332
        "    \"visual_domain_decathlon/omniglot\",\n",
333
        "    \"visual_domain_decathlon/svhn\",\n",
334
        "    \"visual_domain_decathlon/ucf101\",\n",
335
        "    \"visual_domain_decathlon/vgg-flowers\",\n",
336
        "    ])"
337
      ]
338
    },
339
    {
340
      "cell_type": "code",
341
      "execution_count": null,
342
      "metadata": {
343
        "id": "XI7xcfWLXx-G"
344
      },
345
      "outputs": [],
346
      "source": [
347
        "# Append suffix \"/1k\" to get the 1k version of each task.\n",
348
        "VTAB_TASKS = [\n",
349
        "    \"caltech101\",\n",
350
        "    # cifar100/10 were already added with slightly different val split but same test set. So here is added only the 1k versions.\n",
351
        "    \"cifar100/1k\",\n",
352
        "    \"cifar10/1k\",\n",
353
        "    \"dtd\",\n",
354
        "    \"oxford_flowers102\",\n",
355
        "    \"oxford_iiit_pet\",\n",
356
        "    \"sun397\",\n",
357
        "    \"svhn_cropped\",\n",
358
        "    \"patch_camelyon\",\n",
359
        "    \"eurosat\",\n",
360
        "    \"resisc45\",\n",
361
        "    \"diabetic_retinopathy_detection/btgraham-300\",\n",
362
        "    \"clevr/count_cylinders\",  # Not in results table.\n",
363
        "    \"clevr/count_all\",  # Clevr-Count\n",
364
        "    \"clevr/closest_object_distance\",  # Clevr-Dist\n",
365
        "    \"dmlab\",\n",
366
        "    \"dsprites/label_x_position\",  # dSpr-Loc\n",
367
        "    \"dsprites/label_orientation\",  # dSpr-Ori\n",
368
        "    \"kitti/closest_object_distance\",  # Not in results table.\n",
369
        "    \"kitti/count_vehicles\",  # Not in results table.\n",
370
        "    \"kitti/closest_vehicle_distance\",  # Kitti-dist\n",
371
        "    \"smallnorb/label_category\",  # Not in results table.\n",
372
        "    \"smallnorb/label_lighting\",  # Not in results table.\n",
373
        "    \"smallnorb/label_azimuth\",  # Azim\n",
374
        "    \"smallnorb/label_elevation\",  # Elev\n",
375
        "    ]\n",
376
        "for tn in VTAB_TASKS:\n",
377
        "  assert tn not in TFDS_IMAGE_CLASSIFCATON_DATASETS, tn"
378
      ]
379
    },
380
    {
381
      "cell_type": "code",
382
      "execution_count": null,
383
      "metadata": {
384
        "id": "OeLYCqhfXZyk"
385
      },
386
      "outputs": [],
387
      "source": [
388
        "def compute_flops_hlo(flax_module, *a, **kw):\n",
389
        "  # Compute flops on cpu for cross platform consistency.\n",
390
        "  analysis = jax.jit(flax_module, backend='cpu').lower(*a, **kw).cost_analysis()\n",
391
        "  return analysis[\"flops\"]"
392
      ]
393
    },
394
    {
395
      "cell_type": "code",
396
      "execution_count": null,
397
      "metadata": {
398
        "id": "IBFJAJe6XmH7"
399
      },
400
      "outputs": [],
401
      "source": [
402
        "class ObjectCache():\n",
403
        "  def __init__(self, factory_fn):\n",
404
        "    self.factory_fn = factory_fn\n",
405
        "    self.factory_fn_signature = inspect.signature(factory_fn)\n",
406
        "    self.cache = {}\n",
407
        "\n",
408
        "  def __call__(self, *args, **kwargs):\n",
409
        "    assert not args, \"No positional arguments allowed.\"\n",
410
        "    kw_params = {}\n",
411
        "    fn_name = self.factory_fn.__name__\n",
412
        "    fn_params = inspect.signature(self.factory_fn).parameters\n",
413
        "    for k_param, v_param in fn_params.items():\n",
414
        "      if k_param in kwargs:\n",
415
        "        kw_params[k_param] = kwargs[k_param]\n",
416
        "      elif v_param.default != v_param.empty:\n",
417
        "        # Fallback to declared defalut value.\n",
418
        "        kw_params[k_param] = fn_params[k_param].default\n",
419
        "      else:\n",
420
        "        assert False, (\n",
421
        "            f\"Missing value for argument {k_param} for function {fn_name}\")\n",
422
        "\n",
423
        "      if v_param.annotation != v_param.empty:\n",
424
        "        # Apply annotated type.\n",
425
        "        assert isinstance(type(v_param.annotation), type)\n",
426
        "        kw_params[k_param] = v_param.annotation(kw_params[k_param])\n",
427
        "\n",
428
        "    key = json.dumps(kw_params, sort_keys=True)\n",
429
        "    if key not in self.cache:\n",
430
        "      self.cache[key] = self.factory_fn(**kw_params)\n",
431
        "      print(f\"Added to cache: {fn_name}({key})  [cache size {len(self.cache)}]\")\n",
432
        "    return self.cache[key]"
433
      ]
434
    },
435
    {
436
      "cell_type": "markdown",
437
      "metadata": {
438
        "id": "AgTEwq24TYn8"
439
      },
440
      "source": [
441
        "# Models"
442
      ]
443
    },
444
    {
445
      "cell_type": "code",
446
      "execution_count": null,
447
      "metadata": {
448
        "id": "ueAYFssCUG78"
449
      },
450
      "outputs": [],
451
      "source": [
452
        "# Sample inputs\n",
453
        "def get_sample_images(image_size, batch_size):\n",
454
        "  return np.zeros((batch_size, image_size, image_size, 3))"
455
      ]
456
    },
457
    {
458
      "cell_type": "code",
459
      "execution_count": null,
460
      "metadata": {
461
        "id": "tkv4dVcQV4YB"
462
      },
463
      "outputs": [],
464
      "source": [
465
        "def get_num_params(params):\n",
466
        "  return sum(jax.tree_util.tree_flatten(\n",
467
        "      jax.tree_util.tree_map(lambda p: np.prod(p.shape), params)\n",
468
        "      )[0])"
469
      ]
470
    },
471
    {
472
      "cell_type": "code",
473
      "execution_count": null,
474
      "metadata": {
475
        "id": "V6vIOQBEVWch"
476
      },
477
      "outputs": [],
478
      "source": [
479
        "def get_optimizer(\n",
480
        "    opt_lr: float,\n",
481
        "    opt_lr_schedule: str,\n",
482
        "    opt_lr_warmup_ratio: float,\n",
483
        "    opt_momentum: float,\n",
484
        "    opt_nesterov: bool,\n",
485
        "    num_train_batches_between_validations: int,\n",
486
        "    num_validations_per_path_training: int,\n",
487
        "    ):\n",
488
        "  min_lr = opt_lr / 1000.0\n",
489
        "  if opt_lr_schedule == \"constant\":\n",
490
        "    # Divide by 2 so that average lr is the same as other types.\n",
491
        "    learning_rate = 0.5 * opt_lr\n",
492
        "  elif opt_lr_schedule == \"linear\":\n",
493
        "    train_steps = int(num_train_batches_between_validations * num_validations_per_path_training)\n",
494
        "    warmup_steps = int(opt_lr_warmup_ratio * train_steps)\n",
495
        "    schedules = [\n",
496
        "        optax.linear_schedule(\n",
497
        "            init_value=min_lr,\n",
498
        "            end_value=opt_lr,\n",
499
        "            transition_steps=warmup_steps),\n",
500
        "        optax.linear_schedule(\n",
501
        "            init_value=opt_lr,\n",
502
        "            end_value=min_lr,\n",
503
        "            transition_steps=train_steps-warmup_steps)]\n",
504
        "    learning_rate = optax.join_schedules(schedules, [warmup_steps])\n",
505
        "  elif opt_lr_schedule == \"cosine\":\n",
506
        "    train_steps = int(num_train_batches_between_validations\n",
507
        "                      * num_validations_per_path_training)\n",
508
        "    learning_rate = optax.warmup_cosine_decay_schedule(\n",
509
        "        init_value=min_lr,\n",
510
        "        peak_value=opt_lr,\n",
511
        "        warmup_steps=int(opt_lr_warmup_ratio * train_steps),\n",
512
        "        decay_steps=train_steps)\n",
513
        "  elif opt_lr_schedule == \"restarts\":\n",
514
        "    train_steps = num_train_batches_between_validations\n",
515
        "    repeats = num_validations_per_path_training\n",
516
        "    kwargs = dict(\n",
517
        "        init_value=min_lr,\n",
518
        "        peak_value=opt_lr,\n",
519
        "        warmup_steps=int(opt_lr_warmup_ratio * train_steps),\n",
520
        "        decay_steps=train_steps,\n",
521
        "    )\n",
522
        "    kwargs = [kwargs] * repeats\n",
523
        "    learning_rate = optax.sgdr_schedule(kwargs)\n",
524
        "  else:\n",
525
        "    assert False, f\"Invalid lr schedule: {opt_lr_schedule}\"\n",
526
        "\n",
527
        "  return optax.chain(\n",
528
        "      optax.clip_by_global_norm(1.0),\n",
529
        "      optax.sgd(\n",
530
        "          learning_rate=learning_rate,\n",
531
        "          momentum=opt_momentum,\n",
532
        "          nesterov=opt_nesterov,\n",
533
        "          accumulator_dtype=jnp.bfloat16))"
534
      ]
535
    },
536
    {
537
      "cell_type": "code",
538
      "execution_count": null,
539
      "metadata": {
540
        "id": "2GpeOYxFi1Sm"
541
      },
542
      "outputs": [],
543
      "source": [
544
        "def merge_params(a, b):\n",
545
        "  params = a.copy(b)\n",
546
        "  assert len(params) == len(a) + len(b)\n",
547
        "  return params"
548
      ]
549
    },
550
    {
551
      "cell_type": "markdown",
552
      "metadata": {
553
        "id": "ExYTVeamDcPy"
554
      },
555
      "source": [
556
        "## Vit Model"
557
      ]
558
    },
559
    {
560
      "cell_type": "code",
561
      "execution_count": null,
562
      "metadata": {
563
        "id": "43JOxcGVMJ4s"
564
      },
565
      "outputs": [],
566
      "source": [
567
        "class VitModelFactory():\n",
568
        "  @staticmethod\n",
569
        "  def get_model(hparams, config):\n",
570
        "    return get_vit_model(hparams, config)\n",
571
        "\n",
572
        "  @staticmethod\n",
573
        "  def get_init_comps(hparams, config):\n",
574
        "    return get_vit_init_comps(hparams, config)\n",
575
        "\n",
576
        "  @staticmethod\n",
577
        "  def get_comps2model_fn():\n",
578
        "    return vit_comps2model\n",
579
        "\n",
580
        "  @staticmethod\n",
581
        "  def get_sample_input(hparams):\n",
582
        "    return get_sample_images(image_size=hparams[\"ds_image_size\"], batch_size=1)"
583
      ]
584
    },
585
    {
586
      "cell_type": "code",
587
      "execution_count": null,
588
      "metadata": {
589
        "id": "vvZ_4-kJ9Pt3"
590
      },
591
      "outputs": [],
592
      "source": [
593
        "def get_vit_filename(query):\n",
594
        "  df = checkpoint.get_augreg_df()\n",
595
        "  res = df.query(query).filename.unique()\n",
596
        "  assert len(res) == 1\n",
597
        "  return res[0]"
598
      ]
599
    },
600
    {
601
      "cell_type": "code",
602
      "execution_count": null,
603
      "metadata": {
604
        "id": "0lvqd47g9ZsW"
605
      },
606
      "outputs": [],
607
      "source": [
608
        "VIT_CONFIG_CACHE = {}\n",
609
        "\n",
610
        "def get_vit_config(query):\n",
611
        "  global VIT_CONFIG_CACHE\n",
612
        "  if query not in VIT_CONFIG_CACHE:\n",
613
        "    filename = get_vit_filename(query)\n",
614
        "    config = models_config.AUGREG_CONFIGS[filename.split(\"-\")[0]].copy_and_resolve_references()\n",
615
        "    config.unlock()\n",
616
        "    # Disable dropout.\n",
617
        "    config.transformer.dropout_rate = 0.0\n",
618
        "    config.transformer.attention_dropout_rate = 0.0\n",
619
        "    config.lock()\n",
620
        "    VIT_CONFIG_CACHE[query] = config\n",
621
        "  return VIT_CONFIG_CACHE[query].copy_and_resolve_references()\n",
622
        "\n",
623
        "def get_set_vit_config(hparams, config):\n",
624
        "  path_config = get_vit_config(config.vit_checkpoint_query)\n",
625
        "  path_config.transformer.num_layers = int(hparams[\"num_layers\"])\n",
626
        "  path_config.unlock()\n",
627
        "  path_config.num_classes = int(hparams[\"num_classes\"])\n",
628
        "  if \"classifier\" in hparams:\n",
629
        "    path_config.classifier = hparams[\"classifier\"]\n",
630
        "  path_config.lock()\n",
631
        "  path_config = FrozenConfigDict(path_config)\n",
632
        "  return path_config\n",
633
        "\n",
634
        "def get_max_num_layers(query):\n",
635
        "  config = get_vit_config(query)\n",
636
        "  return config.transformer.num_layers"
637
      ]
638
    },
639
    {
640
      "cell_type": "code",
641
      "execution_count": null,
642
      "metadata": {
643
        "id": "JOV3CPHMUQbG"
644
      },
645
      "outputs": [],
646
      "source": [
647
        "# Get params from ViT checkpoints.\n",
648
        "def get_vit_checkpoint_comps(image_size, query):\n",
649
        "  filename = get_vit_filename(query)\n",
650
        "  config = get_vit_config(query)\n",
651
        "  model = models.VisionTransformer(**config, num_classes=1)  # num_classes unused.\n",
652
        "  init_params = copy.deepcopy(jax.device_get(\n",
653
        "      model.init(jax.random.PRNGKey(random.randrange(int(1e10))),\n",
654
        "                 VitModelFactory.get_sample_input({\"ds_image_size\": image_size}),\n",
655
        "                 train=False  # Disables dropout, no effect on params.\n",
656
        "                 )[\"params\"]))\n",
657
        "  params = checkpoint.load_pretrained(\n",
658
        "    pretrained_path=f\"gs://vit_models/augreg/{filename}.npz\",\n",
659
        "    init_params=init_params,\n",
660
        "    model_config=config)\n",
661
        "  return vit_model2comps(params)\n",
662
        "\n",
663
        "def get_vit_checkpoint_reshaped_posembed_component(\n",
664
        "    agent_id: str, ds_image_size: int, query: str):\n",
665
        "  params = get_vit_checkpoint_comps(ds_image_size, query)[\"posembed_input\"]\n",
666
        "  return Component(name=\"posembed_input\",\n",
667
        "                   agent_id=agent_id,\n",
668
        "                   params=params,\n",
669
        "                   train_locks=[])"
670
      ]
671
    },
672
    {
673
      "cell_type": "code",
674
      "execution_count": null,
675
      "metadata": {
676
        "id": "FjEWZgFIUt25"
677
      },
678
      "outputs": [],
679
      "source": [
680
        "# Get ViT model and init_params.\n",
681
        "def get_vit_model(hparams, config):\n",
682
        "  vit_config = get_set_vit_config(hparams, config)\n",
683
        "  return models.VisionTransformer(**vit_config)\n",
684
        "\n",
685
        "def get_vit_init_comps(hparams, config):\n",
686
        "  model = get_vit_model(hparams, config)\n",
687
        "  init_params = copy.deepcopy(jax.device_get(model.init(\n",
688
        "      jax.random.PRNGKey(random.randrange(int(1e10))),\n",
689
        "      VitModelFactory.get_sample_input(hparams),\n",
690
        "      train=False  # Disables dropout, no effect on params.\n",
691
        "      )[\"params\"]))\n",
692
        "  return vit_model2comps(init_params)"
693
      ]
694
    },
695
    {
696
      "cell_type": "code",
697
      "execution_count": null,
698
      "metadata": {
699
        "id": "txU4go5GUf-e"
700
      },
701
      "outputs": [],
702
      "source": [
703
        "# ViT parameters mapping to components.\n",
704
        "TRANSFORMER_KEYS = set(\n",
705
        "    [\"encoder_norm\", \"posembed_input\" ] + \\\n",
706
        "    [f\"encoderblock_{k}\" for k in range(30)])\n",
707
        "\n",
708
        "def vit_model2comps(params):\n",
709
        "  new_params = {}\n",
710
        "  for k in params.keys():\n",
711
        "    if k == \"Transformer\":\n",
712
        "      t_params = params[k]\n",
713
        "      for t_k in t_params.keys():\n",
714
        "        new_params[t_k] = t_params[t_k]\n",
715
        "    else:\n",
716
        "      new_params[k] = params[k]\n",
717
        "  return flax.core.freeze(new_params)\n",
718
        "\n",
719
        "def vit_comps2model(params):\n",
720
        "  new_params = params.unfreeze()\n",
721
        "  new_params[\"Transformer\"] = {}\n",
722
        "  for k in list(new_params.keys()):\n",
723
        "    if k in TRANSFORMER_KEYS:\n",
724
        "      new_params[\"Transformer\"][k] = new_params.pop(k)\n",
725
        "  assert len(new_params[\"Transformer\"]) != 0\n",
726
        "  return flax.core.freeze(new_params)"
727
      ]
728
    },
729
    {
730
      "cell_type": "markdown",
731
      "metadata": {
732
        "id": "ShsZ14HoiuFD"
733
      },
734
      "source": [
735
        "## MultiVit Model"
736
      ]
737
    },
738
    {
739
      "cell_type": "code",
740
      "execution_count": null,
741
      "metadata": {
742
        "id": "Z0-D1trkMOor"
743
      },
744
      "outputs": [],
745
      "source": [
746
        "class MultiVitModelFactory():\n",
747
        "  @staticmethod\n",
748
        "  def get_model(hparams, config):\n",
749
        "    return get_multivit_model(hparams, config)\n",
750
        "\n",
751
        "  @staticmethod\n",
752
        "  def get_init_comps(hparams, config):\n",
753
        "    return get_multivit_init_comps(hparams, config)\n",
754
        "\n",
755
        "  @staticmethod\n",
756
        "  def get_comps2model_fn():\n",
757
        "    return multivit_comps2model\n",
758
        "\n",
759
        "  @staticmethod\n",
760
        "  def get_sample_input(hparams):\n",
761
        "    return {str(k): get_sample_images(image_size=k, batch_size=1) for k in hparams[\"ds_image_size\"]}"
762
      ]
763
    },
764
    {
765
      "cell_type": "code",
766
      "execution_count": null,
767
      "metadata": {
768
        "id": "61WJRSygqedz"
769
      },
770
      "outputs": [],
771
      "source": [
772
        "def get_multivit_init_comps(hparams, config):\n",
773
        "  model = get_multivit_model(hparams, config)\n",
774
        "  init_params = copy.deepcopy(jax.device_get(\n",
775
        "      model.init(\n",
776
        "          jax.random.PRNGKey(random.randrange(int(1e10))),\n",
777
        "          MultiVitModelFactory.get_sample_input(hparams),\n",
778
        "          train=False  # Disables dropout, no effect on params.\n",
779
        "          )[\"params\"]))\n",
780
        "  return multivit_model2comps(init_params)"
781
      ]
782
    },
783
    {
784
      "cell_type": "code",
785
      "execution_count": null,
786
      "metadata": {
787
        "id": "YeB-D_HVdIDm"
788
      },
789
      "outputs": [],
790
      "source": [
791
        "def multivit_comps2model(params):\n",
792
        "  params = params.unfreeze()\n",
793
        "  for k in params:\n",
794
        "    if k.startswith(\"path_\"):\n",
795
        "      params[k] = vit_comps2model(flax.core.freeze(params[k]))\n",
796
        "  return flax.core.freeze(params)\n",
797
        "\n",
798
        "def multivit_model2comps(params):\n",
799
        "  # Mapping of paths component skipped since those are never used from rand init.\n",
800
        "  return params"
801
      ]
802
    },
803
    {
804
      "cell_type": "code",
805
      "execution_count": null,
806
      "metadata": {
807
        "id": "waN2hnq7kbOC"
808
      },
809
      "outputs": [],
810
      "source": [
811
        "class MultipathRouter(nn.Module):\n",
812
        "  init_main_path_weight: float\n",
813
        "  num_paths: int\n",
814
        "  lr_mult: float\n",
815
        "\n",
816
        "  @nn.compact\n",
817
        "  def __call__(self, x):\n",
818
        "    assert self.num_paths \u003e 0\n",
819
        "    assert self.lr_mult \u003e= 0 and self.lr_mult \u003c= 1\n",
820
        "    init_bias = np.log((1/(1/self.init_main_path_weight -1))*(self.num_paths-1))\n",
821
        "    x = nn.LayerNorm()(x)\n",
822
        "    x = nn.Dense(self.num_paths,\n",
823
        "                 kernel_init=nn.initializers.zeros,\n",
824
        "                 bias_init=nn.initializers.constant(np.asarray([init_bias]+[0]*(self.num_paths-1)))\n",
825
        "                 )(x)\n",
826
        "    x = nn.softmax(x)\n",
827
        "    x = self.lr_mult * x + (1-self.lr_mult) * jax.lax.stop_gradient(x)\n",
828
        "    return x\n",
829
        "\n",
830
        "class Connector(nn.Module):\n",
831
        "  out_dim: int\n",
832
        "\n",
833
        "  @nn.compact\n",
834
        "  def __call__(self, x):\n",
835
        "    x = nn.Dense(self.out_dim,\n",
836
        "                 kernel_init=nn.initializers.zeros,\n",
837
        "                 bias_init=nn.initializers.zeros,\n",
838
        "                 )(x)\n",
839
        "    return x\n",
840
        "\n",
841
        "class MultiVitModel(nn.Module):\n",
842
        "  config: Any\n",
843
        "  path_module: Type[nn.Module] = models.VisionTransformer\n",
844
        "  main_path_name: str = \"path_0\"\n",
845
        "  @nn.compact\n",
846
        "  def __call__(self, inputs, *, train):\n",
847
        "    logits_0 = self.path_module(\n",
848
        "        name=self.main_path_name,\n",
849
        "        **self.config.paths_configs[self.main_path_name])(\n",
850
        "            inputs[self.config.paths_image_size[self.main_path_name]], train=train)\n",
851
        "    out_dim = logits_0.shape[-1]\n",
852
        "    weights = MultipathRouter(\n",
853
        "        name=\"multipath_router\",\n",
854
        "        num_paths=len(self.config.paths_configs),\n",
855
        "        **self.config.router)(\n",
856
        "            logits_0)\n",
857
        "    all_logits = [logits_0]\n",
858
        "    for path_name in self.config.paths_configs:\n",
859
        "      if path_name == self.main_path_name:\n",
860
        "        continue\n",
861
        "      representation = self.path_module(\n",
862
        "          name=path_name,\n",
863
        "          **self.config.paths_configs[path_name])(\n",
864
        "              inputs[self.config.paths_image_size[path_name]], train=train)\n",
865
        "      path_logits = Connector(\n",
866
        "          name=f\"head_adapter_{path_name}\", out_dim=out_dim)(representation)\n",
867
        "      all_logits.append(path_logits)\n",
868
        "    stacked = jnp.stack(all_logits, axis=-1)\n",
869
        "    logits_comb = jnp.einsum(\"BLp,Bp-\u003eBL\", stacked, weights)\n",
870
        "    logits_comb = jnp.einsum(\"BLp,Bp-\u003eBL\", jax.lax.stop_gradient(stacked), weights)\n",
871
        "    logits_sum = jnp.einsum(\"BLp,Bp-\u003eBL\", stacked, jnp.ones_like(weights))\n",
872
        "    logits_out = logits_comb - jax.lax.stop_gradient(logits_sum) + logits_sum\n",
873
        "    return logits_out"
874
      ]
875
    },
876
    {
877
      "cell_type": "code",
878
      "execution_count": null,
879
      "metadata": {
880
        "id": "CC4pF48eqDLL"
881
      },
882
      "outputs": [],
883
      "source": [
884
        "def get_multivit_model(hparams, config):\n",
885
        "  model_config = ConfigDict()\n",
886
        "  model_config.paths_configs = {\n",
887
        "      k: get_set_vit_config(hparams[\"paths\"][k][\"hparams\"], config) for k in hparams[\"paths\"]}\n",
888
        "  model_config.paths_image_size = {\n",
889
        "      k: str(hparams[\"paths\"][k][\"hparams\"][\"ds_image_size\"]) for k in hparams[\"paths\"]}\n",
890
        "  model_config.router = {\n",
891
        "      \"init_main_path_weight\": float(hparams[\"router_init_main_path_weight\"]),\n",
892
        "      \"lr_mult\": float(hparams[\"router_lr_mult\"]),\n",
893
        "  }\n",
894
        "  return MultiVitModel(config=FrozenConfigDict(model_config))"
895
      ]
896
    },
897
    {
898
      "cell_type": "markdown",
899
      "metadata": {
900
        "id": "lyX6WjaUDWOi"
901
      },
902
      "source": [
903
        "# Agents"
904
      ]
905
    },
906
    {
907
      "cell_type": "code",
908
      "execution_count": null,
909
      "metadata": {
910
        "id": "FNAi2xK-fARf"
911
      },
912
      "outputs": [],
913
      "source": [
914
        "def format_agent_id(class_name, task_name):\n",
915
        "  agent_id = f\"{class_name}/{task_name}\"\n",
916
        "  assert \"~\" not in agent_id, f\"Invalid agent id: {agent_id}\"\n",
917
        "  return agent_id.replace(\"/\", \"~\")\n",
918
        "\n",
919
        "def get_agent_class(agent_id):\n",
920
        "  return globals()[agent_id.split(\"~\")[0]]"
921
      ]
922
    },
923
    {
924
      "cell_type": "code",
925
      "execution_count": null,
926
      "metadata": {
927
        "id": "TgGHyptb_BwQ"
928
      },
929
      "outputs": [],
930
      "source": [
931
        "def incremental_mutation(value, values_list):\n",
932
        "  assert value in values_list, f\"{value} not in {values_list}\"\n",
933
        "  idx = values_list.index(value)\n",
934
        "  idx += 1 if np.random.uniform() \u003c 0.5 else -1\n",
935
        "  idx = max(0, min(len(values_list)-1, idx))\n",
936
        "  return values_list[idx]"
937
      ]
938
    },
939
    {
940
      "cell_type": "code",
941
      "execution_count": null,
942
      "metadata": {
943
        "id": "2AwvZWRxnao5"
944
      },
945
      "outputs": [],
946
      "source": [
947
        "DATASET_HPARAMS_KEYS_PRERFIX = \"ds_\""
948
      ]
949
    },
950
    {
951
      "cell_type": "code",
952
      "execution_count": null,
953
      "metadata": {
954
        "id": "jCNLTP1zGu_u"
955
      },
956
      "outputs": [],
957
      "source": [
958
        "class Agent():\n",
959
        "  @property\n",
960
        "  def class_name(self):\n",
961
        "    return self.__class__.__name__\n",
962
        "\n",
963
        "  @property\n",
964
        "  def id(self):\n",
965
        "    return self.config.agent_id\n",
966
        "\n",
967
        "  @staticmethod\n",
968
        "  def get_model_factory():\n",
969
        "    assert False, \"Not implementd\"\n",
970
        "\n",
971
        "  def run(self):\n",
972
        "    assert False, \"Not implementd\"\n",
973
        "\n",
974
        "  def complete_config(self, system_state_dir, task_name, num_cycles_max):\n",
975
        "    self.config.system_state_dir = system_state_dir\n",
976
        "    self.config.task_name = task_name\n",
977
        "    self.config.agent_id = format_agent_id(self.class_name, task_name)\n",
978
        "    self.config.agent_dir = os.path.join(system_state_dir, self.id)\n",
979
        "    self.config.num_cycles_max = num_cycles_max\n",
980
        "\n",
981
        "  def agent_classes_to_load(self):\n",
982
        "    # Defaults to load only agents of the same class. Extend this list to allow\n",
983
        "    # to access the strucutres and parameters produced by different agent types.\n",
984
        "    return [self.class_name]"
985
      ]
986
    },
987
    {
988
      "cell_type": "code",
989
      "execution_count": null,
990
      "metadata": {
991
        "id": "5j4LPnTA8EzH"
992
      },
993
      "outputs": [],
994
      "source": [
995
        "def run_cycles(agent):\n",
996
        "  config = agent.config\n",
997
        "  task_name = config.task_name\n",
998
        "  num_cycles = config.num_cycles_max\n",
999
        "  for _ in range(num_cycles):\n",
1000
        "    agent.load_state()\n",
1001
        "    if agent.cycle_id \u003e= num_cycles:\n",
1002
        "      break\n",
1003
        "    print(\"\\n\\n====\")\n",
1004
        "    print(f\"CYCLE: [{agent.cycle_id+1}/{num_cycles}]\")\n",
1005
        "    agent.pop.start_cycle()\n",
1006
        "    agent_cycle(agent)\n",
1007
        "    agent.pop.end_cycle()\n",
1008
        "    agent.cycle_id += 1\n",
1009
        "    agent.generation_id = 0\n",
1010
        "    save_state(agent)\n",
1011
        "    if agent.cycle_id \u003e= num_cycles:\n",
1012
        "      break\n",
1013
        "\n",
1014
        "def run_root_model(agent):\n",
1015
        "  agent.load_state()\n",
1016
        "  save_state(agent)"
1017
      ]
1018
    },
1019
    {
1020
      "cell_type": "code",
1021
      "execution_count": null,
1022
      "metadata": {
1023
        "id": "oMUf-hLFF4Wc"
1024
      },
1025
      "outputs": [],
1026
      "source": [
1027
        "# Run a full paths sampling iteration for a task.\n",
1028
        "def agent_cycle(agent):\n",
1029
        "  pop = agent.pop\n",
1030
        "  config = agent.config\n",
1031
        "  task = Path.cached_tasks(task_name=config.task_name)\n",
1032
        "  best_path = pop.get_best_path()\n",
1033
        "  if TEST_IMMUTABILITY and best_path:\n",
1034
        "    run_test_eval(best_path, test_immutability=True)\n",
1035
        "  devices = jax.local_devices()\n",
1036
        "  print(\"DEVICE COUNT:\", len(devices))\n",
1037
        "  num_gen_batches = math.ceil(config.num_samples_per_cycle/len(devices))\n",
1038
        "  for _ in range(num_gen_batches):\n",
1039
        "    if agent.generation_id \u003e= num_gen_batches:\n",
1040
        "      break\n",
1041
        "    print(f\"----\\nGENERATION: [{agent.generation_id+1}/{num_gen_batches}]\")\n",
1042
        "    ds_hparams = agent.sample_ds_hparams()\n",
1043
        "    ds_hparams[\"num_classes\"] = task.num_classes\n",
1044
        "    paths = []\n",
1045
        "    for i in range(len(devices)):\n",
1046
        "      print(f\"Sampling path {Path.counter}\")\n",
1047
        "      paths.append(agent.sample_path(ds_hparams))\n",
1048
        "      gc.collect()\n",
1049
        "    ds_hparams = agent.finalize_ds_hparams(ds_hparams, paths)\n",
1050
        "    ds_train = task.get_ds(\"train\", ds_hparams)\n",
1051
        "    ds_validation = task.get_ds(\"validation\", ds_hparams)\n",
1052
        "    train_loop(paths, ds_train, ds_validation, devices, config)\n",
1053
        "    for path in paths:\n",
1054
        "      path.metrics[\"generation_id\"] = agent.generation_id\n",
1055
        "      if path.metrics[\"improved\"]:\n",
1056
        "        assert path not in pop.paths[config.agent_id]\n",
1057
        "        pop.paths[config.agent_id].append(path)\n",
1058
        "    pop.prune_population()\n",
1059
        "    # Track best path.\n",
1060
        "    curr_best_path = pop.get_best_path()\n",
1061
        "    if curr_best_path != best_path:\n",
1062
        "      if best_path:\n",
1063
        "        assert curr_best_path.score() \u003e= best_path.score()\n",
1064
        "      best_path = curr_best_path\n",
1065
        "      best_path.metrics[\"new_best\"] = True\n",
1066
        "      agent.print_best_path_summary()\n",
1067
        "    agent.generation_id += 1\n",
1068
        "    df_leaderboard(pop_to_df(pop))\n",
1069
        "    if agent.generation_id \u003c num_gen_batches:\n",
1070
        "      save_state(agent)\n",
1071
        "  assert best_path in pop.paths[config.agent_id], best_path\n",
1072
        "  run_test_eval(best_path)"
1073
      ]
1074
    },
1075
    {
1076
      "cell_type": "markdown",
1077
      "metadata": {
1078
        "id": "9I56DcbI79M2"
1079
      },
1080
      "source": [
1081
        "## Vit Agent"
1082
      ]
1083
    },
1084
    {
1085
      "cell_type": "code",
1086
      "execution_count": null,
1087
      "metadata": {
1088
        "id": "njMI52JMnqq1"
1089
      },
1090
      "outputs": [],
1091
      "source": [
1092
        "def get_common_config_vit():\n",
1093
        "  config = ConfigDict()\n",
1094
        "  config.num_train_examples_between_validations_max = 300_000\n",
1095
        "  config.num_validations_per_path_training = 4\n",
1096
        "  config.num_validation_examples_max = 10_000\n",
1097
        "  config.num_samples_per_cycle = 16\n",
1098
        "  config.max_task_population_size = 5\n",
1099
        "  # Force finetune last layer norm that technically is part of the head.\n",
1100
        "  config.force_mutations = [\"clone:encoder_norm\"]\n",
1101
        "  config.scorer_kwargs = dict(\n",
1102
        "      scale_factor=0.99,\n",
1103
        "      base_accounted_params=2_200_000_000,\n",
1104
        "      base_flops=3_800_000_000_000,\n",
1105
        "      )\n",
1106
        "  config.hparams_defaults = {\n",
1107
        "      \"_mu_\": 0.2,\n",
1108
        "      \"opt_lr\": 0.02,\n",
1109
        "      \"opt_lr_schedule\": \"cosine\",\n",
1110
        "      \"opt_lr_warmup_ratio\": 0.02,\n",
1111
        "      \"opt_momentum\": 0.8,\n",
1112
        "      \"opt_nesterov\": True,\n",
1113
        "      \"ds_area_range_min\": 1.0,\n",
1114
        "      \"ds_aspect_ratio_range_min\": 1.0,\n",
1115
        "      \"ds_flip_left_right\": False,\n",
1116
        "      \"ds_brightness_delta\": 0.0,\n",
1117
        "      \"ds_contrast_delta\": 0.0,\n",
1118
        "      \"ds_saturation_delta\": 0.0,\n",
1119
        "      \"ds_hue_delta\": 0.0,\n",
1120
        "      \"ds_quality_delta\": 0.0,\n",
1121
        "  }\n",
1122
        "  config.hparams_mutation_ranges = {\n",
1123
        "      \"_mu_\": [0.02, 0.04, 0.06, 0.08, 0.10, 0.12, 0.14, 0.16, 0.18, 0.20, 0.22, 0.24, 0.26, 0.28, 0.30],\n",
1124
        "      \"opt_lr\": [0.0001, 0.0002, 0.0005, 0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1.0],\n",
1125
        "      \"opt_lr_warmup_ratio\": [0.0, 0.01, 0.02, 0.05, 0.1, 0.2, 0.3],\n",
1126
        "      \"opt_momentum\": [0.5, 0.6, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 0.98, 0.99],\n",
1127
        "      \"opt_nesterov\": [True, False],\n",
1128
        "      \"ds_area_range_min\": [0.05, 0.5, 0.95, 1.0],\n",
1129
        "      \"ds_aspect_ratio_range_min\": [0.5, 0.75, 1.0],\n",
1130
        "      \"ds_flip_left_right\": [True, False],\n",
1131
        "      \"ds_brightness_delta\": [0.0, 0.01, 0.02, 0.05, 0.1, 0.2],\n",
1132
        "      \"ds_contrast_delta\": [0.0, 0.01, 0.02, 0.05, 0.1, 0.2],\n",
1133
        "      \"ds_saturation_delta\": [0.0, 0.01, 0.02, 0.05, 0.1, 0.2],\n",
1134
        "      \"ds_hue_delta\": [0.0, 0.01, 0.02, 0.05, 0.1, 0.2],\n",
1135
        "      \"ds_quality_delta\": [ 0.0, 0.01, 0.02, 0.05, 0.1, 0.2],\n",
1136
        "  }\n",
1137
        "  return config\n",
1138
        "\n",
1139
        "def get_config_vit_large():\n",
1140
        "  config = get_common_config_vit()\n",
1141
        "  config.batch_size = 16\n",
1142
        "  # The query is used to get the model configs even if the checkpoint is not loaded.\n",
1143
        "  config.vit_checkpoint_query = 'name==\"L/16\" and ds==\"i21k\" and aug==\"medium2\" and wd==0.03 and sd==0.1'\n",
1144
        "  max_num_layers = get_max_num_layers(config.vit_checkpoint_query)\n",
1145
        "  config.hparams_defaults[\"num_layers\"] = max_num_layers\n",
1146
        "  config.hparams_mutation_ranges[\"num_layers\"] = list(\n",
1147
        "      range(config.hparams_defaults[\"num_layers\"]+1\n",
1148
        "            +1  # Allow to exceed root-model's layers by 1.\n",
1149
        "            ))\n",
1150
        "  config.hparams_defaults[\"ds_image_size\"] = 384\n",
1151
        "  config.hparams_mutation_ranges[\"ds_image_size\"] = [224, 384]\n",
1152
        "  return config\n",
1153
        "\n",
1154
        "def get_config_vit_ti3():\n",
1155
        "  config = get_common_config_vit()\n",
1156
        "  config.batch_size = 512\n",
1157
        "  config.vit_checkpoint_query = 'name==\"Ti/16\" and ds==\"i21k\" and aug==\"light1\" and wd==0.1 and sd==0.0'\n",
1158
        "  config.hparams_defaults[\"num_layers\"] = 3\n",
1159
        "  config.hparams_mutation_ranges[\"num_layers\"] = list(range(config.hparams_defaults[\"num_layers\"]+1))\n",
1160
        "  config.hparams_defaults[\"ds_image_size\"] = 32\n",
1161
        "  config.hparams_mutation_ranges[\"ds_image_size\"] = [16*i for i in (range(1, 1+int(112/16)))]\n",
1162
        "  return config\n",
1163
        "\n",
1164
        "def get_config_vit_base():\n",
1165
        "  config = get_common_config_vit()\n",
1166
        "  config.batch_size = 256\n",
1167
        "  config.vit_checkpoint_query = 'name==\"B/16\" and ds==\"i21k\" and aug==\"medium1\" and wd==0.1 and sd==0'\n",
1168
        "  max_num_layers = get_max_num_layers(config.vit_checkpoint_query)\n",
1169
        "  config.hparams_defaults[\"num_layers\"] = max_num_layers\n",
1170
        "  config.hparams_mutation_ranges[\"num_layers\"] = list(range(config.hparams_defaults[\"num_layers\"]+1))\n",
1171
        "  config.hparams_defaults[\"ds_image_size\"] = 80\n",
1172
        "  config.hparams_mutation_ranges[\"ds_image_size\"] = [16*i for i in (range(1, 1+int(112/16)))]\n",
1173
        "  return config\n",
1174
        "\n",
1175
        "def config_validate(config):\n",
1176
        "  for khp in config.hparams_defaults:\n",
1177
        "    if khp in config.hparams_mutation_ranges:\n",
1178
        "      assert config.hparams_defaults[khp] in config.hparams_mutation_ranges[khp], khp\n",
1179
        "  for khp in config.hparams_mutation_ranges:\n",
1180
        "    assert khp in config.hparams_defaults, khp"
1181
      ]
1182
    },
1183
    {
1184
      "cell_type": "code",
1185
      "execution_count": null,
1186
      "metadata": {
1187
        "id": "_wzXhNikODR8"
1188
      },
1189
      "outputs": [],
1190
      "source": [
1191
        "class Vit(Agent):\n",
1192
        "  \"\"\"ViT large\"\"\"\n",
1193
        "  def __init__(self, system_state_dir, task_name, num_cycles_max):\n",
1194
        "    self.config = self.get_config()\n",
1195
        "    self.complete_config(system_state_dir, task_name, num_cycles_max)\n",
1196
        "    self.cached_posembed_components = ObjectCache(get_vit_checkpoint_reshaped_posembed_component)\n",
1197
        "\n",
1198
        "  @staticmethod\n",
1199
        "  def get_model_factory():\n",
1200
        "    return VitModelFactory\n",
1201
        "\n",
1202
        "  def load_state(self):\n",
1203
        "    task_name = self.config.task_name\n",
1204
        "    self.pop = Population(self.config)\n",
1205
        "    self.cycle_id = 0\n",
1206
        "    self.generation_id = 0\n",
1207
        "    # Root models.\n",
1208
        "    if task_name.startswith(\"root_model/\"):\n",
1209
        "      hparams = self.config.hparams_defaults.as_configdict()\n",
1210
        "      if task_name == \"root_model/random_init\":\n",
1211
        "        hparams[\"num_classes\"] = 0  # Removes head layer.\n",
1212
        "        path_params = self.get_model_factory().get_init_comps(hparams, self.config)\n",
1213
        "      else:\n",
1214
        "        assert task_name == \"root_model/checkpoint\", task_name\n",
1215
        "        path_params = get_vit_checkpoint_comps(\n",
1216
        "            hparams[\"ds_image_size\"],\n",
1217
        "            self.config.vit_checkpoint_query)\n",
1218
        "      path = Path(\n",
1219
        "          hparams,\n",
1220
        "          params2comps(path_params, train_locks=[self.id], agent_id=self.id),\n",
1221
        "          parent=None,\n",
1222
        "          agent_id=self.id,\n",
1223
        "          task_name=task_name)\n",
1224
        "      self.pop.paths[self.id].append(path)\n",
1225
        "      return\n",
1226
        "\n",
1227
        "    # Load latest agent state.\n",
1228
        "    def validate_df(df):\n",
1229
        "      assert len(df[\"agent_id\"].unique()) == 1, len(df[\"agent_id\"].unique())\n",
1230
        "      assert df[\"agent_id\"].unique()[0] == self.id, df[\"agent_id\"].unique()[0]\n",
1231
        "    agent_checkpoint = latest_checkpoint(\n",
1232
        "        os.path.join(self.config.agent_dir, \"state_*_*/\"))\n",
1233
        "    if agent_checkpoint:\n",
1234
        "      matched = re.findall(r\"checkpoint_([0-9]+)_([0-9]+)$\", agent_checkpoint)\n",
1235
        "      assert len(matched) == 1\n",
1236
        "      self.cycle_id = int(matched[0][0])\n",
1237
        "      self.generation_id = int(matched[0][1])\n",
1238
        "      state_dir = os.path.dirname(agent_checkpoint)\n",
1239
        "      self.pop.paths_df = df_read_from_csv(state_dir, \"paths\")\n",
1240
        "      self.pop.comps_df = df_read_from_csv(state_dir, \"components\")\n",
1241
        "      validate_df(self.pop.paths_df)\n",
1242
        "      validate_df(self.pop.comps_df)\n",
1243
        "      # Set globals.\n",
1244
        "      Path.paths = []\n",
1245
        "      Path.counter = 1 + int(self.pop.paths_df.id.max())\n",
1246
        "      Component.counter = 1 + int(self.pop.comps_df.id.max())\n",
1247
        "      # Get id of the last componet saved in a non intermediate checkpoint.\n",
1248
        "      non_intermediated_checkpoint = latest_checkpoint(\n",
1249
        "          os.path.join(self.config.agent_dir, \"state_*_0/\"))\n",
1250
        "      if non_intermediated_checkpoint:\n",
1251
        "        ni_paths_df = df_read_from_csv(\n",
1252
        "            os.path.dirname(non_intermediated_checkpoint), \"paths\")\n",
1253
        "        validate_df(ni_paths_df)\n",
1254
        "        Path.last_saved = int(ni_paths_df.id.max())\n",
1255
        "        ni_comps_df = df_read_from_csv(\n",
1256
        "            os.path.dirname(non_intermediated_checkpoint), \"components\")\n",
1257
        "        validate_df(ni_comps_df)\n",
1258
        "        Component.last_saved = int(ni_comps_df.id.max())\n",
1259
        "      print(\"CONTINUING FROM STATE\", self.cycle_id, self.generation_id)\n",
1260
        "\n",
1261
        "    # Load all available paths.\n",
1262
        "    all_agents_dirs = []\n",
1263
        "    for agent_class_to_load in self.agent_classes_to_load():\n",
1264
        "      all_agents_dirs.extend(\n",
1265
        "          gfile.glob(os.path.join(self.config.system_state_dir,\n",
1266
        "                                  agent_class_to_load+\"~*\")))\n",
1267
        "    assert all_agents_dirs, f\"No state for agents: {self.agent_classes_to_load()}\"\n",
1268
        "    state_dir = os.path.dirname(agent_checkpoint) if agent_checkpoint else None\n",
1269
        "    load_paths(self.pop, state_dir, all_agents_dirs)\n",
1270
        "\n",
1271
        "    assert self.pop.paths, \"Population is empty, run an agent creating a \" \\\n",
1272
        "        \"root model to initialize the population.\"\n",
1273
        "    df_leaderboard(pop_to_df(self.pop))\n",
1274
        "\n",
1275
        "  def get_config(self):\n",
1276
        "    return get_config_vit_large()\n",
1277
        "\n",
1278
        "  def run(self):\n",
1279
        "    if self.config.task_name.startswith(\"root_model/\"):\n",
1280
        "      run_root_model(self)\n",
1281
        "      return\n",
1282
        "    run_cycles(self)\n",
1283
        "\n",
1284
        "  def complete_config(self, system_state_dir, task_name, num_cycles_max):\n",
1285
        "    super().complete_config(system_state_dir, task_name, num_cycles_max)\n",
1286
        "    self.config = FrozenConfigDict(self.config)\n",
1287
        "    config_validate(self.config)\n",
1288
        "\n",
1289
        "  def do_mutate(self, hparams, mutation_name):\n",
1290
        "    \"\"\"Returns True if mutation is sampled to be applied.\"\"\"\n",
1291
        "    if mutation_name in self.config.get(\"force_mutations\", []):\n",
1292
        "      return True\n",
1293
        "    mutation_prob_k = f\"_mu_|{mutation_name}\"\n",
1294
        "    # Fallback is used for batch shared sampling.\n",
1295
        "    mu = hparams.get(\"_mu_\", self.config.hparams_defaults[\"_mu_\"])\n",
1296
        "    mutation_prob = hparams.get(mutation_prob_k, mu)\n",
1297
        "    if \"_mu_\" in self.config.hparams_mutation_ranges:\n",
1298
        "      if mu \u003e np.random.uniform():\n",
1299
        "        mutation_prob = incremental_mutation(\n",
1300
        "            mutation_prob, self.config.hparams_mutation_ranges[\"_mu_\"])\n",
1301
        "      hparams[mutation_prob_k] = mutation_prob\n",
1302
        "    return mutation_prob \u003e np.random.uniform()\n",
1303
        "\n",
1304
        "  def parent_decay_selection(self):\n",
1305
        "    for path in sorted(self.pop.paths[self.config.agent_id],\n",
1306
        "                       key=lambda p: p.score(),\n",
1307
        "                       reverse=True):\n",
1308
        "      offsprings = path.metrics.get(\"offsprings\", 0)\n",
1309
        "      assert not math.isnan(offsprings)\n",
1310
        "      select_prob = 0.5 ** offsprings\n",
1311
        "      print(f\" Candidate parent path {path.id},\",\n",
1312
        "            f\"selection probability: 0.5^{offsprings} == {select_prob}\")\n",
1313
        "      if np.random.uniform() \u003c select_prob:\n",
1314
        "        path.metrics[\"offsprings\"] = path.metrics.get(\"offsprings\", 0) + 1\n",
1315
        "        return path\n",
1316
        "    return None\n",
1317
        "\n",
1318
        "  def sample_path(self, ds_hparams):\n",
1319
        "    parent = self.parent_decay_selection()\n",
1320
        "    if not parent:  # Random sample.\n",
1321
        "      parent = random.choice([p for paths in self.pop.paths.values() for p in paths])\n",
1322
        "      print(f\" Randomly selected parent {parent.agent_id}:{parent.id}\")\n",
1323
        "    return self.mutate_parent(parent, ds_hparams)\n",
1324
        "\n",
1325
        "  def mutate_hparams(self, hparams):\n",
1326
        "    for k in sorted(self.config.hparams_mutation_ranges):\n",
1327
        "      if k in hparams and self.do_mutate(hparams, f\"hp:{k}\"):\n",
1328
        "        hparams[k] = incremental_mutation(\n",
1329
        "            hparams[k], self.config.hparams_mutation_ranges[k])\n",
1330
        "    return hparams\n",
1331
        "\n",
1332
        "  def sample_ds_hparams(self):\n",
1333
        "    \"\"\"Sample hparams that need to be shared across each paths generation.\"\"\"\n",
1334
        "    ds_hparams = {}\n",
1335
        "    # Initialize shared hparams with defaults.\n",
1336
        "    for key in self.config.hparams_defaults:\n",
1337
        "      if key.startswith(DATASET_HPARAMS_KEYS_PRERFIX):\n",
1338
        "        ds_hparams[key] = self.config.hparams_defaults[key]\n",
1339
        "    # Overwrite with values from best path if available.\n",
1340
        "    best_path = self.pop.get_best_path()\n",
1341
        "    if best_path:\n",
1342
        "      ds_hparams.update(\n",
1343
        "          {k : best_path.hparams[k] for k in ds_hparams if k in best_path.hparams})\n",
1344
        "      ds_hparams.update(\n",
1345
        "          {k : best_path.hparams[k] for k in best_path.hparams if k.startswith(\n",
1346
        "              f\"_mu_|hp:{DATASET_HPARAMS_KEYS_PRERFIX}\")})\n",
1347
        "      # Sample mutations.\n",
1348
        "      df_hparams = self.mutate_hparams(ds_hparams)\n",
1349
        "    # Validate.\n",
1350
        "    for k in ds_hparams:\n",
1351
        "      assert (k.startswith(DATASET_HPARAMS_KEYS_PRERFIX) or\n",
1352
        "              k.startswith(f\"_mu_|hp:{DATASET_HPARAMS_KEYS_PRERFIX}\"))\n",
1353
        "    return ds_hparams\n",
1354
        "\n",
1355
        "  def finalize_ds_hparams(self, ds_hparams, paths):\n",
1356
        "    # Validate shared params.\n",
1357
        "    for k in ds_hparams:\n",
1358
        "      if k.startswith(DATASET_HPARAMS_KEYS_PRERFIX):\n",
1359
        "        for path in paths:\n",
1360
        "          assert ds_hparams[k] == path.hparams[k]\n",
1361
        "    return ds_hparams\n",
1362
        "\n",
1363
        "  def mutate_parent(self, parent, ds_hparams):\n",
1364
        "    config = self.config\n",
1365
        "    agent_id = config.agent_id\n",
1366
        "    task_name = config.task_name\n",
1367
        "    comps = []\n",
1368
        "    new_hparams = copy.deepcopy(parent.hparams)\n",
1369
        "    new_hparams = self.mutate_hparams(new_hparams)\n",
1370
        "    # Overwrite dataset hparams with those sampled for the generation batch.\n",
1371
        "    new_hparams.update(ds_hparams)\n",
1372
        "\n",
1373
        "    def get_component_ref(c, clone):\n",
1374
        "      if c.is_trainable() or clone:\n",
1375
        "        # Clone trainable component.\n",
1376
        "        return c.clone(agent_id=agent_id)\n",
1377
        "      # Refer to frozen component.\n",
1378
        "      return c\n",
1379
        "\n",
1380
        "    init_params = self.get_model_factory().get_init_comps(new_hparams, config)\n",
1381
        "    for new_comp_name in init_params:\n",
1382
        "      comp = None\n",
1383
        "      # Attept to reuse matching componenent from closer ancestor.\n",
1384
        "      ancestor = parent\n",
1385
        "      while ancestor is not None:\n",
1386
        "        comps_lookup = {c.name:c for c in ancestor.components}\n",
1387
        "        if new_comp_name in comps_lookup:\n",
1388
        "          # Head must be trainable if no acestor is of same agent will fall back\n",
1389
        "          # to random init of correct shape.\n",
1390
        "          if new_comp_name == \"head\" and agent_id != ancestor.agent_id:\n",
1391
        "            assert agent_id != ancestor.agent_id, f\"{agent_id} != {ancestor.agent_id}\"\n",
1392
        "            ancestor = ancestor.parent\n",
1393
        "            continue\n",
1394
        "          # Check shapes match otherwise skip.\n",
1395
        "          if (jax.tree_util.tree_map(jnp.shape, init_params[new_comp_name]) !=\n",
1396
        "              jax.tree_util.tree_map(jnp.shape, comps_lookup[new_comp_name].params)):\n",
1397
        "            if new_comp_name == \"posembed_input\":\n",
1398
        "              # Change of image size changed shape of position embeddings,\n",
1399
        "              # this can happend if ds_image_size is tuned,\n",
1400
        "              # continue searching through ancestors for matching size.\n",
1401
        "              assert \"ds_image_size\" in config.hparams_mutation_ranges\n",
1402
        "              assert new_hparams[\"ds_image_size\"] != ancestor.hparams[\"ds_image_size\"]\n",
1403
        "              ancestor = ancestor.parent\n",
1404
        "              continue\n",
1405
        "\n",
1406
        "            print(f\"WARNING: Shapes do not match for component: {new_comp_name}  {ancestor.agent_id}-\u003e{agent_id}\")\n",
1407
        "            print(jax.tree_util.tree_map(jnp.shape, init_params[new_comp_name]))\n",
1408
        "            print(jax.tree_util.tree_map(jnp.shape, comps_lookup[new_comp_name].params))\n",
1409
        "            assert False  # Should not happen in current configuration.\n",
1410
        "\n",
1411
        "          ancestor_comp = comps_lookup[new_comp_name]\n",
1412
        "          comp = get_component_ref(\n",
1413
        "              ancestor_comp, clone=(\n",
1414
        "                  ancestor_comp.is_trainable() or self.do_mutate(\n",
1415
        "                      new_hparams, f\"clone:{new_comp_name}\")))\n",
1416
        "          break\n",
1417
        "        ancestor = ancestor.parent\n",
1418
        "      # Get reshaped posembed_input from checkpoint.\n",
1419
        "      if comp is None and new_comp_name == \"posembed_input\":\n",
1420
        "        pe_comp = self.cached_posembed_components(\n",
1421
        "            agent_id=agent_id,\n",
1422
        "            query=config.vit_checkpoint_query,\n",
1423
        "            **new_hparams)\n",
1424
        "        # Clone to make the component trainable.\n",
1425
        "        comp = get_component_ref(pe_comp, clone=True)\n",
1426
        "      # Otherwise create one from random init params.\n",
1427
        "      if comp is None:\n",
1428
        "        # Possible rand init triggering combinations in current configurations.\n",
1429
        "        assert (\n",
1430
        "            new_comp_name == \"head\"\n",
1431
        "            or (new_comp_name.startswith(\"encoderblock_\")\n",
1432
        "                and config.hparams_defaults[\"num_layers\"] \u003c max(\n",
1433
        "                config.hparams_mutation_ranges.get(\"num_layers\", [-1]))))\n",
1434
        "        comp = params2comps(\n",
1435
        "            init_params, train_locks=[],\n",
1436
        "            agent_id=agent_id, name=new_comp_name)[0]\n",
1437
        "      assert comp is not None\n",
1438
        "      comps.append(comp)\n",
1439
        "    return Path(new_hparams, comps, parent=parent, agent_id=agent_id, task_name=task_name)\n",
1440
        "\n",
1441
        "  def print_best_path_summary(self):\n",
1442
        "    best_path = self.pop.get_best_path()\n",
1443
        "    print(f\"Best id:{best_path.id}\",\n",
1444
        "          f\"score:{best_path.score():.4f}\",\n",
1445
        "          f\"quality:{best_path.metrics['quality']:.4f}\",\n",
1446
        "          f\"gen:{best_path.metrics['generation_id']}\",\n",
1447
        "          f\"\\n{best_path.hparams}\")\n",
1448
        "\n",
1449
        "  def get_paths_to_publish(self):\n",
1450
        "    return [p for p in self.pop.paths[self.config.agent_id]]\n",
1451
        "\n",
1452
        "class VitT3(Vit):\n",
1453
        "  \"\"\"ViT tiny 3 layers\"\"\"\n",
1454
        "  def get_config(self):\n",
1455
        "    return get_config_vit_ti3()\n",
1456
        "\n",
1457
        "class VitB(Vit):\n",
1458
        "  \"\"\"ViT base\"\"\"\n",
1459
        "  def get_config(self):\n",
1460
        "    return get_config_vit_base()"
1461
      ]
1462
    },
1463
    {
1464
      "cell_type": "markdown",
1465
      "metadata": {
1466
        "id": "-N57gaotPN55"
1467
      },
1468
      "source": [
1469
        "## MultiVit Agent"
1470
      ]
1471
    },
1472
    {
1473
      "cell_type": "code",
1474
      "execution_count": null,
1475
      "metadata": {
1476
        "id": "Cb3uj_kw9ED9"
1477
      },
1478
      "outputs": [],
1479
      "source": [
1480
        "def set_multivit_common_config(config):\n",
1481
        "  config.force_mutations = []\n",
1482
        "  config.scorer_kwargs = {}\n",
1483
        "  for rm_k in [\"num_layers\", \"ds_image_size\"]:\n",
1484
        "    del config.hparams_defaults[rm_k]\n",
1485
        "    del config.hparams_mutation_ranges[rm_k]\n",
1486
        "  config.hparams_defaults[\"router_init_main_path_weight\"] = 0.8\n",
1487
        "  config.hparams_defaults[\"router_lr_mult\"] = 0.05\n",
1488
        "  config.hparams_defaults[\"num_paths\"] = 2\n",
1489
        "  config.hparams_mutation_ranges[\"router_lr_mult\"] = [0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1]\n",
1490
        "  config.hparams_mutation_ranges[\"num_paths\"] = [2, 3]\n",
1491
        "  return config"
1492
      ]
1493
    },
1494
    {
1495
      "cell_type": "code",
1496
      "execution_count": null,
1497
      "metadata": {
1498
        "id": "3Id10GxBcyJi"
1499
      },
1500
      "outputs": [],
1501
      "source": [
1502
        "class MultiVit(Vit):\n",
1503
        "  def get_config(self):\n",
1504
        "    config = get_config_vit_large()\n",
1505
        "    return set_multivit_common_config(config)\n",
1506
        "\n",
1507
        "  @staticmethod\n",
1508
        "  def get_model_factory():\n",
1509
        "    return MultiVitModelFactory\n",
1510
        "\n",
1511
        "  def run(self):\n",
1512
        "    run_cycles(self)\n",
1513
        "\n",
1514
        "  def agent_classes_to_load(self):\n",
1515
        "    return [self.single_path_agent_class()]\n",
1516
        "\n",
1517
        "  def single_path_agent_class(self):\n",
1518
        "    return self.class_name.replace(\"Multi\", \"\")\n",
1519
        "\n",
1520
        "  def single_path_main_agent_id(self):\n",
1521
        "    return format_agent_id(self.single_path_agent_class(), self.config.task_name)\n",
1522
        "\n",
1523
        "  def load_state(self):\n",
1524
        "    if self.config.task_name.startswith(\"root_model/\"):\n",
1525
        "      assert False, (\n",
1526
        "          \"Root models need to be generated with the corresponding\" \\\n",
1527
        "          f\"single path agent: '{self.single_path_agent_class()}'.\")\n",
1528
        "    super().load_state()\n",
1529
        "    assert self.single_path_main_agent_id() in self.pop.paths, (\n",
1530
        "        \"Missing state for the corresponding single path main agent. \" \\\n",
1531
        "        f\"Run agent '{self.single_path_agent_class()}' \" \\\n",
1532
        "        f\"on the '{self.config.task_name}' task to generated it.\")\n",
1533
        "\n",
1534
        "  def sample_path(self, ds_hparams):\n",
1535
        "    selected_paths = {}\n",
1536
        "    selected_paths[\"path_0\"] = random.choice(self.pop.paths[self.single_path_main_agent_id()])\n",
1537
        "    parent = self.parent_decay_selection()\n",
1538
        "    if parent is not None:\n",
1539
        "      new_hparams = copy.deepcopy(parent.hparams)\n",
1540
        "      print(\" Selected\")\n",
1541
        "    else:\n",
1542
        "      parent = selected_paths[\"path_0\"]\n",
1543
        "      best_path = self.pop.get_best_path()\n",
1544
        "      if best_path:\n",
1545
        "        new_hparams = copy.deepcopy(best_path.hparams)\n",
1546
        "      else:\n",
1547
        "        new_hparams = self.config.hparams_defaults.to_dict()\n",
1548
        "      print(\" New random\")\n",
1549
        "    new_hparams[\"paths\"] = {}\n",
1550
        "    new_hparams = self.mutate_hparams(new_hparams)\n",
1551
        "    # Overwrite dataset hparams with those sampled for the generation batch.\n",
1552
        "    new_hparams.update(ds_hparams)\n",
1553
        "\n",
1554
        "    for path_name in [f\"path_{i}\" for i in range(int(new_hparams[\"num_paths\"]))]:\n",
1555
        "      if path_name in selected_paths:\n",
1556
        "        continue\n",
1557
        "      if path_name in parent.hparams.get(\"paths\", {}):\n",
1558
        "        selected_paths[path_name] = self.pop.get_path_from_full_id(parent.hparams[\"paths\"][path_name][\"agent_id\"], parent.hparams[\"paths\"][path_name][\"id\"])\n",
1559
        "        print(\" Subpath from parent: \", path_name, selected_paths[path_name].full_id)\n",
1560
        "        continue\n",
1561
        "      selected_paths[path_name] = random.choice([\n",
1562
        "          p for paths in self.pop.paths.values() for p in paths if (\n",
1563
        "              p.agent_id not in [sp.agent_id for sp in selected_paths.values()]\n",
1564
        "              and p.agent_id.startswith(f\"{self.single_path_agent_class()}~\")\n",
1565
        "              and not p.agent_id.endswith(\"~1k\")  # Excludes VTAB-1k tasks.\n",
1566
        "              # and (path_name != \"path_1\" or p.agent_id in ['Vit~i_naturalist2017'])  # Forces i_naturalist2017 selection.\n",
1567
        "              )])\n",
1568
        "      print(\" Subpath rand selected:\", path_name, selected_paths[path_name].full_id)\n",
1569
        "\n",
1570
        "    for (path_name, path) in selected_paths.items():\n",
1571
        "      new_hparams[\"paths\"][path_name] = {\n",
1572
        "          \"id\": path.id,\n",
1573
        "          \"agent_id\": path.agent_id,\n",
1574
        "          \"hparams\": copy.deepcopy(path.hparams)}\n",
1575
        "    print(\" Sampled subpaths:\",\n",
1576
        "          {k: v[\"agent_id\"] for k, v in new_hparams[\"paths\"].items()})\n",
1577
        "    # Set headless model config for models of different tasks.\n",
1578
        "    for k in new_hparams[\"paths\"]:\n",
1579
        "      if selected_paths[k].task_name != self.config.task_name:\n",
1580
        "        new_hparams[\"paths\"][k][\"hparams\"][\"num_classes\"] = 0\n",
1581
        "    # Collect image sizes needed.\n",
1582
        "    image_sizes = set()\n",
1583
        "    for k in new_hparams[\"paths\"]:\n",
1584
        "      image_sizes.add(int(new_hparams[\"paths\"][k][\"hparams\"][\"ds_image_size\"]))\n",
1585
        "    new_hparams[\"ds_image_size\"] = list(image_sizes)\n",
1586
        "    # Collect components.\n",
1587
        "    init_params = self.get_model_factory().get_init_comps(new_hparams, self.config)\n",
1588
        "    comps = []\n",
1589
        "    for new_comp_name in init_params:\n",
1590
        "      if new_comp_name in new_hparams[\"paths\"]:\n",
1591
        "        comps.append(ComponentPath(name=new_comp_name,\n",
1592
        "                                   path=selected_paths[new_comp_name]))\n",
1593
        "      else:\n",
1594
        "        comps_lookup = {c.name:c for c in parent.components}\n",
1595
        "        if new_comp_name in comps_lookup and (\n",
1596
        "          jax.tree_util.tree_map(jnp.shape, init_params[new_comp_name]) ==\n",
1597
        "          jax.tree_util.tree_map(jnp.shape, comps_lookup[new_comp_name].params)):\n",
1598
        "          print(\" COMP Reusing\", new_comp_name)\n",
1599
        "          comp = comps_lookup[new_comp_name].clone(agent_id=self.config.agent_id)\n",
1600
        "        else:\n",
1601
        "          print(\" COMP Init\", new_comp_name)\n",
1602
        "          comp = Component(name=new_comp_name,\n",
1603
        "                           agent_id=self.config.agent_id,\n",
1604
        "                           params=init_params[new_comp_name],\n",
1605
        "                           train_locks=[])\n",
1606
        "        comps.append(comp)\n",
1607
        "    return Path(new_hparams, comps, parent=parent,\n",
1608
        "                agent_id=self.config.agent_id, task_name=self.config.task_name)\n",
1609
        "\n",
1610
        "  def finalize_ds_hparams(self, ds_hparams, paths):\n",
1611
        "    # Validate shared params.\n",
1612
        "    for k in ds_hparams:\n",
1613
        "      if k.startswith(DATASET_HPARAMS_KEYS_PRERFIX):\n",
1614
        "        for path in paths:\n",
1615
        "          assert ds_hparams[k] == path.hparams[k], (k, ds_hparams[k], path.hparams[k])\n",
1616
        "    image_sizes = set()\n",
1617
        "    for path in paths:\n",
1618
        "      image_sizes.update(path.hparams[\"ds_image_size\"])\n",
1619
        "    ds_hparams[\"ds_image_size\"] = list(image_sizes)\n",
1620
        "    print(\"Image sizes:\", ds_hparams[\"ds_image_size\"])\n",
1621
        "    return ds_hparams\n",
1622
        "\n",
1623
        "  def print_best_path_summary(self):\n",
1624
        "    super().print_best_path_summary()\n",
1625
        "    best_path = self.pop.get_best_path()\n",
1626
        "    print(\"Paths used by best model:\",\n",
1627
        "          {k: best_path.hparams[\"paths\"][k][\"agent_id\"] for k in best_path.hparams[\"paths\"]})\n",
1628
        "\n",
1629
        "  def get_paths_to_publish(self):\n",
1630
        "    paths_to_publish = [p for p in self.pop.paths[self.config.agent_id]]\n",
1631
        "    for path in list(paths_to_publish):\n",
1632
        "      for k in path.hparams[\"paths\"]:\n",
1633
        "        paths_to_publish.append(self.pop.get_path_from_full_id(path.hparams[\"paths\"][k][\"agent_id\"], path.hparams[\"paths\"][k][\"id\"]))\n",
1634
        "    return paths_to_publish\n",
1635
        "\n",
1636
        "class MultiVitT3(MultiVit):\n",
1637
        "  def get_config(self):\n",
1638
        "    config = get_config_vit_ti3()\n",
1639
        "    return set_multivit_common_config(config)\n",
1640
        "\n",
1641
        "class MultiVitB(MultiVit):\n",
1642
        "  def get_config(self):\n",
1643
        "    config = get_config_vit_base()\n",
1644
        "    return set_multivit_common_config(config)"
1645
      ]
1646
    },
1647
    {
1648
      "cell_type": "markdown",
1649
      "metadata": {
1650
        "id": "gfGafBjhDpV7"
1651
      },
1652
      "source": [
1653
        "# Tasks"
1654
      ]
1655
    },
1656
    {
1657
      "cell_type": "code",
1658
      "execution_count": null,
1659
      "metadata": {
1660
        "id": "mBAsZx4IWWaS"
1661
      },
1662
      "outputs": [],
1663
      "source": [
1664
        "TFDS_BUILDERS_CACHE = {}\n",
1665
        "def get_tfds_builder(tfds_name):\n",
1666
        "  global TFDS_BUILDERS_CACHE\n",
1667
        "  if tfds_name not in TFDS_BUILDERS_CACHE:\n",
1668
        "    TFDS_BUILDERS_CACHE[tfds_name] = tfds.builder(tfds_name)\n",
1669
        "    TFDS_BUILDERS_CACHE[tfds_name].download_and_prepare()\n",
1670
        "  return TFDS_BUILDERS_CACHE[tfds_name]"
1671
      ]
1672
    },
1673
    {
1674
      "cell_type": "code",
1675
      "execution_count": null,
1676
      "metadata": {
1677
        "id": "kqnhOnk5jbnm"
1678
      },
1679
      "outputs": [],
1680
      "source": [
1681
        "def get_default_splits(tfds_name):\n",
1682
        "  info = get_tfds_builder(tfds_name).info\n",
1683
        "  splits = list(info.splits.keys())\n",
1684
        "  assert \"train\" in splits, splits\n",
1685
        "  splits.remove(\"train\")\n",
1686
        "  used_percent = 0\n",
1687
        "  slice_percent = 5\n",
1688
        "  pp = {}\n",
1689
        "  for k in [\"test\", \"validation\"]:\n",
1690
        "    if k in splits:\n",
1691
        "      pp[k] = k\n",
1692
        "      splits.remove(k)\n",
1693
        "    else:\n",
1694
        "      pp[k] = f\"train[{used_percent}%:{used_percent+slice_percent}%]\"\n",
1695
        "      used_percent += slice_percent\n",
1696
        "  pp[\"train\"] = f\"train[{used_percent}%:]\"\n",
1697
        "  return pp\n",
1698
        "\n",
1699
        "def get_dataset_and_splits(tfds_name: str):\n",
1700
        "  vtab_class = None\n",
1701
        "  if tfds_name in [\"imagenet_v2\", \"cifar10_1\"]:\n",
1702
        "    assert False,  f\"{tfds_name} used as validation set for other tasks.\"\n",
1703
        "\n",
1704
        "  if tfds_name == \"imagenet2012\":\n",
1705
        "    dataset = {\n",
1706
        "        \"train\":\"imagenet2012\", \"validation\":\"imagenet_v2\", \"test\":\"imagenet2012\"}\n",
1707
        "    splits = {\n",
1708
        "        \"train\":\"train\", \"validation\":\"test\", \"test\":\"validation\"}\n",
1709
        "  elif tfds_name == \"cifar100\":\n",
1710
        "    dataset = tfds_name\n",
1711
        "    splits = {\n",
1712
        "        \"train\":\"train[:98%]\", \"validation\":\"train[98%:]\", \"test\":\"test\"}\n",
1713
        "  elif tfds_name == \"cifar10\":\n",
1714
        "    dataset = {\n",
1715
        "        \"train\":\"cifar10\", \"validation\":\"cifar10_1\", \"test\":\"cifar10\"}\n",
1716
        "    splits = {\n",
1717
        "        \"train\":\"train\", \"validation\":\"test\", \"test\":\"test\"}\n",
1718
        "  elif (tfds_name.startswith(\"visual_domain_decathlon/\") or\n",
1719
        "        tfds_name in [\"i_naturalist2017\", \"i_naturalist2018\", \"places365_small\"]):\n",
1720
        "    dataset = tfds_name\n",
1721
        "    # Test has no labels, split validation in half.\n",
1722
        "    splits =  {\n",
1723
        "        \"train\":\"train\", \"validation\":\"validation[:50%]\", \"test\":\"validation[50%:]\"}\n",
1724
        "  elif tfds_name.startswith(\"cmaterdb/\"):\n",
1725
        "    dataset = tfds_name\n",
1726
        "    # Increase size of validation set due to small dataset size.\n",
1727
        "    splits =  {\n",
1728
        "        \"train\":\"train[20%:]\", \"validation\":\"train[:20%]\", \"test\":\"test\"}\n",
1729
        "  elif tfds_name == \"omniglot\":\n",
1730
        "    # Test has no labels, and missing validation, use additional splits.\n",
1731
        "    dataset = tfds_name\n",
1732
        "    splits = {\"train\":\"train\", \"validation\":\"small1\", \"test\":\"small2\"}\n",
1733
        "  elif tfds_name.startswith(\"controlled_noisy_web_labels/\"):\n",
1734
        "    dataset = tfds_name\n",
1735
        "    splits =  {\n",
1736
        "        \"train\":\"train_00\",\n",
1737
        "        \"validation\":\"validation[:50%]\",\n",
1738
        "        \"test\":\"validation[50%:]\"}\n",
1739
        "  elif tfds_name.startswith(\"cycle_gan/\"):\n",
1740
        "    dataset = tfds_name\n",
1741
        "    splits =  {\n",
1742
        "        \"train\":\"trainA[10%:]+trainB[10%:]\",\n",
1743
        "        \"validation\":\"trainA[:10%]+trainB[:10%]\",\n",
1744
        "        \"test\":\"testA+testB\"}\n",
1745
        "  elif tfds_name in [\"imagenet_a\", \"imagenet_r\", \"imagenet_sketch\",\n",
1746
        "                     \"siscore/rotation\", \"siscore/size\", \"siscore/location\",]:\n",
1747
        "    # Only test split.\n",
1748
        "    dataset = tfds_name\n",
1749
        "    splits =  {\n",
1750
        "        \"train\":\"test[10%:]\",\n",
1751
        "        \"validation\":\"test[5%:10%]\",\n",
1752
        "        \"test\":\"test[:5%]\"}\n",
1753
        "  elif tfds_name in [\"pet_finder\"]:\n",
1754
        "    # Explicitly use only train split. E.g. test has no labels.\n",
1755
        "    dataset = tfds_name\n",
1756
        "    splits =  {\n",
1757
        "        \"train\":\"train[10%:]\",\n",
1758
        "        \"validation\":\"train[5%:10%]\",\n",
1759
        "        \"test\":\"train[:5%]\"}\n",
1760
        "  elif tfds_name == \"quickdraw_bitmap\":\n",
1761
        "    dataset = tfds_name\n",
1762
        "    # Cap size of test and validation set.\n",
1763
        "    splits =  {\n",
1764
        "        \"train\":\"train[20000:]\", \"validation\":\"train[10000:20000]\", \"test\":\"train[:10000]\"}\n",
1765
        "  elif tfds_name == \"stanford_online_products\":\n",
1766
        "    dataset = tfds_name\n",
1767
        "    # Use the first 10k test samples as validation since test has 60k.\n",
1768
        "    splits =  {\n",
1769
        "        \"train\":\"train\", \"validation\":\"test[:10000]\", \"test\":\"test[10000:]\"}\n",
1770
        "  elif tfds_name in VTAB_TASKS or (\n",
1771
        "      tfds_name.endswith(\"/1k\") and tfds_name.replace(\"/1k\", \"\") in VTAB_TASKS):\n",
1772
        "    is_vtab_1k = tfds_name.endswith(\"/1k\")\n",
1773
        "    tfds_name = tfds_name.replace(\"/1k\", \"\")\n",
1774
        "    registry_name = {\n",
1775
        "        \"diabetic_retinopathy_detection/btgraham-300\": \"diabetic_retinopathy\",\n",
1776
        "        \"svhn_cropped\": \"svhn\",\n",
1777
        "        \"cifar100\": \"cifar\",\n",
1778
        "        \"cifar10\": \"cifar\",\n",
1779
        "    }.get(tfds_name, tfds_name.split(\"/\")[0])\n",
1780
        "    args = {\n",
1781
        "        \"clevr/count_all\": (\"count_all\",),\n",
1782
        "        \"clevr/count_cylinders\": (\"count_cylinders\",),\n",
1783
        "        \"clevr/closest_object_distance\": (\"closest_object_distance\",),\n",
1784
        "        \"dsprites/label_x_position\": (\"label_x_position\",),\n",
1785
        "        \"dsprites/label_orientation\": (\"label_orientation\",),\n",
1786
        "        \"kitti/closest_object_distance\": (\"closest_object_distance\",),\n",
1787
        "        \"kitti/count_vehicles\": (\"count_vehicles\",),\n",
1788
        "        \"kitti/closest_vehicle_distance\": (\"closest_vehicle_distance\",),\n",
1789
        "        \"smallnorb/label_category\": (\"label_category\",),\n",
1790
        "        \"smallnorb/label_lighting\": (\"label_lighting\",),\n",
1791
        "        \"smallnorb/label_azimuth\": (\"label_azimuth\",),\n",
1792
        "        \"smallnorb/label_elevation\": (\"label_elevation\",),\n",
1793
        "        \"cifar100\": (100,),\n",
1794
        "        \"cifar10\": (10,),\n",
1795
        "    }.get(tfds_name, ())\n",
1796
        "    vtab_class = task_adapt_registry.Registry.lookup(\n",
1797
        "        f\"data.{registry_name}\")(*args)\n",
1798
        "    vtab_splits = vtab_class._tfds_splits\n",
1799
        "    dataset = {\n",
1800
        "        \"caltech101\": \"caltech101:3.*.*\",\n",
1801
        "        \"dtd\": \"dtd:3.*.*\",\n",
1802
        "        \"oxford_flowers102\": \"oxford_flowers102:2.*.*\",\n",
1803
        "        \"oxford_iiit_pet\": \"oxford_iiit_pet:3.*.*\",\n",
1804
        "        \"sun397\": \"sun397/tfds:4.*.*\",\n",
1805
        "        \"svhn\": \"svhn_cropped:3.*.*\",\n",
1806
        "        \"patch_camelyon\": \"patch_camelyon:2.*.*\",\n",
1807
        "        \"eurosat\": \"eurosat/rgb:2.*.*\",\n",
1808
        "        \"resisc45\": \"resisc45:3.*.*\",\n",
1809
        "        \"diabetic_retinopathy\": \"diabetic_retinopathy_detection/btgraham-300:3.*.*\",\n",
1810
        "        \"clevr\": \"clevr:3.*.*\",\n",
1811
        "        \"dmlab\": \"dmlab:2.0.1\",\n",
1812
        "        \"dsprites\": \"dsprites:2.*.*\",\n",
1813
        "        \"kitti\": \"kitti:3.2.0\",\n",
1814
        "        \"smallnorb\": \"smallnorb:2.*.*\",\n",
1815
        "        \"cifar\" : \"cifar100:3.*.*\" if tfds_name == \"cifar100\" else \"cifar10:3.*.*\",\n",
1816
        "    }[registry_name]\n",
1817
        "    if is_vtab_1k:\n",
1818
        "      splits =  {\n",
1819
        "          \"train\": str(vtab_splits[\"train800\"]),\n",
1820
        "          \"validation\": str(vtab_splits[\"val200\"]),\n",
1821
        "          \"test\": str(vtab_splits[\"test\"]),\n",
1822
        "          }\n",
1823
        "    else:\n",
1824
        "      splits =  {\n",
1825
        "          \"train\": str(vtab_splits[\"train\"]),\n",
1826
        "          \"validation\": str(vtab_splits[\"val\"]),\n",
1827
        "          \"test\": str(vtab_splits[\"test\"]),\n",
1828
        "          }\n",
1829
        "  else:\n",
1830
        "    dataset = tfds_name\n",
1831
        "    splits = get_default_splits(tfds_name)\n",
1832
        "  return dataset, splits, vtab_class\n",
1833
        "\n",
1834
        "\n",
1835
        "class Task():\n",
1836
        "  def __init__(self, name, config):\n",
1837
        "    self.config = config\n",
1838
        "\n",
1839
        "    self.dataset, self.splits, self.vtab_class = get_dataset_and_splits(name)\n",
1840
        "    self.name = name\n",
1841
        "    if self.vtab_class:\n",
1842
        "      self.num_classes = self.vtab_class.get_num_classes()\n",
1843
        "    else:\n",
1844
        "      self.num_classes = self.get_builder(\n",
1845
        "          \"train\").info.features[self.get_label_key()].num_classes\n",
1846
        "    num_train_examples = self.get_builder(\n",
1847
        "        \"train\").info.splits[self.splits[\"train\"]].num_examples\n",
1848
        "    self.train_batch_size = config.batch_size\n",
1849
        "    self.num_train_batches_between_validations = math.ceil(\n",
1850
        "        min(num_train_examples,\n",
1851
        "            config.num_train_examples_between_validations_max)\n",
1852
        "        / self.train_batch_size)\n",
1853
        "\n",
1854
        "    num_validation_examples_tot = self.get_builder(\n",
1855
        "        \"validation\").info.splits[self.splits[\"validation\"]].num_examples\n",
1856
        "    if config.num_validation_examples_max \u003c= num_validation_examples_tot:\n",
1857
        "      self.validation_batch_size = config.batch_size\n",
1858
        "      self.num_validation_batches = math.floor(\n",
1859
        "          config.num_validation_examples_max / self.validation_batch_size)\n",
1860
        "    else:\n",
1861
        "      # Adjust batch_size and num_batches to cover the smaller validation sets.\n",
1862
        "      self.num_validation_batches = math.ceil(\n",
1863
        "          num_validation_examples_tot / config.batch_size)\n",
1864
        "      self.validation_batch_size = math.floor(\n",
1865
        "          num_validation_examples_tot / self.num_validation_batches)\n",
1866
        "      assert num_validation_examples_tot \u003e= (\n",
1867
        "          self.num_validation_batches*self.validation_batch_size)\n",
1868
        "    self.num_validation_examples = (\n",
1869
        "        self.num_validation_batches * self.validation_batch_size)\n",
1870
        "\n",
1871
        "    print(f\"Task: {self.name}\")\n",
1872
        "    print(f\"  Train batches between validations: {self.num_train_batches_between_validations}\")\n",
1873
        "    print(f\"  Validation batches: {self.num_validation_batches}\")\n",
1874
        "    print(f\"  Validation batch size: {self.validation_batch_size}\")\n",
1875
        "    print(f\"  Dataset {{\\n{self.dataset}}}\")\n",
1876
        "    print(f\"  Splits {{\\n{self.splits}}}\")\n",
1877
        "\n",
1878
        "  def get_label_key(self):\n",
1879
        "    return {\n",
1880
        "        \"stanford_online_products\": \"super_class_id\",\n",
1881
        "        }.get(self.name, \"label\")\n",
1882
        "\n",
1883
        "  def get_builder(self, mode):\n",
1884
        "    if type(self.dataset) == str:\n",
1885
        "      return get_tfds_builder(self.dataset)\n",
1886
        "    return get_tfds_builder(self.dataset[mode])\n",
1887
        "\n",
1888
        "  def get_ds(self, mode, hparams):\n",
1889
        "    data = self.get_builder(mode).as_dataset(\n",
1890
        "        split=self.splits[mode],\n",
1891
        "        shuffle_files=mode==\"train\")\n",
1892
        "\n",
1893
        "    def _pp(data):\n",
1894
        "      im = data[\"image\"]\n",
1895
        "      tf.debugging.assert_type(im, tf.uint8)\n",
1896
        "\n",
1897
        "      if mode == \"train\":\n",
1898
        "        if hparams.get(\"ds_quality_delta\", 0.0) \u003e 0.0:\n",
1899
        "          im = tf.image.random_jpeg_quality(\n",
1900
        "              im,\n",
1901
        "              min_jpeg_quality=int(100 * (1 - hparams[\"ds_quality_delta\"])),\n",
1902
        "              max_jpeg_quality=100)\n",
1903
        "\n",
1904
        "      # Must have 3 channels.\n",
1905
        "      if im.shape[-1] == 1:\n",
1906
        "        im = tf.squeeze(tf.stack([im] * 3, -1), axis=-2)\n",
1907
        "      assert im.shape[-1] == 3\n",
1908
        "      im = tf.cast(im, tf.float32)\n",
1909
        "      if mode == \"train\":\n",
1910
        "        if hparams.get(\"ds_area_range_min\", 1.0) \u003c 1.0:\n",
1911
        "          channels = im.shape[-1]\n",
1912
        "          begin, size, _ = tf.image.sample_distorted_bounding_box(\n",
1913
        "              tf.shape(im),\n",
1914
        "              tf.zeros([0, 0, 4], tf.float32),\n",
1915
        "              aspect_ratio_range=[hparams[\"ds_aspect_ratio_range_min\"],\n",
1916
        "                                  1.0/hparams[\"ds_aspect_ratio_range_min\"]],\n",
1917
        "              area_range=[hparams[\"ds_area_range_min\"], 1.0],\n",
1918
        "              # Overlap with bounding box, the bounding box should anyway\n",
1919
        "              # default defaults to whole image in this case.\n",
1920
        "              min_object_covered=0,\n",
1921
        "              use_image_if_no_bounding_boxes=True)\n",
1922
        "          im = tf.slice(im, begin, size)\n",
1923
        "          # Restore the depth-dimension lost by the above operation.\n",
1924
        "          im.set_shape([None, None, channels])\n",
1925
        "        if hparams.get(\"ds_flip_left_right\", False):\n",
1926
        "          if tf.random.uniform(shape=[]) \u003e 0.5:\n",
1927
        "            im = tf.image.flip_left_right(im)\n",
1928
        "        if hparams.get(\"ds_brightness_delta\", 0.0) \u003e 0.0:\n",
1929
        "          im = tf.image.random_brightness(\n",
1930
        "              im, max_delta=hparams[\"ds_brightness_delta\"])\n",
1931
        "        if hparams.get(\"ds_contrast_delta\", 0.0) \u003e 0.0:\n",
1932
        "          im = tf.image.random_contrast(\n",
1933
        "              im, lower=1-hparams[\"ds_contrast_delta\"],\n",
1934
        "              upper=1+hparams[\"ds_contrast_delta\"])\n",
1935
        "        if hparams.get(\"ds_saturation_delta\", 0.0) \u003e 0.0:\n",
1936
        "          im = tf.image.random_saturation(\n",
1937
        "              im, lower=1-hparams[\"ds_saturation_delta\"],\n",
1938
        "              upper=1 + hparams[\"ds_saturation_delta\"])\n",
1939
        "        if hparams.get(\"ds_hue_delta\", 0.0) \u003e 0.0:\n",
1940
        "          im = tf.image.random_hue(im, max_delta=hparams[\"ds_hue_delta\"])\n",
1941
        "\n",
1942
        "      def get_formatted_image(image, image_size):\n",
1943
        "        image = tf.image.resize(image, [image_size, image_size])\n",
1944
        "        # Values in range [-1 , 1].\n",
1945
        "        image = image / 127.5 - 1\n",
1946
        "        image = tf.clip_by_value(image, -1, 1)\n",
1947
        "        return image\n",
1948
        "\n",
1949
        "      if type(hparams[\"ds_image_size\"]) is list:\n",
1950
        "        out_im = {}\n",
1951
        "        for im_size in hparams[\"ds_image_size\"]:\n",
1952
        "          out_im[str(im_size)] = get_formatted_image(im, int(im_size))\n",
1953
        "      else:\n",
1954
        "        out_im = get_formatted_image(im, int(hparams[\"ds_image_size\"]))\n",
1955
        "      return {\"image\": out_im,\n",
1956
        "              \"label\": data[self.get_label_key()]}\n",
1957
        "\n",
1958
        "    if mode == \"validation\":\n",
1959
        "      data = data.take(self.num_validation_examples).cache()\n",
1960
        "    if mode != \"test\":\n",
1961
        "      data = data.repeat()\n",
1962
        "    if self.vtab_class and self.vtab_class._base_preprocess_fn:\n",
1963
        "      data = data.map(self.vtab_class._base_preprocess_fn, tf.data.AUTOTUNE)\n",
1964
        "    data = data.map(_pp, tf.data.AUTOTUNE)\n",
1965
        "    if mode == \"train\":\n",
1966
        "      batch_size = self.train_batch_size\n",
1967
        "    else:\n",
1968
        "      batch_size = self.validation_batch_size\n",
1969
        "    data = data.batch(batch_size)\n",
1970
        "    if mode == \"train\":\n",
1971
        "      data = data.shuffle(10)\n",
1972
        "    return tfds.as_numpy(data.prefetch(tf.data.AUTOTUNE))\n",
1973
        "\n",
1974
        "def get_task_factory_fn(config):\n",
1975
        "  def get_task(task_name: str):\n",
1976
        "    return Task(name=task_name, config=config)\n",
1977
        "  return get_task"
1978
      ]
1979
    },
1980
    {
1981
      "cell_type": "markdown",
1982
      "metadata": {
1983
        "id": "GdzUzvhrWLer"
1984
      },
1985
      "source": [
1986
        "# Components"
1987
      ]
1988
    },
1989
    {
1990
      "cell_type": "code",
1991
      "execution_count": null,
1992
      "metadata": {
1993
        "id": "rzxoZ4rdQZcA"
1994
      },
1995
      "outputs": [],
1996
      "source": [
1997
        "def params2comps(params, train_locks, agent_id, name=None):\n",
1998
        "  \"\"\"Convert frozend dict of params to a list of components.\"\"\"\n",
1999
        "  components = []\n",
2000
        "  for k in params:\n",
2001
        "    if name is None or name == k:\n",
2002
        "      c = Component(\n",
2003
        "          name=k, agent_id=agent_id,\n",
2004
        "          params=params[k], train_locks=train_locks)\n",
2005
        "      components.append(c)\n",
2006
        "  return components"
2007
      ]
2008
    },
2009
    {
2010
      "cell_type": "code",
2011
      "execution_count": null,
2012
      "metadata": {
2013
        "id": "DNxjDX13_dm_"
2014
      },
2015
      "outputs": [],
2016
      "source": [
2017
        "def fingerprint_params(params):\n",
2018
        "  return np.sum(np.array(jax.tree_util.tree_leaves(\n",
2019
        "      jax.tree_util.tree_map(jnp.sum, params))))\n",
2020
        "\n",
2021
        "class Component():\n",
2022
        "  counter = 0\n",
2023
        "  # Components of retained paths with id \u003c= last_saved are saved in checkpoint.\n",
2024
        "  last_saved = -1\n",
2025
        "\n",
2026
        "  def reset_globals():\n",
2027
        "    Component.counter = 0\n",
2028
        "    Component.last_saved = -1\n",
2029
        "\n",
2030
        "  def __init__(\n",
2031
        "      self, name: str, agent_id: str, params, train_locks):\n",
2032
        "    self.name = name\n",
2033
        "    self.agent_id = agent_id\n",
2034
        "    self.params = jax.device_get(params)\n",
2035
        "    self.num_params = None\n",
2036
        "    self.train_locks = set(train_locks)\n",
2037
        "    self.id = Component.counter\n",
2038
        "    Component.counter += 1\n",
2039
        "\n",
2040
        "  def get_num_params(self):\n",
2041
        "    if self.num_params is None:\n",
2042
        "      self.num_params = get_num_params(self.params)\n",
2043
        "    return self.num_params\n",
2044
        "\n",
2045
        "  def fingerprint(self):\n",
2046
        "    return fingerprint_params(self.params)\n",
2047
        "\n",
2048
        "  def is_trainable(self):\n",
2049
        "    return len(self.train_locks) == 0\n",
2050
        "\n",
2051
        "  def clone(self, agent_id):\n",
2052
        "    return Component(name=self.name,\n",
2053
        "                     agent_id=agent_id,\n",
2054
        "                     params=copy.deepcopy(jax.device_get(self.params)),\n",
2055
        "                     train_locks=set())"
2056
      ]
2057
    },
2058
    {
2059
      "cell_type": "code",
2060
      "execution_count": null,
2061
      "metadata": {
2062
        "id": "SdMf8WcYBbjt"
2063
      },
2064
      "outputs": [],
2065
      "source": [
2066
        "class ComponentPath():\n",
2067
        "  \"\"\"Wraps a Paths to be used as a Component.\"\"\"\n",
2068
        "  def __init__(self, name, path):\n",
2069
        "    self.name = name\n",
2070
        "    self.path = path\n",
2071
        "    self.train_locks = set([\"FROZEN\"])\n",
2072
        "\n",
2073
        "  def is_trainable(self):\n",
2074
        "    return False\n",
2075
        "\n",
2076
        "  @property\n",
2077
        "  def params(self):\n",
2078
        "    return flax.core.freeze(self.path.get_all_params())"
2079
      ]
2080
    },
2081
    {
2082
      "cell_type": "markdown",
2083
      "metadata": {
2084
        "id": "pCie1hQLUyP7"
2085
      },
2086
      "source": [
2087
        " # Paths \u0026 Population"
2088
      ]
2089
    },
2090
    {
2091
      "cell_type": "code",
2092
      "execution_count": null,
2093
      "metadata": {
2094
        "id": "8PU5ffvd_gC9"
2095
      },
2096
      "outputs": [],
2097
      "source": [
2098
        "class Path():\n",
2099
        "  def reset_globals(config):\n",
2100
        "    Path.config = config\n",
2101
        "    Path.counter = 0\n",
2102
        "    Path.last_saved = -1\n",
2103
        "    Path.paths = []\n",
2104
        "    Path.scorer = globals()[config.get(\"scorer_class\", \"ScorerDecay\")](\n",
2105
        "        **config.get(\"scorer_kwargs\", {}))\n",
2106
        "    # Cache output of functions calls with same args.\n",
2107
        "    Path.cached_tasks = ObjectCache(get_task_factory_fn(config))\n",
2108
        "    Path.cached_optimizers = ObjectCache(get_optimizer)\n",
2109
        "\n",
2110
        "  def __init__(self, hparams, components, parent, agent_id, task_name):\n",
2111
        "    self.components = components\n",
2112
        "    self.id = Path.counter\n",
2113
        "    Path.counter += 1\n",
2114
        "    self.agent_id = agent_id\n",
2115
        "    self.task_name = task_name\n",
2116
        "    self.parent = parent\n",
2117
        "    self.hparams = hparams\n",
2118
        "    self._model = None\n",
2119
        "    self.metrics = {\n",
2120
        "        \"generation\": 0 if parent is None else parent.metrics[\"generation\"] + 1,\n",
2121
        "    }\n",
2122
        "    Path.paths.append(self)\n",
2123
        "\n",
2124
        "  @property\n",
2125
        "  def task(self):\n",
2126
        "    return Path.cached_tasks(task_name=self.task_name)\n",
2127
        "\n",
2128
        "  @property\n",
2129
        "  def model_factory(self):\n",
2130
        "    return get_agent_class(self.agent_id).get_model_factory()\n",
2131
        "\n",
2132
        "  @property\n",
2133
        "  def model(self):\n",
2134
        "    if self._model == None:\n",
2135
        "      self._model = self.model_factory.get_model(self.hparams, self.config)\n",
2136
        "    return self._model\n",
2137
        "\n",
2138
        "  @property\n",
2139
        "  def full_id(self):\n",
2140
        "    return f\"{self.agent_id}:{self.id}\"\n",
2141
        "\n",
2142
        "  def comps_only(self):  # Exclude wrapped paths.\n",
2143
        "    return [c for c in self.components if c.__class__ is Component]\n",
2144
        "\n",
2145
        "  def score(self):\n",
2146
        "    return Path.scorer.score(self)\n",
2147
        "\n",
2148
        "  def get_all_params(self):\n",
2149
        "    params = {}\n",
2150
        "    for c in self.components:\n",
2151
        "      assert c.name not in params, c.name\n",
2152
        "      params[c.name] = c.params\n",
2153
        "    return flax.core.freeze(params)\n",
2154
        "\n",
2155
        "  def get_trainable_params(self):\n",
2156
        "    params = {}\n",
2157
        "    for c in self.components:\n",
2158
        "      if c.is_trainable():\n",
2159
        "        assert c.name not in params, c.name\n",
2160
        "        params[c.name] = c.params\n",
2161
        "    return flax.core.freeze(params)\n",
2162
        "\n",
2163
        "  def get_fixed_params(self):\n",
2164
        "    params = {}\n",
2165
        "    for c in self.components:\n",
2166
        "      if not c.is_trainable():\n",
2167
        "        assert c.name not in params, c.name\n",
2168
        "        params[c.name] = c.params\n",
2169
        "    return flax.core.freeze(params)\n",
2170
        "\n",
2171
        "  def update_trainable(self, trained_params):\n",
2172
        "    trainable_count = 0\n",
2173
        "    for c in self.components:\n",
2174
        "      if c.is_trainable():\n",
2175
        "        trainable_count += 1\n",
2176
        "        assert c.name in trained_params.keys()\n",
2177
        "        c.params = trained_params[c.name]\n",
2178
        "    assert len(trained_params.keys()) == trainable_count, (\n",
2179
        "        f\"{len(trained_params.keys())} {trainable_count}\")\n",
2180
        "\n",
2181
        "  def get_num_accounted_params(self):\n",
2182
        "    rtn = 0\n",
2183
        "    for c in self.components:\n",
2184
        "      tl = copy.copy(c.train_locks)\n",
2185
        "      assert type(tl) is set\n",
2186
        "      tl.add(self.agent_id)\n",
2187
        "      assert tl\n",
2188
        "      rtn += c.get_num_params() / len(tl)\n",
2189
        "    return rtn\n",
2190
        "\n",
2191
        "  def get_flops(self):\n",
2192
        "    return compute_flops_hlo(\n",
2193
        "          partial(self.model.apply, train=False),\n",
2194
        "          {\"params\": self.model_factory.get_comps2model_fn()(merge_params(\n",
2195
        "              self.get_trainable_params(),\n",
2196
        "              self.get_fixed_params()))},\n",
2197
        "          self.model_factory.get_sample_input(self.hparams))\n",
2198
        "\n",
2199
        "  def get_optimizer(self):\n",
2200
        "    return Path.cached_optimizers(\n",
2201
        "        num_train_batches_between_validations=\n",
2202
        "            self.task.num_train_batches_between_validations,\n",
2203
        "        num_validations_per_path_training=\n",
2204
        "            self.task.config.num_validations_per_path_training,\n",
2205
        "        **self.hparams)"
2206
      ]
2207
    },
2208
    {
2209
      "cell_type": "code",
2210
      "execution_count": null,
2211
      "metadata": {
2212
        "id": "bkRmcJgzUbwN"
2213
      },
2214
      "outputs": [],
2215
      "source": [
2216
        "class ScorerDecay():\n",
2217
        "  def __init__(self, scale_factor=1, base_accounted_params=0, base_flops=0):\n",
2218
        "    assert 0.0 \u003c scale_factor \u003c= 1.0\n",
2219
        "    self.scale_factor = scale_factor\n",
2220
        "    self.base_accounted_params = base_accounted_params\n",
2221
        "    self.base_flops = base_flops\n",
2222
        "\n",
2223
        "  def score(self, path):\n",
2224
        "    if (\"quality\" not in path.metrics\n",
2225
        "        or math.isnan(path.metrics[\"quality\"])):\n",
2226
        "      return None\n",
2227
        "    assert path.metrics[\"quality\"] \u003e= 0, (\n",
2228
        "        f\"{path.task_name} {path.metrics['quality']}\")\n",
2229
        "    score = path.metrics[\"quality\"]\n",
2230
        "    if self.base_accounted_params \u003e 0:\n",
2231
        "      # Accounted params needs to be updated since it depends on the\n",
2232
        "      # changing structure of the system.\n",
2233
        "      path.metrics[\"accounted_params\"] = path.get_num_accounted_params()\n",
2234
        "      score *= self.scale_factor ** (\n",
2235
        "          path.metrics[\"accounted_params\"] / self.base_accounted_params)\n",
2236
        "    if self.base_flops \u003e 0:\n",
2237
        "      if \"flops\" not in path.metrics:\n",
2238
        "        path.metrics[\"flops\"] = path.get_flops()\n",
2239
        "      score *= self.scale_factor ** (path.metrics[\"flops\"] / self.base_flops)\n",
2240
        "    assert score \u003e= 0\n",
2241
        "    path.metrics[\"score\"] = score\n",
2242
        "    return score"
2243
      ]
2244
    },
2245
    {
2246
      "cell_type": "code",
2247
      "execution_count": null,
2248
      "metadata": {
2249
        "id": "YMbYgKd8_nyi"
2250
      },
2251
      "outputs": [],
2252
      "source": [
2253
        "class Population():\n",
2254
        "  def __init__(self, config):\n",
2255
        "    Path.reset_globals(config)\n",
2256
        "    Component.reset_globals()\n",
2257
        "    self.paths = defaultdict(list)\n",
2258
        "    self.config = config\n",
2259
        "    self.paths_df = pd.DataFrame()\n",
2260
        "    self.comps_df = pd.DataFrame()\n",
2261
        "\n",
2262
        "  def get_best_path(self):\n",
2263
        "    if len(self.paths[self.config.agent_id]) == 0:\n",
2264
        "      return None\n",
2265
        "    # Oldest path achieving max score.\n",
2266
        "    return max(sorted(self.paths[self.config.agent_id], key=lambda p: p.id, reverse=False), key=lambda p: p.score())\n",
2267
        "\n",
2268
        "  def prune_population(self):\n",
2269
        "    if self.config.get(\"max_task_population_size\", None) and (\n",
2270
        "        len(self.paths[self.config.agent_id]) \u003e self.config.max_task_population_size):\n",
2271
        "      self.paths[self.config.agent_id] = sorted(\n",
2272
        "          self.paths[self.config.agent_id], key=lambda p: p.score(), reverse=True\n",
2273
        "          )[:self.config.max_task_population_size]\n",
2274
        "\n",
2275
        "  def add_train_locks(self):\n",
2276
        "    # Check.\n",
2277
        "    for ps in self.paths.values():\n",
2278
        "      for p in ps:\n",
2279
        "        for c in p.components:\n",
2280
        "          assert self.config.agent_id not in c.train_locks\n",
2281
        "    # Add locks.\n",
2282
        "    paths = self.paths[self.config.agent_id]\n",
2283
        "    for p in paths:\n",
2284
        "      for c in p.components:\n",
2285
        "        c.train_locks.add(self.config.agent_id)\n",
2286
        "\n",
2287
        "  def rm_train_locks(self):\n",
2288
        "    # Remove locks.\n",
2289
        "    paths = self.paths[self.config.agent_id]\n",
2290
        "    for p in paths:\n",
2291
        "      for c in p.components:\n",
2292
        "        if self.config.agent_id in c.train_locks:\n",
2293
        "          c.train_locks.remove(self.config.agent_id)\n",
2294
        "    # Check.\n",
2295
        "    for ps in self.paths.values():\n",
2296
        "      for p in ps:\n",
2297
        "        for c in p.components:\n",
2298
        "          assert self.config.agent_id not in c.train_locks\n",
2299
        "\n",
2300
        "  def start_cycle(self):\n",
2301
        "    self.rm_train_locks()\n",
2302
        "\n",
2303
        "  def end_cycle(self):\n",
2304
        "    # Keep only best one.\n",
2305
        "    best_path = self.get_best_path()\n",
2306
        "    assert best_path is not None\n",
2307
        "    best_path.metrics[\"num_cycles\"] = best_path.metrics.get(\"num_cycles\", 0) + 1\n",
2308
        "    self.paths[self.config.agent_id] = [best_path]\n",
2309
        "    self.add_train_locks()\n",
2310
        "    self.garbage_collect_paths()\n",
2311
        "\n",
2312
        "  def garbage_collect_paths(self):\n",
2313
        "    # Store history before dropping references to unused paths to trigger\n",
2314
        "    # garbage collection of components and parameters.\n",
2315
        "    self.paths_df = self.paths_df.append(\n",
2316
        "        paths_to_df(Path.paths), ignore_index=True\n",
2317
        "        ).query(f'agent_id==\"{self.config.agent_id}\" and id\u003e{Path.last_saved}'\n",
2318
        "        # Drop duplicates generated by reloads, notice that some state-based\n",
2319
        "        # metrics may vary (e.g. accounted parameters) so we match only id\n",
2320
        "        # (agent_id is already matched from the preceding query).\n",
2321
        "        ).drop_duplicates(\"id\")\n",
2322
        "    self.comps_df = self.comps_df.append(\n",
2323
        "        components_to_df(Path.paths), ignore_index=True\n",
2324
        "        ).query(f'agent_id==\"{self.config.agent_id}\" and id\u003e{Component.last_saved}'\n",
2325
        "        ).drop_duplicates()\n",
2326
        "    # Drop unused paths generated in this agent cycle for garbage collection.\n",
2327
        "    Path.paths = []\n",
2328
        "    # Simplify ancestor tree to contain only live paths.\n",
2329
        "    # Notice that the simplification is done also for paths of other tasks,\n",
2330
        "    # since they may be pointing to a path of this task that was discarded.\n",
2331
        "    live_paths_ids = [p.full_id for paths in self.paths.values() for p in paths]\n",
2332
        "    for path in [path for paths in self.paths.values() for path in paths]:\n",
2333
        "      ancestor = path.parent\n",
2334
        "      if ancestor is None:\n",
2335
        "        continue\n",
2336
        "      while True:\n",
2337
        "        if ancestor.full_id in live_paths_ids:\n",
2338
        "          path.parent = ancestor\n",
2339
        "          break\n",
2340
        "        ancestor = ancestor.parent\n",
2341
        "\n",
2342
        "  def get_path_from_full_id(self, agent_id, path_id):\n",
2343
        "    for p in self.paths[agent_id]:\n",
2344
        "      if p.id == path_id:\n",
2345
        "        return p\n",
2346
        "    assert False, f\"Path not found {agent_id}:{path_id}\""
2347
      ]
2348
    },
2349
    {
2350
      "cell_type": "code",
2351
      "execution_count": null,
2352
      "metadata": {
2353
        "id": "4EgQHbawpNcS"
2354
      },
2355
      "outputs": [],
2356
      "source": [
2357
        "pd.set_option(\"display.expand_frame_repr\", False)\n",
2358
        "pd.set_option(\"display.max_columns\", 100)\n",
2359
        "pd.set_option(\"display.max_rows\", 100)\n",
2360
        "\n",
2361
        "def pop_to_df(pop):\n",
2362
        "  return paths_to_df([p for paths in pop.paths.values() for p in paths])\n",
2363
        "\n",
2364
        "def paths_to_df(paths):\n",
2365
        "  # Collect all metrics names.\n",
2366
        "  metrics_keys = set()\n",
2367
        "  hparams_keys = set()\n",
2368
        "  for path in paths:\n",
2369
        "    path.score()  # Update scores.\n",
2370
        "    metrics_keys.update(path.metrics)\n",
2371
        "    hparams_keys.update(path.hparams)\n",
2372
        "\n",
2373
        "  def _format(x):\n",
2374
        "    if type(x) in [dict, list]:\n",
2375
        "      return json.dumps(x)\n",
2376
        "    return x\n",
2377
        "\n",
2378
        "  data = defaultdict(list)\n",
2379
        "  for path in paths:\n",
2380
        "    data[\"agent_id\"].append(path.agent_id)\n",
2381
        "    data[\"task_name\"].append(path.task_name)\n",
2382
        "    data[\"id\"].append(path.id)\n",
2383
        "    data[\"parent_id\"].append(path.parent.id if path.parent else -1)\n",
2384
        "    data[\"parent_agent_id\"].append(path.parent.agent_id if path.parent else None)\n",
2385
        "    data[\"components\"].append(\",\".join(\n",
2386
        "        [f\"{c.agent_id}:{c.id}\" for c in path.comps_only()]))\n",
2387
        "    for k in hparams_keys:\n",
2388
        "      data[f\"hparams.{k}\"].append(_format(path.hparams[k]) if k in path.hparams else None)\n",
2389
        "    for k in metrics_keys:\n",
2390
        "      data[f\"metrics.{k}\"].append(path.metrics[k] if k in path.metrics else None)\n",
2391
        "  return pd.DataFrame(data)\n",
2392
        "\n",
2393
        "def components_to_df(paths):\n",
2394
        "  # Collect all components.\n",
2395
        "  comps = set()\n",
2396
        "  for p in paths:\n",
2397
        "    comps.update(p.comps_only())\n",
2398
        "\n",
2399
        "  data = defaultdict(list)\n",
2400
        "  for c in comps:\n",
2401
        "    data[\"id\"].append(c.id)\n",
2402
        "    data[\"name\"].append(c.name)\n",
2403
        "    data[\"agent_id\"].append(c.agent_id)\n",
2404
        "    data[\"num_params\"].append(c.get_num_params())\n",
2405
        "  return pd.DataFrame(data)\n",
2406
        "\n",
2407
        "def print_df_segments(df, segment_length:int = 5):\n",
2408
        "  tot_length = df.shape[0]\n",
2409
        "  # Pad column title with spaces to keep alignment across segments.\n",
2410
        "  def prepend_spaces(original_str, pad_to_len):\n",
2411
        "    return \" \" * (pad_to_len-len(original_str)) + original_str\n",
2412
        "  pad_to_len = max([len(tn) for tn in set(df[\"agent_id\"].to_list())])+1\n",
2413
        "  df = df.rename(columns={\n",
2414
        "    \"agent_id\": prepend_spaces(\"agent_id\", pad_to_len),\n",
2415
        "    \"task_name\": prepend_spaces(\"task_name\", pad_to_len),\n",
2416
        "    \"parent_agent_id\": prepend_spaces(\"parent_agent_id\", pad_to_len),\n",
2417
        "    })\n",
2418
        "  for x in range(0, tot_length, segment_length):\n",
2419
        "    print(df[x:min(x+segment_length, tot_length)])\n",
2420
        "\n",
2421
        "def df_leaderboard(df):\n",
2422
        "  # Place columns on the left for readability.\n",
2423
        "  all_keys = sorted(df.columns.tolist())\n",
2424
        "  first_keys = [\"agent_id\", \"task_name\", \"metrics.test_quality\", \"metrics.score\",\n",
2425
        "                \"metrics.quality\", \"metrics.accounted_params\", \"metrics.flops\",\n",
2426
        "                \"id\", \"parent_id\", \"parent_agent_id\"]\n",
2427
        "  first_keys = [k for k in first_keys if k in all_keys]\n",
2428
        "  sorted_keys = first_keys + [k for k in all_keys if k not in first_keys]\n",
2429
        "  # Filter mu function parameters.\n",
2430
        "  sorted_keys = [k for k in sorted_keys if \"_mu_|\" not in k]\n",
2431
        "  df = df[sorted_keys]\n",
2432
        "  if \"metrics.score\" in df:\n",
2433
        "    df = df.sort_values([\"agent_id\", \"metrics.score\"], ascending=[True, False], ignore_index=True)\n",
2434
        "  else:\n",
2435
        "    df = df.sort_values(\"agent_id\", ignore_index=True)\n",
2436
        "  print_df_segments(df)\n",
2437
        "  for k in [\"metrics.score\", \"metrics.quality\", \"metrics.test_quality\"]:\n",
2438
        "    if k in df:\n",
2439
        "      print(f\"Avg {k}: {df[k].mean():.6f}\")"
2440
      ]
2441
    },
2442
    {
2443
      "cell_type": "markdown",
2444
      "metadata": {
2445
        "id": "rDU0butVlu2f"
2446
      },
2447
      "source": [
2448
        "# Checkpointing"
2449
      ]
2450
    },
2451
    {
2452
      "cell_type": "code",
2453
      "execution_count": null,
2454
      "metadata": {
2455
        "id": "104ReuGLspoJ"
2456
      },
2457
      "outputs": [],
2458
      "source": [
2459
        "def df_write_to_csv(df, dir_path, df_name):\n",
2460
        "  filename_df = os.path.join(dir_path, f\"{df_name}.csv\")\n",
2461
        "  with gfile.GFile(filename_df, \"w\") as outfile:\n",
2462
        "    df.to_csv(outfile, index=False)\n",
2463
        "\n",
2464
        "def df_read_from_csv(dir_path, df_name):\n",
2465
        "  filename_df = os.path.join(dir_path, f\"{df_name}.csv\")\n",
2466
        "  with gfile.GFile(filename_df, \"r\") as infile:\n",
2467
        "    df = pd.read_csv(infile)\n",
2468
        "  # Pandas read_csv() reads empty stings as NaNs. Set NaNs to empty strings in\n",
2469
        "  # columns with type strings/object.\n",
2470
        "  for c in df.columns:\n",
2471
        "    if df[c].dtype == np.object_:\n",
2472
        "        df[c].fillna(\"\", inplace=True)\n",
2473
        "  return df\n",
2474
        "\n",
2475
        "def get_comps_params_to_save(pop):\n",
2476
        "  comps_params = {}\n",
2477
        "  # All components generated by this agent.\n",
2478
        "  all_comps = set(\n",
2479
        "      [c for p in pop.paths[pop.config.agent_id] for c in p.comps_only() if c.agent_id == pop.config.agent_id])\n",
2480
        "  # Check that there are not duplicate ids.\n",
2481
        "  assert len(all_comps) == len(set([c.id for c in all_comps])), (\n",
2482
        "      [f\"{c.name}:{c.agent_id}:{c.id}\" for c in all_comps])\n",
2483
        "  for c in all_comps:\n",
2484
        "    if c.id \u003c= Component.last_saved:\n",
2485
        "      continue\n",
2486
        "    assert c.agent_id == pop.config.agent_id\n",
2487
        "    c_id_string = f\"{c.name}:{c.agent_id}:{c.id}\"\n",
2488
        "    comps_params[c_id_string] = c.params\n",
2489
        "  return comps_params"
2490
      ]
2491
    },
2492
    {
2493
      "cell_type": "code",
2494
      "execution_count": null,
2495
      "metadata": {
2496
        "id": "Z_AmkV7hLeQp"
2497
      },
2498
      "outputs": [],
2499
      "source": [
2500
        "def latest_checkpoint(ckpt_dir, prefix = \"checkpoint_\"):\n",
2501
        "  ckpt_dir = os.fspath(ckpt_dir)\n",
2502
        "  glob_path = os.path.join(ckpt_dir, f\"{prefix}*\")\n",
2503
        "  checkpoint_files = flax_checkpoints.natural_sort(gfile.glob(glob_path))\n",
2504
        "  checkpoint_files = [f for f in checkpoint_files if not f.endswith(\"_tmp\")]\n",
2505
        "  return checkpoint_files[-1] if checkpoint_files else None"
2506
      ]
2507
    },
2508
    {
2509
      "cell_type": "code",
2510
      "execution_count": null,
2511
      "metadata": {
2512
        "id": "M_RTFjoOwfpM"
2513
      },
2514
      "outputs": [],
2515
      "source": [
2516
        "def save_checkpoint(ckpt_dir, comps_params, cycle_id, generation_id):\n",
2517
        "  print(\"SAVING\", cycle_id, generation_id, comps_params.keys())\n",
2518
        "  # Write checkpoint.\n",
2519
        "  flax_checkpoints.save_checkpoint(\n",
2520
        "      ckpt_dir=ckpt_dir,\n",
2521
        "      target=comps_params,\n",
2522
        "      step=generation_id,\n",
2523
        "      prefix=f\"checkpoint_{cycle_id}_\",\n",
2524
        "      overwrite=True)\n",
2525
        "  # Delete intermediate checkpoint directories.\n",
2526
        "  if generation_id == 0:\n",
2527
        "    intermediate_ckpt_dirs = gfile.glob(\n",
2528
        "        os.path.join(os.path.dirname(ckpt_dir), \"state_*_[^0]*\"))\n",
2529
        "    for d in intermediate_ckpt_dirs:\n",
2530
        "      print(\"Deleting intermediate checkpoint:\", d)\n",
2531
        "      gfile.rmtree(d)\n",
2532
        "\n",
2533
        "def save_state(agent):\n",
2534
        "  pop = agent.pop\n",
2535
        "  cycle_id = agent.cycle_id\n",
2536
        "  generation_id = agent.generation_id\n",
2537
        "  config = agent.config\n",
2538
        "  write_start = time.time()\n",
2539
        "  # Save data needed to resume exp.\n",
2540
        "  pop.garbage_collect_paths()\n",
2541
        "  state_dir = os.path.join(config.agent_dir, f\"state_{cycle_id}_{generation_id}\")\n",
2542
        "  gfile.makedirs(state_dir)\n",
2543
        "  assert not latest_checkpoint(state_dir), f\"Checkpoint already present in forlder: {state_dir}\"\n",
2544
        "  print(\"WRITING CHECKPOINT:\", cycle_id, generation_id)\n",
2545
        "  df_write_to_csv(paths_to_df(agent.get_paths_to_publish()), state_dir, \"published\")\n",
2546
        "  df_write_to_csv(paths_to_df([p for paths in pop.paths.values() for p in paths]), state_dir, \"population\")\n",
2547
        "  df_write_to_csv(pop.paths_df, state_dir, \"paths\")\n",
2548
        "  df_write_to_csv(pop.comps_df, state_dir, \"components\")\n",
2549
        "  json.dump(config.as_configdict().to_dict(), gfile.GFile(os.path.join(state_dir, \"config.json\"), \"w\"), indent=2)\n",
2550
        "  save_checkpoint(state_dir, get_comps_params_to_save(pop), cycle_id, generation_id)\n",
2551
        "  # Update last saved.\n",
2552
        "  if generation_id == 0:\n",
2553
        "    Path.last_saved = pop.paths_df.id.max()\n",
2554
        "    Component.last_saved = pop.comps_df.id.max()\n",
2555
        "  print(f\"STATE WRITE TIME: {time.time() - write_start:.2f} s\")"
2556
      ]
2557
    },
2558
    {
2559
      "cell_type": "code",
2560
      "execution_count": null,
2561
      "metadata": {
2562
        "id": "7yWdy7DskBph"
2563
      },
2564
      "outputs": [],
2565
      "source": [
2566
        "def load_paths(pop, state_dir, all_agents_dirs):\n",
2567
        "  if state_dir:\n",
2568
        "    state_dir = state_dir.rstrip(\"/\")\n",
2569
        "  load_start = time.time()\n",
2570
        "\n",
2571
        "  # Load system state info.\n",
2572
        "  population_df = pd.DataFrame()\n",
2573
        "  skip_agent_dir = None\n",
2574
        "  if state_dir:\n",
2575
        "    # Load agent state, possibly intermediate.\n",
2576
        "    population_df = population_df.append(df_read_from_csv(state_dir, \"published\"))\n",
2577
        "    skip_agent_dir = os.path.dirname(state_dir)\n",
2578
        "  for agent_dir in all_agents_dirs:\n",
2579
        "    if agent_dir == skip_agent_dir:\n",
2580
        "      continue\n",
2581
        "    agent_checkpoint = latest_checkpoint(os.path.join(agent_dir, \"state_*_0/\"))\n",
2582
        "    if agent_checkpoint:\n",
2583
        "      population_df = population_df.append(\n",
2584
        "          df_read_from_csv(os.path.dirname(agent_checkpoint), \"published\"))\n",
2585
        "\n",
2586
        "  # Load parameters from sharded system checkpoint.\n",
2587
        "  loaded_params = {}  # Dictionary to accumlate loaded parameters.\n",
2588
        "  lock = Lock()\n",
2589
        "  duplicate_keys = set()\n",
2590
        "  def append_loaded_params(add_chkp_dir: str):\n",
2591
        "    if latest_checkpoint(add_chkp_dir) is None:\n",
2592
        "      return  # Skip folders without a completed checkpoint.\n",
2593
        "    lp_add = flax_checkpoints.restore_checkpoint(\n",
2594
        "        ckpt_dir=add_chkp_dir,\n",
2595
        "        target=None)\n",
2596
        "    if lp_add:\n",
2597
        "      lock.acquire()\n",
2598
        "      print(\"LOADED COMPONENTS\", add_chkp_dir, lp_add.keys())\n",
2599
        "      duplicate_keys.update(loaded_params.keys() \u0026 lp_add.keys())\n",
2600
        "      loaded_params.update(lp_add)\n",
2601
        "      lock.release()\n",
2602
        "  all_state_dirs = []\n",
2603
        "  if state_dir:\n",
2604
        "    # Include active agent state, possibly intermediate.\n",
2605
        "    all_state_dirs.append(state_dir)\n",
2606
        "    all_state_dirs.extend(gfile.glob(os.path.join(os.path.dirname(state_dir), \"state_*_0\")))\n",
2607
        "  for agent_dir in all_agents_dirs:\n",
2608
        "    all_state_dirs.extend(gfile.glob(os.path.join(agent_dir, \"state_*_0\")))\n",
2609
        "  threads = []\n",
2610
        "  for s_dir in set(all_state_dirs):\n",
2611
        "    threads.append(Thread(target=append_loaded_params, args=(s_dir,)))\n",
2612
        "    threads[-1].start()\n",
2613
        "  for t in threads:\n",
2614
        "    t.join()\n",
2615
        "  assert not duplicate_keys, duplicate_keys\n",
2616
        "  print(f\"LOAD TIME: {time.time() - load_start:.2f} s\")\n",
2617
        "  frozen_params = flax.core.freeze(loaded_params)\n",
2618
        "  sid_2_comp = {}\n",
2619
        "  for k in frozen_params.keys():\n",
2620
        "    assert len(k.split(\":\")) == 3, k\n",
2621
        "    name, agent_id, id = k.split(\":\")\n",
2622
        "    c = Component(\n",
2623
        "        name=name, agent_id=agent_id, params=frozen_params[k], train_locks=[])\n",
2624
        "    c.id = int(id)\n",
2625
        "    source_id = f\"{agent_id}:{id}\"\n",
2626
        "    assert source_id not in sid_2_comp, source_id\n",
2627
        "    sid_2_comp[source_id] = c\n",
2628
        "  # For parent assignemt.\n",
2629
        "  sid_2_path = {}\n",
2630
        "  path_2_parent_sid = {}\n",
2631
        "  for index, row in population_df.iterrows():\n",
2632
        "    agent_id = row[\"agent_id\"]\n",
2633
        "    path_id = int(row[\"id\"])\n",
2634
        "    path_sid = f\"{agent_id}:{path_id}\"\n",
2635
        "    if path_sid in sid_2_path:\n",
2636
        "      continue\n",
2637
        "    comps_sids = row[\"components\"].split(\",\")\n",
2638
        "    comps = []\n",
2639
        "    for sid in comps_sids:\n",
2640
        "      comps.append(sid_2_comp[sid])\n",
2641
        "    task_name = row[\"task_name\"]\n",
2642
        "    # Retrieve hparams and metrics.\n",
2643
        "    hparams = {}\n",
2644
        "    metrics = {}\n",
2645
        "    for k in row.keys():\n",
2646
        "      v = row[k]\n",
2647
        "      if type(v) is float and math.isnan(v):\n",
2648
        "        continue\n",
2649
        "      if k.startswith(\"hparams.\"):\n",
2650
        "        if type(v) == str and (v.startswith(\"{\") or v.startswith(\"[\")):\n",
2651
        "          v = json.loads(v)\n",
2652
        "        hparams[k[len(\"hparams.\"):]] = v\n",
2653
        "      if k.startswith(\"metrics.\"):\n",
2654
        "        metrics[k[len(\"metrics.\"):]] = v\n",
2655
        "    # Create path.\n",
2656
        "    path = Path(hparams, comps, parent=None, agent_id=agent_id, task_name=task_name)\n",
2657
        "    path.metrics = metrics\n",
2658
        "    path.id = path_id\n",
2659
        "    # Add train locks.\n",
2660
        "    for c in path.components:\n",
2661
        "      c.train_locks.add(agent_id)\n",
2662
        "    pop.paths[agent_id].append(path)\n",
2663
        "    sid_2_path[path_sid] = path\n",
2664
        "    if row[\"parent_id\"] \u003e= 0:\n",
2665
        "      parent_sid = f'{row[\"parent_agent_id\"]}:{row[\"parent_id\"]}'\n",
2666
        "      path_2_parent_sid[path] = parent_sid\n",
2667
        "  # Set parents.\n",
2668
        "  for path, parent_sid in path_2_parent_sid.items():\n",
2669
        "    if parent_sid not in sid_2_path:\n",
2670
        "      # This can happen if parent is retired by a parallel agent.\n",
2671
        "      # In this case fall back to root model.\n",
2672
        "      for k in sid_2_path.keys():\n",
2673
        "        if \"root_model\" in k:\n",
2674
        "          parent_sid = k\n",
2675
        "      print(f\"{path.agent_id}:{path.id} orphaned, fallback: {parent_sid}\")\n",
2676
        "    path.parent = sid_2_path[parent_sid]\n",
2677
        "  # Set reference to components representing sub paths.\n",
2678
        "  for path in [p for paths in pop.paths.values() for p in paths]:\n",
2679
        "    if \"paths\" in path.hparams:\n",
2680
        "      for k in path.hparams[\"paths\"]:\n",
2681
        "        sub_path = pop.get_path_from_full_id(path.hparams[\"paths\"][k][\"agent_id\"], path.hparams[\"paths\"][k][\"id\"])\n",
2682
        "        path.components.append(ComponentPath(name=k, path=sub_path))"
2683
      ]
2684
    },
2685
    {
2686
      "cell_type": "markdown",
2687
      "metadata": {
2688
        "id": "_0AyzNaTl002"
2689
      },
2690
      "source": [
2691
        "# Training"
2692
      ]
2693
    },
2694
    {
2695
      "cell_type": "code",
2696
      "execution_count": null,
2697
      "metadata": {
2698
        "id": "m6vSjSIvNPq4"
2699
      },
2700
      "outputs": [],
2701
      "source": [
2702
        "@partial(jax.jit, static_argnames=\"model\")\n",
2703
        "def eval_step(params, inputs, labels, model):\n",
2704
        "  logits = model.apply({\"params\": params}, inputs, train=False)\n",
2705
        "  # Avg accuracy on the batch.\n",
2706
        "  return (logits.argmax(axis=-1) == labels).mean()"
2707
      ]
2708
    },
2709
    {
2710
      "cell_type": "code",
2711
      "execution_count": null,
2712
      "metadata": {
2713
        "id": "_m2xl8XR7cWy"
2714
      },
2715
      "outputs": [],
2716
      "source": [
2717
        "@partial(jax.jit, static_argnames=[\"model\", \"optimizer\", \"format_params_fn\"], donate_argnums=[0, 2])\n",
2718
        "def train_step(params, fixed_params, opt_state, inputs, labels, model, optimizer, format_params_fn):\n",
2719
        "  def loss_fn(params, fixed_params, inputs, labels):\n",
2720
        "    logits = model.apply(\n",
2721
        "        {\"params\": format_params_fn(merge_params(params, fixed_params))},\n",
2722
        "        inputs, train=True)\n",
2723
        "    labels = jax.nn.one_hot(labels, logits.shape[-1])\n",
2724
        "    return -jnp.mean(jnp.sum(labels * nn.log_softmax(logits), axis=-1))\n",
2725
        "  grads = jax.grad(loss_fn)(params, fixed_params, inputs, labels)\n",
2726
        "  updates, opt_state = optimizer.update(grads, opt_state, params=params)\n",
2727
        "  params = optax.apply_updates(params, updates)\n",
2728
        "  return params, opt_state"
2729
      ]
2730
    },
2731
    {
2732
      "cell_type": "code",
2733
      "execution_count": null,
2734
      "metadata": {
2735
        "id": "RrvyCDaw0KFZ"
2736
      },
2737
      "outputs": [],
2738
      "source": [
2739
        "def execute_train_step(path, train_batch):\n",
2740
        "  path.params_device, path.opt_state_device = train_step(\n",
2741
        "      path.params_device,\n",
2742
        "      path.fixed_params_device,\n",
2743
        "      path.opt_state_device,\n",
2744
        "      train_batch[\"image\"],\n",
2745
        "      train_batch[\"label\"],\n",
2746
        "      path.model,\n",
2747
        "      path.optimizer,\n",
2748
        "      path.model_factory.get_comps2model_fn())\n",
2749
        "\n",
2750
        "def execute_eval_step(path, eval_batch):\n",
2751
        "  path.accs.append(\n",
2752
        "      eval_step(\n",
2753
        "          path.model_factory.get_comps2model_fn()(merge_params(\n",
2754
        "              path.params_device, path.fixed_params_device)),\n",
2755
        "          eval_batch[\"image\"],\n",
2756
        "          eval_batch[\"label\"],\n",
2757
        "          path.model))"
2758
      ]
2759
    },
2760
    {
2761
      "cell_type": "code",
2762
      "execution_count": null,
2763
      "metadata": {
2764
        "id": "lRjic_IpGYJU"
2765
      },
2766
      "outputs": [],
2767
      "source": [
2768
        "PREV_LOOP_END = time.time()\n",
2769
        "\n",
2770
        "def train_loop(paths, ds_train, ds_validation, devices, config):\n",
2771
        "  global PREV_LOOP_END\n",
2772
        "  timing = {}\n",
2773
        "  task = paths[0].task\n",
2774
        "  for path in paths:\n",
2775
        "    assert task.name == path.task_name\n",
2776
        "  for p_id, path in enumerate(paths):\n",
2777
        "    path.device_id = p_id % len(devices)\n",
2778
        "    path.device = devices[path.device_id]\n",
2779
        "    path.optimizer = path.get_optimizer()\n",
2780
        "    path.best_params_local = None\n",
2781
        "    path.best_quality = None\n",
2782
        "    path.best_score = path.parent.score() if path.agent_id == path.parent.agent_id else -np.inf\n",
2783
        "    path.evals = []\n",
2784
        "    path.exe_thread = None\n",
2785
        "  gc.collect()\n",
2786
        "  # Tranfer parameters to devices.\n",
2787
        "  for path in paths:\n",
2788
        "    path.params_device = jax.device_put(path.get_trainable_params(), path.device)\n",
2789
        "    path.fixed_params_device = jax.device_put(path.get_fixed_params(), path.device)\n",
2790
        "    path.opt_state_device = jax.jit(path.optimizer.init, device=path.device)(path.params_device)\n",
2791
        "  iter_ds_validation = iter(ds_validation)\n",
2792
        "  # Train loop.\n",
2793
        "  for t_step, train_batch in zip(\n",
2794
        "      range(config.num_validations_per_path_training\n",
2795
        "            * task.num_train_batches_between_validations),\n",
2796
        "      ds_train):\n",
2797
        "    if t_step == 0:\n",
2798
        "      timing[\"start_train\"] = time.time()\n",
2799
        "    for p_id, path in enumerate(paths):\n",
2800
        "      train_batch = jax.device_put(train_batch, path.device)\n",
2801
        "      if path.exe_thread is not None:\n",
2802
        "        path.exe_thread.join()\n",
2803
        "      path.exe_thread = Thread(target=execute_train_step, args=(path, train_batch))\n",
2804
        "      path.exe_thread.start()\n",
2805
        "    if t_step == 0:\n",
2806
        "      [p.exe_thread.join() for p in paths]\n",
2807
        "      timing[\"end_train_compile\"] = time.time()\n",
2808
        "    # Evaluation on validation set.\n",
2809
        "    if (t_step+1) % task.num_train_batches_between_validations == 0:\n",
2810
        "      for path in paths:\n",
2811
        "        path.accs = []\n",
2812
        "      for e_step, eval_batch in zip(range(task.num_validation_batches), iter_ds_validation):\n",
2813
        "        if e_step == 0:\n",
2814
        "          start_eval_round = time.time()\n",
2815
        "          if \"start_eval\" not in timing:\n",
2816
        "            timing[\"start_eval\"] = start_eval_round\n",
2817
        "        for p_id, path in enumerate(paths):\n",
2818
        "          eval_batch = jax.device_put(eval_batch, path.device)\n",
2819
        "          path.exe_thread.join()\n",
2820
        "          path.exe_thread = Thread(target=execute_eval_step, args=(path,eval_batch))\n",
2821
        "          path.exe_thread.start()\n",
2822
        "        if e_step == 0 and \"end_eval_compile\" not in timing:\n",
2823
        "          [p.exe_thread.join() for p in paths]\n",
2824
        "          timing[\"end_eval_compile\"] = time.time()\n",
2825
        "\n",
2826
        "      # Get params of best models.\n",
2827
        "      qs = []\n",
2828
        "      eval_idx = (t_step+1) // task.num_train_batches_between_validations\n",
2829
        "      for path in paths:\n",
2830
        "        path.exe_thread.join()\n",
2831
        "        quality = np.mean(path.accs)\n",
2832
        "        del path.accs\n",
2833
        "        qs.append(f\"{quality:.4f}\")\n",
2834
        "        path.evals.append(quality)\n",
2835
        "        # Set quality in metrics for current score computation.\n",
2836
        "        path.metrics[\"quality\"] = quality\n",
2837
        "        path_score = path.score()\n",
2838
        "        if path_score \u003e path.best_score:\n",
2839
        "          path.best_params_local = jax.device_get(path.params_device)\n",
2840
        "          path.best_score = path_score\n",
2841
        "          path.best_quality = quality\n",
2842
        "          qs[-1] += \"*\"\n",
2843
        "      time_train = time.time() - PREV_LOOP_END\n",
2844
        "      avg_path_time = (time_train / eval_idx) / len(paths)\n",
2845
        "      print((\"\\t\".join(qs) + f\"\\t\u003c Eval {eval_idx}\").expandtabs(8),\n",
2846
        "            f\"tot:{time_train:.1f}s\", f\"avg/path:{avg_path_time:.1f}s\")\n",
2847
        "      timing[\"time_eval\"] = timing.get(\"time_eval\", 0) + (time.time() - start_eval_round)\n",
2848
        "      del eval_batch\n",
2849
        "  del train_batch\n",
2850
        "  for path in paths:\n",
2851
        "    del path.params_device\n",
2852
        "    del path.fixed_params_device\n",
2853
        "    del path.opt_state_device\n",
2854
        "    del path.optimizer\n",
2855
        "    del path.exe_thread\n",
2856
        "  gc.collect()\n",
2857
        "\n",
2858
        "  timing[\"end_train\"] = time.time()\n",
2859
        "\n",
2860
        "  time_init = timing[\"start_train\"] - PREV_LOOP_END\n",
2861
        "  time_train_compile = timing[\"end_train_compile\"] - timing[\"start_train\"]\n",
2862
        "  time_eval_compile = timing[\"end_eval_compile\"] - timing[\"start_eval\"]\n",
2863
        "  time_eval = timing[\"time_eval\"] - time_eval_compile\n",
2864
        "  time_train = timing[\"end_train\"] - timing[\"end_train_compile\"] - time_eval - time_eval_compile\n",
2865
        "  PREV_LOOP_END = timing[\"end_train\"]\n",
2866
        "\n",
2867
        "  for path in paths:\n",
2868
        "    path.metrics[\"time_init\"] = time_init\n",
2869
        "    path.metrics[\"time_train_compile\"] = time_train_compile\n",
2870
        "    path.metrics[\"time_eval_compile\"] = time_eval_compile\n",
2871
        "    path.metrics[\"time_train\"] = time_train\n",
2872
        "    path.metrics[\"time_eval\"] = time_eval\n",
2873
        "    path.metrics[\"timestamp_end\"] = PREV_LOOP_END\n",
2874
        "    path.metrics[\"num_params\"] = get_num_params(path.get_all_params())\n",
2875
        "    path.metrics[\"num_trainable_params\"] = get_num_params(path.get_trainable_params())\n",
2876
        "    path.metrics[\"quality\"] = max(path.evals)\n",
2877
        "    path.metrics[\"evals\"] = json.dumps([float(v) for v in path.evals])\n",
2878
        "\n",
2879
        "    if path.best_params_local != None:\n",
2880
        "      path.metrics[\"improved\"] = True\n",
2881
        "      path.update_trainable(path.best_params_local)\n",
2882
        "      assert path.best_quality == path.metrics[\"quality\"]\n",
2883
        "      assert path.best_score == path.score()\n",
2884
        "    else:\n",
2885
        "      path.metrics[\"improved\"] = False\n",
2886
        "      # Sampled path will be dropped if not improved, so skip paramter update.\n",
2887
        "      assert path.best_quality == None\n",
2888
        "\n",
2889
        "    del path.best_params_local\n",
2890
        "    del path.best_score\n",
2891
        "    del path.best_quality\n",
2892
        "    del path.evals\n",
2893
        "\n",
2894
        "  pqs = []\n",
2895
        "  qs = []\n",
2896
        "  psc = []\n",
2897
        "  sc = []\n",
2898
        "  for path in paths:\n",
2899
        "    if path.task_name == path.parent.task_name:\n",
2900
        "      metric_suffix = \"\" if path.agent_id == path.parent.agent_id else \"A\"\n",
2901
        "      pqs.append(f\"{path.parent.metrics['quality']:.4f}{metric_suffix}\")\n",
2902
        "      psc.append(f\"{path.parent.score():.4f}{metric_suffix}\")\n",
2903
        "    else:\n",
2904
        "      pqs.append(\"NEW\")\n",
2905
        "      psc.append(\"NEW\")\n",
2906
        "    qs.append(f\"{path.metrics['quality']:.4f}\")\n",
2907
        "    sc.append(f\"{path.score():.4f}\")\n",
2908
        "    if path.metrics[\"improved\"]:\n",
2909
        "      sc[-1] += \"+\"\n",
2910
        "\n",
2911
        "  print((\"\\t\".join([f\"{path.parent.id}\" for path in paths]) + \"\\t\u003c Parent id\").expandtabs(8))\n",
2912
        "  print((\"\\t\".join([f\"{path.id}\" for path in paths]) + \"\\t\u003c Path id\").expandtabs(8))\n",
2913
        "  print((\"\\t\".join(pqs) + \"\\t\u003c Parent best quality\").expandtabs(8))\n",
2914
        "  print((\"\\t\".join(qs) + \"\\t\u003c Path best quality\").expandtabs(8))\n",
2915
        "  print((\"\\t\".join(psc) + \"\\t\u003c Parent score\").expandtabs(8))\n",
2916
        "  print((\"\\t\".join(sc) + \"\\t\u003c Path score\").expandtabs(8))\n",
2917
        "  print(\"time\\tINIT\\tCOMPtrn\\tCOMPevl\\tTRN\\tEVAL\".expandtabs(8))\n",
2918
        "  print(f\"(s)\\t{time_init:.1f}\\t{time_train_compile:.1f}\\t{time_eval_compile:.1f}\\t{time_train:.1f}\\t{time_eval:.1f}\".expandtabs(8))"
2919
      ]
2920
    },
2921
    {
2922
      "cell_type": "code",
2923
      "execution_count": null,
2924
      "metadata": {
2925
        "id": "JVT8nwIWMAVf"
2926
      },
2927
      "outputs": [],
2928
      "source": [
2929
        "def has_test_quality(path):\n",
2930
        "  return (\"test_quality\" in path.metrics and not math.isnan(path.metrics[\"test_quality\"]))\n",
2931
        "\n",
2932
        "# Run final eval on test set.\n",
2933
        "def run_test_eval(path, test_immutability=False):\n",
2934
        "  # Skip if test_quality already computed and no immutability test required.\n",
2935
        "  if not test_immutability and has_test_quality(path):\n",
2936
        "    return\n",
2937
        "  eval_st = time.time()\n",
2938
        "  ds_test = path.task.get_ds(\"test\", path.hparams)\n",
2939
        "  params = path.get_all_params()\n",
2940
        "  # Running on same device can allow to reuse the fn compiled for validation.\n",
2941
        "  if not hasattr(path, \"device\"):\n",
2942
        "    path.device = random.choice(jax.local_devices())\n",
2943
        "  params_device = jax.device_put(path.model_factory.get_comps2model_fn()(params), path.device)\n",
2944
        "  acc_sum = []\n",
2945
        "  tot_num_samples = 0\n",
2946
        "  # Warning: if repeat() is called on this dataset, then this loop never ends.\n",
2947
        "  for batch in ds_test:\n",
2948
        "    acc_avg = eval_step(params_device, batch[\"image\"], batch[\"label\"], path.model)\n",
2949
        "    batch_size = batch[\"label\"].shape[0]\n",
2950
        "    # Need to recompute sum because last batch can have different size to allow\n",
2951
        "    # for exact eval on the test set.\n",
2952
        "    acc_sum.append(acc_avg * batch_size)\n",
2953
        "    tot_num_samples += batch_size\n",
2954
        "  del params_device\n",
2955
        "  acc_avg = np.sum(acc_sum) / tot_num_samples\n",
2956
        "  # Assert test quality equivalence to test immutability.\n",
2957
        "  if has_test_quality(path):\n",
2958
        "    print(f\"Testing immutability of path {path.id} : {path.metrics['test_quality']} ~= {acc_avg}\")\n",
2959
        "    assert test_immutability\n",
2960
        "    if not np.isclose(path.metrics[\"test_quality\"], acc_avg, rtol=IMMUTABILITY_RELATIVE_TOLLERANCE):\n",
2961
        "      print(\"WARNING IMMUTABILITY TEST FAILED, delta:\", acc_avg-path.metrics[\"test_quality\"])\n",
2962
        "    assert np.isclose(path.metrics[\"test_quality\"], acc_avg), \\\n",
2963
        "        f\"{path.task_name} {path.metrics['test_quality']} {acc_avg}\"\n",
2964
        "  path.metrics[\"test_quality\"] = acc_avg\n",
2965
        "  print(f\"TEST QUALITY: {acc_avg}\\nTEST TIME: {time.time()-eval_st:.2f}s\")"
2966
      ]
2967
    },
2968
    {
2969
      "cell_type": "markdown",
2970
      "metadata": {
2971
        "id": "ZubWF7rhmHju"
2972
      },
2973
      "source": [
2974
        "# Main"
2975
      ]
2976
    },
2977
    {
2978
      "cell_type": "code",
2979
      "execution_count": null,
2980
      "metadata": {
2981
        "id": "wzO_0X5-S0yj"
2982
      },
2983
      "outputs": [],
2984
      "source": [
2985
        "agent = get_agent_class(AGENT)(system_state_dir=SYSTEM_STATE_DIR, task_name=TASK_NAME, num_cycles_max=NUM_CYCLES_MAX)\n",
2986
        "agent.run()"
2987
      ]
2988
    }
2989
  ],
2990
  "metadata": {
2991
    "accelerator": "TPU",
2992
    "colab": {
2993
      "name": "mu4Net: Multipath Multiagent Multitask Mutant Net",
2994
      "private_outputs": true,
2995
      "provenance": [],
2996
      "toc_visible": true
2997
    },
2998
    "kernelspec": {
2999
      "display_name": "Python 3",
3000
      "name": "python3"
3001
    },
3002
    "language_info": {
3003
      "name": "python"
3004
    }
3005
  },
3006
  "nbformat": 4,
3007
  "nbformat_minor": 0
3008
}
3009
google-research

Использование cookies