19.3. 异步随机搜索¶

在 Colab 中打开 Notebook

在 Colab 中打开 Notebook

在 Colab 中打开 Notebook

在 Colab 中打开 Notebook

在 SageMaker Studio Lab 中打开 Notebook

正如我们在之前的第 19.2 节中所看到的，由于评估超参数配置的成本很高，我们可能需要等待数小时甚至数天，随机搜索才能返回一个好的超参数配置。在实践中，我们通常可以访问一个资源池，例如同一台机器上的多个 GPU 或多台带单个 GPU 的机器。这就引出了一个问题：我们如何有效地分配随机搜索？

总的来说，我们区分同步和异步并行超参数优化（见图 19.3.1）。在同步设置中，我们等待所有并发运行的试验完成后，再开始下一批。考虑包含诸如深度神经网络的滤波器数量或层数等超参数的配置空间。包含更多层数或滤波器的超参数配置自然会花费更多时间来完成，而同一批中的所有其他试验都必须在同步点（图 19.3.1 中的灰色区域）等待，然后我们才能继续优化过程。

在异步设置中，我们一旦有资源可用就立即安排一个新的试验。这将最佳地利用我们的资源，因为我们可以避免任何同步开销。对于随机搜索，每个新的超参数配置都是独立于所有其他配置选择的，特别是它不利用任何先前评估的观察结果。这意味着我们可以轻松地异步并行化随机搜索。对于那些根据先前观察结果做决策的更复杂的方法（见第 19.5 节），这并不直接。虽然我们比串行设置需要更多的资源，但异步随机搜索表现出线性加速，即如果 \(K\) 个试验可以并行运行，达到某个性能的速度会快 \(K\) 倍。

图 19.3.1 同步或异步地分配超参数优化过程。与串行设置相比，我们可以在保持总计算量不变的情况下，减少总体的挂钟时间。在有掉队者的情况下，同步调度可能导致工作节点空闲。¶

在本笔记本中，我们将研究异步随机搜索，其中试验在同一台机器上的多个 python 进程中执行。从头开始实现分布式作业调度和执行是困难的。我们将使用 Syne Tune (Salinas et al., 2022)，它为我们提供了一个简单的异步 HPO 接口。Syne Tune 设计为可以使用不同的执行后端运行，我们邀请感兴趣的读者研究其简单的 API，以了解更多关于分布式 HPO 的信息。

import logging
from d2l import torch as d2l

logging.basicConfig(level=logging.INFO)
from syne_tune import StoppingCriterion, Tuner
from syne_tune.backend.python_backend import PythonBackend
from syne_tune.config_space import loguniform, randint
from syne_tune.experiments import load_experiment
from syne_tune.optimizer.baselines import RandomSearch

INFO:root:SageMakerBackend is not imported since dependencies are missing. You can install them with
   pip install 'syne-tune[extra]'
AWS dependencies are not imported since dependencies are missing. You can install them with
   pip install 'syne-tune[aws]'
or (for everything)
   pip install 'syne-tune[extra]'
AWS dependencies are not imported since dependencies are missing. You can install them with
   pip install 'syne-tune[aws]'
or (for everything)
   pip install 'syne-tune[extra]'
INFO:root:Ray Tune schedulers and searchers are not imported since dependencies are missing. You can install them with
   pip install 'syne-tune[raytune]'
or (for everything)
   pip install 'syne-tune[extra]'

19.3.1. 目标函数¶

首先，我们必须定义一个新的目标函数，使其现在通过 report 回调将性能返回给 Syne Tune。

def hpo_objective_lenet_synetune(learning_rate, batch_size, max_epochs):
    from syne_tune import Reporter
    from d2l import torch as d2l

    model = d2l.LeNet(lr=learning_rate, num_classes=10)
    trainer = d2l.HPOTrainer(max_epochs=1, num_gpus=1)
    data = d2l.FashionMNIST(batch_size=batch_size)
    model.apply_init([next(iter(data.get_dataloader(True)))[0]], d2l.init_cnn)
    report = Reporter()
    for epoch in range(1, max_epochs + 1):
        if epoch == 1:
            # Initialize the state of Trainer
            trainer.fit(model=model, data=data)
        else:
            trainer.fit_epoch()
        validation_error = trainer.validation_error().cpu().detach().numpy()
        report(epoch=epoch, validation_error=float(validation_error))

请注意，Syne Tune 的 PythonBackend 要求依赖项在函数定义内部导入。

19.3.2. 异步调度器¶

首先，我们定义了同时评估试验的工作节点数量。我们还需要通过定义总挂钟时间的上限来指定我们希望运行随机搜索多长时间。

n_workers = 2  # Needs to be <= the number of available GPUs

max_wallclock_time = 12 * 60  # 12 minutes

接下来，我们声明我们想要优化的指标以及我们是想最小化还是最大化这个指标。即，metric 需要与传递给 report 回调的参数名称相对应。

mode = "min"
metric = "validation_error"

我们使用之前示例中的配置空间。在 Syne Tune 中，这个字典也可以用来向训练脚本传递常量属性。我们利用这个特性来传递 max_epochs。此外，我们在 initial_config 中指定了要评估的第一个配置。

config_space = {
    "learning_rate": loguniform(1e-2, 1),
    "batch_size": randint(32, 256),
    "max_epochs": 10,
}
initial_config = {
    "learning_rate": 0.1,
    "batch_size": 128,
}

接下来，我们需要指定作业执行的后端。这里我们只考虑在本地机器上进行分发，并行作业作为子进程执行。然而，对于大规模 HPO，我们也可以在集群或云环境中运行，每个试验都会消耗一个完整的实例。

trial_backend = PythonBackend(
    tune_function=hpo_objective_lenet_synetune,
    config_space=config_space,
)

我们现在可以为异步随机搜索创建调度器，其行为与我们在第 19.2 节中的 BasicScheduler 类似。

scheduler = RandomSearch(
    config_space,
    metric=metric,
    mode=mode,
    points_to_evaluate=[initial_config],
)

INFO:syne_tune.optimizer.schedulers.fifo:max_resource_level = 10, as inferred from config_space
INFO:syne_tune.optimizer.schedulers.fifo:Master random_seed = 2737092907

Syne Tune 还有一个 Tuner，其中主实验循环和簿记被集中管理，调度器和后端之间的交互也由它协调。

stop_criterion = StoppingCriterion(max_wallclock_time=max_wallclock_time)

tuner = Tuner(
    trial_backend=trial_backend,
    scheduler=scheduler,
    stop_criterion=stop_criterion,
    n_workers=n_workers,
    print_update_interval=int(max_wallclock_time * 0.6),
)

让我们运行我们的分布式 HPO 实验。根据我们的停止标准，它将运行大约 12 分钟。

tuner.run()

INFO:syne_tune.tuner:results of trials will be saved on /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958
INFO:root:Detected 4 GPUs
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.1 --batch_size 128 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/0/checkpoints
INFO:syne_tune.tuner:(trial 0) - scheduled config {'learning_rate': 0.1, 'batch_size': 128, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.1702844732454753 --batch_size 114 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/1/checkpoints
INFO:syne_tune.tuner:(trial 1) - scheduled config {'learning_rate': 0.1702844732454753, 'batch_size': 114, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 0 completed.
INFO:syne_tune.tuner:Trial trial_id 1 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.34019846567238493 --batch_size 221 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/2/checkpoints
INFO:syne_tune.tuner:(trial 2) - scheduled config {'learning_rate': 0.34019846567238493, 'batch_size': 221, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.014628124155727769 --batch_size 88 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/3/checkpoints
INFO:syne_tune.tuner:(trial 3) - scheduled config {'learning_rate': 0.014628124155727769, 'batch_size': 88, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 2 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.1114831485450576 --batch_size 142 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/4/checkpoints
INFO:syne_tune.tuner:(trial 4) - scheduled config {'learning_rate': 0.1114831485450576, 'batch_size': 142, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 3 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.014076038679980779 --batch_size 223 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/5/checkpoints
INFO:syne_tune.tuner:(trial 5) - scheduled config {'learning_rate': 0.014076038679980779, 'batch_size': 223, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 4 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.02558173674804846 --batch_size 62 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/6/checkpoints
INFO:syne_tune.tuner:(trial 6) - scheduled config {'learning_rate': 0.02558173674804846, 'batch_size': 62, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 5 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.026035979388614055 --batch_size 139 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/7/checkpoints
INFO:syne_tune.tuner:(trial 7) - scheduled config {'learning_rate': 0.026035979388614055, 'batch_size': 139, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 6 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.24202494130424274 --batch_size 231 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/8/checkpoints
INFO:syne_tune.tuner:(trial 8) - scheduled config {'learning_rate': 0.24202494130424274, 'batch_size': 231, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 7 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.10483132064775551 --batch_size 145 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/9/checkpoints
INFO:syne_tune.tuner:(trial 9) - scheduled config {'learning_rate': 0.10483132064775551, 'batch_size': 145, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 8 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.017898854850751864 --batch_size 51 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/10/checkpoints
INFO:syne_tune.tuner:(trial 10) - scheduled config {'learning_rate': 0.017898854850751864, 'batch_size': 51, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 9 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.9645419978270817 --batch_size 200 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/11/checkpoints
INFO:syne_tune.tuner:(trial 11) - scheduled config {'learning_rate': 0.9645419978270817, 'batch_size': 200, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 11 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.10559888854748693 --batch_size 40 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/12/checkpoints
INFO:syne_tune.tuner:(trial 12) - scheduled config {'learning_rate': 0.10559888854748693, 'batch_size': 40, 'max_epochs': 10}
INFO:syne_tune.tuner:tuning status (last metric is reported)
 trial_id     status  iter  learning_rate  batch_size  max_epochs  epoch  validation_error  worker-time
        0  Completed    10       0.100000         128          10   10.0          0.277195    64.928907
        1  Completed    10       0.170284         114          10   10.0          0.286225    65.434195
        2  Completed    10       0.340198         221          10   10.0          0.218990    59.729758
        3  Completed    10       0.014628          88          10   10.0          0.899920    81.001636
        4  Completed    10       0.111483         142          10   10.0          0.268684    64.427400
        5  Completed    10       0.014076         223          10   10.0          0.899922    61.264475
        6  Completed    10       0.025582          62          10   10.0          0.399520    75.966186
        7  Completed    10       0.026036         139          10   10.0          0.899988    62.261541
        8  Completed    10       0.242025         231          10   10.0          0.257636    58.186485
        9  Completed    10       0.104831         145          10   10.0          0.273898    59.771699
       10 InProgress     8       0.017899          51          10    8.0          0.496118    66.999746
       11  Completed    10       0.964542         200          10   10.0          0.181600    59.159662
       12 InProgress     0       0.105599          40          10      -                 -            -
2 trials running, 11 finished (11 until the end), 436.60s wallclock-time

INFO:syne_tune.tuner:Trial trial_id 10 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.5846051207380589 --batch_size 35 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/13/checkpoints
INFO:syne_tune.tuner:(trial 13) - scheduled config {'learning_rate': 0.5846051207380589, 'batch_size': 35, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 12 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.2468891379769198 --batch_size 146 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/14/checkpoints
INFO:syne_tune.tuner:(trial 14) - scheduled config {'learning_rate': 0.2468891379769198, 'batch_size': 146, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 13 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.12956867470224812 --batch_size 218 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/15/checkpoints
INFO:syne_tune.tuner:(trial 15) - scheduled config {'learning_rate': 0.12956867470224812, 'batch_size': 218, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 14 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.24900745354561854 --batch_size 103 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/16/checkpoints
INFO:syne_tune.tuner:(trial 16) - scheduled config {'learning_rate': 0.24900745354561854, 'batch_size': 103, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 15 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.03903577426988046 --batch_size 80 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/17/checkpoints
INFO:syne_tune.tuner:(trial 17) - scheduled config {'learning_rate': 0.03903577426988046, 'batch_size': 80, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 16 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.01846559300690354 --batch_size 183 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/18/checkpoints
INFO:syne_tune.tuner:(trial 18) - scheduled config {'learning_rate': 0.01846559300690354, 'batch_size': 183, 'max_epochs': 10}
INFO:syne_tune.stopping_criterion:reaching max wallclock time (720), stopping there.
INFO:syne_tune.tuner:Stopping trials that may still be running.
INFO:syne_tune.tuner:Tuning finished, results of trials can be found on /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958
--------------------
Resource summary (last result is reported):
 trial_id     status  iter  learning_rate  batch_size  max_epochs  epoch  validation_error  worker-time
        0  Completed    10       0.100000         128          10     10          0.277195    64.928907
        1  Completed    10       0.170284         114          10     10          0.286225    65.434195
        2  Completed    10       0.340198         221          10     10          0.218990    59.729758
        3  Completed    10       0.014628          88          10     10          0.899920    81.001636
        4  Completed    10       0.111483         142          10     10          0.268684    64.427400
        5  Completed    10       0.014076         223          10     10          0.899922    61.264475
        6  Completed    10       0.025582          62          10     10          0.399520    75.966186
        7  Completed    10       0.026036         139          10     10          0.899988    62.261541
        8  Completed    10       0.242025         231          10     10          0.257636    58.186485
        9  Completed    10       0.104831         145          10     10          0.273898    59.771699
       10  Completed    10       0.017899          51          10     10          0.405545    83.778503
       11  Completed    10       0.964542         200          10     10          0.181600    59.159662
       12  Completed    10       0.105599          40          10     10          0.182500    94.734384
       13  Completed    10       0.584605          35          10     10          0.153846   110.965637
       14  Completed    10       0.246889         146          10     10          0.215050    65.142847
       15  Completed    10       0.129569         218          10     10          0.313873    61.310455
       16  Completed    10       0.249007         103          10     10          0.196101    72.519127
       17 InProgress     9       0.039036          80          10      9          0.369000    73.403000
       18 InProgress     5       0.018466         183          10      5          0.900263    34.714568
2 trials running, 17 finished (17 until the end), 722.84s wallclock-time

validation_error: best 0.14451533555984497 for trial-id 13
--------------------

所有评估过的超参数配置的日志都存储起来以供进一步分析。在调优作业期间的任何时候，我们都可以轻松获取到目前为止获得的结果，并绘制当前最优解轨迹。

d2l.set_figsize()
tuning_experiment = load_experiment(tuner.name)
tuning_experiment.plot()

WARNING:matplotlib.legend:No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.

19.3.3. 可视化异步优化过程¶

下面我们可视化了每个试验（图中的每种颜色代表一个试验）的学习曲线在异步优化过程中的演变情况。在任何时间点，并发运行的试验数量与我们的工作节点数量一样多。一旦一个试验完成，我们立即开始下一个试验，而不等待其他试验完成。通过异步调度，工作节点的空闲时间被减少到最低。

d2l.set_figsize([6, 2.5])
results = tuning_experiment.results

for trial_id in results.trial_id.unique():
    df = results[results["trial_id"] == trial_id]
    d2l.plt.plot(
        df["st_tuner_time"],
        df["validation_error"],
        marker="o"
    )

d2l.plt.xlabel("wall-clock time")
d2l.plt.ylabel("objective function")

Text(0, 0.5, 'objective function')

19.3.4. 小结¶

通过将试验分布到并行资源上，我们可以大幅减少随机搜索的等待时间。总的来说，我们区分同步调度和异步调度。同步调度意味着我们在前一批超参数配置完成后，再采样新的一批。如果我们有掉队者——即完成时间比其他试验长的试验——我们的工作节点需要在同步点等待。异步调度则在资源可用时立即评估新的超参数配置，从而确保所有工作节点在任何时间点都处于忙碌状态。虽然随机搜索很容易进行异步分发，且不需要对实际算法做任何改变，但其他方法则需要一些额外的修改。

19.3.5. 练习¶

考虑在第 5.6 节中实现并在第 19.2 节的练习 1 中使用的 DropoutMLP 模型。
1. 实现一个目标函数 hpo_objective_dropoutmlp_synetune 以便与 Syne Tune 一起使用。确保你的函数在每个 epoch 后报告验证错误。
2. 使用第 19.2 节中练习 1 的设置，比较随机搜索和贝叶斯优化。如果你使用 SageMaker，可以随意使用 Syne Tune 的基准测试工具来并行运行实验。提示：贝叶斯优化由 syne_tune.optimizer.baselines.BayesianOptimization 提供。
3. 对于这个练习，你需要在至少有 4 个 CPU 内核的实例上运行。对于上面使用的方法之一（随机搜索，贝叶斯优化），分别使用 n_workers=1，n_workers=2，n_workers=4 运行实验，并比较结果（当前最优解轨迹）。至少对于随机搜索，你应该能观察到相对于工作节点数量的线性扩展。提示：为了获得稳健的结果，你可能需要对每次重复进行多次平均。
高级。本练习的目标是在 Syne Tune 中实现一个新的调度器。
1. 创建一个包含 d2lbook 和 syne-tune 源码的虚拟环境。
2. 将第 19.2 节中练习 2 的 LocalSearcher 实现为 Syne Tune 中的一个新搜索器。提示：阅读本教程。或者，你可以参考这个例子。
3. 在 DropoutMLP 基准上，将你的新 LocalSearcher 与 RandomSearch 进行比较。

讨论

19.3. 异步随机搜索¶ Colab [pytorch]在 Colab 中打开 Notebook Colab [mxnet]在 Colab 中打开 Notebook Colab [jax]在 Colab 中打开 Notebook Colab [tensorflow]在 Colab 中打开 Notebook SageMaker Studio Lab在 SageMaker Studio Lab 中打开 Notebook