19.3. 异步随机搜索¶ 在 SageMaker Studio Lab 中打开 Notebook
正如我们在之前的 第 19.2 节 中所看到的,由于评估超参数配置的成本很高,我们可能需要等待数小时甚至数天,随机搜索才能返回一个好的超参数配置。在实践中,我们通常可以访问一个资源池,例如同一台机器上的多个 GPU 或多台带单个 GPU 的机器。这就引出了一个问题:我们如何有效地分配随机搜索?
总的来说,我们区分同步和异步并行超参数优化(见 图 19.3.1)。在同步设置中,我们等待所有并发运行的试验完成后,再开始下一批。考虑包含诸如深度神经网络的滤波器数量或层数等超参数的配置空间。包含更多层数或滤波器的超参数配置自然会花费更多时间来完成,而同一批中的所有其他试验都必须在同步点(图 19.3.1 中的灰色区域)等待,然后我们才能继续优化过程。
在异步设置中,我们一旦有资源可用就立即安排一个新的试验。这将最佳地利用我们的资源,因为我们可以避免任何同步开销。对于随机搜索,每个新的超参数配置都是独立于所有其他配置选择的,特别是它不利用任何先前评估的观察结果。这意味着我们可以轻松地异步并行化随机搜索。对于那些根据先前观察结果做决策的更复杂的方法(见 第 19.5 节),这并不直接。虽然我们比串行设置需要更多的资源,但异步随机搜索表现出线性加速,即如果 \(K\) 个试验可以并行运行,达到某个性能的速度会快 \(K\) 倍。
图 19.3.1 同步或异步地分配超参数优化过程。与串行设置相比,我们可以在保持总计算量不变的情况下,减少总体的挂钟时间。在有掉队者的情况下,同步调度可能导致工作节点空闲。¶
在本笔记本中,我们将研究异步随机搜索,其中试验在同一台机器上的多个 python 进程中执行。从头开始实现分布式作业调度和执行是困难的。我们将使用 Syne Tune (Salinas et al., 2022),它为我们提供了一个简单的异步 HPO 接口。Syne Tune 设计为可以使用不同的执行后端运行,我们邀请感兴趣的读者研究其简单的 API,以了解更多关于分布式 HPO 的信息。
import logging
from d2l import torch as d2l
logging.basicConfig(level=logging.INFO)
from syne_tune import StoppingCriterion, Tuner
from syne_tune.backend.python_backend import PythonBackend
from syne_tune.config_space import loguniform, randint
from syne_tune.experiments import load_experiment
from syne_tune.optimizer.baselines import RandomSearch
INFO:root:SageMakerBackend is not imported since dependencies are missing. You can install them with
pip install 'syne-tune[extra]'
AWS dependencies are not imported since dependencies are missing. You can install them with
pip install 'syne-tune[aws]'
or (for everything)
pip install 'syne-tune[extra]'
AWS dependencies are not imported since dependencies are missing. You can install them with
pip install 'syne-tune[aws]'
or (for everything)
pip install 'syne-tune[extra]'
INFO:root:Ray Tune schedulers and searchers are not imported since dependencies are missing. You can install them with
pip install 'syne-tune[raytune]'
or (for everything)
pip install 'syne-tune[extra]'
19.3.1. 目标函数¶
首先,我们必须定义一个新的目标函数,使其现在通过 report
回调将性能返回给 Syne Tune。
def hpo_objective_lenet_synetune(learning_rate, batch_size, max_epochs):
from syne_tune import Reporter
from d2l import torch as d2l
model = d2l.LeNet(lr=learning_rate, num_classes=10)
trainer = d2l.HPOTrainer(max_epochs=1, num_gpus=1)
data = d2l.FashionMNIST(batch_size=batch_size)
model.apply_init([next(iter(data.get_dataloader(True)))[0]], d2l.init_cnn)
report = Reporter()
for epoch in range(1, max_epochs + 1):
if epoch == 1:
# Initialize the state of Trainer
trainer.fit(model=model, data=data)
else:
trainer.fit_epoch()
validation_error = trainer.validation_error().cpu().detach().numpy()
report(epoch=epoch, validation_error=float(validation_error))
请注意,Syne Tune 的 PythonBackend
要求依赖项在函数定义内部导入。
19.3.2. 异步调度器¶
首先,我们定义了同时评估试验的工作节点数量。我们还需要通过定义总挂钟时间的上限来指定我们希望运行随机搜索多长时间。
n_workers = 2 # Needs to be <= the number of available GPUs
max_wallclock_time = 12 * 60 # 12 minutes
接下来,我们声明我们想要优化的指标以及我们是想最小化还是最大化这个指标。即,metric
需要与传递给 report
回调的参数名称相对应。
mode = "min"
metric = "validation_error"
我们使用之前示例中的配置空间。在 Syne Tune 中,这个字典也可以用来向训练脚本传递常量属性。我们利用这个特性来传递 max_epochs
。此外,我们在 initial_config
中指定了要评估的第一个配置。
config_space = {
"learning_rate": loguniform(1e-2, 1),
"batch_size": randint(32, 256),
"max_epochs": 10,
}
initial_config = {
"learning_rate": 0.1,
"batch_size": 128,
}
接下来,我们需要指定作业执行的后端。这里我们只考虑在本地机器上进行分发,并行作业作为子进程执行。然而,对于大规模 HPO,我们也可以在集群或云环境中运行,每个试验都会消耗一个完整的实例。
trial_backend = PythonBackend(
tune_function=hpo_objective_lenet_synetune,
config_space=config_space,
)
我们现在可以为异步随机搜索创建调度器,其行为与我们在 第 19.2 节 中的 BasicScheduler
类似。
scheduler = RandomSearch(
config_space,
metric=metric,
mode=mode,
points_to_evaluate=[initial_config],
)
INFO:syne_tune.optimizer.schedulers.fifo:max_resource_level = 10, as inferred from config_space
INFO:syne_tune.optimizer.schedulers.fifo:Master random_seed = 2737092907
Syne Tune 还有一个 Tuner
,其中主实验循环和簿记被集中管理,调度器和后端之间的交互也由它协调。
stop_criterion = StoppingCriterion(max_wallclock_time=max_wallclock_time)
tuner = Tuner(
trial_backend=trial_backend,
scheduler=scheduler,
stop_criterion=stop_criterion,
n_workers=n_workers,
print_update_interval=int(max_wallclock_time * 0.6),
)
让我们运行我们的分布式 HPO 实验。根据我们的停止标准,它将运行大约 12 分钟。
tuner.run()
INFO:syne_tune.tuner:results of trials will be saved on /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958
INFO:root:Detected 4 GPUs
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.1 --batch_size 128 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/0/checkpoints
INFO:syne_tune.tuner:(trial 0) - scheduled config {'learning_rate': 0.1, 'batch_size': 128, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.1702844732454753 --batch_size 114 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/1/checkpoints
INFO:syne_tune.tuner:(trial 1) - scheduled config {'learning_rate': 0.1702844732454753, 'batch_size': 114, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 0 completed.
INFO:syne_tune.tuner:Trial trial_id 1 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.34019846567238493 --batch_size 221 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/2/checkpoints
INFO:syne_tune.tuner:(trial 2) - scheduled config {'learning_rate': 0.34019846567238493, 'batch_size': 221, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.014628124155727769 --batch_size 88 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/3/checkpoints
INFO:syne_tune.tuner:(trial 3) - scheduled config {'learning_rate': 0.014628124155727769, 'batch_size': 88, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 2 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.1114831485450576 --batch_size 142 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/4/checkpoints
INFO:syne_tune.tuner:(trial 4) - scheduled config {'learning_rate': 0.1114831485450576, 'batch_size': 142, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 3 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.014076038679980779 --batch_size 223 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/5/checkpoints
INFO:syne_tune.tuner:(trial 5) - scheduled config {'learning_rate': 0.014076038679980779, 'batch_size': 223, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 4 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.02558173674804846 --batch_size 62 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/6/checkpoints
INFO:syne_tune.tuner:(trial 6) - scheduled config {'learning_rate': 0.02558173674804846, 'batch_size': 62, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 5 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.026035979388614055 --batch_size 139 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/7/checkpoints
INFO:syne_tune.tuner:(trial 7) - scheduled config {'learning_rate': 0.026035979388614055, 'batch_size': 139, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 6 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.24202494130424274 --batch_size 231 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/8/checkpoints
INFO:syne_tune.tuner:(trial 8) - scheduled config {'learning_rate': 0.24202494130424274, 'batch_size': 231, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 7 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.10483132064775551 --batch_size 145 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/9/checkpoints
INFO:syne_tune.tuner:(trial 9) - scheduled config {'learning_rate': 0.10483132064775551, 'batch_size': 145, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 8 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.017898854850751864 --batch_size 51 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/10/checkpoints
INFO:syne_tune.tuner:(trial 10) - scheduled config {'learning_rate': 0.017898854850751864, 'batch_size': 51, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 9 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.9645419978270817 --batch_size 200 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/11/checkpoints
INFO:syne_tune.tuner:(trial 11) - scheduled config {'learning_rate': 0.9645419978270817, 'batch_size': 200, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 11 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.10559888854748693 --batch_size 40 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/12/checkpoints
INFO:syne_tune.tuner:(trial 12) - scheduled config {'learning_rate': 0.10559888854748693, 'batch_size': 40, 'max_epochs': 10}
INFO:syne_tune.tuner:tuning status (last metric is reported)
trial_id status iter learning_rate batch_size max_epochs epoch validation_error worker-time
0 Completed 10 0.100000 128 10 10.0 0.277195 64.928907
1 Completed 10 0.170284 114 10 10.0 0.286225 65.434195
2 Completed 10 0.340198 221 10 10.0 0.218990 59.729758
3 Completed 10 0.014628 88 10 10.0 0.899920 81.001636
4 Completed 10 0.111483 142 10 10.0 0.268684 64.427400
5 Completed 10 0.014076 223 10 10.0 0.899922 61.264475
6 Completed 10 0.025582 62 10 10.0 0.399520 75.966186
7 Completed 10 0.026036 139 10 10.0 0.899988 62.261541
8 Completed 10 0.242025 231 10 10.0 0.257636 58.186485
9 Completed 10 0.104831 145 10 10.0 0.273898 59.771699
10 InProgress 8 0.017899 51 10 8.0 0.496118 66.999746
11 Completed 10 0.964542 200 10 10.0 0.181600 59.159662
12 InProgress 0 0.105599 40 10 - - -
2 trials running, 11 finished (11 until the end), 436.60s wallclock-time
INFO:syne_tune.tuner:Trial trial_id 10 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.5846051207380589 --batch_size 35 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/13/checkpoints
INFO:syne_tune.tuner:(trial 13) - scheduled config {'learning_rate': 0.5846051207380589, 'batch_size': 35, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 12 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.2468891379769198 --batch_size 146 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/14/checkpoints
INFO:syne_tune.tuner:(trial 14) - scheduled config {'learning_rate': 0.2468891379769198, 'batch_size': 146, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 13 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.12956867470224812 --batch_size 218 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/15/checkpoints
INFO:syne_tune.tuner:(trial 15) - scheduled config {'learning_rate': 0.12956867470224812, 'batch_size': 218, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 14 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.24900745354561854 --batch_size 103 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/16/checkpoints
INFO:syne_tune.tuner:(trial 16) - scheduled config {'learning_rate': 0.24900745354561854, 'batch_size': 103, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 15 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.03903577426988046 --batch_size 80 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/17/checkpoints
INFO:syne_tune.tuner:(trial 17) - scheduled config {'learning_rate': 0.03903577426988046, 'batch_size': 80, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 16 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.01846559300690354 --batch_size 183 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/tune_function --tune_function_hash 4d7d5b85e4537ad0c5d0a202623dcec5 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958/18/checkpoints
INFO:syne_tune.tuner:(trial 18) - scheduled config {'learning_rate': 0.01846559300690354, 'batch_size': 183, 'max_epochs': 10}
INFO:syne_tune.stopping_criterion:reaching max wallclock time (720), stopping there.
INFO:syne_tune.tuner:Stopping trials that may still be running.
INFO:syne_tune.tuner:Tuning finished, results of trials can be found on /home/ci/syne-tune/python-entrypoint-2023-08-18-19-45-39-958
--------------------
Resource summary (last result is reported):
trial_id status iter learning_rate batch_size max_epochs epoch validation_error worker-time
0 Completed 10 0.100000 128 10 10 0.277195 64.928907
1 Completed 10 0.170284 114 10 10 0.286225 65.434195
2 Completed 10 0.340198 221 10 10 0.218990 59.729758
3 Completed 10 0.014628 88 10 10 0.899920 81.001636
4 Completed 10 0.111483 142 10 10 0.268684 64.427400
5 Completed 10 0.014076 223 10 10 0.899922 61.264475
6 Completed 10 0.025582 62 10 10 0.399520 75.966186
7 Completed 10 0.026036 139 10 10 0.899988 62.261541
8 Completed 10 0.242025 231 10 10 0.257636 58.186485
9 Completed 10 0.104831 145 10 10 0.273898 59.771699
10 Completed 10 0.017899 51 10 10 0.405545 83.778503
11 Completed 10 0.964542 200 10 10 0.181600 59.159662
12 Completed 10 0.105599 40 10 10 0.182500 94.734384
13 Completed 10 0.584605 35 10 10 0.153846 110.965637
14 Completed 10 0.246889 146 10 10 0.215050 65.142847
15 Completed 10 0.129569 218 10 10 0.313873 61.310455
16 Completed 10 0.249007 103 10 10 0.196101 72.519127
17 InProgress 9 0.039036 80 10 9 0.369000 73.403000
18 InProgress 5 0.018466 183 10 5 0.900263 34.714568
2 trials running, 17 finished (17 until the end), 722.84s wallclock-time
validation_error: best 0.14451533555984497 for trial-id 13
--------------------
所有评估过的超参数配置的日志都存储起来以供进一步分析。在调优作业期间的任何时候,我们都可以轻松获取到目前为止获得的结果,并绘制当前最优解轨迹。
d2l.set_figsize()
tuning_experiment = load_experiment(tuner.name)
tuning_experiment.plot()
WARNING:matplotlib.legend:No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
19.3.3. 可视化异步优化过程¶
下面我们可视化了每个试验(图中的每种颜色代表一个试验)的学习曲线在异步优化过程中的演变情况。在任何时间点,并发运行的试验数量与我们的工作节点数量一样多。一旦一个试验完成,我们立即开始下一个试验,而不等待其他试验完成。通过异步调度,工作节点的空闲时间被减少到最低。
d2l.set_figsize([6, 2.5])
results = tuning_experiment.results
for trial_id in results.trial_id.unique():
df = results[results["trial_id"] == trial_id]
d2l.plt.plot(
df["st_tuner_time"],
df["validation_error"],
marker="o"
)
d2l.plt.xlabel("wall-clock time")
d2l.plt.ylabel("objective function")
Text(0, 0.5, 'objective function')
19.3.4. 小结¶
通过将试验分布到并行资源上,我们可以大幅减少随机搜索的等待时间。总的来说,我们区分同步调度和异步调度。同步调度意味着我们在前一批超参数配置完成后,再采样新的一批。如果我们有掉队者——即完成时间比其他试验长的试验——我们的工作节点需要在同步点等待。异步调度则在资源可用时立即评估新的超参数配置,从而确保所有工作节点在任何时间点都处于忙碌状态。虽然随机搜索很容易进行异步分发,且不需要对实际算法做任何改变,但其他方法则需要一些额外的修改。
19.3.5. 练习¶
考虑在 第 5.6 节 中实现并在 第 19.2 节 的练习 1 中使用的
DropoutMLP
模型。实现一个目标函数
hpo_objective_dropoutmlp_synetune
以便与 Syne Tune 一起使用。确保你的函数在每个 epoch 后报告验证错误。使用 第 19.2 节 中练习 1 的设置,比较随机搜索和贝叶斯优化。如果你使用 SageMaker,可以随意使用 Syne Tune 的基准测试工具来并行运行实验。提示:贝叶斯优化由
syne_tune.optimizer.baselines.BayesianOptimization
提供。对于这个练习,你需要在至少有 4 个 CPU 内核的实例上运行。对于上面使用的方法之一(随机搜索,贝叶斯优化),分别使用
n_workers=1
,n_workers=2
,n_workers=4
运行实验,并比较结果(当前最优解轨迹)。至少对于随机搜索,你应该能观察到相对于工作节点数量的线性扩展。提示:为了获得稳健的结果,你可能需要对每次重复进行多次平均。
高级。本练习的目标是在 Syne Tune 中实现一个新的调度器。