19.5. 异步连续减半¶ 在 SageMaker Studio Lab 中打开 Notebook
正如我们在 第 19.3 节 中所看到的,我们可以通过将超参数配置的评估分布在单个实例上的多个实例或多个 CPU/GPU 上来加速 HPO。然而,与随机搜索相比,在分布式环境中异步运行连续减半(Successive Halving,SH)并不简单。在我们决定接下来运行哪个配置之前,我们首先必须收集当前梯级(rung level)上的所有观察结果。这需要在每个梯级上同步工作节点。例如,对于最低的梯级 \(r_{\mathrm{min}}\),我们首先必须评估所有 \(N = \eta^K\) 个配置,然后才能将其中最好的 \(\frac{1}{\eta}\) 提升到下一个梯级。
在任何分布式系统中,同步通常意味着工作节点的空闲时间。首先,我们经常观察到不同超参数配置的训练时间存在很大差异。例如,假设每层的滤波器数量是一个超参数,那么滤波器较少的网络会比滤波器较多的网络训练得更快,这意味着由于落后者(stragglers)的存在,工作节点会出现空闲。此外,一个梯级中的任务槽位数量并不总是工作节点数量的倍数,在这种情况下,一些工作节点甚至可能会在整个批次中处于空闲状态。
图 19.5.1 展示了在两个工作节点上,对四个不同试验使用 \(\eta=2\) 的同步连续减半的调度情况。我们从评估 Trial-0 和 Trial-1 一个 epoch 开始,并在它们完成后立即继续评估接下来的两个试验。我们首先必须等待 Trial-2 完成,这比其他试验花费了更多时间,然后才能将最好的两个试验(即 Trial-0 和 Trial-3)提升到下一个梯级。这导致了工作节点-1 的空闲。然后,我们继续进行第 1 阶梯。在这里,Trial-3 比 Trial-0 花费的时间更长,这导致了工作节点-0 的额外空闲时间。一旦我们到达第 2 阶梯,只剩下最好的试验 Trial-0,它只占用一个工作节点。为避免工作节点-1 在此期间空闲,大多数连续减半的实现会立即开始下一轮,并在第一阶梯上开始评估新的试验(例如 Trial-4)。
图 19.5.1 使用两个工作节点的同步连续减半。¶
异步连续减半(Asynchronous Successive Halving,ASHA)(Li et al., 2018) 将连续减半(SH)应用于异步并行场景。ASHA 的主要思想是,一旦我们在当前梯级上收集到至少 \(\eta\) 个观察结果,就将配置提升到下一个梯级。这个决策规则可能导致次优的提升:有些配置可能被提升到下一个梯级,但事后看来,它们在同一梯级上的表现并不如大多数其他配置。但另一方面,我们通过这种方式消除了所有同步点。在实践中,这种次优的初始提升对性能的影响不大,这不仅因为超参数配置的排名在不同梯级之间通常相当一致,还因为梯级会随着时间推移而增长,越来越好地反映该级别的指标值分布。如果一个工作节点空闲,但没有配置可以被提升,我们就会以 \(r = r_{\mathrm{min}}\)(即第一梯级)开始一个新的配置。
图 19.5.2 展示了 ASHA 对相同配置的调度情况。一旦 Trial-1 完成,我们收集了两个试验(即 Trial-0 和 Trial-1)的结果,并立即将其中较好的一个(Trial-0)提升到下一个梯级。在 Trial-0 在第 1 阶梯上完成后,该阶梯上的试验数量太少,无法支持进一步的提升。因此,我们继续在第 0 阶梯上评估 Trial-3。在 Trial-3 完成时,Trial-2 仍在运行。此时,我们在第 0 阶梯上有 3 个已评估的试验,还有一个试验已经在第 1 阶梯上评估。由于 Trial-3 在第 0 阶梯的表现比 Trial-0 差,且 \(\eta=2\),我们还不能提升任何新的试验,于是工作节点-1 从头开始运行 Trial-4。然而,一旦 Trial-2 完成并且得分比 Trial-3 差,后者就被提升到第 1 阶梯。之后,我们在第 1 阶梯上收集到 2 个评估结果,这意味着我们现在可以将 Trial-0 提升到第 2 阶梯。与此同时,工作节点-1 继续在第 0 阶梯上评估新的试验(即 Trial-5)。
图 19.5.2 使用两个工作节点的异步连续减半(ASHA)。¶
import logging
from d2l import torch as d2l
logging.basicConfig(level=logging.INFO)
import matplotlib.pyplot as plt
from syne_tune import StoppingCriterion, Tuner
from syne_tune.backend.python_backend import PythonBackend
from syne_tune.config_space import loguniform, randint
from syne_tune.experiments import load_experiment
from syne_tune.optimizer.baselines import ASHA
INFO:root:SageMakerBackend is not imported since dependencies are missing. You can install them with
pip install 'syne-tune[extra]'
AWS dependencies are not imported since dependencies are missing. You can install them with
pip install 'syne-tune[aws]'
or (for everything)
pip install 'syne-tune[extra]'
AWS dependencies are not imported since dependencies are missing. You can install them with
pip install 'syne-tune[aws]'
or (for everything)
pip install 'syne-tune[extra]'
INFO:root:Ray Tune schedulers and searchers are not imported since dependencies are missing. You can install them with
pip install 'syne-tune[raytune]'
or (for everything)
pip install 'syne-tune[extra]'
19.5.1. 目标函数¶
我们将使用 *Syne Tune* 以及与 第 19.3 节 中相同的目标函数。
def hpo_objective_lenet_synetune(learning_rate, batch_size, max_epochs):
from syne_tune import Reporter
from d2l import torch as d2l
model = d2l.LeNet(lr=learning_rate, num_classes=10)
trainer = d2l.HPOTrainer(max_epochs=1, num_gpus=1)
data = d2l.FashionMNIST(batch_size=batch_size)
model.apply_init([next(iter(data.get_dataloader(True)))[0]], d2l.init_cnn)
report = Reporter()
for epoch in range(1, max_epochs + 1):
if epoch == 1:
# Initialize the state of Trainer
trainer.fit(model=model, data=data)
else:
trainer.fit_epoch()
validation_error = trainer.validation_error().cpu().detach().numpy()
report(epoch=epoch, validation_error=float(validation_error))
我们还将使用与之前相同的配置空间。
min_number_of_epochs = 2
max_number_of_epochs = 10
eta = 2
config_space = {
"learning_rate": loguniform(1e-2, 1),
"batch_size": randint(32, 256),
"max_epochs": max_number_of_epochs,
}
initial_config = {
"learning_rate": 0.1,
"batch_size": 128,
}
19.5.2. 异步调度器¶
首先,我们定义并发评估试验的工作节点数量。我们还需要通过定义总挂钟时间的上限来指定我们希望运行随机搜索多长时间。
n_workers = 2 # Needs to be <= the number of available GPUs
max_wallclock_time = 12 * 60 # 12 minutes
运行 ASHA 的代码是我们为异步随机搜索所做工作的简单变体。
mode = "min"
metric = "validation_error"
resource_attr = "epoch"
scheduler = ASHA(
config_space,
metric=metric,
mode=mode,
points_to_evaluate=[initial_config],
max_resource_attr="max_epochs",
resource_attr=resource_attr,
grace_period=min_number_of_epochs,
reduction_factor=eta,
)
INFO:syne_tune.optimizer.schedulers.fifo:max_resource_level = 10, as inferred from config_space
INFO:syne_tune.optimizer.schedulers.fifo:Master random_seed = 3140976097
这里,metric
和 resource_attr
指定了与 report
回调一起使用的键名,而 max_resource_attr
表示目标函数的哪个输入对应于 \(r_{\mathrm{max}}\)。此外,grace_period
提供了 \(r_{\mathrm{min}}\),而 reduction_factor
则是 \(\eta\)。我们可以像之前一样运行 Syne Tune(这将花费大约 12 分钟)。
trial_backend = PythonBackend(
tune_function=hpo_objective_lenet_synetune,
config_space=config_space,
)
stop_criterion = StoppingCriterion(max_wallclock_time=max_wallclock_time)
tuner = Tuner(
trial_backend=trial_backend,
scheduler=scheduler,
stop_criterion=stop_criterion,
n_workers=n_workers,
print_update_interval=int(max_wallclock_time * 0.6),
)
tuner.run()
INFO:syne_tune.tuner:results of trials will be saved on /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046
INFO:root:Detected 4 GPUs
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.1 --batch_size 128 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/0/checkpoints
INFO:syne_tune.tuner:(trial 0) - scheduled config {'learning_rate': 0.1, 'batch_size': 128, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.44639554136672527 --batch_size 196 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/1/checkpoints
INFO:syne_tune.tuner:(trial 1) - scheduled config {'learning_rate': 0.44639554136672527, 'batch_size': 196, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.011548051321691994 --batch_size 254 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/2/checkpoints
INFO:syne_tune.tuner:(trial 2) - scheduled config {'learning_rate': 0.011548051321691994, 'batch_size': 254, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.14942487313193167 --batch_size 132 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/3/checkpoints
INFO:syne_tune.tuner:(trial 3) - scheduled config {'learning_rate': 0.14942487313193167, 'batch_size': 132, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 1 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.06317157191455719 --batch_size 242 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/4/checkpoints
INFO:syne_tune.tuner:(trial 4) - scheduled config {'learning_rate': 0.06317157191455719, 'batch_size': 242, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.48801815412811467 --batch_size 41 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/5/checkpoints
INFO:syne_tune.tuner:(trial 5) - scheduled config {'learning_rate': 0.48801815412811467, 'batch_size': 41, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.5904067586747807 --batch_size 244 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/6/checkpoints
INFO:syne_tune.tuner:(trial 6) - scheduled config {'learning_rate': 0.5904067586747807, 'batch_size': 244, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.08812857364095393 --batch_size 148 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/7/checkpoints
INFO:syne_tune.tuner:(trial 7) - scheduled config {'learning_rate': 0.08812857364095393, 'batch_size': 148, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.012271314788363914 --batch_size 235 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/8/checkpoints
INFO:syne_tune.tuner:(trial 8) - scheduled config {'learning_rate': 0.012271314788363914, 'batch_size': 235, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 5 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.08845692598296777 --batch_size 236 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/9/checkpoints
INFO:syne_tune.tuner:(trial 9) - scheduled config {'learning_rate': 0.08845692598296777, 'batch_size': 236, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.0825770880068151 --batch_size 75 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/10/checkpoints
INFO:syne_tune.tuner:(trial 10) - scheduled config {'learning_rate': 0.0825770880068151, 'batch_size': 75, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.20235201406823256 --batch_size 65 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/11/checkpoints
INFO:syne_tune.tuner:(trial 11) - scheduled config {'learning_rate': 0.20235201406823256, 'batch_size': 65, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.3359885631737537 --batch_size 58 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/12/checkpoints
INFO:syne_tune.tuner:(trial 12) - scheduled config {'learning_rate': 0.3359885631737537, 'batch_size': 58, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.7892434579795236 --batch_size 89 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/13/checkpoints
INFO:syne_tune.tuner:(trial 13) - scheduled config {'learning_rate': 0.7892434579795236, 'batch_size': 89, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.1233786579597858 --batch_size 176 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/14/checkpoints
INFO:syne_tune.tuner:(trial 14) - scheduled config {'learning_rate': 0.1233786579597858, 'batch_size': 176, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 13 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.13707981127012328 --batch_size 141 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/15/checkpoints
INFO:syne_tune.tuner:(trial 15) - scheduled config {'learning_rate': 0.13707981127012328, 'batch_size': 141, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.02913976299993913 --batch_size 116 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/16/checkpoints
INFO:syne_tune.tuner:(trial 16) - scheduled config {'learning_rate': 0.02913976299993913, 'batch_size': 116, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.033362897489792855 --batch_size 154 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/17/checkpoints
INFO:syne_tune.tuner:(trial 17) - scheduled config {'learning_rate': 0.033362897489792855, 'batch_size': 154, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.29442952580755816 --batch_size 210 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/18/checkpoints
INFO:syne_tune.tuner:(trial 18) - scheduled config {'learning_rate': 0.29442952580755816, 'batch_size': 210, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.10214259921521483 --batch_size 239 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/19/checkpoints
INFO:syne_tune.tuner:(trial 19) - scheduled config {'learning_rate': 0.10214259921521483, 'batch_size': 239, 'max_epochs': 10}
INFO:syne_tune.tuner:tuning status (last metric is reported)
trial_id status iter learning_rate batch_size max_epochs epoch validation_error worker-time
0 Stopped 4 0.100000 128 10 4.0 0.430578 29.093798
1 Completed 10 0.446396 196 10 10.0 0.205652 72.747496
2 Stopped 2 0.011548 254 10 2.0 0.900570 13.729115
3 Stopped 8 0.149425 132 10 8.0 0.259171 58.980305
4 Stopped 4 0.063172 242 10 4.0 0.900579 27.773950
5 Completed 10 0.488018 41 10 10.0 0.140488 113.171314
6 Stopped 10 0.590407 244 10 10.0 0.193776 70.364757
7 Stopped 2 0.088129 148 10 2.0 0.899955 14.169738
8 Stopped 2 0.012271 235 10 2.0 0.899840 13.434274
9 Stopped 2 0.088457 236 10 2.0 0.899801 13.034437
10 Stopped 4 0.082577 75 10 4.0 0.385970 35.426524
11 Stopped 4 0.202352 65 10 4.0 0.543102 34.653495
12 Stopped 10 0.335989 58 10 10.0 0.149558 90.924182
13 Completed 10 0.789243 89 10 10.0 0.144887 77.365970
14 Stopped 2 0.123379 176 10 2.0 0.899987 12.422906
15 Stopped 2 0.137080 141 10 2.0 0.899983 13.395153
16 Stopped 4 0.029140 116 10 4.0 0.900532 27.834111
17 Stopped 2 0.033363 154 10 2.0 0.899996 13.407285
18 InProgress 1 0.294430 210 10 1.0 0.899878 6.126259
19 InProgress 0 0.102143 239 10 - - -
2 trials running, 18 finished (3 until the end), 437.07s wallclock-time
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.02846298236356246 --batch_size 115 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/20/checkpoints
INFO:syne_tune.tuner:(trial 20) - scheduled config {'learning_rate': 0.02846298236356246, 'batch_size': 115, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.037703019195187606 --batch_size 91 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/21/checkpoints
INFO:syne_tune.tuner:(trial 21) - scheduled config {'learning_rate': 0.037703019195187606, 'batch_size': 91, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.0741039859356903 --batch_size 192 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/22/checkpoints
INFO:syne_tune.tuner:(trial 22) - scheduled config {'learning_rate': 0.0741039859356903, 'batch_size': 192, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.3032613031191755 --batch_size 252 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/23/checkpoints
INFO:syne_tune.tuner:(trial 23) - scheduled config {'learning_rate': 0.3032613031191755, 'batch_size': 252, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.019823425532533637 --batch_size 252 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/24/checkpoints
INFO:syne_tune.tuner:(trial 24) - scheduled config {'learning_rate': 0.019823425532533637, 'batch_size': 252, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.8203370335228594 --batch_size 77 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/25/checkpoints
INFO:syne_tune.tuner:(trial 25) - scheduled config {'learning_rate': 0.8203370335228594, 'batch_size': 77, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.2960420911378594 --batch_size 104 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/26/checkpoints
INFO:syne_tune.tuner:(trial 26) - scheduled config {'learning_rate': 0.2960420911378594, 'batch_size': 104, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.2993874715754653 --batch_size 192 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/27/checkpoints
INFO:syne_tune.tuner:(trial 27) - scheduled config {'learning_rate': 0.2993874715754653, 'batch_size': 192, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.08056711961080017 --batch_size 36 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/28/checkpoints
INFO:syne_tune.tuner:(trial 28) - scheduled config {'learning_rate': 0.08056711961080017, 'batch_size': 36, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.26868380288030347 --batch_size 151 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/29/checkpoints
INFO:syne_tune.tuner:(trial 29) - scheduled config {'learning_rate': 0.26868380288030347, 'batch_size': 151, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 29 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.9197404791177789 --batch_size 66 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/30/checkpoints
INFO:syne_tune.tuner:(trial 30) - scheduled config {'learning_rate': 0.9197404791177789, 'batch_size': 66, 'max_epochs': 10}
INFO:syne_tune.stopping_criterion:reaching max wallclock time (720), stopping there.
INFO:syne_tune.tuner:Stopping trials that may still be running.
INFO:syne_tune.tuner:Tuning finished, results of trials can be found on /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046
--------------------
Resource summary (last result is reported):
trial_id status iter learning_rate batch_size max_epochs epoch validation_error worker-time
0 Stopped 4 0.100000 128 10 4 0.430578 29.093798
1 Completed 10 0.446396 196 10 10 0.205652 72.747496
2 Stopped 2 0.011548 254 10 2 0.900570 13.729115
3 Stopped 8 0.149425 132 10 8 0.259171 58.980305
4 Stopped 4 0.063172 242 10 4 0.900579 27.773950
5 Completed 10 0.488018 41 10 10 0.140488 113.171314
6 Stopped 10 0.590407 244 10 10 0.193776 70.364757
7 Stopped 2 0.088129 148 10 2 0.899955 14.169738
8 Stopped 2 0.012271 235 10 2 0.899840 13.434274
9 Stopped 2 0.088457 236 10 2 0.899801 13.034437
10 Stopped 4 0.082577 75 10 4 0.385970 35.426524
11 Stopped 4 0.202352 65 10 4 0.543102 34.653495
12 Stopped 10 0.335989 58 10 10 0.149558 90.924182
13 Completed 10 0.789243 89 10 10 0.144887 77.365970
14 Stopped 2 0.123379 176 10 2 0.899987 12.422906
15 Stopped 2 0.137080 141 10 2 0.899983 13.395153
16 Stopped 4 0.029140 116 10 4 0.900532 27.834111
17 Stopped 2 0.033363 154 10 2 0.899996 13.407285
18 Stopped 8 0.294430 210 10 8 0.241193 52.089688
19 Stopped 2 0.102143 239 10 2 0.900002 12.487762
20 Stopped 2 0.028463 115 10 2 0.899995 14.100359
21 Stopped 2 0.037703 91 10 2 0.900026 14.664848
22 Stopped 2 0.074104 192 10 2 0.901730 13.312770
23 Stopped 2 0.303261 252 10 2 0.900009 12.725821
24 Stopped 2 0.019823 252 10 2 0.899917 12.533380
25 Stopped 10 0.820337 77 10 10 0.196842 81.816103
26 Stopped 10 0.296042 104 10 10 0.198453 81.121330
27 Stopped 4 0.299387 192 10 4 0.336183 24.610689
28 InProgress 9 0.080567 36 10 9 0.203052 104.303746
29 Completed 10 0.268684 151 10 10 0.222814 68.217289
30 InProgress 1 0.919740 66 10 1 0.900037 10.070776
2 trials running, 29 finished (4 until the end), 723.70s wallclock-time
validation_error: best 0.1404876708984375 for trial-id 5
--------------------
请注意,我们正在运行 ASHA 的一个变体,其中表现不佳的试验会提前停止。这与我们在 第 19.4.1 节 中的实现不同,后者每个训练任务都以固定的 max_epochs
启动。在后一种情况下,一个表现良好并达到全部 10 个 epoch 的试验,首先需要训练 1、然后 2、然后 4、然后 8 个 epoch,每次都从头开始。这种暂停和恢复的调度可以通过在每个 epoch 后对训练状态进行检查点来有效实现,但我们在这里避免了这种额外的复杂性。实验结束后,我们可以检索并绘制结果。
d2l.set_figsize()
e = load_experiment(tuner.name)
e.plot()
WARNING:matplotlib.legend:No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
19.5.3. 可视化优化过程¶
我们再次将每个试验的学习曲线可视化(图中的每种颜色代表一个试验)。将其与 第 19.3 节 中的异步随机搜索进行比较。正如我们在 第 19.4 节 中看到的连续减半一样,大多数试验在 1 或 2 个 epoch(\(r_{\mathrm{min}}\) 或 \(\eta * r_{\mathrm{min}}\))时被停止。然而,试验不会在同一点停止,因为它们每个 epoch 所需的时间不同。如果我们运行的是标准连续减半而不是 ASHA,我们就需要在将配置提升到下一个梯级之前同步我们的工作节点。
d2l.set_figsize([6, 2.5])
results = e.results
for trial_id in results.trial_id.unique():
df = results[results["trial_id"] == trial_id]
d2l.plt.plot(
df["st_tuner_time"],
df["validation_error"],
marker="o"
)
d2l.plt.xlabel("wall-clock time")
d2l.plt.ylabel("objective function")
Text(0, 0.5, 'objective function')