19.5. 异步连续减半
在 Colab 中打开 Notebook
在 Colab 中打开 Notebook
在 Colab 中打开 Notebook
在 Colab 中打开 Notebook
在 SageMaker Studio Lab 中打开 Notebook

正如我们在 第 19.3 节 中所看到的,我们可以通过将超参数配置的评估分布在单个实例上的多个实例或多个 CPU/GPU 上来加速 HPO。然而,与随机搜索相比,在分布式环境中异步运行连续减半(Successive Halving,SH)并不简单。在我们决定接下来运行哪个配置之前,我们首先必须收集当前梯级(rung level)上的所有观察结果。这需要在每个梯级上同步工作节点。例如,对于最低的梯级 \(r_{\mathrm{min}}\),我们首先必须评估所有 \(N = \eta^K\) 个配置,然后才能将其中最好的 \(\frac{1}{\eta}\) 提升到下一个梯级。

在任何分布式系统中,同步通常意味着工作节点的空闲时间。首先,我们经常观察到不同超参数配置的训练时间存在很大差异。例如,假设每层的滤波器数量是一个超参数,那么滤波器较少的网络会比滤波器较多的网络训练得更快,这意味着由于落后者(stragglers)的存在,工作节点会出现空闲。此外,一个梯级中的任务槽位数量并不总是工作节点数量的倍数,在这种情况下,一些工作节点甚至可能会在整个批次中处于空闲状态。

图 19.5.1 展示了在两个工作节点上,对四个不同试验使用 \(\eta=2\) 的同步连续减半的调度情况。我们从评估 Trial-0 和 Trial-1 一个 epoch 开始,并在它们完成后立即继续评估接下来的两个试验。我们首先必须等待 Trial-2 完成,这比其他试验花费了更多时间,然后才能将最好的两个试验(即 Trial-0 和 Trial-3)提升到下一个梯级。这导致了工作节点-1 的空闲。然后,我们继续进行第 1 阶梯。在这里,Trial-3 比 Trial-0 花费的时间更长,这导致了工作节点-0 的额外空闲时间。一旦我们到达第 2 阶梯,只剩下最好的试验 Trial-0,它只占用一个工作节点。为避免工作节点-1 在此期间空闲,大多数连续减半的实现会立即开始下一轮,并在第一阶梯上开始评估新的试验(例如 Trial-4)。

../_images/sync_sh.svg

图 19.5.1 使用两个工作节点的同步连续减半。

异步连续减半(Asynchronous Successive Halving,ASHA)(Li et al., 2018) 将连续减半(SH)应用于异步并行场景。ASHA 的主要思想是,一旦我们在当前梯级上收集到至少 \(\eta\) 个观察结果,就将配置提升到下一个梯级。这个决策规则可能导致次优的提升:有些配置可能被提升到下一个梯级,但事后看来,它们在同一梯级上的表现并不如大多数其他配置。但另一方面,我们通过这种方式消除了所有同步点。在实践中,这种次优的初始提升对性能的影响不大,这不仅因为超参数配置的排名在不同梯级之间通常相当一致,还因为梯级会随着时间推移而增长,越来越好地反映该级别的指标值分布。如果一个工作节点空闲,但没有配置可以被提升,我们就会以 \(r = r_{\mathrm{min}}\)(即第一梯级)开始一个新的配置。

图 19.5.2 展示了 ASHA 对相同配置的调度情况。一旦 Trial-1 完成,我们收集了两个试验(即 Trial-0 和 Trial-1)的结果,并立即将其中较好的一个(Trial-0)提升到下一个梯级。在 Trial-0 在第 1 阶梯上完成后,该阶梯上的试验数量太少,无法支持进一步的提升。因此,我们继续在第 0 阶梯上评估 Trial-3。在 Trial-3 完成时,Trial-2 仍在运行。此时,我们在第 0 阶梯上有 3 个已评估的试验,还有一个试验已经在第 1 阶梯上评估。由于 Trial-3 在第 0 阶梯的表现比 Trial-0 差,且 \(\eta=2\),我们还不能提升任何新的试验,于是工作节点-1 从头开始运行 Trial-4。然而,一旦 Trial-2 完成并且得分比 Trial-3 差,后者就被提升到第 1 阶梯。之后,我们在第 1 阶梯上收集到 2 个评估结果,这意味着我们现在可以将 Trial-0 提升到第 2 阶梯。与此同时,工作节点-1 继续在第 0 阶梯上评估新的试验(即 Trial-5)。

../_images/asha.svg

图 19.5.2 使用两个工作节点的异步连续减半(ASHA)。

import logging
from d2l import torch as d2l

logging.basicConfig(level=logging.INFO)
import matplotlib.pyplot as plt
from syne_tune import StoppingCriterion, Tuner
from syne_tune.backend.python_backend import PythonBackend
from syne_tune.config_space import loguniform, randint
from syne_tune.experiments import load_experiment
from syne_tune.optimizer.baselines import ASHA
INFO:root:SageMakerBackend is not imported since dependencies are missing. You can install them with
   pip install 'syne-tune[extra]'
AWS dependencies are not imported since dependencies are missing. You can install them with
   pip install 'syne-tune[aws]'
or (for everything)
   pip install 'syne-tune[extra]'
AWS dependencies are not imported since dependencies are missing. You can install them with
   pip install 'syne-tune[aws]'
or (for everything)
   pip install 'syne-tune[extra]'
INFO:root:Ray Tune schedulers and searchers are not imported since dependencies are missing. You can install them with
   pip install 'syne-tune[raytune]'
or (for everything)
   pip install 'syne-tune[extra]'

19.5.1. 目标函数

我们将使用 *Syne Tune* 以及与 第 19.3 节 中相同的目标函数。

def hpo_objective_lenet_synetune(learning_rate, batch_size, max_epochs):
    from syne_tune import Reporter
    from d2l import torch as d2l

    model = d2l.LeNet(lr=learning_rate, num_classes=10)
    trainer = d2l.HPOTrainer(max_epochs=1, num_gpus=1)
    data = d2l.FashionMNIST(batch_size=batch_size)
    model.apply_init([next(iter(data.get_dataloader(True)))[0]], d2l.init_cnn)
    report = Reporter()
    for epoch in range(1, max_epochs + 1):
        if epoch == 1:
            # Initialize the state of Trainer
            trainer.fit(model=model, data=data)
        else:
            trainer.fit_epoch()
        validation_error = trainer.validation_error().cpu().detach().numpy()
        report(epoch=epoch, validation_error=float(validation_error))

我们还将使用与之前相同的配置空间。

min_number_of_epochs = 2
max_number_of_epochs = 10
eta = 2

config_space = {
    "learning_rate": loguniform(1e-2, 1),
    "batch_size": randint(32, 256),
    "max_epochs": max_number_of_epochs,
}
initial_config = {
    "learning_rate": 0.1,
    "batch_size": 128,
}

19.5.2. 异步调度器

首先,我们定义并发评估试验的工作节点数量。我们还需要通过定义总挂钟时间的上限来指定我们希望运行随机搜索多长时间。

n_workers = 2  # Needs to be <= the number of available GPUs
max_wallclock_time = 12 * 60  # 12 minutes

运行 ASHA 的代码是我们为异步随机搜索所做工作的简单变体。

mode = "min"
metric = "validation_error"
resource_attr = "epoch"

scheduler = ASHA(
    config_space,
    metric=metric,
    mode=mode,
    points_to_evaluate=[initial_config],
    max_resource_attr="max_epochs",
    resource_attr=resource_attr,
    grace_period=min_number_of_epochs,
    reduction_factor=eta,
)
INFO:syne_tune.optimizer.schedulers.fifo:max_resource_level = 10, as inferred from config_space
INFO:syne_tune.optimizer.schedulers.fifo:Master random_seed = 3140976097

这里,metricresource_attr 指定了与 report 回调一起使用的键名,而 max_resource_attr 表示目标函数的哪个输入对应于 \(r_{\mathrm{max}}\)。此外,grace_period 提供了 \(r_{\mathrm{min}}\),而 reduction_factor 则是 \(\eta\)。我们可以像之前一样运行 Syne Tune(这将花费大约 12 分钟)。

trial_backend = PythonBackend(
    tune_function=hpo_objective_lenet_synetune,
    config_space=config_space,
)

stop_criterion = StoppingCriterion(max_wallclock_time=max_wallclock_time)
tuner = Tuner(
    trial_backend=trial_backend,
    scheduler=scheduler,
    stop_criterion=stop_criterion,
    n_workers=n_workers,
    print_update_interval=int(max_wallclock_time * 0.6),
)
tuner.run()
INFO:syne_tune.tuner:results of trials will be saved on /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046
INFO:root:Detected 4 GPUs
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.1 --batch_size 128 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/0/checkpoints
INFO:syne_tune.tuner:(trial 0) - scheduled config {'learning_rate': 0.1, 'batch_size': 128, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.44639554136672527 --batch_size 196 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/1/checkpoints
INFO:syne_tune.tuner:(trial 1) - scheduled config {'learning_rate': 0.44639554136672527, 'batch_size': 196, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.011548051321691994 --batch_size 254 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/2/checkpoints
INFO:syne_tune.tuner:(trial 2) - scheduled config {'learning_rate': 0.011548051321691994, 'batch_size': 254, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.14942487313193167 --batch_size 132 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/3/checkpoints
INFO:syne_tune.tuner:(trial 3) - scheduled config {'learning_rate': 0.14942487313193167, 'batch_size': 132, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 1 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.06317157191455719 --batch_size 242 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/4/checkpoints
INFO:syne_tune.tuner:(trial 4) - scheduled config {'learning_rate': 0.06317157191455719, 'batch_size': 242, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.48801815412811467 --batch_size 41 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/5/checkpoints
INFO:syne_tune.tuner:(trial 5) - scheduled config {'learning_rate': 0.48801815412811467, 'batch_size': 41, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.5904067586747807 --batch_size 244 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/6/checkpoints
INFO:syne_tune.tuner:(trial 6) - scheduled config {'learning_rate': 0.5904067586747807, 'batch_size': 244, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.08812857364095393 --batch_size 148 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/7/checkpoints
INFO:syne_tune.tuner:(trial 7) - scheduled config {'learning_rate': 0.08812857364095393, 'batch_size': 148, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.012271314788363914 --batch_size 235 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/8/checkpoints
INFO:syne_tune.tuner:(trial 8) - scheduled config {'learning_rate': 0.012271314788363914, 'batch_size': 235, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 5 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.08845692598296777 --batch_size 236 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/9/checkpoints
INFO:syne_tune.tuner:(trial 9) - scheduled config {'learning_rate': 0.08845692598296777, 'batch_size': 236, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.0825770880068151 --batch_size 75 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/10/checkpoints
INFO:syne_tune.tuner:(trial 10) - scheduled config {'learning_rate': 0.0825770880068151, 'batch_size': 75, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.20235201406823256 --batch_size 65 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/11/checkpoints
INFO:syne_tune.tuner:(trial 11) - scheduled config {'learning_rate': 0.20235201406823256, 'batch_size': 65, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.3359885631737537 --batch_size 58 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/12/checkpoints
INFO:syne_tune.tuner:(trial 12) - scheduled config {'learning_rate': 0.3359885631737537, 'batch_size': 58, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.7892434579795236 --batch_size 89 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/13/checkpoints
INFO:syne_tune.tuner:(trial 13) - scheduled config {'learning_rate': 0.7892434579795236, 'batch_size': 89, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.1233786579597858 --batch_size 176 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/14/checkpoints
INFO:syne_tune.tuner:(trial 14) - scheduled config {'learning_rate': 0.1233786579597858, 'batch_size': 176, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 13 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.13707981127012328 --batch_size 141 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/15/checkpoints
INFO:syne_tune.tuner:(trial 15) - scheduled config {'learning_rate': 0.13707981127012328, 'batch_size': 141, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.02913976299993913 --batch_size 116 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/16/checkpoints
INFO:syne_tune.tuner:(trial 16) - scheduled config {'learning_rate': 0.02913976299993913, 'batch_size': 116, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.033362897489792855 --batch_size 154 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/17/checkpoints
INFO:syne_tune.tuner:(trial 17) - scheduled config {'learning_rate': 0.033362897489792855, 'batch_size': 154, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.29442952580755816 --batch_size 210 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/18/checkpoints
INFO:syne_tune.tuner:(trial 18) - scheduled config {'learning_rate': 0.29442952580755816, 'batch_size': 210, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.10214259921521483 --batch_size 239 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/19/checkpoints
INFO:syne_tune.tuner:(trial 19) - scheduled config {'learning_rate': 0.10214259921521483, 'batch_size': 239, 'max_epochs': 10}
INFO:syne_tune.tuner:tuning status (last metric is reported)
 trial_id     status  iter  learning_rate  batch_size  max_epochs  epoch  validation_error  worker-time
        0    Stopped     4       0.100000         128          10    4.0          0.430578    29.093798
        1  Completed    10       0.446396         196          10   10.0          0.205652    72.747496
        2    Stopped     2       0.011548         254          10    2.0          0.900570    13.729115
        3    Stopped     8       0.149425         132          10    8.0          0.259171    58.980305
        4    Stopped     4       0.063172         242          10    4.0          0.900579    27.773950
        5  Completed    10       0.488018          41          10   10.0          0.140488   113.171314
        6    Stopped    10       0.590407         244          10   10.0          0.193776    70.364757
        7    Stopped     2       0.088129         148          10    2.0          0.899955    14.169738
        8    Stopped     2       0.012271         235          10    2.0          0.899840    13.434274
        9    Stopped     2       0.088457         236          10    2.0          0.899801    13.034437
       10    Stopped     4       0.082577          75          10    4.0          0.385970    35.426524
       11    Stopped     4       0.202352          65          10    4.0          0.543102    34.653495
       12    Stopped    10       0.335989          58          10   10.0          0.149558    90.924182
       13  Completed    10       0.789243          89          10   10.0          0.144887    77.365970
       14    Stopped     2       0.123379         176          10    2.0          0.899987    12.422906
       15    Stopped     2       0.137080         141          10    2.0          0.899983    13.395153
       16    Stopped     4       0.029140         116          10    4.0          0.900532    27.834111
       17    Stopped     2       0.033363         154          10    2.0          0.899996    13.407285
       18 InProgress     1       0.294430         210          10    1.0          0.899878     6.126259
       19 InProgress     0       0.102143         239          10      -                 -            -
2 trials running, 18 finished (3 until the end), 437.07s wallclock-time

INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.02846298236356246 --batch_size 115 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/20/checkpoints
INFO:syne_tune.tuner:(trial 20) - scheduled config {'learning_rate': 0.02846298236356246, 'batch_size': 115, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.037703019195187606 --batch_size 91 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/21/checkpoints
INFO:syne_tune.tuner:(trial 21) - scheduled config {'learning_rate': 0.037703019195187606, 'batch_size': 91, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.0741039859356903 --batch_size 192 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/22/checkpoints
INFO:syne_tune.tuner:(trial 22) - scheduled config {'learning_rate': 0.0741039859356903, 'batch_size': 192, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.3032613031191755 --batch_size 252 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/23/checkpoints
INFO:syne_tune.tuner:(trial 23) - scheduled config {'learning_rate': 0.3032613031191755, 'batch_size': 252, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.019823425532533637 --batch_size 252 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/24/checkpoints
INFO:syne_tune.tuner:(trial 24) - scheduled config {'learning_rate': 0.019823425532533637, 'batch_size': 252, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.8203370335228594 --batch_size 77 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/25/checkpoints
INFO:syne_tune.tuner:(trial 25) - scheduled config {'learning_rate': 0.8203370335228594, 'batch_size': 77, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.2960420911378594 --batch_size 104 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/26/checkpoints
INFO:syne_tune.tuner:(trial 26) - scheduled config {'learning_rate': 0.2960420911378594, 'batch_size': 104, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.2993874715754653 --batch_size 192 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/27/checkpoints
INFO:syne_tune.tuner:(trial 27) - scheduled config {'learning_rate': 0.2993874715754653, 'batch_size': 192, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.08056711961080017 --batch_size 36 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/28/checkpoints
INFO:syne_tune.tuner:(trial 28) - scheduled config {'learning_rate': 0.08056711961080017, 'batch_size': 36, 'max_epochs': 10}
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.26868380288030347 --batch_size 151 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/29/checkpoints
INFO:syne_tune.tuner:(trial 29) - scheduled config {'learning_rate': 0.26868380288030347, 'batch_size': 151, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 29 completed.
INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.9197404791177789 --batch_size 66 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/30/checkpoints
INFO:syne_tune.tuner:(trial 30) - scheduled config {'learning_rate': 0.9197404791177789, 'batch_size': 66, 'max_epochs': 10}
INFO:syne_tune.stopping_criterion:reaching max wallclock time (720), stopping there.
INFO:syne_tune.tuner:Stopping trials that may still be running.
INFO:syne_tune.tuner:Tuning finished, results of trials can be found on /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046
--------------------
Resource summary (last result is reported):
 trial_id     status  iter  learning_rate  batch_size  max_epochs  epoch  validation_error  worker-time
        0    Stopped     4       0.100000         128          10      4          0.430578    29.093798
        1  Completed    10       0.446396         196          10     10          0.205652    72.747496
        2    Stopped     2       0.011548         254          10      2          0.900570    13.729115
        3    Stopped     8       0.149425         132          10      8          0.259171    58.980305
        4    Stopped     4       0.063172         242          10      4          0.900579    27.773950
        5  Completed    10       0.488018          41          10     10          0.140488   113.171314
        6    Stopped    10       0.590407         244          10     10          0.193776    70.364757
        7    Stopped     2       0.088129         148          10      2          0.899955    14.169738
        8    Stopped     2       0.012271         235          10      2          0.899840    13.434274
        9    Stopped     2       0.088457         236          10      2          0.899801    13.034437
       10    Stopped     4       0.082577          75          10      4          0.385970    35.426524
       11    Stopped     4       0.202352          65          10      4          0.543102    34.653495
       12    Stopped    10       0.335989          58          10     10          0.149558    90.924182
       13  Completed    10       0.789243          89          10     10          0.144887    77.365970
       14    Stopped     2       0.123379         176          10      2          0.899987    12.422906
       15    Stopped     2       0.137080         141          10      2          0.899983    13.395153
       16    Stopped     4       0.029140         116          10      4          0.900532    27.834111
       17    Stopped     2       0.033363         154          10      2          0.899996    13.407285
       18    Stopped     8       0.294430         210          10      8          0.241193    52.089688
       19    Stopped     2       0.102143         239          10      2          0.900002    12.487762
       20    Stopped     2       0.028463         115          10      2          0.899995    14.100359
       21    Stopped     2       0.037703          91          10      2          0.900026    14.664848
       22    Stopped     2       0.074104         192          10      2          0.901730    13.312770
       23    Stopped     2       0.303261         252          10      2          0.900009    12.725821
       24    Stopped     2       0.019823         252          10      2          0.899917    12.533380
       25    Stopped    10       0.820337          77          10     10          0.196842    81.816103
       26    Stopped    10       0.296042         104          10     10          0.198453    81.121330
       27    Stopped     4       0.299387         192          10      4          0.336183    24.610689
       28 InProgress     9       0.080567          36          10      9          0.203052   104.303746
       29  Completed    10       0.268684         151          10     10          0.222814    68.217289
       30 InProgress     1       0.919740          66          10      1          0.900037    10.070776
2 trials running, 29 finished (4 until the end), 723.70s wallclock-time

validation_error: best 0.1404876708984375 for trial-id 5
--------------------

请注意,我们正在运行 ASHA 的一个变体,其中表现不佳的试验会提前停止。这与我们在 第 19.4.1 节 中的实现不同,后者每个训练任务都以固定的 max_epochs 启动。在后一种情况下,一个表现良好并达到全部 10 个 epoch 的试验,首先需要训练 1、然后 2、然后 4、然后 8 个 epoch,每次都从头开始。这种暂停和恢复的调度可以通过在每个 epoch 后对训练状态进行检查点来有效实现,但我们在这里避免了这种额外的复杂性。实验结束后,我们可以检索并绘制结果。

d2l.set_figsize()
e = load_experiment(tuner.name)
e.plot()
WARNING:matplotlib.legend:No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
../_images/output_sh-async_bb0ea6_13_1.svg

19.5.3. 可视化优化过程

我们再次将每个试验的学习曲线可视化(图中的每种颜色代表一个试验)。将其与 第 19.3 节 中的异步随机搜索进行比较。正如我们在 第 19.4 节 中看到的连续减半一样,大多数试验在 1 或 2 个 epoch(\(r_{\mathrm{min}}\)\(\eta * r_{\mathrm{min}}\))时被停止。然而,试验不会在同一点停止,因为它们每个 epoch 所需的时间不同。如果我们运行的是标准连续减半而不是 ASHA,我们就需要在将配置提升到下一个梯级之前同步我们的工作节点。

d2l.set_figsize([6, 2.5])
results = e.results
for trial_id in results.trial_id.unique():
    df = results[results["trial_id"] == trial_id]
    d2l.plt.plot(
        df["st_tuner_time"],
        df["validation_error"],
        marker="o"
    )
d2l.plt.xlabel("wall-clock time")
d2l.plt.ylabel("objective function")
Text(0, 0.5, 'objective function')
../_images/output_sh-async_bb0ea6_15_1.svg

19.5.4. 小结

与随机搜索相比,连续减半在异步分布式环境中运行并不那么简单。为了避免同步点,我们尽快将配置提升到下一个梯级,即使这意味着提升了一些错误的配置。在实践中,这通常不会造成太大影响,异步调度相对于同步调度的增益通常远大于次优决策所带来的损失。

讨论