4.5. Softmax 回归的简洁实现¶

在 Colab 中打开 Notebook

在 Colab 中打开 Notebook

在 Colab 中打开 Notebook

在 Colab 中打开 Notebook

在 SageMaker Studio Lab 中打开 Notebook

正如高级深度学习框架使实现线性回归（参见第 3.5 节）变得更容易一样，它们在这里同样很方便。

import torch
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l

from mxnet import gluon, init, npx
from mxnet.gluon import nn
from d2l import mxnet as d2l

npx.set_np()

from functools import partial
import jax
import optax
from flax import linen as nn
from jax import numpy as jnp
from d2l import jax as d2l

No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

import tensorflow as tf
from d2l import tensorflow as d2l

4.5.1. 定义模型¶

与第 3.5 节一样，我们使用内置层来构造全连接层。当我们需要将网络应用于某个输入时，内置的 __call__ 方法会调用 forward。

pytorch mxnet jax tensorflow

我们使用 Flatten 层将四阶张量 X 转换为二阶张量，同时保持第一个轴的维度不变。

class SoftmaxRegression(d2l.Classifier):  #@save
    """The softmax regression model."""
    def __init__(self, num_outputs, lr):
        super().__init__()
        self.save_hyperparameters()
        self.net = nn.Sequential(nn.Flatten(),
                                 nn.LazyLinear(num_outputs))

    def forward(self, X):
        return self.net(X)

尽管输入 X 是一个四阶张量，内置的 Dense 层会自动将 X 转换为二阶张量，同时保持第一个轴的维度不变。

class SoftmaxRegression(d2l.Classifier):  #@save
    """The softmax regression model."""
    def __init__(self, num_outputs, lr):
        super().__init__()
        self.save_hyperparameters()
        self.net = nn.Dense(num_outputs)
        self.net.initialize()
    def forward(self, X):
        return self.net(X)

Flax 允许用户使用 @nn.compact 装饰器以更紧凑的方式编写网络类。有了 @nn.compact，人们可以简单地将所有网络逻辑写在一个“前向传播”方法中，而无需在数据类中定义标准的 setup 方法。

class SoftmaxRegression(d2l.Classifier):  #@save
    num_outputs: int
    lr: float

    @nn.compact
    def __call__(self, X):
        X = X.reshape((X.shape[0], -1))  # Flatten
        X = nn.Dense(self.num_outputs)(X)
        return X

我们使用 Flatten 层来转换四阶张量 X，同时保持第一个轴的维度不变。

class SoftmaxRegression(d2l.Classifier):  #@save
    """The softmax regression model."""
    def __init__(self, num_outputs, lr):
        super().__init__()
        self.save_hyperparameters()
        self.net = tf.keras.models.Sequential()
        self.net.add(tf.keras.layers.Flatten())
        self.net.add(tf.keras.layers.Dense(num_outputs))

    def forward(self, X):
        return self.net(X)

4.5.2. 再谈 Softmax¶

在第 4.4 节中，我们计算了模型的输出，然后应用了交叉熵损失。虽然这在数学上是完全合理的，但在计算上是有风险的，因为指数运算中可能出现数值下溢和上溢。

回想一下，softmax 函数通过 \(\hat y_j = \frac{\exp(o_j)}{\sum_k \exp(o_k)}\) 计算概率。如果某些 \(o_k\) 非常大，即非常正，那么 \(\exp(o_k)\) 可能会超过某些数据类型所能表示的最大数值。这被称为上溢。同样，如果每个参数都是非常大的负数，我们会得到下溢。例如，单精度浮点数大约覆盖 \(10^{-38}\) 到 \(10^{38}\) 的范围。因此，如果 \(\mathbf{o}\) 中的最大项超出了区间 \([-90, 90]\)，结果将不会稳定。解决这个问题的一个方法是从所有项中减去 \(\bar{o} \stackrel{\textrm{def}}{=} \max_k o_k\)

(4.5.1)¶\[\hat y_j = \frac{\exp o_j}{\sum_k \exp o_k} = \frac{\exp(o_j - \bar{o}) \exp \bar{o}}{\sum_k \exp (o_k - \bar{o}) \exp \bar{o}} = \frac{\exp(o_j - \bar{o})}{\sum_k \exp (o_k - \bar{o})}.\]

根据构造，我们知道对于所有 \(j\)，\(o_j - \bar{o} \leq 0\)。因此，对于一个 \(q\) 类分类问题，分母包含在区间 \([1, q]\) 内。此外，分子永远不会超过 \(1\)，从而防止了数值上溢。数值下溢只在 \(\exp(o_j - \bar{o})\) 在数值上计算为 \(0\) 时发生。尽管如此，当我们想要计算 \(\log \hat{y}_j\) 为 \(\log 0\) 时，我们可能会在几步之后遇到麻烦。特别是在反向传播中，我们可能会面临满屏的可怕的 NaN (非数字) 结果。

幸运的是，我们得救了，因为即使我们在计算指数函数，我们最终的目的是取它们的对数（在计算交叉熵损失时）。通过将 softmax 和交叉熵结合起来，我们可以完全避免数值稳定性问题。我们有

(4.5.2)¶\[\log \hat{y}_j = \log \frac{\exp(o_j - \bar{o})}{\sum_k \exp (o_k - \bar{o})} = o_j - \bar{o} - \log \sum_k \exp (o_k - \bar{o}).\]

这避免了上溢和下溢。我们会希望保留传统的 softmax 函数，以备不时之需，比如评估模型的输出概率。但是，我们不是将 softmax 概率传递给新的损失函数，而是直接传递 logits，并在交叉熵损失函数内部一次性计算 softmax 及其对数，该函数会做一些聪明的事情，比如 “LogSumExp 技巧”。

pytorch mxnet jax tensorflow

@d2l.add_to_class(d2l.Classifier)  #@save
def loss(self, Y_hat, Y, averaged=True):
    Y_hat = Y_hat.reshape((-1, Y_hat.shape[-1]))
    Y = Y.reshape((-1,))
    return F.cross_entropy(
        Y_hat, Y, reduction='mean' if averaged else 'none')

@d2l.add_to_class(d2l.Classifier)  #@save
def loss(self, Y_hat, Y, averaged=True):
    Y_hat = Y_hat.reshape((-1, Y_hat.shape[-1]))
    Y = Y.reshape((-1,))
    fn = gluon.loss.SoftmaxCrossEntropyLoss()
    l = fn(Y_hat, Y)
    return l.mean() if averaged else l

@d2l.add_to_class(d2l.Classifier)  #@save
@partial(jax.jit, static_argnums=(0, 5))
def loss(self, params, X, Y, state, averaged=True):
    # To be used later (e.g., for batch norm)
    Y_hat = state.apply_fn({'params': params}, *X,
                           mutable=False, rngs=None)
    Y_hat = Y_hat.reshape((-1, Y_hat.shape[-1]))
    Y = Y.reshape((-1,))
    fn = optax.softmax_cross_entropy_with_integer_labels
    # The returned empty dictionary is a placeholder for auxiliary data,
    # which will be used later (e.g., for batch norm)
    return (fn(Y_hat, Y).mean(), {}) if averaged else (fn(Y_hat, Y), {})

@d2l.add_to_class(d2l.Classifier)  #@save
def loss(self, Y_hat, Y, averaged=True):
    Y_hat = tf.reshape(Y_hat, (-1, Y_hat.shape[-1]))
    Y = tf.reshape(Y, (-1,))
    fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    return fn(Y, Y_hat)

4.5.3. 训练¶

接下来我们训练模型。我们使用 Fashion-MNIST 图像，将其展平为 784 维的特征向量。

pytorch mxnet jax tensorflow

data = d2l.FashionMNIST(batch_size=256)
model = SoftmaxRegression(num_outputs=10, lr=0.1)
trainer = d2l.Trainer(max_epochs=10)
trainer.fit(model, data)

../_images/output_softmax-regression-concise_0b22ca_52_0.svg

data = d2l.FashionMNIST(batch_size=256)
model = SoftmaxRegression(num_outputs=10, lr=0.1)
trainer = d2l.Trainer(max_epochs=10)
trainer.fit(model, data)

../_images/output_softmax-regression-concise_0b22ca_55_0.svg

data = d2l.FashionMNIST(batch_size=256)
model = SoftmaxRegression(num_outputs=10, lr=0.1)
trainer = d2l.Trainer(max_epochs=10)
trainer.fit(model, data)

../_images/output_softmax-regression-concise_0b22ca_58_0.svg

data = d2l.FashionMNIST(batch_size=256)
model = SoftmaxRegression(num_outputs=10, lr=0.1)
trainer = d2l.Trainer(max_epochs=10)
trainer.fit(model, data)

../_images/output_softmax-regression-concise_0b22ca_61_0.svg

和以前一样，这个算法收敛到一个相当准确的解，尽管这次代码行数比以前少。

4.5.4. 总结¶

高级 API 在向用户隐藏潜在危险方面非常方便，例如数值稳定性。此外，它们允许用户用很少的代码行简洁地设计模型。这既是福也是祸。明显的好处是它使得事情变得非常容易上手，即使对于从未上过一堂统计学课程的工程师也是如此（事实上，他们是本书的目标受众之一）。但是隐藏这些尖锐的棱角也伴随着代价：不利于自己添加新的和不同的组件，因为没有太多这样做的肌肉记忆。此外，当框架的保护层未能完全覆盖所有极端情况时，它使得修复问题变得更加困难。同样，这是由于不熟悉造成的。

因此，我们强烈建议您回顾接下来的许多实现的简陋版和优雅版。虽然我们强调易于理解，但这些实现通常仍然相当高效（卷积是这里的大例外）。我们的意图是让您在发明任何框架都无法提供的新东西时，能够在此基础上进行构建。

4.5.5. 练习¶

深度学习使用许多不同的数字格式，包括 FP64 双精度（极少使用）、FP32 单精度、BFLOAT16（适合压缩表示）、FP16（非常不稳定）、TF32（NVIDIA 的一种新格式）和 INT8。计算指数函数的最小和最大参数，使其结果不会导致数值下溢或上溢。
INT8 是一种非常有限的格式，由从 \(1\) 到 \(255\) 的非零数字组成。在不使用更多比特的情况下，如何扩展其动态范围？标准的乘法和加法还适用吗？
增加训练的轮数。为什么验证准确率在一段时间后可能会下降？我们该如何解决这个问题？
当您增加学习率时会发生什么？比较几个学习率的损失曲线。哪一个效果更好？在什么时候？

pytorch mxnet jax tensorflow

讨论

4.5. Softmax 回归的简洁实现¶ Colab [pytorch]在 Colab 中打开 Notebook Colab [mxnet]在 Colab 中打开 Notebook Colab [jax]在 Colab 中打开 Notebook Colab [tensorflow]在 Colab 中打开 Notebook SageMaker Studio Lab在 SageMaker Studio Lab 中打开 Notebook

4.5.1. 定义模型¶

4.5.2. 再谈 Softmax¶

4.5.3. 训练¶

4.5.4. 总结¶

4.5.5. 练习¶

4.5. Softmax 回归的简洁实现¶

在 Colab 中打开 Notebook

在 Colab 中打开 Notebook

在 Colab 中打开 Notebook

在 Colab 中打开 Notebook

在 SageMaker Studio Lab 中打开 Notebook