14.9. 语义分割和数据集¶

在 Colab 中打开 Notebook

在 Colab 中打开 Notebook

在 Colab 中打开 Notebook

在 Colab 中打开 Notebook

在 SageMaker Studio Lab 中打开 Notebook

在 14.3节–14.8节中讨论目标检测任务时，我们使用矩形边界框来标注和预测图像中的目标。本节将讨论*语义分割*（semantic segmentation）问题，它侧重于如何将图像分割成属于不同语义类别的区域。与目标检测不同，语义分割可以识别并理解图像中像素级别的内容：其在像素级别上对语义区域进行标注和预测。图 14.9.1显示了在语义分割中，图像中狗、猫和背景的标签。与目标检测相比，在语义分割中标记的像素级边框显然更精细。

图 14.9.1 语义分割中图像里狗、猫和背景的标签。¶

14.9.1. 图像分割和实例分割¶

计算机视觉领域还有两个与语义分割相似的重要任务，即图像分割（image segmentation）和实例分割（instance segmentation）。下面，我们将简要区分它们与语义分割的差别。

*图像分割*将图像划分为若干组成区域。这类问题的方法通常利用图像中像素之间的相关性。它在训练时不需要关于图像像素的标签信息，在预测时也无法保证分割出的区域具有我们希望得到的语义。以图 14.9.1中的图像为输入，图像分割可能会将狗分为两个区域：一个覆盖嘴和眼睛，主要为黑色；另一个覆盖身体的其余部分，主要为黄色。
*实例分割*也叫*同时检测并分割*（simultaneous detection and segmentation）。它研究如何识别图像中每个目标实例的像素级区域。与语义分割不同，实例分割不仅需要区分语义，还需要区分不同的目标实例。例如，如果图像中有两条狗，实例分割需要区分哪个像素属于哪条狗。

14.9.2. Pascal VOC2012 语义分割数据集¶

最重要的语义分割数据集之一是Pascal VOC2012。下面，我们来看一下这个数据集。

pytorch mxnet

%matplotlib inline
import os
import torch
import torchvision
from d2l import torch as d2l

%matplotlib inline
import os
from mxnet import gluon, image, np, npx
from d2l import mxnet as d2l

npx.set_np()

数据集的 tar 文件大约是 2 GB，所以下载文件可能需要一些时间。解压后的数据集位于 ../data/VOCdevkit/VOC2012。

pytorch mxnet

#@save
d2l.DATA_HUB['voc2012'] = (d2l.DATA_URL + 'VOCtrainval_11-May-2012.tar',
                           '4e443f8a2eca6b1dac8a6c57641b67dd40621a49')

voc_dir = d2l.download_extract('voc2012', 'VOCdevkit/VOC2012')

Downloading ../data/VOCtrainval_11-May-2012.tar from http://d2l-data.s3-accelerate.amazonaws.com/VOCtrainval_11-May-2012.tar...

#@save
d2l.DATA_HUB['voc2012'] = (d2l.DATA_URL + 'VOCtrainval_11-May-2012.tar',
                           '4e443f8a2eca6b1dac8a6c57641b67dd40621a49')

voc_dir = d2l.download_extract('voc2012', 'VOCdevkit/VOC2012')

Downloading ../data/VOCtrainval_11-May-2012.tar from http://d2l-data.s3-accelerate.amazonaws.com/VOCtrainval_11-May-2012.tar...

进入路径 ../data/VOCdevkit/VOC2012 后，我们可以看到数据集的不同组件。ImageSets/Segmentation 路径包含指定训练和测试样本的文本文件，而 JPEGImages 和 SegmentationClass 路径分别存储每个示例的输入图像和标签。此处的标签也是图像格式，其尺寸与其标注的输入图像的尺寸相同。此外，任何标签图像中的相同颜色的像素都属于同一语义类别。下面定义了 read_voc_images 函数，用于将所有输入图像和标签读入内存。

pytorch mxnet

#@save
def read_voc_images(voc_dir, is_train=True):
    """Read all VOC feature and label images."""
    txt_fname = os.path.join(voc_dir, 'ImageSets', 'Segmentation',
                             'train.txt' if is_train else 'val.txt')
    mode = torchvision.io.image.ImageReadMode.RGB
    with open(txt_fname, 'r') as f:
        images = f.read().split()
    features, labels = [], []
    for i, fname in enumerate(images):
        features.append(torchvision.io.read_image(os.path.join(
            voc_dir, 'JPEGImages', f'{fname}.jpg')))
        labels.append(torchvision.io.read_image(os.path.join(
            voc_dir, 'SegmentationClass' ,f'{fname}.png'), mode))
    return features, labels

train_features, train_labels = read_voc_images(voc_dir, True)

#@save
def read_voc_images(voc_dir, is_train=True):
    """Read all VOC feature and label images."""
    txt_fname = os.path.join(voc_dir, 'ImageSets', 'Segmentation',
                             'train.txt' if is_train else 'val.txt')
    with open(txt_fname, 'r') as f:
        images = f.read().split()
    features, labels = [], []
    for i, fname in enumerate(images):
        features.append(image.imread(os.path.join(
            voc_dir, 'JPEGImages', f'{fname}.jpg')))
        labels.append(image.imread(os.path.join(
            voc_dir, 'SegmentationClass', f'{fname}.png')))
    return features, labels

train_features, train_labels = read_voc_images(voc_dir, True)

[22:12:52] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for CPU

我们绘制前五个输入图像及其标签。在标签图像中，白色和黑色分别代表边框和背景，而其他颜色则对应不同的类别。

pytorch mxnet

n = 5
imgs = train_features[:n] + train_labels[:n]
imgs = [img.permute(1,2,0) for img in imgs]
d2l.show_images(imgs, 2, n);

../_images/output_semantic-segmentation-and-dataset_23ff18_30_0.png

n = 5
imgs = train_features[:n] + train_labels[:n]
d2l.show_images(imgs, 2, n);

../_images/output_semantic-segmentation-and-dataset_23ff18_33_0.png

接下来，我们列出此数据集中所有标签的 RGB 颜色值和类名。

pytorch mxnet

#@save
VOC_COLORMAP = [[0, 0, 0], [128, 0, 0], [0, 128, 0], [128, 128, 0],
                [0, 0, 128], [128, 0, 128], [0, 128, 128], [128, 128, 128],
                [64, 0, 0], [192, 0, 0], [64, 128, 0], [192, 128, 0],
                [64, 0, 128], [192, 0, 128], [64, 128, 128], [192, 128, 128],
                [0, 64, 0], [128, 64, 0], [0, 192, 0], [128, 192, 0],
                [0, 64, 128]]

#@save
VOC_CLASSES = ['background', 'aeroplane', 'bicycle', 'bird', 'boat',
               'bottle', 'bus', 'car', 'cat', 'chair', 'cow',
               'diningtable', 'dog', 'horse', 'motorbike', 'person',
               'potted plant', 'sheep', 'sofa', 'train', 'tv/monitor']

#@save
VOC_COLORMAP = [[0, 0, 0], [128, 0, 0], [0, 128, 0], [128, 128, 0],
                [0, 0, 128], [128, 0, 128], [0, 128, 128], [128, 128, 128],
                [64, 0, 0], [192, 0, 0], [64, 128, 0], [192, 128, 0],
                [64, 0, 128], [192, 0, 128], [64, 128, 128], [192, 128, 128],
                [0, 64, 0], [128, 64, 0], [0, 192, 0], [128, 192, 0],
                [0, 64, 128]]

#@save
VOC_CLASSES = ['background', 'aeroplane', 'bicycle', 'bird', 'boat',
               'bottle', 'bus', 'car', 'cat', 'chair', 'cow',
               'diningtable', 'dog', 'horse', 'motorbike', 'person',
               'potted plant', 'sheep', 'sofa', 'train', 'tv/monitor']

有了上面定义的两个常量，我们就可以方便地查找标签中每个像素的类别索引。我们定义 voc_colormap2label 函数来构建从上述 RGB 颜色值到类别索引的映射，以及 voc_label_indices 函数将任何 RGB 值映射到此 Pascal VOC2012 数据集中的类别索引。

pytorch mxnet

#@save
def voc_colormap2label():
    """Build the mapping from RGB to class indices for VOC labels."""
    colormap2label = torch.zeros(256 ** 3, dtype=torch.long)
    for i, colormap in enumerate(VOC_COLORMAP):
        colormap2label[
            (colormap[0] * 256 + colormap[1]) * 256 + colormap[2]] = i
    return colormap2label

#@save
def voc_label_indices(colormap, colormap2label):
    """Map any RGB values in VOC labels to their class indices."""
    colormap = colormap.permute(1, 2, 0).numpy().astype('int32')
    idx = ((colormap[:, :, 0] * 256 + colormap[:, :, 1]) * 256
           + colormap[:, :, 2])
    return colormap2label[idx]

#@save
def voc_colormap2label():
    """Build the mapping from RGB to class indices for VOC labels."""
    colormap2label = np.zeros(256 ** 3)
    for i, colormap in enumerate(VOC_COLORMAP):
        colormap2label[
            (colormap[0] * 256 + colormap[1]) * 256 + colormap[2]] = i
    return colormap2label

#@save
def voc_label_indices(colormap, colormap2label):
    """Map any RGB values in VOC labels to their class indices."""
    colormap = colormap.astype(np.int32)
    idx = ((colormap[:, :, 0] * 256 + colormap[:, :, 1]) * 256
           + colormap[:, :, 2])
    return colormap2label[idx]

例如，在第一个示例图像中，飞机前部的类别索引为 1，而背景索引为 0。

pytorch mxnet

y = voc_label_indices(train_labels[0], voc_colormap2label())
y[105:115, 130:140], VOC_CLASSES[1]

(tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
         [0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
         [0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
         [0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
         [0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
         [0, 0, 0, 0, 1, 1, 1, 1, 1, 1],
         [0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
         [0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
         [0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
         [0, 0, 0, 0, 0, 0, 0, 0, 1, 1]]),
 'aeroplane')

y = voc_label_indices(train_labels[0], voc_colormap2label())
y[105:115, 130:140], VOC_CLASSES[1]

(array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
        [0., 0., 0., 0., 0., 0., 0., 1., 1., 1.],
        [0., 0., 0., 0., 0., 0., 1., 1., 1., 1.],
        [0., 0., 0., 0., 0., 1., 1., 1., 1., 1.],
        [0., 0., 0., 0., 0., 1., 1., 1., 1., 1.],
        [0., 0., 0., 0., 1., 1., 1., 1., 1., 1.],
        [0., 0., 0., 0., 0., 1., 1., 1., 1., 1.],
        [0., 0., 0., 0., 0., 1., 1., 1., 1., 1.],
        [0., 0., 0., 0., 0., 0., 1., 1., 1., 1.],
        [0., 0., 0., 0., 0., 0., 0., 0., 1., 1.]]),
 'aeroplane')

14.9.2.1. 数据预处理¶

在之前的实验中，例如在 8.1节–8.4节中，图像被重新缩放以适应模型所需的输入形状。然而，在语义分割中，这样做需要将预测的像素类别重新缩放回输入图像的原始形状。这种重新缩放可能不准确，特别是对于具有不同类别的分割区域。为了避免这个问题，我们将图像裁剪为*固定*形状，而不是重新缩放。具体来说，我们使用图像增广中的随机裁剪，裁剪输入图像和标签的相同区域。

pytorch mxnet

#@save
def voc_rand_crop(feature, label, height, width):
    """Randomly crop both feature and label images."""
    rect = torchvision.transforms.RandomCrop.get_params(
        feature, (height, width))
    feature = torchvision.transforms.functional.crop(feature, *rect)
    label = torchvision.transforms.functional.crop(label, *rect)
    return feature, label

imgs = []
for _ in range(n):
    imgs += voc_rand_crop(train_features[0], train_labels[0], 200, 300)

imgs = [img.permute(1, 2, 0) for img in imgs]
d2l.show_images(imgs[::2] + imgs[1::2], 2, n);

../_images/output_semantic-segmentation-and-dataset_23ff18_66_0.png

#@save
def voc_rand_crop(feature, label, height, width):
    """Randomly crop both feature and label images."""
    feature, rect = image.random_crop(feature, (width, height))
    label = image.fixed_crop(label, *rect)
    return feature, label

imgs = []
for _ in range(n):
    imgs += voc_rand_crop(train_features[0], train_labels[0], 200, 300)
d2l.show_images(imgs[::2] + imgs[1::2], 2, n);

../_images/output_semantic-segmentation-and-dataset_23ff18_69_0.png

14.9.2.2. 自定义语义分割数据集类¶

我们通过继承高级 API 提供的 Dataset 类来定义一个自定义语义分割数据集类 VOCSegDataset。通过实现 __getitem__ 函数，我们可以任意访问数据集中索引为 idx 的输入图像以及此图像中每个像素的类别索引。由于数据集中的一些图像尺寸小于随机裁剪的输出尺寸，这些示例被自定义的 filter 函数过滤掉。此外，我们还定义了 normalize_image 函数来标准化输入图像的三个 RGB 通道的值。

pytorch mxnet

#@save
class VOCSegDataset(torch.utils.data.Dataset):
    """A customized dataset to load the VOC dataset."""

    def __init__(self, is_train, crop_size, voc_dir):
        self.transform = torchvision.transforms.Normalize(
            mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
        self.crop_size = crop_size
        features, labels = read_voc_images(voc_dir, is_train=is_train)
        self.features = [self.normalize_image(feature)
                         for feature in self.filter(features)]
        self.labels = self.filter(labels)
        self.colormap2label = voc_colormap2label()
        print('read ' + str(len(self.features)) + ' examples')

    def normalize_image(self, img):
        return self.transform(img.float() / 255)

    def filter(self, imgs):
        return [img for img in imgs if (
            img.shape[1] >= self.crop_size[0] and
            img.shape[2] >= self.crop_size[1])]

    def __getitem__(self, idx):
        feature, label = voc_rand_crop(self.features[idx], self.labels[idx],
                                       *self.crop_size)
        return (feature, voc_label_indices(label, self.colormap2label))

    def __len__(self):
        return len(self.features)

#@save
class VOCSegDataset(gluon.data.Dataset):
    """A customized dataset to load the VOC dataset."""
    def __init__(self, is_train, crop_size, voc_dir):
        self.rgb_mean = np.array([0.485, 0.456, 0.406])
        self.rgb_std = np.array([0.229, 0.224, 0.225])
        self.crop_size = crop_size
        features, labels = read_voc_images(voc_dir, is_train=is_train)
        self.features = [self.normalize_image(feature)
                         for feature in self.filter(features)]
        self.labels = self.filter(labels)
        self.colormap2label = voc_colormap2label()
        print('read ' + str(len(self.features)) + ' examples')

    def normalize_image(self, img):
        return (img.astype('float32') / 255 - self.rgb_mean) / self.rgb_std

    def filter(self, imgs):
        return [img for img in imgs if (
            img.shape[0] >= self.crop_size[0] and
            img.shape[1] >= self.crop_size[1])]

    def __getitem__(self, idx):
        feature, label = voc_rand_crop(self.features[idx], self.labels[idx],
                                       *self.crop_size)
        return (feature.transpose(2, 0, 1),
                voc_label_indices(label, self.colormap2label))

    def __len__(self):
        return len(self.features)

14.9.2.3. 读取数据集¶

我们使用自定义的 VOCSegDataset 类来分别创建训练集和测试集的实例。假设我们指定随机裁剪图像的输出形状为 \(320\times 480\)。下面我们可以查看训练集和测试集中保留的示例数量。

pytorch mxnet

crop_size = (320, 480)
voc_train = VOCSegDataset(True, crop_size, voc_dir)
voc_test = VOCSegDataset(False, crop_size, voc_dir)

read 1114 examples
read 1078 examples

crop_size = (320, 480)
voc_train = VOCSegDataset(True, crop_size, voc_dir)
voc_test = VOCSegDataset(False, crop_size, voc_dir)

read 1114 examples
read 1078 examples

将批量大小设置为 64，我们定义训练集的数据迭代器。让我们打印第一个小批量的形状。与图像分类或目标检测不同，这里的标签是三维张量。

pytorch mxnet

batch_size = 64
train_iter = torch.utils.data.DataLoader(voc_train, batch_size, shuffle=True,
                                    drop_last=True,
                                    num_workers=d2l.get_dataloader_workers())
for X, Y in train_iter:
    print(X.shape)
    print(Y.shape)
    break

torch.Size([64, 3, 320, 480])
torch.Size([64, 320, 480])

batch_size = 64
train_iter = gluon.data.DataLoader(voc_train, batch_size, shuffle=True,
                                   last_batch='discard',
                                   num_workers=d2l.get_dataloader_workers())
for X, Y in train_iter:
    print(X.shape)
    print(Y.shape)
    break

(64, 3, 320, 480)
(64, 320, 480)

14.9.2.4. 整合所有组件¶

最后，我们定义以下 load_data_voc 函数来下载和读取 Pascal VOC2012 语义分割数据集。它返回训练和测试数据集的数据迭代器。

pytorch mxnet

#@save
def load_data_voc(batch_size, crop_size):
    """Load the VOC semantic segmentation dataset."""
    voc_dir = d2l.download_extract('voc2012', os.path.join(
        'VOCdevkit', 'VOC2012'))
    num_workers = d2l.get_dataloader_workers()
    train_iter = torch.utils.data.DataLoader(
        VOCSegDataset(True, crop_size, voc_dir), batch_size,
        shuffle=True, drop_last=True, num_workers=num_workers)
    test_iter = torch.utils.data.DataLoader(
        VOCSegDataset(False, crop_size, voc_dir), batch_size,
        drop_last=True, num_workers=num_workers)
    return train_iter, test_iter

#@save
def load_data_voc(batch_size, crop_size):
    """Load the VOC semantic segmentation dataset."""
    voc_dir = d2l.download_extract('voc2012', os.path.join(
        'VOCdevkit', 'VOC2012'))
    num_workers = d2l.get_dataloader_workers()
    train_iter = gluon.data.DataLoader(
        VOCSegDataset(True, crop_size, voc_dir), batch_size,
        shuffle=True, last_batch='discard', num_workers=num_workers)
    test_iter = gluon.data.DataLoader(
        VOCSegDataset(False, crop_size, voc_dir), batch_size,
        last_batch='discard', num_workers=num_workers)
    return train_iter, test_iter

14.9.3. 小结¶

语义分割通过将图像分割成属于不同语义类别的区域，来在像素级别上识别和理解图像中的内容。
最重要的语义分割数据集之一是 Pascal VOC2012。
在语义分割中，由于输入图像和标签在像素上一一对应，因此输入图像被随机裁剪为固定形状，而不是被重新缩放。

14.9.4. 练习¶

语义分割如何应用于自动驾驶和医学图像诊断？你能想到其他应用吗？
回想一下 14.1节中对数据增广的描述。图像分类中使用的哪些图像增广方法在语义分割中是不可行的？

pytorch mxnet

讨论

14.9. 语义分割和数据集¶ Colab [pytorch]在 Colab 中打开 Notebook Colab [mxnet]在 Colab 中打开 Notebook Colab [jax]在 Colab 中打开 Notebook Colab [tensorflow]在 Colab 中打开 Notebook SageMaker Studio Lab在 SageMaker Studio Lab 中打开 Notebook

14.9.1. 图像分割和实例分割¶

14.9.2. Pascal VOC2012 语义分割数据集¶

14.9.2.1. 数据预处理¶

14.9.2.2. 自定义语义分割数据集类¶

14.9.2.3. 读取数据集¶

14.9.2.4. 整合所有组件¶

14.9.3. 小结¶

14.9.4. 练习¶

14.9. 语义分割和数据集¶

在 Colab 中打开 Notebook

在 Colab 中打开 Notebook

在 Colab 中打开 Notebook

在 Colab 中打开 Notebook

在 SageMaker Studio Lab 中打开 Notebook