15.3. 用于预训练词嵌入的数据集
在 Colab 中打开 Notebook
在 Colab 中打开 Notebook
在 Colab 中打开 Notebook
在 Colab 中打开 Notebook
在 SageMaker Studio Lab 中打开 Notebook

既然我们已经了解了word2vec模型和近似训练的技术细节,那么让我们来看看它们的实现。具体来说,我们将以 :numref:`sec_word2vec` 中的跳元模型和 :numref:`sec_approx-train` 中的负采样为例。在本节中,我们从用于预训练词嵌入模型的数据集开始:数据的原始格式将被转换为可在训练期间迭代的小批量。

import collections
import math
import os
import random
import torch
from d2l import torch as d2l
import collections
import math
import os
import random
from mxnet import gluon, np
from d2l import mxnet as d2l

15.3.1. 读取数据集

我们在这里使用的数据集是 `宾州树库(PTB) `__。该语料库取样于《华尔街日报》的文章,分为训练集、验证集和测试集。在原始格式中,文本文件的每一行表示一个由空格分隔的单词组成的句子。在这里,我们视每个单词为一个词元。

#@save
d2l.DATA_HUB['ptb'] = (d2l.DATA_URL + 'ptb.zip',
                       '319d85e578af0cdc590547f26231e4e31cdf1e42')

#@save
def read_ptb():
    """Load the PTB dataset into a list of text lines."""
    data_dir = d2l.download_extract('ptb')
    # Read the training set
    with open(os.path.join(data_dir, 'ptb.train.txt')) as f:
        raw_text = f.read()
    return [line.split() for line in raw_text.split('\n')]

sentences = read_ptb()
f'# sentences: {len(sentences)}'
Downloading ../data/ptb.zip from http://d2l-data.s3-accelerate.amazonaws.com/ptb.zip...
'# sentences: 42069'
#@save
d2l.DATA_HUB['ptb'] = (d2l.DATA_URL + 'ptb.zip',
                       '319d85e578af0cdc590547f26231e4e31cdf1e42')

#@save
def read_ptb():
    """Load the PTB dataset into a list of text lines."""
    data_dir = d2l.download_extract('ptb')
    # Read the training set
    with open(os.path.join(data_dir, 'ptb.train.txt')) as f:
        raw_text = f.read()
    return [line.split() for line in raw_text.split('\n')]

sentences = read_ptb()
f'# sentences: {len(sentences)}'
Downloading ../data/ptb.zip from http://d2l-data.s3-accelerate.amazonaws.com/ptb.zip...
'# sentences: 42069'

在读取训练集之后,我们为语料库构建一个词表,其中任何出现次数少于10次的单词都由“<unk>”词元替换。请注意,原始数据集还包含表示稀有(未知)单词的“<unk>”词元。

vocab = d2l.Vocab(sentences, min_freq=10)
f'vocab size: {len(vocab)}'
'vocab size: 6719'
vocab = d2l.Vocab(sentences, min_freq=10)
f'vocab size: {len(vocab)}'
'vocab size: 6719'

15.3.2. 二次采样

文本数据通常有像“the”、“a”和“in”这样的高频词:在非常大的语料库中,它们甚至可能出现数十亿次。然而,这些词通常在上下文窗口中与许多不同的词共同出现,提供的有用信号很少。例如,考虑上下文窗口中的“chip”一词:直观地说,它与低频词“intel”的共现比与高频词“a”的共现更有助于训练。此外,用大量(高频)词进行训练速度很慢。因此,在训练词嵌入模型时,可以对高频词进行*二次采样* :cite:`Mikolov.Sutskever.Chen.ea.2013`。具体来说,数据集中的每个索引词 \(w_i\) 将以概率

(15.3.1)\[P(w_i) = \max\left(1 - \sqrt{\frac{t}{f(w_i)}}, 0\right),\]

被丢弃,其中 \(f(w_i)\) 是单词 \(w_i\) 的数量与数据集中单词总数的比率,常数 \(t\) 是一个超参数(实验中为 \(10^{-4}\))。我们可以看到,只有当相对频率 \(f(w_i) > t\) 时,(高频)词 \(w_i\) 才可能被丢弃,并且词的相对频率越高,被丢弃的概率就越大。

#@save
def subsample(sentences, vocab):
    """Subsample high-frequency words."""
    # Exclude unknown tokens ('<unk>')
    sentences = [[token for token in line if vocab[token] != vocab.unk]
                 for line in sentences]
    counter = collections.Counter([
        token for line in sentences for token in line])
    num_tokens = sum(counter.values())

    # Return True if `token` is kept during subsampling
    def keep(token):
        return(random.uniform(0, 1) <
               math.sqrt(1e-4 / counter[token] * num_tokens))

    return ([[token for token in line if keep(token)] for line in sentences],
            counter)

subsampled, counter = subsample(sentences, vocab)
#@save
def subsample(sentences, vocab):
    """Subsample high-frequency words."""
    # Exclude unknown tokens ('<unk>')
    sentences = [[token for token in line if vocab[token] != vocab.unk]
                 for line in sentences]
    counter = collections.Counter([
        token for line in sentences for token in line])
    num_tokens = sum(counter.values())

    # Return True if `token` is kept during subsampling
    def keep(token):
        return(random.uniform(0, 1) <
               math.sqrt(1e-4 / counter[token] * num_tokens))

    return ([[token for token in line if keep(token)] for line in sentences],
            counter)

subsampled, counter = subsample(sentences, vocab)

以下代码片段绘制了二次采样前后每个句子的词元数量的直方图。正如预期的那样,二次采样通过丢弃高频词显著缩短了句子,这将导致训练加速。

d2l.show_list_len_pair_hist(['origin', 'subsampled'], '# tokens per sentence',
                            'count', sentences, subsampled);
../_images/output_word-embedding-dataset_f77071_39_0.svg
d2l.show_list_len_pair_hist(['origin', 'subsampled'], '# tokens per sentence',
                            'count', sentences, subsampled);
../_images/output_word-embedding-dataset_f77071_42_0.svg

对于单个词元,高频词“the”的采样率不到1/20。

def compare_counts(token):
    return (f'# of "{token}": '
            f'before={sum([l.count(token) for l in sentences])}, '
            f'after={sum([l.count(token) for l in subsampled])}')

compare_counts('the')
'# of "the": before=50770, after=2010'
def compare_counts(token):
    return (f'# of "{token}": '
            f'before={sum([l.count(token) for l in sentences])}, '
            f'after={sum([l.count(token) for l in subsampled])}')

compare_counts('the')
'# of "the": before=50770, after=2007'

相比之下,低频词“join”则被完全保留。

compare_counts('join')
'# of "join": before=45, after=45'
compare_counts('join')
'# of "join": before=45, after=45'

二次采样后,我们将语料库的词元映射到它们的索引。

corpus = [vocab[line] for line in subsampled]
corpus[:3]
[[], [4127, 3228, 1773], [3922, 1922, 4743, 2696]]
corpus = [vocab[line] for line in subsampled]
corpus[:3]
[[], [3228, 4060], [3922, 1922, 4743]]

15.3.3. 提取中心词和上下文词

以下 get_centers_and_contexts 函数从 corpus 中提取所有的中心词及其上下文词。它在1和 max_window_size 之间随机均匀地采样一个整数作为上下文窗口大小。对于任何中心词,那些与它的距离不超过采样的上下文窗口大小的词是它的上下文词。

#@save
def get_centers_and_contexts(corpus, max_window_size):
    """Return center words and context words in skip-gram."""
    centers, contexts = [], []
    for line in corpus:
        # To form a "center word--context word" pair, each sentence needs to
        # have at least 2 words
        if len(line) < 2:
            continue
        centers += line
        for i in range(len(line)):  # Context window centered at `i`
            window_size = random.randint(1, max_window_size)
            indices = list(range(max(0, i - window_size),
                                 min(len(line), i + 1 + window_size)))
            # Exclude the center word from the context words
            indices.remove(i)
            contexts.append([line[idx] for idx in indices])
    return centers, contexts
#@save
def get_centers_and_contexts(corpus, max_window_size):
    """Return center words and context words in skip-gram."""
    centers, contexts = [], []
    for line in corpus:
        # To form a "center word--context word" pair, each sentence needs to
        # have at least 2 words
        if len(line) < 2:
            continue
        centers += line
        for i in range(len(line)):  # Context window centered at `i`
            window_size = random.randint(1, max_window_size)
            indices = list(range(max(0, i - window_size),
                                 min(len(line), i + 1 + window_size)))
            # Exclude the center word from the context words
            indices.remove(i)
            contexts.append([line[idx] for idx in indices])
    return centers, contexts

接下来,我们创建一个包含两个句子的人工数据集,分别有7个和3个词。设最大上下文窗口大小为2,并打印所有的中心词及其上下文词。

tiny_dataset = [list(range(7)), list(range(7, 10))]
print('dataset', tiny_dataset)
for center, context in zip(*get_centers_and_contexts(tiny_dataset, 2)):
    print('center', center, 'has contexts', context)
dataset [[0, 1, 2, 3, 4, 5, 6], [7, 8, 9]]
center 0 has contexts [1]
center 1 has contexts [0, 2]
center 2 has contexts [0, 1, 3, 4]
center 3 has contexts [1, 2, 4, 5]
center 4 has contexts [2, 3, 5, 6]
center 5 has contexts [3, 4, 6]
center 6 has contexts [5]
center 7 has contexts [8, 9]
center 8 has contexts [7, 9]
center 9 has contexts [7, 8]
tiny_dataset = [list(range(7)), list(range(7, 10))]
print('dataset', tiny_dataset)
for center, context in zip(*get_centers_and_contexts(tiny_dataset, 2)):
    print('center', center, 'has contexts', context)
dataset [[0, 1, 2, 3, 4, 5, 6], [7, 8, 9]]
center 0 has contexts [1, 2]
center 1 has contexts [0, 2, 3]
center 2 has contexts [0, 1, 3, 4]
center 3 has contexts [2, 4]
center 4 has contexts [2, 3, 5, 6]
center 5 has contexts [4, 6]
center 6 has contexts [4, 5]
center 7 has contexts [8, 9]
center 8 has contexts [7, 9]
center 9 has contexts [7, 8]

在PTB数据集上训练时,我们将最大上下文窗口大小设置为5。下面提取数据集中所有的中心词及其上下文词。

all_centers, all_contexts = get_centers_and_contexts(corpus, 5)
f'# center-context pairs: {sum([len(contexts) for contexts in all_contexts])}'
'# center-context pairs: 1503420'
all_centers, all_contexts = get_centers_and_contexts(corpus, 5)
f'# center-context pairs: {sum([len(contexts) for contexts in all_contexts])}'
'# center-context pairs: 1497612'

15.3.4. 负采样

我们使用负采样进行近似训练。为了根据预定义的分布采样噪声词,我们定义了以下 RandomGenerator 类,其中(可能未归一化的)采样分布通过参数 sampling_weights 传递。

#@save
class RandomGenerator:
    """Randomly draw among {1, ..., n} according to n sampling weights."""
    def __init__(self, sampling_weights):
        # Exclude
        self.population = list(range(1, len(sampling_weights) + 1))
        self.sampling_weights = sampling_weights
        self.candidates = []
        self.i = 0

    def draw(self):
        if self.i == len(self.candidates):
            # Cache `k` random sampling results
            self.candidates = random.choices(
                self.population, self.sampling_weights, k=10000)
            self.i = 0
        self.i += 1
        return self.candidates[self.i - 1]
#@save
class RandomGenerator:
    """Randomly draw among {1, ..., n} according to n sampling weights."""
    def __init__(self, sampling_weights):
        # Exclude
        self.population = list(range(1, len(sampling_weights) + 1))
        self.sampling_weights = sampling_weights
        self.candidates = []
        self.i = 0

    def draw(self):
        if self.i == len(self.candidates):
            # Cache `k` random sampling results
            self.candidates = random.choices(
                self.population, self.sampling_weights, k=10000)
            self.i = 0
        self.i += 1
        return self.candidates[self.i - 1]

例如,我们可以按照采样概率 \(P(X=1)=2/9, P(X=2)=3/9\)\(P(X=3)=4/9\) 从索引1、2和3中抽取10个随机变量 \(X\),如下所示。

generator = RandomGenerator([2, 3, 4])
[generator.draw() for _ in range(10)]
[3, 3, 1, 3, 1, 2, 3, 3, 2, 1]

对于一对中心词和上下文词,我们随机采样 K(实验中为5)个噪声词。根据word2vec论文中的建议,噪声词 \(w\) 的采样概率 \(P(w)\) 设置为其在词典中相对频率的0.75次方 :cite:`Mikolov.Sutskever.Chen.ea.2013`。

#@save
def get_negatives(all_contexts, vocab, counter, K):
    """Return noise words in negative sampling."""
    # Sampling weights for words with indices 1, 2, ... (index 0 is the
    # excluded unknown token) in the vocabulary
    sampling_weights = [counter[vocab.to_tokens(i)]**0.75
                        for i in range(1, len(vocab))]
    all_negatives, generator = [], RandomGenerator(sampling_weights)
    for contexts in all_contexts:
        negatives = []
        while len(negatives) < len(contexts) * K:
            neg = generator.draw()
            # Noise words cannot be context words
            if neg not in contexts:
                negatives.append(neg)
        all_negatives.append(negatives)
    return all_negatives

all_negatives = get_negatives(all_contexts, vocab, counter, 5)
#@save
def get_negatives(all_contexts, vocab, counter, K):
    """Return noise words in negative sampling."""
    # Sampling weights for words with indices 1, 2, ... (index 0 is the
    # excluded unknown token) in the vocabulary
    sampling_weights = [counter[vocab.to_tokens(i)]**0.75
                        for i in range(1, len(vocab))]
    all_negatives, generator = [], RandomGenerator(sampling_weights)
    for contexts in all_contexts:
        negatives = []
        while len(negatives) < len(contexts) * K:
            neg = generator.draw()
            # Noise words cannot be context words
            if neg not in contexts:
                negatives.append(neg)
        all_negatives.append(negatives)
    return all_negatives

all_negatives = get_negatives(all_contexts, vocab, counter, 5)

15.3.5. 在小批量中加载训练样本

在提取了所有的中心词及其上下文词和采样的噪声词后,它们将被转换为可以训练期间迭代加载的样本小批量。

在一个小批量中,第 \(i^\textrm{th}\) 个样本包括一个中心词及其 \(n_i\) 个上下文词和 \(m_i\) 个噪声词。由于上下文窗口大小不同,\(n_i+m_i\) 对于不同的 \(i\) 是不同的。因此,对于每个样本,我们将其上下文词和噪声词连接到 contexts_negatives 变量中,并填充零直到连接长度达到 \(\max_i n_i+m_i\)max_len)。为了在损失计算中排除填充,我们定义了一个掩码变量 masksmasks 中的元素与 contexts_negatives 中的元素一一对应,其中 masks 中的零(否则为一)对应于 contexts_negatives 中的填充。

为了区分正例和负例,我们通过一个 labels 变量将上下文词与噪声词在 contexts_negatives 中分离开。与 masks 类似,labels 中的元素与 contexts_negatives 中的元素也存在一一对应关系,其中 labels 中的一(否则为零)对应于 contexts_negatives 中的上下文词(正例)。

上述思想在以下 batchify 函数中实现。其输入 data 是一个列表,长度等于批量大小,其中每个元素是一个由中心词 center、其上下文词 context 和其噪声词 negative 组成的样本。此函数返回一个可以加载用于训练期间计算的小批量,例如包括掩码变量。

#@save
def batchify(data):
    """Return a minibatch of examples for skip-gram with negative sampling."""
    max_len = max(len(c) + len(n) for _, c, n in data)
    centers, contexts_negatives, masks, labels = [], [], [], []
    for center, context, negative in data:
        cur_len = len(context) + len(negative)
        centers += [center]
        contexts_negatives += [context + negative + [0] * (max_len - cur_len)]
        masks += [[1] * cur_len + [0] * (max_len - cur_len)]
        labels += [[1] * len(context) + [0] * (max_len - len(context))]
    return (torch.tensor(centers).reshape((-1, 1)), torch.tensor(
        contexts_negatives), torch.tensor(masks), torch.tensor(labels))
#@save
def batchify(data):
    """Return a minibatch of examples for skip-gram with negative sampling."""
    max_len = max(len(c) + len(n) for _, c, n in data)
    centers, contexts_negatives, masks, labels = [], [], [], []
    for center, context, negative in data:
        cur_len = len(context) + len(negative)
        centers += [center]
        contexts_negatives += [context + negative + [0] * (max_len - cur_len)]
        masks += [[1] * cur_len + [0] * (max_len - cur_len)]
        labels += [[1] * len(context) + [0] * (max_len - len(context))]
    return (np.array(centers).reshape((-1, 1)), np.array(
        contexts_negatives), np.array(masks), np.array(labels))

让我们用两个样本组成的小批量来测试这个函数。

x_1 = (1, [2, 2], [3, 3, 3, 3])
x_2 = (1, [2, 2, 2], [3, 3])
batch = batchify((x_1, x_2))

names = ['centers', 'contexts_negatives', 'masks', 'labels']
for name, data in zip(names, batch):
    print(name, '=', data)
centers = tensor([[1],
        [1]])
contexts_negatives = tensor([[2, 2, 3, 3, 3, 3],
        [2, 2, 2, 3, 3, 0]])
masks = tensor([[1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 0]])
labels = tensor([[1, 1, 0, 0, 0, 0],
        [1, 1, 1, 0, 0, 0]])
x_1 = (1, [2, 2], [3, 3, 3, 3])
x_2 = (1, [2, 2, 2], [3, 3])
batch = batchify((x_1, x_2))

names = ['centers', 'contexts_negatives', 'masks', 'labels']
for name, data in zip(names, batch):
    print(name, '=', data)
centers = [[1.]
 [1.]]
contexts_negatives = [[2. 2. 3. 3. 3. 3.]
 [2. 2. 2. 3. 3. 0.]]
masks = [[1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 0.]]
labels = [[1. 1. 0. 0. 0. 0.]
 [1. 1. 1. 0. 0. 0.]]
[22:01:00] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for CPU

15.3.6. 整合所有

最后,我们定义 load_data_ptb 函数,它读取PTB数据集并返回数据迭代器和词表。

#@save
def load_data_ptb(batch_size, max_window_size, num_noise_words):
    """Download the PTB dataset and then load it into memory."""
    num_workers = d2l.get_dataloader_workers()
    sentences = read_ptb()
    vocab = d2l.Vocab(sentences, min_freq=10)
    subsampled, counter = subsample(sentences, vocab)
    corpus = [vocab[line] for line in subsampled]
    all_centers, all_contexts = get_centers_and_contexts(
        corpus, max_window_size)
    all_negatives = get_negatives(
        all_contexts, vocab, counter, num_noise_words)

    class PTBDataset(torch.utils.data.Dataset):
        def __init__(self, centers, contexts, negatives):
            assert len(centers) == len(contexts) == len(negatives)
            self.centers = centers
            self.contexts = contexts
            self.negatives = negatives

        def __getitem__(self, index):
            return (self.centers[index], self.contexts[index],
                    self.negatives[index])

        def __len__(self):
            return len(self.centers)

    dataset = PTBDataset(all_centers, all_contexts, all_negatives)

    data_iter = torch.utils.data.DataLoader(dataset, batch_size, shuffle=True,
                                      collate_fn=batchify,
                                      num_workers=num_workers)
    return data_iter, vocab
#@save
def load_data_ptb(batch_size, max_window_size, num_noise_words):
    """Download the PTB dataset and then load it into memory."""
    sentences = read_ptb()
    vocab = d2l.Vocab(sentences, min_freq=10)
    subsampled, counter = subsample(sentences, vocab)
    corpus = [vocab[line] for line in subsampled]
    all_centers, all_contexts = get_centers_and_contexts(
        corpus, max_window_size)
    all_negatives = get_negatives(
        all_contexts, vocab, counter, num_noise_words)
    dataset = gluon.data.ArrayDataset(
        all_centers, all_contexts, all_negatives)
    data_iter = gluon.data.DataLoader(
        dataset, batch_size, shuffle=True,batchify_fn=batchify,
        num_workers=d2l.get_dataloader_workers())
    return data_iter, vocab

让我们打印数据迭代器的第一个小批量。

data_iter, vocab = load_data_ptb(512, 5, 5)
for batch in data_iter:
    for name, data in zip(names, batch):
        print(name, 'shape:', data.shape)
    break
centers shape: torch.Size([512, 1])
contexts_negatives shape: torch.Size([512, 60])
masks shape: torch.Size([512, 60])
labels shape: torch.Size([512, 60])
data_iter, vocab = load_data_ptb(512, 5, 5)
for batch in data_iter:
    for name, data in zip(names, batch):
        print(name, 'shape:', data.shape)
    break
centers shape: (512, 1)
contexts_negatives shape: (512, 60)
masks shape: (512, 60)
labels shape: (512, 60)

15.3.7. 小结

  • 高频词在训练中可能不是那么有用。我们可以对它们进行二次采样以加速训练。

  • 为了计算效率,我们以小批量方式加载样本。我们可以定义其他变量来区分填充和非填充,以及正例和负例。

15.3.8. 练习

  1. 如果不使用二次采样,本节代码的运行时间会如何变化?

  2. RandomGenerator 类缓存了 k 个随机采样结果。将 k 设置为其他值,看看它如何影响数据加载速度。

  3. 本节代码中还有哪些超参数可能会影响数据加载速度?