15.3. 用于预训练词嵌入的数据集¶ 在 SageMaker Studio Lab 中打开 Notebook
既然我们已经了解了word2vec模型和近似训练的技术细节,那么让我们来看看它们的实现。具体来说,我们将以 :numref:`sec_word2vec` 中的跳元模型和 :numref:`sec_approx-train` 中的负采样为例。在本节中,我们从用于预训练词嵌入模型的数据集开始:数据的原始格式将被转换为可在训练期间迭代的小批量。
import collections
import math
import os
import random
import torch
from d2l import torch as d2l
import collections
import math
import os
import random
from mxnet import gluon, np
from d2l import mxnet as d2l
15.3.1. 读取数据集¶
我们在这里使用的数据集是 `宾州树库(PTB)
#@save
d2l.DATA_HUB['ptb'] = (d2l.DATA_URL + 'ptb.zip',
'319d85e578af0cdc590547f26231e4e31cdf1e42')
#@save
def read_ptb():
"""Load the PTB dataset into a list of text lines."""
data_dir = d2l.download_extract('ptb')
# Read the training set
with open(os.path.join(data_dir, 'ptb.train.txt')) as f:
raw_text = f.read()
return [line.split() for line in raw_text.split('\n')]
sentences = read_ptb()
f'# sentences: {len(sentences)}'
Downloading ../data/ptb.zip from http://d2l-data.s3-accelerate.amazonaws.com/ptb.zip...
'# sentences: 42069'
#@save
d2l.DATA_HUB['ptb'] = (d2l.DATA_URL + 'ptb.zip',
'319d85e578af0cdc590547f26231e4e31cdf1e42')
#@save
def read_ptb():
"""Load the PTB dataset into a list of text lines."""
data_dir = d2l.download_extract('ptb')
# Read the training set
with open(os.path.join(data_dir, 'ptb.train.txt')) as f:
raw_text = f.read()
return [line.split() for line in raw_text.split('\n')]
sentences = read_ptb()
f'# sentences: {len(sentences)}'
Downloading ../data/ptb.zip from http://d2l-data.s3-accelerate.amazonaws.com/ptb.zip...
'# sentences: 42069'
在读取训练集之后,我们为语料库构建一个词表,其中任何出现次数少于10次的单词都由“<unk>”词元替换。请注意,原始数据集还包含表示稀有(未知)单词的“<unk>”词元。
vocab = d2l.Vocab(sentences, min_freq=10)
f'vocab size: {len(vocab)}'
'vocab size: 6719'
vocab = d2l.Vocab(sentences, min_freq=10)
f'vocab size: {len(vocab)}'
'vocab size: 6719'
15.3.2. 二次采样¶
文本数据通常有像“the”、“a”和“in”这样的高频词:在非常大的语料库中,它们甚至可能出现数十亿次。然而,这些词通常在上下文窗口中与许多不同的词共同出现,提供的有用信号很少。例如,考虑上下文窗口中的“chip”一词:直观地说,它与低频词“intel”的共现比与高频词“a”的共现更有助于训练。此外,用大量(高频)词进行训练速度很慢。因此,在训练词嵌入模型时,可以对高频词进行*二次采样* :cite:`Mikolov.Sutskever.Chen.ea.2013`。具体来说,数据集中的每个索引词 \(w_i\) 将以概率
被丢弃,其中 \(f(w_i)\) 是单词 \(w_i\) 的数量与数据集中单词总数的比率,常数 \(t\) 是一个超参数(实验中为 \(10^{-4}\))。我们可以看到,只有当相对频率 \(f(w_i) > t\) 时,(高频)词 \(w_i\) 才可能被丢弃,并且词的相对频率越高,被丢弃的概率就越大。
#@save
def subsample(sentences, vocab):
"""Subsample high-frequency words."""
# Exclude unknown tokens ('<unk>')
sentences = [[token for token in line if vocab[token] != vocab.unk]
for line in sentences]
counter = collections.Counter([
token for line in sentences for token in line])
num_tokens = sum(counter.values())
# Return True if `token` is kept during subsampling
def keep(token):
return(random.uniform(0, 1) <
math.sqrt(1e-4 / counter[token] * num_tokens))
return ([[token for token in line if keep(token)] for line in sentences],
counter)
subsampled, counter = subsample(sentences, vocab)
#@save
def subsample(sentences, vocab):
"""Subsample high-frequency words."""
# Exclude unknown tokens ('<unk>')
sentences = [[token for token in line if vocab[token] != vocab.unk]
for line in sentences]
counter = collections.Counter([
token for line in sentences for token in line])
num_tokens = sum(counter.values())
# Return True if `token` is kept during subsampling
def keep(token):
return(random.uniform(0, 1) <
math.sqrt(1e-4 / counter[token] * num_tokens))
return ([[token for token in line if keep(token)] for line in sentences],
counter)
subsampled, counter = subsample(sentences, vocab)
以下代码片段绘制了二次采样前后每个句子的词元数量的直方图。正如预期的那样,二次采样通过丢弃高频词显著缩短了句子,这将导致训练加速。
d2l.show_list_len_pair_hist(['origin', 'subsampled'], '# tokens per sentence',
'count', sentences, subsampled);
d2l.show_list_len_pair_hist(['origin', 'subsampled'], '# tokens per sentence',
'count', sentences, subsampled);
对于单个词元,高频词“the”的采样率不到1/20。
def compare_counts(token):
return (f'# of "{token}": '
f'before={sum([l.count(token) for l in sentences])}, '
f'after={sum([l.count(token) for l in subsampled])}')
compare_counts('the')
'# of "the": before=50770, after=2010'
def compare_counts(token):
return (f'# of "{token}": '
f'before={sum([l.count(token) for l in sentences])}, '
f'after={sum([l.count(token) for l in subsampled])}')
compare_counts('the')
'# of "the": before=50770, after=2007'
相比之下,低频词“join”则被完全保留。
compare_counts('join')
'# of "join": before=45, after=45'
compare_counts('join')
'# of "join": before=45, after=45'
二次采样后,我们将语料库的词元映射到它们的索引。
corpus = [vocab[line] for line in subsampled]
corpus[:3]
[[], [4127, 3228, 1773], [3922, 1922, 4743, 2696]]
corpus = [vocab[line] for line in subsampled]
corpus[:3]
[[], [3228, 4060], [3922, 1922, 4743]]
15.3.3. 提取中心词和上下文词¶
以下 get_centers_and_contexts
函数从 corpus
中提取所有的中心词及其上下文词。它在1和 max_window_size
之间随机均匀地采样一个整数作为上下文窗口大小。对于任何中心词,那些与它的距离不超过采样的上下文窗口大小的词是它的上下文词。
#@save
def get_centers_and_contexts(corpus, max_window_size):
"""Return center words and context words in skip-gram."""
centers, contexts = [], []
for line in corpus:
# To form a "center word--context word" pair, each sentence needs to
# have at least 2 words
if len(line) < 2:
continue
centers += line
for i in range(len(line)): # Context window centered at `i`
window_size = random.randint(1, max_window_size)
indices = list(range(max(0, i - window_size),
min(len(line), i + 1 + window_size)))
# Exclude the center word from the context words
indices.remove(i)
contexts.append([line[idx] for idx in indices])
return centers, contexts
#@save
def get_centers_and_contexts(corpus, max_window_size):
"""Return center words and context words in skip-gram."""
centers, contexts = [], []
for line in corpus:
# To form a "center word--context word" pair, each sentence needs to
# have at least 2 words
if len(line) < 2:
continue
centers += line
for i in range(len(line)): # Context window centered at `i`
window_size = random.randint(1, max_window_size)
indices = list(range(max(0, i - window_size),
min(len(line), i + 1 + window_size)))
# Exclude the center word from the context words
indices.remove(i)
contexts.append([line[idx] for idx in indices])
return centers, contexts
接下来,我们创建一个包含两个句子的人工数据集,分别有7个和3个词。设最大上下文窗口大小为2,并打印所有的中心词及其上下文词。
tiny_dataset = [list(range(7)), list(range(7, 10))]
print('dataset', tiny_dataset)
for center, context in zip(*get_centers_and_contexts(tiny_dataset, 2)):
print('center', center, 'has contexts', context)
dataset [[0, 1, 2, 3, 4, 5, 6], [7, 8, 9]]
center 0 has contexts [1]
center 1 has contexts [0, 2]
center 2 has contexts [0, 1, 3, 4]
center 3 has contexts [1, 2, 4, 5]
center 4 has contexts [2, 3, 5, 6]
center 5 has contexts [3, 4, 6]
center 6 has contexts [5]
center 7 has contexts [8, 9]
center 8 has contexts [7, 9]
center 9 has contexts [7, 8]
tiny_dataset = [list(range(7)), list(range(7, 10))]
print('dataset', tiny_dataset)
for center, context in zip(*get_centers_and_contexts(tiny_dataset, 2)):
print('center', center, 'has contexts', context)
dataset [[0, 1, 2, 3, 4, 5, 6], [7, 8, 9]]
center 0 has contexts [1, 2]
center 1 has contexts [0, 2, 3]
center 2 has contexts [0, 1, 3, 4]
center 3 has contexts [2, 4]
center 4 has contexts [2, 3, 5, 6]
center 5 has contexts [4, 6]
center 6 has contexts [4, 5]
center 7 has contexts [8, 9]
center 8 has contexts [7, 9]
center 9 has contexts [7, 8]
在PTB数据集上训练时,我们将最大上下文窗口大小设置为5。下面提取数据集中所有的中心词及其上下文词。
all_centers, all_contexts = get_centers_and_contexts(corpus, 5)
f'# center-context pairs: {sum([len(contexts) for contexts in all_contexts])}'
'# center-context pairs: 1503420'
all_centers, all_contexts = get_centers_and_contexts(corpus, 5)
f'# center-context pairs: {sum([len(contexts) for contexts in all_contexts])}'
'# center-context pairs: 1497612'
15.3.4. 负采样¶
我们使用负采样进行近似训练。为了根据预定义的分布采样噪声词,我们定义了以下 RandomGenerator
类,其中(可能未归一化的)采样分布通过参数 sampling_weights
传递。
#@save
class RandomGenerator:
"""Randomly draw among {1, ..., n} according to n sampling weights."""
def __init__(self, sampling_weights):
# Exclude
self.population = list(range(1, len(sampling_weights) + 1))
self.sampling_weights = sampling_weights
self.candidates = []
self.i = 0
def draw(self):
if self.i == len(self.candidates):
# Cache `k` random sampling results
self.candidates = random.choices(
self.population, self.sampling_weights, k=10000)
self.i = 0
self.i += 1
return self.candidates[self.i - 1]
#@save
class RandomGenerator:
"""Randomly draw among {1, ..., n} according to n sampling weights."""
def __init__(self, sampling_weights):
# Exclude
self.population = list(range(1, len(sampling_weights) + 1))
self.sampling_weights = sampling_weights
self.candidates = []
self.i = 0
def draw(self):
if self.i == len(self.candidates):
# Cache `k` random sampling results
self.candidates = random.choices(
self.population, self.sampling_weights, k=10000)
self.i = 0
self.i += 1
return self.candidates[self.i - 1]
例如,我们可以按照采样概率 \(P(X=1)=2/9, P(X=2)=3/9\) 和 \(P(X=3)=4/9\) 从索引1、2和3中抽取10个随机变量 \(X\),如下所示。
generator = RandomGenerator([2, 3, 4])
[generator.draw() for _ in range(10)]
[3, 3, 1, 3, 1, 2, 3, 3, 2, 1]
对于一对中心词和上下文词,我们随机采样 K
(实验中为5)个噪声词。根据word2vec论文中的建议,噪声词 \(w\) 的采样概率 \(P(w)\) 设置为其在词典中相对频率的0.75次方 :cite:`Mikolov.Sutskever.Chen.ea.2013`。
#@save
def get_negatives(all_contexts, vocab, counter, K):
"""Return noise words in negative sampling."""
# Sampling weights for words with indices 1, 2, ... (index 0 is the
# excluded unknown token) in the vocabulary
sampling_weights = [counter[vocab.to_tokens(i)]**0.75
for i in range(1, len(vocab))]
all_negatives, generator = [], RandomGenerator(sampling_weights)
for contexts in all_contexts:
negatives = []
while len(negatives) < len(contexts) * K:
neg = generator.draw()
# Noise words cannot be context words
if neg not in contexts:
negatives.append(neg)
all_negatives.append(negatives)
return all_negatives
all_negatives = get_negatives(all_contexts, vocab, counter, 5)
#@save
def get_negatives(all_contexts, vocab, counter, K):
"""Return noise words in negative sampling."""
# Sampling weights for words with indices 1, 2, ... (index 0 is the
# excluded unknown token) in the vocabulary
sampling_weights = [counter[vocab.to_tokens(i)]**0.75
for i in range(1, len(vocab))]
all_negatives, generator = [], RandomGenerator(sampling_weights)
for contexts in all_contexts:
negatives = []
while len(negatives) < len(contexts) * K:
neg = generator.draw()
# Noise words cannot be context words
if neg not in contexts:
negatives.append(neg)
all_negatives.append(negatives)
return all_negatives
all_negatives = get_negatives(all_contexts, vocab, counter, 5)
15.3.5. 在小批量中加载训练样本¶
在提取了所有的中心词及其上下文词和采样的噪声词后,它们将被转换为可以训练期间迭代加载的样本小批量。
在一个小批量中,第 \(i^\textrm{th}\) 个样本包括一个中心词及其 \(n_i\) 个上下文词和 \(m_i\) 个噪声词。由于上下文窗口大小不同,\(n_i+m_i\) 对于不同的 \(i\) 是不同的。因此,对于每个样本,我们将其上下文词和噪声词连接到 contexts_negatives
变量中,并填充零直到连接长度达到 \(\max_i n_i+m_i\)(max_len
)。为了在损失计算中排除填充,我们定义了一个掩码变量 masks
。masks
中的元素与 contexts_negatives
中的元素一一对应,其中 masks
中的零(否则为一)对应于 contexts_negatives
中的填充。
为了区分正例和负例,我们通过一个 labels
变量将上下文词与噪声词在 contexts_negatives
中分离开。与 masks
类似,labels
中的元素与 contexts_negatives
中的元素也存在一一对应关系,其中 labels
中的一(否则为零)对应于 contexts_negatives
中的上下文词(正例)。
上述思想在以下 batchify
函数中实现。其输入 data
是一个列表,长度等于批量大小,其中每个元素是一个由中心词 center
、其上下文词 context
和其噪声词 negative
组成的样本。此函数返回一个可以加载用于训练期间计算的小批量,例如包括掩码变量。
#@save
def batchify(data):
"""Return a minibatch of examples for skip-gram with negative sampling."""
max_len = max(len(c) + len(n) for _, c, n in data)
centers, contexts_negatives, masks, labels = [], [], [], []
for center, context, negative in data:
cur_len = len(context) + len(negative)
centers += [center]
contexts_negatives += [context + negative + [0] * (max_len - cur_len)]
masks += [[1] * cur_len + [0] * (max_len - cur_len)]
labels += [[1] * len(context) + [0] * (max_len - len(context))]
return (torch.tensor(centers).reshape((-1, 1)), torch.tensor(
contexts_negatives), torch.tensor(masks), torch.tensor(labels))
#@save
def batchify(data):
"""Return a minibatch of examples for skip-gram with negative sampling."""
max_len = max(len(c) + len(n) for _, c, n in data)
centers, contexts_negatives, masks, labels = [], [], [], []
for center, context, negative in data:
cur_len = len(context) + len(negative)
centers += [center]
contexts_negatives += [context + negative + [0] * (max_len - cur_len)]
masks += [[1] * cur_len + [0] * (max_len - cur_len)]
labels += [[1] * len(context) + [0] * (max_len - len(context))]
return (np.array(centers).reshape((-1, 1)), np.array(
contexts_negatives), np.array(masks), np.array(labels))
让我们用两个样本组成的小批量来测试这个函数。
x_1 = (1, [2, 2], [3, 3, 3, 3])
x_2 = (1, [2, 2, 2], [3, 3])
batch = batchify((x_1, x_2))
names = ['centers', 'contexts_negatives', 'masks', 'labels']
for name, data in zip(names, batch):
print(name, '=', data)
centers = tensor([[1],
[1]])
contexts_negatives = tensor([[2, 2, 3, 3, 3, 3],
[2, 2, 2, 3, 3, 0]])
masks = tensor([[1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 0]])
labels = tensor([[1, 1, 0, 0, 0, 0],
[1, 1, 1, 0, 0, 0]])
x_1 = (1, [2, 2], [3, 3, 3, 3])
x_2 = (1, [2, 2, 2], [3, 3])
batch = batchify((x_1, x_2))
names = ['centers', 'contexts_negatives', 'masks', 'labels']
for name, data in zip(names, batch):
print(name, '=', data)
centers = [[1.]
[1.]]
contexts_negatives = [[2. 2. 3. 3. 3. 3.]
[2. 2. 2. 3. 3. 0.]]
masks = [[1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 0.]]
labels = [[1. 1. 0. 0. 0. 0.]
[1. 1. 1. 0. 0. 0.]]
[22:01:00] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for CPU
15.3.6. 整合所有¶
最后,我们定义 load_data_ptb
函数,它读取PTB数据集并返回数据迭代器和词表。
#@save
def load_data_ptb(batch_size, max_window_size, num_noise_words):
"""Download the PTB dataset and then load it into memory."""
num_workers = d2l.get_dataloader_workers()
sentences = read_ptb()
vocab = d2l.Vocab(sentences, min_freq=10)
subsampled, counter = subsample(sentences, vocab)
corpus = [vocab[line] for line in subsampled]
all_centers, all_contexts = get_centers_and_contexts(
corpus, max_window_size)
all_negatives = get_negatives(
all_contexts, vocab, counter, num_noise_words)
class PTBDataset(torch.utils.data.Dataset):
def __init__(self, centers, contexts, negatives):
assert len(centers) == len(contexts) == len(negatives)
self.centers = centers
self.contexts = contexts
self.negatives = negatives
def __getitem__(self, index):
return (self.centers[index], self.contexts[index],
self.negatives[index])
def __len__(self):
return len(self.centers)
dataset = PTBDataset(all_centers, all_contexts, all_negatives)
data_iter = torch.utils.data.DataLoader(dataset, batch_size, shuffle=True,
collate_fn=batchify,
num_workers=num_workers)
return data_iter, vocab
#@save
def load_data_ptb(batch_size, max_window_size, num_noise_words):
"""Download the PTB dataset and then load it into memory."""
sentences = read_ptb()
vocab = d2l.Vocab(sentences, min_freq=10)
subsampled, counter = subsample(sentences, vocab)
corpus = [vocab[line] for line in subsampled]
all_centers, all_contexts = get_centers_and_contexts(
corpus, max_window_size)
all_negatives = get_negatives(
all_contexts, vocab, counter, num_noise_words)
dataset = gluon.data.ArrayDataset(
all_centers, all_contexts, all_negatives)
data_iter = gluon.data.DataLoader(
dataset, batch_size, shuffle=True,batchify_fn=batchify,
num_workers=d2l.get_dataloader_workers())
return data_iter, vocab
让我们打印数据迭代器的第一个小批量。
data_iter, vocab = load_data_ptb(512, 5, 5)
for batch in data_iter:
for name, data in zip(names, batch):
print(name, 'shape:', data.shape)
break
centers shape: torch.Size([512, 1])
contexts_negatives shape: torch.Size([512, 60])
masks shape: torch.Size([512, 60])
labels shape: torch.Size([512, 60])
data_iter, vocab = load_data_ptb(512, 5, 5)
for batch in data_iter:
for name, data in zip(names, batch):
print(name, 'shape:', data.shape)
break
centers shape: (512, 1)
contexts_negatives shape: (512, 60)
masks shape: (512, 60)
labels shape: (512, 60)
15.3.7. 小结¶
高频词在训练中可能不是那么有用。我们可以对它们进行二次采样以加速训练。
为了计算效率,我们以小批量方式加载样本。我们可以定义其他变量来区分填充和非填充,以及正例和负例。