16.1. 情感分析和数据集¶ 在 SageMaker Studio Lab 中打开 Notebook
随着在线社交媒体和评论平台的激增,大量评论数据被记录下来,这些数据具有支持决策过程的巨大潜力。情感分析(sentiment analysis)研究人们在文本(如产品评论、博客评论和论坛讨论)中表达的情绪。它在政治(例如,分析公众对政策的情绪)、金融(例如,分析市场情绪)和市场营销(例如,产品研究和品牌管理)等不同领域有广泛的应用。
由于情感可以被分类为离散的极性或等级(例如,积极的和消极的),我们可以将情感分析看作一个文本分类任务,它将可变长度的文本序列转换为固定长度的文本类别。在本章中,我们将使用斯坦福大学的大型电影评论数据集进行情感分析。它由一个训练集和一个测试集组成,每个数据集都包含从IMDb下载的25000篇电影评论。在这两个数据集中,都有相同数量的“正面”和“负面”标签,表示不同的情感极性。
import os
import torch
from torch import nn
from d2l import torch as d2l
import os
from mxnet import np, npx
from d2l import mxnet as d2l
npx.set_np()
16.1.1. 读取数据集¶
首先,在路径 ../data/aclImdb
中下载并解压这个IMDb评论数据集。
#@save
d2l.DATA_HUB['aclImdb'] = (d2l.DATA_URL + 'aclImdb_v1.tar.gz',
'01ada507287d82875905620988597833ad4e0903')
data_dir = d2l.download_extract('aclImdb', 'aclImdb')
Downloading ../data/aclImdb_v1.tar.gz from http://d2l-data.s3-accelerate.amazonaws.com/aclImdb_v1.tar.gz...
#@save
d2l.DATA_HUB['aclImdb'] = (d2l.DATA_URL + 'aclImdb_v1.tar.gz',
'01ada507287d82875905620988597833ad4e0903')
data_dir = d2l.download_extract('aclImdb', 'aclImdb')
Downloading ../data/aclImdb_v1.tar.gz from http://d2l-data.s3-accelerate.amazonaws.com/aclImdb_v1.tar.gz...
接下来,读取训练和测试数据集。每个样本都是一篇评论及其标签:1表示“正面”,0表示“负面”。
#@save
def read_imdb(data_dir, is_train):
"""Read the IMDb review dataset text sequences and labels."""
data, labels = [], []
for label in ('pos', 'neg'):
folder_name = os.path.join(data_dir, 'train' if is_train else 'test',
label)
for file in os.listdir(folder_name):
with open(os.path.join(folder_name, file), 'rb') as f:
review = f.read().decode('utf-8').replace('\n', '')
data.append(review)
labels.append(1 if label == 'pos' else 0)
return data, labels
train_data = read_imdb(data_dir, is_train=True)
print('# trainings:', len(train_data[0]))
for x, y in zip(train_data[0][:3], train_data[1][:3]):
print('label:', y, 'review:', x[:60])
# trainings: 25000
label: 1 review: Zentropa has much in common with The Third Man, another noir
label: 1 review: Zentropa is the most original movie I've seen in years. If y
label: 1 review: Lars Von Trier is never backward in trying out new technique
#@save
def read_imdb(data_dir, is_train):
"""Read the IMDb review dataset text sequences and labels."""
data, labels = [], []
for label in ('pos', 'neg'):
folder_name = os.path.join(data_dir, 'train' if is_train else 'test',
label)
for file in os.listdir(folder_name):
with open(os.path.join(folder_name, file), 'rb') as f:
review = f.read().decode('utf-8').replace('\n', '')
data.append(review)
labels.append(1 if label == 'pos' else 0)
return data, labels
train_data = read_imdb(data_dir, is_train=True)
print('# trainings:', len(train_data[0]))
for x, y in zip(train_data[0][:3], train_data[1][:3]):
print('label:', y, 'review:', x[:60])
# trainings: 25000
label: 1 review: Zentropa has much in common with The Third Man, another noir
label: 1 review: Zentropa is the most original movie I've seen in years. If y
label: 1 review: Lars Von Trier is never backward in trying out new technique
16.1.2. 预处理数据集¶
我们将每个单词视为一个词元,并过滤掉出现次数少于5次的单词,然后从训练数据集中创建一个词表。
train_tokens = d2l.tokenize(train_data[0], token='word')
vocab = d2l.Vocab(train_tokens, min_freq=5, reserved_tokens=['<pad>'])
train_tokens = d2l.tokenize(train_data[0], token='word')
vocab = d2l.Vocab(train_tokens, min_freq=5, reserved_tokens=['<pad>'])
在词元化之后,我们来绘制评论长度(以词元为单位)的直方图。
d2l.set_figsize()
d2l.plt.xlabel('# tokens per review')
d2l.plt.ylabel('count')
d2l.plt.hist([len(line) for line in train_tokens], bins=range(0, 1000, 50));
d2l.set_figsize()
d2l.plt.xlabel('# tokens per review')
d2l.plt.ylabel('count')
d2l.plt.hist([len(line) for line in train_tokens], bins=range(0, 1000, 50));
正如我们所预期的那样,评论的长度各不相同。为了每次处理一个小批量的这类评论,我们通过截断和填充将每篇评论的长度设置为500,这与 Section 10.5 中机器翻译数据集的预处理步骤类似。
num_steps = 500 # sequence length
train_features = torch.tensor([d2l.truncate_pad(
vocab[line], num_steps, vocab['<pad>']) for line in train_tokens])
print(train_features.shape)
torch.Size([25000, 500])
num_steps = 500 # sequence length
train_features = np.array([d2l.truncate_pad(
vocab[line], num_steps, vocab['<pad>']) for line in train_tokens])
print(train_features.shape)
(25000, 500)
[21:59:47] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for CPU
16.1.3. 创建数据迭代器¶
现在我们可以创建数据迭代器。在每次迭代中,都会返回一个小批量的样本。
train_iter = d2l.load_array((train_features, torch.tensor(train_data[1])), 64)
for X, y in train_iter:
print('X:', X.shape, ', y:', y.shape)
break
print('# batches:', len(train_iter))
X: torch.Size([64, 500]) , y: torch.Size([64])
# batches: 391
train_iter = d2l.load_array((train_features, train_data[1]), 64)
for X, y in train_iter:
print('X:', X.shape, ', y:', y.shape)
break
print('# batches:', len(train_iter))
X: (64, 500) , y: (64,)
# batches: 391
16.1.4. 整合代码¶
最后,我们将上述步骤封装到 load_data_imdb
函数中。它返回训练和测试数据迭代器以及IMDb评论数据集的词表。
#@save
def load_data_imdb(batch_size, num_steps=500):
"""Return data iterators and the vocabulary of the IMDb review dataset."""
data_dir = d2l.download_extract('aclImdb', 'aclImdb')
train_data = read_imdb(data_dir, True)
test_data = read_imdb(data_dir, False)
train_tokens = d2l.tokenize(train_data[0], token='word')
test_tokens = d2l.tokenize(test_data[0], token='word')
vocab = d2l.Vocab(train_tokens, min_freq=5)
train_features = torch.tensor([d2l.truncate_pad(
vocab[line], num_steps, vocab['<pad>']) for line in train_tokens])
test_features = torch.tensor([d2l.truncate_pad(
vocab[line], num_steps, vocab['<pad>']) for line in test_tokens])
train_iter = d2l.load_array((train_features, torch.tensor(train_data[1])),
batch_size)
test_iter = d2l.load_array((test_features, torch.tensor(test_data[1])),
batch_size,
is_train=False)
return train_iter, test_iter, vocab
#@save
def load_data_imdb(batch_size, num_steps=500):
"""Return data iterators and the vocabulary of the IMDb review dataset."""
data_dir = d2l.download_extract('aclImdb', 'aclImdb')
train_data = read_imdb(data_dir, True)
test_data = read_imdb(data_dir, False)
train_tokens = d2l.tokenize(train_data[0], token='word')
test_tokens = d2l.tokenize(test_data[0], token='word')
vocab = d2l.Vocab(train_tokens, min_freq=5)
train_features = np.array([d2l.truncate_pad(
vocab[line], num_steps, vocab['<pad>']) for line in train_tokens])
test_features = np.array([d2l.truncate_pad(
vocab[line], num_steps, vocab['<pad>']) for line in test_tokens])
train_iter = d2l.load_array((train_features, train_data[1]), batch_size)
test_iter = d2l.load_array((test_features, test_data[1]), batch_size,
is_train=False)
return train_iter, test_iter, vocab
16.1.5. 小结¶
情感分析研究人们在文本中表达的情绪,它被看作一个文本分类问题,将可变长度的文本序列转换为固定长度的文本类别。
经过预处理后,我们可以将斯坦福大型电影评论数据集(IMDb评论数据集)加载到带有词表的数据迭代器中。