1. Overview
本文将CNN用于句子分类任务
(1) 使用静态vector + CNN即可取得很好的效果;=> 这表明预训练的vector是universal的特征提取器,可以被用于多种分类任务中。
(2) 根据特定任务进行fine-tuning 的vector + CNN 取得了更好的效果。
(3) 改进模型架构,使得可以使用 task-specific 和 static 的vector。
(4) 在7项任务中的4项取得了SOTA的效果。
思考:卷积神经网络的核心思想是捕获局部特征。在图像领域,由于图像本身具有局部相关性,因此,CNN是一个较为适用的特征提取器。在NLP中,可以将一段文本n-gram看做一个有相近特征的片段——窗口,因而希望通过CNN来捕获这个滑动窗口内的局部特征。卷积神经网络的优势在于可以对这样的n-gram特征进行组合和筛选,获取不同的抽象层次的语义信息。
2. Model
对于该模型,主要注意三点:
1. 如何应用的CNN,即在文本中如何使用CNN
2. 如何将static和fine-tuned vector结合在一个架构中
3. 正则化的策略
本文的思路是比较简单的。
2.1 CNN的应用
<1> feature map 的获取
word vector 是k维,sentence length = n (padded),则将该sentence表示为每个单词的简单的concat,如fig1所示,组成最左边的矩形。
卷积核是对窗口大小为h的词进行卷积。大小为h的窗口内单词的表征为 h * k 维度,那么设定一个维度同样为h*k的卷积核 w,对其进行卷积运算。
之后加偏置,进行非线性变换即可得到经过CNN之后提取的特征的表征$c_i$。
这个$c_i$是某一个卷积核对一个窗口的卷积后的特征表示,对于长度为n的sentence,滑动窗口可以滑动n - h + 1次,也就可以得到一个feature map
显然,$c$的维度为n - h + 1. 当然,这是对一个卷积核获取的feature map, 为了提取到多种特征,可以设置不同的卷积核,它们对应的卷积核的大小可以不同,也就是h可以不同。
这个过程对应了Figure1中最左边两个图形的过程。
<2> max pooling
这里的max pooling有个名词叫 max-over-time-pooling.它的over-time体现在:如图,每个feature map中选择值最大的组成到max pooling后的矩阵中,而这个feature map则是沿着滑动窗口,也就是沿着文本序列进行卷积得到的,那么也就是max pooling得到的是分别在每一个卷积核卷积下的,某一个滑动窗口--句子的某一个子序列卷积后的值,这个值相比于其他滑动窗口的更大。句子序列是有先后顺序的,滑动窗口也是,所以是 over-time.
这里记为:,是对应该filter的最大值。
<3> 全连接层
这里也是采用全连接层,将前面层提取的信息,映射到分类类别中,获取其概率分布。
2.2 static 和 fine-tuned vector的结合
paper中,将句子分别用 static 和fine-tuned vector 表征为两个channel。如Figure1最左边的图形所示,有两个矩阵,这两个矩阵分别表示用static 和fine-tuned vector拼接组成的句子的表征。比如,前面的矩阵的第一行 是wait这个词的static的vector;后面的矩阵的第一行 是wait这个词的fine-tuned的vector.
二者信息如何结合呢?
paper中的策略也很简单,用同样的卷积核对其进行特征提取后,将两个channel获得的值直接Add在一起,放到feature map中,这样Figure1中的feature map实际上是两种vector进行特征提取后信息的综合。
2.3 正则化的策略
为了避免co-adapation问题,Hinton提出了dropout。在本paper中,对于倒数第二层,也就是max pooling后获取的部分,也使用这样的正则化策略。
假设有m个feature map, 那么记。
如果不使用dropout,其经过线性映射的表示为:
那么如果使用dropout,其经过线性映射的表示为:
这里的$r$是m维的mask向量,其值或为0,或为1,其值为1的概率服从伯努利分布。
那么在进行反向传播时,只有对应mask为1的单元,其梯度才会传播更新。
在测试阶段,权值矩阵w会被scale p倍,即$\hat{w} = pw$,并且$\hat{w}$不进行dropout,来对训练阶段为遇到过的数据进行score.
另外可以选择对$w$进行$l_2$正则化,当在梯度下降后,$||w||_2 > s$ 时,将其值限制为s.
3. Datasets and Experimental Setup
3.1 Datasets:
1. MR: Movie reviews with one sentence per review. positive/negative reviews
2. SST-1: Stanford Sentiment Treebank—an extension of MR but with train/dev/test splits provided and fine-grained labels (very positive, positive, neutral, negative, very negative), re-labeled by Socher et al. (2013).4
3. SST-2: Same as SST-1 but with neutral reviews removed and binary labels.
4. Subj: Subjectivity dataset where the task is to classify a sentence as being subjective or objective (Pang and Lee, 2004)
5. TREC: TREC question dataset—task involves classifying a question into 6 question types (whether the question is about person, location, numeric information, etc.) (Li and Roth, 2002)
6. CR: Customer reviews of various products (cameras, MP3s etc.). Task is to predict positive/negative reviews (Hu and Liu, 2004)
7. MPQA: Opinion polarity detection subtask of the MPQA dataset (Wiebe et al., 2005).
3.2 Hyperparameters and Training
激活函数:ReLU
window(h): 3,4,5, 每个有100个feature map
dropout p = 0.5
l2(s) = 3
mini-batch size = 50
在SST-2的dev set上进行网格搜索(grid search)选择的以上超参数。
批量梯度下降
使用Adadelta update rule
对于没有提供标准dev set的数据集,随机在training data 中选10%作为dev set.
3.3 Pre-trained Word Vectors
word2vec vectors that were trained on 100 billion words from Google News
3.4 Model Variations
paper中提供的几种模型的变型主要为了测试,初始的word vector的设置对模型效果的影响。
CNN-rand: 完全随机初始化
CNN-static: 用word2vec预训练的初始化
CNN-non-static: 用针对特定任务fine-tuned的
CNN-multichannel: 将static与fine-tuned的结合,每个作为一个channel
效果:后三者相比于完全rand的在7个数据集上效果都有提升。
并且本文所提出的这个简单的CNN模型的效果,和一些利用parse-tree等复杂模型的效果相差很小。在SST-2, CR 中取得了SOTA.
本文提出multichannel的方法,本想希望通过避免overfitting来提升效果的,但是实验结果显示,并没有显示处完全的优势,在一些数据集上的效果,不及其他。
4. Code
Theano: 1. paper的实现代码:yoonkim/CNN_sentence:
Tensorflow: 2. dennybritz/cnn-text-classification-tf:
Keras: 3. alexander-rakhlin/CNN-for-Sentence-Classification-in-Keras:
Pytorch: 4. Shawn1993/cnn-text-classification-pytorch:
试验了MR的效果,eval准确率最高为73%,低于github中给出的77.5%和paper中76.1%的准确率;
试验了SST的效果,eval准确率最高为37%,低于github中给出的37.2%和paper中45.0%的准确率。
这里展示model.py的代码:
1 import torch 2 import torch.nn as nn 3 import torch.nn.functional as F 4 from torch.autograd import Variable 5 6 7 class CNN_Text(nn.Module): 8 9 def __init__(self, args):10 super(CNN_Text, self).__init__()11 self.args = args12 13 V = args.embed_num14 D = args.embed_dim15 C = args.class_num16 Ci = 117 Co = args.kernel_num18 Ks = args.kernel_sizes19 20 self.embed = nn.Embedding(V, D)21 # self.convs1 = [nn.Conv2d(Ci, Co, (K, D)) for K in Ks]22 self.convs1 = nn.ModuleList([nn.Conv2d(Ci, Co, (K, D)) for K in Ks])23 '''24 self.conv13 = nn.Conv2d(Ci, Co, (3, D))25 self.conv14 = nn.Conv2d(Ci, Co, (4, D))26 self.conv15 = nn.Conv2d(Ci, Co, (5, D))27 '''28 self.dropout = nn.Dropout(args.dropout)29 self.fc1 = nn.Linear(len(Ks)*Co, C)30 31 def conv_and_pool(self, x, conv):32 x = F.relu(conv(x)).squeeze(3) # (N, Co, W)33 x = F.max_pool1d(x, x.size(2)).squeeze(2)34 return x35 36 def forward(self, x):37 x = self.embed(x) # (N, W, D)38 39 if self.args.static:40 x = Variable(x)41 42 x = x.unsqueeze(1) # (N, Ci, W, D)43 44 x = [F.relu(conv(x)).squeeze(3) for conv in self.convs1] # [(N, Co, W), ...]*len(Ks)45 46 x = [F.max_pool1d(i, i.size(2)).squeeze(2) for i in x] # [(N, Co), ...]*len(Ks)47 48 x = torch.cat(x, 1)49 50 '''51 x1 = self.conv_and_pool(x,self.conv13) #(N,Co)52 x2 = self.conv_and_pool(x,self.conv14) #(N,Co)53 x3 = self.conv_and_pool(x,self.conv15) #(N,Co)54 x = torch.cat((x1, x2, x3), 1) # (N,len(Ks)*Co)55 '''56 x = self.dropout(x) # (N, len(Ks)*Co)57 logit = self.fc1(x) # (N, C)58 return logit
Pytorch 5. prakashpandey9/Text-Classification-Pytorch:
注意,该代码中models的CNN部分是paper的简单实现,但是代码的main.py需要有修改
由于选用的是IMDB的数据集,其label是1,2,而pytorch在计算loss时,要求target的范围在0<= t < n_classes,也就是需要将标签(1,2)转换为(0,1),使其符合pytorch的要求,否则会报错:“Assertion `t >= 0 && t < n_classes` failed.”
可以通过将标签2改为0,来实现:
1 target = (target != 2)2 target = target.long()
应为该代码中用的损失函数是cross_entropy, 所以应转为long类型。
方便起见,这里展示修改后的完整的main.py的代码,里面的超参数可以自行更改。
1 import os 2 import time 3 import load_data 4 import torch 5 import torch.nn.functional as F 6 from torch.autograd import Variable 7 import torch.optim as optim 8 import numpy as np 9 from models.LSTM import LSTMClassifier 10 from models.CNN import CNN 11 12 TEXT, vocab_size, word_embeddings, train_iter, valid_iter, test_iter = load_data.load_dataset() 13 14 def clip_gradient(model, clip_value): 15 params = list(filter(lambda p: p.grad is not None, model.parameters())) 16 for p in params: 17 p.grad.data.clamp_(-clip_value, clip_value) 18 19 def train_model(model, train_iter, epoch): 20 total_epoch_loss = 0 21 total_epoch_acc = 0 22 23 device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') 24 # model.cuda() 25 # model.to(device) 26 27 optim = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters())) 28 steps = 0 29 model.train() 30 for idx, batch in enumerate(train_iter): 31 text = batch.text[0] 32 target = batch.label 33 ##########Assertion `t >= 0 && t < n_classes` failed.################### 34 target = (target != 2) 35 target = target.long() 36 ######################################################################## 37 # target = torch.autograd.Variable(target).long() 38 39 if torch.cuda.is_available(): 40 text = text.cuda() 41 target = target.cuda() 42 43 if (text.size()[0] is not 32):# One of the batch returned by BucketIterator has length different than 32. 44 continue 45 optim.zero_grad() 46 prediction = model(text) 47 48 prediction.to(device) 49 50 loss = loss_fn(prediction, target) 51 loss.to(device) 52 53 num_corrects = (torch.max(prediction, 1)[1].view(target.size()).data == target.data).float().sum() 54 acc = 100.0 * num_corrects/len(batch) 55 56 loss.backward() 57 clip_gradient(model, 1e-1) 58 optim.step() 59 steps += 1 60 61 if steps % 100 == 0: 62 print (f'Epoch: {epoch+1}, Idx: {idx+1}, Training Loss: {loss.item():.4f}, Training Accuracy: {acc.item(): .2f}%') 63 64 total_epoch_loss += loss.item() 65 total_epoch_acc += acc.item() 66 67 return total_epoch_loss/len(train_iter), total_epoch_acc/len(train_iter) 68 69 def eval_model(model, val_iter): 70 total_epoch_loss = 0 71 total_epoch_acc = 0 72 model.eval() 73 with torch.no_grad(): 74 for idx, batch in enumerate(val_iter): 75 text = batch.text[0] 76 if (text.size()[0] is not 32): 77 continue 78 target = batch.label 79 # target = torch.autograd.Variable(target).long() 80 81 target = (target != 2) 82 target = target.long() 83 84 85 if torch.cuda.is_available(): 86 text = text.cuda() 87 target = target.cuda() 88 89 prediction = model(text) 90 loss = loss_fn(prediction, target) 91 num_corrects = (torch.max(prediction, 1)[1].view(target.size()).data == target.data).sum() 92 acc = 100.0 * num_corrects/len(batch) 93 total_epoch_loss += loss.item() 94 total_epoch_acc += acc.item() 95 96 return total_epoch_loss/len(val_iter), total_epoch_acc/len(val_iter) 97 98 99 # learning_rate = 2e-5100 # batch_size = 32101 # output_size = 2102 # hidden_size = 256103 # embedding_length = 300104 105 learning_rate = 1e-3106 batch_size = 32107 output_size = 1108 # hidden_size = 256109 embedding_length = 300110 111 # model = LSTMClassifier(batch_size, output_size, hidden_size, vocab_size, embedding_length, word_embeddings)112 113 model = CNN(batch_size = batch_size, output_size = 2, in_channels = 1, out_channels = 100, kernel_heights = [3,4,5], stride = 1, padding = 0, keep_probab = 0.5, vocab_size = vocab_size, embedding_length = 300, weights = word_embeddings)114 115 device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')116 model.to(device)117 118 loss_fn = F.cross_entropy119 120 for epoch in range(1):121 train_loss, train_acc = train_model(model, train_iter, epoch)122 val_loss, val_acc = eval_model(model, valid_iter)123 124 print(f'Epoch: {epoch+1:02}, Train Loss: {train_loss:.3f}, Train Acc: {train_acc:.2f}%, Val. Loss: {val_loss:3f}, Val. Acc: {val_acc:.2f}%')125 126 test_loss, test_acc = eval_model(model, test_iter)127 print(f'Test Loss: {test_loss:.3f}, Test Acc: {test_acc:.2f}%')128 129 ''' Let us now predict the sentiment on a single sentence just for the testing purpose. '''130 test_sen1 = "This is one of the best creation of Nolan. I can say, it's his magnum opus. Loved the soundtrack and especially those creative dialogues."131 test_sen2 = "Ohh, such a ridiculous movie. Not gonna recommend it to anyone. Complete waste of time and money."132 133 test_sen1 = TEXT.preprocess(test_sen1)134 test_sen1 = [[TEXT.vocab.stoi[x] for x in test_sen1]]135 136 test_sen2 = TEXT.preprocess(test_sen2)137 test_sen2 = [[TEXT.vocab.stoi[x] for x in test_sen2]]138 139 test_sen = np.asarray(test_sen2)140 test_sen = torch.LongTensor(test_sen)141 142 # test_tensor = Variable(test_sen, volatile=True)143 144 # test_tensor = torch.tensor(test_sen, dtype= torch.long)145 # test_tensor.new_tensor(test_sen, requires_grad = False)146 test_tensor = test_sen.clone().detach().requires_grad_(False)147 148 test_tensor = test_tensor.cuda()149 150 model.eval()151 output = model(test_tensor, 1)152 output = output.cuda()153 out = F.softmax(output, 1)154 155 if (torch.argmax(out[0]) == 0):156 print ("Sentiment: Positive")157 else:158 print ("Sentiment: Negative")
[支付宝] Bless you~ O(∩_∩)O
As you start to walk out on the way, the way appears. ~Rumi