Advertisement

NLP入门——天池新闻文本分类(5)基于深度学习文本分类2

阅读量:

NLP入门——天池新闻文本分类(5)基于深度学习的文本分类2

    • 深度模型

    • word2vec

      • Skip-grams(SG)过程
      • Skip-grams训练
      • word2Vec训练词向量
    • TextCNN

      • textCNN Datawhale实现

深度模型

前面提到新闻文本分类任务可以拆分成两步来进行,第一步先将文本表示成词向量,第二步则使用机器学习或深度学习模型来对模型输入(词向量)进行分类处理。因为模型的提升也可以从这两个方面来着手。第一种思路是选择更为合适的词向量方法,比如从one-hot词向量转变成Word2vec词向量;而第二种思路则是选择更为有效的预测模型,比如从多元线性回归模型转成集成树模型(GBDT, Xgboost, lightgbm)。在第4节提到的Fasttext方法在这个任务中可以看成是将两个步骤融合起来,同时进行。

word2vec

在本节中,我们尝试使用Word2Vec方法来生成词向量,再将这些词向量作为模型的输入,用于预测。这里重点参考了DataWhale公布的Word2Vec代码,语料选择的是本任务的语料。在DataWhale提供的参考代码中,模型似乎想要考虑不同类别新闻数量不一致可能带来的影响,使得各个类别的数据分布相对“均匀”,但不太确定这样做是否真的有效。另外模型似乎想要使用十折交叉验证法来进行训练和验证,因此制作了10个均等的数据集,但后续的代码又没真的进行十折交叉验证。感觉代码写的很乱。

以下描述一下使用Word2Vec方法得到词向量的思路,以及得到词向量后使用TextCNN来进行新闻文本分类的思路。Word2Vec 利用大量的语料信息将单词表示成词向量,在词向量的生成过程充分利用了语句中的上下文信息,从而使得词向量能够反映出语义信息。Word2Vec可以将one-hot编码的稀疏词向量表示成稠密编码的低维词向量,并使得词向量具有语义信息。有两种处理方式:CBOW (continuous bag of words ) 方法和Skipgrams方法。CBOW通过建立全连接神经网络,使用一段语句中的n-1个词预测剩下的一个词,从而获得该单词对应的隐向量,并将该向量作为词向量。CBOW通过建立全连接神经网络,使用一段语句中的n-1个词预测剩下的一个词,从而获得该单词对应的隐向量,并将该向量作为词向量。通过这样的处理后,就可以重复表示单词之间的关联关系,语义相近的单词,词向量也相近,从而可以很好地表示单词的语义信息。

通过Word2Vec得到词向量后,可以将文本各个词向量直接相加,从而抽象表示文本,再使用机器学习模型进行处理。这和第3小节中基于词袋模型的处理方法思路是类似的,但这样势必会遗漏文本中的大量信息。另一种思路则采取深度学习方法来进行分类处理。第一种方法可以使用循环神经网络(比如LSTM)来进行处理,这里可以通过单词补充的方式使得模型的输入是定长的。第二种方法可以采用更为复杂的特定模型来进行处理,比如TextCNN方法。还可以考虑引入注意力机制。这里重点介绍一下TextCNN方法。

Skip-grams(SG)过程

神经网络基于训练数据,将会输出一个概率分布,这些概率代表着词典中每个词作为input word的output word的可能性
模型的输出概率代表着我们词典中的每个词有多大可能性和input word同时出现

input word和out word都会进行one-hot编码,形成一个稀疏向量(实际上仅有一个位置是1)
为了节约计算资源,它会仅仅选择矩阵对应向量中维度值为1的索引行计算

Skip-grams训练

Word2Vec模型是一个超级大的神经网络(权重矩阵规模非常大)。
百万数量级的权重矩阵和亿万数量级的训练样本意味着训练灾难。

问题解决:

  1. 将常见的组合单词或词组作为单个’words’来处理
    2.对高频词抽样来减少样本个数

  2. 对优化目标采用’negative sampling’方法,这样每个训练样本的训练只会更新一小部分模型权重,从而降低计算负担
    3.1 负采样时,随机选择一小部分negative words来更新对应权重,同时对positive words更新权重
    3.2 使用’一元模型分布’来选择’negative words’,个单词被选作negative sample的概率和它出现频次有关,频次越高越容易被选中
    3.3 负采样代码中,有一个包含了一亿个元素的数组’unigram table’,数组由词汇表中每个单词的索引号填充。单次负采样的概率*1亿=单次在表中出现的次数;也就是说,进行负采样时,只需要在0-1亿范围内生成一个随机数,然后选择表中索引号为这个随机数的单次作为negative word即可;一个单词负采样概率越大,它在表中出现的次数越多,被选择的概率就越大

  3. 霍夫曼树:输入权值为(w1,w2…wn)的n个节点;输出对应的霍夫曼树,一般得到霍夫曼树后会对叶子节点进行霍夫曼编码,由于权重高的叶子节点靠近根节点,而权重低的叶子节点会远离根节点。 所以高权重节点编码值较短,而低权重值编码值较长,这保证了树的带权路径最短,也符合信息论:常用词拥有更短的编码
    4.1.将(w1,w2…wn)看做是有n棵树的森林,每个数仅有一个节点
    4.2.在森林中选择根节点权值最小的两个数合并,得到一棵新树,这两棵树分布作为新树的左右子树,新树根节点权重为左右子树根节点权重和
    4.3.删除森林中权值最小的两棵树,并把合并后的新树加入森林
    4.4.重复4.2与4.3,直到森林中只剩一棵树
    4.5.在Word2Vec中,约定左子树编码为1,右子树编码为0,同时约定左子树的权重不小于右子树的权重

  4. Hierarchical Softmax过程:为了避免计算所有词的softmax概率,Word2Vec采用了霍夫曼树代替从隐藏层到输出softmax层的映射。霍夫曼树的建立:
    5.1.根据标签(label)和频率建立霍夫曼树(label出现的频率越高,Huffman树的路径越短)
    5.2.Huffman树中每一叶子节点代表一个label
    5.2.1. p - 从根节点出发到达w对应叶子节点的路径
    5.2.2. l - 路径p中包含节点的个数
    5.2.3. p1,p2,…pl - 路径p中的l个节点,其中p1表示根节点,p2表示词w对应的第二个节点
    5.2.4. d2,d3,…dl∈{0,1} - 词w的Huffman编码,它有l-1位编码构成,dl表示路径p中第l个节点对应的编码(根节点无)
    5.2.5. θ1,θ2,…θ(l-1)∈R - 路径p中非叶子节点对应的向量,θj表示路径p中第j个非叶子节点对应的向量
    5.3.一棵Huffman树,是一个二分类树(二叉树)。再Word2Vec中,1表示负类,0表示正类,通过Sigmoid函数分类

word2Vec训练词向量

复制代码
    '''
    model = Word2Vec(sentences, workers=num_workers, size=num_features)
    参数详解:
    sentences - 语料集,可以是一个list,对于大语料集,建议使用BrownCorpus,Text8Corpus,lineSentence构建
    sg - 用于设置训练算法,默认为0,即CBOW算法;sg=1则采用skip-gram算法
    size - 指定特征向量的维度,默认为100。大的size需要更多的训练数据,但是效果会更好
    window - 指当前词与预测词在一个句子中的最大距离
    alpha - 学习速率
    seed - 随机种子
    min_count - 可以对字典做截断,词频数少于min_count则被丢弃,默认为5
    max_vocab_size - 设置词向量构建期间的RAM限制。
                        如果所有独立单词个数超过限制,则丢弃其中最不频繁的一个。每一千万个单词大约需要1GB的RAM
    sample - 高频词汇的随机降采样的配置阈值,默认1乘e的-3次方,范围是0到1乘e的-5次方
    workers - 参加控制训练的并行数
    hs - hs=1采用Hierarchica_softmax技巧,hs=0采用negative_sampling(下采样)
    iter - 迭代次数,默认5次
    '''

TextCNN

TextCNN是2014年提出的模型。在对词向量输入进行处理时,使用了CNN。模型的结果如下:
在这里插入图片描述

textCNN Datawhale实现

复制代码
    import logging
    import random
    
    import numpy as np
    import torch
    
    logging.basicConfig(level=logging.INFO, format='%(asctime)-15s %(levelname)s: %(message)s')
    
    # set seed 
    seed = 666
    random.seed(seed)
    np.random.seed(seed)
    torch.cuda.manual_seed(seed)
    torch.manual_seed(seed)
    
    # set cuda
    gpu = 0
    use_cuda = gpu >= 0 and torch.cuda.is_available()
    if use_cuda:
    torch.cuda.set_device(gpu)
    device = torch.device("cuda", gpu)
    else:
    device = torch.device("cpu")
    logging.info("Use cuda: %s, gpu id: %d.", use_cuda, gpu)
复制代码
    2020-07-17 11:37:20,835 INFO: Use cuda: True, gpu id: 0.
复制代码
    # split data to 10 fold
    fold_num = 10
    data_file = '../data/train_set.csv'
    import pandas as pd
    
    
    
    def all_data2fold(fold_num, num=10000):
    fold_data = []
    f = pd.read_csv(data_file, sep='\t', encoding='UTF-8')
    texts = f['text'].tolist()[:num]
    labels = f['label'].tolist()[:num]
    
    total = len(labels)
    
    index = list(range(total))
    np.random.shuffle(index)
    
    all_texts = []
    all_labels = []
    for i in index:
        all_texts.append(texts[i])
        all_labels.append(labels[i])
    
    label2id = {}
    for i in range(total):
        label = str(all_labels[i])
        if label not in label2id:
            label2id[label] = [i]
        else:
            label2id[label].append(i)
    
    all_index = [[] for _ in range(fold_num)]
    for label, data in label2id.items():
        # print(label, len(data))
        batch_size = int(len(data) / fold_num)
        other = len(data) - batch_size * fold_num
        for i in range(fold_num):
            cur_batch_size = batch_size + 1 if i < other else batch_size
            # print(cur_batch_size)
            batch_data = [data[i * batch_size + b] for b in range(cur_batch_size)]
            all_index[i].extend(batch_data)
    
    batch_size = int(total / fold_num)
    other_texts = []
    other_labels = []
    other_num = 0
    start = 0
    for fold in range(fold_num):
        num = len(all_index[fold])
        texts = [all_texts[i] for i in all_index[fold]]
        labels = [all_labels[i] for i in all_index[fold]]
    
        if num > batch_size:
            fold_texts = texts[:batch_size]
            other_texts.extend(texts[batch_size:])
            fold_labels = labels[:batch_size]
            other_labels.extend(labels[batch_size:])
            other_num += num - batch_size
        elif num < batch_size:
            end = start + batch_size - num
            fold_texts = texts + other_texts[start: end]
            fold_labels = labels + other_labels[start: end]
            start = end
        else:
            fold_texts = texts
            fold_labels = labels
    
        assert batch_size == len(fold_labels)
    
        # shuffle
        index = list(range(batch_size))
        np.random.shuffle(index)
    
        shuffle_fold_texts = []
        shuffle_fold_labels = []
        for i in index:
            shuffle_fold_texts.append(fold_texts[i])
            shuffle_fold_labels.append(fold_labels[i])
    
        data = {'label': shuffle_fold_labels, 'text': shuffle_fold_texts}
        fold_data.append(data)
    
    logging.info("Fold lens %s", str([len(data['label']) for data in fold_data]))
    
    return fold_data
    
    
    fold_data = all_data2fold(10)
复制代码
    2020-07-17 11:37:25,526 INFO: Fold lens [1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000]
复制代码
    # build train, dev, test data
    fold_id = 9
    
    # dev
    dev_data = fold_data[fold_id]
    
    # train
    train_texts = []
    train_labels = []
    for i in range(0, fold_id):
    data = fold_data[i]
    train_texts.extend(data['text'])
    train_labels.extend(data['label'])
    
    train_data = {'label': train_labels, 'text': train_texts}
    
    # test
    test_data_file = '../data/test_a.csv'
    f = pd.read_csv(test_data_file, sep='\t', encoding='UTF-8')
    texts = f['text'].tolist()
    test_data = {'label': [0] * len(texts), 'text': texts}
复制代码
    # build vocab
    from collections import Counter
    from transformers import BasicTokenizer
    
    basic_tokenizer = BasicTokenizer()
    
    
    class Vocab():
    def __init__(self, train_data):
        self.min_count = 5
        self.pad = 0
        self.unk = 1
        self._id2word = ['[PAD]', '[UNK]']
        self._id2extword = ['[PAD]', '[UNK]']
    
        self._id2label = []
        self.target_names = []
    
        self.build_vocab(train_data)
    
        reverse = lambda x: dict(zip(x, range(len(x))))
        self._word2id = reverse(self._id2word)
        self._label2id = reverse(self._id2label)
    
        logging.info("Build vocab: words %d, labels %d." % (self.word_size, self.label_size))
    
    def build_vocab(self, data):
        self.word_counter = Counter()
    
        for text in data['text']:
            words = text.split()
            for word in words:
                self.word_counter[word] += 1
    
        for word, count in self.word_counter.most_common():
            if count >= self.min_count:
                self._id2word.append(word)
    
        label2name = {0: '科技', 1: '股票', 2: '体育', 3: '娱乐', 4: '时政', 5: '社会', 6: '教育', 7: '财经',
                      8: '家居', 9: '游戏', 10: '房产', 11: '时尚', 12: '彩票', 13: '星座'}
    
        self.label_counter = Counter(data['label'])
    
        for label in range(len(self.label_counter)):
            count = self.label_counter[label]
            self._id2label.append(label)
            self.target_names.append(label2name[label])
    
    def load_pretrained_embs(self, embfile):
        with open(embfile, encoding='utf-8') as f:
            lines = f.readlines()
            items = lines[0].split()
            word_count, embedding_dim = int(items[0]), int(items[1])
    
        index = len(self._id2extword)
        embeddings = np.zeros((word_count + index, embedding_dim))
        for line in lines[1:]:
            values = line.split()
            self._id2extword.append(values[0])
            vector = np.array(values[1:], dtype='float64')
            embeddings[self.unk] += vector
            embeddings[index] = vector
            index += 1
    
        embeddings[self.unk] = embeddings[self.unk] / word_count
        embeddings = embeddings / np.std(embeddings)
    
        reverse = lambda x: dict(zip(x, range(len(x))))
        self._extword2id = reverse(self._id2extword)
    
        assert len(set(self._id2extword)) == len(self._id2extword)
    
        return embeddings
    
    def word2id(self, xs):
        if isinstance(xs, list):
            return [self._word2id.get(x, self.unk) for x in xs]
        return self._word2id.get(xs, self.unk)
    
    def extword2id(self, xs):
        if isinstance(xs, list):
            return [self._extword2id.get(x, self.unk) for x in xs]
        return self._extword2id.get(xs, self.unk)
    
    def label2id(self, xs):
        if isinstance(xs, list):
            return [self._label2id.get(x, self.unk) for x in xs]
        return self._label2id.get(xs, self.unk)
    
    @property
    def word_size(self):
        return len(self._id2word)
    
    @property
    def extword_size(self):
        return len(self._id2extword)
    
    @property
    def label_size(self):
        return len(self._id2label)
    
    
    vocab = Vocab(train_data)
复制代码
    2020-07-17 11:37:26,603 INFO: PyTorch version 1.2.0 available.
    
    2020-07-17 11:37:29,673 INFO: Build vocab: words 4337, labels 14.
复制代码
    # build module
    import torch.nn as nn
    import torch.nn.functional as F
    
    
    class Attention(nn.Module):
    def __init__(self, hidden_size):
        super(Attention, self).__init__()
        self.weight = nn.Parameter(torch.Tensor(hidden_size, hidden_size))
        self.weight.data.normal_(mean=0.0, std=0.05)
    
        self.bias = nn.Parameter(torch.Tensor(hidden_size))
        b = np.zeros(hidden_size, dtype=np.float32)
        self.bias.data.copy_(torch.from_numpy(b))
    
        self.query = nn.Parameter(torch.Tensor(hidden_size))
        self.query.data.normal_(mean=0.0, std=0.05)
    
    def forward(self, batch_hidden, batch_masks):
        # batch_hidden: b x len x hidden_size (2 * hidden_size of lstm)
        # batch_masks:  b x len
    
        # linear
        key = torch.matmul(batch_hidden, self.weight) + self.bias  # b x len x hidden
    
        # compute attention
        outputs = torch.matmul(key, self.query)  # b x len
    
        masked_outputs = outputs.masked_fill((1 - batch_masks).bool(), float(-1e32))
    
        attn_scores = F.softmax(masked_outputs, dim=1)  # b x len
    
        # 对于全零向量,-1e32的结果为 1/len, -inf为nan, 额外补0
        masked_attn_scores = attn_scores.masked_fill((1 - batch_masks).bool(), 0.0)
    
        # sum weighted sources
        batch_outputs = torch.bmm(masked_attn_scores.unsqueeze(1), key).squeeze(1)  # b x hidden
    
        return batch_outputs, attn_scores
    
    
    # build word encoder
    word2vec_path = '../emb/word2vec.txt'
    dropout = 0.15
    
    
    class WordCNNEncoder(nn.Module):
    def __init__(self, vocab):
        super(WordCNNEncoder, self).__init__()
        self.dropout = nn.Dropout(dropout)
        self.word_dims = 100
    
        self.word_embed = nn.Embedding(vocab.word_size, self.word_dims, padding_idx=0)
    
        extword_embed = vocab.load_pretrained_embs(word2vec_path)
        extword_size, word_dims = extword_embed.shape
        logging.info("Load extword embed: words %d, dims %d." % (extword_size, word_dims))
    
        self.extword_embed = nn.Embedding(extword_size, word_dims, padding_idx=0)
        self.extword_embed.weight.data.copy_(torch.from_numpy(extword_embed))
        self.extword_embed.weight.requires_grad = False
    
        input_size = self.word_dims
    
        self.filter_sizes = [2, 3, 4]  # n-gram window
        self.out_channel = 100
        self.convs = nn.ModuleList([nn.Conv2d(1, self.out_channel, (filter_size, input_size), bias=True)
                                    for filter_size in self.filter_sizes])
    
    def forward(self, word_ids, extword_ids):
        # word_ids: sen_num x sent_len
        # extword_ids: sen_num x sent_len
        # batch_masks: sen_num x sent_len
        sen_num, sent_len = word_ids.shape
    
        word_embed = self.word_embed(word_ids)  # sen_num x sent_len x 100
        extword_embed = self.extword_embed(extword_ids)
        batch_embed = word_embed + extword_embed
    
        if self.training:
            batch_embed = self.dropout(batch_embed)
    
        batch_embed.unsqueeze_(1)  # sen_num x 1 x sent_len x 100
    
        pooled_outputs = []
        for i in range(len(self.filter_sizes)):
            filter_height = sent_len - self.filter_sizes[i] + 1
            conv = self.convs[i](batch_embed)
            hidden = F.relu(conv)  # sen_num x out_channel x filter_height x 1
    
            mp = nn.MaxPool2d((filter_height, 1))  # (filter_height, filter_width)
            pooled = mp(hidden).reshape(sen_num,
                                        self.out_channel)  # sen_num x out_channel x 1 x 1 -> sen_num x out_channel
    
            pooled_outputs.append(pooled)
    
        reps = torch.cat(pooled_outputs, dim=1)  # sen_num x total_out_channel
    
        if self.training:
            reps = self.dropout(reps)
    
        return reps
    
    
    # build sent encoder
    sent_hidden_size = 256
    sent_num_layers = 2
    
    
    class SentEncoder(nn.Module):
    def __init__(self, sent_rep_size):
        super(SentEncoder, self).__init__()
        self.dropout = nn.Dropout(dropout)
    
        self.sent_lstm = nn.LSTM(
            input_size=sent_rep_size,
            hidden_size=sent_hidden_size,
            num_layers=sent_num_layers,
            batch_first=True,
            bidirectional=True
        )
    
    def forward(self, sent_reps, sent_masks):
        # sent_reps:  b x doc_len x sent_rep_size
        # sent_masks: b x doc_len
    
        sent_hiddens, _ = self.sent_lstm(sent_reps)  # b x doc_len x hidden*2
        sent_hiddens = sent_hiddens * sent_masks.unsqueeze(2)
    
        if self.training:
            sent_hiddens = self.dropout(sent_hiddens)
    
        return sent_hiddens
复制代码
    # build model
    class Model(nn.Module):
    def __init__(self, vocab):
        super(Model, self).__init__()
        self.sent_rep_size = 300
        self.doc_rep_size = sent_hidden_size 
        self.all_parameters = {}
        parameters = []
        self.word_encoder = WordCNNEncoder(vocab)
        parameters.extend(list(filter(lambda p: p.requires_grad, self.word_encoder.parameters())))
    
        self.sent_encoder = SentEncoder(self.sent_rep_size)
        self.sent_attention = Attention(self.doc_rep_size)
        parameters.extend(list(filter(lambda p: p.requires_grad, self.sent_encoder.parameters())))
        parameters.extend(list(filter(lambda p: p.requires_grad, self.sent_attention.parameters())))
    
        self.out = nn.Linear(self.doc_rep_size, vocab.label_size, bias=True)
        parameters.extend(list(filter(lambda p: p.requires_grad, self.out.parameters())))
    
        if use_cuda:
            self.to(device)
    
        if len(parameters) > 0:
            self.all_parameters["basic_parameters"] = parameters
    
        logging.info('Build model with cnn word encoder, lstm sent encoder.')
    
        para_num = sum([np.prod(list(p.size())) for p in self.parameters()])
        logging.info('Model param num: %.2f M.' % (para_num / 1e6))
    
    def forward(self, batch_inputs):
        # batch_inputs(batch_inputs1, batch_inputs2): b x doc_len x sent_len
        # batch_masks : b x doc_len x sent_len
        batch_inputs1, batch_inputs2, batch_masks = batch_inputs
        batch_size, max_doc_len, max_sent_len = batch_inputs1.shape[0], batch_inputs1.shape[1], batch_inputs1.shape[2]
        batch_inputs1 = batch_inputs1.view(batch_size * max_doc_len, max_sent_len)  # sen_num x sent_len
        batch_inputs2 = batch_inputs2.view(batch_size * max_doc_len, max_sent_len)  # sen_num x sent_len
        batch_masks = batch_masks.view(batch_size * max_doc_len, max_sent_len)  # sen_num x sent_len
    
        sent_reps = self.word_encoder(batch_inputs1, batch_inputs2)  # sen_num x sent_rep_size
    
        sent_reps = sent_reps.view(batch_size, max_doc_len, self.sent_rep_size)  # b x doc_len x sent_rep_size
        batch_masks = batch_masks.view(batch_size, max_doc_len, max_sent_len)  # b x doc_len x max_sent_len
        sent_masks = batch_masks.bool().any(2).float()  # b x doc_len
    
        sent_hiddens = self.sent_encoder(sent_reps, sent_masks)  # b x doc_len x doc_rep_size
        doc_reps, atten_scores = self.sent_attention(sent_hiddens, sent_masks)  # b x doc_rep_size
    
        batch_outputs = self.out(doc_reps)  # b x num_labels
    
        return batch_outputs
    
    
    model = Model(vocab)
复制代码
    # build optimizer
    learning_rate = 2e-4
    decay = .75
    decay_step = 1000
    
    
    class Optimizer:
    def __init__(self, model_parameters):
        self.all_params = []
        self.optims = []
        self.schedulers = []
    
        for name, parameters in model_parameters.items():
            if name.startswith("basic"):
                optim = torch.optim.Adam(parameters, lr=learning_rate)
                self.optims.append(optim)
    
                l = lambda step: decay ** (step // decay_step)
                scheduler = torch.optim.lr_scheduler.LambdaLR(optim, lr_lambda=l)
                self.schedulers.append(scheduler)
                self.all_params.extend(parameters)
    
            else:
                Exception("no nameed parameters.")
    
        self.num = len(self.optims)
    
    def step(self):
        for optim, scheduler in zip(self.optims, self.schedulers):
            optim.step()
            scheduler.step()
            optim.zero_grad()
    
    def zero_grad(self):
        for optim in self.optims:
            optim.zero_grad()
    
    def get_lr(self):
        lrs = tuple(map(lambda x: x.get_lr()[-1], self.schedulers))
        lr = ' %.5f' * self.num
        res = lr % lrs
        return res
复制代码
    # build dataset
    def sentence_split(text, vocab, max_sent_len=256, max_segment=16):
    words = text.strip().split()
    document_len = len(words)
    
    index = list(range(0, document_len, max_sent_len))
    index.append(document_len)
    
    segments = []
    for i in range(len(index) - 1):
        segment = words[index[i]: index[i + 1]]
        assert len(segment) > 0
        segment = [word if word in vocab._id2word else '<UNK>' for word in segment]
        segments.append([len(segment), segment])
    
    assert len(segments) > 0
    if len(segments) > max_segment:
        segment_ = int(max_segment / 2)
        return segments[:segment_] + segments[-segment_:]
    else:
        return segments
    
    
    def get_examples(data, vocab, max_sent_len=256, max_segment=8):
    label2id = vocab.label2id
    examples = []
    
    for text, label in zip(data['text'], data['label']):
        # label
        id = label2id(label)
    
        # words
        sents_words = sentence_split(text, vocab, max_sent_len, max_segment)
        doc = []
        for sent_len, sent_words in sents_words:
            word_ids = vocab.word2id(sent_words)
            extword_ids = vocab.extword2id(sent_words)
            doc.append([sent_len, word_ids, extword_ids])
        examples.append([id, len(doc), doc])
    
    logging.info('Total %d docs.' % len(examples))
    return examples
复制代码
    # build loader
    
    def batch_slice(data, batch_size):
    batch_num = int(np.ceil(len(data) / float(batch_size)))
    for i in range(batch_num):
        cur_batch_size = batch_size if i < batch_num - 1 else len(data) - batch_size * i
        docs = [data[i * batch_size + b] for b in range(cur_batch_size)]
    
        yield docs
    
    
    def data_iter(data, batch_size, shuffle=True, noise=1.0):
    """
    randomly permute data, then sort by source length, and partition into batches
    ensure that the length of  sentences in each batch
    """
    
    batched_data = []
    if shuffle:
        np.random.shuffle(data)
    
        lengths = [example[1] for example in data]
        noisy_lengths = [- (l + np.random.uniform(- noise, noise)) for l in lengths]
        sorted_indices = np.argsort(noisy_lengths).tolist()
        sorted_data = [data[i] for i in sorted_indices]
    else:
        sorted_data = data
        
    batched_data.extend(list(batch_slice(sorted_data, batch_size)))
    
    if shuffle:
        np.random.shuffle(batched_data)
    
    for batch in batched_data:
        yield batch
复制代码
    # some function
    from sklearn.metrics import f1_score, precision_score, recall_score
    
    
    def get_score(y_ture, y_pred):
    y_ture = np.array(y_ture)
    y_pred = np.array(y_pred)
    f1 = f1_score(y_ture, y_pred, average='macro') 
    p = precision_score(y_ture, y_pred, average='macro') 
    r = recall_score(y_ture, y_pred, average='macro') 
    
    return str((reformat(p, 2), reformat(r, 2), reformat(f1, 2))), reformat(f1, 2)
    
    
    def reformat(num, n):
    return float(format(num, '0.' + str(n) + 'f'))
复制代码
    # build trainer
    
    import time
    from sklearn.metrics import classification_report
    
    clip = 5.0
    epochs = 1
    early_stops = 3
    log_interval = 50
    
    test_batch_size = 128
    train_batch_size = 128
    
    save_model = './cnn.bin'
    save_test = './cnn.csv'
    
    class Trainer():
    def __init__(self, model, vocab):
        self.model = model
        self.report = True
    
        self.train_data = get_examples(train_data, vocab)
        self.batch_num = int(np.ceil(len(self.train_data) / float(train_batch_size)))
        self.dev_data = get_examples(dev_data, vocab)
    
        # criterion
        self.criterion = nn.CrossEntropyLoss()
    
        # label name
        self.target_names = vocab.target_names
    
        # optimizer
        self.optimizer = Optimizer(model.all_parameters)
    
        # count
        self.step = 0
        self.early_stop = -1
        self.best_train_f1, self.best_dev_f1 = 0, 0
        self.last_epoch = epochs
    
    def train(self):
        logging.info('Start training...')
        for epoch in range(1, epochs + 1):
            train_f1 = self._train(epoch)
    
            dev_f1 = self._eval(epoch)
    
            if self.best_dev_f1 <= dev_f1:
                logging.info(
                    "Exceed history dev = %.2f, current dev = %.2f" % (self.best_dev_f1, dev_f1))
                torch.save(self.model.state_dict(), save_model)
    
                self.best_train_f1 = train_f1
                self.best_dev_f1 = dev_f1
                self.early_stop = 0
            else:
                self.early_stop += 1
                if self.early_stop == early_stops:
                    logging.info(
                        "Eearly stop in epoch %d, best train: %.2f, dev: %.2f" % (
                            epoch - early_stops, self.best_train_f1, self.best_dev_f1))
                    self.last_epoch = epoch
                    break
    
    def test(self):
        self.model.load_state_dict(torch.load(save_model))
        self._eval(self.last_epoch + 1, test=True)
    
    def _train(self, epoch):
        self.optimizer.zero_grad()
        self.model.train()
    
        start_time = time.time()
        epoch_start_time = time.time()
        overall_losses = 0
        losses = 0
        batch_idx = 1
        y_pred = []
        y_true = []
        for batch_data in data_iter(self.train_data, train_batch_size, shuffle=True):
            torch.cuda.empty_cache()
            batch_inputs, batch_labels = self.batch2tensor(batch_data)
            batch_outputs = self.model(batch_inputs)
            loss = self.criterion(batch_outputs, batch_labels)
            loss.backward()
    
            loss_value = loss.detach().cpu().item()
            losses += loss_value
            overall_losses += loss_value
    
            y_pred.extend(torch.max(batch_outputs, dim=1)[1].cpu().numpy().tolist())
            y_true.extend(batch_labels.cpu().numpy().tolist())
    
            nn.utils.clip_grad_norm_(self.optimizer.all_params, max_norm=clip)
            for optimizer, scheduler in zip(self.optimizer.optims, self.optimizer.schedulers):
                optimizer.step()
                scheduler.step()
            self.optimizer.zero_grad()
    
            self.step += 1
    
            if batch_idx % log_interval == 0:
                elapsed = time.time() - start_time
    
                lrs = self.optimizer.get_lr()
                logging.info(
                    '| epoch {:3d} | step {:3d} | batch {:3d}/{:3d} | lr{} | loss {:.4f} | s/batch {:.2f}'.format(
                        epoch, self.step, batch_idx, self.batch_num, lrs,
                        losses / log_interval,
                        elapsed / log_interval))
    
                losses = 0
                start_time = time.time()
    
            batch_idx += 1
    
        overall_losses /= self.batch_num
        during_time = time.time() - epoch_start_time
    
        # reformat
        overall_losses = reformat(overall_losses, 4)
        score, f1 = get_score(y_true, y_pred)
    
        logging.info(
            '| epoch {:3d} | score {} | f1 {} | loss {:.4f} | time {:.2f}'.format(epoch, score, f1,
                                                                                  overall_losses,
                                                                                  during_time))
        if set(y_true) == set(y_pred) and self.report:
            report = classification_report(y_true, y_pred, digits=4, target_names=self.target_names)
            logging.info('\n' + report)
    
        return f1
    
    def _eval(self, epoch, test=False):
        self.model.eval()
        start_time = time.time()
        data = self.test_data if test else self.dev_data
        y_pred = []
        y_true = []
        with torch.no_grad():
            for batch_data in data_iter(data, test_batch_size, shuffle=False):
                torch.cuda.empty_cache()
                batch_inputs, batch_labels = self.batch2tensor(batch_data)
                batch_outputs = self.model(batch_inputs)
                y_pred.extend(torch.max(batch_outputs, dim=1)[1].cpu().numpy().tolist())
                y_true.extend(batch_labels.cpu().numpy().tolist())
    
            score, f1 = get_score(y_true, y_pred)
    
            during_time = time.time() - start_time
            
            if test:
                df = pd.DataFrame({'label': y_pred})
                df.to_csv(save_test, index=False, sep=',')
            else:
                logging.info(
                    '| epoch {:3d} | dev | score {} | f1 {} | time {:.2f}'.format(epoch, score, f1,
                                                                              during_time))
                if set(y_true) == set(y_pred) and self.report:
                    report = classification_report(y_true, y_pred, digits=4, target_names=self.target_names)
                    logging.info('\n' + report)
    
        return f1
    
    def batch2tensor(self, batch_data):
        '''
            [[label, doc_len, [[sent_len, [sent_id0, ...], [sent_id1, ...]], ...]]
        '''
        batch_size = len(batch_data)
        doc_labels = []
        doc_lens = []
        doc_max_sent_len = []
        for doc_data in batch_data:
            doc_labels.append(doc_data[0])
            doc_lens.append(doc_data[1])
            sent_lens = [sent_data[0] for sent_data in doc_data[2]]
            max_sent_len = max(sent_lens)
            doc_max_sent_len.append(max_sent_len)
    
        max_doc_len = max(doc_lens)
        max_sent_len = max(doc_max_sent_len)
    
        batch_inputs1 = torch.zeros((batch_size, max_doc_len, max_sent_len), dtype=torch.int64)
        batch_inputs2 = torch.zeros((batch_size, max_doc_len, max_sent_len), dtype=torch.int64)
        batch_masks = torch.zeros((batch_size, max_doc_len, max_sent_len), dtype=torch.float32)
        batch_labels = torch.LongTensor(doc_labels)
    
        for b in range(batch_size):
            for sent_idx in range(doc_lens[b]):
                sent_data = batch_data[b][2][sent_idx]
                for word_idx in range(sent_data[0]):
                    batch_inputs1[b, sent_idx, word_idx] = sent_data[1][word_idx]
                    batch_inputs2[b, sent_idx, word_idx] = sent_data[2][word_idx]
                    batch_masks[b, sent_idx, word_idx] = 1
    
        if use_cuda:
            batch_inputs1 = batch_inputs1.to(device)
            batch_inputs2 = batch_inputs2.to(device)
            batch_masks = batch_masks.to(device)
            batch_labels = batch_labels.to(device)
    
        return (batch_inputs1, batch_inputs2, batch_masks), batch_labels
复制代码
    # train
    trainer = Trainer(model, vocab)
    trainer.train()
复制代码
    # test
    trainer.test()

全部评论 (0)

还没有任何评论哟~