文本分类（5）-TextCNN实现文本分类

阅读量：

基于TextCNN模型对IMDB Review数据进行分类训练，并提供详细说明；数据集链接为https://pan.baidu.com/s/1EYoqAcW238saKy3uQCfC3w；提取码：ilze

复制代码

    import numpy as np
    import logging
    
    from keras import Input
    from keras.layers import Conv1D, MaxPool1D, Dense, Flatten, concatenate, Embedding
    from keras.models import Model
    # from keras.utils import plot_model
    from keras.utils.vis_utils import plot_model
    import pandas as pd
    import warnings
    import keras
    import re
    import matplotlib.pyplot as plt
    from nltk.stem import WordNetLemmatizer
    from nltk.corpus import stopwords
    from keras.preprocessing.text import Tokenizer
    from keras.preprocessing.sequence import pad_sequences
    from keras.layers import Dense, LSTM, Embedding, Dropout, Conv1D, MaxPooling1D, Bidirectional
    from keras.models import Sequential
    from keras.utils import np_utils
    
    warnings.filterwarnings('ignore')
    
    # get data
    df1 = pd.read_csv('word2vec-nlp-tutorial/labeledTrainData.tsv', sep='\t', error_bad_lines=False)
    df2 = pd.read_csv('word2vec-nlp-tutorial/imdb_master.csv', encoding="latin-1")
    df3 = pd.read_csv('word2vec-nlp-tutorial/testData.tsv', sep='\t', error_bad_lines=False)
    
    df2 = df2.drop(['Unnamed: 0','type','file'],axis=1)
    df2.columns = ["review","sentiment"]
    df2 = df2[df2.sentiment != 'unsup']
    df2['sentiment'] = df2['sentiment'].map({'pos': 1, 'neg': 0})
    
    df = pd.concat([df1, df2]).reset_index(drop=True)
    
    train_texts = df.review
    train_labels = df.sentiment
    
    test_texts = df3.review
    
    def replace_abbreviations(text):
    texts = []
    for item in text:
        item = item.lower().replace("it's", "it is").replace("i'm", "i am").replace("he's", "he is").replace("she's", "she is")\
            .replace("we're", "we are").replace("they're", "they are").replace("you're", "you are").replace("that's", "that is")\
            .replace("this's", "this is").replace("can't", "can not").replace("don't", "do not").replace("doesn't", "does not")\
            .replace("we've", "we have").replace("i've", " i have").replace("isn't", "is not").replace("won't", "will not")\
            .replace("hasn't", "has not").replace("wasn't", "was not").replace("weren't", "were not").replace("let's", "let us")\
            .replace("didn't", "did not").replace("hadn't", "had not").replace("waht's", "what is").replace("couldn't", "could not")\
            .replace("you'll", "you will").replace("you've", "you have")
    
        item = item.replace("'s", "")
        texts.append(item)
    
    return texts
    
    
    def clear_review(text):
    texts = []
    for item in text:
        item = item.replace("<br /><br />", "")
        item = re.sub("[^a-zA-Z]", " ", item.lower())
        texts.append(" ".join(item.split()))
    return texts
    
    def stemed_words(text):
    stop_words = stopwords.words("english")
    lemma = WordNetLemmatizer()
    texts = []
    for item in text:
        words = [lemma.lemmatize(w, pos='v') for w in item.split() if w not in stop_words]
        texts.append(" ".join(words))
    return texts
            
    def preprocess(text):
    
    text = replace_abbreviations(text)
    text = clear_review(text)
    text = stemed_words(text)
    
    return text
    
    train_texts = preprocess(train_texts)
    test_texts = preprocess(test_texts)
    
    max_features = 6000
    texts = train_texts + test_texts
    tok = Tokenizer(num_words=max_features)
    tok.fit_on_texts(texts)
    list_tok = tok.texts_to_sequences(texts)
    
    maxlen = 130
    
    seq_tok = pad_sequences(list_tok, maxlen=maxlen)
    
    x_train = seq_tok[:len(train_texts)]
    y_train = train_labels
    y_train = np_utils.to_categorical(y_train, num_classes=2)
    
    # 绘图
    def show_history(trian_model):
    plt.figure(figsize=(10, 5))
    
    plt.subplot(121)
    plt.plot(trian_model.history['acc'], c='b', label='train')
    plt.plot(trian_model.history['val_acc'], c='g', label='validation')
    plt.legend()
    plt.xlabel('epoch')
    plt.ylabel('accuracy')
    plt.title('Model accuracy')
    
    plt.subplot(122)
    plt.plot(trian_model.history['loss'], c='b', label='train')
    plt.plot(trian_model.history['val_loss'], c='g', label='validation')
    plt.legend()
    plt.xlabel('epoch')
    plt.ylabel('loss')
    plt.title('Model loss')
    
    plt.show()
    
    
    def test_cnn(y,maxlen,max_features,embedding_dims,filters = 250):
    #Inputs
    seq = Input(shape=[maxlen],name='x_seq')
    
    #Embedding layers
    emb = Embedding(max_features,embedding_dims)(seq)
    
    # conv layers
    convs = []
    filter_sizes = [2,3,4]
    for fsz in filter_sizes:
        conv1 = Conv1D(filters,kernel_size=fsz,activation='tanh')(emb)
        pool1 = MaxPool1D(maxlen-fsz+1)(conv1)
        pool1 = Flatten()(pool1)
        convs.append(pool1)
    merge = concatenate(convs,axis=1)
    
    out = Dropout(0.5)(merge)
    output = Dense(32,activation='relu')(out)
    
    output = Dense(units=y.shape[1],activation='sigmoid')(output)
    
    model = Model([seq],output)
    #     model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
    return model
    
    def model_train(model, x_train, y_train):
    keras.callbacks.EarlyStopping(monitor='val_loss', patience=0, verbose=0, mode='auto')
    history = model.fit(x_train, y_train, validation_split=0.2, batch_size=100, epochs=20)
    return history
    
    model = test_cnn(y_train, maxlen, max_features, embedding_dims=128, filters=250)
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    history = model_train(model, x_train, y_train)

全部评论 (0)

还没有任何评论哟~

文本分类（5）-TextCNN实现文本分类

利用TextCNN对IMDBReviwe文本进行分类，数据集地址：<https://pan.baidu.com/s/1EYoqAcW238saKy3uQCfC3w 提取码：ilze importnum...

TextCNN文本分类（keras实现）

目录前言：一、论文笔记二、Keras文本预处理 1、读取数据集 2、将文字转换成数字特征 3、将每条文本转换为数字列表 4、将每条文本设置为相同长度 5、将每个词编码转换为词向量 6、Keras...

【文本分类】TextCNN的实现

模型论文可以在这里找到ConvolutionalNeuralNetworksforSentenceClassification 后面的实现参考文本分类实战（二）——textCNN模型因为用的自己的...

文本分类：TextCNN

目录一、模型二、损失函数一、模型 1、输入向量是句子中第i个字的k维向量，如果句子有n个字，则句子的向量表示为n个字向量的拼接向量： 2、卷积层使用长度为h个字符的过滤器f，f是一个非线性函...

TextCNN文本分类与tensorflow实现

1.引言我们知道，卷积神经网络（CNN）主要是在计算机视觉方面已经取得了很多很好的成就，但是，CNN在自然语言处理方面同样也可以拥有很好的应用。本文将介绍一个有关CNN的模型，用来对文本进行分类，并...

TextCNN textcnn 新闻文本分类实战

目录项目介绍：数据示例： textcnn原理：导入库包：数据预处理：参数设定：构建模型：模型训练函数：模型训练与预测： \ 项目介绍：本项目使用的是THUCNews数据集（缩减版），...

[nlp] 文本分类 TextCNN

暂无描述

NLP-文本分类-TextCNN

NLP文本分类TextCNN 数据集代码 ecoding:utf8 @Author:SuperLong @Email:miuzxl@163.com @Time:2024/8/289:19 impor...

Textcnn textcnn的文本分类情感分类研究

项目演示：基于textcnn的微博评论文本分类情感分类研究付完整代码数据哔哩哔哩bilibili http://www.hengsblog.com/2021/02/14/TextCnnbase/ im...

【文本分类】TextCNN模型

文本数据的序列性使得RNN的循环迭代模式成为显而易见的选择，但如果我们把文本编码后的结果（Batch×sequence×embedding）看做一张图片，那么通过卷积的方式提取文本信息也理所当然。这就...

是否确定退出登录?

文本分类（5）-TextCNN实现文本分类

全部评论 (0)

相关文章推荐

文本分类（5）-TextCNN实现文本分类

TextCNN文本分类（keras实现）

【文本分类】TextCNN的实现

文本分类：TextCNN

TextCNN文本分类与tensorflow实现

TextCNN textcnn 新闻文本分类实战

[nlp] 文本分类 TextCNN

NLP-文本分类-TextCNN

Textcnn textcnn的文本分类情感分类研究

【文本分类】TextCNN模型