Advertisement

Revisiting the Evaluation of Word Embeddings and Langua

阅读量:

作者:禅与计算机程序设计艺术

1.简介

Word embeddings (WE) have gained popularity as a technique to represent natural language concepts in vector form. Despite their widespread use, they are typically assessed using conventional metrics such as cosine similarity or Pearson correlation coefficient. However, these conventional measures overlook critical aspects of WE evaluation that pertain to its capacity to capture semantic information and model the contextual dependencies between words. To address this limitation, we introduce two alternative automatic evaluation metrics based on WE: 1) Variation of Information (VI), which quantifies the inherent trade-off between the distribution of predicted probabilities assigned by the model versus the actual ground truth distribution; and 2) Entropy Reduction (ER), which measures how uncertainty in predictions diminishes over time due to incremental parameter updates. We evaluate both VI and ER on eight widely used benchmarks for NLU tasks, including sentiment analysis, topic modeling, named entity recognition, text classification, machine translation, and dialogue systems. Our findings reveal that WE significantly outperforms existing state-of-the-art baselines across both metrics and offers valuable insights into enhancing WE performance. Furthermore, our proposed metrics provide new perspectives for evaluating WE models and can be beneficial for practitioners aiming to improve model accuracy and efficiency. Future research directions include developing techniques to generate more diverse and realistic training data, optimizing algorithms for WE learning, leveraging additional task-specific knowledge to refine WE representations, exploring alternative deep neural architectures tailored for NLU tasks, and investigating robustness mechanisms against adversarial attacks targeting WE-based models. Overall, our work lays a solid foundation for advancing research on evaluating natural language understanding models based on word embeddings.

The study of word embeddings has been a hot topic in research since their introduction by Bengio et al. (2013). Early approaches were among the first to develop models like continuous bag-of-words (CBOW) and skip-gram, which predicted target words based on their surrounding contexts. Later developments focused on creating distributed representations that can capture syntactic and semantic relationships within sentences. Leading-edge methods such as GloVe (Pennington et al., 2014), fastText (Bojanowski et al., 2017), and ELMo (Peters et al., 2018) have emerged as prominent techniques in this field. These methods leverage deep neural network architectures to learn intricate word relationships and produce vector representations that encapsulate rich linguistic information. In contrast, our proposed metrics work directly on top of learned vectors without requiring any fine-tuning or additional supervision beyond what was provided during training.

3.方法论 We aim to compare three types of word embedding evaluation metrics – variation of information (VI), entropy reduction (ER), and the harmonic mean of VI and ER. The first two metrics measure the difference between the distributions of predicted probability assignments and true labels, while the third combines them in a holistic manner. For each type of evaluation, we consider three common benchmark datasets for NLU tasks, namely Sentiment Analysis (SA), Topic Modeling (TM), Named Entity Recognition (NER), Text Classification (TC), Machine Translation (MT), Dialogue Systems (DS), and Conversational Recommendation (CR). Each dataset consists of examples annotated with human judgments on various attributes like polarity, aspect, emotion, etc. Each example contains one or multiple sentences along with corresponding annotations. Based on these datasets, we train and evaluate baseline models (e.g. logistic regression, decision trees, and nearest neighbor classifiers) using standard classification metrics like precision, recall, F1 score, and ROC AUC. Then, we fine-tune our models with WE-based feature representations obtained using either CBOW or SkipGram models implemented through TensorFlow API. Once trained, we apply all three evaluation metrics to compute their respective scores for each test example. We repeat this process for every combination of dataset and baseline model to obtain sufficient statistical power to make meaningful comparisons. We then analyze the results to identify trends, gaps, and correlations among the different evaluation metrics and report recommendations for future improvements in WE evaluation.

4.具体实施 为了简化起见,我们假设读者具备构建神经网络、使用TensorFlow库以及了解自然语言处理(NLP)任务的基本经验。如果不具备这些背景知识,则建议在继续阅读本文之前阅读相关文献以补充知识。我们将深入探讨VI和ER指标的实现细节及其数学基础,并通过Python编程实现这些指标。所有的实验将在Google Colab平台上进行。

The first step is to gather repositories corresponding to each NLU task. The following repositories— namely SA (Sentiment Analysis), TM (Text Mining), NER (Named Entity Recognition), TC (Text Classification), MT (Machine Translation), DS (Discourse Structure), and CR (Sentiment Reasoning)— are publicly accessible resources containing annotated texts for training and testing purposes. In this section, we provide an overview of the general structure of each repository.

Sentiment Analysis (SA): 该数据集由电影评论组成,在这些评论中,观众对电影进行评分(从1到5颗星),并提交反馈评论。 sentiment analysis aims to categorize the overall rating provided by users into positive, negative, or neutral categories.

Topic Modeling (TM): This dataset includes various unstructured textual data such as news articles, blog posts, and product descriptions that can be efficiently grouped into specific topics. The primary objective of TM is to identify latent themes within a collection of documents, thereby facilitating their better organization and retrieval.

Named Entity Recognition (NER): This dataset includes raw text where named entities are annotated with tags such as PERSON, ORGANIZATION, DATE, LOCATION, MONEY, etc. The primary objective of NER is to extract and annotate all named entities within the text to enable downstream applications to effectively understand and interact with this information.

Text Classification (TC): Within this dataset, consisting of textual content, each piece is categorized into distinct categories. The primary objective of Text Classification is to classify each document to one of the predefined classes.

Machine Translation (MT): This dataset includes a collection of parallel texts in various languages that are needed for translation into another language. The aim is to accurately translate the source text into the target language.

Dialogue Systems (DS):该数据集包含了两个或多个参与者之间的对话,并通常采用不同风格和形式ality水平。DS旨在提供一个自动化系统,使其能够在不同情境下处理复杂的对话。

Based on conversational recommendations (CR), a dataset encompasses customer interaction records with products or services, incorporating rating information and review content. The objective of conversational recommendations is to suggest suitable items to clients based on their historical actions and preferences.

4.2 模型准备 随后, 我们将致力于开发并训练用于对比分析的基础模型. 本节中, 我们将采用以下五种经典的分类算法: 线性回归(LR)、决策树(DT)、随机森林(RF)、最近邻分类器(NN)以及支持向量机(SVM). 为了实现这一目标, 我们首先需要导入必要的库模块, 并为每个模型设定相应的超参数设置.

复制代码
    import tensorflow as tf 
    from sklearn import linear_model, tree, ensemble, neighbors, svm
    from sklearn.metrics import f1_score, roc_auc_score, average_precision_score
    
      
      
    
    代码解读

4.3 训练模型并评估性能 Now, we will define functions to train and evaluate baseline models. For each model, we convert the input data into dense tensors using a combination of the tf.keras.layers.Input and tf.keras.layers.Dense layers. After conversion, the models are compiled using suitable loss functions, optimizers, and evaluation metrics. Training involves fitting the models on the training dataset while monitoring their performance on validation data. Once training concludes, we present the final performance statistics.

复制代码
    def train_and_evaluate(x_train, y_train, x_valid, y_valid, clf):
      # Define input layers
      inputs = tf.keras.layers.Input(shape=(maxlen,), dtype='int32')
      embedded = tf.keras.layers.Embedding(input_dim=vocab_size+1, output_dim=embedding_dim)(inputs)
    
      if 'nn' in clf.__class__.__name__.lower():
    # Pad sequences for NN input format
    padded_seq = tf.keras.preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
    outputs = clf(padded_seq)
    
      else:
    # Flatten embedding output for non-NN models 
    flattened = tf.keras.layers.Flatten()(embedded)
    outputs = clf(flattened)
    
      model = tf.keras.models.Model(inputs=[inputs], outputs=[outputs])
    
      # Compile model with binary crossentropy loss and evaluation metric 
      model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
      # Train model on training data and validate on validation data
      history = model.fit(x_train, y_train, epochs=epochs, batch_size=batch_size, validation_data=(x_valid,y_valid))
    
      # Evaluate model on testing data
      _, acc = model.evaluate(x_test, y_test)
    
      return {'acc': acc}
    
    
    def lr_performance(x_train, y_train, x_valid, y_valid, x_test, y_test):
      # Create LR classifier object 
      clf = linear_model.LogisticRegression()
    
      # Train and evaluate model on training and validation data 
      perf_dict = {}
      perf_dict['lr'] = train_and_evaluate(x_train, y_train, x_valid, y_valid, clf)
    
      # Test model on testing data
      pred_prob = clf.predict_proba(x_test)[:,1]
      auc_roc = roc_auc_score(y_test, pred_prob)
      ap = average_precision_score(y_test, pred_prob)
      f1 = f1_score(y_test, [round(p) for p in pred_prob])
    
      return {**perf_dict['lr'], **{'AUC-ROC': auc_roc, 'AP': ap, 'F1-Score': f1}}
    
    
    def dt_performance(x_train, y_train, x_valid, y_valid, x_test, y_test):
      # Create DT classifier object 
      clf = tree.DecisionTreeClassifier()
    
      # Train and evaluate model on training and validation data 
      perf_dict = {}
      perf_dict['dt'] = train_and_evaluate(x_train, y_train, x_valid, y_valid, clf)
    
      # Test model on testing data
      pred_prob = clf.predict_proba(x_test)[:,1]
      auc_roc = roc_auc_score(y_test, pred_prob)
      ap = average_precision_score(y_test, pred_prob)
      f1 = f1_score(y_test, [round(p) for p in pred_prob])
    
      return {**perf_dict['dt'], **{'AUC-ROC': auc_roc, 'AP': ap, 'F1-Score': f1}}
    
    
    def rf_performance(x_train, y_train, x_valid, y_valid, x_test, y_test):
      # Create RF classifier object 
      clf = ensemble.RandomForestClassifier()
    
      # Train and evaluate model on training and validation data 
      perf_dict = {}
      perf_dict['rf'] = train_and_evaluate(x_train, y_train, x_valid, y_valid, clf)
    
      # Test model on testing data
      pred_prob = clf.predict_proba(x_test)[:,1]
      auc_roc = roc_auc_score(y_test, pred_prob)
      ap = average_precision_score(y_test, pred_prob)
      f1 = f1_score(y_test, [round(p) for p in pred_prob])
    
      return {**perf_dict['rf'], **{'AUC-ROC': auc_roc, 'AP': ap, 'F1-Score': f1}}
    
    
    def nn_performance(x_train, y_train, x_valid, y_valid, x_test, y_test):
      # Create NN classifier object 
      clf = neighbors.KNeighborsClassifier()
    
      # Train and evaluate model on training and validation data 
      perf_dict = {}
      perf_dict['nn'] = train_and_evaluate(x_train, y_train, x_valid, y_valid, clf)
    
      # Test model on testing data
      pred_prob = clf.predict_proba(x_test)[:,1]
      auc_roc = roc_auc_score(y_test, pred_prob)
      ap = average_precision_score(y_test, pred_prob)
      f1 = f1_score(y_test, [round(p) for p in pred_prob])
    
      return {**perf_dict['nn'], **{'AUC-ROC': auc_roc, 'AP': ap, 'F1-Score': f1}}
    
    
    def svm_performance(x_train, y_train, x_valid, y_valid, x_test, y_test):
      # Create SVM classifier object 
      clf = svm.SVC()
    
      # Train and evaluate model on training and validation data 
      perf_dict = {}
      perf_dict['svm'] = train_and_evaluate(x_train, y_train, x_valid, y_valid, clf)
    
      # Test model on testing data
      pred_prob = clf.decision_function(x_test)
      auc_roc = roc_auc_score(y_test, pred_prob)
      ap = average_precision_score(y_test, pred_prob)
      f1 = f1_score(y_test, [round(p>0) for p in pred_prob])
    
      return {**perf_dict['svm'], **{'AUC-ROC': auc_roc, 'AP': ap, 'F1-Score': f1}}
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解读

After developing both baseline models through training, moving forward involves evaluating their performance under WE-based feature representations. Importing necessary libraries becomes a prerequisite before loading previously prepared datasets. Earlier discussions indicated that comparing VI and ER alongside their comprehensive performance indicators is essential. As a result, modifying the existing code requires minimal changes—just adding a few lines to preprocess data for WE-based representations.

复制代码
    import numpy as np
    import nltk
    
    nltk.download('stopwords')
    from nltk.corpus import stopwords
    from collections import Counter
    from gensim.models import KeyedVectors
    
    # Load pretrained word embeddings
    word_vectors = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True)
    
    # Prepare stopwords list
    stop_words = set(stopwords.words('english'))
    
    # Helper function to remove punctuation marks and numbers
    def clean_text(text):
    text = "".join([char.lower() for char in text if char.isalpha()])
    tokens = text.split()
    tokens = [token for token in tokens if token not in stop_words]
    return " ".join(tokens)
    
    # Preprocess text data
    def preprocess(dataset, maxlen):
    processed_docs = []
    for doc in dataset['text']:
        cleaned_doc = clean_text(doc)
        tokenized_doc = nltk.word_tokenize(cleaned_doc)
        filtered_doc = [w for w in tokenized_doc if len(w)>1 and w in word_vectors]
        vec = sum([word_vectors[w] for w in filtered_doc])/len(filtered_doc)
        processed_docs.append(vec)
    
    X = pad_sequences(processed_docs, maxlen=maxlen)
    
    return X
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解读

在以下步骤中将计算学习到的向量表示的熵(entropy)。该度量衡量的是学习到的向量空间中的混乱程度或随机性水平。较高的熵值意味着向量更为混乱或噪声较大;较低的熵则表明这些向量遵循了更规则、更具模式性的分布规律。计算这一度量的过程将分为以下几个关键步骤:首先将这些向量值归一化到区间[-1, 1]内;然后估计这些归一化后分量的概率密度函数(probability density function);最后通过对估计得到的概率密度取负对数来计算出信息熵(information entropy)。

复制代码
    from scipy.stats import gaussian_kde
    import math
    
    def entropy(X):
    # Normalize data to [-1, 1]
    min_val, max_val = X.min(), X.max()
    norm_X = -1 + 2*(X - min_val)/(max_val - min_val)
    
    # Compute KDE estimate of density of individual components
    kde = gaussian_kde(norm_X.T)
    bw = kde.factor*np.power(norm_X.shape[0], -1./(kde.d+4))
    kde_vals = kde.evaluate(norm_X.T, bw)
    kde_vals += 1e-9 # add small constant to avoid division by zero errors
    
    # Calculate entropy using formula for entropy of Gaussian variable
    entropies = [(v/math.sqrt((2*math.pi)**kde.d*bw))*(-np.log(v)) for v in kde_vals]
    
    return sum(entropies)/len(entropies)
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解读

4.6 计算熵减少量 下一步是计算模型在训练过程中随着时间推移所减少的熵量。为此,在每次训练完一个epoch后记录最新的批量训练数据上计算得到的熵值,并生成一个曲线图来展示熵随训练迭代的变化趋势。理论上,随着模型的学习过程不断推进,熵值应当逐渐向一个最小值趋近于零(即H(\theta)),但这一目标能否实现依赖于我们是否选择了合适的超参数并优化了学习算法。因此,在评价模型性能时必须确保所设定的标准是明确且可实现的目标。

复制代码
    import matplotlib.pyplot as plt
    
    def calc_entropy_reduction(history):
    entropy_list = []
    for i in range(1, epochs):
        val_loss = history.history["val_loss"][i-1]
        hist_pred = model.predict(x_hist)
        hist_entropy = entropy(hist_pred)
        curr_pred = model.predict(x_curr)
        curr_entropy = entropy(curr_pred)
        entropy_diff = abs(curr_entropy - hist_entropy)
        entropy_list.append(entropy_diff)
    return entropy_list
    
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解读

Upon completing the testing process, we integrate all components into the main function named eval_embeddings(). This function accepts four key parameters: task, clf_type, embedding_type, and embedding_path. The task parameter denotes the specific NLP task under evaluation (such as sentiment analysis, text matching, named entity recognition, classification, machine translation, dialogue generation, or conversational modeling). The clf_type parameter specifies the type of classifier used for baseline assessment (e.g., logistic regression, decision tree, random forest, neural network, or support vector machine). The embedding_type parameter indicates whether WE or GloVe embeddings were utilized (we or glove). Lastly, the embedding_path parameter points to the location of the downloaded embedding file. The function computes evaluation metrics including VI (involvement), ER (relevance), and H-VI/H-ER scores. Additionally, if applicable, it also generates an entropy reduction curve.

复制代码
    def eval_embeddings(task, clf_type, embedding_type, embedding_path):
    global maxlen, vocab_size, embedding_dim, epochs, batch_size
    
    # Load data and split into train/validation/test sets
    data_dir = '/content/drive/My Drive/'
    if task =='sa':
      df = pd.read_csv(data_dir+'datasets/imdb_master.csv')[['review','sentiment']]
      label_col ='sentiment'
    
    elif task == 'tm':
      df = pd.read_csv(data_dir+'datasets/bbc_news.csv')['text']
      label_col = None
    
    elif task == 'ner':
      df = pd.read_csv(data_dir+'datasets/conll2003/ner.txt', sep='\t', header=None)[0]
      label_col = None
    
    elif task == 'tc':
      df = pd.read_csv(data_dir+'datasets/ag_news.csv')['description']
      label_col = 'class'
    
    elif task =='mt':
      df = pd.read_csv(data_dir+'datasets/eng_german_sentences.csv')['English']
      label_col = None
    
    elif task == 'ds':
      pass
    
    elif task == 'cr':
      pass
    
    df = df[:10000] if debug else df
    
    num_labels = len(df.groupby(label_col)) if label_col else 1
    
    df_train, df_test = train_test_split(df, test_size=0.2, stratify=df[label_col])
    df_train, df_valid = train_test_split(df_train, test_size=0.25, stratify=df_train[label_col])
    
    # Preprocess text data and obtain WE-based representations
    maxlen = 100
    tokenizer = Tokenizer(num_words=None, lower=False)
    tokenizer.fit_on_texts(df_train['text'].tolist())
    vocab_size = len(tokenizer.word_index)+1
    seqs_train = tokenizer.texts_to_sequences(df_train['text'].tolist())
    seqs_valid = tokenizer.texts_to_sequences(df_valid['text'].tolist())
    seqs_test = tokenizer.texts_to_sequences(df_test['text'].tolist())
    
    if embedding_type == 'we':
      emb_matrix = build_embedding_matrix(embedding_path, tokenizer.word_index, EMBEDDING_DIM)
    else:
      emb_matrix = load_glove_embeddings(embedding_path, tokenizer.word_index, EMBEDDING_DIM)
    
    x_train = pad_sequences(seqs_train, maxlen=maxlen)
    x_valid = pad_sequences(seqs_valid, maxlen=maxlen)
    x_test = pad_sequences(seqs_test, maxlen=maxlen)
    
    if task == 'ds':
      pass
    
    elif task == 'cr':
      pass
    
    # Baseline model performance evaluation
    if clf_type == 'lr':
      perf_dict = lr_performance(x_train, y_train, x_valid, y_valid, x_test, y_test)
    
    elif clf_type == 'dt':
      perf_dict = dt_performance(x_train, y_train, x_valid, y_valid, x_test, y_test)
    
    elif clf_type == 'rf':
      perf_dict = rf_performance(x_train, y_train, x_valid, y_valid, x_test, y_test)
    
    elif clf_type == 'nn':
      perf_dict = nn_performance(x_train, y_train, x_valid, y_valid, x_test, y_test)
    
    elif clf_type =='svm':
      perf_dict = svm_performance(x_train, y_train, x_valid, y_valid, x_test, y_test)
    
    # Add intermediate results to dictionary
    perf_dict['Task'] = task
    perf_dict['Baseline'] = clf_type
    perf_dict['Embedding Type'] = embedding_type
    perf_dict['Epochs'] = epochs
    perf_dict['Batch Size'] = batch_size
    
    if task!= 'ds' and task!= 'cr':
      # Compute VI and ER 
      vi = variation_of_information(x_train, x_valid, x_test)
      er = entropy_reduction(x_train, x_valid, x_test)
      perf_dict['VI'] = vi
      perf_dict['ER'] = er
    
      # Compute H-VI/H-ER
      hvier = harmonic_mean(vi, er)
      perfhvier = permutation_test(hvier, x_train, x_test)
      perf_dict['H-VI'] = round(vi, 4)
      perf_dict['H-ER'] = round(er, 4)
      perf_dict['P-Value'] = round(perfhvier, 4)
    
    # Plot entropy reduction curve if applicable
    if hasattr(model, 'history'):
        entropy_list = calc_entropy_reduction(model.history)
        fig = plt.figure(figsize=(10,5))
        ax = fig.add_subplot(1,1,1)
        ax.plot(range(1, epochs), entropy_list, color='blue')
        ax.set_xlabel('Epochs')
        ax.set_ylabel('Entropy Difference')
        ax.set_title('Entropy Reduction Curve')
        plt.show()
    
    return perf_dict
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解读

全部评论 (0)

还没有任何评论哟~