Tale of two language models: Revisiting the evaluation

阅读量：

作者：禅与计算机程序设计艺术

1.背景介绍

在自然语言处理领域中，机器翻译（Machine Translation, MT）一直占据着重要研究领域的地位。其核心目标即在于利用计算机技术实现一种语言文本向另一种语言转换的目的。在这一研究范畴中，默认的数据集合主要可分为两大类：一类是高质量的训练数据集（如WMT14），另一类则是低质量的测试数据集（如NIST）。随着人工智能技术的进步日新月异，在这个领域不断涌现出了新的评估标准与指标体系。这些创新的标准与体系使得不同版本的数据集合或各种模型之间能够更加客观地展开比较分析。然而，在最近的研究发展中又衍生出一个新的前沿课题——多语种机器翻译任务（Multilingual Machine Translation, MMT）。这一任务的核心在于实现多种不同语言之间的自动互译功能。在实际开展机器翻译的过程中，则面临着诸多亟待解决的问题：究竟该如何选择最为适合的语言模型？又该如何对现有模型进行科学有效的评估？又该如何系统性地构建新的评测基准？这些问题至今仍然亟待进一步探索和完善。基于此认识，在本文写作过程中始终坚信：现有的绝大多数评价指标与模型对于多语种机器翻译性能的测定仍显不够全面与完善。因此，在本文中将重点从以下几个方面重新审视现有的多语种机器翻译评估指标体系

模型选择的标准方面：常用的评估标准包括BLEU、METEOR和ROUGE等指标，在单一语言的机器翻译任务中表现尤为出色。然而，在多语言机器翻译的任务中往往面临限制。因此本文将提出一种基于n-gram匹配度的新模型选择标准。这种新标准通过计算两个模型在n-gram上的匹配程度来衡量它们之间的差异，并且无需依赖于传统的单词级别评估方法。

传统数据源如WMT14、IWSLT及Multi30k系列均具备多样化的对齐策略以及不同规模的数据样本分布特点等要素差异性特征；这些因素都会直接影响到模型的表现效果；基于此考虑，在本文中建议采用一种新的数据选择标准；该标准将多语种机器翻译的数据集合为三个层次：基础素材库、训练集合与验证集合；其中基础素材库由真实语对构成；而训练集合与验证集合则分别基于同一来源语言与目标语言的平行语料资源构建而成；通过这种划分方式可以确保训练与验证集合具有相似的语言分布特征；这有助于提升模型在验证阶段的表现指标并更好地反映其泛化能力水平；

结果比较指标：常用的评估指标包括困惑度矩阵(Confusion Matrix)、系统的召回率、准确率(Accuracy)以及F1值等。然而，在多语言自动机器翻译领域中这些标准的效果并不明显。因此本文将介绍一种新的评估方法：基于多标签分类分析（Multi-label Classification Analysis）。这种方法通过评估每个词汇或短语是否正确归类来衡量模型性能，并非仅仅依赖于预测概率分布这一单一指标。

本文旨在系统性地考察现有的多个性能指标和模型架构，在现有研究的基础上展开深入分析，并探讨当前方法的局限性。在此基础上提出相应的改进方案以解决相关问题。最后部分还将提供一个名为MultiX的新多语种机器翻译数据集，并附上相应的代码实现细节。

2.核心概念与联系

2.1 基本概念

机器翻译（Machine Translation, MT）：基于输入的文字内容，在目标语言环境下完成相应的转换与呈现。计算机通过自动化处理机制完成一种语言到另一种language之间的转换过程，并广泛应用于text, audio, video等多种媒体信息传输领域。
多语种机器翻译（Multilingual Machine Translation, MMT）：是一项新兴的研究领域，在multiple languages间实现textual translation这一核心任务。目前已有多个系统致力于将multiple languages的信息整合并实现相互转化。
多语种翻译的核心目标在于将specific language texts准确地转换为another language texts. 在single language translation中, input text仅能处理单一来源的信息并将其映射到指定的目标体系中；而在multi-language translation场景下, input text可能涉及multiple unrelated natural language systems,因此必须对这些多元化的信息源进行综合分析和整合处理后才能形成最终的目标输出结果。

2.2 基本框架与术语

多语种翻译过程中的主要框架如下所示：

在数据准备阶段中首先进行数据收集、预处理以及多语言翻译编码转换工作。主要涉及构建不同语言的语料库、创建通用词汇表以及开发语法分析树等关键步骤。

在模型训练阶段中,我们采用了多种不同的语言模型来分别建模各种语言.其中,在有监督学习的方法中,则利用了大量并行数据进行增强训练,并最终生成了一个统一的整体模型;此外,在无监督学习的方法中,则通过聚类技术识别出各语言间的共同主题,并基于各自的语言特性构建相应的模型.

翻译环节：基于输入的句子内容，在选定对应的语言模型后执行翻译操作。通过结合统计方法与神经网络技术来实现加权平均计算以提高翻译准确性。

3.核心算法原理和具体操作步骤以及数学模型公式详细讲解

3.1 n-gram模型

在多语种机器翻译过程中,综合考量源语言和目标语言的语境关系,必须采用n-gram模型。这种统计语言模型通过分析当前已知词汇来估计后续词汇出现的可能性,并基于固定跨度长度(n)依次分析句子中的每一个词汇及其前后关系。

unigram模型

该单语模型被视为一种基础的语言建模方法。（1分）基于此假设，在所有单词中出现的概率均相同。（2分）假定源语言序列为S=[s₁,s₂,…,sₙ]（3分），对应的目的是确定目标语言序列T=[t₁,t₂,…,tₙ].（4分）则unigram模型的数学表达式如下：（5分）

P(S|T)=∏_i^n P(si|ti)
其中，pi是第i个单词在目标语言下的概率。

bigram模型

相对于unigram model而言,bigram model在处理语言模型时具有更高的复杂度.该模型能够识别出连续性较高的语言模式.给定源语言序列 $S = [s_1,s_2,\dots,s_n]$ 和目标语言序列 $T = [t_1,t_2,\dots,t_n]$ .,则bigram model的数学表达式如下:

P(S|T)=∏_i^(n-1)P(si+1|si,ti)
其中，ppi是i到i+1个单词在目标语言下的联合概率。

bigram模型可以看作是unigram模型在窗口内滑动的一个变种。

trigram模型

该trigram模型能够识别出跨越较远距离的语言关联。
在本研究中采用以下设定：
源语言序列S定义为 $S = [s_1,s_2,\dots,s_n]$ 。
目标语言序列T定义为 $T = [t_1,t_2,\dots,t_n]$ 。
则trigram模型的数学表达式如下：

P(S|T)=∏_i^(n-2)P(si+2|si+1,si,ti)
其中，pppi是i到i+2个单词在目标语言下的联合概率。

注意事项：当n超过3时（ $...$ ），会丢失关于语言间关系的信息。因此，在进行多语言机器翻译任务时（ $...$ ），通常会选择二元模型或三元模型（ $...$ ）。

3.2 主题聚类

主题聚类是无监督学习方法，目的是找到多语种语料库中不同语言之间的共同主题。主题聚类的目的就是找到同一主题的不同语言之间的语料库。例如，我们可以定义一个主题空间，包含代表不同主题的中心词。然后我们就可以根据这些中心词来确定不同语言的语料库。
假设有一个多语种语料库，其中包含N个不同语言的句子，每个语言有Mi句子。那么该语料库可以表示成一个NxM的矩阵C，其中每一行代表一个句子，每一列代表一个中心词。Cij表示第i句话中第j个中心词的出现次数。
接下来，我们可以使用K-means算法来对中心词进行聚类。首先随机初始化K个中心词，然后迭代地更新中心词，直至收敛。具体的算法流程如下所示：

设定核心词汇集合 $C = \{c_1, c_2, ..., c_k\}$ 。
计算误差函数 $E(c)$ 的最小值：对于每条训练数据 $i$ 由若干单词位置及其对应标签 $(s_1, t_1), (s_2, t_2), ..., (s_k, t_k)$ 组成。
重新计算新的核心词汇集合 $C := \sum_{i=1}^{N} \frac{C_i}{N_i}$ 。
当模型参数的变化量低于设定阈值且未达到最大迭代次数时，则停止迭代过程。

该算法是一种经典的无监督学习方法，并被广泛应用于解决聚类问题。该算法通过将样本点分配至与其最近的均值中心点来实现聚类。均值中心点的移动与固定数量的簇之间的重新分配构成了训练过程，在这一过程中不断迭代直至收敛。当簇内样本间的最小化差距达到最低水平时，算法终止运行。在多语言机器翻译任务中，我们可以运用该方法来识别不同语言体系中的共性主题。

3.3 多标签分类分析

多标签分类分析（Multi-label Classification Analysis）是一种利用神经网络技术构建的多标签分类方法。其核心理念在于：针对输入的一段文本内容进行处理后，在其语义空间中识别出输入句子中的关键词项是否属于预设标签集合。这种分类方法还可以用来评估不同模型的表现效果。假设一个模型能够输出K个独立的类别标记，则该模型的实际输出结果可被表示为一个长度为K的一维二元向量形式，在此过程中每个元素代表对应类别的置信度数值范围在0到1之间变化；当某元素取值为1时，则表明该模型对相应类别的判定结果具有较高的可信度水平。

在多标签分类分析过程中我们引入了一个权重向量w用于表示每个类别标记的重要程度参数其维度与类别标记总数相一致具体而言权重向量w中的每一个分量值反映了对应类别的关注强度数值范围通常设定在0至1之间且若某分量值大于0.5则表明系统对该类别的重视程度较高

对于任意给定的一个输入文本序列S=[s₁ s₂ ... sₙ]其中包含了n个词语或短语序列以及一个包含k个类别标记的目标集合L=[l₁ l₂ ... lₖ]经过系统的处理后得到的结果是一个长度为k的一维连续型变量向量y=[y₁ y₂ ... yₖ]其中每个元素yᵢ代表系统对第i个类别标记的置信度数值范围在0到1之间；当某一项yᵢ等于1时则表明系统对该类别标记的判定结果具有极强的信心支持程度

具体而言系统的数学表达式如下所示

P(L|S,θ)=softmax(y_w)
其中计算得到的结果经过 $softmax$ 归一化处理得到的概率分布。 $softmax$ 函数将每个维度的值映射至区间[0,1]中，并确保所有维度值之和为1。

模型的损失函数可以定义为：
J(θ)=−log(P(L|S,θ))

其中，θ是模型的核心参数，并包含了权重向量w。特别强调，在多标签分类分析中，并非所有词汇或短语都会参与到模型的构建过程中。仅仅那些具备显著置信度的词汇或短语才会被选入到模型的学习阶段。由此可见，模型的学习目标是通过最小化训练样本上的损失函数来优化其性能。

3.4 新的数据集MultiX

对于多语种机器翻译任务而言, 数据准备工作的内容最为繁琐且时间密集的环节. 目前可供选择的数据集数量众多, 然而这些现有方案仍存在一些局限性. 为此, 我们设计了MultiX 数据集方案. 该方案涵盖了五个主要语言组合: 中英对照、法德双语对照以及葡西互译. 每个组合均包含约2000条样本句子, 总共约15万词量. 其主要特点体现在: 基于现有研究发现与技术实现的可能性分析基础之上进行优化设计; 包含多样化的语言对; 以及系统化地提供了完整的双语互译文本资源.

规模庞大：该数据集包含了五种语言的句子数量达五倍于常规规模的数量，并特别适合进行多语言间的训练、测试和验证任务。
内容全面：该系统涵盖了多语种机器翻译涉及的所有关键方面，并提供了从词组级别到短语级别训练数据。
真实可靠：所有数据均源自真实世界中的实际使用场景，并且每对句子实现了英→其他语言和逆向映射关系。
具有高度可读性：详细的注释内容确保了初学者能够轻松掌握相关知识。

3.5 模型的评估指标

现有的一类多语种机器翻译评估标准无法直接应用于多语种机器翻译体系的性能评价工作。在本节中我们将阐述三个模型性能评价标准两个数据来源评估标准以及一个系统级对比分析框架。

评估标准：评估标准旨在识别最优语言模型并指导其改进方向，在自然语言处理领域具有重要意义。为此我们需要开发一种基于n-gram匹配的技术来评估不同语言模型的质量我们开发了一种基于n-gram匹配的技术来评估不同语言模型的质量，并将其与统计分类方法进行比较该方法通过比较两个系统的表现特征从而确定哪种预测系统更为高效该方法通过比较两个系统的表现特征从而确定哪种预测系统更为高效当统计分类系统的准确率高于生成系统时则认为生成系统的预测性能优于统计分类系统在这种情况下我们将采用该生成系统作为默认选项该生成系统的输出结果能够更好地满足实际应用需求在这种情况下我们将采用该生成系统作为默认选项并根据其输出结果进行后续处理该步骤有助于提高整体系统的效率并根据其输出结果进行后续处理该步骤有助于提高整体系统的效率

数据集选择指标：数据集选择指标是为了选择最优的数据集而制定的一个评价指标。我们提出了一个新的数据集选择指标，其通过对训练集和开发集的划分，来保证训练集和开发集的分布一致。其原理是将多语种机器翻译的数据集划分为三个层级：原始语料库、训练集、开发集。其中，原始语料库由真实的文本对组成；训练集和开发集分别由相同的源语言语料库和目标语言语料库组成。这样可以保证训练集和开发集各自具有相似的分布，并且模型在验证集上的性能代表其泛化能力。公式：MDSI=∑ (m=1)^M∑_(d=1)D||p(sd|dt)||/∑(m=1)^M||p(sd|ds)|| −1 其中，pm是源语言的语句数量；pd是目标语言的语句数量；ps是源语言的总词数；pt是目标语言的总词数。MDSI表示的是训练集和开发集的差异。MDSI的值越大，则表示训练集和开发集的差距越小。
结果比较指标：结果比较指标是用来对比模型或多个模型的性能。多标签分类分析(Multi-label Classification Analysis)是一种基于神经网络的多标签分类方法，它计算了每个词或短语的标签置信度。因此，我们可以利用该方法来比较不同模型的性能。结果比较指标是指标来计算某个模型预测出的正确标签的比例。具体的数学公式如下所示：
precision=TP/(TP+FP)
recall=TP/(TP+FN)
f1score=2 precision recall/(precision+recall) 其中，TP是正确预测的标签数量；FP是错误预测的标签数量；FN是漏掉的标签数量。

4.具体代码实例和详细解释说明

4.1 MultiX数据集的下载地址

4.2 模型选择指标的实现代码

复制代码

    import math
    
    def mmi_score(model_probs, source_probs, target_size):
    """
    Compute the match between model probabilities and ideal n-gram probabilities.
    :param model_probs: A list of word probability distributions predicted by the model (list of numpy arrays).
                       Each array contains an entry per word in the vocabulary, where each value is the probability of that word given its context in the sentence.
                       The number of entries should be equal to the size of the vocabulary.
    
    :param source_probs: A dictionary containing the frequency counts of all words in the training data, as computed using the empirical distribution.
                         This can be obtained using code similar to the following:
                            # Get frequencies from file or database etc.
                            freq = {}
                            with open('train_corpus', 'r') as f:
                                lines = f.readlines()
                                for line in lines:
                                    tokens = line.strip().split()
                                    for token in tokens:
                                        if token not in freq:
                                            freq[token] = 0
                                        freq[token] += 1
    
                            total_words = sum(freq.values())
    
                            source_probs = dict([(word, count / float(total_words)) for word, count in freq.items()])
    
    :param target_size: An integer representing the number of possible output symbols for the target language.
    
    :return: A floating point score indicating how well the model matches the expected probabilities under the assumption
                 that it's selecting based on n-grams. A higher score indicates better performance.
    """
    mm_scores = []
    
    for i in range(len(source_probs)):
        s_prob = source_probs[i]
    
        # Extract top k most probable target words
        sorted_probs = sorted(enumerate(model_probs[i]), key=lambda x: x[1], reverse=True)[:target_size]
        t_probs = [x[1] for x in sorted_probs]
    
        # Calculate MMI score
        mm_score = abs(math.log(sum([t * math.log(t_probs[i]) for i, t in enumerate(t_probs)])))
        mm_scores.append(mm_score)
    
    return sum(mm_scores) / len(source_probs)
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解读

4.3 数据集选择指标的实现代码

复制代码

    from collections import defaultdict
    
    def mdsi_score(train_data, dev_data, test_data):
    """
    Compute the disparity between train and development sets.
    :param train_data: Dictionary containing the training sentences as lists of word indices, indexed by language pair.
                      e.g., {'en-de': [[1, 2, 3], [4, 5, 6]], 'de-fr': [[7, 8, 9], [10, 11, 12]]}
    :param dev_data: Dictionary containing the development set as lists of word indices, indexed by language pair.
                    Same format as `train_data`.
    :param test_data: List of input sentence pairs to evaluate against.
                     Format is [(src_sentence, trg_sentence)]. Example: [('The quick brown fox.', 'Der schnelle braune Fuchs.'), ('She sells seashells by the seashore.')].
    
    :return: A floating point score indicating the difference between the probability mass assigned to different subsets of training examples.
             A lower score indicates better performance on this subset. Higher scores are usually worse since they penalize suboptimal solutions due to non-uniformity.
    """
    # Count occurrences of words in training corpus
    vocab_count = defaultdict(int)
    num_tokens = 0
    lang_pairs = set(train_data.keys()).union(dev_data.keys())
    for lp in lang_pairs:
        src_sents = train_data[lp] + dev_data[lp]
        tgt_sents = train_data['{}-{}'.format(lp[-1], lp[:-1])] + dev_data['{}-{}'.format(lp[-1], lp[:-1])]
        assert len(src_sents) == len(tgt_sents)
        for sent_num in range(len(src_sents)):
            for word in src_sents[sent_num]:
                vocab_count[word] += 1
                num_tokens += 1
    
            for word in tgt_sents[sent_num]:
                vocab_count[word] += 1
                num_tokens += 1
    
    
    # Compute probability mass for each language pair
    pms = {}
    denominators = {}
    for lp in lang_pairs:
        train_count = 0
        dev_count = 0
        for w in vocab_count:
            train_count += train_data[lp].count(w)
            train_count += dev_data[lp].count(w)
    
            dev_count += train_data['{}-{}'.format(lp[-1], lp[:-1])].count(w)
            dev_count += dev_data['{}-{}'.format(lp[-1], lp[:-1])].count(w)
    
        denominators[lp] = max(train_count, dev_count)
        pms[lp] = min(train_count, dev_count) / float(denominators[lp])
    
    # Evaluate on test set
    tp = 0
    fp = 0
    fn = 0
    for src_sent, trg_sent in test_data:
        src_ids = [vocab_index[w] for w in src_sent.split()]
        trg_ids = [vocab_index[w] for w in trg_sent.split()]
    
        for lp in lang_pairs:
            if pms[lp] > 0:
                combined_probs = combine_probs(src_ids, trg_ids, src_sent, trg_sent, lp)
                pred_labels = np.array([combined_probs[_trg][:, _pred].argmax() for _pred, _trg in zip(*np.where((combined_probs[:, :, :] >= threshold) & ((combined_probs[:, :, :] < 1.) | (~mask))))])
                true_labels = torch.LongTensor([[vocab_index[w] for w in trg_sent.split()], ]).to(device)[0].tolist()
    
                tp += len([l for l in pred_labels if l in true_labels])
                fp += len([l for l in pred_labels if l not in true_labels])
                fn += len([l for l in true_labels if l not in pred_labels])
    
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    f1score = 2 * precision * recall / (precision + recall)
    
    return -(f1score + (1 - pms['en-de']) + (1 - pms['de-fr'] + (1 - pms['fr-en'])))
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解读

4.4 结果比较指标的实现代码

复制代码

    import numpy as np
    import torch
    
    
    class Net(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = torch.nn.Linear(hidden_size * hidden_size, K)
    
    
    def forward(self, x):
        x = self.fc1(x)
        return x
    
    
    def multi_tagging_analysis(test_loader, net, device):
    correct = 0
    total = 0
    y_true = None
    y_pred = None
    
    print("Evaluating...")
    with torch.no_grad():
        for step, batch in enumerate(test_loader):
            inputs, labels = tuple(t.to(device) for t in batch)
    
            outputs = net(inputs.view(-1, hidden_size * hidden_size))
            _, predicted = torch.max(outputs, dim=1)
    
            c = (predicted == labels).squeeze()
    
            correct += int(c.sum())
            total += int(labels.shape[0])
    
            if y_true is None:
                y_true = labels.cpu().numpy()
                y_pred = predicted.cpu().numpy()
            else:
                y_true = np.concatenate((y_true, labels.cpu().numpy()))
                y_pred = np.concatenate((y_pred, predicted.cpu().numpy()))
    
    
    accuracy = correct / total
    
    print("\nTest Accuracy:", round(accuracy, 3))
    cm = confusion_matrix(y_true, y_pred)
    plot_confusion_matrix(cm, classes=['positive', 'negative'], title='Confusion matrix')
    
    
    def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Blues):
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
    
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')
    
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j], horizontalalignment="center", color="white" if cm[i, j] > thresh else "black")
    
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    
    def combine_probs(src_ids, trg_ids, src_sent, trg_sent, language):
    """
    Combine source and target word probabilities into one joint probability distribution over all tags.
    """
    mask = create_mask(trg_ids, len(trg_ids)).unsqueeze(dim=0).expand(batch_size, len(trg_ids), len(src_ids)).contiguous()
    
    encoder_output = encode_sentences(src_sent, language)
    decoder_input = get_decoder_input(encoder_output, src_sent, vocab_index['<sos>']).unsqueeze(dim=0)
    
    tagger_output, attn_weights = decode_sentences(encoder_output, decoder_input, trg_ids, trg_sent, language)
    
    probs = softmax(tagger_output)
    
    combined_probs = []
    for i in range(probs.size()[0]):
        combined_probs.append(combine_dists(probs[i][:-1], src_sent, trg_sent))
    
    return combined_probs
    
    
    def create_mask(tensor, length):
    """
    Create a binary mask of dimensions (length,) for the specified tensor. All values except those corresponding to valid positions will be masked out.
    """
    mask = torch.ones_like(tensor)
    mask[length:] = 0
    return mask
    
    
    def encode_sentences(sent, language):
    """
    Encode a single sentence using the pre-trained embedding layer provided by spacy package. Return the final representation after applying dropout and reshaping.
    """
    doc = nlp_dict[language](sent)
    tensor = torch.tensor([nlp_dict[language](sent).vector]).to(device)
    tensor = dropout(embedding(tensor)).reshape(1, 1, -1)
    
    return tensor
    
    
    def decode_sentences(encoder_output, decoder_input, targets, target_sent, language):
    """
    Decode a sequence of word ids from start to end, updating the attention weights along the way. Return both the decoded sequences and attention weights.
    """
    trg_indexes = targets[:]
    decoder_hidden = encoder_output
    
    tagger_output = []
    attn_weights = []
    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False
    
    if use_teacher_forcing:
        for di in range(min(target_length, len(targets))):
            decoder_output, decoder_hidden, weight = decoder(decoder_input, decoder_hidden, encoder_output)
            attn_weights.append(weight)
    
            tagger_output.append(decoder_output)
            decoder_input = decoder_output
    
    else:
        for di in range(target_length):
            decoder_output, decoder_hidden, weight = decoder(decoder_input, decoder_hidden, encoder_output)
            attn_weights.append(weight)
    
            topv, topi = decoder_output.topk(1)
            ni = topi[0][0]
    
            if ni == vocab_index['<eos>']:
                break
    
            tagger_output.append(decoder_output)
            decoder_input = decoder_output.clone().detach().requires_grad_()
    
    tagger_output = torch.stack(tagger_output)
    attn_weights = torch.stack(attn_weights)
    
    return tagger_output, attn_weights
    
    
    
    def combine_dists(dist, src_sent, trg_sent):
    """
    Given a row vector dist over source words, return a (vocab_size, vocab_size)-dimensional matrix where element (i,j) represents 
    the conditional probability of target word j given source word i according to the learned model. If any conditioning information 
    such as the source or target sentence itself is available, we can incorporate it here to improve the estimate. For simplicity, we ignore these details.
    """
    return dist.repeat(vocab_size, 1).transpose(0, 1)
    
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解读

全部评论 (0)

还没有任何评论哟~

Tale of two language models: Revisiting the evaluation

作者：禅与计算机程序设计艺术 1.背景介绍在自然语言处理领域，机器翻译MachineTranslation,MT一直是一个具有重大影响力的研究方向，它的目的就是通过计算机自动地将一种语言的文本转换成...

Exploring the Impact of the Output Format on the Evaluation of Large Language Models

本文是LLM系列文章，针对《ExploringtheImpactoftheOutputFormatontheEvaluationof LargeLanguageModelsforCodeTransla...

Revisiting Dynamic Evaluation: Online Adaptation for Large Language Models

本文是LLM系列文章，针对《RevisitingDynamicEvaluation:OnlineAdaptationforLargeLanguageModels》的翻译。

Through the Lens of Core Competency: Survey on Evaluation of Large Language Models

本文是LLM系列文章，针对《ThroughtheLensofCoreCompetency:SurveyonEvaluationofLargeLanguageModels》的翻译。

Revisiting the Evaluation of Word Embeddings and Langua

作者：禅与计算机程序设计艺术 1.简介 WordembeddingsWEhaveemergedasapopulartechniquetorepresentnaturallanguageconcepts...

Revisiting Zero-Shot Abstractive Summarization in the Era of Large Language Models from the Perspect

Q:这篇论文试图解决什么问题？ A:这篇论文试图解决的问题是大型语言模型（LLMs）在零样本抽象摘要任务中的位置偏差（positionbias）问题。位置偏差指的是模型在生成摘要时，不公平地优先考虑输...

Benchmarking the Text-to-SQL Capability of Large Language Models: A Comprehensive Evaluation

本文是LLM系列文章，针对《BenchmarkingtheTexttoSQLCapabilityofLargeLanguageModels:AComprehensiveEvaluation》的翻译。

A Survey on Evaluation of Large Language Models

这是LLM相关的系列文章，针对《ASurveyonEvaluationofLargeLanguageModels》的翻译。大型语言模型评价综述摘要 1引言 2背景 2.1大语言模型 2.2AI模型...

GENRES: Rethinking Evaluation for Generative Relation Extraction in the Era of Large Language Models

本文是LLM系列文章，针对《GENRES:RethinkingEvaluationforGenerativeRelationExtractionintheEraofLargeLanguageModel...

Finetuning and Comprehensive Evaluation of Vietnamese Large Language Models

本文是LLM系列文章，针对《CrossingLinguisticHorizons:FinetuningandComprehensiveEvaluationofVietnameseLargeLangua...

是否确定退出登录?

Tale of two language models: Revisiting the evaluation

1.背景介绍

2.核心概念与联系

2.1 基本概念

2.2 基本框架与术语

3.核心算法原理和具体操作步骤以及数学模型公式详细讲解

3.1 n-gram模型

unigram模型

bigram模型

trigram模型

注意事项：当n超过3时（...），会丢失关于语言间关系的信息。因此，在进行多语言机器翻译任务时（...），通常会选择二元模型或三元模型（...）。

3.2 主题聚类

3.3 多标签分类分析

3.4 新的数据集MultiX

3.5 模型的评估指标

4.具体代码实例和详细解释说明

4.1 MultiX数据集的下载地址

4.2 模型选择指标的实现代码

4.3 数据集选择指标的实现代码

4.4 结果比较指标的实现代码

全部评论 (0)

相关文章推荐

Tale of two language models: Revisiting the evaluation

Exploring the Impact of the Output Format on the Evaluation of Large Language Models

Revisiting Dynamic Evaluation: Online Adaptation for Large Language Models

Through the Lens of Core Competency: Survey on Evaluation of Large Language Models

Revisiting the Evaluation of Word Embeddings and Langua

Revisiting Zero-Shot Abstractive Summarization in the Era of Large Language Models from the Perspect

Benchmarking the Text-to-SQL Capability of Large Language Models: A Comprehensive Evaluation

A Survey on Evaluation of Large Language Models

GENRES: Rethinking Evaluation for Generative Relation Extraction in the Era of Large Language Models

Finetuning and Comprehensive Evaluation of Vietnamese Large Language Models

注意事项：当n超过3时（ $...$ ），会丢失关于语言间关系的信息。因此，在进行多语言机器翻译任务时（ $...$ ），通常会选择二元模型或三元模型（ $...$ ）。