Advertisement

注意神经机器翻译----创建与训练机器翻译模型

阅读量:

该笔记本基于 seq2seq 模型用于将西班牙语文本转换为英语。作为一个较为复杂的案例,请假设读者具备一定的 seq2seq 模型知识。

在本笔记本中完成模型训练后, 您将能够输入西班牙语句子示例, 例如 “¿Still are they at home? ”, 并将其返回对应的英文翻译: “The translation is ‘still at home’.”

就玩具级别的实例而言,其翻译效果尚可;然而所生成的关注图可能更具吸引力。这有助于了解输入句子哪些部分在被翻译时受到模型关注:

spanish-english attention plot

注意:此示例在单个P100 GPU上运行大约需要10分钟。

复制代码
    from __future__ import absolute_import, division, print_function
    
    !pip install tensorflow-gpu==2.0.0-alpha0
    import tensorflow as tf
    
    import matplotlib.pyplot as plt
    from sklearn.model_selection import train_test_split
    
    import unicodedata
    import re
    import numpy as np
    import os
    import io
    import time
    
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    AI助手
  • 下载并准备数据集

我们将在http://www.manythings.org/anki/这一资源平台上获取语言数据集。该数据集由以下格式的语言训练样本组成:

May I borrow this book? ¿Puedo tomar prestado este libro?
有多种语言可供选择,但我们将使用英语 - 西班牙语数据集。 为方便起见,我们在Google Cloud上托管了此数据集的副本,但您也可以下载自己的副本。 下载数据集后,以下是我们准备数据的步骤:
1.为每个句子添加开始和结束标记。
2.删除特殊字符以清除句子。
3.创建一个单词索引和反向单词索引(从单词→id和id→单词映射的字典)。
4.将每个句子填充到最大长度。

复制代码
    # Download the file
    path_to_zip = tf.keras.utils.get_file(
    'spa-eng.zip', origin='http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip', 
    extract=True)
    
    path_to_file = os.path.dirname(path_to_zip)+"/spa-eng/spa.txt"
    
    
      
      
      
      
      
      
    
    AI助手
复制代码
    # Converts the unicode file to ascii
    def unicode_to_ascii(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn')
    
    
    def preprocess_sentence(w):
    w = unicode_to_ascii(w.lower().strip())
    
    # creating a space between a word and the punctuation following it
    # eg: "he is a boy." => "he is a boy ." 
    # Reference:- https://stackoverflow.com/questions/3645931/python-padding-punctuation-with-white-spaces-keeping-punctuation
    w = re.sub(r"([?.!,¿])", r" \1 ", w)
    w = re.sub(r'[" "]+', " ", w)
    
    # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
    w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)
    
    w = w.rstrip().strip()
    
    # adding a start and an end token to the sentence
    # so that the model know when to start and stop predicting.
    w = '<start> ' + w + ' <end>'
    return w
    
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    AI助手
复制代码
    en_sentence = u"May I borrow this book?"
    sp_sentence = u"¿Puedo tomar prestado este libro?"
    print(preprocess_sentence(en_sentence))
    print(preprocess_sentence(sp_sentence).encode('utf-8'))
    
    
      
      
      
      
    
    AI助手
复制代码
    # 1. Remove the accents
    # 2. Clean the sentences
    # 3. Return word pairs in the format: [ENGLISH, SPANISH]
    def create_dataset(path, num_examples):
    lines = io.open(path, encoding='UTF-8').read().strip().split('\n')
    
    word_pairs = [[preprocess_sentence(w) for w in l.split('\t')]  for l in lines[:num_examples]]
    
    return zip(*word_pairs)
    
    
      
      
      
      
      
      
      
      
      
    
    AI助手
复制代码
    en, sp = create_dataset(path_to_file, None)
    print(en[-1])
    print(sp[-1])
    
    
      
      
      
    
    AI助手
复制代码
    def max_length(tensor):
    return max(len(t) for t in tensor)
    
    
      
      
    
    AI助手
复制代码
    def tokenize(lang):
      lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(
      filters='')
      lang_tokenizer.fit_on_texts(lang)
      
      tensor = lang_tokenizer.texts_to_sequences(lang)
      
      tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor,
                                                         padding='post')
      
      return tensor, lang_tokenizer
    
    
      
      
      
      
      
      
      
      
      
      
      
    
    AI助手
复制代码
    def load_dataset(path, num_examples=None):
    # creating cleaned input, output pairs
    targ_lang, inp_lang = create_dataset(path, num_examples)
    
    input_tensor, inp_lang_tokenizer = tokenize(inp_lang)
    target_tensor, targ_lang_tokenizer = tokenize(targ_lang)
    
    return input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer
    
    
      
      
      
      
      
      
      
      
    
    AI助手

限制数据集的大小以更快地进行实验(可选)

对于处理超过1万条消息(即超过1万条消息/天)的数据集进行训练需要较长的时间。为了加快训练速度并平衡效率与资源投入之间的关系,在不影响整体性能的前提下,建议将消息数量限定在每日3万条以内(当然这可能会导致翻译质量下降):

复制代码
    # Try experimenting with the size of that dataset
    num_examples = 30000
    input_tensor, target_tensor, inp_lang, targ_lang = load_dataset(path_to_file, num_examples)
    
    # Calculate max_length of the target tensors
    max_length_targ, max_length_inp = max_length(target_tensor), max_length(input_tensor)
    
    
      
      
      
      
      
      
    
    AI助手
复制代码
    # Creating training and validation sets using an 80-20 split
    input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.2)
    
    # Show length
    len(input_tensor_train), len(target_tensor_train), len(input_tensor_val), len(target_tensor_val)
    
    
      
      
      
      
      
    
    AI助手
复制代码
    def convert(lang, tensor):
      for t in tensor:
    if t!=0:
      print ("%d ----> %s" % (t, lang.index_word[t]))
    
    
      
      
      
      
    
    AI助手
复制代码
    print ("Input Language; index to word mapping")
    convert(inp_lang, input_tensor_train[0])
    print ()
    print ("Target Language; index to word mapping")
    convert(targ_lang, target_tensor_train[0])
    
    
      
      
      
      
      
    
    AI助手

创建一个 tf.data 数据集

复制代码
    BUFFER_SIZE = len(input_tensor_train)
    BATCH_SIZE = 64
    steps_per_epoch = len(input_tensor_train)//BATCH_SIZE
    embedding_dim = 256
    units = 1024
    vocab_inp_size = len(inp_lang.word_index)+1
    vocab_tar_size = len(targ_lang.word_index)+1
    
    dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train)).shuffle(BUFFER_SIZE)
    dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)
    
    
      
      
      
      
      
      
      
      
      
      
    
    AI助手
复制代码
    example_input_batch, example_target_batch = next(iter(dataset))
    example_input_batch.shape, example_target_batch.shape
    
    
      
      
    
    AI助手

编写编码器和解码器模型

在此问题中, 我们将开发一个基于注意力机制的编码器-解码器系统, 您可以在TensorFlow官网访问其提供的《神经机器翻译(seq2seq)教程》一文。本示例采用了经过优化的API集合, 该笔记本应用了来自《神经机器翻译(seq2seq)教程》中介绍的注意方程式。图中展示了输入单词如何通过注意机制分配权重, 接着解码器利用这些权重来预测句子中的下一个单词。

attention mechanism

The sequence is processed by an encoder model to yield an encoder output with dimensions _(batch_size\ ,\ max_length\ ,\ hidden_size)_ and an encoder hidden state with dimensions _(batch_size\ ,\ hidden_size)_.

以下是用到的的方程式:

attention equation 0
attention equation 1

The Bahdanau attention mechanism is being employed. We will establish notation prior to simplifying the form.

  • FC 是全连接(密集)层
    • EO 表示编码器输出
    • H 表示内部状态
    • X 是输入至解码器的信息

和伪代码:

  • score variable is calculated as FC(\tanh(FC(EO) + FC(H))).
  • attention weights are computed via the application of the Softmax function along the first axis. By default, this function operates on the last axis of tensors; however, in our case, we need to apply it to the initial dimension to ensure proper normalization across sequences.
  • The context vector is generated by summing the element-wise product of attention weights and EO across a specific dimension. This step ensures that each position in the sequence receives a weighted contribution from all other positions based on their relevance as determined by attention mechanisms.
  • The embedding output represents the encoded representation of the input sequence.
  • The merged vector is formed by concatenating two vectors: one derived from an operation on tokens within our current sentence and another representing global contextual information captured by a precomputed sentence-level embedding.
  • This concatenated representation is then fed into a Gated Recurrent Unit (GRU), which processes it sequentially while maintaining its hidden state through time steps. This mechanism allows us to model temporal dependencies effectively while preserving contextual information at both local and global levels.

每个步骤中所有向量的形状都已在代码中的注释中指定:

复制代码
    class Encoder(tf.keras.Model):
      def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
    super(Encoder, self).__init__()
    self.batch_sz = batch_sz
    self.enc_units = enc_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.enc_units, 
                                   return_sequences=True, 
                                   return_state=True, 
                                   recurrent_initializer='glorot_uniform')
    
      def call(self, x, hidden):
    x = self.embedding(x)
    output, state = self.gru(x, initial_state = hidden)        
    return output, state
    
      def initialize_hidden_state(self):
    return tf.zeros((self.batch_sz, self.enc_units))
    
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    AI助手
复制代码
    encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)
    
    # sample input
    sample_hidden = encoder.initialize_hidden_state()
    sample_output, sample_hidden = encoder(example_input_batch, sample_hidden)
    print ('Encoder output shape: (batch size, sequence length, units) {}'.format(sample_output.shape))
    print ('Encoder Hidden state shape: (batch size, units) {}'.format(sample_hidden.shape))
    
    
      
      
      
      
      
      
      
    
    AI助手
复制代码
    class BahdanauAttention(tf.keras.Model):
      def __init__(self, units):
    super(BahdanauAttention, self).__init__()
    self.W1 = tf.keras.layers.Dense(units)
    self.W2 = tf.keras.layers.Dense(units)
    self.V = tf.keras.layers.Dense(1)
      
      def call(self, query, values):
    # hidden shape == (batch_size, hidden size)
    # hidden_with_time_axis shape == (batch_size, 1, hidden size)
    # we are doing this to perform addition to calculate the score
    hidden_with_time_axis = tf.expand_dims(query, 1)
    
    # score shape == (batch_size, max_length, hidden_size)
    score = self.V(tf.nn.tanh(
        self.W1(values) + self.W2(hidden_with_time_axis)))
    
    # attention_weights shape == (batch_size, max_length, 1)
    # we get 1 at the last axis because we are applying score to self.V
    attention_weights = tf.nn.softmax(score, axis=1)
    
    # context_vector shape after sum == (batch_size, hidden_size)
    context_vector = attention_weights * values
    context_vector = tf.reduce_sum(context_vector, axis=1)
    
    return context_vector, attention_weights
    
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    AI助手
复制代码
    attention_layer = BahdanauAttention(10)
    attention_result, attention_weights = attention_layer(sample_hidden, sample_output)
    
    print("Attention result shape: (batch size, units) {}".format(attention_result.shape))
    print("Attention weights shape: (batch_size, sequence_length, 1) {}".format(attention_weights.shape))
    
    
      
      
      
      
      
    
    AI助手
复制代码
    class Decoder(tf.keras.Model):
      def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
    super(Decoder, self).__init__()
    self.batch_sz = batch_sz
    self.dec_units = dec_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.dec_units, 
                                   return_sequences=True, 
                                   return_state=True, 
                                   recurrent_initializer='glorot_uniform')
    self.fc = tf.keras.layers.Dense(vocab_size)
    
    # used for attention
    self.attention = BahdanauAttention(self.dec_units)
    
      def call(self, x, hidden, enc_output):
    # enc_output shape == (batch_size, max_length, hidden_size)
    context_vector, attention_weights = self.attention(hidden, enc_output)
    
    # x shape after passing through embedding == (batch_size, 1, embedding_dim)
    x = self.embedding(x)
    
    # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
    
    # passing the concatenated vector to the GRU
    output, state = self.gru(x)
    
    # output shape == (batch_size * 1, hidden_size)
    output = tf.reshape(output, (-1, output.shape[2]))
    
    # output shape == (batch_size, vocab)
    x = self.fc(output)
    
    return x, state, attention_weights
    
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    AI助手
复制代码
    decoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE)
    
    sample_decoder_output, _, _ = decoder(tf.random.uniform((64, 1)), 
                                      sample_hidden, sample_output)
    
    print ('Decoder output shape: (batch_size, vocab size) {}'.format(sample_decoder_output.shape))
    
    
      
      
      
      
      
      
    
    AI助手

定义优化器和损失函数

复制代码
    optimizer = tf.keras.optimizers.Adam()
    loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    
    def loss_function(real, pred):
      mask = tf.math.logical_not(tf.math.equal(real, 0))
      loss_ = loss_object(real, pred)
    
      mask = tf.cast(mask, dtype=loss_.dtype)
      loss_ *= mask
      
      return tf.reduce_mean(loss_)
    
    
      
      
      
      
      
      
      
      
      
      
      
    
    AI助手

Checkpoints (基于对象的保存)

复制代码
    checkpoint_dir = './training_checkpoints'
    checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
    checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                                 encoder=encoder,
                                 decoder=decoder)
    
    
      
      
      
      
      
    
    AI助手

Training

  1. 通过编码层处理输入信息,并返回其对应的编码结果及其隐藏状态。
  2. 其输出结果及当前的隐藏状态与初始标记一起传输至解码层进行后续处理。
  3. 解码层根据接收到的信息生成预测结果及新的隐层表示。
  4. 然后将当前的隐层表示反馈至模型用于后续处理,并利用生成结果计算相应的损失函数。
  5. 通过教师强制机制为后续步骤提供真实目标数据。
  6. 教师强制机制是一种在训练阶段直接使用真实标签加速收敛的技术方法。
  7. 最终步骤是对损失函数进行计算并执行反向传播以优化模型参数。
复制代码
    @tf.function
    def train_step(inp, targ, enc_hidden):
      loss = 0
        
      with tf.GradientTape() as tape:
    enc_output, enc_hidden = encoder(inp, enc_hidden)
    
    dec_hidden = enc_hidden
    
    dec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE, 1)       
    
    # Teacher forcing - feeding the target as the next input
    for t in range(1, targ.shape[1]):
      # passing enc_output to the decoder
      predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)
    
      loss += loss_function(targ[:, t], predictions)
    
      # using teacher forcing
      dec_input = tf.expand_dims(targ[:, t], 1)
    
      batch_loss = (loss / int(targ.shape[1]))
    
      variables = encoder.trainable_variables + decoder.trainable_variables
    
      gradients = tape.gradient(loss, variables)
    
      optimizer.apply_gradients(zip(gradients, variables))
      
      return batch_loss
    
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    AI助手
复制代码
    EPOCHS = 10
    
    for epoch in range(EPOCHS):
      start = time.time()
    
      enc_hidden = encoder.initialize_hidden_state()
      total_loss = 0
    
      for (batch, (inp, targ)) in enumerate(dataset.take(steps_per_epoch)):
    batch_loss = train_step(inp, targ, enc_hidden)
    total_loss += batch_loss
    
    if batch % 100 == 0:
        print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
                                                     batch,
                                                     batch_loss.numpy()))
      # saving (checkpoint) the model every 2 epochs
      if (epoch + 1) % 2 == 0:
    checkpoint.save(file_prefix = checkpoint_prefix)
    
      print('Epoch {} Loss {:.4f}'.format(epoch + 1,
                                      total_loss / steps_per_epoch))
      print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))
    
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    AI助手

翻译

  • 评估功能相当于训练循环的一种形式,在这里我们不采用教师强制机制。
  • 每个时间步骤中,解码器的输入由其上一个预测结果、当前隐藏状态以及编码器输出组成。
  • 预计在某个特定的时间点之前停止模型进行预测,并确定结束标记。
  • 同时记录下每一个时间步所对应的注意力权重信息。

注意:编码器输出仅针对一个输入计算一次。

复制代码
    def evaluate(sentence):
    attention_plot = np.zeros((max_length_targ, max_length_inp))
    
    sentence = preprocess_sentence(sentence)
    
    inputs = [inp_lang.word_index[i] for i in sentence.split(' ')]
    inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs], 
                                                           maxlen=max_length_inp, 
                                                           padding='post')
    inputs = tf.convert_to_tensor(inputs)
    
    result = ''
    
    hidden = [tf.zeros((1, units))]
    enc_out, enc_hidden = encoder(inputs, hidden)
    
    dec_hidden = enc_hidden
    dec_input = tf.expand_dims([targ_lang.word_index['<start>']], 0)
    
    for t in range(max_length_targ):
        predictions, dec_hidden, attention_weights = decoder(dec_input, 
                                                             dec_hidden, 
                                                             enc_out)
        
        # storing the attention weights to plot later on
        attention_weights = tf.reshape(attention_weights, (-1, ))
        attention_plot[t] = attention_weights.numpy()
    
        predicted_id = tf.argmax(predictions[0]).numpy()
    
        result += targ_lang.index_word[predicted_id] + ' '
    
        if targ_lang.index_word[predicted_id] == '<end>':
            return result, sentence, attention_plot
        
        # the predicted ID is fed back into the model
        dec_input = tf.expand_dims([predicted_id], 0)
    
    return result, sentence, attention_plot
    
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    AI助手
复制代码
    # function for plotting the attention weights
    def plot_attention(attention, sentence, predicted_sentence):
    fig = plt.figure(figsize=(10,10))
    ax = fig.add_subplot(1, 1, 1)
    ax.matshow(attention, cmap='viridis')
    
    fontdict = {'fontsize': 14}
    
    ax.set_xticklabels([''] + sentence, fontdict=fontdict, rotation=90)
    ax.set_yticklabels([''] + predicted_sentence, fontdict=fontdict)
    
    plt.show()
    
    
      
      
      
      
      
      
      
      
      
      
      
      
    
    AI助手
复制代码
    def translate(sentence):
    result, sentence, attention_plot = evaluate(sentence)
        
    print('Input: %s' % (sentence).encode('utf-8'))
    print('Predicted translation: {}'.format(result))
    
    attention_plot = attention_plot[:len(result.split(' ')), :len(sentence.split(' '))]
    plot_attention(attention_plot, sentence.split(' '), result.split(' '))
    
    
      
      
      
      
      
      
      
      
    
    AI助手

恢复最新的检查点并进行测试

复制代码
    # restoring the latest checkpoint in checkpoint_dir
    checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))
    
    
      
      
    
    AI助手
复制代码
    translate(u'hace mucho frio aqui.')
    
    
      
    
    AI助手
复制代码
    translate(u'esta es mi vida.')
    
    
      
    
    AI助手
复制代码
    translate(u'¿todavia estan en casa?')
    
    
      
    
    AI助手
复制代码
    # wrong translation
    translate(u'trata de averiguarlo.')
    
    
      
      
    
    AI助手

下一步操作尝试

  • 获取多样化的数据集以便测试翻译效果,例如英语到德语或英语到法语。
    • 探索更大规模的数据集合进行训练。
  • 使用更多的时间段来提升模型的泛化能力。

全部评论 (0)

还没有任何评论哟~