注意神经机器翻译----创建与训练机器翻译模型
该笔记本基于 seq2seq 模型用于将西班牙语文本转换为英语。作为一个较为复杂的案例,请假设读者具备一定的 seq2seq 模型知识。
在本笔记本中完成模型训练后, 您将能够输入西班牙语句子示例, 例如 “¿Still are they at home? ”, 并将其返回对应的英文翻译: “The translation is ‘still at home’.”
就玩具级别的实例而言,其翻译效果尚可;然而所生成的关注图可能更具吸引力。这有助于了解输入句子哪些部分在被翻译时受到模型关注:

注意:此示例在单个P100 GPU上运行大约需要10分钟。
from __future__ import absolute_import, division, print_function
!pip install tensorflow-gpu==2.0.0-alpha0
import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import unicodedata
import re
import numpy as np
import os
import io
import time
AI助手
- 下载并准备数据集
我们将在http://www.manythings.org/anki/这一资源平台上获取语言数据集。该数据集由以下格式的语言训练样本组成:
May I borrow this book? ¿Puedo tomar prestado este libro?
有多种语言可供选择,但我们将使用英语 - 西班牙语数据集。 为方便起见,我们在Google Cloud上托管了此数据集的副本,但您也可以下载自己的副本。 下载数据集后,以下是我们准备数据的步骤:
1.为每个句子添加开始和结束标记。
2.删除特殊字符以清除句子。
3.创建一个单词索引和反向单词索引(从单词→id和id→单词映射的字典)。
4.将每个句子填充到最大长度。
# Download the file
path_to_zip = tf.keras.utils.get_file(
'spa-eng.zip', origin='http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip',
extract=True)
path_to_file = os.path.dirname(path_to_zip)+"/spa-eng/spa.txt"
AI助手
# Converts the unicode file to ascii
def unicode_to_ascii(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn')
def preprocess_sentence(w):
w = unicode_to_ascii(w.lower().strip())
# creating a space between a word and the punctuation following it
# eg: "he is a boy." => "he is a boy ."
# Reference:- https://stackoverflow.com/questions/3645931/python-padding-punctuation-with-white-spaces-keeping-punctuation
w = re.sub(r"([?.!,¿])", r" \1 ", w)
w = re.sub(r'[" "]+', " ", w)
# replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)
w = w.rstrip().strip()
# adding a start and an end token to the sentence
# so that the model know when to start and stop predicting.
w = '<start> ' + w + ' <end>'
return w
AI助手
en_sentence = u"May I borrow this book?"
sp_sentence = u"¿Puedo tomar prestado este libro?"
print(preprocess_sentence(en_sentence))
print(preprocess_sentence(sp_sentence).encode('utf-8'))
AI助手
# 1. Remove the accents
# 2. Clean the sentences
# 3. Return word pairs in the format: [ENGLISH, SPANISH]
def create_dataset(path, num_examples):
lines = io.open(path, encoding='UTF-8').read().strip().split('\n')
word_pairs = [[preprocess_sentence(w) for w in l.split('\t')] for l in lines[:num_examples]]
return zip(*word_pairs)
AI助手
en, sp = create_dataset(path_to_file, None)
print(en[-1])
print(sp[-1])
AI助手
def max_length(tensor):
return max(len(t) for t in tensor)
AI助手
def tokenize(lang):
lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(
filters='')
lang_tokenizer.fit_on_texts(lang)
tensor = lang_tokenizer.texts_to_sequences(lang)
tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor,
padding='post')
return tensor, lang_tokenizer
AI助手
def load_dataset(path, num_examples=None):
# creating cleaned input, output pairs
targ_lang, inp_lang = create_dataset(path, num_examples)
input_tensor, inp_lang_tokenizer = tokenize(inp_lang)
target_tensor, targ_lang_tokenizer = tokenize(targ_lang)
return input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer
AI助手
限制数据集的大小以更快地进行实验(可选)
对于处理超过1万条消息(即超过1万条消息/天)的数据集进行训练需要较长的时间。为了加快训练速度并平衡效率与资源投入之间的关系,在不影响整体性能的前提下,建议将消息数量限定在每日3万条以内(当然这可能会导致翻译质量下降):
# Try experimenting with the size of that dataset
num_examples = 30000
input_tensor, target_tensor, inp_lang, targ_lang = load_dataset(path_to_file, num_examples)
# Calculate max_length of the target tensors
max_length_targ, max_length_inp = max_length(target_tensor), max_length(input_tensor)
AI助手
# Creating training and validation sets using an 80-20 split
input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.2)
# Show length
len(input_tensor_train), len(target_tensor_train), len(input_tensor_val), len(target_tensor_val)
AI助手
def convert(lang, tensor):
for t in tensor:
if t!=0:
print ("%d ----> %s" % (t, lang.index_word[t]))
AI助手
print ("Input Language; index to word mapping")
convert(inp_lang, input_tensor_train[0])
print ()
print ("Target Language; index to word mapping")
convert(targ_lang, target_tensor_train[0])
AI助手
创建一个 tf.data 数据集
BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 64
steps_per_epoch = len(input_tensor_train)//BATCH_SIZE
embedding_dim = 256
units = 1024
vocab_inp_size = len(inp_lang.word_index)+1
vocab_tar_size = len(targ_lang.word_index)+1
dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)
AI助手
example_input_batch, example_target_batch = next(iter(dataset))
example_input_batch.shape, example_target_batch.shape
AI助手
编写编码器和解码器模型
在此问题中, 我们将开发一个基于注意力机制的编码器-解码器系统, 您可以在TensorFlow官网访问其提供的《神经机器翻译(seq2seq)教程》一文。本示例采用了经过优化的API集合, 该笔记本应用了来自《神经机器翻译(seq2seq)教程》中介绍的注意方程式。图中展示了输入单词如何通过注意机制分配权重, 接着解码器利用这些权重来预测句子中的下一个单词。

The sequence is processed by an encoder model to yield an encoder output with dimensions _(batch_size\ ,\ max_length\ ,\ hidden_size)_ and an encoder hidden state with dimensions _(batch_size\ ,\ hidden_size)_.
以下是用到的的方程式:


The Bahdanau attention mechanism is being employed. We will establish notation prior to simplifying the form.
- FC 是全连接(密集)层
- EO 表示编码器输出
- H 表示内部状态
- X 是输入至解码器的信息
和伪代码:
score variableis calculated as FC(\tanh(FC(EO) + FC(H))).attention weightsare computed via the application of the Softmax function along the first axis. By default, this function operates on the last axis of tensors; however, in our case, we need to apply it to the initial dimension to ensure proper normalization across sequences.- The
context vectoris generated by summing the element-wise product ofattention weightsandEOacross a specific dimension. This step ensures that each position in the sequence receives a weighted contribution from all other positions based on their relevance as determined by attention mechanisms. - The
embedding outputrepresents the encoded representation of the input sequence. - The
merged vectoris formed by concatenating two vectors: one derived from an operation on tokens within our current sentence and another representing global contextual information captured by a precomputed sentence-level embedding. - This concatenated representation is then fed into a Gated Recurrent Unit (GRU), which processes it sequentially while maintaining its hidden state through time steps. This mechanism allows us to model temporal dependencies effectively while preserving contextual information at both local and global levels.
每个步骤中所有向量的形状都已在代码中的注释中指定:
class Encoder(tf.keras.Model):
def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
super(Encoder, self).__init__()
self.batch_sz = batch_sz
self.enc_units = enc_units
self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
self.gru = tf.keras.layers.GRU(self.enc_units,
return_sequences=True,
return_state=True,
recurrent_initializer='glorot_uniform')
def call(self, x, hidden):
x = self.embedding(x)
output, state = self.gru(x, initial_state = hidden)
return output, state
def initialize_hidden_state(self):
return tf.zeros((self.batch_sz, self.enc_units))
AI助手
encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)
# sample input
sample_hidden = encoder.initialize_hidden_state()
sample_output, sample_hidden = encoder(example_input_batch, sample_hidden)
print ('Encoder output shape: (batch size, sequence length, units) {}'.format(sample_output.shape))
print ('Encoder Hidden state shape: (batch size, units) {}'.format(sample_hidden.shape))
AI助手
class BahdanauAttention(tf.keras.Model):
def __init__(self, units):
super(BahdanauAttention, self).__init__()
self.W1 = tf.keras.layers.Dense(units)
self.W2 = tf.keras.layers.Dense(units)
self.V = tf.keras.layers.Dense(1)
def call(self, query, values):
# hidden shape == (batch_size, hidden size)
# hidden_with_time_axis shape == (batch_size, 1, hidden size)
# we are doing this to perform addition to calculate the score
hidden_with_time_axis = tf.expand_dims(query, 1)
# score shape == (batch_size, max_length, hidden_size)
score = self.V(tf.nn.tanh(
self.W1(values) + self.W2(hidden_with_time_axis)))
# attention_weights shape == (batch_size, max_length, 1)
# we get 1 at the last axis because we are applying score to self.V
attention_weights = tf.nn.softmax(score, axis=1)
# context_vector shape after sum == (batch_size, hidden_size)
context_vector = attention_weights * values
context_vector = tf.reduce_sum(context_vector, axis=1)
return context_vector, attention_weights
AI助手
attention_layer = BahdanauAttention(10)
attention_result, attention_weights = attention_layer(sample_hidden, sample_output)
print("Attention result shape: (batch size, units) {}".format(attention_result.shape))
print("Attention weights shape: (batch_size, sequence_length, 1) {}".format(attention_weights.shape))
AI助手
class Decoder(tf.keras.Model):
def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
super(Decoder, self).__init__()
self.batch_sz = batch_sz
self.dec_units = dec_units
self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
self.gru = tf.keras.layers.GRU(self.dec_units,
return_sequences=True,
return_state=True,
recurrent_initializer='glorot_uniform')
self.fc = tf.keras.layers.Dense(vocab_size)
# used for attention
self.attention = BahdanauAttention(self.dec_units)
def call(self, x, hidden, enc_output):
# enc_output shape == (batch_size, max_length, hidden_size)
context_vector, attention_weights = self.attention(hidden, enc_output)
# x shape after passing through embedding == (batch_size, 1, embedding_dim)
x = self.embedding(x)
# x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
# passing the concatenated vector to the GRU
output, state = self.gru(x)
# output shape == (batch_size * 1, hidden_size)
output = tf.reshape(output, (-1, output.shape[2]))
# output shape == (batch_size, vocab)
x = self.fc(output)
return x, state, attention_weights
AI助手
decoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE)
sample_decoder_output, _, _ = decoder(tf.random.uniform((64, 1)),
sample_hidden, sample_output)
print ('Decoder output shape: (batch_size, vocab size) {}'.format(sample_decoder_output.shape))
AI助手
定义优化器和损失函数
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
def loss_function(real, pred):
mask = tf.math.logical_not(tf.math.equal(real, 0))
loss_ = loss_object(real, pred)
mask = tf.cast(mask, dtype=loss_.dtype)
loss_ *= mask
return tf.reduce_mean(loss_)
AI助手
Checkpoints (基于对象的保存)
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
encoder=encoder,
decoder=decoder)
AI助手
Training
- 通过编码层处理输入信息,并返回其对应的编码结果及其隐藏状态。
- 其输出结果及当前的隐藏状态与初始标记一起传输至解码层进行后续处理。
- 解码层根据接收到的信息生成预测结果及新的隐层表示。
- 然后将当前的隐层表示反馈至模型用于后续处理,并利用生成结果计算相应的损失函数。
- 通过教师强制机制为后续步骤提供真实目标数据。
- 教师强制机制是一种在训练阶段直接使用真实标签加速收敛的技术方法。
- 最终步骤是对损失函数进行计算并执行反向传播以优化模型参数。
@tf.function
def train_step(inp, targ, enc_hidden):
loss = 0
with tf.GradientTape() as tape:
enc_output, enc_hidden = encoder(inp, enc_hidden)
dec_hidden = enc_hidden
dec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE, 1)
# Teacher forcing - feeding the target as the next input
for t in range(1, targ.shape[1]):
# passing enc_output to the decoder
predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)
loss += loss_function(targ[:, t], predictions)
# using teacher forcing
dec_input = tf.expand_dims(targ[:, t], 1)
batch_loss = (loss / int(targ.shape[1]))
variables = encoder.trainable_variables + decoder.trainable_variables
gradients = tape.gradient(loss, variables)
optimizer.apply_gradients(zip(gradients, variables))
return batch_loss
AI助手
EPOCHS = 10
for epoch in range(EPOCHS):
start = time.time()
enc_hidden = encoder.initialize_hidden_state()
total_loss = 0
for (batch, (inp, targ)) in enumerate(dataset.take(steps_per_epoch)):
batch_loss = train_step(inp, targ, enc_hidden)
total_loss += batch_loss
if batch % 100 == 0:
print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
batch,
batch_loss.numpy()))
# saving (checkpoint) the model every 2 epochs
if (epoch + 1) % 2 == 0:
checkpoint.save(file_prefix = checkpoint_prefix)
print('Epoch {} Loss {:.4f}'.format(epoch + 1,
total_loss / steps_per_epoch))
print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))
AI助手
翻译
- 评估功能相当于训练循环的一种形式,在这里我们不采用教师强制机制。
- 每个时间步骤中,解码器的输入由其上一个预测结果、当前隐藏状态以及编码器输出组成。
- 预计在某个特定的时间点之前停止模型进行预测,并确定结束标记。
- 同时记录下每一个时间步所对应的注意力权重信息。
注意:编码器输出仅针对一个输入计算一次。
def evaluate(sentence):
attention_plot = np.zeros((max_length_targ, max_length_inp))
sentence = preprocess_sentence(sentence)
inputs = [inp_lang.word_index[i] for i in sentence.split(' ')]
inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs],
maxlen=max_length_inp,
padding='post')
inputs = tf.convert_to_tensor(inputs)
result = ''
hidden = [tf.zeros((1, units))]
enc_out, enc_hidden = encoder(inputs, hidden)
dec_hidden = enc_hidden
dec_input = tf.expand_dims([targ_lang.word_index['<start>']], 0)
for t in range(max_length_targ):
predictions, dec_hidden, attention_weights = decoder(dec_input,
dec_hidden,
enc_out)
# storing the attention weights to plot later on
attention_weights = tf.reshape(attention_weights, (-1, ))
attention_plot[t] = attention_weights.numpy()
predicted_id = tf.argmax(predictions[0]).numpy()
result += targ_lang.index_word[predicted_id] + ' '
if targ_lang.index_word[predicted_id] == '<end>':
return result, sentence, attention_plot
# the predicted ID is fed back into the model
dec_input = tf.expand_dims([predicted_id], 0)
return result, sentence, attention_plot
AI助手
# function for plotting the attention weights
def plot_attention(attention, sentence, predicted_sentence):
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(1, 1, 1)
ax.matshow(attention, cmap='viridis')
fontdict = {'fontsize': 14}
ax.set_xticklabels([''] + sentence, fontdict=fontdict, rotation=90)
ax.set_yticklabels([''] + predicted_sentence, fontdict=fontdict)
plt.show()
AI助手
def translate(sentence):
result, sentence, attention_plot = evaluate(sentence)
print('Input: %s' % (sentence).encode('utf-8'))
print('Predicted translation: {}'.format(result))
attention_plot = attention_plot[:len(result.split(' ')), :len(sentence.split(' '))]
plot_attention(attention_plot, sentence.split(' '), result.split(' '))
AI助手
恢复最新的检查点并进行测试
# restoring the latest checkpoint in checkpoint_dir
checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))
AI助手
translate(u'hace mucho frio aqui.')
AI助手
translate(u'esta es mi vida.')
AI助手
translate(u'¿todavia estan en casa?')
AI助手
# wrong translation
translate(u'trata de averiguarlo.')
AI助手
下一步操作尝试
- 获取多样化的数据集以便测试翻译效果,例如英语到德语或英语到法语。
- 探索更大规模的数据集合进行训练。
- 使用更多的时间段来提升模型的泛化能力。
