Advertisement

学习Knowledge Graph Embedding Based Question Answering代码笔记

阅读量:

前言

最近被导师安排学习一下【Knowledge Graph Embedding Based Question Answering】 这篇paper,这篇paper的重点在于运用了Knowledge Graph为dataset,在不用知道数据结构的情况下,去完成Question Answering这个自然语言处理方向的问题。这篇笔记只用来记录一下阅读这篇paper的github的代码时,作为一名很菜的本科学生所发现觉得可能有用的代码片段,具体对paper的笔记会再开一份笔记另行记录

希望自己能和大家一起学习进步!加油!


paper 链接:

delivery.acm.org/10.1145/330…acm =1564312374_9607150c0f9e4d7029cba11e69cb8903 (请复制全部)

github 链接:

github.com/xhuang31/KE…


下面会逐步缓慢更新

正文开始!

  1. if the question contains specific words, delete it

比如我们想去掉what is your name里的what is,获得结果your name,便可使用如下代码:

复制代码
 whhowset = [{'what', 'how', 'where', 'who', 'which', 'whom'},

    
 {'in which', 'what is', "what 's", 'what are', 'what was', 'what were', 'where is', 'where are','where was', 'where were', 'who is', 'who was', 'who are', 'how is', 'what did'}, 
    
 {'what kind of', 'what kinds of', 'what type of', 'what types of', 'what sort of'}]
    
 question = ["what","is","your","name"]
    
 for j in range(3, 0, -1):
    
   if ' '.join(question[0:j]) in whhowset[j - 1]:
    
     del question[0:j]
    
     continue
    
 print(question)
    
 复制代码

output: ["your","name"]

  1. create n-gram list for sentence word list

以下引用自wiki里对n-gram的解释:n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus.

n可自定义,例如unigram, bigram. 对于n-gram的具体例子就是:

  • 单词:word: apple, n-gram list: ['a','p','l','e','ap','pp','pl','pl','app','ppl','ple','appl','pple','apple']
  • 句子:sentence: 'how are you', n-gram list: ['how', 'are', 'u', 'how are', 'are u', 'how are u']
复制代码
 question = ["how","are","u"]

    
 grams = []
    
 maxlen = len(question)
    
 for token in question:
    
     grams.append(token)
    
  
    
 for j in range(2, maxlen + 1):
    
     for token in [question[idx:idx + j] for idx in range(maxlen - j + 1)]:
    
     grams.append(' '.join(token))
    
  
    
 print(grams)
    
 复制代码

output: ['how', 'are', 'u', 'how are', 'are u', 'how are u']

  1. write the output into a file
复制代码
 import os

    
 mids = ["I","I","am","a","human"]
    
 with open(os.path.join('output.txt'), 'w')as outfile:
    
     for i, entity in enumerate(set(mids)):
    
         outfile.write("{}\t{}\n".format(entity, i))
    
 复制代码

output: 为一个文件:output.txt: 内容为:

复制代码
       Human      0

    
     a	1
    
     am	2
    
     I       3
    
 复制代码
  1. argParser in PyTorch:makes it easy to write user-friendly command-line interface. Define how a single command-line argument should be parsed.

function:parser.add_argument(name or flags...[, action][, nargs][, const][, default][, type][, choices][, required][, help][, metavar][, dest])

parameters (cite from the Pytorch documentation):

  • const - A constant value required by some action and nargs selections.
  • dest - The name of the attribute to be added to the object returned by parse_args().
  • action - The basic type of action to be taken when this argument is encountered at the command line.
复制代码
 import argparse

    
  
    
 parser = argparse.ArgumentParser(description='Process some integers.')
    
 parser.add_argument('integers', metavar='N', type=int, nargs='+',
    
                 help='an integer for the accumulator')
    
 parser.add_argument('--sum', dest='accumulate', action='store_const',
    
                 const=sum, default=max,
    
                 help='sum the integers (default: find the max)')
    
  
    
 args = parser.parse_args()
    
 print args.accumulate(args.integers)
    
 复制代码

output: python prog.py 1 2 3 4 --> 4(get the maximum), python prog.py 1 2 3 4 --sum -->10(get the sum)

  1. Counter Object A counter tool is provided to support convenient and rapid tallies
复制代码
 from collections import Counter

    
  
    
 cnt = Counter()
    
 for word in ['red', 'blue', 'red', 'green', 'blue', 'blue']:
    
   cnt[word] += 1
    
 print(cnt)
    
 复制代码

output: Counter({'blue': 3, 'red': 2, 'green': 1})

  1. PyTorch Manualseed
复制代码
 import torch

    
  
    
 torch.manual_seed(3)
    
 print(torch.rand(3))
    
 复制代码

output: tensor([0.0043, 0.1056, 0.2858]),this array will always be the same, if you don't have the manual_seed function, the output will be different every time

  1. CUDNN deterministic n some circumstances when using the CUDA backend with CuDNN, this operator may select a nondeterministic algorithm to increase performance.If this is undesirable, you can try to make the operation deterministic (potentially at a performance cost) by setting torch.backends.cudnn.deterministic = True

Example:

复制代码
 torch.backends.cudnn.deterministic = True

    
 复制代码
  1. torchtext 注:以下部分来自于知乎大佬Lee的知乎文章 torchtext入门教程,轻松玩转文本数据处理 仅作为学习笔记整理,侵删。

torchtext组件:

  • Field :主要包含以下数据预处理的配置信息,比如指定分词方法,是否转成小写,起始字符,结束字符,补全字符以及词典等等
  • Dataset :继承自pytorch的Dataset,用于加载数据,提供了TabularDataset可以指点路径,格式,Field信息就可以方便的完成数据加载。同时torchtext还提供预先构建的常用数据集的Dataset对象,可以直接加载使用,splits方法可以同时加载训练集,验证集和测试集。
  • Iterator : 主要是数据输出的模型的迭代器,可以支持batch定制
  1. field:
复制代码
 TEXT = data.Field(lower=True)

    
 复制代码

此处为数据预处理设置为全部转为小写

  1. Dataset

torchtext的Dataset是继承自pytorch的Dataset,提供了一个可以下载压缩数据并解压的方法(支持.zip, .gz, .tgz)

splits方法可以同时读取训练集,验证集,测试集

TabularDataset可以很方便的读取CSV, TSV, or JSON格式的文件

复制代码
 train = data.TabularDataset(path=os.path.join(args.output, 'dete_train.txt'), format='tsv', fields=[('text', TEXT), ('ed', ED)])

    
 dev, test = data.TabularDataset.splits(path=args.output, validation='valid.txt', test='test.txt', format='tsv', fields=field)
    
 复制代码

加载数据后可以建立词典,建立词典的时候可以使用与训练的word vector

复制代码
 TEXT.build_vocab(train,vectors="text.6B.100d")

    
 复制代码
  1. Iterator

Iterator是torchtext到模型的输出,它提供了我们对数据的一般处理方式,比如打乱,排序,等等,可以动态修改batch大小,这里也有splits方法 可以同时输出训练集,验证集,测试集

复制代码
 train_iter = data.Iterator(train, batch_size=args.batch_size, device=torch.device('cuda', args.gpu), train=True,

    
                            repeat=False, sort=False, shuffle=True, sort_within_batch=False)
    
     dev_iter = data.Iterator(dev, batch_size=args.batch_size, device=torch.device('cuda', args.gpu), train=False,
    
                          repeat=False, sort=False, shuffle=False, sort_within_batch=False)
    
 复制代码
  1. Floor division: Python Arithmetic Operators -- // The division of operands where the result is the quotient in which the digits after the decimal point are removed. But if one of the operands is negative, the result is floored, i.e., rounded away from
复制代码
 print(9//4)

    
 print(-11//3)
    
 复制代码

output: 2 -4

转载于:https://juejin.im/post/5d3d8157f265da1ba84ada19

全部评论 (0)

还没有任何评论哟~