Stanford CoreNLP使用

阅读量：

由斯坦福大学开发的CoreNLP是一个广泛应用的自然语言处理工具包。该工具包能够处理多种语言的应用。它依赖于Java平台，在安装该工具包之前，请确保机器上已经安装了Java环境。目前最新发布的是3.9.1版本。接下来将着重介绍该工具包的整体架构及其基本使用方法。

Annotation类与Annotator实例是CoreNLP核心组件中的两大基础模块。Annotation类代表一组数据模型，在整个CoreNLP工具包中所有的输入与输出都遵循这一数据模型进行操作。个人处理的文本通常以字符串形式呈现，则需要将该字符串格式的数据转换为Annotation类以便传递给CoreNLP进行处理。而Annotator实例则是一组功能相关的组件，在具体应用中我们可依据所需功能选择相应的Annotator实例进行操作

就某段文本而言，在存在多个功能需求的情况下，则首先要确定所需使用的Annotators有哪些，并遵循以下设置格式：

复制代码

    // set up pipeline properties
    Properties props = new Properties();
    // set the list of annotators to run
    props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,parse,depparse,coref,kbp,quote");
    
    
      
      
      
      
      
    
    代码解释

可以看到，在配置标注器时可以选择哪些类型的annotators以及它们的具体属性是可以通过properties字段来配置的。为了满足不同的处理需求，在该网页中提供了多种可供选择的不同类型 annotators，并且可以根据实际的应用场景灵活地选择所需的功能模块进行配置设置。

3、生成Annotations

复制代码

    // read some text in the text variable
    String text = "...";
    
    // create an empty Annotation just with the given text
    Annotation document = new Annotation(text);
    
    
      
      
      
      
      
      
    
    代码解释

Annotation类类似于一个base class，在下面可以衍生出许多经过不同annotator处理后得到的不同annotations。

4、采用StanfordCoreNLP接口来进行处理

代码示例

复制代码

    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    
    piepeline.annotate(document);
    
      
      
      
    
    代码解释

StanfordCoreNLP类通过获取properties类的一个实例来进行初始化，并且通过接受一个annotation作为参数来完成处理工作。将上述代码进行优化和合并处理即可得到完整的流程图。

复制代码

    import edu.stanford.nlp.pipeline.*;
    import java.util.*;
    
    public class BasicPipelineExample {
    
    public static void main(String[] args) {
    
        // creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution
        Properties props = new Properties();
        props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    
        // read some text in the text variable
        String text = "...";
    
        // create an empty Annotation just with the given text
        Annotation document = new Annotation(text);
    
        // run all Annotators on this text
        pipeline.annotate(document);
    
    }
    
    }
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解释

可以看出使用CoreNLP相对而言较为简便。 Stan福核心自然语言处理库作为一个接口，在设置完成后生成annotation后就可以着手处理了。经过上述操作后，请问如何获取完整的处理结果呢？其结果均可以通过document变量获取完成。

5、提取结果
已知使用annotator处理完成后会返回Annotation数据结构类型的成果。通过前面介绍的annotators的种类一节内容可知，在该框架中每种annotator处理完成后都会生成特定类型的 annotation 结构体。下图展示了不同 annotator处理后的具体实现效果

该栏中的GENERATED ANNOTATIONS对应的类别即等同于CoreMap中key的数据类型，在这种情况下我们就可以直接提取结果变得较为简便了。为此需要明确的是只需确定由特定annotator生成的所有annotation的具体类型即可，在这里我们采用ssplit annotator作为示例进行说明。

复制代码

    // these are all the sentences in this document
    // a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
    List<CoreMap> sentences = document.get(SentencesAnnotation.class);
    
    for(CoreMap sentence: sentences) {
      // traversing the words in the current sentence
      // a CoreLabel is a CoreMap with additional token-specific methods
      for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
    // this is the text of the token
    String word = token.get(TextAnnotation.class);
    // this is the POS tag of the token
    String pos = token.get(PartOfSpeechAnnotation.class);
    // this is the NER label of the token
    String ne = token.get(NamedEntityTagAnnotation.class);
      }
    
      // this is the parse tree of the current sentence
      Tree tree = sentence.get(TreeAnnotation.class);
    
      // this is the Stanford dependency graph of the current sentence
      SemanticGraph dependencies = sentence.get(CollapsedCCProcessedDependenciesAnnotation.class);
    }
    
    // This is the coreference link graph
    // Each chain stores a set of mentions that link to each other,
    // along with a method for getting the most representative mention
    // Both sentence and token offsets start at 1!
    Map<Integer, CorefChain> graph = 
      document.get(CorefChainAnnotation.class);
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解释

该类即是由ssplit（断句）annotator生成的注解类因此通过传递此参数即可获取到ssplit运行的结果作为特殊的数据结构实际上它是一种特殊的CoreMap 在功能上也有所区别（例如可以直接将Token转换为String类型）基于一个CoreMap我们可以生成语法树或其他类型的依存句法结构其中Tree代表语法树结构 SemanticGraph代表依存句法结构值得一提的是它们可以直接输出显示

复制代码

    for(CoreMap sentence: sentences) {
       // traversing the words in the current sentence
       // this is the parse tree of the current sentence
    
       Tree tree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class);
    
       System.out.println("语法树：");
    
       System.out.println(tree.toString());
    
    
    
       // this is the Stanford dependency graph of the current sentence
    
       SemanticGraph dependencies = sentence.get(SemanticGraphCoreAnnotations.CollapsedCCProcessedDependenciesAnnotation.class);
    
       System.out.println("依存句法：");
    
       System.out.println(dependencies.toString());
    }
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解释

有些annotator的实现需要另一些annotator，因此在确定properties的annotator属性时要确定好所依赖的annotator都被添加进去，各个annotator的依赖关系看这里。
上面的几个过程是对英文的完整处理过程，除此之外CoreNLP还提供了一个更简单的API调用方法，不过简单方法虽然更直观，但灵活性不太好，因此这里不介绍了，如果需要可以看这个链接Simple API。

该过程主要针对英文语料库的操作方式，在处理中文时也适用。在进行中文处理时需要引入一个额外的软件包。具体下载链接为(http://nlp.stanford.edu/software/stanford-chinese-corenlp-2018-02-27-models.jar)。将这个软件包与CoreNLP软件包整合后导入项目即可完成配置。

同样地，在处理中文代码时也是一样的情况，在设置annotators属性时通常会采用默认设置。具体来说，在包含StanfordCoreNLP-chinese.properties文件的中文jar包中即可找到该配置。

复制代码

    # Pipeline options - lemma is no-op for Chinese but currently needed because coref demands it (bad old requirements system)
    annotators = tokenize, ssplit, pos, lemma, ner, parse, coref
    
    # segment
    tokenize.language = zh
    segment.model = edu/stanford/nlp/models/segmenter/chinese/ctb.gz
    segment.sighanCorporaDict = edu/stanford/nlp/models/segmenter/chinese
    segment.serDictionary = edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
    segment.sighanPostProcessing = true
    
    # sentence split
    ssplit.boundaryTokenRegex = [.。]|[!?！？]+
    
    # pos
    pos.model = edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger
    
    # ner
    ner.language = chinese
    ner.model = edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz
    ner.applyNumericClassifiers = true
    ner.useSUTime = false
    
    # regexner
    ner.fine.regexner.mapping = edu/stanford/nlp/models/kbp/chinese/cn_regexner_mapping.tab
    ner.fine.regexner.noDefaultOverwriteLabels = CITY,COUNTRY,STATE_OR_PROVINCE
    
    # parse
    parse.model = edu/stanford/nlp/models/srparser/chineseSR.ser.gz
    
    # depparse
    depparse.model    = edu/stanford/nlp/models/parser/nndep/UD_Chinese.gz
    depparse.language = chinese
    
    # coref
    coref.sieves = ChineseHeadMatch, ExactStringMatch, PreciseConstructs, StrictHeadMatch1, StrictHeadMatch2, StrictHeadMatch3, StrictHeadMatch4, PronounMatch
    coref.input.type = raw
    coref.postprocessing = true
    coref.calculateFeatureImportance = false
    coref.useConstituencyTree = true
    coref.useSemantics = false
    coref.algorithm = hybrid
    coref.path.word2vec =
    coref.language = zh
    coref.defaultPronounAgreement = true
    coref.zh.dict = edu/stanford/nlp/models/dcoref/zh-attributes.txt.gz
    coref.print.md.log = false
    coref.md.type = RULE
    coref.md.liberalChineseMD = false
    
    # kbp
    kbp.semgrex = edu/stanford/nlp/models/kbp/chinese/semgrex
    kbp.tokensregex = edu/stanford/nlp/models/kbp/chinese/tokensregex
    kbp.language = zh
    kbp.model = none
    
    # entitylink
    entitylink.wikidict = edu/stanford/nlp/models/kbp/chinese/wikidict_chinese.tsv.gz
    
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解释

通常使用这个默认配置就可以了，使用方法如下，

复制代码

    public class nlp_Chinese_demo {
    public static void main(String[] args) {
        String props="StanfordCoreNLP-chinese.properties";
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
        Annotation document;
        //从文件中导入文本
        //document = new Annotation(IOUtils.slurpFileNoExceptions(file));
        annotation = new Annotation("欢迎使用使用斯坦福大学自然语言处理工具包！");
    
    
        pipeline.annotate(document);
        pipeline.prettyPrint(document, System.out);
    }
    }
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解释

初学自然语言处理者难免会对诸多概念存有疑问，在此诚挚地希望各位能指出其中的不足之处，并再次感谢！

参考：https://stanfordnlp.github.io/CoreNLP/api.html

全部评论 (0)

还没有任何评论哟~

Stanford CoreNLP使用

StanfordCoreNLP是斯坦福大学的自然语言处理工具包，目前已经支持多种语言的处理。该工具包需要java的支持，因此机器上需要安装java。目前最新的版本是3.9.1。安装过程我不再赘述，我主...

使用NLTK和Stanford CoreNLP

1安装Java 下载jdk8u261linuxx64.tar.gz。解压sudotarcxvfjdk8u261linuxx64.tar.gz 设置环境变量sudovim/etc/profile ex...

python中使用Stanford CoreNLP

1\.确保安装了java环境，下载安装JDK1.8及以上版本 2\.下载StanfordCoreNLP文件，并解压 3\.由于StanfordCoreNLP默认处理英文，如果需要处理其他的语言，可以下...

Stanford CoreNLP在Android中的使用

下载 StanfordCoreNLP jar包导入与处理因为只实现部分内容，为了使apk不致过大，第二个包进行删减。解决导包的各种报错：build.gradleapp android//配置项目构...

【实例】python 使用 Stanford-corenlp 分词

fromstanfordcorenlpimportStanfordCoreNLP第一步 nlp=StanfordCoreNLPr'E:/cornlp/stanfordcorenlpfull201813...

corenlp分词 stanford_使用Stanford CoreNLP进行中文分词

所以可以直接配置gradle依赖。对不同的语言通过classifier选择对应的model。其中models是其他语言models的基础，默认可以处理English，必须引入。我们需要处理中文，所以还...

在Eclipse下安装和使用Stanford CoreNLP

下载： 1. coreNLP的包，五百多M大小，主要包括这个算法的很多核心jar包。网址：https://stanfordnlp.github.io/CoreNLP/index.htmldownlo...

Linux安装Stanford-CoreNLP

一、首先需要安装java8 brewcaskinstalljava 安装成功后用命令”javaversion”可查看版本信息二、下载StanfordcoreNLP包 1、下载<https://sta...

使用Stanford CoreNLP工具包处理中文

这几天刚刚接触自然语言处理，使用了StanfordCoreNLP工具。但毕竟是第一次用，所以遇到很多问题，现将解决方案记下（容易百度到的错误就不记了）。其中用StanfordCoreNLP集合工具来...

命令行跑stanford-corenlp

在https://nlp.stanford.edu/software/stanfordcorenlp4.5.6.zip 下载stanfordcorenlp4.5.6.zip 在https://stan...

是否确定退出登录?

Stanford CoreNLP使用

全部评论 (0)

相关文章推荐

Stanford CoreNLP使用

使用NLTK和Stanford CoreNLP

python中使用Stanford CoreNLP

Stanford CoreNLP在Android中的使用

【实例】python 使用 Stanford-corenlp 分词

corenlp分词 stanford_使用Stanford CoreNLP进行中文分词

在Eclipse下安装和使用Stanford CoreNLP

Linux安装Stanford-CoreNLP

使用Stanford CoreNLP工具包处理中文

命令行跑stanford-corenlp