Advertisement

python中使用Stanford CoreNLP

阅读量:

1. 确保安装了java环境,下载安装JDK 1.8及以上版本

2. 下载Stanford CoreNLP文件,并解压

3. 由于Stanford CoreNLP默认处理英文,如果需要处理其他的语言,可以下载对应的jar包

4. 下载其他语言的jar包之后,一定要移动到第二步解压的目录下,并修改jar包的名称,格式为stanford-语言-corenlp-yyyy-mm-dd-models.jar,(yyyy-mm-dd可以自定义)

如:stanford-chinese-corenlp-2024-04-22-models.jar,

复制代码
    mv /path/to/stanford-corenlp-4.5.6-models-french.jar /path/to/stanford-corenlp-4.5.6
    

tips:只有包含model的,才是对应的语言jar包

5. 配置Stanford CoreNLP的路径(必须配置),有两种选择:

1)导入所有的jar包,也可以只导入使用的jar包

复制代码
    export CLASSPATH=$CLASSPATH:/path/to/stanford-corenlp-4.5.6/*:
    

2)只导入使用的jar包

复制代码
    export CLASSPATH=/path/to/stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar
    

6. 检测Stanford CoreNLP能否正常使用

方法一:
复制代码
    java edu.stanford.nlp.pipeline.StanfordCoreNLP -file input.txt
    
方法二:
复制代码
 echo "Please tokenize this text." | java edu.stanford.nlp.process.PTBTokenizer
    
  
    
 # 输出
    
 Please
    
 tokenize
    
 this
    
 text
    
 .
    
 PTBTokenizer tokenized 5 tokens at 68.97 tokens per second.
    
    
    
    

方法三:python环境下调用

1)安装stanfordcorenlp

复制代码
    pip install stanfordcorenlp
    
    python
    
    

2)在Python环境下调用stanfordcorenlp,对文本进行分词tokenize

复制代码
 #英文中的应用

    
 from stanfordcorenlp import StanfordCoreNLP
    
 nlp = StanfordCoreNLP(r'D:\python3\stanford-corenlp-full-2018-02-27\stanford-corenlp-full-2018-02-27')
    
  
    
 sentence = 'Guangdong University of Foreign Studies is located in Guangzhou.'
    
 print ('Tokenize:', nlp.word_tokenize(sentence))
    
 print ('Part of Speech:', nlp.pos_tag(sentence))
    
 print ('Named Entities:', nlp.ner(sentence))
    
 print ('Constituency Parsing:', nlp.parse(sentence))#语法树
    
 print ('Dependency Parsing:', nlp.dependency_parse(sentence))#依存句法
    
 nlp.close() # Do not forget to close! The backend server will consume a lot memery
    
  
    
  
    
  
    
 #中文中的应用,一定记得下载中文jar包,并标志lang=‘zh’
    
 nlp = StanfordCoreNLP(r'D:\python3\stanford-corenlp-full-2018-02-27\stanford-corenlp-full-2018-02-27', lang='zh')
    
 sentence = '清华大学位于北京。'
    
 print(nlp.word_tokenize(sentence))
    
 print(nlp.pos_tag(sentence))
    
 print(nlp.ner(sentence))
    
 print(nlp.parse(sentence))
    
 print(nlp.dependency_parse(sentence))
    
    
    
    
    python
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-08-16/kuYEDg8JvObnsM2dVfahj67GXoSy.png)

3)CNN/Dailymail数据集处理中的一个Tokenizer方法 ,GitHub - abisee/cnn-dailymail: Code to obtain the CNN / Daily Mail dataset (non-anonymized) for summarization

复制代码
 def tokenize_stories(stories_dir, tokenized_stories_dir):

    
   """Maps a whole directory of .story files to a tokenized version using Stanford CoreNLP Tokenizer"""
    
   print("Preparing to tokenize %s to %s..." % (stories_dir, tokenized_stories_dir))
    
   stories = os.listdir(stories_dir)
    
   # make IO list file
    
   print("Making list of files to tokenize...")
    
   with open("mapping.txt", "w") as f:
    
     for s in stories:
    
       f.write("%s \t %s\n" % (os.path.join(stories_dir, s), os.path.join(tokenized_stories_dir, s)))
    
   command = ['java', 'edu.stanford.nlp.process.PTBTokenizer', '-ioFileList', '-preserveLines', 'mapping.txt']
    
   print("Tokenizing %i files in %s and saving in %s..." % (len(stories), stories_dir, tokenized_stories_dir))
    
   subprocess.run(command)
    
   print("Stanford CoreNLP Tokenizer has finished.")
    
   os.remove("mapping.txt")
    
  
    
   # Check that the tokenized stories directory contains the same number of files as the original directory
    
   num_orig = len(os.listdir(stories_dir))
    
   num_tokenized = len(os.listdir(tokenized_stories_dir))
    
   if num_orig != num_tokenized:
    
     raise Exception("The tokenized stories directory %s contains %i files, but it should contain the same number as %s (which has %i files). Was there an error during tokenization?" % (tokenized_stories_dir, num_tokenized, stories_dir, num_orig))
    
   print("Successfully finished tokenizing %s to %s.\n" % (stories_dir, tokenized_stories_dir))
    
  
    
  
    
  
    
  
    
 # java 是Java时的命令。
    
 # edu.stanford.nlp.process.PTBTokenizer 是Stanford CoreNLP包中的PTBTokenizer类的完全限定名,这个类实现了分词器。
    
 # -ioFileList 是一个命令行参数,它告诉PTBTokenizer从指定的文件中读取输入和输出文件的映射。
    
 # -preserveLines 是一个可选参数,它指示分词器保留原始文本中的行结构。
    
 # mapping.txt 是之前代码中创建的文件,它包含了原始文件和期望的分词后文件的路径映射。
    
    
    
    
    python
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-08-16/b03cEgulNy8e1BU5o9XktRmh46dC.png)

参考:python中stanfordCorenlp使用教程-博客

Overview - CoreNLP (stanfordnlp.github.io)

全部评论 (0)

还没有任何评论哟~