State of the Art Natural Language Processing Tools: A C

阅读量：

作者：禅与计算机程序设计艺术

1.简介

Natural language processing (NLP) plays a vital role in diverse application domains such as speech recognition, text-based chatbots, information retrieval systems, and document analysis. There exist numerous open-source NLP tools available for developers to readily implement solutions. This article intends to evaluate several prominent NLP libraries and frameworks. It will emphasize their key functionalities while also addressing existing limitations that warrant further investigation. Additionally, the performance metrics of these tools across different datasets will be compared briefly. This comparative analysis will assist readers in selecting the most suitable library or framework based on their specific requirements. In conclusion, this article seeks to offer an overview of state-of-the-art NLP tools while highlighting the relative strengths and weaknesses of each library or framework depending on the user's needs.

2.基本概念、术语说明

Before delving into the technical aspects of each tool, we will first introduce some fundamental terminology and conceptual frameworks that are essential for understanding this review.

During the process of disassembling a sentence into individual words, phrases, or other meaningful components, it is commonly referred to as tokenization. This method involves decomposing the input text into manageable components named tokens, which are subsequently processed by NLP algorithms for further analysis.

Stopwords removal is defined as the process of eliminating frequent English words from a given text. These stopwords generally do not contribute significantly to the semantic meaning of the text, thereby allowing them to be safely omitted without compromising the integrity of valuable information contained within.

Stemming refers to the method of reducing words to their root form. For instance, words like 'running', 'run', and 'runner' are typically stemmed to 'run'. Although this approach may not always yield precise results, it effectively reduces the diversity of words within a corpus.

This technique effectively reduces words to their base form through lemmatization. In contrast to stemming, which merely strips off word endings, this approach leverages part-of-speech tagging techniques to map each word to its corresponding grammatical category and subsequently minimizes it to its fundamental form.

词性标注（POS）：词性标签给句子中的每个单词分配一个类别（如名词、动词、形容词等），表明其在句中的语法功能。它们在识别单词间的关系并从文本中提取相关信息方面扮演着至关重要的角色

Named Entity Recognition (NER): Named Entity Recognition (NER) is a method that extracts entities such as organizations, locations, persons, and dates from unstructured texts. The process involves identifying predefined named entities within the text and categorizing them appropriately.

The Bag-of-Words model formalizes textual information by representing each unique term within it exactly once. The model captures and analyzes this information through a frequency-based structure, enabling effective analysis and processing. Each document can be expressed as a frequency-based structure capturing word occurrences within it, providing a compact and efficient representation for various applications such as information retrieval and natural language processing. Importantly, the sequence or arrangement of these terms holds no significance.

TF-IDF指标：TF-IDF是一种度量方法，用于量化词汇的重要性，并与袋装词模型紧密相关。该方法通过根据其在整个语料库中的出现频率计算得出，并与在其中出现过的文档数量成反比来为每个单词或短语分配权重。
Word embeddings: These are high-dimensional vector representations designed to capture individual word tokens in a semantic space. They encode semantic correlations among semantically related terms, enabling machine learning models to better understand and process textual content by capturing nuanced patterns that go beyond the limitations of simpler bag-of-words approaches.

Sentiment analysis: Sentiment analysis assesses the sentiment expressed in a given piece of text into categories such as positive, negative, or neutral. Traditional approaches incorporate various methods including lexicons for lexical content, rule-based systems for syntactic patterns, and machine learning techniques for semantic understanding.

3.核心算法原理及操作步骤

Now let's turn to examining the key features offered by spaCy, NLTK, and Stanford CoreNLP. We will systematically discuss each library's methodology for tokenization such as stopword elimination stemming lemmatization part-of-speech tagging named entity recognition and sentiment analysis. Additionally we will explore how these libraries support a variety of tasks such as information extraction question answering topic modeling summarization and dependency parsing. Finally we will evaluate their effectiveness across four real-world datasets to provide a comprehensive comparison based on practical applications.

3.1 spaCy

spaCy 是一个免费且开源的自然语言处理工具包，在 Python 语言环境下设计开发而成。它为自然语言处理提供了多种高级功能模块，并支持分词、词性标注、实体识别、依存语法解析以及更多相关操作。其独特的架构使其能够高效地训练定制化的语言模型。

Approach towards Tokenization

spaCy's method for tokenizing text mirrors NLTK's approach but includes some tweaks. This tokenizer divides texts into word sequences, preserving punctuation. Furthermore, the parser also endeavors to dissect compound terms like "New York" into individual components. Here’s a guide on utilizing the spaCy tokenizer effectively.

复制代码

    import spacy
    nlp = spacy.load('en') # Load English language model
    text = 'This is a sample sentence.'
    doc = nlp(text) # Create Doc object from text
    tokens = [token.text for token in doc] # Extract list of tokens
    print(tokens)
    
      
      
      
      
      
    
    代码解读

Output:

复制代码

    ['This', 'is', 'a','sample','sentence']
    
    
    代码解读

Stopword Removal

Stopwords are defined as frequent words that hold little or no importance in a given context and are typically excluded from natural language processing tasks. The spaCy library provides pre-defined stopwords lists for various languages. To eliminate these stopwords, refer to the provided code snippet.

复制代码

    stop_words = ['this', 'that', 'and', 'or', 'not']
    filtered_tokens = []
    for token in tokens:
    if token.lower() not in stop_words:
        filtered_tokens.append(token)
    
    print(filtered_tokens)
    
      
      
      
      
      
      
    
    代码解读

Output:

复制代码

    ['sample','sentence']
    
    
    代码解读

Stemming vs Lemmatization

Both stemming and lemmatization involve converting words into their root forms. However, they differ in the approaches they use to accomplish this task. Both utilize rules and dictionaries to replace words based on specific patterns. While lemmatization adheres to the morphological rules of the analyzed language, reducing its likelihood of errors. Here's how you can perform stemming using the spaCy library:

复制代码

    stemmer = nltk.stem.PorterStemmer()
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    print(stemmed_tokens)
    
      
      
    
    代码解读

Output:

复制代码

    ['sampl','sentenc']
    
    
    代码解读

Similarly, you can perform lemmatization using the lemmatize method:

复制代码

    lemmatizer = nltk.WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    print(lemmatized_tokens)
    
      
      
    
    代码解读

Output:

复制代码

    ['sample','sentence']
    
    
    代码解读

Part-of-speech Tagging

Part-of-speech tagging assigns labels to each word indicating their grammatical roles within a sentence. This technique aids in deciphering word interconnections and underpins the creation of semantic representations from text. The spaCy library offers diverse tagging options, spanning from straightforward rule-based methods to advanced neural network-based solutions. Here's how you can utilize the default POS tagger:

复制代码

    pos_tags = [(token.text, token.tag_) for token in doc]
    print(pos_tags)
    
      
    
    代码解读

Output:

复制代码

    [('This', 'DET'), ('is', 'VERB'), ('a', 'DET'), ('sample', 'ADJ'), ('sentence', '.')]
    
    
    代码解读

Named Entity Recognition

识别文本中的命名实体（NER）是一种方法。
spaCy库包含了许多经过训练好的模型用于识别各类实体。
以下是如何利用spaCy的NER模型的步骤：

首先导入必要的模块和数据集
加载预训练好的spaCy NER模型
将输入文本进行分词并进行命名实体识别
最后提取并分类出的所有命名实体

复制代码

    entities = [(ent.text, ent.label_) for ent in doc.ents]
    print(entities)
    
      
    
    代码解读

Output:

复制代码

    [('This', 'ORDINAL')]
    
    
    代码解读

Sentiment Analysis

Sentiment analysis is designed to assess the overall attitude or perception towards a specific topic or product. It focuses on identifying and categorizing opinions expressed in a text as positive, negative, or neutral. Traditional methods for sentiment analysis relied heavily on rule-based systems or dictionary-based approaches. However, advancements in deep learning technologies such as transfer learning and attention mechanisms have led to the development of more sophisticated solutions. Among these modern techniques, recurrent neural networks (RNN) have emerged as one of the most popular and effective approaches due to their ability to capture sequential patterns in data. The spaCy library provides prebuilt support for sentiment analysis, making it an accessible tool for integrating this feature into various applications. Here's how you can analyze the sentiment of a piece of text using the spaCy library:

复制代码

    from spacytextblob import TextBlobBlooIE
    nlp.add_pipe("spacytextblob")
    sentiment = doc._.polarity
    print(sentiment)
    
      
      
      
    
    代码解读

Output:

复制代码

Here, _ represents the extension attribute, providing access to various properties of the Doc object created by spaCy. This involves accessing the polarity score assigned by TextBlob to the document, which spans a range from -1 to +1, with values closer to zero indicating neutral sentiment. A higher score signifies a stronger sentiment.

3.2 NLTK

NLTK is among the earliest and top-tier natural language processing libraries. It provides a comprehensive suite of functionalities that enable tasks such as tokenization, stemming/lemmatization, tagging, classification, parsing, sentiment analysis, and much more. To install NLTK easily on your system:

复制代码

    pip install nltk
    
    
    代码解读

Approach towards Tokenization

Tokenization涉及将一个句子分解为短语、单词或其他有意义的部分。NLTK将其定义为两类主要分词器——段落分句器和单词分词器。段落分句器将一段文本分割成多个句子，而单词分词器则将一个句子进一步拆分成单个词语。要通过单词分词器对文本进行分词操作，请调用word_tokenize()函数：

复制代码

    import nltk
    nltk.download('punkt') # Download Punkt sentence tokenizer
    text = 'This is a sample sentence.'
    tokens = nltk.word_tokenize(text)
    print(tokens)
    
      
      
      
      
    
    代码解读

Output:

复制代码

    ['This', 'is', 'a','sample','sentence', '.']
    
    
    代码解读

While noting that the output might vary slightly, it's important to understand that NLTK's default behavior treats contractions such as 'don't' as single tokens rather than separating them into 'do' and 'n't'. If preserving the original structure is crucial for your application, ensure that when initializing the WordTokenizer, you set preserve_line=True. This setting prevents NLTK from altering how contractions are tokenized.

复制代码

    tokenizer = nltk.tokenize.WordPunctTokenizer()
    tokens = tokenizer.tokenize(text)
    print(tokens)
    
      
      
    
    代码解读

Output:

复制代码

    ["'T'", 'is', 'a','sample','sentence', '.', "'"]
    
    
    代码解读

Stopword Removal

These words, known as stopwords, hold minimal significance in context and are typically excluded from texts. The NLTK library offers language-specific stopword lists alongside functions to remove these words. For filtering stopwords from a list of tokens, one approach is to utilize the remove() method within the nltk.corpus.stopwords module.

复制代码

    import nltk
    nltk.download('stopwords') # Download stopwords
    stop_words = set(nltk.corpus.stopwords.words('english'))
    filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
    print(filtered_tokens)
    
      
      
      
      
    
    代码解读

Output:

复制代码

    ['sample','sentence', '.']
    
    
    代码解读

Stemming vs Lemmatization

Stemming and lemmatization both transform words into their base forms. However, stemming typically employs heuristic-based approaches, which can lead to inaccuracies due to word ambiguity. In contrast, lemmatization adheres to linguistic morphology rules for the analyzed language, ensuring more precise results. The NLTK library provides implementations of the Porter stemming algorithm and the WordNet lemmatizer. Here's how you can implement stemming using the NLTK library:

复制代码

    porter = nltk.PorterStemmer()
    stemmed_tokens = [porter.stem(token) for token in filtered_tokens]
    print(stemmed_tokens)
    
      
      
    
    代码解读

Output:

复制代码

    ['sampl','sentenc', '.']
    
    
    代码解读

A straightforward method for lemmatizing words is to use the WordNetLemmatizer class.

复制代码

    wnl = nltk.WordNetLemmatizer()
    lemmatized_tokens = [wnl.lemmatize(token) for token in filtered_tokens]
    print(lemmatized_tokens)
    
      
      
    
    代码解读

Output:

复制代码

    ['sample','sentence', '.']
    
    
    代码解读

Part-of-speech Tagging

Tagging part-of-speech (POS) categorizes each word within a sentence into its grammatical role. The NLTK library provides an integrated tagger using nltk.pos_tag(), which leverages the Brown Corpus for both training and testing purposes. Below are instructions on how to utilize this POS tagger:

Part-of-speech tagging identifies word roles through syntactic analysis. This linguistic technique examines sentence structure to classify words as nouns、verbs、adjectives or adverbs based on their position and usage within clauses.

Take an example: In the sentence "The cat leapt over the table at sunset," "leapt" functions as a verb while "the" acts as an article. Similarly,"cat" denotes a noun,"over" serves as a preposition,"and"table" marks another noun."

复制代码

    tagged_tokens = nltk.pos_tag(filtered_tokens)
    print(tagged_tokens)
    
      
    
    代码解读

Output:

复制代码

    [('sample', 'NN'), ('sentence', '.')];
    
    
    代码解读

Please note that the output comprises tuples, where each tuple includes two elements: the first being a token and the second being its corresponding POS tag.

Named Entity Recognition

Named entity recognition (NER) involves identifying predefined named entities within the text and categorizing them appropriately. NLTK provides a built-in NER classifier, specifically nltk.ne_chunk(), which leverages conditional random fields for training and testing purposes. Here's how you can effectively utilize the NER classifier: First, ensure your text is preprocessed to identify relevant entities. Next, apply the nltk.ne_chunk() function to perform classification. Finally, evaluate its performance through appropriate testing methods.

复制代码

    tree = nltk.ne_chunk(tagged_tokens)
    print(tree);
    
      
    
    代码解读

Output:

复制代码

    NE   __  VP
||

      NP   VBZ
||

       Det  NN
            .
    
      
      
      
      
      
    
    代码解读

Sentiment Analysis

Sentiment analysis examines the overall sentiment or perception regarding a specific topic or product. It entails identifying and classifying opinions within a text as positive, negative, or neutral. The NLTK library offers a sentiment analyzer (nltk.sentiment.vader.SentimentIntensityAnalyzer()) which integrates rule-based methodologies, machine learning techniques, and lexical analysis approaches. Here's how to utilize the sentiment analyzer:

复制代码

    analyzer = nltk.sentiment.vader.SentimentIntensityAnalyzer()
    sentiment = analyzer.polarity_scores(text)['compound']
    print(sentiment)
    
      
      
    
    代码解读

Output:

复制代码

    -0.05
    
    
    代码解读

Here, the polarity score represents a numeric value spanning from -1 to +1, with scores near zero denoting neutral sentiment. A higher score signifies greater sentiment intensity. It must be noted that the accuracy of sentiment analysis is largely contingent upon both the quality of the underlying dataset and the foundational assumptions inherent in the model.

3.3 Stanford CoreNLP

Stanford CoreNLP is a robust open-source toolkit for natural language processing created by the Stanford NLP team at Stanford University. It provides an extensive collection of tools for text processing, encompassing tokenization, part-of-speech tagging, named entity recognition, dependency parsing, coreference resolution, sentiment analysis, and additional functionalities. This guide outlines the steps to download and utilize Stanford CoreNLP:

Download the latest Java JDK version from https://www.oracle.com/java/technologies/javase-downloads.html.

Obtain the most recent release of Stanford CoreNLP from https://stanfordnlp.github.io/CoreNLP.

Extract the downloaded file into a directory of your choosing.
Launch a command prompt or terminal window and navigate within the extracted folder.

通过执行以下命令启动服务器：java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer

Until it starts successfully, wait. The server is configured to run on port 9000 by default.
将请求提交给服务器通过HTTP POST方法，并将您想要处理的内容设置为请求体内容。例如示例请求数字执行分词、词性标注、实体识别以及情感分析功能：

复制代码

    import requests
    
    url = 'http://localhost:9000/?properties={"annotators":"tokenize,ssplit,pos,ner,sentiment","outputFormat":"json"}'
    data = {'text': 'I am happy today!'}
    response = requests.post(url, json=data).json()
    print(response)
    
      
      
      
      
      
    
    代码解读

Output:

复制代码

    {
      "sentences": [
    {
      "index": 0, 
      "tokens": [
        {
          "index": 1, 
          "originalText": "I", 
          "characterOffsetBegin": 0, 
          "characterOffsetEnd": 1, 
          "pos": "PRP", 
          "ner": "O", 
          "sentiment": null
        }, 
        {
          "index": 2, 
          "originalText": "am", 
          "characterOffsetBegin": 2, 
          "characterOffsetEnd": 4, 
          "pos": "VBP", 
          "ner": "O", 
          "sentiment": {
            "score": 0.573, 
            "magnitude": 0.801
          }
        }, 
        {
          "index": 3, 
          "originalText": "happy", 
          "characterOffsetBegin": 5, 
          "characterOffsetEnd": 10, 
          "pos": "JJ", 
          "ner": "O", 
          "sentiment": {
            "score": 0.648, 
            "magnitude": 1.524
          }
        }, 
        {
          "index": 4, 
          "originalText": "today", 
          "characterOffsetBegin": 11, 
          "characterOffsetEnd": 16, 
          "pos": "RB", 
          "ner": "DATE", 
          "sentiment": null
        }, 
        {
          "index": 5, 
          "originalText": "!", 
          "characterOffsetBegin": 16, 
          "characterOffsetEnd": 17, 
          "pos": ".", 
          "ner": "O", 
          "sentiment": null
        }
      ]
    }
      ], 
      "coreferences": [], 
      "documentScores": {}, 
      "language": "English"
    }
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解读

It is noticeable that the response includes comprehensive annotation details for each token in the input text. Such details encompass its character offset、part of speech tagging、named entity labels、and sentiment analysis scores. Additionally、one can observe that Stanford CoreNLP supports processing multiple languages beyond English; therefore、adjustments to the annotator settings may be necessary depending on specific requirements.

全部评论 (0)

还没有任何评论哟~

State of the Art Natural Language Processing Tools: A C

作者：禅与计算机程序设计艺术 1.简介 NaturallanguageprocessingNLPhasbecomeacrucialcomponentinvariousapplicationdomain...

Neural Natural Language Processing for Long Texts: A Survey of the State-of-the-Art

本文是针对NLP处理长文本的一个综述，针对《NeuralNaturalLanguagProcessingforLongTexts:ASurveyoftheStateoftheArt》的翻译。

Pushing the Limits of Natural Language Processing: Appl

作者：禅与计算机程序设计艺术 1.简介自然语言处理NLP已成为许多计算机科学领域的一项重要研究方向。近年来，基于深度学习技术的transformer模型在很多任务上取得了令人瞩目的成功。这些模型从海...

The Intersection of Deep Learning and Natural Language Processing

1.背景介绍自从深度学习技术的蓬勃发展以来，它已经成为了人工智能领域的重要技术之一。深度学习的发展也为自然语言处理（NLP）领域提供了强大的支持。在这篇文章中，我们将探讨深度学习与自然语言处理的相互...

The Current State of Deep Learning for Natural Language

作者：禅与计算机程序设计艺术 1.简介深度学习在自然语言处理领域已成为热门话题。近几年来，深度学习在文本分类、语言模型等任务上取得了不错的成果。然而，目前为止，关于深度学习在自然语言处理领域的最新研...

Large Language Models in Cybersecurity: State-of-the-Art

本文是LLM系列，针对《LargeLanguageModelsinCybersecurity:StateoftheArt》的翻译。网络安全中的大型语言模型：最新进展摘要 1引言 2背景 3LLM的...

A Survey of Diffusion Models in Natural Language Processing

本文是NLP系列文章，针对《ASurveyofDiffusionModelsinNaturalLanguageProcessing》的翻译。自然语言处理中扩散模型综述摘要 1引言 2通用框架 3N...

The Future of Computational Linguistics: Advances in Natural Language Processing

1.背景介绍自从人工智能技术的蓬勃发展以来，自然语言处理（NLP）技术也在不断发展和进步。自然语言处理是计算机科学与人工智能领域中的一个分支，它旨在让计算机理解、生成和处理人类语言。在这篇文章中，我...

Multi-robot Task Allocation: A Review of the State-of-the-Art

1Introduction TheapplicabilityareasofMRS:intelligentsecurity,searchandrescue,surveillance,humanitari...

笔记-Paraphrase Generation A Survey of the State of the Art

目录评估方法传统方法：基于规则的复述生成基于同义词字典替换的复述生成基于统计机器翻译的复述生成神经网络模型——复述生成 EncoderDecoder 在EncoderDecoder的基础上...

是否确定退出登录?

State of the Art Natural Language Processing Tools: A C

1.简介

2.基本概念、术语说明

3.核心算法原理及操作步骤

3.1 spaCy

Approach towards Tokenization

Stopword Removal

Stemming vs Lemmatization

Part-of-speech Tagging

Named Entity Recognition

Sentiment Analysis

3.2 NLTK

Approach towards Tokenization

Stopword Removal

Stemming vs Lemmatization

Part-of-speech Tagging

Named Entity Recognition

Sentiment Analysis

3.3 Stanford CoreNLP

全部评论 (0)

相关文章推荐

State of the Art Natural Language Processing Tools: A C

Neural Natural Language Processing for Long Texts: A Survey of the State-of-the-Art

Pushing the Limits of Natural Language Processing: Appl

The Intersection of Deep Learning and Natural Language Processing

The Current State of Deep Learning for Natural Language

Large Language Models in Cybersecurity: State-of-the-Art

A Survey of Diffusion Models in Natural Language Processing

The Future of Computational Linguistics: Advances in Natural Language Processing

Multi-robot Task Allocation: A Review of the State-of-the-Art

笔记-Paraphrase Generation A Survey of the State of the Art