Natural Language Processing in Python – Building a Chat

阅读量：

作者：禅与计算机程序设计艺术

1.简介

Chatbots are becoming increasingly popular, offering an efficient means of engaging users by processing user inquiries and delivering responses or recommendations based on their behavior and preferences. These chatbots can also streamline tasks that traditionally require human intervention, thereby saving time and effort. Constructing effective chatbots necessitates a solid understanding of NLP techniques such as tokenization, stemming, part-of-speech tagging, sentiment analysis, and named entity recognition. In this article, we will illustrate the process of building a simple yet potent chatbot using the NLTK library in Python. This tool is extensively utilized across academic and industrial sectors for various NLP applications, including text classification, information extraction, machine translation, and speech recognition.

This tutorial requires that the reader possesses elementary knowledge of Python programming and can effectively install libraries using pip. The subsequent sections address key aspects of the subject matter.

课程概览：介绍自然语言处理技术
文本数据分词：将连续的文字分割成有意义的单位
词干提取：去除单词末尾的字母，提取核心词干
词性标注：识别文本中每个词的语法类别
情感分析：评估文本表达的情感倾向
实体识别：从文本中识别出具有特定意义的实体
模型训练：构建用于分类任务的机器学习模型
对话机器人：设计能够进行简单对话的智能系统
课程总结：回顾学习内容并巩固知识

Before proceeding, ensure that the nltk package has been installed on your system. Through the command line interface, you can install it using the pip command, typing 'pip install nltk'. Alternatively, if running within a Jupyter Notebook or Lab environment, you can utilize Python's built-in package manager to install nltk. Note that if you're new to NLTK, we recommend starting with its official documentation available at https://www.nltk.org.

复制代码

    import nltk
    nltk.download('punkt') # download punkt tokenizer
    nltk.download('stopwords') # download stop words corpus
    
      
      
    
    代码解读

Once done, let's begin!

2.Introduction to NLP

Natural language processing (NLP) represents a specialized area within artificial intelligence focused on analyzing and interpreting human languages. Such tasks include word segmentation, sentence parsing, and sentiment analysis, which are essential components of NLP systems. For these tasks, NLP algorithms must possess the capability to identify the various linguistic elements, including words, phrases, and sentences. These elements form linguistic structures known as tokens, which are analyzed using established rules and models. The most prevalent algorithm employed in NLP is referred to as statistical language modeling, which plays a crucial role in processing and understanding human language.

In this section, we will present some fundamental concepts related to NLP and provide an overview of the objectives of each task. Prior to delving deeper, it is crucial to grasp the underlying aspects of computer science terminology.

Tokens and Types

A token represents a series of characters that correspond to fundamental elements in natural language information. A type can be classified as a category or grouping to which all instances of a specific concept belong. By considering the concept 'person' within the English language, all instances of person are grouped under the same label since humans are social beings who interact with other individuals. Similarly, to develop a chatbot, we must first define what constitutes a token and a type in our input information.

Tokenization

The process of splitting a text into individual tokens is referred to as tokenization. Depending on the specific application requirements, there are various methods available for tokenizing text. A typical approach is to divide the text into individual words or terms. We can utilize the word_tokenize() function from the NLTK library to perform tokenization. Here’s an example:

复制代码

    from nltk.tokenize import word_tokenize
    
    text = "Hello world! How are you doing today?"
    tokens = word_tokenize(text)
    print(tokens)
    
      
      
      
      
    
    代码解读

Output: ['Hello', 'world', '!', 'How', 'are', 'you', 'doing', 'today', '?']

Tokenization of a sentence typically yields multiple tokens, each separated by spaces. Occasionally, punctuation marks might occasionally become individual tokens. Consequently, when handling specific NLP tasks, it's crucial to monitor the positions of punctuation within the original text to ensure they don't interfere with the model's output.

Types

Another aspect of natural language processing encompasses the identification of various types within a text. Types typically encompass elements such as nouns, verbs, adjectives, and adverbs. Recognizing these types enables the extraction of meaningful text features, which can subsequently be incorporated into a machine learning framework for analysis.

We can classify different types through part-of-speech tagging (pos tagging). The pos tagging system assigns labels to each word in a sentence based on its grammatical role. The common parts of speech include nouns, verbs, adjectives, adverbs, pronouns, conjunctions, and so on. Utilizing the pos_tag() function provided by NLTK, we can apply pos tagging to the tokens within a given text. For instance, consider the sentence:

复制代码

    from nltk.tokenize import word_tokenize
    from nltk.tag import pos_tag
    
    text = "Hello world! How are you doing today?"
    tokens = word_tokenize(text)
    tags = pos_tag(tokens)
    for tag in tags:
    print(tag[0], tag[1])
    
      
      
      
      
      
      
      
    
    代码解读

Output:

复制代码

    Hello PROPN
    world NOUN
    ! PUNCT
    How VERB
    are VERB
    you PRON
    doing VERB
    today ADV
    ? PUNCT
    
      
      
      
      
      
      
      
      
    
    代码解读

Notice how the different POS tags are identified alongside each token.

Having completed the discussion on tokenization and part-of-speech tagging, we now proceed to explore more advanced NLP techniques, including stemming, lemmatization, and named entity recognition tasks.

3. Stemming and Lemmatization

Stemming and lemmatization are both procedural steps that reduce words to their base or root forms. Both approaches aim for similar objectives but differ in specific aspects.

Stemming

分词的过程是通过去除动词的词缀来获得词干或分词。例如，'running'、'run'和'runs'的分词是'run'。然而，分词可能导致错误的词干，这主要是由于英语语言的不规则性。此外，分词通常会产生一些不存在的词，例如'amongst'。

Porter stemmer, a stemming algorithm integrated into NLTK, offers an effective way to perform stemming. By leveraging the capabilities of NLTK, we can easily implement stemming in our text processing tasks. Let's consider an example. First, initialize the PorterStemmer instance. Then, pass the word 'causation' through the stemming process. The code snippet below demonstrates this:

This code will output 'caus' as the stemmed form of 'causation', illustrating the successful application of the stemming algorithm.

复制代码

    from nltk.stem import PorterStemmer
    
    ps = PorterStemmer()
    text = ["convenience", "conveniences", "convenient"]
    for w in text:
    print(w, ps.stem(w))
    
      
      
      
      
      
    
    代码解读

Output:

复制代码

    convenience conveni
    conveniences conveni
    convenient convid
    
      
      
    
    代码解读

Here, we created three test cases and applied the Porter stemmer to each one. The resulting stems are not always accurate, as the Porter stemmer does not account for the context in which a word appears. However, stemming can still be beneficial in reducing the number of unique words in a corpus while retaining sufficient information for downstream tasks.

Lemmatization

Lemmatization entails choosing the canonical or dictionary form of a word over its stemmed form. By contrast, lemmatization proves more precise than stemming, particularly when a word has multiple possible stems. Despite its higher precision, lemmatization is computationally more intensive than stemming and demands additional resources.

The WordNetLemmatizer, incorporated in NLTK, is designed for lemmatization.

复制代码

    from nltk.stem import WordNetLemmatizer
    
    lemmatizer = WordNetLemmatizer()
    text = "they ran quickly during the race"
    tokens = word_tokenize(text)
    lemmas = [lemmatizer.lemmatize(t) for t in tokens]
    print(" ".join(lemmas))
    
      
      
      
      
      
      
    
    代码解读

Output:

复制代码

    they run quick durin the raci
    
    
    代码解读

Typically, lemmatization transforms various conjugations of the verb 'to run' into its base form, such as 'quick' and 'durin', ensuring that the meaning is preserved, as in 'during'. While stemming and lemmatization often yield comparable results, they should not be considered synonymous. Depending on the specific task, factors such as word frequency and context may influence the choice between stemming and lemmatization.

4. Part-of-Speech Tagging

Part-of-speech tagging, commonly referred to as POS tagging, represents a crucial element within the architecture of natural language processing (NLP) pipelines. It functions by assigning syntactic roles to individual words within a sentence, thereby enabling machines to decipher the contextual relationships between words and enhance the precision of language processing tasks. The efficacy of POS tagging is contingent upon the selection of appropriate methodologies, which can broadly be categorized into rule-based approaches, neural network-based approaches, and hierarchical models. This article will provide a brief overview of the two primary methodologies: rule-based tagging and unsupervised tagging.

Rule-Based Approach

Rule-based approaches primarily rely on establishing sets of patterns and transformations that align input sequences with output sequences based on predefined rules. A typical pattern involves the utilization of regular expressions that correspond to combinations of letters and symbols linked to existing parts of speech. These rules can be formulated to address specific scenarios, rendering them practical solutions while remaining practical solutions, they may struggle with larger-scale applications, which can lead to scalability issues. Additionally, such approaches do not inherently account for semantic aspects of language, such as coreference resolution or sense disambiguation.

This serves as an illustration of a rule-based methodology for POS tagging, employing regular expressions.

复制代码

    import re
    
    def pos_tag_regex(sentence):
    regex_patterns = [
        ('NN.*|PRP.*', 'noun'),    # matches nouns and proper nouns
        ('VB.*','verb'),           # matches verbs
        ('JJ.*', 'adjective'),      # matches adjectives
        ('RB.*', 'adverb')]         # matches adverbs
    
    tagged_sentence = []
    for token in sentence.split():
        for pattern, tag in regex_patterns:
            if re.match(pattern, token):
                tagged_sentence.append((token, tag))
                break
        else:
            tagged_sentence.append((token, None))
    
    return tagged_sentence
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解读

Let's apply this function to the sample sentence "I love playing soccer":

复制代码

    text = "I love playing soccer"
    tagged_sent = pos_tag_regex(text)
    print(tagged_sent)
    
      
      
    
    代码解读

Output: [('I', 'pronoun'), ('love','verb'), ('playing','verb'), ('soccer', 'noun')]

However, applying regular expressions directly to raw texts may result in errors caused by false positives and negatives, leading to inconsistent results. Moreover, regular expression-based POS taggers are likely to concentrate exclusively on the morphological properties of words and disregard lexical cues.

Unsupervised Approach

Unsupervised learning techniques aim to uncover the inherent structures and patterns within natural language without the need for annotated data or labeled datasets. Among the widely recognized approaches in natural language processing, the Hidden Markov Model (HMM) stands out as a probabilistic generative model that calculates the likelihood of observed word sequences based on hidden variables. This framework posits that each word in a sentence is influenced solely by the preceding word, thereby treating each word as independent of subsequent words.

An HMM classifier employs a sequence of observations to adjust its parameters within an HMM model that captures the joint probability distribution across word sequences. Once trained, this model can be utilized to categorize new documents by evaluating the likelihood of tag sequences given the observed word sequence. Various algorithms have been developed to enhance the functionality of HMM classifiers, with the Viterbi algorithm and the Forward-Backward algorithm being two of the most widely used. For instance, implementing the Viterbi algorithm provides a practical approach to training an HMM classifier for part-of-speech tagging.

First, we will perform preprocessing on the textual data to eliminate special characters, numbers, and punctuation marks. Following this, we will dissect the text into sentences, tokenize each sentence into individual words, and then add tuples containing each word along with its corresponding tag to a list. Finally, we will use the scikit-learn LabelEncoder to convert the labels into integer format, ensuring that each label is uniquely represented for subsequent processing steps.

复制代码

    import string
    from sklearn.preprocessing import LabelEncoder
    
    def preprocess(text):
    translator = str.maketrans('', '', string.punctuation)
    stripped_text = text.translate(translator)
    tokens = word_tokenize(stripped_text)
    return [(t, get_pos(t)) for t in tokens]
    
    def get_pos(word):
    # add custom logic here to identify POS tags for each word
    pass
    
    data = preprocess("I love playing soccer.")
    labels = [label for _, label in data]
    
    encoder = LabelEncoder()
    encoded_labels = encoder.fit_transform(labels)
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解读

Next, we can set up the HMM model and determine the transition and emission probabilities derived from the corpus statistics. We can then apply the Viterbi algorithm to fit the model to the encoded data via. Given that POS tagging is a sequence labeling task, we cannot simply evaluate performance on a single document. Rather, we must measure the overall accuracy, precision, recall, and F1 score to achieve reliable performance evaluation. We can derive these measures from predicted labels versus actual labels after processing the entire dataset.

复制代码

    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score, precision_recall_fscore_support
    
    X_train, X_test, y_train, y_test = train_test_split(data, encoded_labels, test_size=0.2, random_state=42)
    
    states = len(set([l for s, l in data]))
    transition_probabilities = [[0.8 for _ in range(states)] for _ in range(states)]
    emission_probabilities = {}
    init_probability = [0.5]*len(set([t for t, _ in data]))
    
    for state in range(states):
    count_dict = {'start':0}
    for i, (_, l) in enumerate(data):
        if l == state:
            if i > 0:
                prev_l = data[i-1][1]
                count_dict[prev_l] = count_dict.get(prev_l, 0)+1
            count_dict['start'] += 1
    
    emission_probabilities[state] = {k:v/count_dict['start'] for k, v in count_dict.items()}
    
    y_pred = []
    for seq in X_test:
    obs = [t for t, _ in seq]
    best_path, max_prob = viterbi(obs, init_probability, transition_probabilities, emission_probabilities)
    pred = [best_path[-1]]+list(reversed(best_path))[1:-1]
    y_pred += pred
    
    accuracy = accuracy_score(y_test, y_pred)
    precision, recall, f1, support = precision_recall_fscore_support(y_test, y_pred, average='weighted')
    print("Accuracy:", accuracy)
    print("Precision:", precision)
    print("Recall:", recall)
    print("F1 Score:", f1)
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解读

Finally, the performance of the HMM classifier can be evaluated by examining the comparison between realized versus forecasted POS tags.

复制代码

    actual_tags = [encoder.inverse_transform([t])[0] for t in y_test]
    predicted_tags = [encoder.inverse_transform([t])[0] for t in y_pred]
    
    for act_tag, pred_tag in zip(actual_tags, predicted_tags):
    print(act_tag + "\t-->\t" + pred_tag)
    
      
      
      
      
    
    代码解读

全部评论 (0)

还没有任何评论哟~

Natural Language Processing in Python – Building a Chat

作者：禅与计算机程序设计艺术 1.简介 Chatbotsarebecomingincreasinglypopularastheyprovideanefficientwayofcommunicating...

Natural Language Processing(NLP) – Building a Search En

作者：禅与计算机程序设计艺术 1.简介搜索引擎是一个至关重要的应用场景。利用搜索引擎可以快速、方便地检索到大量相关的信息资源，帮助用户提高工作效率，节省时间成本。作为一个AI领域的研究者，我自然也会...

(10) Building EndtoEnd Natural Language Processing Syst

作者：禅与计算机程序设计艺术 1.简介自然语言处理（NaturalLanguageProcessing，NLP）是自然语言理解、表达和生成等领域的一个重要分支。然而，构建真正意义上“端到端”的NLP...

A Survey of Diffusion Models in Natural Language Processing

本文是NLP系列文章，针对《ASurveyofDiffusionModelsinNaturalLanguageProcessing》的翻译。自然语言处理中扩散模型综述摘要 1引言 2通用框架 3N...

Natural Language Processing with Python: A Primer

作者：禅与计算机程序设计艺术 1.简介自然语言处理（NLP）是指将计算机“读懂”人类的语言、文本或者其他形式的输入信息并进行分析、理解、提取其中的信息。它的应用遍及各个领域，如人机交互、自然语言生成...

Deep Learning for Natural Language Processing in Python

作者：禅与计算机程序设计艺术 1.简介在这篇文章中，我将会介绍一下基于深度学习的自然语言处理（NLP）模型的相关知识、术语及其核心算法原理和具体操作步骤。首先，我将会简要介绍一下什么是NLP、为什么...

Transfer Learning in Natural Language Processing

作者：禅与计算机程序设计艺术 1.简介在NLP领域，TransferlearningTL方法是一种很流行的方法。可以将预训练好的模型应用到新的数据集上，从而可以加快模型的训练速度、提高准确率。

Natural Language Processing (NLP) for Beginners: A Guid

作者：禅与计算机程序设计艺术 1.简介自然语言处理（NaturalLanguageProcessing，NLP）作为当今最热门的AI领域之一，已经吸引了众多青年学者、工程师、科学家的关注。但是，对于...

LSTM for Natural Language Processing: A Comprehensive Overview

1.背景介绍自然语言处理（NaturalLanguageProcessing,NLP）是计算机科学与人工智能的一个分支，旨在让计算机理解、生成和处理人类语言。自然语言处理的主要任务包括文本分类、情感...

PyTorch for Natural Language Processing: A Complete Overview

1.背景介绍自然语言处理（NaturalLanguageProcessing,NLP）是计算机科学与人工智能的一个分支，旨在让计算机理解、解析和生成人类语言。自然语言处理的主要任务包括文本分类、情感...

是否确定退出登录?

Natural Language Processing in Python – Building a Chat

1.简介

2.Introduction to NLP

Tokens and Types

Tokenization

Types

3. Stemming and Lemmatization

Stemming

Lemmatization

4. Part-of-Speech Tagging

Rule-Based Approach

Unsupervised Approach

全部评论 (0)

相关文章推荐

Natural Language Processing in Python – Building a Chat

Natural Language Processing(NLP) – Building a Search En

(10) Building EndtoEnd Natural Language Processing Syst

A Survey of Diffusion Models in Natural Language Processing

Natural Language Processing with Python: A Primer

Deep Learning for Natural Language Processing in Python

Transfer Learning in Natural Language Processing

Natural Language Processing (NLP) for Beginners: A Guid

LSTM for Natural Language Processing: A Comprehensive Overview

PyTorch for Natural Language Processing: A Complete Overview