Advertisement

Natural Language Processing in Python – Building a Chat

阅读量:

作者:禅与计算机程序设计艺术

1.简介

Chatbots are becoming increasingly popular, offering an efficient means of engaging users by processing user inquiries and delivering responses or recommendations based on their behavior and preferences. These chatbots can also streamline tasks that traditionally require human intervention, thereby saving time and effort. Constructing effective chatbots necessitates a solid understanding of NLP techniques such as tokenization, stemming, part-of-speech tagging, sentiment analysis, and named entity recognition. In this article, we will illustrate the process of building a simple yet potent chatbot using the NLTK library in Python. This tool is extensively utilized across academic and industrial sectors for various NLP applications, including text classification, information extraction, machine translation, and speech recognition.

This tutorial requires that the reader possesses elementary knowledge of Python programming and can effectively install libraries using pip. The subsequent sections address key aspects of the subject matter.

课程概览:介绍自然语言处理技术
文本数据分词:将连续的文字分割成有意义的单位
词干提取:去除单词末尾的字母,提取核心词干
词性标注:识别文本中每个词的语法类别
情感分析:评估文本表达的情感倾向
实体识别:从文本中识别出具有特定意义的实体
模型训练:构建用于分类任务的机器学习模型
对话机器人:设计能够进行简单对话的智能系统
课程总结:回顾学习内容并巩固知识

Before proceeding, ensure that the nltk package has been installed on your system. Through the command line interface, you can install it using the pip command, typing 'pip install nltk'. Alternatively, if running within a Jupyter Notebook or Lab environment, you can utilize Python's built-in package manager to install nltk. Note that if you're new to NLTK, we recommend starting with its official documentation available at https://www.nltk.org.

复制代码
    import nltk
    nltk.download('punkt') # download punkt tokenizer
    nltk.download('stopwords') # download stop words corpus
    
      
      
    
    代码解读

Once done, let's begin!

2.Introduction to NLP

Natural language processing (NLP) represents a specialized area within artificial intelligence focused on analyzing and interpreting human languages. Such tasks include word segmentation, sentence parsing, and sentiment analysis, which are essential components of NLP systems. For these tasks, NLP algorithms must possess the capability to identify the various linguistic elements, including words, phrases, and sentences. These elements form linguistic structures known as tokens, which are analyzed using established rules and models. The most prevalent algorithm employed in NLP is referred to as statistical language modeling, which plays a crucial role in processing and understanding human language.

In this section, we will present some fundamental concepts related to NLP and provide an overview of the objectives of each task. Prior to delving deeper, it is crucial to grasp the underlying aspects of computer science terminology.

Tokens and Types

A token represents a series of characters that correspond to fundamental elements in natural language information. A type can be classified as a category or grouping to which all instances of a specific concept belong. By considering the concept 'person' within the English language, all instances of person are grouped under the same label since humans are social beings who interact with other individuals. Similarly, to develop a chatbot, we must first define what constitutes a token and a type in our input information.

Tokenization

The process of splitting a text into individual tokens is referred to as tokenization. Depending on the specific application requirements, there are various methods available for tokenizing text. A typical approach is to divide the text into individual words or terms. We can utilize the word_tokenize() function from the NLTK library to perform tokenization. Here’s an example:

复制代码
    from nltk.tokenize import word_tokenize
    
    text = "Hello world! How are you doing today?"
    tokens = word_tokenize(text)
    print(tokens)
    
      
      
      
      
    
    代码解读

Output: ['Hello', 'world', '!', 'How', 'are', 'you', 'doing', 'today', '?']

Tokenization of a sentence typically yields multiple tokens, each separated by spaces. Occasionally, punctuation marks might occasionally become individual tokens. Consequently, when handling specific NLP tasks, it's crucial to monitor the positions of punctuation within the original text to ensure they don't interfere with the model's output.

Types

Another aspect of natural language processing encompasses the identification of various types within a text. Types typically encompass elements such as nouns, verbs, adjectives, and adverbs. Recognizing these types enables the extraction of meaningful text features, which can subsequently be incorporated into a machine learning framework for analysis.

We can classify different types through part-of-speech tagging (pos tagging). The pos tagging system assigns labels to each word in a sentence based on its grammatical role. The common parts of speech include nouns, verbs, adjectives, adverbs, pronouns, conjunctions, and so on. Utilizing the pos_tag() function provided by NLTK, we can apply pos tagging to the tokens within a given text. For instance, consider the sentence:

复制代码
    from nltk.tokenize import word_tokenize
    from nltk.tag import pos_tag
    
    text = "Hello world! How are you doing today?"
    tokens = word_tokenize(text)
    tags = pos_tag(tokens)
    for tag in tags:
    print(tag[0], tag[1])
    
      
      
      
      
      
      
      
    
    代码解读

Output:

复制代码
    Hello PROPN
    world NOUN
    ! PUNCT
    How VERB
    are VERB
    you PRON
    doing VERB
    today ADV
    ? PUNCT
    
      
      
      
      
      
      
      
      
    
    代码解读

Notice how the different POS tags are identified alongside each token.

Having completed the discussion on tokenization and part-of-speech tagging, we now proceed to explore more advanced NLP techniques, including stemming, lemmatization, and named entity recognition tasks.

3. Stemming and Lemmatization

Stemming and lemmatization are both procedural steps that reduce words to their base or root forms. Both approaches aim for similar objectives but differ in specific aspects.

Stemming

分词的过程是通过去除动词的词缀来获得词干或分词。例如,'running'、'run'和'runs'的分词是'run'。然而,分词可能导致错误的词干,这主要是由于英语语言的不规则性。此外,分词通常会产生一些不存在的词,例如'amongst'。

Porter stemmer, a stemming algorithm integrated into NLTK, offers an effective way to perform stemming. By leveraging the capabilities of NLTK, we can easily implement stemming in our text processing tasks. Let's consider an example. First, initialize the PorterStemmer instance. Then, pass the word 'causation' through the stemming process. The code snippet below demonstrates this:

This code will output 'caus' as the stemmed form of 'causation', illustrating the successful application of the stemming algorithm.

复制代码
    from nltk.stem import PorterStemmer
    
    ps = PorterStemmer()
    text = ["convenience", "conveniences", "convenient"]
    for w in text:
    print(w, ps.stem(w))
    
      
      
      
      
      
    
    代码解读

Output:

复制代码
    convenience conveni
    conveniences conveni
    convenient convid
    
      
      
    
    代码解读

Here, we created three test cases and applied the Porter stemmer to each one. The resulting stems are not always accurate, as the Porter stemmer does not account for the context in which a word appears. However, stemming can still be beneficial in reducing the number of unique words in a corpus while retaining sufficient information for downstream tasks.

Lemmatization

Lemmatization entails choosing the canonical or dictionary form of a word over its stemmed form. By contrast, lemmatization proves more precise than stemming, particularly when a word has multiple possible stems. Despite its higher precision, lemmatization is computationally more intensive than stemming and demands additional resources.

The WordNetLemmatizer, incorporated in NLTK, is designed for lemmatization.

复制代码
    from nltk.stem import WordNetLemmatizer
    
    lemmatizer = WordNetLemmatizer()
    text = "they ran quickly during the race"
    tokens = word_tokenize(text)
    lemmas = [lemmatizer.lemmatize(t) for t in tokens]
    print(" ".join(lemmas))
    
      
      
      
      
      
      
    
    代码解读

Output:

复制代码
    they run quick durin the raci
    
    
    代码解读

Typically, lemmatization transforms various conjugations of the verb 'to run' into its base form, such as 'quick' and 'durin', ensuring that the meaning is preserved, as in 'during'. While stemming and lemmatization often yield comparable results, they should not be considered synonymous. Depending on the specific task, factors such as word frequency and context may influence the choice between stemming and lemmatization.

4. Part-of-Speech Tagging

Part-of-speech tagging, commonly referred to as POS tagging, represents a crucial element within the architecture of natural language processing (NLP) pipelines. It functions by assigning syntactic roles to individual words within a sentence, thereby enabling machines to decipher the contextual relationships between words and enhance the precision of language processing tasks. The efficacy of POS tagging is contingent upon the selection of appropriate methodologies, which can broadly be categorized into rule-based approaches, neural network-based approaches, and hierarchical models. This article will provide a brief overview of the two primary methodologies: rule-based tagging and unsupervised tagging.

Rule-Based Approach

Rule-based approaches primarily rely on establishing sets of patterns and transformations that align input sequences with output sequences based on predefined rules. A typical pattern involves the utilization of regular expressions that correspond to combinations of letters and symbols linked to existing parts of speech. These rules can be formulated to address specific scenarios, rendering them practical solutions while remaining practical solutions, they may struggle with larger-scale applications, which can lead to scalability issues. Additionally, such approaches do not inherently account for semantic aspects of language, such as coreference resolution or sense disambiguation.

This serves as an illustration of a rule-based methodology for POS tagging, employing regular expressions.

复制代码
    import re
    
    def pos_tag_regex(sentence):
    regex_patterns = [
        ('NN.*|PRP.*', 'noun'),    # matches nouns and proper nouns
        ('VB.*','verb'),           # matches verbs
        ('JJ.*', 'adjective'),      # matches adjectives
        ('RB.*', 'adverb')]         # matches adverbs
    
    tagged_sentence = []
    for token in sentence.split():
        for pattern, tag in regex_patterns:
            if re.match(pattern, token):
                tagged_sentence.append((token, tag))
                break
        else:
            tagged_sentence.append((token, None))
    
    return tagged_sentence
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解读

Let's apply this function to the sample sentence "I love playing soccer":

复制代码
    text = "I love playing soccer"
    tagged_sent = pos_tag_regex(text)
    print(tagged_sent)
    
      
      
    
    代码解读

Output: [('I', 'pronoun'), ('love','verb'), ('playing','verb'), ('soccer', 'noun')]

However, applying regular expressions directly to raw texts may result in errors caused by false positives and negatives, leading to inconsistent results. Moreover, regular expression-based POS taggers are likely to concentrate exclusively on the morphological properties of words and disregard lexical cues.

Unsupervised Approach

Unsupervised learning techniques aim to uncover the inherent structures and patterns within natural language without the need for annotated data or labeled datasets. Among the widely recognized approaches in natural language processing, the Hidden Markov Model (HMM) stands out as a probabilistic generative model that calculates the likelihood of observed word sequences based on hidden variables. This framework posits that each word in a sentence is influenced solely by the preceding word, thereby treating each word as independent of subsequent words.

An HMM classifier employs a sequence of observations to adjust its parameters within an HMM model that captures the joint probability distribution across word sequences. Once trained, this model can be utilized to categorize new documents by evaluating the likelihood of tag sequences given the observed word sequence. Various algorithms have been developed to enhance the functionality of HMM classifiers, with the Viterbi algorithm and the Forward-Backward algorithm being two of the most widely used. For instance, implementing the Viterbi algorithm provides a practical approach to training an HMM classifier for part-of-speech tagging.

First, we will perform preprocessing on the textual data to eliminate special characters, numbers, and punctuation marks. Following this, we will dissect the text into sentences, tokenize each sentence into individual words, and then add tuples containing each word along with its corresponding tag to a list. Finally, we will use the scikit-learn LabelEncoder to convert the labels into integer format, ensuring that each label is uniquely represented for subsequent processing steps.

复制代码
    import string
    from sklearn.preprocessing import LabelEncoder
    
    def preprocess(text):
    translator = str.maketrans('', '', string.punctuation)
    stripped_text = text.translate(translator)
    tokens = word_tokenize(stripped_text)
    return [(t, get_pos(t)) for t in tokens]
    
    def get_pos(word):
    # add custom logic here to identify POS tags for each word
    pass
    
    data = preprocess("I love playing soccer.")
    labels = [label for _, label in data]
    
    encoder = LabelEncoder()
    encoded_labels = encoder.fit_transform(labels)
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解读

Next, we can set up the HMM model and determine the transition and emission probabilities derived from the corpus statistics. We can then apply the Viterbi algorithm to fit the model to the encoded data via. Given that POS tagging is a sequence labeling task, we cannot simply evaluate performance on a single document. Rather, we must measure the overall accuracy, precision, recall, and F1 score to achieve reliable performance evaluation. We can derive these measures from predicted labels versus actual labels after processing the entire dataset.

复制代码
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score, precision_recall_fscore_support
    
    X_train, X_test, y_train, y_test = train_test_split(data, encoded_labels, test_size=0.2, random_state=42)
    
    states = len(set([l for s, l in data]))
    transition_probabilities = [[0.8 for _ in range(states)] for _ in range(states)]
    emission_probabilities = {}
    init_probability = [0.5]*len(set([t for t, _ in data]))
    
    for state in range(states):
    count_dict = {'start':0}
    for i, (_, l) in enumerate(data):
        if l == state:
            if i > 0:
                prev_l = data[i-1][1]
                count_dict[prev_l] = count_dict.get(prev_l, 0)+1
            count_dict['start'] += 1
    
    emission_probabilities[state] = {k:v/count_dict['start'] for k, v in count_dict.items()}
    
    y_pred = []
    for seq in X_test:
    obs = [t for t, _ in seq]
    best_path, max_prob = viterbi(obs, init_probability, transition_probabilities, emission_probabilities)
    pred = [best_path[-1]]+list(reversed(best_path))[1:-1]
    y_pred += pred
    
    accuracy = accuracy_score(y_test, y_pred)
    precision, recall, f1, support = precision_recall_fscore_support(y_test, y_pred, average='weighted')
    print("Accuracy:", accuracy)
    print("Precision:", precision)
    print("Recall:", recall)
    print("F1 Score:", f1)
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解读

Finally, the performance of the HMM classifier can be evaluated by examining the comparison between realized versus forecasted POS tags.

复制代码
    actual_tags = [encoder.inverse_transform([t])[0] for t in y_test]
    predicted_tags = [encoder.inverse_transform([t])[0] for t in y_pred]
    
    for act_tag, pred_tag in zip(actual_tags, predicted_tags):
    print(act_tag + "\t-->\t" + pred_tag)
    
      
      
      
      
    
    代码解读

全部评论 (0)

还没有任何评论哟~