Advertisement

How to create your own Question-Answering system easily with python

阅读量:

The history of Machine Comprehension (MC) dates back to the emergence of foundational concepts in Artificial Intelligence (AI). In his seminal paper "Computing Machinery and Intelligence," published by Alan Turing, he introduced what is now famously known as the Turing test as a benchmark for artificial intelligence. Nearly 70 years later, Question Answering (QA), a specialized subset of MC, remains one of AI's most challenging tasks.

Thanks to advancements in Deep Learning research and the introduction of Transfer Learning techniques, the field of Natural Language Processing (NLP) has undergone rapid evolution since last year. Thanks to these advancements, a variety of powerful pre-trained NLP models—such as OpenAI-GPT, ELMo, BERT—and XLNet have become accessible to researchers through leading institutions in the field.

Given the significant advancements in natural language processing (NLP) technology, several enhanced systems and applications for NLP tasks are anticipated to be released. The system referred to here is the cdQA-suite , which has been developed as part of a collaboration between Telecom ParisTech, an engineering institution in France, and BNP Paribas Personal Finance, a European leader in personal finance.

Open-domain QA vs. closed-domain QA

Upon considering QA systems, it is prudent to recognize the distinction between two primary categories: open-domain QA (ODQA) systems and closed-domain QA (CDQA) systems.

Open-domain systems engage in addressing questions pertaining to nearly any subject, relying exclusively on general ontologies and world knowledge. This system exemplifies the concept of an ODQA, which was developed by Facebook Research as their ODQA utilizing Wikipedia’s extensive article collection as its sole source. Given that these documents pertain to numerous topics and subjects, it becomes evident why this system is regarded as an ODQA.

On the other hand, closed-domain systems handle questions within a particular domain (for instance, medicine or automotive maintenance), and make use of domain-specific knowledge by employing models trained on domain-specific datasets. The cdQA-suite was designed to enable anyone interested in creating closed-domain QA systems to do so easily.

cdQA-suite

cdQA 该系统实现了从问题理解到答案生成的完整闭环过程. - cdQA github.com

The cdQA-suite is comprised of three blocks:

  • cdQA: An intuitive Python package, cdQA, is designed to facilitate the creation of QA pipelines.
  • cdQA-annotator: The cdQA-annotator tool is dedicated to assisting in annotating QA datasets for model evaluation and fine-tuning tasks.
  • cdQA-ui: The cdQA-ui user interface allows users to connect their websites with the back-end system seamlessly.

I am going to delineate the functionality of each module and guide you on how to utilize it in constructing a custom QA system based on your unique dataset.

cdQA

The cdQA architecture relies upon two key elements: the Retrieval component and the Reader component. Below this text block, you will find a schema that illustrates how this system mechanism operates.

Mechanism of cdQA pipeline

在这里插入图片描述

When a query is received by the system, a Retriever selects a subset of documents from its database that are most probable to include an answer. This process utilizes DrQA's retriever mechanism, which employs TF-IDF features derived from uni-grams and bi-grams to evaluate and compute cosine similarities between each query sentence and every document within the database.

Once selecting the most probable documents, this system processes each document into paragraphs before posing queries to an advanced retrieval module. The architecture at its core is a pre-trained deep learning model based on PyTorch technology, accessible through ["HuggingFace"]('s platform (https://huggingface.co/). This retrieval module then retrieves and evaluates potential answers within each paragraph. Following this initial processing stage, there exists a final aggregation layer that integrates these responses based on an internal scoring mechanism to determine and output the most likely answer according to these scores.

Using the cdQA python package

改写说明

复制代码
    # Installing cdQA package with pip
    pip install cdqa
    # From source
    git clone https://github.com/cdqa-suite/cdQA.git &&
    cd cdQA &&
    pip install .

Open a Jupyter notebook and follow the steps below to see how cdQA operates.

复制代码
    import pandas as pd
    from ast import literal_eval
    
    from cdqa.utils.filters import filter_paragraphs
    from cdqa.utils.download import download_model, download_bnpp_data
    from cdqa.pipeline.cdqa_sklearn import QAPipeline
    
    # Download data and models
    download_bnpp_data(dir='./data/bnpp_newsroom_v1.1/')
    download_model(model='bert-squad_1.1', dir='./models')
    
    # Loading data and filtering / preprocessing the documents
    df = pd.read_csv('data/bnpp_newsroom_v1.1/bnpp_newsroom-v1.1.csv', converters={'paragraphs': literal_eval})
    df = filter_paragraphs(df)
    
    # Loading QAPipeline with CPU version of BERT Reader pretrained on SQuAD 1.1
    cdqa_pipeline = QAPipeline(reader='models/bert_qa_vCPU-sklearn.joblib')
    
    # Fitting the retriever to the list of documents in the dataframe
    cdqa_pipeline.fit_retriever(X=df)
    
    # Sending a question to the pipeline and getting prediction
    query = 'Since when does the Excellence Program of BNP Paribas exist?'
    prediction = cdqa_pipeline.predict(X=query)
    
    print('query: {}\n'.format(query))
    print('answer: {}\n'.format(prediction[0]))
    print('title: {}\n'.format(prediction[1]))
    print('paragraph: {}\n'.format(prediction[2]))

You should have something like the following as output:

在这里插入图片描述

You can notice that this system additionally outputs an answer, as well as including both the paragraph where it was found and the document or article's title.

Within the context provided earlier, preprocessing and filtering steps were essential for converting the BNP Paribas dataframe into a desired configuration.

title paragraphs
The Article Title [Paragraph 1 of Article,…,Paragraph N of Article]

If you utilize self-collected datasets, please ensure that your dataframe adheres to this structure.

The CPU-based version of the model requires a runtime period ranging from 10 to 20 seconds for each prediction. This moderate runtime is primarily attributed to The BERT Reader model, which represents one of the largest deep learning frameworks with approximately 110 million parameters. If you possess a GPU, To utilize directly The GPU version of The model models/bert_qa_vGPU-sklearn.joblib is an efficient approach. The pre-trained models are also accessible on The releases page of cdQA's GitHub repository: https://github.com/cdqa-suite/cdQA/releases.

Training / Fine-tuning the reader

You can also boost its efficiency, which was originally trained using the SQuAD 1.1 corpus. If you obtain an annotated dataset (which can be generated using a tool like cdQA-annotator) in a similar format to that of SQuAD datasets, you are able to finetune your reader accordingly.

复制代码
    # Put the path to your json file in SQuAD format here
    path_to_data = './data/SQuAD_1.1/train-v1.1.json'
    cdqa_pipeline.fit_reader(path_to_data)

Note that fine-tuning operations should be conducted on GPU, given that the BERT model requires significant computational resources for training on CPU, making it impractical to perform fine-tuning tasks solely on a CPU.

Note that fine-tuning operations should be conducted on GPU, given that the BERT model requires significant computational resources for training on CPU, making it impractical to perform fine-tuning tasks solely on a CPU.

You can also explore alternative methods to follow the same process from the official resources: https://github.com/cdqa-suite/cdQA/tree/master/examples

cdQA-annotator

With the aim of simplifying data annotation tasks, the team has developed a web application tool named cdQA-annotator.

复制代码
    from cdqa.utils.converters import df2squad
    # Converting dataframe to SQuAD format
    json_data = df2squad(df=df, squad_version='v1.1', output_dir='.', filename='dataset-name.json')

Now you can install the annotator and run it:

复制代码
    # Clone the repo
    git clone https://github.com/cdqa-suite/cdQA-annotator
    # Install dependencies
    cd cdQA-annotator
    npm install
    # Start development server
    cd src
    vue serve

Upon visiting the provided link, http://localhost:8080/, once you have loaded your JSON file, you will encounter...

在这里插入图片描述

To begin annotating question-answer pairs, you only require writing a query. You can highlight the correct answers by using your mouse cursor, which automatically inserts them. Then, simply press the Add annotation button.

在这里插入图片描述

Once annotated, you are able to access and utilize the BERT Reader to fine-tune its parameters for your personal dataset, as detailed in the preceding paragraphs.

cdQA-ui

The team has developed a web interface for integrating with cdQA. In this section, I will describe how you can utilize the UI connected to the backend of cdQA.

Please ensure that you must install a $cdQA$ REST API by running it on your terminal (make sure that the application runs within the $cdQA$ folder):

复制代码
    export dataset_path = 'path-to-dataset.csv'
    export reader_path = 'path-to-reader-model'
    
    FLASK_APP=api.py flask run -h 0.0.0.0

Second, you should proceed to the installation of the cdQA-ui package:

复制代码
    git clone https://github.com/cdqa-suite/cdQA-ui &&
    cd cdQA-ui &&
    npm install

Then, you start the develpoment server:

复制代码
    npm run serve

Access the web application at http://localhost:8080/. Upon accessing, you will see a visual representation of the interface layout.

在这里插入图片描述

As it has a strong connection with a backend system, utilizing REST APIs, this platform allows users to submit queries and receive quick responses from its backend. The platform provides detailed information about related content sections and their corresponding articles.

在这里插入图片描述

Inserting the interface in a web-site

If you wish to integrate the interface onto your website, simply import the necessary components into your Vue application.

复制代码
    import Vue from 'vue'
    import CdqaUI from 'cdqa-ui'
    
    Vue.use(CdqaUI)
    import Vue from 'vue'
    import BootstrapVue from "bootstrap-vue"
    
    Vue.use(BootstrapVue)
    
    import "bootstrap/dist/css/bootstrap.css"
    import "bootstrap-vue/dist/bootstrap-vue.css"

Then you insert the cdQA interface component:

复制代码
    <CdqaUI api_endpoint_cpu="http://localhost:5000/api" :queries_examples="['What is Artificial Intelligence?', 'What is Blockchain?']">
    </CdqaUI>

Demo

You can also explore a demo of the application on its official website: https://cdqa-suite.github.io/cdQA-website/#demo

Conclusion

Within this article, I introduced cdQA-suite as a tool for deploying a fully integrated system.

Should one wish to delve deeper into this project, we warmly invite them to explore our official GitHub repository at [link]. Should one should not fail to star and follow any repositories that align with their interests if they find this project valuable for their work or personal applications.

We have just launched version 1.0.2 of the cdQA package, which is efficient and demonstrates excellent potential. However, there remains significant potential for enhancement. If you are interested in contributing to this project and would like to assist with these improvements, please feel free to select one issue from our current list: https://github.com/cdqa-suite/cdQA/issues. Please feel free to raise a Pull Request once you have made your changes.

Sources:

cdQA-suite 仓库在GitHub上托管:https://github.com/cdqa-suite
官方BERT版本来自谷歌研究:https://github.com/google-research/bert
由HuggingFace提供的PyTorch-BERT版本:https://github.com/huggingface/pytorch-pretrained-BERT
SQuAD数据集获取地址:https://rajpurkar.github.io/SQuAD-explorer/
Facebook Research开发的DrQA系统:https://github.com/facebookresearch/DrQA/
DeepPavlov开源库及其开源问答系统:medium.com/deeppavlov/open-domain-question-answering-with-deeppavlov-c665d2ee4d65
OpenAI发布 improved GPT 模型相关内容:openai.com/blog/better-language-models/
BERT语言模型实现者 - AllenNLP 提供的 ELMo 模型:allennlp.org/elmo
XLNet语言模型及其预训练方法研究来源:arxiv.org/abs/1906.08237

全部评论 (0)

还没有任何评论哟~