How to create your own Question-Answering system easily with python

阅读量：

The history of Machine Comprehension (MC) dates back to the emergence of foundational concepts in Artificial Intelligence (AI). In his seminal paper "Computing Machinery and Intelligence," published by Alan Turing, he introduced what is now famously known as the Turing test as a benchmark for artificial intelligence. Nearly 70 years later, Question Answering (QA), a specialized subset of MC, remains one of AI's most challenging tasks.

Thanks to advancements in Deep Learning research and the introduction of Transfer Learning techniques, the field of Natural Language Processing (NLP) has undergone rapid evolution since last year. Thanks to these advancements, a variety of powerful pre-trained NLP models—such as OpenAI-GPT, ELMo, BERT—and XLNet have become accessible to researchers through leading institutions in the field.

Given the significant advancements in natural language processing (NLP) technology, several enhanced systems and applications for NLP tasks are anticipated to be released. The system referred to here is the cdQA-suite , which has been developed as part of a collaboration between Telecom ParisTech, an engineering institution in France, and BNP Paribas Personal Finance, a European leader in personal finance.

Open-domain QA vs. closed-domain QA

Upon considering QA systems, it is prudent to recognize the distinction between two primary categories: open-domain QA (ODQA) systems and closed-domain QA (CDQA) systems.

Open-domain systems engage in addressing questions pertaining to nearly any subject, relying exclusively on general ontologies and world knowledge. This system exemplifies the concept of an ODQA, which was developed by Facebook Research as their ODQA utilizing Wikipedia’s extensive article collection as its sole source. Given that these documents pertain to numerous topics and subjects, it becomes evident why this system is regarded as an ODQA.

On the other hand, closed-domain systems handle questions within a particular domain (for instance, medicine or automotive maintenance), and make use of domain-specific knowledge by employing models trained on domain-specific datasets. The cdQA-suite was designed to enable anyone interested in creating closed-domain QA systems to do so easily.

cdQA-suite

cdQA 该系统实现了从问题理解到答案生成的完整闭环过程. - cdQA github.com

The cdQA-suite is comprised of three blocks:

cdQA: An intuitive Python package, cdQA, is designed to facilitate the creation of QA pipelines.
cdQA-annotator: The cdQA-annotator tool is dedicated to assisting in annotating QA datasets for model evaluation and fine-tuning tasks.
cdQA-ui: The cdQA-ui user interface allows users to connect their websites with the back-end system seamlessly.

I am going to delineate the functionality of each module and guide you on how to utilize it in constructing a custom QA system based on your unique dataset.

cdQA

The cdQA architecture relies upon two key elements: the Retrieval component and the Reader component. Below this text block, you will find a schema that illustrates how this system mechanism operates.

Mechanism of cdQA pipeline

When a query is received by the system, a Retriever selects a subset of documents from its database that are most probable to include an answer. This process utilizes DrQA's retriever mechanism, which employs TF-IDF features derived from uni-grams and bi-grams to evaluate and compute cosine similarities between each query sentence and every document within the database.

Once selecting the most probable documents, this system processes each document into paragraphs before posing queries to an advanced retrieval module. The architecture at its core is a pre-trained deep learning model based on PyTorch technology, accessible through ["HuggingFace"]('s platform (https://huggingface.co/). This retrieval module then retrieves and evaluates potential answers within each paragraph. Following this initial processing stage, there exists a final aggregation layer that integrates these responses based on an internal scoring mechanism to determine and output the most likely answer according to these scores.

Using the cdQA python package

改写说明

复制代码

    # Installing cdQA package with pip
    pip install cdqa
    # From source
    git clone https://github.com/cdqa-suite/cdQA.git &&
    cd cdQA &&
    pip install .

Open a Jupyter notebook and follow the steps below to see how cdQA operates.

复制代码

    import pandas as pd
    from ast import literal_eval
    
    from cdqa.utils.filters import filter_paragraphs
    from cdqa.utils.download import download_model, download_bnpp_data
    from cdqa.pipeline.cdqa_sklearn import QAPipeline
    
    # Download data and models
    download_bnpp_data(dir='./data/bnpp_newsroom_v1.1/')
    download_model(model='bert-squad_1.1', dir='./models')
    
    # Loading data and filtering / preprocessing the documents
    df = pd.read_csv('data/bnpp_newsroom_v1.1/bnpp_newsroom-v1.1.csv', converters={'paragraphs': literal_eval})
    df = filter_paragraphs(df)
    
    # Loading QAPipeline with CPU version of BERT Reader pretrained on SQuAD 1.1
    cdqa_pipeline = QAPipeline(reader='models/bert_qa_vCPU-sklearn.joblib')
    
    # Fitting the retriever to the list of documents in the dataframe
    cdqa_pipeline.fit_retriever(X=df)
    
    # Sending a question to the pipeline and getting prediction
    query = 'Since when does the Excellence Program of BNP Paribas exist?'
    prediction = cdqa_pipeline.predict(X=query)
    
    print('query: {}\n'.format(query))
    print('answer: {}\n'.format(prediction[0]))
    print('title: {}\n'.format(prediction[1]))
    print('paragraph: {}\n'.format(prediction[2]))

You should have something like the following as output:

You can notice that this system additionally outputs an answer, as well as including both the paragraph where it was found and the document or article's title.

Within the context provided earlier, preprocessing and filtering steps were essential for converting the BNP Paribas dataframe into a desired configuration.

title	paragraphs
The Article Title	[Paragraph 1 of Article,…,Paragraph N of Article]

If you utilize self-collected datasets, please ensure that your dataframe adheres to this structure.

The CPU-based version of the model requires a runtime period ranging from 10 to 20 seconds for each prediction. This moderate runtime is primarily attributed to The BERT Reader model, which represents one of the largest deep learning frameworks with approximately 110 million parameters. If you possess a GPU, To utilize directly The GPU version of The model models/bert_qa_vGPU-sklearn.joblib is an efficient approach. The pre-trained models are also accessible on The releases page of cdQA's GitHub repository: https://github.com/cdqa-suite/cdQA/releases.

Training / Fine-tuning the reader

You can also boost its efficiency, which was originally trained using the SQuAD 1.1 corpus. If you obtain an annotated dataset (which can be generated using a tool like cdQA-annotator) in a similar format to that of SQuAD datasets, you are able to finetune your reader accordingly.

复制代码

    # Put the path to your json file in SQuAD format here
    path_to_data = './data/SQuAD_1.1/train-v1.1.json'
    cdqa_pipeline.fit_reader(path_to_data)

Note that fine-tuning operations should be conducted on GPU, given that the BERT model requires significant computational resources for training on CPU, making it impractical to perform fine-tuning tasks solely on a CPU.

You can also explore alternative methods to follow the same process from the official resources: https://github.com/cdqa-suite/cdQA/tree/master/examples

cdQA-annotator

With the aim of simplifying data annotation tasks, the team has developed a web application tool named cdQA-annotator.

复制代码

    from cdqa.utils.converters import df2squad
    # Converting dataframe to SQuAD format
    json_data = df2squad(df=df, squad_version='v1.1', output_dir='.', filename='dataset-name.json')

Now you can install the annotator and run it:

复制代码

    # Clone the repo
    git clone https://github.com/cdqa-suite/cdQA-annotator
    # Install dependencies
    cd cdQA-annotator
    npm install
    # Start development server
    cd src
    vue serve

Upon visiting the provided link, $http://localhost:8080/$ , once you have loaded your JSON file, you will encounter...

To begin annotating question-answer pairs, you only require writing a query. You can highlight the correct answers by using your mouse cursor, which automatically inserts them. Then, simply press the Add annotation button.

Once annotated, you are able to access and utilize the BERT Reader to fine-tune its parameters for your personal dataset, as detailed in the preceding paragraphs.

cdQA-ui

The team has developed a web interface for integrating with cdQA. In this section, I will describe how you can utilize the UI connected to the backend of cdQA.

Please ensure that you must install a $cdQA$ REST API by running it on your terminal (make sure that the application runs within the $cdQA$ folder):

复制代码

    export dataset_path = 'path-to-dataset.csv'
    export reader_path = 'path-to-reader-model'
    
    FLASK_APP=api.py flask run -h 0.0.0.0

Second, you should proceed to the installation of the cdQA-ui package:

复制代码

    git clone https://github.com/cdqa-suite/cdQA-ui &&
    cd cdQA-ui &&
    npm install

Then, you start the develpoment server:

复制代码

    npm run serve

Access the web application at http://localhost:8080/. Upon accessing, you will see a visual representation of the interface layout.

As it has a strong connection with a backend system, utilizing REST APIs, this platform allows users to submit queries and receive quick responses from its backend. The platform provides detailed information about related content sections and their corresponding articles.

Inserting the interface in a web-site

If you wish to integrate the interface onto your website, simply import the necessary components into your Vue application.

复制代码

    import Vue from 'vue'
    import CdqaUI from 'cdqa-ui'
    
    Vue.use(CdqaUI)
    import Vue from 'vue'
    import BootstrapVue from "bootstrap-vue"
    
    Vue.use(BootstrapVue)
    
    import "bootstrap/dist/css/bootstrap.css"
    import "bootstrap-vue/dist/bootstrap-vue.css"

Then you insert the cdQA interface component:

复制代码

    <CdqaUI api_endpoint_cpu="http://localhost:5000/api" :queries_examples="['What is Artificial Intelligence?', 'What is Blockchain?']">
    </CdqaUI>

Demo

You can also explore a demo of the application on its official website: https://cdqa-suite.github.io/cdQA-website/#demo

Conclusion

Within this article, I introduced cdQA-suite as a tool for deploying a fully integrated system.

Should one wish to delve deeper into this project, we warmly invite them to explore our official GitHub repository at [link]. Should one should not fail to star and follow any repositories that align with their interests if they find this project valuable for their work or personal applications.

We have just launched version 1.0.2 of the cdQA package, which is efficient and demonstrates excellent potential. However, there remains significant potential for enhancement. If you are interested in contributing to this project and would like to assist with these improvements, please feel free to select one issue from our current list: https://github.com/cdqa-suite/cdQA/issues. Please feel free to raise a Pull Request once you have made your changes.

Sources:

cdQA-suite 仓库在GitHub上托管：https://github.com/cdqa-suite
官方BERT版本来自谷歌研究：https://github.com/google-research/bert
由HuggingFace提供的PyTorch-BERT版本：https://github.com/huggingface/pytorch-pretrained-BERT
SQuAD数据集获取地址：https://rajpurkar.github.io/SQuAD-explorer/
Facebook Research开发的DrQA系统：https://github.com/facebookresearch/DrQA/
DeepPavlov开源库及其开源问答系统：medium.com/deeppavlov/open-domain-question-answering-with-deeppavlov-c665d2ee4d65
OpenAI发布 improved GPT 模型相关内容：openai.com/blog/better-language-models/
BERT语言模型实现者 - AllenNLP 提供的 ELMo 模型：allennlp.org/elmo
XLNet语言模型及其预训练方法研究来源：arxiv.org/abs/1906.08237

全部评论 (0)

还没有任何评论哟~

How to create your own Question-Answering system easily with python

ThehistoryofMachineComprehensionMChasitsoriginsalongwiththebirthoffirstconceptsinArtificialIntellige...

How to Build Your Own Computer Vision App with Python a

作者：禅与计算机程序设计艺术 1.简介随着人工智能领域的飞速发展，计算机视觉应用也从静态图片处理、图像检索、对象检测等传统的计算机视觉任务转向新的探索方向。如今，机器学习技术已经成为人们研究和解决新...

How to build your own ubuntu image with docker?

一.Buildaubuntuimageandinstallsshd 1\.Pullubuntu dockerpullubuntu:14.04 2\.CreateDockerfile FROMubunt...

How to start your own ultralearning project

整体结构如下图所示： Whyultralearning? ultralearningprojectsarehardwork.Notonlydotheyrequireyoutotaketimeoutof...

GANs 2.0: How to Train Your Own GANs from Scratch?

作者：禅与计算机程序设计艺术 1.简介深度学习已经成为当今的热门话题之一，由于其能够解决很多复杂的问题，使得许多领域迅速取得了突破性的进步。而对生成对抗网络（GenerativeAdversaria...

Tutorial on how to build your own deep reinforcement le

作者：禅与计算机程序设计艺术 1.背景介绍在强化学习领域，关于如何训练智能体Agent来解决复杂的任务、游戏或环境，最重要的莫过于深度强化学习（DeepReinforcementLearning）方...

Build Your Own Mobile Chatbot With Dialogflow

作者：禅与计算机程序设计艺术 1.简介 Chatbot（聊天机器人）是一个新兴的交互方式，它通过与人类进行聊天的方式来完成任务。在近年来，Chatbot的应用范围越来越广泛，可以自动处理信息，解决重复...

Efficient Question Answering with Question Decomposition and Multiple Answer Streams

文章目录题目摘要简介浅层QA子系统通过验证合并答案流运行描述评估与讨论结论题目通过问题分解和多答案流实现高效问答论文地址：https://link.springer.com/ch...

DriveLM: Driving with Graph Visual Question Answering

DriveLM:DrivingwithGraphVisualQuestionAnswering 摘要 1.介绍 2.DriveLM：任务、数据、数据库（DriveLM:Task,Data,Metric...

10.How To Build an AI System that Can Read Your Mind

作者：禅与计算机程序设计艺术 1.简介自从进入到人工智能领域之后，每一个研究人员、工程师都在不断探索新的应用场景和新型的AI模型。最近，随着很多热门的AI项目的涌现，如聊天机器人的实现、图像识别的提...

是否确定退出登录?

How to create your own Question-Answering system easily with python

Open-domain QA vs. closed-domain QA

cdQA-suite

cdQA

Using the cdQA python package

Training / Fine-tuning the reader

cdQA-annotator

cdQA-ui

Demo

Conclusion

Sources:

全部评论 (0)

相关文章推荐

How to create your own Question-Answering system easily with python

How to Build Your Own Computer Vision App with Python a

How to build your own ubuntu image with docker?

How to start your own ultralearning project

GANs 2.0: How to Train Your Own GANs from Scratch?

Tutorial on how to build your own deep reinforcement le

Build Your Own Mobile Chatbot With Dialogflow

Efficient Question Answering with Question Decomposition and Multiple Answer Streams

DriveLM: Driving with Graph Visual Question Answering

10.How To Build an AI System that Can Read Your Mind