Advertisement

How to Chat with my Data in Python using Microsoft Azure OpenAI and Azure Cogintive Search

阅读量:

题意:在Python环境中利用Microsoft Azure OpenAI和Azure Cognitive Search进行对话交流的方法是什么

问题背景:

I have developed a piece of code that extracts text from a PDF document and encodes it into vectors utilizing the text-embeddings-ada-002 model provided by Microsoft's Azure OpenAI service. These vector representations are then stored within a Microsoft Azure Cognitive Search Index, which can be accessed for retrieval purposes. However, I currently desire to employ Azure OpenAI's capabilities to interact with this data and obtain an AI-generated result. While my existing code functions effectively, I am unsure about how to establish an interaction mechanism through Azure OpenAI that integrates with my custom data stored within the Microsoft Cognitive Search platform using Python programming language.

我已开发出了一段代码,并从PDF格式的文档中提取文本内容后使用微软Azure OpenAI的text-embeddings-ada-002模型将其转换为向量表示。这些向量随后被存储于微软Azure Cognitive Search索引中,并便于后续查询操作。然而,在此之前我打算利用Azure OpenAI平台与这些数据集合建立交互机制以便检索生成的内容。到目前为止我的开发工作一切正常运行但我目前还不清楚如何通过Python语言在Azure OpenAI平台下与位于Azure Cognitive Search中的自定义数据集合建立交互机制

This is my code: 这是我的代码:

复制代码
 OPENAI_API_BASE = "https://xxxxx.openai.azure.com"

    
 OPENAI_API_KEY = "xxxxxx"
    
 OPENAI_API_VERSION = "2023-05-15"
    
  
    
 openai.api_type = "azure"
    
 openai.api_key = OPENAI_API_KEY
    
 openai.api_base = OPENAI_API_BASE
    
 openai.api_version = OPENAI_API_VERSION
    
  
    
 AZURE_COGNITIVE_SEARCH_SERVICE_ENDPOINT = "https://xxxxxx.search.windows.net"
    
 AZURE_COGNITIVE_SEARCH_API_KEY = "xxxxxxx"
    
 AZURE_COGNITIVE_SEARCH_INDEX_NAME = "test"
    
 AZURE_COGNITIVE_SEARCH_CREDENTIAL = AzureKeyCredential(AZURE_COGNITIVE_SEARCH_API_KEY)
    
  
    
 llm = AzureChatOpenAI(deployment_name="gpt35", openai_api_key=OPENAI_API_KEY, openai_api_base=OPENAI_API_BASE, openai_api_version=OPENAI_API_VERSION)
    
 embeddings = OpenAIEmbeddings(deployment_id="ada002", chunk_size=1, openai_api_key=OPENAI_API_KEY, openai_api_base=OPENAI_API_BASE, openai_api_version=OPENAI_API_VERSION)
    
  
    
 acs = AzureSearch(azure_search_endpoint=AZURE_COGNITIVE_SEARCH_SERVICE_ENDPOINT,
    
               azure_search_key = AZURE_COGNITIVE_SEARCH_API_KEY,
    
               index_name = AZURE_COGNITIVE_SEARCH_INDEX_NAME,
    
               embedding_function = embeddings.embed_query)
    
  
    
  
    
 def generate_embeddings(s):
    
   # wichtig! engine muss der name sein meiner bereitstellung sein!
    
   response = openai.Embedding.create(
    
       input=s,
    
       engine="ada002"
    
   )
    
  
    
   embeddings = response['data'][0]['embedding']
    
  
    
   return embeddings
    
  
    
 def generate_tokens(s, f):
    
   text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
    
   splits = text_splitter.split_text(s)
    
   i = 0
    
  
    
   documents = []
    
   for split in splits:
    
     metadata = {}
    
     metadata["index"] = i
    
     metadata["file_source"] = f
    
     i = i+1
    
  
    
     new_doc = Document(page_content=split, metadata=metadata)
    
     documents.append(new_doc)
    
     #documents = text_splitter.create_documents(splits)
    
  
    
   return documents
    
  
    
 drive.mount('/content/drive')
    
 folder = "/content/drive/docs/pdf/"
    
  
    
 page_content = ''
    
 doc_content = ''
    
  
    
 for filename in os.listdir(folder):
    
     file_path = os.path.join(folder, filename)
    
     if os.path.isfile(file_path):
    
     print(f"Processing file: {file_path}")
    
  
    
     doc = fitz.open(file_path)
    
     for page in doc: # iterate the document pages
    
       page_content += page.get_text() # get plain text encoded as UTF-8
    
  
    
     doc_content += page_content
    
  
    
     d = generate_tokens(doc_content, file_path)
    
     print(d)
    
  
    
     acs.add_documents(documents=d)
    
     
    
     print("Done.")
    
  
    
  
    
 query = "What are the advantages of an open-source ai model?"
    
 search_client = SearchClient(AZURE_COGNITIVE_SEARCH_SERVICE_ENDPOINT, AZURE_COGNITIVE_SEARCH_INDEX_NAME, credential=AZURE_COGNITIVE_SEARCH_CREDENTIAL)
    
  
    
 results = search_client.search(
    
     search_text=None,
    
     vector_queries= [vector_query],
    
     select=["content_vector", "metadata"],
    
 )
    
  
    
 print(results)
    
  
    
 for result in results:
    
   print(result)

The Azure Cognitive Search platform utilizes two distinct fields to organize its data: content_vector, which encapsulates the vector representation of the information, and content, which stores the raw textual material. After examining a substantial amount of GitHub code authored by Microsoft, I am aware that this system has been implemented, yet I remain perplexed about the specifics of its implementation for this particular segment.

Azure Cognitive Search中包含一组字段(如content_vector用于存储向量)以及另一个字段(如content用于存储纯文本内容)。通过研究多个GitHub仓库中的代码片段以及微软官方发布的开源实现等资源后发现我发现这些功能已经被成功实现但在深入分析这些开源项目的过程中仍面临诸多挑战。

I am being sought after for some guidance/suggestions on how to be extended to interact with the content in Azure Cognitive via Azure Open AI Chat.

我们正在寻求一些指导建议来进一步扩展这段代码,并使其能够与Azure OpenAI Chat集成,并在Azure Cognitive中实现内容交互。

问题解决:

What your code has achieved up to now is to perform a similarity search within Azure Cognitive Search and retrieve the relevant data pertaining to your inquiry.

截至目前为止,您的代码库已通过 Azure Cognitive Search 进行了相似性搜索,并成功检索出相关数据。

The next step would involve transferring both the query and this relevant dataset into an LLM in order for it to generate answers based on that dataset. The method involves formulating a prompt that includes detailed information about the topic, then inputting that prompt into an LLM for generating responses.

下一步将是将查询及相关数据传递给大语言模型(LLM),以便根据相关信息生成答案。实现的方式则是设计并构建一个提示(prompt),并将其中的信息填入其中后发送至大语言模型用于回答查询请求。

Here's some code to do the same:

这是实现相同功能的代码:

复制代码
 # "content" field contains the text content of your data. make sure that it is retrieved.

    
 results = search_client.search(
    
     search_text=None,
    
     vector_queries= [vector_query],
    
     select=["content", "content_vector", "metadata"],
    
 )
    
  
    
 context = ""
    
 for result in results:
    
   context += result.content + "\n\n"
    
  
    
  
    
 # setup prompt template
    
 template = """
    
 Use the following pieces of context to answer the question at the end. Question is enclosed in <question></question>.
    
 Do keep the following things in mind when answering the question:
    
 - If you don't know the answer, just say that you don't know, don't try to make up an answer.
    
 - Keep the answer as concise as possible.
    
 - Use only the context to answer the question. Context is enclosed in <context></context>
    
 - If the answer is not found in context, simply output "I'm sorry but I do not know the answer to your question.".
    
   22.   23. <context>{context}</context>
    
 <question>{question}</question>
    
 """
    
 prompt_template = PromptTemplate.from_template(template)
    
  
    
 # initialize LLM
    
 llm = AzureChatOpenAI(deployment_name="gpt35", openai_api_key=OPENAI_API_KEY, openai_api_base=OPENAI_API_BASE, openai_api_version=OPENAI_API_VERSION, temperature=0)
    
 prompt = prompt_template.format(context=context, question= query)
    
 message = HumanMessage(content=prompt)
    
 result = llm([message])
    
 print(result.content)

This is an iconic Retrieval Augmented Generation (RAG) technique. I developed a simple application employing this RAG technology to query Azure Documentation using natural language. The code above originates from the code I developed for that application. You can read more about the application and see the source code here:

这是检索增强生成(RAG)技术的核心内容。基于这一技术的基础上,我开发了一个简单的应用程序。这些代码主要基于我为该应用编写的模块构建而成,并能够通过自然语言接口向Azure文档库发起查询请求。这些文档主要包括各种API服务及其相关操作说明,在此平台中您可以通过标准的自然语言接口向Azure文档库发起查询请求。
以上代码主要由两个部分组成:一个是用于数据处理的基础模块;另一个是负责与外部服务交互的核心逻辑模块。
您可以在以下链接处访问完整的项目页面并查看源码实现:
GitHub - gmantri/azure-docs-copilot

全部评论 (0)

还没有任何评论哟~