Advertisement

How to Chat with my Data in Python using Microsoft Azure OpenAI and Azure Cognitive Search

阅读量:

题意 :如何使用 Microsoft Azure OpenAI 和 Azure Cognitive Search 在 Python 中与我的数据进行聊天

问题背景:

I have written code that extracts text from a PDF document and converts it into vectors using the text-embeddings-ada-002 model from Azure OpenAI. These vectors are then stored in a Microsoft Azure Cognitive Search Index and can be queried. However, I now want to use Azure OpenAI to interact with this data and retrieve a generated result. My code until now works fine, but i dont know how to implement the interaction through Azure OpenAI with my custom data in Azure Cognitive Search in Python.

我编写了代码,从 PDF 文档中提取文本,并使用 Azure OpenAI 的 text-embeddings-ada-002 模型将其转换为向量。然后,这些向量被存储在 Microsoft Azure Cognitive Search 索引中,并且可以被查询。目前为止,我的代码运行正常,但我不知道如何通过 Azure OpenAI 实现与这些自定义数据的交互,并在 Python 中从 Azure Cognitive Search 中检索生成的结果。

This is my code: 以下是我的代码

复制代码
 OPENAI_API_BASE = "https://xxxxx.openai.azure.com"

    
 OPENAI_API_KEY = "xxxxxx"
    
 OPENAI_API_VERSION = "2023-05-15"
    
  
    
 openai.api_type = "azure"
    
 openai.api_key = OPENAI_API_KEY
    
 openai.api_base = OPENAI_API_BASE
    
 openai.api_version = OPENAI_API_VERSION
    
  
    
 AZURE_COGNITIVE_SEARCH_SERVICE_ENDPOINT = "https://xxxxxx.search.windows.net"
    
 AZURE_COGNITIVE_SEARCH_API_KEY = "xxxxxxx"
    
 AZURE_COGNITIVE_SEARCH_INDEX_NAME = "test"
    
 AZURE_COGNITIVE_SEARCH_CREDENTIAL = AzureKeyCredential(AZURE_COGNITIVE_SEARCH_API_KEY)
    
  
    
 llm = AzureChatOpenAI(deployment_name="gpt35", openai_api_key=OPENAI_API_KEY, openai_api_base=OPENAI_API_BASE, openai_api_version=OPENAI_API_VERSION)
    
 embeddings = OpenAIEmbeddings(deployment_id="ada002", chunk_size=1, openai_api_key=OPENAI_API_KEY, openai_api_base=OPENAI_API_BASE, openai_api_version=OPENAI_API_VERSION)
    
  
    
 acs = AzureSearch(azure_search_endpoint=AZURE_COGNITIVE_SEARCH_SERVICE_ENDPOINT,
    
               azure_search_key = AZURE_COGNITIVE_SEARCH_API_KEY,
    
               index_name = AZURE_COGNITIVE_SEARCH_INDEX_NAME,
    
               embedding_function = embeddings.embed_query)
    
  
    
  
    
 def generate_embeddings(s):
    
   # wichtig! engine muss der name sein meiner bereitstellung sein!
    
   response = openai.Embedding.create(
    
       input=s,
    
       engine="ada002"
    
   )
    
  
    
   embeddings = response['data'][0]['embedding']
    
  
    
   return embeddings
    
  
    
 def generate_tokens(s, f):
    
   text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
    
   splits = text_splitter.split_text(s)
    
   i = 0
    
  
    
   documents = []
    
   for split in splits:
    
     metadata = {}
    
     metadata["index"] = i
    
     metadata["file_source"] = f
    
     i = i+1
    
  
    
     new_doc = Document(page_content=split, metadata=metadata)
    
     documents.append(new_doc)
    
     #documents = text_splitter.create_documents(splits)
    
  
    
   return documents
    
  
    
 drive.mount('/content/drive')
    
 folder = "/content/drive/docs/pdf/"
    
  
    
 page_content = ''
    
 doc_content = ''
    
  
    
 for filename in os.listdir(folder):
    
     file_path = os.path.join(folder, filename)
    
     if os.path.isfile(file_path):
    
     print(f"Processing file: {file_path}")
    
  
    
     doc = fitz.open(file_path)
    
     for page in doc: # iterate the document pages
    
       page_content += page.get_text() # get plain text encoded as UTF-8
    
  
    
     doc_content += page_content
    
  
    
     d = generate_tokens(doc_content, file_path)
    
     print(d)
    
  
    
     acs.add_documents(documents=d)
    
     
    
     print("Done.")
    
  
    
  
    
 query = "What are the advantages of an open-source ai model?"
    
 search_client = SearchClient(AZURE_COGNITIVE_SEARCH_SERVICE_ENDPOINT, AZURE_COGNITIVE_SEARCH_INDEX_NAME, credential=AZURE_COGNITIVE_SEARCH_CREDENTIAL)
    
  
    
 results = search_client.search(
    
     search_text=None,
    
     vector_queries= [vector_query],
    
     select=["content_vector", "metadata"],
    
 )
    
  
    
 print(results)
    
  
    
 for result in results:
    
   print(result)

The fields in Azure Cognitive search are content_vector for the vectors and content for the plain text content. I have looks at a lot of GitHub Code, also published by Microsoft and know that it is implemented, but have obviously some problems understanding how this piece in particular is implemented.

Azure Cognitive Search 中的字段包括用于向量的 content_vector 和用于纯文本内容的 content。我查阅了很多 GitHub 上的代码,包括微软发布的代码,也知道这种功能已经实现,但显然在理解这部分具体实现时遇到了一些问题。

So i am looking for some help/hint how to extend this code to interact with the content in Azure Cognitive via Azure Open AI Chat.

因此,我正在寻求一些帮助或提示,了解如何扩展这段代码,以便通过 Azure OpenAI 聊天功能与 Azure Cognitive Search 中的内容进行交互。

问题解决:

What your code has done till now is done a similarity search in Azure Cognitive Search and found the relevant data related to your question.

到目前为止,你的代码已经在 Azure Cognitive Search 中进行了相似性搜索,并找到了与你问题相关的数据。

Next step would be to pass the query and this relevant data to an LLM to create an answer to the query from the relevant data. The way you would do it is create a prompt and populate it with this information and send it to an LLM to answer the query.

下一步是将查询和这些相关数据传递给一个大型语言模型(LLM),以根据相关数据生成查询的答案。你可以创建一个提示,将这些信息填入其中,然后发送给 LLM 以回答查询。

Here's some code to do the same:

以下是实现相同功能的一些代码:

复制代码
 # "content" field contains the text content of your data. make sure that it is retrieved.

    
 results = search_client.search(
    
     search_text=None,
    
     vector_queries= [vector_query],
    
     select=["content", "content_vector", "metadata"],
    
 )
    
  
    
 context = ""
    
 for result in results:
    
   context += result.content + "\n\n"
    
  
    
  
    
 # setup prompt template
    
 template = """
    
 Use the following pieces of context to answer the question at the end. Question is enclosed in <question></question>.
    
 Do keep the following things in mind when answering the question:
    
 - If you don't know the answer, just say that you don't know, don't try to make up an answer.
    
 - Keep the answer as concise as possible.
    
 - Use only the context to answer the question. Context is enclosed in <context></context>
    
 - If the answer is not found in context, simply output "I'm sorry but I do not know the answer to your question.".
    
   22.   23. <context>{context}</context>
    
 <question>{question}</question>
    
 """
    
 prompt_template = PromptTemplate.from_template(template)
    
  
    
 # initialize LLM
    
 llm = AzureChatOpenAI(deployment_name="gpt35", openai_api_key=OPENAI_API_KEY, openai_api_base=OPENAI_API_BASE, openai_api_version=OPENAI_API_VERSION, temperature=0)
    
 prompt = prompt_template.format(context=context, question= query)
    
 message = HumanMessage(content=prompt)
    
 result = llm([message])
    
 print(result.content)

This is a classic Retrieval Augmented Generation (RAG) technique. I created a simple application using this to query Azure Documentation using natural language. The code above is based on the code I wrote for that application. You can read more about the application and see the source code here: GitHub - gmantri/azure-docs-copilot.

这是一种经典的检索增强生成(RAG)技术。我创建了一个简单的应用程序,使用这种技术以自然语言查询 Azure 文档。上面的代码是基于我为该应用程序编写的代码。你可以在这里阅读更多关于该应用程序的信息,并查看源代码:GitHub - gmantri/azure-docs-copilot。

全部评论 (0)

还没有任何评论哟~