Advertisement

Fetch Data from Azure Blob Storage to Azure AI Search - RAG

阅读量:

题意 :从Azure Blob存储获取数据到Azure认知搜索 - RAG

问题背景:

I am developing a RAG-based chatbot to search through my content. I have uploaded all the contents to blob storage and imported the blob storage into the Azure AI Search service. I attempted to use an index/indexer, but the index operates based on fields only, whereas I want to search through everything. Is there an alternative way to achieve this?

我正在开发一个基于RAG(快速应用生成器,Rapid Application Generation)的聊天机器人来搜索我的内容。我已经将所有内容上传到Blob存储,并将Blob存储导入到Azure认知搜索服务中。我尝试使用索引/索引器,但索引仅基于字段进行操作,而我希望搜索所有内容。有没有其他方法可以实现这一点?

I also want any new data added to my blob container to be automatically included in Azure AI Search.

我还希望任何新添加到我的Blob容器中的数据都能自动包含在Azure认知搜索中。

问题解决:

I attempted to use an index/indexer, but the index operates based on fields only, whereas I want to search through everything. Is there an alternative way to achieve this?

Based on your capture, where you only have 1 index field called 'id', I guess you incorrectly understood the way an index is used. You need to define your index structure with the appropriate fields first, given what you want to achieve:

根据您提供的信息,您只有一个名为“id”的索引字段,我猜您可能误解了索引的使用方法。为了达成您的目标,您首先需要根据您的需求定义包含适当字段的索引结构:

  • if you want to index documents to search them later, you need to have your documents' content in one of your index fields (probably using a string type), and set this field as 'searchable' 如果您想要对文档进行索引以便以后搜索它们,您需要将文档的内容放在索引的一个字段中(可能是使用字符串类型),并将该字段设置为“可搜索”的。
  • if you also want to be able to search using vector (or hybrid mode, combining keyword + vector), you will also need a different column containing those vectors 如果您还希望能够使用向量(或混合模式,即结合关键词和向量)进行搜索,那么您还需要一个包含这些向量的不同列。
  • if you want to easily get your document title, add another field 如果您想轻松获取文档标题,请添加另一个字段。
  • if you want to easily get your document name / filename, add another field 如果您想轻松获取文档名称/文件名,请添加另一个字段。
  • etc etc

And once you have listed what you need from your files, and defined and created your index structure, then you will use an 'indexer' to fill your index with your data.

一旦您列出了需要从文件中获取的内容,并定义和创建了索引结构,然后您就可以使用“索引器”来用您的数据填充索引。

For example, when you use the "Add your data" tab in Azure OpenAI Studio, it will automatically create all that for you. You end up having this kind of index structure:

例如,当您在Azure OpenAI Studio中使用“添加您的数据”选项卡时,它会为您自动创建所有这些内容。最终,您将拥有这样的索引结构:

Here:

  • 'content' field has the processed document's main content “content”字段包含已处理文档的主要内容
  • 'contentVector' fied stores the embeddings of this main content (the output of a call to an 'text-embedding-ada-002' model with the same content which is stored in 'content' field) “contentVector”字段存储了此主要内容的嵌入(即,使用与“content”字段中存储的相同内容调用“text-embedding-ada-002”模型所得的输出)
  • 'filepath' and 'url' have been filled with my blob file details “filepath”和“url”字段已被填充了我的Blob文件详细信息
  • 'title' has been created using the beginning of the document chunk... it's sometimes not a real title :D “title”字段是通过文档块的开头部分创建的……有时它并不是一个真正的标题:D
  • 'id' is a technical identifier in the search service “id”是搜索服务中的技术标识符
  • 'chunk_id' is a zero-based index indicating which part of the document it is, because the input documents are split in littler chunks (see more details here) “chunk_id”是一个从零开始的索引,指示它是文档的哪一部分,因为输入文档被分割成了更小的块(更多详细信息请见此处)

I also want any new data added to my blob container to be automatically included in Azure AI Search.

You need to define the frequency of processing of your indexer to run regularly. Of you can use some code, for example by using an Azure Function with a "Blob trigger" event, where you will index your document when the function runs because it is fired when a blob is created / modified.

您需要定义索引器的处理频率以使其定期运行。或者,您可以使用一些代码,例如使用带有“Blob触发器”事件的Azure函数,当函数运行时(即在创建或修改Blob时触发)将对您的文档进行索引。

全部评论 (0)

还没有任何评论哟~