How to deal with different vector-dimensions for embeddings and search with pgvector?
题意 :如何处理嵌入和使用pgvector搜索时不同的向量维度?
问题背景:
I employ Postgres with the pgvector extension to store OpenAI embeddings, serving as the data source for my RAG system.
为了管理来自OpenAI的嵌入数据而借助Postgres的pgvector扩展,在此方案中该数据将被该RAG管道所采用。
Up until now, the best practice has been to use the embedding model text-embedding-ada-002, which outputs vectors of a dimension, specifically 1536.
到目前为止,最佳实践是使用提供1536维向量的嵌入模型text-embedding-ada-002。
Today, OpenAI has launched two new models, namely the text-embedding-3-small and the text-embedding-3-large. These models are designed to offer diverse dimension capabilities with embedding dimensions of 512 and 1536 respectively. The smaller model specifically provides embedding dimensions of 256, while the larger one includes dimensions up to 1024. Notably, the largest version goes up to an embedding dimension of 3072.
今天,OpenAI推出了两套新型模型——text-embedding-3-small与text-embedding-3-large,并且它们能够支持从256到3072维的不同维度向量;并且能够生成相应的向量空间表示
My question is, how can I effectively manage or address multiple vector dimensions or aspects on a single table that simultaneously provides identical query results.
我的问题是:如何在同一张表上管理多维向量维度以支持相同的查询方式?
Currently, my table looks like this:
目前,我的表看起来是这样的:
create table if not exists public.embeddings
(
id serial primary key,
embedding vector(1536) not null
// ... some more columns, but irrelevant for the given context
)
create index if not exists embeddings_embedding_idx
on public.embeddings using ivfflat (embedding public.vector_cosine_ops);
For querying, I use a stored function:
对于查询,我使用了一个存储函数:
create or replace function match_embeddings(
query_embedding vector(1536),
match_threshold float,
match_count int
)
RETURNS table(j json)
AS
$$
BEGIN
RETURN QUERY
select row_to_json(r)
from (select e.id,
1 - (e.embedding <=> query_embedding) as similarity
from embeddings e
where
1 - (e.embedding <=> query_embedding) > match_threshold
order by similarity desc
limit match_count) r;
END
$$
language plpgsql;
Currently, the setup relies on a fixed vector dimension size of 1536. Unfortunately, pgvector does not offer a variable dimension feature. Having a column of type vector(1536) requires that an array of precisely 1536 elements be provided. Submitting fewer than required values leads to an error.
当前设置采用了固定向量维度为1536的方法,并注意到pgvector并未提供可变维度的支持选项。具体而言,在使用名为vector(1536)的列时,默认需要将其精确填充为长度为1536的数据数组以避免出现系统报错的情况
I'm pondering whether it's feasible to expand the vector storage beyond the current maximum requirement of 3072. Additionally, I would like to ensure that within a table, we can store information about each applied vector size so that we can identify at any time which specific dimension has been stored.
我想知道能否方便地扩展向量存储至最大容量(如3072),并建议在表格中标注所用的向量大小以及时确认存储了多少维的数据。
The main questions are: 主要问题是:
- 我是否应该用零对现有的1536维向量进行右填充(append),以适应扩展到3072维的需求?这样做是否会影响到查询机制的有效性?
- 或者我是否应该为每个所需的维度添加多个列(columns),例如
embedding_1536 vector(1536)、embedding_3072 vector(3072)?
或者我是否应该使用多个嵌入表(embedding tables),每个表都提供一个对应维度的向量列?
如果这些方法都不奏效的话, - 或者还有其他我现在还没有想到的方法
I am looking forward to hearing from you. Thanks!
任何有用的想法、提示、意见或解决方案都非常欢迎。谢谢!
问题解决:
Have you tried variable embedding column?
你是否尝试过可变嵌入列?
CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector);
INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5]');
SELECT * FROM items WHERE vector_dims(embedding) = 3 ORDER BY embedding <-> '[3,1,2]' LIMIT 5;

