ClickHouse向量搜索

ClickHouseopen in new window 是最快速和资源利用率最高的开源数据库,适用于实时应用程序和带有完整SQL支持以及广泛的函数来帮助用户编写分析查询的分析,最近添加了数据结构和距离搜索函数(如L2Distance),以及近似最近邻搜索索引open in new window,使得ClickHouse可以作为高性能和可扩展的向量数据库来存储和搜索带有SQL的向量。

本文档展示了如何使用与 ClickHouse 向量搜索相关的功能。

设置环境

使用Docker设置本地ClickHouse服务器(可选)

! docker run -d -p 8123:8123 -p9000:9000 --name langchain-clickhouse-server --ulimit nofile=262144:262144 clickhouse/clickhouse-server:23.4.2.11

设置ClickHouse客户端驱动程序

!pip install clickhouse-connect

我们想要使用OpenAIEmbeddings,所以我们需要获取OpenAI API密钥。

import os
import getpass

if not os.environ['OPENAI_API_KEY']:
    os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Clickhouse, ClickhouseSettings
from langchain.document_loaders import TextLoader
loader = TextLoader('../../../state_of_the_union.txt')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
for d in docs:
    d.metadata = {'some': 'metadata'}
settings = ClickhouseSettings(table="clickhouse_vector_search_example")
docsearch = Clickhouse.from_documents(docs, embeddings, config=settings)

query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.similarity_search(query)

Inserting data...: 100%|██████████| 42/42 [00:00<00:00, 2801.49it/s]

print(docs[0].page_content)
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

获取连接信息和数据模式

print(str(docsearch))
default.clickhouse_vector_search_example @ localhost:8123

username: None

Table Schema:
---------------------------------------------------
|id                      |Nullable(String)        |
|document                |Nullable(String)        |
|embedding               |Array(Float32)          |
|metadata                |Object('json')          |
|uuid                    |UUID                    |
---------------------------------------------------

ClickHouse 表结构

如果不存在,默认情况下,ClickHouse 表将会自动创建。高级用户可以预先使用优化的设置创建表。对于分布式 ClickHouse 集群的分片,表引擎应配置为 Distributed

print(f"Clickhouse Table DDL:\n\n{docsearch.schema}")
    Clickhouse Table DDL:
    
    CREATE TABLE IF NOT EXISTS default.clickhouse_vector_search_example(
        id Nullable(String),
        document Nullable(String),
        embedding Array(Float32),
        metadata JSON,
        uuid UUID DEFAULT generateUUIDv4(),
        CONSTRAINT cons_vec_len CHECK length(embedding) = 1536,
        INDEX vec_idx embedding TYPE annoy(100,'L2Distance') GRANULARITY 1000
    ) ENGINE = MergeTree ORDER BY uuid SETTINGS index_granularity = 8192

筛选

您可以直接访问 ClickHouse SQL,使用标准 SQL 编写 WHERE 语句来筛选数据。

注意: 请注意防范 SQL 注入攻击,这个接口不能直接由最终用户调用。

如果您在设置中自定义了 column_map,您可以按以下方式使用筛选器进行搜索:

from langchain.vectorstores import Clickhouse, ClickhouseSettings
from langchain.document_loaders import TextLoader

loader = TextLoader('../../../state_of_the_union.txt')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()

for i, d in enumerate(docs):
    d.metadata = {'doc_id': i}

docsearch = Clickhouse.from_documents(docs, embeddings)

Inserting data...: 100%|██████████| 42/42 [00:00<00:00, 6939.56it/s]

meta = docsearch.metadata_column
output = docsearch.similarity_search_with_relevance_scores('What did the president say about Ketanji Brown Jackson?', 
                                                           k=4, where_str=f"{meta}.doc_id<10")
for d, dist in output:
    print(dist, d.metadata, d.page_content[:20] + '...')
0.6779101415357189 {'doc_id': 0} Madam Speaker, Madam...
0.6997970363474885 {'doc_id': 8} And so many families...
0.7044504914336727 {'doc_id': 1} Groups of citizens b...
0.7053558702165094 {'doc_id': 6} And I’m taking robus...

删除数据

docsearch.drop()
Last Updated:
Contributors: 刘强