Chroma（音色）

Chromaopen in new window 是一个用于构建嵌入式人工智能应用程序的数据库。

本文档展示了如何使用与 Chroma 向量数据库相关的功能。

!pip install chromadb

# get a token: https://platform.openai.com/account/api-keys

from getpass import getpass

OPENAI_API_KEY = getpass()

········

import os

os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader

loader = TextLoader('../../../state_of_the_union.txt')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()

db = Chroma.from_documents(docs, embeddings)

query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)

Using embedded DuckDB without persistence: data will be transient

print(docs[0].page_content)

    Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 
    
    Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 
    
    One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 
    
    And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

带有相似度得分的相似性搜索

返回的距离得分是余弦距离。因此，得分越低越好。

docs = db.similarity_search_with_score(query)

docs[0]

(Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': '../../../state_of_the_union.txt'}), 0.3949805498123169)

持久化

以下步骤介绍了如何持久化ChromaDB实例。

初始化PeristedChromaDB

为每个块创建嵌入并插入到Chroma向量数据库中。当ChromaDB被持久化时，persist_directory参数告诉ChromaDB在何处存储数据库。

# 嵌入和存储文本
# 提供persist_directory将嵌入存储到磁盘上
persist_directory = 'db'

embedding = OpenAIEmbeddings()
vectordb = Chroma.from_documents(documents=docs, embedding=embedding, persist_directory=persist_directory)

使用直接的本地API运行Chroma。
在db中未找到现有的数据库，跳过加载
在db中未找到现有的数据库，跳过加载

持久化数据库

我们应该调用persist()确保嵌入被写入磁盘。

vectordb.persist()
vectordb = None

将DB持久化到磁盘，将其放入保存文件夹db中
PersistentDuckDB del，即将运行persist
将DB持久化到磁盘，将其放入保存文件夹db中

从磁盘加载数据库，并创建链式结构

确保传递与实例化数据库时相同的persist_directory和embedding_function。初始化用于问答的链式结构。

# Now we can load the persisted database from disk, and use it as normal. 
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)

使用直接本地API运行Chroma。
载入了4个嵌入。
载入了1个集合。

检索器选项

本节介绍了使用Chroma作为检索器的不同选项。

最大边缘相关性（MMR）

除了在检索器对象中使用相似性搜索外，您还可以使用mmr。

retriever = db.as_retriever(search_type="mmr")

retriever.get_relevant_documents(query)[0]

Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': '../../../state_of_the_union.txt'})

更新文档

update_document函数允许您在将文档添加到Chroma实例后修改其内容。让我们看一个如何使用此函数的示例。

# 导入Document类
from langchain.docstore.document import Document

# 初始文档内容和ID
initial_content = "这是初始的文档内容"
document_id = "doc1"

# 使用初始内容和元数据创建Document实例
original_doc = Document(page_content=initial_content, metadata={"page": "0"})

# 使用原始文档初始化一个Chroma实例
new_db = Chroma.from_documents(
    collection_name="test_collection",
    documents=[original_doc],
    embedding=OpenAIEmbeddings(),  # 使用之前的嵌入
    ids=[document_id],
)

到目前为止，我们有了一个新的Chroma实例，其中包含一个ID为"doc1"的文档，内容为"这是初始的文档内容"。现在，让我们更新文档的内容。

# 更新后的文档内容
updated_content = "这是更新后的文档内容"

# 创建一个带有更新内容的新Document实例
updated_doc = Document(page_content=updated_content, metadata={"page": "1"})

# 通过传递文档ID和更新后的文档来更新Chroma实例中的文档
new_db.update_document(document_id=document_id, document=updated_doc)

# 现在，让我们使用相似性搜索来检索更新后的文档
output = new_db.similarity_search(updated_content, k=1)

# 打印检索到的文档内容
print(output[0].page_content, output[0].metadata)

This is the updated document content {'page': '1'}

# Chroma（音色）

# 带有相似度得分的相似性搜索

# 持久化

# 初始化PeristedChromaDB

# 持久化数据库

# 从磁盘加载数据库，并创建链式结构

# 检索器选项

# 最大边缘相关性（MMR）

# 更新文档