FAISS
Facebook AI Similarity Search (Faiss) 是一个用于高效相似性搜索和密集向量聚类的库。它包含搜索任意大小向量集的算法,甚至可以处理不能全部加载到内存中的向量集。它还包含用于评估和参数调整的支持代码。
本笔记演示了与 FAISS
向量数据库相关的功能的使用。
#!pip install faiss
# 或者
!pip install faiss-cpu
我们希望使用 OpenAIEmbeddings,因此我们需要获取 OpenAI API 密钥。
import os
import getpass
os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API 密钥:')
# 如果需要使用无 AVX2 优化的 FAISS,请取消下一行的注释
# os.environ['FAISS_NO_AVX2'] = '1'
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.document_loaders import TextLoader
from langchain.document_loaders import TextLoader
loader = TextLoader('../../../state_of_the_union.txt')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(docs, embeddings)
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
print(docs[0].page_content)
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.
Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
带有分数的相似性搜索
FAISS 提供了一些特定的方法,其中之一是 similarity_search_with_score
,它允许您返回文档以及查询与它们之间的距离分数。返回的距离分数是 L2 距离,因此较低的分数表示更相似。
docs_and_scores = db.similarity_search_with_score(query)
docs_and_scores[0]
(Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': '../../../state_of_the_union.txt'}), 0.36913747)
也可以使用 similarity_search_by_vector
方法来搜索与给定的嵌入向量相似的文档,该方法接受一个嵌入向量作为参数,而不是一个字符串。
embedding_vector = embeddings.embed_query(query)
docs_and_scores = faiss_index.similarity_search_by_vector(embedding_vector)
保存和加载
您还可以保存和加载 FAISS 索引,这样您就无需每次使用时都重新创建它。
db.save_local("faiss_index")
new_db = FAISS.load_local("faiss_index", embeddings)
docs = new_db.similarity_search(query)
docs[0]
Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': '../../../state_of_the_union.txt'})
合并
您还可以合并两个FAISS向量存储库。
db1 = FAISS.from_texts(["foo"], embeddings)
db2 = FAISS.from_texts(["bar"], embeddings)
db1.docstore._dict
{'068c473b-d420-487a-806b-fb0ccea7f711': Document(page_content='foo', metadata={})}
db2.docstore._dict
{'807e0c63-13f6-4070-9774-5c6f0fbb9866': Document(page_content='bar', metadata={})}
db1.merge_from(db2)
db1.docstore._dict
{'068c473b-d420-487a-806b-fb0ccea7f711': Document(page_content='foo', metadata={}),'807e0c63-13f6-4070-9774-5c6f0fbb9866': Document(page_content='bar', metadata={})}
带有过滤的相似性搜索
FAISS向量存储库还可以支持过滤,因为FAISS本身不原生支持过滤,我们需要手动实现。这可以通过先获取比 k
更多的结果,然后进行过滤来实现。您可以根据元数据对文档进行过滤。在调用任何搜索方法时,您还可以设置fetch_k
参数来设置在进行过滤之前要获取多少个文档。以下是一个小例子:
from langchain.schema import Document
list_of_documents = [
Document(page_content="foo", metadata=dict(page=1)),
Document(page_content="bar", metadata=dict(page=1)),
Document(page_content="foo", metadata=dict(page=2)),
Document(page_content="barbar", metadata=dict(page=2)),
Document(page_content="foo", metadata=dict(page=3)),
Document(page_content="bar burr", metadata=dict(page=3)),
Document(page_content="foo", metadata=dict(page=4)),
Document(page_content="bar bruh", metadata=dict(page=4))
]
db = FAISS.from_documents(list_of_documents, embeddings)
results_with_scores = db.similarity_search_with_score("foo")
for doc, score in results_with_scores:
print(f"Content: {doc.page_content}, Metadata: {doc.metadata}, Score: {score}")
Content: foo, Metadata: {'page': 1}, Score: 5.159960813797904e-15
Content: foo, Metadata: {'page': 2}, Score: 5.159960813797904e-15
Content: foo, Metadata: {'page': 3}, Score: 5.159960813797904e-15
Content: foo, Metadata: {'page': 4}, Score: 5.159960813797904e-15
Now we make the same query call but we filter for only page = 1
results_with_scores = db.similarity_search_with_score("foo", filter=dict(page=1))
for doc, score in results_with_scores:
print(f"Content: {doc.page_content}, Metadata: {doc.metadata}, Score: {score}")
Content: foo, Metadata: {'page': 1}, Score: 5.159960813797904e-15
Content: bar, Metadata: {'page': 1}, Score: 0.3131446838378906
Same thing can be done with the max_marginal_relevance_search
as well.
results = db.max_marginal_relevance_search("foo", filter=dict(page=1))
for doc in results:
print(f"Content: {doc.page_content}, Metadata: {doc.metadata}")
Content: foo, Metadata: {'page': 1}
Content: bar, Metadata: {'page': 1}
这是一个在调用 similarity_search
时设置 fetch_k
参数的示例。通常情况下,您会希望 fetch_k
参数的值远大于 k
参数的值。这是因为 fetch_k
参数表示在进行过滤之前将获取的文档数量。如果将 fetch_k
设置为较低的值,可能无法获得足够的文档来进行过滤。
results = db.similarity_search("foo", filter=dict(page=1), k=1, fetch_k=4)
for doc, score in results_with_scores:
print(f"Content: {doc.page_content}, Metadata: {doc.metadata}, Score: {score}")
Content: foo, Metadata: {'page': 1}, Score: 5.159960813797904e-15
Content: bar, Metadata: {'page': 1}, Score: 0.3131446838378906