ElasticSearch

Elasticsearchopen in new window 是一个分布式的、RESTful 的搜索和分析引擎。它提供了一个分布式的、支持多租户的全文搜索引擎,具有基于 HTTP 的 Web 接口和无模式的 JSON 文档。

本文档展示了与 Elasticsearch 数据库相关的功能。

ElasticVectorSearch 类

安装

请参考Elasticsearch 的安装说明open in new window

要连接到不需要登录凭据的 Elasticsearch 实例,请将 Elasticsearch URL 和索引名称与嵌入对象一起传递给构造函数。

示例:

        from langchain import ElasticVectorSearch
        from langchain.embeddings import OpenAIEmbeddings

        embedding = OpenAIEmbeddings()
        elastic_vector_search = ElasticVectorSearch(
            elasticsearch_url="http://localhost:9200",
            index_name="test_index",
            embedding=embedding
        )

要连接到需要登录凭据的 Elasticsearch 实例(包括 Elastic Cloud),请使用 Elasticsearch URL 格式 https://username:password@es_host:9243。例如,要连接到 Elastic Cloud,请使用所需的身份验证详细信息创建 Elasticsearch URL,并将其作为命名参数 elasticsearch_url 传递给 ElasticVectorSearch 构造函数。

您可以通过登录到 Elastic Cloud 控制台(https://cloud.elastic.co)、选择部署,并导航到“Deployments”页面来获取 Elastic Cloud 的 URL 和登录凭据。

要获取默认的 "elastic" 用户的 Elastic Cloud 密码:

  1. 登录到 Elastic Cloud 控制台(https://cloud.elastic.co)
  2. 转到 "Security" > "Users"
  3. 找到 "elastic" 用户并点击 "Edit"
  4. 点击 "Reset password"
  5. 按照提示重置密码

Elastic Cloud URL 的格式为 https://username:password@cluster_id.region_id.gcp.cloud.es.io:9243。

示例:

        from langchain import ElasticVectorSearch
        from langchain.embeddings import OpenAIEmbeddings

        embedding = OpenAIEmbeddings()

        elastic_host = "cluster_id.region_id.gcp.cloud.es.io"
        elasticsearch_url = f"https://username:password@{elastic_host}:9243"
        elastic_vector_search = ElasticVectorSearch(
            elasticsearch_url=elasticsearch_url,
            index_name="test_index",
            embedding=embedding
        )
!pip install elasticsearch
import os
import getpass

os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')

OpenAI API Key: ········

示例

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import ElasticVectorSearch
from langchain.document_loaders import TextLoader
from langchain.document_loaders import TextLoader
loader = TextLoader('../../../state_of_the_union.txt')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
db = ElasticVectorSearch.from_documents(docs, embeddings, elasticsearch_url="http://localhost:9200")

query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
print(docs[0].page_content)
In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections. 

We cannot let this happen. 

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

ElasticKnnSearch 类

ElasticKnnSearch 类实现了在 Elasticsearch 中存储向量和文档,并用于近似 kNN 搜索open in new window 的功能。

!pip install langchain elasticsearch
from langchain.vectorstores.elastic_vector_search import ElasticKnnSearch
from langchain.embeddings import ElasticsearchEmbeddings
import elasticsearch
# Initialize ElasticsearchEmbeddings
model_id = "<model_id_from_es>" 
dims = dim_count
es_cloud_id = "ESS_CLOUD_ID"
es_user = "es_user"
es_password = "es_pass"
test_index = "<index_name>"
#input_field = "your_input_field" # if different from 'text_field'
# Generate embedding object
embeddings = ElasticsearchEmbeddings.from_credentials(
    model_id,
    #input_field=input_field,
    es_cloud_id=es_cloud_id,
    es_user=es_user,
    es_password=es_password,
)
# Initialize ElasticKnnSearch
knn_search = ElasticKnnSearch(
	es_cloud_id=es_cloud_id, 
	es_user=es_user, 
	es_password=es_password, 
	index_name= test_index, 
	embedding= embeddings
)

测试添加向量

# 测试 `add_texts` 方法
texts = ["Hello, world!", "Machine learning is fun.", "I love Python."]
knn_search.add_texts(texts)

# 测试 `from_texts` 方法
new_texts = ["This is a new text.", "Elasticsearch is powerful.", "Python is great for data analysis."]
knn_search.from_texts(new_texts, dims=dims)

使用查询向量构建器测试 kNN 搜索

# 使用模型ID和查询文本测试 `knn_search` 方法
query = "Hello"
knn_result = knn_search.knn_search(query=query, model_id=model_id, k=2)
print(f"查询 '{query}' 的 kNN 搜索结果:{knn_result}")
print(f"排名第一的结果的 'text' 字段值为:'{knn_result['hits']['hits'][0]['_source']['text']}'")

# 测试 `hybrid_search` 方法
query = "Hello"
hybrid_result = knn_search.knn_hybrid_search(query=query, model_id=model_id, k=2)
print(f"查询 '{query}' 的混合搜索结果:{hybrid_result}")
print(f"排名第一的结果的 'text' 字段值为:'{hybrid_result['hits']['hits'][0]['_source']['text']}'")

使用预生成的向量测试 kNN 搜索

# 为测试生成嵌入向量
query_text = 'Hello'
query_embedding = embeddings.embed_query(query_text)
print(f"嵌入向量的长度:{len(query_embedding)}\n嵌入向量的前两个元素:{query_embedding[:2]}")

# 测试 kNN 搜索
knn_result = knn_search.knn_search(query_vector=query_embedding, k=2)
print(f"排名第一的结果的 'text' 字段值为:'{knn_result['hits']['hits'][0]['_source']['text']}'")

# 测试混合搜索 - 需要同时提供查询文本和查询向量
knn_result = knn_search.knn_hybrid_search(query_vector=query_embedding, query=query_text, k=2)
print(f"排名第一的结果的 'text' 字段值为:'{knn_result['hits']['hits'][0]['_source']['text']}'")

测试 source 选项

# 使用模型ID和查询文本测试 `knn_search` 方法
query = "Hello"
knn_result = knn_search.knn_search(query=query, model_id=model_id, k=2, source=False)
assert '_source' not in knn_result['hits']['hits'][0].keys()

# 测试 `hybrid_search` 方法
query = "Hello"
hybrid_result = knn_search.knn_hybrid_search(query=query, model_id=model_id, k=2, source=False)
assert '_source' not in hybrid_result['hits']['hits'][0].keys()

测试 fields 选项

# 使用模型ID和查询文本测试 `knn_search` 方法
query = "Hello"
knn_result = knn_search.knn_search(query=query, model_id=model_id, k=2, fields=['text'])
assert 'text' in knn_result['hits']['hits'][0]['fields'].keys()

# 测试 `hybrid_search` 方法
query = "Hello"
hybrid_result = kn

n_search.knn_hybrid_search(query=query, model_id=model_id, k=2, fields=['text'])
assert 'text' in hybrid_result['hits']['hits'][0]['fields'].keys()

使用 es 客户端连接进行测试,而不是 cloud_id

# 创建 Elasticsearch 连接
es_connection = Elasticsearch(
    hosts=['https://es_cluster_url:port'], 
    basic_auth=('user', 'password')
)
# 使用 es_connection 实例化 ElasticsearchEmbeddings
embeddings = ElasticsearchEmbeddings.from_es_connection(
    model_id,
    es_connection,
)
# 初始化 ElasticKnnSearch
knn_search = ElasticKnnSearch(
    es_connection=es_connection,
    index_name=test_index,
    embedding=embeddings
)
# 使用模型ID和查询文本测试 `knn_search` 方法
query = "Hello"
knn_result = knn_search.knn_search(query=query, model_id=model_id, k=2)
print(f"查询 '{query}' 的 kNN 搜索结果:{knn_result}")
print(f"排名第一的结果的 'text' 字段值为:'{knn_result['hits']['hits'][0]['_source']['text']}'")
Last Updated:
Contributors: 刘强