Deep Lake

Deep Lakeopen in new window 是一个多模态向量存储库,用于存储嵌入向量及其包括文本、JSON、图像、音频、视频等在内的元数据。它可以将数据保存在本地、云端或 Activeloop 存储中。它支持混合搜索,包括嵌入向量及其属性。

这个笔记本展示了与 Deep Lake 相关的基本功能。虽然 Deep Lake 可以存储嵌入向量,但它也能存储任何类型的数据。它是一个功能完善的无服务器数据湖,具备版本控制、查询引擎和流式数据加载器,可用于深度学习框架。

如需更多信息,请参阅 Deep Lake 文档open in new windowAPI 参考open in new window

!pip install openai deeplake tiktoken
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import DeepLake
import os
import getpass

os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')
embeddings = OpenAIEmbeddings()

OpenAI API Key: ········

from langchain.document_loaders import TextLoader

loader = TextLoader('../../../state_of_the_union.txt')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()

请问您希望在 ./deeplake/ 目录下创建一个本地数据集,然后运行相似性搜索是吗?Deeplake+LangChain 集成在幕后使用 Deep Lake 数据集,因此 datasetvector store 可以互换使用。如果您想在自己的云端或 Deep Lake 存储中创建数据集,请根据需要调整路径open in new window

db = DeepLake(dataset_path="./my_deeplake/", embedding_function=embeddings)
db.add_documents(docs)
# or shorter
# db = DeepLake.from_documents(docs, dataset_path="./my_deeplake/", embedding=embeddings, overwrite=True)
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
/home/leo/.local/lib/python3.10/site-packages/deeplake/util/check_latest_version.py:32: UserWarning: A newer version of deeplake (3.3.2) is available. It's recommended that you update to the latest version using `pip install -U deeplake`.
  warnings.warn(


./my_deeplake/ loaded successfully.


Evaluating ingest: 100%|██████████████████████████████████████| 1/1 [00:07<00:00


Dataset(path='./my_deeplake/', tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype     shape      dtype  compression
  -------   -------   -------    -------  ------- 
  embedding  generic  (42, 1536)  float32   None   
    ids      text     (42, 1)      str     None   
  metadata    json     (42, 1)      str     None   
    text      text     (42, 1)      str     None   
print(docs[0].page_content)
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

稍后,您可以重新加载数据集而无需重新计算嵌入

db = DeepLake(dataset_path="./my_deeplake/", embedding_function=embeddings, read_only=True)
docs = db.similarity_search(query)
./my_deeplake/ loaded successfully.



Deep Lake Dataset in ./my_deeplake/ already exists, loading from the storage


Dataset(path='./my_deeplake/', read_only=True, tensors=['embedding', 'ids', 'metadata', 'text'])

tensor     htype     shape      dtype  compression
-------   -------   -------    -------  ------- 
embedding  generic  (42, 1536)  float32   None   
  ids      text     (42, 1)      str     None   
metadata    json     (42, 1)      str     None   
  text      text     (42, 1)      str     None   

Deep Lake 目前是单写多读的。设置 read_only=True 可以避免获取写入锁。

检索与问答

Deep Lake 提供了检索和问答的功能。您可以使用 Deep Lake 进行以下操作:

from langchain.chains import RetrievalQA
from langchain.llms import OpenAIChat

qa = RetrievalQA.from_chain_type(llm=OpenAIChat(model='gpt-3.5-turbo'), chain_type='stuff', retriever=db.as_retriever())
  /home/leo/.local/lib/python3.10/site-packages/langchain/llms/openai.py:624: UserWarning: You are trying to use a chat model. This way of initializing it is no longer supported. Instead, please use: `from langchain.chat_models import ChatOpenAI`
    warnings.warn(
query = 'What did the president say about Ketanji Brown Jackson'
qa.run(query)

'The president nominated Ketanji Brown Jackson to serve on the United States Supreme Court. He described her as a former top litigator in private practice, a former federal public defender, a consensus builder, and from a family of public school educators and police officers. He also mentioned that she has received broad support from various groups since being nominated.'

基于属性的元数据筛选

import random

for d in docs:
    d.metadata['year'] = random.randint(2012, 2014)

db = DeepLake.from_documents(docs, embeddings, dataset_path="./my_deeplake/", overwrite=True)
./my_deeplake/ loaded successfully.
Evaluating ingest: 100%|██████████| 1/1 [00:04<00:00
Dataset(path='./my_deeplake/', tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype     shape     dtype  compression
  -------   -------   -------   -------  ------- 
  embedding  generic  (4, 1536)  float32   None   
    ids      text     (4, 1)      str     None   
  metadata    json     (4, 1)      str     None   
    text      text     (4, 1)      str     None   
db.similarity_search('What did the president say about Ketanji Brown Jackson', filter={'year': 2013})
100%|██████████| 4/4 [00:00<00:00, 1080.24it/s]

[Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': '../../../state_of_the_union.txt', 'year': 2013}), Document(page_content='And for our LGBTQ+ Americans, let’s finally get the bipartisan Equality Act to my desk. The onslaught of state laws targeting transgender Americans and their families is wrong. \n\nAs I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential. \n\nWhile it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military justice. \n\nAnd soon, we’ll strengthen the Violence Against Women Act that I first wrote three decades ago. It is important for us to show the nation that we can come together and do big things. \n\nSo tonight I’m offering a Unity Agenda for the Nation. Four big things we can do together. \n\nFirst, beat the opioid epidemic.', metadata={'source': '../../../state_of_the_union.txt', 'year': 2013})]

选择距离函数

在 Deep Lake 中,您可以根据需求选择不同的距离函数来计算相似性或距离。以下是一些常用的距离函数选项:

db.similarity_search('What did the president say about Ketanji Brown Jackson?', distance_metric='cos')

[Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': '../../../state_of_the_union.txt', 'year': 2013}), Document(page_content='A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n\nAnd if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. \n\nWe can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling. \n\nWe’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers. \n\nWe’re putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster. \n\nWe’re securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders.', metadata={'source': '../../../state_of_the_union.txt', 'year': 2012}), Document(page_content='And for our LGBTQ+ Americans, let’s finally get the bipartisan Equality Act to my desk. The onslaught of state laws targeting transgender Americans and their families is wrong. \n\nAs I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential. \n\nWhile it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military justice. \n\nAnd soon, we’ll strengthen the Violence Against Women Act that I first wrote three decades ago. It is important for us to show the nation that we can come together and do big things. \n\nSo tonight I’m offering a Unity Agenda for the Nation. Four big things we can do together. \n\nFirst, beat the opioid epidemic.', metadata={'source': '../../../state_of_the_union.txt', 'year': 2013}), Document(page_content='Tonight, I’m announcing a crackdown on these companies overcharging American businesses and consumers. \n\nAnd as Wall Street firms take over more nursing homes, quality in those homes has gone down and costs have gone up. \n\nThat ends on my watch. \n\nMedicare is going to set higher standards for nursing homes and make sure your loved ones get the care they deserve and expect. \n\nWe’ll also cut costs and keep the economy going strong by giving workers a fair shot, provide more training and apprenticeships, hire them based on their skills not degrees. \n\nLet’s pass the Paycheck Fairness Act and paid leave. \n\nRaise the minimum wage to $15 an hour and extend the Child Tax Credit, so no one has to raise a family in poverty. \n\nLet’s increase Pell Grants and increase our historic support of HBCUs, and invest in what Jill—our First Lady who teaches full-time—calls America’s best-kept secret: community colleges.', metadata={'source': '../../../state_of_the_union.txt', 'year': 2012})]

最大边际相关性(Maximal Marginal Relevance,MMR)的应用

使用最大边际相关性(MMR)

db.max_marginal_relevance_search('What did the president say about Ketanji Brown Jackson?')

[Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': '../../../state_of_the_union.txt', 'year': 2013}), Document(page_content='Tonight, I’m announcing a crackdown on these companies overcharging American businesses and consumers. \n\nAnd as Wall Street firms take over more nursing homes, quality in those homes has gone down and costs have gone up. \n\nThat ends on my watch. \n\nMedicare is going to set higher standards for nursing homes and make sure your loved ones get the care they deserve and expect. \n\nWe’ll also cut costs and keep the economy going strong by giving workers a fair shot, provide more training and apprenticeships, hire them based on their skills not degrees. \n\nLet’s pass the Paycheck Fairness Act and paid leave. \n\nRaise the minimum wage to $15 an hour and extend the Child Tax Credit, so no one has to raise a family in poverty. \n\nLet’s increase Pell Grants and increase our historic support of HBCUs, and invest in what Jill—our First Lady who teaches full-time—calls America’s best-kept secret: community colleges.', metadata={'source': '../../../state_of_the_union.txt', 'year': 2012}), Document(page_content='A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n\nAnd if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. \n\nWe can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling. \n\nWe’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers. \n\nWe’re putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster. \n\nWe’re securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders.', metadata={'source': '../../../state_of_the_union.txt', 'year': 2012}), Document(page_content='And for our LGBTQ+ Americans, let’s finally get the bipartisan Equality Act to my desk. The onslaught of state laws targeting transgender Americans and their families is wrong. \n\nAs I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential. \n\nWhile it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military justice. \n\nAnd soon, we’ll strengthen the Violence Against Women Act that I first wrote three decades ago. It is important for us to show the nation that we can come together and do big things. \n\nSo tonight I’m offering a Unity Agenda for the Nation. Four big things we can do together. \n\nFirst, beat the opioid epidemic.', metadata={'source': '../../../state_of_the_union.txt', 'year': 2013})]

删除数据集

db.delete_dataset()

如果删除失败,您还可以使用强制删除

DeepLake.force_delete_by_path("./my_deeplake")

云端(Activeloop、AWS、GCS 等)或内存中的 Deep Lake 数据集

默认情况下,Deep Lake 数据集存储在本地。如果您想将它们存储在内存中、Deep Lake 托管的数据库中或任何对象存储中,您可以提供相应的数据集路径open in new window。您可以从app.activeloop.aiopen in new window获取用户令牌。

os.environ['ACTIVELOOP_TOKEN'] = getpass.getpass('Activeloop Token:')
# Embed and store the texts
username = "<username>" # your username on app.activeloop.ai   
dataset_path = f"hub://{username}/langchain_test" # could be also ./local/path (much faster locally), s3://bucket/path/to/dataset, gcs://path/to/dataset, etc.

embedding = OpenAIEmbeddings()
db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings, overwrite=True)
db.add_documents(docs)
Your Deep Lake dataset has been successfully created!
The dataset is private so make sure you are logged in!
This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/davitbun/langchain_test
hub://davitbun/langchain_test loaded successfully.
Evaluating ingest: 100%|██████████| 1/1 [00:14<00:00
Dataset(path='hub://davitbun/langchain_test', tensors=['embedding', 'ids', 'metadata', 'text'])

tensor     htype     shape     dtype  compression
-------   -------   -------   -------  ------- 
embedding  generic  (4, 1536)  float32   None   
  ids      text     (4, 1)      str     None   
metadata    json     (4, 1)      str     None   
  text      text     (4, 1)      str     None   
['d6d6ccb4-e187-11ed-b66d-41c5f7b85421',
'd6d6ccb5-e187-11ed-b66d-41c5f7b85421',
'd6d6ccb6-e187-11ed-b66d-41c5f7b85421',
'd6d6ccb7-e187-11ed-b66d-41c5f7b85421']
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
print(docs[0].page_content)
    Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 
    
    Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 
    
    One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 
    
    And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

在 AWS S3 上创建数据集

dataset_path = f"s3://BUCKET/langchain_test" # could be also ./local/path (much faster locally), hub://bucket/path/to/dataset, gcs://path/to/dataset, etc.

embedding = OpenAIEmbeddings()
db = DeepLake.from_documents(docs, dataset_path=dataset_path, embedding=embeddings, overwrite=True, creds = {
   'aws_access_key_id': os.environ['AWS_ACCESS_KEY_ID'], 
   'aws_secret_access_key':  os.environ['AWS_SECRET_ACCESS_KEY'], 
   'aws_session_token': os.environ['AWS_SESSION_TOKEN'], # Optional
})
s3://hub-2.0-datasets-n/langchain_test loaded successfully.
Evaluating ingest: 100%|██████████| 1/1 [00:10<00:00
\
Dataset(path='s3://hub-2.0-datasets-n/langchain_test', tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype     shape     dtype  compression
  -------   -------   -------   -------  ------- 
  embedding  generic  (4, 1536)  float32   None   
    ids      text     (4, 1)      str     None   
  metadata    json     (4, 1)      str     None   
    text      text     (4, 1)      str     None   

Deep Lake API

您可以通过 db.ds 访问 Deep Lake 数据集。

# get structure of the dataset
db.ds.summary()
Dataset(path='hub://davitbun/langchain_test', tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype     shape     dtype  compression
  -------   -------   -------   -------  ------- 
  embedding  generic  (4, 1536)  float32   None   
    ids      text     (4, 1)      str     None   
  metadata    json     (4, 1)      str     None   
    text      text     (4, 1)      str     None   
# get embeddings numpy array
embeds = db.ds.embedding.numpy()

将本地数据集传输到云端

复制已创建的数据集到云端。您也可以从云端传输到本地。

import deeplake
username = "davitbun" # your username on app.activeloop.ai   
source = f"hub://{username}/langchain_test" # could be local, s3, gcs, etc.
destination = f"hub://{username}/langchain_test_copy" # could be local, s3, gcs, etc.

deeplake.deepcopy(src=source, dest=destination, overwrite=True)

Copying dataset: 100%|██████████| 56/56 [00:38<00:00
This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/davitbun/langchain_test_copy
Your Deep Lake dataset has been successfully created!
The dataset is private so make sure you are logged in!
Dataset(path='hub://davitbun/langchain_test_copy', tensors=['embedding', 'ids', 'metadata', 'text'])
db = DeepLake(dataset_path=destination, embedding_function=embeddings)
db.add_documents(docs)
This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/davitbun/langchain_test_copy
/
hub://davitbun/langchain_test_copy loaded successfully.
Deep Lake Dataset in hub://davitbun/langchain_test_copy already exists, loading from the storage
Dataset(path='hub://davitbun/langchain_test_copy', tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype     shape     dtype  compression
  -------   -------   -------   -------  ------- 
  embedding  generic  (4, 1536)  float32   None   
    ids      text     (4, 1)      str     None   
  metadata    json     (4, 1)      str     None   
    text      text     (4, 1)      str     None   
Evaluating ingest: 100%|██████████| 1/1 [00:31<00:00
-
Dataset(path='hub://davitbun/langchain_test_copy', tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype     shape     dtype  compression
  -------   -------   -------   -------  ------- 
  embedding  generic  (8, 1536)  float32   None   
    ids      text     (8, 1)      str     None   
  metadata    json     (8, 1)      str     None   
    text      text     (8, 1)      str     None   
['ad42f3fe-e188-11ed-b66d-41c5f7b85421',
'ad42f3ff-e188-11ed-b66d-41c5f7b85421',
'ad42f400-e188-11ed-b66d-41c5f7b85421',
'ad42f401-e188-11ed-b66d-41c5f7b85421']
Last Updated:
Contributors: 刘强