Tiktoken

tiktokenopen in new window 是由 OpenAI 创建的快速 BPE 分词工具。

  1. 文本如何进行分割:使用 tiktoken 的标记进行分割。
  2. 分块大小的测量方式:按 tiktoken 的标记数量进行测量。
#!pip install tiktoken
# This is a long document we can split up.
with open('../../../state_of_the_union.txt') as f:
    state_of_the_union = f.read()
from langchain.text_splitter import TokenTextSplitter
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
    Madam Speaker, Madam Vice President, our
Last Updated:
Contributors: 刘强