Hugging Face 分词器

Hugging Faceopen in new window 拥有许多分词器。

我们使用 Hugging Face 分词器中的 GPT2TokenizerFastopen in new window 来计算文本的标记长度。

  1. 文本如何进行分割:根据传入的字符进行分割。
  2. 分块大小的测量方式:由 Hugging Face 分词器计算的标记数量进行测量。
from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
# This is a long document we can split up.
with open('../../../state_of_the_union.txt') as f:
    state_of_the_union = f.read()
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(tokenizer, chunk_size=100, chunk_overlap=0)
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
    Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  
    
    Last year COVID-19 kept us apart. This year we are finally together again. 
    
    Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 
    
    With a duty to one another to the American people to the Constitution.
Last Updated:
Contributors: 刘强