分词 | LM Studio 文档 - LM Studio 应用程序

模型使用分词器将文本内部转换为它们可以更轻松处理的“标记”。LM Studio 暴露了此分词器以供实用。

分词

您可以使用 SDK 对加载的 LLM 或嵌入模型中的字符串进行分词。在以下示例中，LLM 引用可以替换为嵌入模型引用，无需进行其他更改。

import lmstudio as lms

model = lms.llm()

tokens = model.tokenize("Hello, world!")

print(tokens) # Array of token IDs.

计算标记

如果您只关心标记数量，只需检查结果数组的长度。

token_count = len(model.tokenize("Hello, world!"))
print("Token count:", token_count)

示例：计算上下文

您可以通过以下方式确定给定的对话是否符合模型的上下文：

使用提示模板将对话转换为字符串。
计算字符串中的 token 数量。
将 token 数量与模型的上下文长度进行比较。

import lmstudio as lms

def does_chat_fit_in_context(model: lms.LLM, chat: lms.Chat) → bool:
    # Convert the conversation to a string using the prompt template.
    formatted = model.apply_prompt_template(chat)
    # Count the number of tokens in the string.
    token_count = len(model.tokenize(formatted))
    # Get the current loaded context length of the model
    context_length = model.get_context_length()
    return token_count < context_length

model = lms.llm()

chat = lms.Chat.from_history({
    "messages": [
        { "role": "user", "content": "What is the meaning of life." },
        { "role": "assistant", "content": "The meaning of life is..." },
        # ... More messages
    ]
})

print("Fits in context:", does_chat_fit_in_context(model, chat))

分词分词

分词

计算标记

示例：计算上下文

分词
分词