lmstudio.js 代码示例 - SDK (TypeScript) | LM Studio 文档

注意

以下内容尚未更新以反映 [email protected] 中的更改，仍在引用 0.0.12 API。

我们正在更新公开的文档和标题，敬请谅解 🐻👾🙏。

以下是使用 LM Studio 的 TypeScript 客户端 SDK 执行操作（例如加载、卸载和使用模型生成）的示例。

加载大型语言模型并使用它进行生成

此示例加载模型 lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF 并使用它预测文本。

import { LMStudioClient } from "@lmstudio/sdk";

async function main() {
  const client = new LMStudioClient();

  // Load a model
  const llama3 = await client.llm.load("lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF", {
    config: { gpuOffload: "max" },
  });

  // Create a text completion prediction
  const prediction = llama3.complete("The meaning of life is");

  // Stream the response
  for await (const text of prediction) {
    process.stdout.write(text);
  }
}

main();

专业提示

process.stdout.write 是一个 Node.js 特定的函数，允许您打印文本而无需换行。

在浏览器上，您可能想要执行以下操作

// Get the element where you want to display the output
const outputElement = document.getElementById("output");

for await (const text of prediction) {
  outputElement.textContent += text;
}

使用非默认 LM Studio 服务器端口

此示例演示如何连接到在不同端口（例如 8080）上运行的 LM Studio。

import { LMStudioClient } from "@lmstudio/sdk";

async function main() {
  const client = new LMStudioClient({
    baseUrl: "ws://127.0.0.1:8080",
  });

  // client.llm.load(...);
}

main();

加载模型并在客户端退出后保持加载状态（守护程序模式）

默认情况下，当您的客户端断开与 LM Studio 的连接时，该客户端加载的所有模型都会卸载。您可以通过将 noHup 选项设置为 true 来防止这种情况。

await client.llm.load("lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF",
  config: { gpuOffload: "max" },
  noHup: true,
});

// The model stays loaded even after the client disconnects

为已加载的模型指定友好名称

加载模型时，您可以为其设置一个标识符。此标识符可用于稍后引用该模型。

await client.llm.load("lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF", {
  config: { gpuOffload: "max" },
  identifier: "my-model",
});

// You can refer to the model later using the identifier
const myModel = await client.llm.get("my-model");
// myModel.complete(...);

使用自定义配置加载模型

默认情况下，模型的加载配置来自与该模型关联的预设（可以在 LM Studio 的“我的模型”页面上更改）。

const llama3 = await client.llm.load("lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF", {
  config: {
    gpuOffload: "max",
    contextLength: 1024,
    gpuOffload: 0.5, // Offloads 50% of the computation to the GPU
  },
});

// llama3.complete(...);

使用特定预设加载模型

预设确定模型的默认加载配置和默认推理配置。默认情况下，使用与模型关联的预设。（可以在 LM Studio 的“我的模型”页面上更改）。您可以通过指定 preset 选项来更改使用的预设。

const llama3 = await client.llm.load("lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF", {
  config: { gpuOffload: "max" }, // Overrides the preset
  preset: "My ChatML",
});

自定义加载进度

您可以通过提供 onProgress 回调来跟踪模型的加载进度。

const llama3 = await client.llm.load("lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF", {
  config: { gpuOffload: "max" },
  verbose: false, // Disables the default progress logging
  onProgress: (progress) => {
    console.log(`Progress: ${(progress * 100).toFixed(1)}%`);
  },
});

列出所有可以加载的模型

如果您想查找所有可加载的模型，可以使用 system 对象上的 listDownloadedModel 方法。

const downloadedModels = await client.system.listDownloadedModels();
const downloadedLLMs = downloadedModels.filter((model) => model.type === "llm");

// Load the first model
const model = await client.llm.load(downloadedLLMs[0].path);
// model.complete(...);

取消加载

您可以使用 AbortController 取消加载。

const controller = new AbortController();

try {
  const llama3 = await client.llm.load("lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF", {
    signal: controller.signal,
  });
  // llama3.complete(...);
} catch (error) {
  console.error(error);
}

// Somewhere else in your code:
controller.abort();

信息

AbortController 是一个标准的 JavaScript API，允许您取消异步操作。它在现代浏览器和 Node.js 中受支持。更多信息，请参见 MDN Web 文档。

卸载模型

您可以通过调用 unload 方法来卸载模型。

const llama3 = await client.llm.load("lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF", {
  identifier: "my-model",
});

// ...Do stuff...

await client.llm.unload("my-model");

请注意，默认情况下，客户端断开连接时会卸载客户端加载的所有模型。因此，除非您要精确控制模型的生命周期，否则无需手动卸载它们。

专业提示

如果您希望在断开连接后保持模型加载状态，则可以在加载模型时将 noHup 选项设置为 true。

使用特定的已加载模型

要根据标识符查找已加载的模型，请使用以下方法：

const myModel = await client.llm.get({ identifier: "my-model" });
// Or just
const myModel = await client.llm.get("my-model");

// myModel.complete(...);

要根据路径查找已加载的模型，请使用以下方法：

// Matches any quantization
const llama3 = await client.llm.get({ path: "lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF" });

// Or if a specific quantization is desired:
const llama3 = await client.llm.get({
  path: "lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf",
});

// llama3.complete(...);

使用任何已加载模型

如果您没有特定的模型，只想使用任何已加载的模型，只需向 client.llm.get 传递一个空对象。

const anyModel = await client.llm.get({});
// anyModel.complete(...);

列出所有已加载模型

要列出所有已加载的模型，请使用 client.llm.listLoaded 方法。

const loadedModels = await client.llm.listLoaded();

if (loadedModels.length === 0) {
  throw new Error("No models loaded");
}

// Use the first one
const firstModel = await client.llm.get({ identifier: loadedModels[0].identifier });
// firstModel.complete(...);

文本补全

要执行文本补全，请使用 complete 方法。

const prediction = model.complete("The meaning of life is");

for await (const text of prediction) {
  process.stdout.write(text);
}

默认情况下，使用预设中的推理参数进行预测。您可以像这样覆盖它们：

const prediction = anyModel.complete("Meaning of life is", {
  contextOverflowPolicy: "stopAtLimit",
  maxPredictedTokens: 100,
  prePrompt: "Some pre-prompt",
  stopStrings: ["\n"],
  temperature: 0.7,
});

// ...Do stuff with the prediction...

对话式补全

要进行对话，请使用 respond 方法。

const prediction = anyModel.respond([
  { role: "system", content: "Answer the following questions." },
  { role: "user", content: "What is the meaning of life?" },
]);

for await (const text of prediction) {
  process.stdout.write(text);
}

同样，您可以覆盖对话的推理参数（注意，可用选项与文本补全不同）。

const prediction = anyModel.respond(
  [
    { role: "system", content: "Answer the following questions." },
    { role: "user", content: "What is the meaning of life?" },
  ],
  {
    contextOverflowPolicy: "stopAtLimit",
    maxPredictedTokens: 100,
    stopStrings: ["\n"],
    temperature: 0.7,
    inputPrefix: "Q: ",
    inputSuffix: "\nA:",
  },
);

// ...Do stuff with the prediction...

注意

大型语言模型 (LLM) 是 *无状态的*。它们不记住或保留来自先前输入的信息。因此，使用 LLM 进行预测时，应始终提供完整的历史记录/上下文。

获取预测统计信息

如果您希望获取预测统计信息，您可以等待预测对象以获取 PredictionResult，您可以通过其 stats 属性访问统计信息。

const prediction = model.complete("The meaning of life is");

for await (const text of prediction) {
  process.stdout.write(text);
}

const { stats } = await prediction;
console.log(stats);

信息

当您已经使用完预测流后，等待预测对象不会导致任何额外的等待，因为结果缓存在预测对象中。

另一方面，如果您只关心最终结果，则无需遍历流。相反，您可以直接等待预测对象以获取最终结果。

const prediction = model.complete("The meaning of life is");
const result = await prediction;
const content = result.content;
const stats = result.stats;

// Or just:

const { content, stats } = await model.complete("The meaning of life is");

生成 JSON（结构化输出）

LM Studio 支持结构化预测，这将强制模型生成符合特定结构的内容。要启用结构化预测，您应该设置 structured 字段。它适用于 complete 和 respond 方法。

这是一个如何使用结构化预测的示例：

const prediction = model.complete("Here is a joke in JSON:", {
  maxPredictedTokens: 100,
  structured: { type: "json" },
});

const result = await prediction;
try {
  // Although the LLM is guaranteed to only produce valid JSON, when it is interrupted, the
  // partial result might not be. Always check for errors. (See below)
  const parsed = JSON.parse(result.content);
  console.info(parsed);
} catch (e) {
  console.error(e);
}

有时，任何 JSON 都不够。您可能希望强制执行特定的 JSON 模式。您可以通过向 structured 字段提供 JSON 模式来实现此目的。有关 JSON 模式的更多信息，请阅读 json-schema.org。

const schema = {
  type: "object",
  properties: {
    setup: { type: "string" },
    punchline: { type: "string" },
  },
  required: ["setup", "punchline"],
};

const prediction = llama3.complete("Here is a joke in JSON:", {
  maxPredictedTokens: 100,
  structured: { type: "json", jsonSchema: schema },
});

const result = await prediction;
try {
  const parsed = JSON.parse(result.content);
  console.info("The setup is", parsed.setup);
  console.info("The punchline is", parsed.punchline);
} catch (e) {
  console.error(e);
}

注意

尽管强制模型生成符合指定结构的预测，但预测可能会中断（例如，如果用户停止预测）。发生这种情况时，部分结果可能不符合指定的结构。因此，在使用预测结果之前，始终检查它，例如，将 JSON.parse 包裹在 try-catch 块中。
在某些情况下，模型可能会卡住。例如，当强制它生成有效的 JSON 时，它可能会生成一个左花括号 { 但从不生成右花括号 }。在这种情况下，预测将永远持续下去，直到达到上下文长度限制，这可能需要很长时间。因此，建议始终设置 maxPredictedTokens 限制。这也与上面的要点有关。

取消预测

可以通过调用预测对象上的 cancel 方法来取消预测。

const prediction = model.complete("The meaning of life is");

// ...Do stuff...

prediction.cancel();

取消预测后，预测将正常停止，但 stopReason 将设置为 "userStopped"。您可以像这样检测取消：

for await (const text of prediction) {
  process.stdout.write(text);
}
const { stats } = await prediction;
if (stats.stopReason === "userStopped") {
  console.log("Prediction was canceled by the user");
}