使用 TGI 的消息 API 从 OpenAI 迁移到 Open LLMs

这个 notebook 展示了如何轻松地从 OpenAI 模型过渡到 Open LLMs，而无需重构任何现有代码。

文本生成推理（TGI）现在提供了一个消息 API，使其与 OpenAI 的聊天完成 API 的直接兼容。这意味着任何使用 OpenAI 的模型（通过 OpenAI 客户端库或像 LangChain 或 LlamaIndex 这样的第三方工具）的现有脚本都可以直接替换为使用运行在 TGI 端点上的任何开源 LLM！

这允许你快速测试并受益于开源模型提供的众多优势。例如：

对模型和数据的完全控制和透明度
不再担心速率限制
能够根据你的具体需求完全定制系统

在这个 notebook 中，我们将向你展示具体流程：

使用 TGI 创建推理端点来部署模型
使用 OpenAI 客户端库查询推理端点
将端点与 LangChain 和 LlamaIndex 工作流程集成

让我们开始吧！

初始化设置

首先，我们需要安装依赖项和设置一个 HF API 密钥。

!pip install --upgrade -q huggingface_hub langchain langchain-community langchainhub langchain-openai llama-index chromadb bs4 sentence_transformers torch

import os
import getpass

# enter API key
os.environ["HUGGINGFACEHUB_API_TOKEN"] = HF_API_KEY = getpass.getpass()

1. 创建一个推理端点

一开始，让我们使用 TGI 将Nous-Hermes-2-Mixtral-8x7B-DPO，一个微调的 Mixtral 模型，部署到推理端点。

我们只需通过 UI 的几次点击，就可以部署模型，或者利用 huggingface_hub Python 库以编程方式创建和管理推理端点。

在这里，我们将使用 Hub 库，通过指定端点名称和模型仓库，以及 text-generation 任务。在这个例子中，我们使用 protected 类型，因此访问部署的模型将需要一个有效的 Hugging Face token。我们还需要配置硬件要求，如供应商、地区、加速器、实例类型和大小。你可以使用this API call查看可用的资源选项列表，并在目录中这里查看为选定模型推荐配置。

>>> from huggingface_hub import create_inference_endpoint

>>> endpoint = create_inference_endpoint(
...     "nous-hermes-2-mixtral-8x7b-demo",
...     repository="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
...     framework="pytorch",
...     task="text-generation",
...     accelerator="gpu",
...     vendor="aws",
...     region="us-east-1",
...     type="protected",
...     instance_type="p4de",
...     instance_size="2xlarge",
...     custom_image={
...         "health_route": "/health",
...         "env": {
...             "MAX_INPUT_LENGTH": "4096",
...             "MAX_BATCH_PREFILL_TOKENS": "4096",
...             "MAX_TOTAL_TOKENS": "32000",
...             "MAX_BATCH_TOTAL_TOKENS": "1024000",
...             "MODEL_ID": "/repository",
...         },
...         "url": "ghcr.io/huggingface/text-generation-inference:sha-1734540",  # must be >= 1.4.0
...     },
... )

>>> endpoint.wait()
>>> print(endpoint.status)

running

部署启动需要几分钟时间。我们可以使用 .wait() 工具来阻塞运行线程，直到端点达到最终的“运行”状态。一旦运行，我们可以在 UI 播放器中确认其状态并试用：

IE UI Overview

太好了，现在我们有一个可用的端点！

注意：使用 huggingface_hub 部署时，默认情况下，在15分钟空闲时间后，你的端点会自动缩放到零，以在非活动期间优化成本。查看Hub Python 库文档以了解可用于管理端点生命周期的所有功能。

2. 使用 OpenAI 客户端库查询推理端点

如上所述，由于我们的模型托管在 TGI 上，现在支持消息 API，这意味着我们可以直接使用熟悉的 OpenAI 客户端库来查询它。

使用 Python 客户端

下面的例子展示了如何使用OpenAI Python 库进行这种转换。只需将 <ENDPOINT_URL> 替换为你的端点 URL（确保包含 v1/ 后缀），并将 <HF_API_KEY> 字段填充为有效的 Hugging Face 用户 token。<ENDPOINT_URL> 可以从推理端点的 UI 中获取，或者从我们上面使用 endpoint.url 创建的端点对象中获取。

然后我们可以像往常一样使用客户端，传递一个消息列表以从我们的推理端点流式传输响应。

>>> from openai import OpenAI

>>> BASE_URL = endpoint.url

>>> # init the client but point it to TGI
>>> client = OpenAI(
...     base_url=os.path.join(BASE_URL, "v1/"),
...     api_key=HF_API_KEY,
... )
>>> chat_completion = client.chat.completions.create(
...     model="tgi",
...     messages=[
...         {"role": "system", "content": "You are a helpful assistant."},
...         {"role": "user", "content": "Why is open-source software important?"},
...     ],
...     stream=True,
...     max_tokens=500,
... )

>>> # iterate and print stream
>>> for message in chat_completion:
...     print(message.choices[0].delta.content, end="")

Open-source software is important due to a number of reasons, including:

1. Collaboration: The collaborative nature of open-source software allows developers from around the world to work together, share their ideas and improve the code. This often results in faster progress and better software.

2. Transparency: With open-source software, the code is publicly available, making it easy to see exactly how the software functions, and allowing users to determine if there are any security vulnerabilities.

3. Customization: Being able to access the code also allows users to customize the software to better suit their needs. This makes open-source software incredibly versatile, as users can tweak it to suit their specific use case.

4. Quality: Open-source software is often developed by large communities of dedicated developers, who work together to improve the software. This results in a higher level of quality than might be found in proprietary software.

5. Cost: Open-source software is often provided free of charge, which makes it accessible to a wider range of users. This can be especially important for organizations with limited budgets for software.

6. Shared Benefit: By sharing the code of open-source software, everyone can benefit from the hard work of the developers. This contributes to the overall advancement of technology, as users and developers work together to improve and build upon the software.

In summary, open-source software provides a collaborative platform that leads to high-quality, customizable, and transparent software, all available at little or no cost, benefiting both individuals and the technology community as a whole.<|im_end|>

在幕后，TGI 的消息 API 自动使用其聊天模板将消息列表转换为模型所需的指令格式。

注意：某些 OpenAI 功能，如函数调用，与 TGI 不兼容。目前，消息 API 支持以下 chat completion 参数：stream、max_new_tokens、frequency_penalty、logprobs、seed、temperature 和 top_p.

使用 JavaScript 客户端

这里是与上面相同的流式示例，但是使用了OpenAI Javascript/Typescript 库。

import OpenAI from "openai";

const openai = new OpenAI({
  baseURL: "<ENDPOINT_URL>" + "/v1/", // replace with your endpoint url
  apiKey: "<HF_API_TOKEN>", // replace with your token
});

async function main() {
  const stream = await openai.chat.completions.create({
    model: "tgi",
    messages: [
      { role: "system", content: "You are a helpful assistant." },
      { role: "user", content: "Why is open-source software important?" },
    ],
    stream: true,
    max_tokens: 500,
  });
  for await (const chunk of stream) {
    process.stdout.write(chunk.choices[0]?.delta?.content || "");
  }
}

main();

3. 与 LangChain 和 LlamaIndex 集成

现在，让我们看看如何将这个新创建的端点与像 LangChain 和 LlamaIndex 这样的流行 RAG 框架一起使用。

如何与 LangChain 一起使用

要在 LangChain 中使用，只需创建一个 ChatOpenAI 的实例，并按如下方式传递你的 <ENDPOINT_URL> 和 <HF_API_TOKEN>：

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model_name="tgi",
    openai_api_key=HF_API_KEY,
    openai_api_base=os.path.join(BASE_URL, "v1/"),
)
llm.invoke("Why is open-source software important?")

我们能够直接利用与 OpenAI 模型相同的 ChatOpenAI 类。这使得所有之前的代码只需更改一行代码，就能与我们的端点一起工作。

现在，让我们在简单的 RAG 流水线中使用我们的 Mixtral 模型，来回答一个关于 HF 博客内容的问题。

from langchain import hub
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.runnables import RunnableParallel
from langchain_community.embeddings import HuggingFaceEmbeddings

# Load, chunk and index the contents of the blog
loader = WebBaseLoader(
    web_paths=("https://huggingface.co./blog/open-source-llms-as-agents",),
)
docs = loader.load()

# declare an HF embedding model
hf_embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-large-en-v1.5")

text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(documents=splits, embedding=hf_embeddings)

# Retrieve and generate using the relevant snippets of the blog
retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain_from_docs = (
    RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"]))) | prompt | llm | StrOutputParser()
)

rag_chain_with_source = RunnableParallel({"context": retriever, "question": RunnablePassthrough()}).assign(
    answer=rag_chain_from_docs
)

rag_chain_with_source.invoke("According to this article which open-source model is the best for an agent behaviour?")

如何与 LlamaIndex 一起使用

类似地，你也可以在 LlamaIndex 中使用 TGI 端点。我们将使用 OpenAILike 类，并通过配置一些额外的参数（即 is_local、is_function_calling_model、is_chat_model、context_window）来实例化它。

注意：上下文窗口参数应与之前为端点的 MAX_TOTAL_TOKENS 设置的值相匹配。

from llama_index.llms import OpenAILike

llm = OpenAILike(
    model="tgi",
    api_key=HF_API_KEY,
    api_base=BASE_URL + "/v1/",
    is_chat_model=True,
    is_local=False,
    is_function_calling_model=False,
    context_window=4096,
)

llm.complete("Why is open-source software important?")

现在我们可以使用它在类似的 RAG 流水线中。请记住，之前在推理端点选择的 MAX_INPUT_LENGTH 将直接影响模型可以处理的检索到的数据块（similarity_top_k）的数量。

from llama_index import (
    ServiceContext,
    VectorStoreIndex,
)
from llama_index import download_loader
from llama_index.embeddings import HuggingFaceEmbedding
from llama_index.query_engine import CitationQueryEngine


SimpleWebPageReader = download_loader("SimpleWebPageReader")

documents = SimpleWebPageReader(html_to_text=True).load_data(
    ["https://huggingface.co./blog/open-source-llms-as-agents"]
)

# Load embedding model
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-large-en-v1.5")

# Pass LLM to pipeline
service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=llm)
index = VectorStoreIndex.from_documents(documents, service_context=service_context, show_progress=True)

# Query the index
query_engine = CitationQueryEngine.from_args(
    index,
    similarity_top_k=2,
)
response = query_engine.query("According to this article which open-source model is the best for an agent behaviour?")

response.response

总结

完成端点使用后，你可以暂停或删除它。这一步可以通过 UI 完成，或者像下面这样以编程方式完成。

# pause our running endpoint
endpoint.pause()

# optionally delete
# endpoint.delete()

< > Update on GitHub

Open-Source AI Cookbook