Llama llm CodeProject.AI AI 人工智能 Python Javascript

为 CodeProject.AI 服务器创建 LLM 聊天模块

Matthew Dennis

5.00/5 (5投票s)

2024 年 4 月 4 日

CPOL

11分钟阅读

16314

为 CodeProject.AI Server 创建一个类似 ChatGPT 的 AI 模块，该模块处理长时间运行的进程。

完整代码可在我们的 GitHub 存储库中找到，地址为 CodeProject.AI-Server/src/modules/LlamaChat at main · codeproject/CodeProject.AI-Server (github.com)

引言

本文将向您展示如何在本地桌面计算机上运行类似 ChatGPT 的大型语言模型 (LLM)。这样做有很多原因：确保您的敏感数据保留在您的网络内，避免难以理解的主机托管费用，在没有互联网的情况下也能访问 LLM。我们的原因是：因为它们很酷，我们想和它们一起玩。

我们将使用 CodeProject.AI 服务器来处理所有烦人的设置、部署和生命周期管理，以便我们能专注于代码。由于令人头疼的大型模型已经包含了大部分魔法，所以代码量确实不多。值得关注的（除了 LLM 本身这个相当重要的点之外）是 CodeProject.AI 服务器中处理长时间运行的推理调用。

传统上，CodeProject.AI 服务器的模块假定任何推理（将数据输入 AI 模型）都足够快，可以在通常的 HTTP API 调用服务器的超时时间内返回结果。对于 LLM 来说，情况并非如此。发送一个提示，模型对其进行一些处理，然后继续内容或聊天响应会一块一块地返回。这可能需要一些时间，因此我们将为 CodeProject.AI 服务器引入长时间运行的进程。

入门

我们假设您已经阅读了 CodeProject.AI 模块创建：Python 全面演练。我们将以完全相同的方式创建一个模块，只是增加一点，我们将展示如何处理长时间运行的进程。

我们还假设您已按照安装 CodeProject.AI 开发环境的指南设置了您的开发环境。

我们将使用的库是 Llama-cpp，通过 Python (llama-cpp-python) 进行封装，模型将是 Mistral 7B Instruct v0.2 - GGUF，一个拥有 73 亿参数、32K 上下文窗口且在桌面级硬件上表现出色的模型。该模型专为 Meta AI 于 2023 年发布的 LLM Llama 设计。Llama-cpp-python 是 Llama 模型 C++ 接口的 Python 封装。所有这些运行得相当顺畅，这归功于 Mistral 7b 模型以及 Georgi Gerganov 在 llama-cpp 上的工作。

如果您正在寻找其他 Llama 兼容模型进行实验，我建议尝试 Hugging Face，特别是 TheBloke (Tom Jobbins) 量化的模型。随着几乎每天都有新模型和改进模型发布，您可以使用 LLM 排行榜来选择适合您用例和硬件的 LLM。

编写任何 CodeProject.AI 服务器模块都是一个简单的过程。我们的 SDK 已经完成了大部分繁重的工作，让您可以只专注于实现用例的功能。在我们的模块创建冒险中将采取的步骤是：

在 CodeProject.AI 服务器解决方案中创建一个项目。
创建一个 `modulesettings.json` 文件，用于定义有关运行和与模块通信所需的元数据。
编写模块的 Python 代码。
1. 编写一个 Python 文件，封装 `llma-cpp-python` 包，通常作为类公开所需的功能。
2. 编写一个 Python 文件，实现一个适配器，该适配器连接 CodeProject.AI 服务器与上述包装器。这就是魔法发生的地方。
创建模块安装所需的文件。
1. 创建安装脚本。
2. 创建 PIP requirements.txt 文件。
为 CodeProject.AI 仪表板创建 UI。

创建打包脚本，以便将模块部署到 CodeProject.AI 模块存储库，这在将您自己的 Python 模块添加到 CodeProject.AI 中有详细介绍，本文不再讨论。

本文也不讨论模块的测试，但相关建议可以在将您自己的 Python 模块添加到 CodeProject.AI 中找到。

创建项目

CodeProject.AI 服务器已经在当前代码库中包含此模块，但我们将假装它不存在，然后逐步讲解其创建过程。为此，我们首先需要在 CodeProject.AI 服务器解决方案的 `/src/modules` 中创建一个名为“Llama”的文件夹。

创建 modulesettings.json 文件。

同样，请确保您已回顾 Python 全面演练和 ModuleSettings 文件。我们的 modulesettings 文件非常基础，有趣的部分是：

我们将用于启动模块的适配器是 `llama_chat_adapter.py`。
我们将使用 python3.8 运行。
我们将设置一些环境变量来指定包含模型的文件夹位置和模型名称。
我们将定义一个路由“text/chat”，该路由接受一个名为“prompt”的命令，该命令接受一个字符串“prompt”并返回一个字符串“reply”。

{
  "Modules": {
 
    "LlamaChat": {
      "Name": "LlamaChat",
      "Version": "1.0.0",
 
      "PublishingInfo" : {
		 ... 
      },
 
      "LaunchSettings": {
        "FilePath":    "llama_chat_adapter.py",
        "Runtime":     "python3.8",
      },
 
      "EnvironmentVariables": {
        "CPAI_MODULE_LLAMA_MODEL_DIR":      "./models",
        "CPAI_MODULE_LLAMA_MODEL_FILENAME": "mistral-7b-instruct-v0.2.Q4_K_M.gguf"

        // fallback to loading pretrained
        "CPAI_MODULE_LLAMA_MODEL_REPO":     "@TheBloke/Mistral-7B-Instruct-v0.2-GGUF",
        "CPAI_MODULE_LLAMA_MODEL_FILEGLOB": "*.Q4_K_M.gguf",
      },
 
      "GpuOptions" : {
	     ...
      },
      
      "InstallOptions" : {
	     ...
      },
  
      "RouteMaps": [
        {
          "Name": "LlamaChat",
          "Route": "text/chat",
          "Method": "POST",
          "Command": "prompt",
          "MeshEnabled": false,
          "Description": "Uses the Llama LLM to answer simple wiki-based questions.",
          "Inputs": [
            {
              "Name": "prompt",
              "Type": "Text",
              "Description": "The prompt to send to the LLM"
            }
          ],
          "Outputs": [
            {
              "Name": "success",
              "Type": "Boolean",
              "Description": "True if successful."
            },
            {
              "Name": "reply",
              "Type": "Text",
              "Description": "The reply from the model."
            },
			...
          ]
        }
      ]
    }
  }
}

编写模块代码

为 CodeProject.AI 服务器创建模块的全部目的是利用现有代码，进行封装，并允许 API 服务器向所有用户公开。这通过两个文件完成：一个文件用于封装包或示例代码，另一个文件（适配器）用于将 CodeProject.AI 服务器连接到该包装器。

封装 llama-cpp-python 包

我们将要封装的代码将位于 `llama_chat.py` 中。这个简单的 Python 模块有两个方法：`__init__` 构造函数，它创建一个 `Llama` 对象，以及 `do_chat`，它接受一个提示并返回文本。`do_chat` 返回的文本要么是 `CreateChatCompletionResponse` 对象，要么，如果是流式传输，则是一个 `Iterator[CreateChatCompletionStreamResponse]` 对象。`**kwargs` 参数允许将任意附加参数传递给 LLM `create_chat_completion` 函数。有关可用参数及其作用的详细信息，请参阅 `llama-cpp-python` 文档。

# This model uses the llama_cpp_python library to interact with the LLM.
# See https://llama-cpp_python.readthedocs.io/en/latest/ for more information.
 
import os
from typing import Iterator, Union
 
from llama_cpp import ChatCompletionRequestSystemMessage, \
                      ChatCompletionRequestUserMessage,   \
                      CreateCompletionResponse,           \
                      CreateCompletionStreamResponse,     \
                      CreateChatCompletionResponse,       \
                      CreateChatCompletionStreamResponse, \
                      Llama
 
class LlamaChat:
 
    def __init__(self, repo_id: str, fileglob:str, filename:str, model_dir:str, n_ctx: int = 0,
                 verbose: bool = True) -> None:
 
        try:
            # This will use the model we've already downloaded and cached locally
            self.model_path = os.path.join(model_dir, filename)
            self.llm = Llama(model_path=self.model_path, 
                             n_ctx=n_ctx,
                             n_gpu_layers=-1,
                             verbose=verbose)
        except:
            try:
                # This will download the model from the repo and cache it locally
                # Handy if we didn't download during install
                self.model_path = os.path.join(model_dir, fileglob)
                self.llm = Llama.from_pretrained(repo_id=repo_id,
                                                 filename=fileglob,
                                                 n_ctx=n_ctx,
                                                 n_gpu_layers=-1,
                                                 verbose=verbose,
                                                 cache_dir=model_dir,
                                                 chat_format="llama-2")
            except:
                self.llm        = None
                self.model_path = None
 
 
    def do_chat(self, prompt: str, system_prompt: str=None, **kwargs) -> \
            Union[CreateChatCompletionResponse, Iterator[CreateChatCompletionStreamResponse]]:
        """ 
        Generates a response from a chat / conversation prompt
        params:
            prompt:str	                    The prompt to generate text from.
            system_prompt: str=None         The description of the assistant
            max_tokens: int = 128           The maximum number of tokens to generate.
            temperature: float = 0.8        The temperature to use for sampling.
            grammar: Optional[LlamaGrammar] = None
            functions: Optional[List[ChatCompletionFunction]] = None,
            function_call: Optional[Union[str, ChatCompletionFunctionCall]] = None,
            stream: bool = False            Whether to stream the results.
            stop: [Union[str, List[str]]] = [] A list of strings to stop generation when encountered.
        """
 
        if not system_prompt:
            system_prompt = "You're a helpful assistant who answers questions the user asks of you concisely and accurately."
 
        completion = self.llm.create_chat_completion(
                        messages=[
                            ChatCompletionRequestSystemMessage(role="system", content=system_prompt),
                            ChatCompletionRequestUserMessage(role="user", content=prompt),
                        ],
                        **kwargs) if self.llm else None
 
        return completion

正如您所见，实现此文件并不需要太多代码。

创建适配器

适配器是一个派生自 `ModuleRunner` 类的类。`ModuleRunner` 为以下方面处理了繁重的工作：

从服务器检索命令。
调用派生类上适当的重载函数。
将响应返回给服务器。
日志记录
将周期性的模块状态更新发送到服务器。

我们的适配器位于模块 `llama_chat_adapter.py` 中，并包含在 Python 全面演练中讨论的已重写方法。可以在 GitHub 存储库的源代码中查看整个文件。

重要提示：对 `llm.create_chat_completion`（即调用 LLM）的调用的响应可能是一个单独的响应，也可能是一个流式响应。两者都需要时间，但我们将选择将响应作为流返回，允许我们逐步构建回复。我们将通过 CodeProject.AI 服务器中的长进程机制来做到这一点。这意味着我们将使用 `llama_chat.py` 中的代码发出请求到 LLM，并迭代返回的值，累积 LLM 生成的回复。要在初始请求 CodeProject.AI 服务器后显示累积的回复，客户端可以轮询命令状态。

我们将讨论文件的每个部分，解释它们的作用。完整文件可以在 GitHub 存储库中查看。

前导码

序言设置了指向 SDK 的包搜索路径，包含文件所需的导入，并定义了 `LlamaChat_adapter` 类。

# Import the CodeProject.AI SDK. This will add to the PATH  for future imports
sys.path.append("../../SDK/Python")
from common import JSON
from request_data import RequestData
from module_runner import ModuleRunner
from module_options import ModuleOptions
from module_logging import LogMethod, LogVerbosity
 
# Import the method of the module we're wrapping
from llama_chat import LlamaChat
 
class LlamaChat_adapter(ModuleRunner):

initialise()

`initialise()` 函数是从基类 `ModuleRunner` 重载的，并在适配器启动时对其进行初始化。在此模块中，它：

读取定义了将用于处理提示的 LLM 模型的环境变量。
使用指定的模型创建 `LlamaChat` 类的实例。

def initialise(self) -> None:
 
    self.models_dir      = ModuleOptions.getEnvVariable("CPAI_MODULE_LLAMA_MODEL_DIR",      "./models")
    
    # For using llama-cpp.from_pretrained
    self.model_repo      = ModuleOptions.getEnvVariable("CPAI_MODULE_LLAMA_MODEL_REPO",     "TheBloke/Llama-2-7B-Chat-GGUF")
    self.models_fileglob = ModuleOptions.getEnvVariable("CPAI_MODULE_LLAMA_MODEL_FILEGLOB", "*.Q4_K_M.gguf")
    
    # fallback loading via Llama()
    self.model_filename  = ModuleOptions.getEnvVariable("CPAI_MODULE_LLAMA_MODEL_FILENAME", "mistral-7b-instruct-v0.2.Q4_K_M.gguf")
 
    verbose = self.log_verbosity != LogVerbosity.Quiet
    self.llama_chat = LlamaChat(repo_id=self.model_repo,
                                fileglob=self.models_fileglob,
                                filename=self.model_filename,
                                model_dir=self.models_dir,
                                n_ctx=0,
                                verbose=verbose)
    
    if self.llama_chat.model_path:
        self.log(LogMethod.Info|LogMethod.Server, {
            "message": f"Using model from '{self.llama_chat.model_path}'",
            "loglevel": "information"
        })
    else:
        self.log(LogMethod.Error|LogMethod.Server, {
            "message": f"Unable to load Llama model",
            "loglevel": "error"
        })
 
    self.reply_text  = ""
    self.cancelled   = False

process()

当收到一个非通用模块命令的命令时，会调用 `process()` 函数。它将执行以下两项操作之一：

处理请求并返回响应。这是短暂处理模式，例如目标检测。
返回一个将在后台执行以处理请求的命令并创建响应的可调用对象。这是我们将运行的模式。

对于此模块，我们仅返回 `LlamaChat_adapter.long_process` 函数，以表明这是一个长时间运行的进程。此名称是惯例。

def process(self, data: RequestData) -> JSON:
    return self.long_process

有趣提示：当您从 `process` 返回一个可调用对象时，向 CodeProject.AI 服务器发出请求的客户端实际上不会收到一个可调用对象作为响应。那样会很奇怪且无益。`ModuleRunner` 会注意到正在返回一个可调用对象，并将当前请求的命令 ID 和模块 ID 传递回给客户端，客户端随后可以使用它们来发出与此请求相关的状态调用。

long_process()

这就是使用 `llama_chat.py` 文件功能实际完成工作的地方。`long_process` 方法调用 `llama_chat.py` 代码，并将 `stream=True` 传递给 `do_chat`。这将导致返回响应的迭代器，其中每个响应都在循环中处理并添加到我们的最终结果中。在每次迭代中，我们都会检查是否被要求取消操作。取消信号在 `self.cancelled` 类变量中，该变量在 `cancel_command_task` 方法（下文描述）中切换。

客户端可以通过向服务器发送 `get_command_status` 命令来轮询累积结果，并显示响应的 `reply` 属性。（下文描述）。

def long_process(self, data: RequestData) -> JSON:
 
    self.reply_text = ""
    stop_reason = None
 
    prompt: str        = data.get_value("prompt")
    max_tokens: int    = data.get_int("max_tokens", 0) #0 means model default
    temperature: float = data.get_float("temperature", 0.4)
 
    try:
        start_time = time.perf_counter()
 
        completion = self.llama_chat.do_chat(prompt=prompt, max_tokens=max_tokens,
                                             temperature=temperature, stream=True)
        if completion:
            try:
                for output in completion:
                    if self.cancelled:
                        self.cancelled = False
                        stop_reason = "cancelled"
                        break
 
                    # Using the raw result from the llama_chat module. In
                    # building modules we don't try adn rewrite the code we
                    # are wrapping. Rather, we wrap the code so we can take
                    # advantage of updates to the original code more easily
                    # rather than having to re-apply fixes.
                    delta = output["choices"][0]["delta"]
                    if "content" in delta:
                        self.reply_text += delta["content"]
            except StopIteration:
                pass
            
        inferenceMs : int = int((time.perf_counter() - start_time) * 1000)
 
        if stop_reason is None:
            stop_reason = "completed"
 
        response = {
            "success": True, 
            "reply": self.reply_text,
            "stop_reason": stop_reason,
            "processMs" : inferenceMs,
            "inferenceMs" : inferenceMs
        }
 
    except Exception as ex:
        self.report_error(ex, __file__)
        response = { "success": False, "error": "Unable to generate text" }
 
    return response

command_status()

我们有一个 `long_process` 方法，在从 `process` 返回时被调用，但我们需要一种方法来查看这个长进程的结果。请记住，我们将聊天补全的累积结果发送到 `self.reply_text` 变量中，因此在我们的 `command_status()` 函数中，我们将返回我们迄今为止收集到的内容。

调用 `command_status()` 是发送原始聊天命令的客户端应用程序在发送命令后应该做的事情。该调用通过 `/v1/LlamaChat/get_command_status` 端点进行，这将导致服务器向模块发送一条消息，进而导致调用 `command_status()` 并将结果返回给客户端。

def command_status(self) -> JSON:
    return {
        "success": True, 
        "reply":   self.reply_text
    }

客户端应该（或者可以）然后显示“reply”，每次后续调用（希望如此）都会显示 LLM 的响应内容略有增加。

cancel_command_task()

当服务器收到来自服务器的 `cancel_command` 命令时，会调用 `cancel_command_task()`。每当服务器收到 `v1/LlamaChat/cancel_command` 请求时，都会发生这种情况。此函数设置一个标志，指示长进程终止。它还将 `self.force_shutdown` 设置为 `False`，以告知 `ModuleRunner` 基类该模块将正常终止长进程，而无需强制终止后台任务。

def cancel_command_task(self):
    self.cancelled      = True   # We will cancel this long process ourselves
    self.force_shutdown = False  # Tell ModuleRunner not to go ballistic

主

最后，如果此文件是从 Python 命令行执行的，我们需要为 LlamaChat_adapter 启动 asyncio 循环。

if __name__ == "__main__":
    LlamaChat_adapter().start_loop()

以上是实现模块所需的所有 Python 代码。安装过程需要一些标准文件，将在接下来的部分中讨论。

编写安装和设置

安装和设置模块的过程需要几个文件。这些文件用于构建执行环境并运行模块。本节将介绍这些文件。

创建安装脚本。

您需要为模块准备两个安装脚本：Windows 的 `install.bat` 和 Linux/macOS 的 `install.sh`。对于此模块，所有这些文件所做的只是在安装过程中下载 LLM 模型文件，以确保模块在没有互联网连接的情况下也能正常工作。您可以在 GitHub 存储库的源代码中查看这些文件的内容。

有关创建这些文件的详细信息，请参阅 CodeProject.AI 模块创建：Python 全面演练和为 CodeProject.AI 服务器编写安装脚本。

创建 requirements.txt 文件

模块设置过程使用 `requirements.txt` 文件来安装模块所需的 Python 包。如果不同的操作系统、架构和硬件需要不同的包或包版本，此文件可能会有所不同。有关变体的详细信息，请参阅 Python requirements.txt 文件和源代码。对于此模块，主要的 `requirements.txt` 文件是：

#! Python3.7
 
huggingface_hub     # Installing the huggingface hub
 
diskcache>=5.6.1    # Installing disckcache for Disk and file backed persistent cache
numpy>=1.20.0       # Installing NumPy, a package for scientific computing
 
# --extra-index-url https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/basic/cpu
# --extra-index-url https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/AVX/cpu
# --extra-index-url https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/AVX512/cpu
--extra-index-url=https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/AVX2/cpu
--prefer-binary
llama-cpp-python    # Installing simple Python bindings for the llama.cpp library
 
# last line empty

创建 CodeProject.AI 测试页面（以及 Explorer UI）

显示在 CodeProject.AI Explorer 中的 UI 在 `explore.html` 文件中定义。下面是存储库中内容的简化版本，以便您可以看到重要部分。

当单击 `_MID_queryBtn` 时，会调用 `_MID_onLlamaChat`，它获取用户提供的提示并将其发布到 `/v1/text/chat` 端点。该调用的返回数据包括一条消息“谢谢，我们已经启动了一个长进程”，以及发送请求的命令和模块的 ID。

然后，我们立即开始一个循环，该循环将每 250 毫秒轮询模块状态。我们通过调用 `/v1/llama_chat/get_command_statu` 来实现这一点，传入我们从 `process` 调用中收到的命令 ID 和模块 ID。每次响应，我们都会显示 `results.reply`。

其结果是，您输入一个提示，单击发送，几秒钟后，响应就开始在结果框中累积。纯粹的魔法。

<!DOCTYPE html>
<html lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
    ... 
</head>
<body class="dark-mode">

        <form method="post" action="" enctype="multipart/form-data" id="myform">

            <div class="form-group">
                <label class="form-label text-end">How can I help you?</label>
                <div class="input-group mt-1">
                    <textarea id="_MID_promptText"></textarea>
                    <input id="_MID_queryBtn" type="button" value="Send" 
                           onclick="_MID_onLlamaChat(_MID_promptText.value, _MID_maxTokens.value, _MID_temperature.value)">
                    <input type="button" value="Stop" id="_MID_stopBtn"onclick="_MID_onLlamaStop()" />
                </div>
            </div>

            <div class="mt-2">
                <div id="_MID_answerText"></div>
            </div>
 
            <div class="mt-3">
                <div id="results" name="results" </div>
            </div>
 
        </form>
 
        <script type="text/javascript">
            let chat           = '';
            let commandId      = '';
 
            async function _MID_onLlamaChat(prompt, maxTokens, temperature) {
 
                if (!prompt) {
                    alert("No text was provided for Llama chat");
                    return;
                }
 
                let params = [
                    ['prompt',      prompt],
                    ['max_tokens',  maxTokens],
                    ['temperature', temperature]
                ];
 
                setResultsHtml("Sending prompt...");
                let data = await submitRequest('text', 'chat', null, params)
                if (data) {
                    _MID_answerText.innerHTML = "<div class='text-muted'>Answer will appear here...</div>";
 
                    // get the commandId to so we can poll for the results
                    commandId = data.commandId;
                    moduleId  = data.moduleId;
 
                    params   = [['commandId', commandId], ['moduleId', moduleId]];
                    let done = false;
 
                    while (!done) {
 
                        await delay(250);
 
                        let results = await submitRequest('LlamaChat', 'get_command_status', null, params);
                        if (results) {
                            if (results.success) {
 
                                done = results.commandStatus == "completed";
                                let html = "<b>You</b>: " + prompt + "<br><br><b>Assistant</b>: "
                                         + results.reply.replace(/[\u00A0-\u9999<>\&]/g, function(i) {
                                               return '&#'+i.charCodeAt(0)+';';
                                           });
                                }
 
                                _MID_answerText.innerHTML = html
                               }
                            }
                            else {
                                done = true;
                            }
                        }
                    }
               }
            }
 
            async function _MID_onLlamaStop() {
                let params = [['commandId', commandId], ['moduleId', 'LlamaChat']];
                let result = await submitRequest('LlamaChat', 'cancel_command', null, params);
            }

        </script>
    </div>
</body>
</html>

结论

我们已经证明，通过封装现有的示例或库代码并创建适配器，可以轻松创建执行复杂和长时间运行进程的模块。

您的挑战是现在创建或修改一个模块来支持您的特定需求。如果您这样做了，我们鼓励您分享它。CodeProject.AI 服务器致力于帮助构建 AI 社区，我们很荣幸您能成为其中的一员。

历史

2024 年 4 月 4 日：首次发布。