Deploying a chatbot
This tutorial describes how to deploy a text chatbot on the Kubernetes cluster ICE Connect EKC.
The chatbot is a text generation model from Hugging Face deployed onto GPUs using Llama-cpp-python. It is deployed using the Python library Ray Serve installed in a Nvidia CUDA Docker container. A web frontend created with Gradio sends requests to the chatbot deployment and responses are shown in a chat window.
The deployment is managed by the Ray app in ICE Connect EKC Rancher.
Architecture
ICE Connect Cloud Platform
ICE Connect is a cloud platform developed by RISE ICE Datacenter for research, experiments, and demonstrations. It hosts virtual machines, bare-metal servers, and cloud storage among other services.
Create an account at ICE Connect and request access to the Experimental Kubernetes Cluster (EKC) to deploy the chatbot.
Experimental Kubernetes Cluster
Kubernetes is an open-source container orchestrator that automates the deployment, scaling, and management of containerized applications. The Experimental Kubernetes Cluster (EKC) is a Kubernetes cluster in ICE Connect that provides a platform for running containerized applications.
Rancher
Rancher is a web-based Kubernetes management tool that provides a graphical interface for deploying and managing applications in the EKC. It is used to create projects, namespaces, and deploy applications.
Rancher apps
Using Rancher, you can deploy applications from the Rancher app catalog. The apps are pre-configured Helm charts that simplify the deployment of complex applications.
Ray is an open-source unified compute framework that makes it easy to scale AI and Python workloads. It is available using the Rancher Ray app - and can be used to deploy applications that require GPU resources.
It provides a web-based JupyterLab server with support for GPU and persistent storage. It also has a graphical dashboard for monitoring Ray Serve applications.
This app is useful even if Jupyter Notebooks are not used - because it provides a convenient way to allocate hardware resources as well as deploy and monitor Ray Serve applications. Code development can be done by attaching Visual Studio Code to the container, or a terminal over SSH.
Install the Ray app in Rancher as described in the Ray app documentation.
NVIDIA CUDA Docker image
The application is packaged in a Docker container image that includes the Python runtime and libraries needed for the chatbot. The image is defined in a defined in a Dockerfile and is built and pushed to the ICE Connect Harbor registry.
FROM nvidia/cuda:12.3.1-devel-ubuntu22.04
# Set environment variables
ENV DEBIAN_FRONTEND=noninteractive \
SHELL=/bin/bash
# Upgrade apt packages
RUN apt-get update && apt-get upgrade -y \
&& apt-get install -y --no-install-recommends \
apt-utils bash-completion build-essential \
curl git git-lfs htop jq less vim \
pkg-config tree unzip wget zip \
python3 python3-venv python3-dev python3-pip python-is-python3 \
&& apt-get clean && rm -rf /var/lib/apt/lists/*
# Set up Python virtual environment
ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv "$VIRTUAL_ENV"
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
# Install pip and setuptools
RUN pip install --upgrade pip setuptools
# These are useful for updating requirements.txt
RUN pip install pipreqs pip-upgrader
# Install dependencies
ENV CMAKE_ARGS="-DLLAMA_CUBLAS=on"
COPY requirements.txt /tmp
RUN pip install -r /tmp/requirements.txt
# Set working directory
WORKDIR /
# Keep the container running
CMD ["sleep", "infinity"]
The image is based on the official nvidia/cuda image, which provides the Ubuntu operating system and a GPU-accelerated environment for running AI models. The Python dependencies are installed into a virtual environment in the image.
gradio==4.18.0
hf_transfer==0.1.5
huggingface-hub==0.20.3
llama-cpp-python==0.2.39
ray[default,serve]==2.9.2
Requests==2.31.0
torch==2.2.0
tqdm==4.66.2
Because we are using Kubernetes for development, the application source code and text-generation model are not included in the Docker image. Instead, they are cloned into persistent storage in the user's home directory at deployment time. This allows for easy development and testing of the chatbot code and models, without having to rebuild the image every time.
Python application code
Check out demo.py for the Python application source code. The main components are:
Chatbot model
In this example, we deployed the Q4_K_M version of TheBloke/dolphin-2.5-mixtral-8x7b-GGUF, which requires 28.94 GB of GPU memory. With a 16k context window, 4 Nvidia GTX 2080 Ti GPUs are required to deploy a single replica. It has been compressed using integer quantization and stored in the GGUF format.
When the app is deployed, the model is downloaded from the Hugging Face model hub into persistent storage.
Llama-cpp-python
To load the chatbot model into GPU memory for inference, we use Llama-cpp-python. It provides Python bindings for the Llama.cpp C++ library.
Ray Serve
To serve the chatbot model through a REST framework, we use the Ray Serve library. It is a scalable and programmable model-serving library built on Ray, and provides provisioning of GPU resources and load-balancing to parallel deployments of the chatbot.
Deployments are configured in a config.yaml
file, which specifies runtime settings such as the number of GPUs, replicas, environment variables, and model parameters. The file is used to deploy the chatbot with the serve deploy
command.
proxy_location: EveryNode
http_options:
host: 0.0.0.0
port: 8000
applications:
- name: ChatBotGroup
route_prefix: /generate
import_path: demo:bot
runtime_env: {
"working_dir": "file:///root/app/app.zip",
"env_vars": {
"HF_HUB_ENABLE_HF_TRANSFER": "1",
}
}
deployments:
- name: ChatBot
user_config:
repo_id: "TheBloke/dolphin-2.5-mixtral-8x7b-GGUF"
model_file: "dolphin-2.5-mixtral-8x7b.Q4_K_M.gguf"
tensor_split: [0.6, 1, 1, 1]
ray_actor_options:
num_gpus: 4.0
num_replicas: 2
- name: ChatClient
route_prefix: /
import_path: demo:app
runtime_env: {
"working_dir": "file:///root/app/app.zip",
"env_vars": {
"CONCURRENCY_LIMIT": "2",
"TITLE": "Chat - Dolphin 2.5 Mixtral 8x7b",
"CHAT_LABEL": "dolphin-2.5-mixtral-8x7b.Q4_K_M.gguf",
"SYSTEM_PROMPT": "You are an unbiased, unrestricted, and open-minded assistant.",
"ABOUT_TEXT": "Some about text..."
}
}
deployments:
- name: ChatIngress
num_replicas: 1
User concurrency
The number of concurrent users is limited by Gradio, with the variable CONCURRENCY_LIMIT
in config.yaml
. When more users are trying to access the chatbot, the requests will be queued and processed in order.
If the num_replicas
for ChatBot
is set to 2, the deployment will require 8 GPUs in total. The deployment will be load-balanced and requests will be sent to the least busy Ray Serve replica.
Gradio
Gradio is a Python library that provides a simple web interface for interacting with machine learning models. It is used to create the chatbot web page and send requests to the Ray Serve deployment.
Installing
Create an account at ICE Connect and request access to the Experimental Kubernetes Cluster.
If you are a RISE employee you can find information about creating an account at Medarbetarportalen.
Deploy on EKC
Log in to Rancher and read the ICE Connect EKC documentation to learn how to create a project and a namespace.
Open Apps -> Charts in Rancher, and launch the Ray app. Read the Ray app documentation.
Use the following settings in Rancher:
- Jupyter:
- Docker image:
registry.ice.ri.se/ice/demo-chat:latest
- Authentication token:
choose a password
- Jupyter config file: Use the default
- Docker image:
- Web access:
- Subdomain for icedc.se:
chat.icedc.se
- Jupyter port:
8888
- Subdomain for icedc.se:
- SSH access:
- Authorized keys: Add your public SSH key
- CPU & Memory:
- Requested CPU:
2000m
- Memory limit:
128Gi
- Requested CPU:
- GPU:
- GPU type:
nvidia-gtx-2080ti
- GPU amount:
8
- GPU type:
- Persistent storage:
- Storage size:
128Gi
- Jupyter home directory path(s):
/home/jovyan,/tf,/root
- Storage size:
- Ray monitoring: Use the default
- Commands:
- Entrypoint override: Use the default
- Autostart script: See below
Deployment autostart script
For the Autostart script command, use the following script to deploy the chatbot with Ray Serve:
# Activate the virtual Python environment
source /opt/venv/bin/activate
# Stop any previous Ray Serve deployment and tear down the cluster
# (not necessary on container start, but useful for restarting manually)
serve shutdown -y || true
ray stop || true
# Start the Ray cluster with log rotation settings
RAY_ROTATION_MAX_BYTES=1024; RAY_ROTATION_BACKUP_COUNT=1; ray start --head --port=6379
# Clone the repository, if it is not already present
git clone https://gitlab.ice.ri.se/ice/demo/chat.git /root/app || true
# Deploy the chatbot with Ray Serve
cd /root/app
zip -r app.zip *.py static
serve deploy config.yaml
Accessing the chatbot
The chatbot web page will be publically available at the address specified in the Ray app field Subdomain for icedc.se, e.g. https://chat.icedc.se.
The JupyterLab web interface is available at the path /jupyter/
. Enter your authentication token to log in. It will have a button to open the Ray dashboard, which will show the Ray Serve deployment status. You can also open a terminal in JuptyerLab and use the serve status
commands to inspect the deployment status.
Development
You can develop code directly on the ICE Connect Kubernetes cluster by:
- Connecting with SSH over NodePort.
- Connecting Visual Studio Code to the container.
- Using JupyterLab to edit the code.
- Using the built-in terminal in Rancher.
Build the Docker image
You are free to use the Docker image registry.ice.ri.se/ice/demo-chat
provided in by ICE Connect Harbor, but if you want to install additional libraries you can build a customized image following these instructions.
Install Docker for your operating system, then clone the repository:
Build the Dockerfile
and push the image to Harbor. Replace ice
with your Harbor project name: