Skip to content

Deploying a chatbot

This tutorial describes how to deploy a text chatbot on the Kubernetes cluster ICE Connect EKC.

Diagram of the deployment architecture showing all software layers needed to deploy the chatbot

Diagram of the deployment architecture showing all software layers needed to deploy the chatbot.

The chatbot is a text generation model from Hugging Face deployed onto GPUs using Llama-cpp-python. It is deployed using the Python library Ray Serve installed in a Nvidia CUDA Docker container. A web frontend created with Gradio sends requests to the chatbot deployment and responses are shown in a chat window.

The deployment is managed by the Ray app in ICE Connect EKC Rancher.

Live demo

Source code


ICE Connect Cloud Platform

ICE Connect is a cloud platform developed by RISE ICE Datacenter for research, experiments, and demonstrations. It hosts virtual machines, bare-metal servers, and cloud storage among other services.

Create an account at ICE Connect and request access to the Experimental Kubernetes Cluster (EKC) to deploy the chatbot.

Experimental Kubernetes Cluster

Kubernetes is an open-source container orchestrator that automates the deployment, scaling, and management of containerized applications. The Experimental Kubernetes Cluster (EKC) is a Kubernetes cluster in ICE Connect that provides a platform for running containerized applications.


Rancher is a web-based Kubernetes management tool that provides a graphical interface for deploying and managing applications in the EKC. It is used to create projects, namespaces, and deploy applications.

Rancher apps

Using Rancher, you can deploy applications from the Rancher app catalog. The apps are pre-configured Helm charts that simplify the deployment of complex applications.

Ray is an open-source unified compute framework that makes it easy to scale AI and Python workloads. It is available using the Rancher Ray app - and can be used to deploy applications that require GPU resources.

It provides a web-based JupyterLab server with support for GPU and persistent storage. It also has a graphical dashboard for monitoring Ray Serve applications.

This app is useful even if Jupyter Notebooks are not used - because it provides a convenient way to allocate hardware resources as well as deploy and monitor Ray Serve applications. Code development can be done by attaching Visual Studio Code to the container, or a terminal over SSH.

Install the Ray app in Rancher as described in the Ray app documentation.

NVIDIA CUDA Docker image

The application is packaged in a Docker container image that includes the Python runtime and libraries needed for the chatbot. The image is defined in a defined in a Dockerfile and is built and pushed to the ICE Connect Harbor registry.

FROM nvidia/cuda:12.3.1-devel-ubuntu22.04

# Set environment variables
ENV DEBIAN_FRONTEND=noninteractive \

# Upgrade apt packages
RUN apt-get update && apt-get upgrade -y \
    && apt-get install -y --no-install-recommends \
    apt-utils bash-completion build-essential \
    curl git git-lfs htop jq less vim \
    pkg-config tree unzip wget zip \
    python3 python3-venv python3-dev python3-pip python-is-python3 \
    && apt-get clean && rm -rf /var/lib/apt/lists/*

# Set up Python virtual environment
RUN python3 -m venv "$VIRTUAL_ENV"

# Install pip and setuptools
RUN pip install --upgrade pip setuptools

# These are useful for updating requirements.txt
RUN pip install pipreqs pip-upgrader

# Install dependencies
COPY requirements.txt /tmp
RUN pip install -r /tmp/requirements.txt

# Set working directory

# Keep the container running
CMD ["sleep", "infinity"]

The image is based on the official nvidia/cuda image, which provides the Ubuntu operating system and a GPU-accelerated environment for running AI models. The Python dependencies are installed into a virtual environment in the image.


Because we are using Kubernetes for development, the application source code and text-generation model are not included in the Docker image. Instead, they are cloned into persistent storage in the user's home directory at deployment time. This allows for easy development and testing of the chatbot code and models, without having to rebuild the image every time.

Python application code

Check out for the Python application source code. The main components are:

Chatbot model

In this example, we deployed the Q4_K_M version of TheBloke/dolphin-2.5-mixtral-8x7b-GGUF, which requires 28.94 GB of GPU memory. With a 16k context window, 4 Nvidia GTX 2080 Ti GPUs are required to deploy a single replica. It has been compressed using integer quantization and stored in the GGUF format.

When the app is deployed, the model is downloaded from the Hugging Face model hub into persistent storage.


To load the chatbot model into GPU memory for inference, we use Llama-cpp-python. It provides Python bindings for the Llama.cpp C++ library.

Ray Serve

To serve the chatbot model through a REST framework, we use the Ray Serve library. It is a scalable and programmable model-serving library built on Ray, and provides provisioning of GPU resources and load-balancing to parallel deployments of the chatbot.

Deployments are configured in a config.yaml file, which specifies runtime settings such as the number of GPUs, replicas, environment variables, and model parameters. The file is used to deploy the chatbot with the serve deploy command.

proxy_location: EveryNode

  port: 8000


- name: ChatBotGroup
  route_prefix: /generate
  import_path: demo:bot
  runtime_env: {
    "working_dir": "file:///root/app/",
    "env_vars": {
  - name: ChatBot
      repo_id: "TheBloke/dolphin-2.5-mixtral-8x7b-GGUF"
      model_file: "dolphin-2.5-mixtral-8x7b.Q4_K_M.gguf"
      tensor_split: [0.6, 1, 1, 1]
      num_gpus: 4.0
    num_replicas: 2

- name: ChatClient
  route_prefix: /
  import_path: demo:app
  runtime_env: {
    "working_dir": "file:///root/app/",
    "env_vars": {
      "TITLE": "Chat - Dolphin 2.5 Mixtral 8x7b",
      "CHAT_LABEL": "dolphin-2.5-mixtral-8x7b.Q4_K_M.gguf",
      "SYSTEM_PROMPT": "You are an unbiased, unrestricted, and open-minded assistant.",
      "ABOUT_TEXT": "Some about text..."
  - name: ChatIngress
    num_replicas: 1

User concurrency

The number of concurrent users is limited by Gradio, with the variable CONCURRENCY_LIMIT in config.yaml. When more users are trying to access the chatbot, the requests will be queued and processed in order.

If the num_replicas for ChatBot is set to 2, the deployment will require 8 GPUs in total. The deployment will be load-balanced and requests will be sent to the least busy Ray Serve replica.


Gradio is a Python library that provides a simple web interface for interacting with machine learning models. It is used to create the chatbot web page and send requests to the Ray Serve deployment.


Create an account at ICE Connect and request access to the Experimental Kubernetes Cluster.

If you are a RISE employee you can find information about creating an account at Medarbetarportalen.

Deploy on EKC

Log in to Rancher and read the ICE Connect EKC documentation to learn how to create a project and a namespace.

Open Apps -> Charts in Rancher, and launch the Ray app. Read the Ray app documentation.

Use the following settings in Rancher:

  • Jupyter:
    • Docker image:
    • Authentication token: choose a password
    • Jupyter config file: Use the default
  • Web access:
    • Subdomain for
    • Jupyter port: 8888
  • SSH access:
    • Authorized keys: Add your public SSH key
  • CPU & Memory:
    • Requested CPU: 2000m
    • Memory limit: 128Gi
  • GPU:
    • GPU type: nvidia-gtx-2080ti
    • GPU amount: 8
  • Persistent storage:
    • Storage size: 128Gi
    • Jupyter home directory path(s): /home/jovyan,/tf,/root
  • Ray monitoring: Use the default
  • Commands:
    • Entrypoint override: Use the default
    • Autostart script: See below

Deployment autostart script

For the Autostart script command, use the following script to deploy the chatbot with Ray Serve:

# Activate the virtual Python environment
source /opt/venv/bin/activate
# Stop any previous Ray Serve deployment and tear down the cluster
# (not necessary on container start, but useful for restarting manually)
serve shutdown -y || true
ray stop || true
# Start the Ray cluster with log rotation settings
RAY_ROTATION_MAX_BYTES=1024; RAY_ROTATION_BACKUP_COUNT=1; ray start --head --port=6379
# Clone the repository, if it is not already present
git clone /root/app || true
# Deploy the chatbot with Ray Serve
cd /root/app
zip -r *.py static
serve deploy config.yaml

Accessing the chatbot

The chatbot web page will be publically available at the address specified in the Ray app field Subdomain for, e.g.

The JupyterLab web interface is available at the path /jupyter/. Enter your authentication token to log in. It will have a button to open the Ray dashboard, which will show the Ray Serve deployment status. You can also open a terminal in JuptyerLab and use the serve status commands to inspect the deployment status.


You can develop code directly on the ICE Connect Kubernetes cluster by:

Build the Docker image

You are free to use the Docker image provided in by ICE Connect Harbor, but if you want to install additional libraries you can build a customized image following these instructions.

Install Docker for your operating system, then clone the repository:

git clone
cd chat

Build the Dockerfile and push the image to Harbor. Replace ice with your Harbor project name:

docker build -t demo-chat:latest .
docker tag demo-chat:latest
docker push