Lab 1: Online Installation of HAMi
This lab walks you through building a Kubernetes cluster from scratch on a Google Cloud GPU virtual machine and installing HAMi online, resulting in a complete GPU virtualization runtime environment.
What You'll Get
After completing this lab, you will have a fully functional GPU-virtualized Kubernetes cluster. For a detailed explanation of the cluster architecture and component responsibilities, see HAMi Cluster Architecture.
Installation Overview
The entire installation process is divided into 6 steps, each solving a specific problem:
%% title: HAMi Installation Overview
flowchart LR
Step1["Step 1<br/>Create GCP VM"] --> Step2["Step 2<br/>Install Helm"]
Step2 --> Step3["Step 3<br/>Install Kubernetes"]
Step3 --> Step4["Step 4<br/>Install Prometheus"]
Step4 --> Step5["Step 5<br/>Install GPU Operator"]
Step5 --> Step6["Step 6<br/>Install HAMi"]
| Step | Purpose | What Problem It Solves |
|---|---|---|
| Create GCP VM | Provision a Linux server with a GPU | Kubernetes needs GPU hardware to schedule GPU workloads |
| Install Helm | Kubernetes package manager | All subsequent components are installed via Helm, similar to apt/yum |
| Install Kubernetes | Container orchestration platform | HAMi runs on top of Kubernetes; all GPU resources are managed by K8s |
| Install Prometheus | Monitoring system | HAMi and GPU Operator depend on Prometheus to collect and store metrics |
| Install GPU Operator | Automated NVIDIA GPU software stack management | Automatically installs GPU drivers, container toolkit, metrics collectors, and other components |
| Install HAMi | GPU virtualization and sharing | Allows multiple Pods to share the same GPU, enabling VRAM partitioning and compute allocation |
Prerequisites
- Google Cloud account with Compute Engine API enabled
gcloudCLI installed and authenticated (gcloud auth login)- NVIDIA T4 GPU quota available in your GCP project
Step 1: Create a GCP Virtual Machine
Purpose
Create a virtual machine with a GPU to serve as the foundation for the entire lab. HAMi requires physical GPU hardware (or pass-through virtual GPU) to function, it does not emulate GPUs; instead, it partitions and shares real GPUs.
Instructions
Set environment variables:
export PROJECT_ID=$(gcloud config get-value project)
export ZONE=us-central1-a
export VM_NAME=hami-workshop
export MACHINE_TYPE=n1-standard-4
export GPU_TYPE=nvidia-tesla-t4
export IMAGE_FAMILY=ubuntu-2204-lts
export IMAGE_PROJECT=ubuntu-os-cloud
export DISK_SIZE=100
Create the virtual machine:
gcloud compute instances create ${VM_NAME} \
--project=${PROJECT_ID} \
--zone=${ZONE} \
--machine-type=${MACHINE_TYPE} \
--accelerator=type=${GPU_TYPE},count=1 \
--maintenance-policy=TERMINATE \
--image-family=${IMAGE_FAMILY} \
--image-project=${IMAGE_PROJECT} \
--boot-disk-size=${DISK_SIZE}GB \
--boot-disk-type=pd-ssd
--maintenance-policy=TERMINATEis required, GPUs do not support live migration. If GCP needs to perform maintenance on the host, the VM will be terminated rather than migrated.
SSH into the VM:
gcloud compute ssh ${VM_NAME} --zone=${ZONE}
After logging in, switch to root:
sudo su -
Step 2: Install Helm
Purpose
Helm is the package manager for Kubernetes. All subsequent installations, Prometheus, GPU Operator, and HAMi, are done through Helm. You can think of it as the apt or yum of the Kubernetes world.
Instructions
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
Verify:
helm version
Step 3: Install Kubernetes
Purpose
HAMi is a GPU scheduling enhancement layer for Kubernetes. It runs as Pods within a Kubernetes cluster. Without Kubernetes, HAMi has no runtime foundation.
This step uses kubeadm to set up a single-node cluster. On a single node, the node serves as both the Master (control plane) and the Worker (running workloads).
Instructions
3.1 Disable Swap
Kubernetes requires swap to be disabled because its resource scheduling assumes fixed memory. Swap can lead to unpredictable performance.
swapoff -a
sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab
3.2 Load Kernel Modules
Container networking requires the overlay and br_netfilter kernel modules. overlay is used for container filesystem layering, and br_netfilter enables iptables to correctly handle bridged traffic.
cat <<EOF | tee /etc/modules-load.d/k8s.conf
overlay
br_netfilter
EOF
modprobe overlay
modprobe br_netfilter
3.3 Configure Kernel Network Parameters
These parameters ensure that network traffic between containers is properly routed and forwarded.
cat <<EOF | tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
EOF
sysctl --system
3.4 Install containerd
containerd is the default container runtime for Kubernetes, responsible for actually creating and running containers. Docker is no longer the default runtime since Kubernetes 1.24.
apt-get update
apt-get install -y containerd
mkdir -p /etc/containerd
containerd config default | tee /etc/containerd/config.toml
# Enable systemd cgroup driver, Kubernetes requires the runtime and kubelet to use the same cgroup driver
sed -i 's/SystemdCgroup \= false/SystemdCgroup \= true/g' /etc/containerd/config.toml
systemctl restart containerd
systemctl enable containerd
3.5 Install kubeadm, kubelet, and kubectl
The relationship between these three tools:
%% title: kubeadm, kubelet, kubectl Relationship
flowchart LR
kubeadm["kubeadm<br/>Cluster initialization tool"] --> cluster["Kubernetes Cluster"]
kubelet["kubelet<br/>Node agent<br/>Runs on each node"] --> pods["Manages Pod lifecycle"]
kubectl["kubectl<br/>Command-line client"] --> api["Interacts with API Server"]
- kubeadm: A one-time tool used to initialize the cluster
- kubelet: A daemon process responsible for creating and destroying Pods on the local node
- kubectl: The command-line tool used for day-to-day operations
apt-get install -y apt-transport-https ca-certificates curl gpg
curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.31/deb/Release.key | \
gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.31/deb/ /' | \
tee /etc/apt/sources.list.d/kubernetes.list
apt-get update
apt-get install -y kubelet kubeadm kubectl
apt-mark hold kubelet kubeadm kubectl
apt-mark holdprevents these packages from being automatically upgraded. Kubernetes component versions need to be managed manually.
3.6 Initialize the Cluster
kubeadm init --pod-network-cidr=10.244.0.0/16
After initialization completes, configure kubectl access:
mkdir -p $HOME/.kube
cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
chown $(id -u):$(id -g) $HOME/.kube/config
3.7 Install Network Plugin (Calico)
Pods need network connectivity to communicate with each other. Calico is a CNI (Container Network Interface) plugin responsible for assigning IP addresses to Pods and handling network routing. Without a CNI plugin, Pods cannot communicate with each other.
kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/tigera-operator.yaml
kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/custom-resources.yaml
3.8 Allow Master Node to Schedule Pods
In a single-node cluster, this node serves as both the control plane and the worker node. By default, Kubernetes does not schedule workloads on Master nodes. You need to manually remove this restriction:
kubectl taint nodes --all node-role.kubernetes.io/control-plane-
3.9 Verify Cluster Status
kubectl get nodes
Expected output (STATUS of Ready indicates the cluster is ready):
NAME STATUS ROLES AGE VERSION
hami-workshop Ready control-plane 2m v1.31.x
Step 4: Install Prometheus
Purpose
Prometheus is the cluster monitoring system, responsible for collecting and storing metrics from all components. Both HAMi and GPU Operator depend on Prometheus, HAMi's scheduler metrics, device plugin metrics, and GPU utilization metrics all require Prometheus for collection.
Why Install Prometheus First
Because the GPU Operator and HAMi installed in subsequent steps will create ServiceMonitors (which tell Prometheus what metrics to collect). If Prometheus is not ready, these ServiceMonitors will have no consumer.
Instructions
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
-n monitoring --create-namespace \
--set grafana.enabled=false \
--version=75.15.1
--set grafana.enabled=falsedisables Grafana because the HAMi WebUI installed later will provide GPU visualization.
Verify Prometheus component status:
kubectl get po -n monitoring
All Pods should have a status of Running:
NAME READY STATUS RESTARTS AGE
prometheus-kube-prometheus-operator-xxxxxxxxxx-xxxxx 1/1 Running 0 2m
prometheus-kube-state-metrics-xxxxxxxxxx-xxxxx 1/1 Running 0 2m
prometheus-prometheus-kube-prometheus-prometheus-0 2/2 Running 0 2m
prometheus-prometheus-node-exporter-xxxxx 1/1 Running 0 2m
If the installation fails, uninstall first before retrying:
helm uninstall -n monitoring prometheus
Step 5: Install GPU Operator
Purpose
The NVIDIA GPU Operator automates the management of the GPU software stack (drivers, container toolkit, metrics collection, feature discovery). For a detailed explanation of each GPU Operator component, see HAMi Cluster Architecture.
Important: You must disable the GPU Operator's built-in device-plugin (
--set devicePlugin.enabled=false) because HAMi provides its own enhanced device-plugin that supports VRAM partitioning and GPU sharing. The two cannot coexist.
Instructions
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set devicePlugin.enabled=false \
--set dcgmExporter.serviceMonitor.enabled=true \
--version=v25.3.0
The
--waitflag waits for all Pods to be ready before returning. The first installation may take a few minutes to download NVIDIA driver images.
Wait for all Pods to be ready:
kubectl get pods -n gpu-operator
Expected output:
NAME READY STATUS RESTARTS AGE
gpu-operator-xxxxxxxxxx-xxxxx 1/1 Running 0 5m
nvidia-container-toolkit-daemonset-xxxxx 1/1 Running 0 4m
nvidia-cuda-validator-xxxxx 0/1 Completed 0 3m
nvidia-dcgm-exporter-xxxxx 1/1 Running 0 4m
nvidia-driver-daemonset-xxxxx 1/1 Running 0 5m
nvidia-gpu-feature-discovery-xxxxx 1/1 Running 0 4m
nvidia-operator-validator-xxxxx 1/1 Running 0 3m
The
nvidia-cuda-validatorstatus ofCompletedis normal, it is a one-time Job that exits after verifying CUDA availability.
Verify GPU Driver
Enter the nvidia-driver-daemonset Pod to verify the GPU driver is loaded correctly (for details on the call chain behind nvidia-smi, see Understanding GPU Drivers):
kubectl -n gpu-operator exec -it $(kubectl get pods -n gpu-operator -l app=nvidia-driver-daemonset -o name | head -1) -- nvidia-smi
The expected output includes GPU information (driver version, CUDA version, GPU model):
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03 Driver Version: 550.144.03 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
|=========================================+========================+======================|
| 0 Tesla T4 On | 00000000:00:04.0 Off | Off |
| N/A 35C P8 10W / 70W | 2MiB / 15360MiB | 0% Default |
+-----------------------------------------------------------------------------------------+
Step 6: Install HAMi
Purpose
Install the HAMi GPU virtualization platform to allow multiple Pods to share the same GPU. For HAMi's architecture and component details, see HAMi Cluster Architecture.
Instructions
Install HAMi open-source edition via the Helm repository:
# Add the HAMi Helm repository
helm repo add hami-charts https://project-hami.github.io/HAMi/
# Install HAMi
helm install hami hami-charts/hami -n kube-system
The HAMi open-source edition is installed in the
kube-systemnamespace. After installation, HAMi automatically detects nodes with GPUs and starts the device-plugin.
Verify:
kubectl get pods -n kube-system | grep hami
Expected output:
hami-device-plugin-xxxxx 2/2 Running 0 2m
hami-scheduler-xxxxxxxxxx-xxxxx 1/1 Running 0 2m
Enable GPU Node
HAMi does not automatically take over all GPU nodes, you need to manually label which nodes should be managed by HAMi. This design allows HAMi and non-HAMi nodes to coexist within the same cluster.
# Get the node name
NODE_NAME=$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}')
# Label the node to be managed by HAMi
kubectl label nodes ${NODE_NAME} gpu=on
Verify GPU registration information:
kubectl get node ${NODE_NAME} -o jsonpath='{.metadata.annotations.ham\.io/node-nvidia-register}'
Expected output is similar to:
GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx,10,15360,100,NVIDIA-Tesla T4,0,true,0:
The format of this annotation is:
{Device UUID},{Number of partitions},{VRAM limit in MB},{Compute limit %},{GPU model},{NUMA},{Health status},{Index}
Here, Number of partitions = 10 means this GPU is virtualized into 10 vGPUs, which can be shared by up to 10 Pods.
(Optional) Install HAMi WebUI
HAMi WebUI provides a visual management interface for GPU resources:
helm repo add hami-webui https://project-hami.github.io/HAMi-WebUI
helm install my-hami-webui hami-webui/hami-webui \
--set externalPrometheus.enabled=true \
--set externalPrometheus.address="http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090" \
--set dcgm-exporter.enabled=false \
-n kube-system
--set dcgm-exporter.enabled=falsebecause the GPU Operator already installed dcgm-exporter, avoiding duplicate deployment.
Access the WebUI via port forwarding:
kubectl port-forward service/my-hami-webui 3000:3000 --namespace=kube-system
Visit http://localhost:3000 to open the HAMi WebUI.