HAMi Cluster Architecture After Installation
After completing the HAMi installation, the cluster is no longer an ordinary Kubernetes cluster, it becomes an AI infrastructure platform with GPU virtualization capabilities. This document breaks down the responsibilities and dependencies of every layer and every component in the cluster after installation.
5-Layer Architecture Overview
The cluster after installation consists of 5 layers, each providing services to the layer above:
%% title: HAMi Cluster 5-Layer Architecture
flowchart TB
subgraph Infra["Kubernetes Infrastructure Layer"]
K8s["Kubernetes Control Plane"]
Cilium["Cilium Network"]
end
subgraph GPUStack["NVIDIA GPU Runtime Stack"]
Driver["NVIDIA Driver"]
Toolkit["Container Toolkit"]
DCGM["DCGM Exporter"]
GFD["GPU Feature Discovery"]
end
subgraph Scheduler["GPU Scheduling Layer"]
DevicePlugin["HAMi Device Plugin"]
SchedulerExt["HAMi Scheduler"]
end
subgraph Observe["Monitoring Layer"]
Prometheus["Prometheus"]
end
subgraph UI["Visualization Layer"]
WebUI["HAMi WebUI"]
end
GPUStack --> Scheduler
Scheduler --> K8s
GPUStack --> Observe
Observe --> UI
Looking from bottom to top:
| Layer | Role | Analogy |
|---|---|---|
| Kubernetes Infrastructure Layer | Container orchestration, network communication, resource management | Operating system |
| NVIDIA GPU Runtime Stack | Enables containers to access GPU hardware | GPU driver |
| GPU Scheduling Layer | GPU resource partitioning, sharing, and scheduling decisions | Resource manager |
| Monitoring Layer | Collects and stores metrics from all components | Monitoring system |
| Visualization Layer | Visual management interface for GPU resources | Dashboard |
The relationship between these 5 layers is a strict dependency: upper layers depend on lower layers, but lower layers do not depend on upper layers. For example, HAMi Scheduler needs Kubernetes to provide scheduling framework extension points, but Kubernetes itself does not care about HAMi's existence.
Helm Releases
Multiple Releases are deployed through Helm during installation. What is a Helm Release? You can think of it as a running instance of an application package, similar to how apt install installs a software package, or docker compose up starts a set of services. Each Release contains a set of Kubernetes resources (Pods, Services, ConfigMaps, etc.) that can be installed, upgraded, and uninstalled together.
Run helm list -A to see all Releases:
NAMESPACE NAME CHART STATUS
gpu-operator gpu-operator-xxxxxxxxxx gpu-operator-v25.3.0 deployed
kube-system hami hami-2.x.x deployed
monitoring prometheus kube-prometheus-stack-75.15.1 deployed
kube-system my-hami-webui hami-webui-x.x.x deployed
kube-system cilium cilium-x.x.x deployed
Responsibilities of Each Release
| Release | Namespace | Responsibility | Installation Timing |
|---|---|---|---|
| cilium | kube-system | CNI network plugin, provides network connectivity for Pods | After K8s installation, replacing Calico |
| gpu-operator | gpu-operator | Automated management of the NVIDIA GPU software stack, automatically deploys drivers, toolkits, and metrics collectors | After Prometheus installation |
| hami | kube-system | GPU virtualization and scheduling enhancement, supports VRAM partitioning and multi-Pod GPU sharing | After GPU Operator installation |
| prometheus | monitoring | Cluster monitoring, collects and stores metrics data from all components | After K8s installation, the first monitoring component installed |
| my-hami-webui | kube-system | GPU resource visualization interface, displays GPU usage and scheduling status | After HAMi installation (optional) |
The installation order has dependency relationships:
%% title: Helm Release Installation Dependencies
flowchart LR
Cilium["cilium<br/>CNI Network"] --> GPUOp["gpu-operator<br/>GPU Runtime"]
GPUOp --> HAMi["hami<br/>GPU Scheduling"]
HAMi --> WebUI["my-hami-webui<br/>Visualization"]
Prometheus["prometheus<br/>Monitoring"] --> GPUOp
Prometheus --> WebUI
Prometheus must be installed before GPU Operator and HAMi WebUI because they both depend on Prometheus for metrics collection and data provision.
Pod Details
After installation, a large number of Pods are running in the cluster. Running kubectl get pods -A will show output similar to the following. This section explains the role of each Pod by category.
K8s Core Components
These are Kubernetes control plane components, created by kubeadm during cluster initialization.
| Pod | Role |
|---|---|
| kube-apiserver | Kubernetes API entry point; all components (kubectl, scheduler, controller-manager) communicate through it |
| kube-scheduler | Determines which node a Pod runs on; HAMi Scheduler participates as an extension |
| kube-controller-manager | Runs various controllers (Deployment, ReplicaSet, Node, etc.) to maintain the cluster's desired state |
| etcd | Distributed key-value store, stores all cluster state data (Pods, Services, ConfigMaps, etc.) |
| kube-proxy | Runs on each Node, maintains Service network forwarding rules (iptables/IPVS) |
| coredns | In-cluster DNS service, provides name resolution for Services |
Without them: The entire Kubernetes cluster cannot run. kube-apiserver is the only component that can directly operate etcd. If it goes down, the entire control plane is paralyzed.
Significance for AI Infrastructure: Kubernetes is the orchestration foundation for GPU workloads. Without K8s, GPU tasks can only be manually assigned to machines, making automatic scheduling, elastic scaling, and fault recovery impossible.
Network Components
Created by the Cilium Helm Release.
| Pod | Role |
|---|---|
| cilium | DaemonSet, one per node, responsible for Pod-to-Pod network connectivity, network policies, and load balancing |
| cilium-operator | Deployment, manages Cilium's control plane logic (IP address allocation, inter-node routing, garbage collection) |
Without them: Pods cannot communicate with each other. Kubernetes' Service mechanism completely fails, DNS resolution doesn't work, and cross-Pod distributed training cannot proceed.
Significance for AI Infrastructure: AI training often involves distributed workloads (multi-Pod collaborative training), and network performance directly affects training efficiency. Cilium is based on eBPF technology, providing high-performance network forwarding.
NVIDIA GPU Runtime Stack
Created by the gpu-operator Helm Release. The core problem this layer solves is: enabling containers to access GPU hardware.
| Pod | Role |
|---|---|
| nvidia-driver-daemonset | DaemonSet, installs the NVIDIA kernel driver on each GPU node. It manages the driver in a containerized manner, avoiding the tedious process of manually compiling and installing drivers on the host |
| nvidia-container-toolkit-daemonset | DaemonSet, configures containerd on each node so it knows how to mount GPU devices and libraries into containers. Modifies containerd configuration to register the nvidia runtime |
| gpu-feature-discovery | DaemonSet, detects the GPU model, VRAM, computing power, and other information on the local node, and writes them as Labels and Annotations to the Node object for scheduler decision-making |
| nvidia-dcgm-exporter | DaemonSet, collects GPU utilization, VRAM usage, temperature, power consumption, and other metrics through DCGM (Data Center GPU Manager), exposing them in Prometheus format |
| nvidia-operator-validator | DaemonSet, validates whether the GPU software stack is functioning properly (driver loaded, Toolkit configured, node ready) |
| nvidia-cuda-validator | Job, runs once and exits, verifies that the CUDA runtime is available |
Without them:
- Without nvidia-driver-daemonset: The GPU hardware cannot be recognized by the operating system, the
nvidia-smicommand does not exist - Without nvidia-container-toolkit-daemonset: GPUs cannot be used inside containers; even if the driver is installed, running
nvidia-smiin a Pod will result in an error - Without gpu-feature-discovery: The scheduler does not know what GPUs the nodes have, and cannot make scheduling decisions based on GPU model or VRAM
- Without nvidia-dcgm-exporter: Prometheus has no GPU metrics data, making it impossible to monitor GPU utilization
- Without validator: There is no way to automatically detect GPU software stack installation issues
Significance for AI Infrastructure: The GPU Runtime Stack is the infrastructure layer for AI workloads. It transforms GPUs from "bare-metal devices" into "Kubernetes-manageable resources". Without this layer, AI training and inference tasks cannot run in a containerized manner.
HAMi Scheduling Components
Created by the hami Helm Release. The core problem this layer solves is: enabling GPUs to go from whole-card allocation to partitionable and shareable.
| Pod | Role |
|---|---|
| hami-scheduler | Deployment, the HAMi scheduler. It registers as a Kubernetes Scheduler Extender and participates in GPU scheduling decisions for Pods. Supports advanced features such as binpack/spread strategies, priority scheduling, and GPU resource quotas |
| hami-device-plugin | DaemonSet (one per GPU node), replaces the native NVIDIA device-plugin. It registers custom GPU resources (VRAM, compute) with kubelet and performs VRAM partitioning and device mounting when Pods are created |
%% title: HAMi GPU Scheduling Flow
flowchart TB
subgraph User["User Submits Workload"]
Pod["Pod<br/>request: 2GB VRAM"]
end
subgraph K8sScheduler["Kubernetes Scheduling Flow"]
KS["kube-scheduler<br/>Filter + Score"]
HS["hami-scheduler<br/>GPU Scheduling Enhancement"]
end
subgraph Node["On Node"]
Kubelet["kubelet"]
HDP["hami-device-plugin"]
GPU["GPU Hardware"]
end
Pod --> KS
KS -->|"Calls Extender"| HS
HS -->|"Returns scheduling decision"| KS
KS -->|"Binds to node"| Kubelet
Kubelet -->|"Creates container"| HDP
HDP -->|"Partitions VRAM<br/>Mounts device"| GPU
Without them:
- Without hami-scheduler: GPU scheduling falls back to Kubernetes' default behavior, allowing only whole-card allocation; VRAM partitioning and multi-Pod sharing become impossible
- Without hami-device-plugin: kubelet does not recognize HAMi's custom resources (
nvidia.com/gpumem,nvidia.com/gpucores), and Pod GPU resource requests cannot be fulfilled
Significance for AI Infrastructure: HAMi is key to GPU utilization. Without HAMi, an inference service that only needs 2GB of VRAM would occupy an entire 16GB GPU, wasting 87.5% of resources. HAMi enables multiple workloads to share the same GPU, boosting GPU utilization several times over.
Monitoring Components
Created by the prometheus Helm Release.
| Pod | Role |
|---|---|
| prometheus-prometheus-kube-prometheus-prometheus-0 | Main Prometheus Server instance, responsible for collecting and storing all metrics data. Automatically discovers scrape targets through ServiceMonitor |
| prometheus-kube-prometheus-operator | Prometheus Operator, manages the lifecycle of Prometheus and Alertmanager, automatically generates configuration |
| prometheus-kube-state-metrics | Listens to the Kubernetes API, converts cluster state (Deployments, Pods, Nodes, etc.) into Prometheus metrics |
| prometheus-prometheus-node-exporter | DaemonSet, one per node, collects node-level CPU, memory, disk, network, and other hardware metrics |
| alertmanager-prometheus-kube-prometheus-alertmanager-0 | Alertmanager, processes alerts from Prometheus, performing deduplication, grouping, routing, and sending notifications |
Without them:
- Without Prometheus Server: All metrics data has nowhere to be stored, and HAMi WebUI cannot display GPU usage charts
- Without Prometheus Operator: Every time a new scrape target is added, the Prometheus configuration must be manually modified and restarted
- Without kube-state-metrics: It is not possible to understand cluster state through metrics (Pod restart counts, Deployment replica count deviations, etc.)
- Without node-exporter: It is not possible to monitor node-level hardware resource usage
- Without Alertmanager: Abnormal conditions such as GPU overheating or insufficient VRAM cannot trigger automatic notifications
Significance for AI Infrastructure: AI workloads are typically long-running training tasks or high-throughput inference services. Monitoring is used not only for observation but also for early warning, excessively high GPU temperatures can interrupt training, and VRAM leaks can cause OOM errors. Without monitoring, operating AI infrastructure is like fumbling in the dark.
Complete System Architecture
The following shows the complete connection relationships between all components after installation:
%% title: Complete System Architecture
flowchart TB
subgraph Hardware["Hardware Layer"]
GPU["Tesla T4 GPU"]
CPU["CPU / Memory / Disk"]
NIC["Network Card"]
end
subgraph K8sCore["Kubernetes Control Plane"]
API["kube-apiserver"]
KS["kube-scheduler"]
CM["kube-controller-manager"]
ETCD["etcd"]
DNS["CoreDNS"]
Proxy["kube-proxy"]
end
subgraph Network["Network Layer"]
Cilium["cilium DaemonSet"]
CiliumOp["cilium-operator"]
end
subgraph GPURuntime["NVIDIA GPU Runtime Stack"]
Driver["nvidia-driver-daemonset"]
Toolkit["nvidia-container-toolkit"]
DCGM["dcgm-exporter"]
GFD["gpu-feature-discovery"]
Validator["validator"]
end
subgraph HAMiLayer["HAMi GPU Scheduling Layer"]
HAMiDP["hami-device-plugin"]
HAMiSched["hami-scheduler"]
end
subgraph Monitor["Monitoring Layer"]
Prom["Prometheus Server"]
PromOp["Prometheus Operator"]
KSM["kube-state-metrics"]
NodeExp["node-exporter"]
Alert["Alertmanager"]
end
subgraph Visualization["Visualization Layer"]
WebUI["HAMi WebUI"]
end
subgraph Workloads["AI Workloads"]
Train["Training Tasks"]
Infer["Inference Services"]
end
%% Hardware to Runtime
GPU -->|"Kernel driver"| Driver
CPU --> NodeExp
NIC --> Cilium
%% Runtime internal
Driver --> Toolkit
Driver --> DCGM
Driver --> GFD
Driver --> Validator
%% Runtime to K8s
Toolkit -->|"Registers nvidia runtime"| API
GFD -->|"Applies GPU Labels"| API
%% HAMi to K8s
HAMiSched -->|"Scheduler Extender"| KS
HAMiDP -->|"Registers GPU resources"| API
KS --> API
CM --> API
%% Monitoring collection
DCGM -->|"GPU metrics"| Prom
NodeExp -->|"Node metrics"| Prom
KSM -->|"Cluster state metrics"| Prom
Prom --> Alert
PromOp --> Prom
%% Visualization
Prom -->|"Queries metrics"| WebUI
%% Workloads
Workloads -->|"Requests GPU resources"| API
Workloads -->|"Uses via Toolkit"| GPU
Cross-Layer Collaboration Flow
Using a typical scenario, submitting an inference Pod that requires 2GB of VRAM, as an example, let's see how the layers collaborate:
%% title: Inference Pod Cross-Layer Collaboration
sequenceDiagram
participant User as User
participant API as kube-apiserver
participant KS as kube-scheduler
participant HS as hami-scheduler
participant Kubelet as kubelet
participant HDP as hami-device-plugin
participant Driver as NVIDIA Driver
participant Prom as Prometheus
User->>API: Submit Pod (request: 2GB VRAM)
API->>KS: Trigger scheduling
KS->>HS: Call Extender, query GPU scheduling recommendation
HS->>HS: Query node GPU resource annotations, calculate optimal node
HS-->>KS: Return recommended node
KS->>API: Bind Pod to target node
API->>Kubelet: Notify to create Pod
Kubelet->>HDP: Request GPU resource allocation
HDP->>HDP: Execute VRAM partitioning (allocate 2GB from physical GPU)
HDP->>Driver: Complete device mounting via NVIDIA libraries
HDP-->>Kubelet: Return allocation result
Kubelet->>Kubelet: Start container
Note over Prom: DCGM Exporter continuously collects GPU metrics
Prom->>Prom: Scrape GPU utilization, VRAM usage, and other metrics
Throughout this process, each layer fulfills its role: Kubernetes provides the scheduling framework and Pod lifecycle management, the GPU Runtime Stack provides hardware access capabilities, HAMi makes GPU-level scheduling decisions and resource partitioning in the middle, and Prometheus continuously collects metrics in the background.
Summary
| Layer | Core Components | Core Problem Solved |
|---|---|---|
| Kubernetes Infrastructure Layer | kube-apiserver, kube-scheduler, etcd, cilium | Container orchestration and network communication |
| NVIDIA GPU Runtime Stack | driver, toolkit, dcgm-exporter, gfd | Enabling containers to access GPUs |
| GPU Scheduling Layer | hami-scheduler, hami-device-plugin | GPU resource partitioning and sharing |
| Monitoring Layer | prometheus, node-exporter, kube-state-metrics | Metrics collection and alerting |
| Visualization Layer | HAMi WebUI | GPU resource visualization |
Once you understand the responsibilities and dependencies of these components, you can quickly identify which layer has a problem when issues arise in the cluster: if Pods cannot be scheduled, check the scheduling layer; if GPUs are unavailable, check the Runtime Stack; if metrics are missing, check the monitoring layer.