MIG-Kubernetes
Overview
This document outlines how to enable MIG on GPU nodes within a Kubernetes cluster.
Instructions
Make use of the CAPI docs here: Cluster API Setup - CloudKB - Confluence (atlassian.net) to spin up a kubernetes cluster.
These are for A100 machines ONLY
Define the GPU node within the flavors.yaml file.
# The following node groups are optional and can be enabled by uncommenting them # - name: md-l3-small # machineFlavor: l3.small # machineCount: 1 - name: md-a100 machineFlavor: g-a100.x1 machineCount: 1
(Note that “- name :” cannot contain “.” otherwise it breaks)
Check that the GPU node is active by running
kubectl get pods -n gpu-operator
Results will look something similar to this:NAME READY STATUS RESTARTS AGE gpu-feature-discovery-mccr2 1/1 Running 0 3d21h gpu-feature-discovery-stvzv 1/1 Running 0 3d21h gpu-operator-8587489d6d-v778r 1/1 Running 1 (3d21h ago) 3d21h nvidia-container-toolkit-daemonset-4jcs4 1/1 Running 0 3d21h nvidia-container-toolkit-daemonset-wk7zp 1/1 Running 0 3d21h nvidia-cuda-validator-4n8b8 0/1 Init:CrashLoopBackOff 6 (73s ago) 7m12s nvidia-cuda-validator-7rh5d 0/1 Init:Error 5 (97s ago) 3m20s nvidia-dcgm-exporter-rllmn 1/1 Running 0 3d21h nvidia-dcgm-exporter-w564r 1/1 Running 0 3d21h nvidia-device-plugin-daemonset-8mcxp 0/1 CrashLoopBackOff 1101 (5m8s ago) 3d21h nvidia-device-plugin-daemonset-mv6pp 0/1 CrashLoopBackOff 1102 (4m51s ago) 3d21h nvidia-driver-daemonset-b7hvm 1/1 Running 0 3d21h nvidia-driver-daemonset-ntk77 1/1 Running 0 3d21h nvidia-mig-manager-ltr6n 1/1 Running 0 3d21h nvidia-mig-manager-s2fvs 1/1 Running 0 3d21h nvidia-node-status-exporter-gwzfz 1/1 Running 0 3d21h nvidia-node-status-exporter-mc558 1/1 Running 0 3d21h nvidia-operator-validator-4xzt5 0/1 Init:CrashLoopBackOff 824 (2m11s ago) 3d21h nvidia-operator-validator-pbj59 0/1 Init:2/4 823 (3m48s ago) 3d21h
If you get pods similar to this, then run:
kubectl get nodes
To find your GPU node name (Should contain the name defined in step 2)NAME STATUS ROLES AGE VERSION thanostwo-cluster-control-plane-1c924ed9-jlx4s Ready control-plane 3d21h v1.24.11 thanostwo-cluster-control-plane-1c924ed9-vvmrv Ready control-plane 3d21h v1.24.11 thanostwo-cluster-control-plane-1c924ed9-wqqjr Ready control-plane 3d21h v1.24.11 thanostwo-cluster-default-md-0-1c924ed9-gcsns Ready <none> 3d21h v1.24.11 thanostwo-cluster-default-md-0-1c924ed9-s6qr8 Ready <none> 3d21h v1.24.11 thanostwo-cluster-default-md-0-1c924ed9-ttclc Ready <none> 3d21h v1.24.11 thanostwo-cluster-md-a100-e31d1485-h8vfr Ready <none> 3d21h v1.24.11 thanostwo-cluster-md-a100-e31d1485-r7r5g Ready <none> 3d21h v1.24.11
Lastly, label the desired node with the following command:
kubectl label nodes NODE_NAME nvidia.com/mig.config=PROFILE_NAME
Where:
NODE_NAME is from the previous step.
PROFILE_NAME can be selected from the table below.
PROFILE_NAME | Amount of MIG devices |
---|---|
|
|
|
|
|
|
|
|
|
|
References
NVIDIA Multi-Instance GPU User Guide :: NVIDIA Data Center GPU Driver Documentation
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#a100-profiles
Reviewer | Review period |
---|---|
Reviewed by @Ramzi Jalili May 6, 2024 | 6 Months |
|
|