MIG-Kubernetes

Overview

This document outlines how to enable MIG on GPU nodes within a Kubernetes cluster.

 Instructions

  1. Make use of the CAPI docs here: Cluster API Setup - CloudKB - Confluence (atlassian.net) to spin up a kubernetes cluster.

These are for A100 machines ONLY

  1. Define the GPU node within the flavors.yaml file.

    # The following node groups are optional and can be enabled by uncommenting them # - name: md-l3-small # machineFlavor: l3.small # machineCount: 1 - name: md-a100 machineFlavor: g-a100.x1 machineCount: 1

    (Note that “- name :” cannot contain “.” otherwise it breaks)

  2. Check that the GPU node is active by running
    kubectl get pods -n gpu-operator
    Results will look something similar to this:

    NAME READY STATUS RESTARTS AGE gpu-feature-discovery-mccr2 1/1 Running 0 3d21h gpu-feature-discovery-stvzv 1/1 Running 0 3d21h gpu-operator-8587489d6d-v778r 1/1 Running 1 (3d21h ago) 3d21h nvidia-container-toolkit-daemonset-4jcs4 1/1 Running 0 3d21h nvidia-container-toolkit-daemonset-wk7zp 1/1 Running 0 3d21h nvidia-cuda-validator-4n8b8 0/1 Init:CrashLoopBackOff 6 (73s ago) 7m12s nvidia-cuda-validator-7rh5d 0/1 Init:Error 5 (97s ago) 3m20s nvidia-dcgm-exporter-rllmn 1/1 Running 0 3d21h nvidia-dcgm-exporter-w564r 1/1 Running 0 3d21h nvidia-device-plugin-daemonset-8mcxp 0/1 CrashLoopBackOff 1101 (5m8s ago) 3d21h nvidia-device-plugin-daemonset-mv6pp 0/1 CrashLoopBackOff 1102 (4m51s ago) 3d21h nvidia-driver-daemonset-b7hvm 1/1 Running 0 3d21h nvidia-driver-daemonset-ntk77 1/1 Running 0 3d21h nvidia-mig-manager-ltr6n 1/1 Running 0 3d21h nvidia-mig-manager-s2fvs 1/1 Running 0 3d21h nvidia-node-status-exporter-gwzfz 1/1 Running 0 3d21h nvidia-node-status-exporter-mc558 1/1 Running 0 3d21h nvidia-operator-validator-4xzt5 0/1 Init:CrashLoopBackOff 824 (2m11s ago) 3d21h nvidia-operator-validator-pbj59 0/1 Init:2/4 823 (3m48s ago) 3d21h

     

  3. If you get pods similar to this, then run:
    kubectl get nodes
    To find your GPU node name (Should contain the name defined in step 2)

    NAME STATUS ROLES AGE VERSION thanostwo-cluster-control-plane-1c924ed9-jlx4s Ready control-plane 3d21h v1.24.11 thanostwo-cluster-control-plane-1c924ed9-vvmrv Ready control-plane 3d21h v1.24.11 thanostwo-cluster-control-plane-1c924ed9-wqqjr Ready control-plane 3d21h v1.24.11 thanostwo-cluster-default-md-0-1c924ed9-gcsns Ready <none> 3d21h v1.24.11 thanostwo-cluster-default-md-0-1c924ed9-s6qr8 Ready <none> 3d21h v1.24.11 thanostwo-cluster-default-md-0-1c924ed9-ttclc Ready <none> 3d21h v1.24.11 thanostwo-cluster-md-a100-e31d1485-h8vfr Ready <none> 3d21h v1.24.11 thanostwo-cluster-md-a100-e31d1485-r7r5g Ready <none> 3d21h v1.24.11

     

  4. Lastly, label the desired node with the following command:
    kubectl label nodes NODE_NAME nvidia.com/mig.config=PROFILE_NAME
    Where:
    NODE_NAME is from the previous step.
    PROFILE_NAME can be selected from the table below.

PROFILE_NAME

Amount of MIG devices

PROFILE_NAME

Amount of MIG devices

all-1g.5gb

7

all-2g.10gb

3

all-3g.20gb

2

all-7g.40gb

1

all-balanced

"1g.5gb": 2
"2g.10gb": 1
"3g.20gb": 1

References

Reviewer

Review period

Reviewer

Review period

Reviewed by @Ramzi Jalili May 6, 2024

6 Months