Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Make use of the CAPI docs here: Cluster API Setup - CloudKB - Confluence (atlassian.net) to spin up a kubernetes cluster.

Note

These are for A100 machines ONLY

  1. Define the GPU node within the flavors.yaml file.

    Code Block
      # The following node groups are optional and can be enabled by uncommenting them
      # - name: md-l3-small
      #   machineFlavor: l3.small
      #   machineCount: 1
      - name: md-a100
        machineFlavor: g-a100.x1
        machineCount: 1

    (Note that “- name :” cannot contain “.” otherwise it breaks)

  2. Check that the GPU node is active by running

...

  1. kubectl get pods -n gpu-operator
    Results will look something similar to this:

    Code Block
    NAME                                       READY   STATUS                  RESTARTS           AGE
    gpu-feature-discovery-mccr2                1/1     Running                 0                  3d21h
    gpu-feature-discovery-stvzv                1/1     Running                 0                  3d21h
    gpu-operator-8587489d6d-v778r              1/1     Running                 1 (3d21h ago)      3d21h
    nvidia-container-toolkit-daemonset-4jcs4   1/1     Running                 0                  3d21h
    nvidia-container-toolkit-daemonset-wk7zp   1/1     Running                 0                  3d21h
    nvidia-cuda-validator-4n8b8                0/1     Init:CrashLoopBackOff   6 (73s ago)        7m12s
    nvidia-cuda-validator-7rh5d                0/1     Init:Error              5 (97s ago)        3m20s
    nvidia-dcgm-exporter-rllmn                 1/1     Running                 0                  3d21h
    nvidia-dcgm-exporter-w564r                 1/1     Running                 0                  3d21h
    nvidia-device-plugin-daemonset-8mcxp       0/1     CrashLoopBackOff        1101 (5m8s ago)    3d21h
    nvidia-device-plugin-daemonset-mv6pp       0/1     CrashLoopBackOff        1102 (4m51s ago)   3d21h
    nvidia-driver-daemonset-b7hvm              1/1     Running                 0                  3d21h
    nvidia-driver-daemonset-ntk77              1/1     Running                 0                  3d21h
    nvidia-mig-manager-ltr6n                   1/1     Running                 0                  3d21h
    nvidia-mig-manager-s2fvs                   1/1     Running                 0                  3d21h
    nvidia-node-status-exporter-gwzfz          1/1     Running                 0                  3d21h
    nvidia-node-status-exporter-mc558          1/1     Running                 0                  3d21h
    nvidia-operator-validator-4xzt5            0/1     Init:CrashLoopBackOff   824 (2m11s ago)    3d21h
    nvidia-operator-validator-pbj59            0/1     Init:2/4                823 (3m48s ago)    3d21h

  2. If you get pods similar to this, then run:

...

  1. kubectl get nodes
    To find your GPU node name (Should contain the name defined in step 2)

    Code Block
    NAME                                             STATUS   ROLES           AGE     VERSION
    thanostwo-cluster-control-plane-1c924ed9-jlx4s   Ready    control-plane   3d21h   v1.24.11
    thanostwo-cluster-control-plane-1c924ed9-vvmrv   Ready    control-plane   3d21h   v1.24.11
    thanostwo-cluster-control-plane-1c924ed9-wqqjr   Ready    control-plane   3d21h   v1.24.11
    thanostwo-cluster-default-md-0-1c924ed9-gcsns    Ready    <none>          3d21h   v1.24.11
    thanostwo-cluster-default-md-0-1c924ed9-s6qr8    Ready    <none>          3d21h   v1.24.11
    thanostwo-cluster-default-md-0-1c924ed9-ttclc    Ready    <none>          3d21h   v1.24.11
    thanostwo-cluster-md-a100-e31d1485-h8vfr         Ready    <none>          3d21h   v1.24.11
    thanostwo-cluster-md-a100-e31d1485-r7r5g         Ready    <none>          3d21h   v1.24.11

  2. Lastly, label the desired node with the following command:

...

  1. kubectl label nodes NODE_NAME nvidia.com/mig.config=PROFILE_NAME
    Where:
    NODE_NAME is from the previous step.
    PROFILE_NAME can be selected from the table below

...

  1. .

PROFILE_NAME

Amount of MIG devices

all-1g.5gb

7

all-2g.10gb

3

all-3g.20gb

2

all-7g.40gb

1

all-balanced

"1g.5gb": 2
"2g.10gb": 1
"3g.20gb": 1

...

Reviewer

Review period

6 Months