How-To: Install The NVIDIA® Data Center GPU Manager (DCGM)

Alexander Hill
Alexander Hill
  • Updated

Overview

The NVIDIA® Data Center GPU Manager (DCGM) simplifies administration of NVIDIA Datacenter (previously “Tesla”) GPUs in cluster and datacenter environments. At its heart, DCGM is an intelligent, lightweight user space library/agent that performs a variety of functions on each host system:

  • GPU behavior monitoring
  • GPU configuration management
  • GPU policy oversight
  • GPU health and diagnostics
  • GPU accounting and process statistics
  • NVSwitch configuration and monitoring

 

Supported Linux Distributions

Linux Distributions and Architectures
Linux Distribution x86 (x86_64) Arm64 (aarch64)
Debian 12 X  
RHEL 8.y/Rocky Linux 8.y X X
RHEL 9.y/Rocky Linux 9.y X X
SLES/OpenSUSE 15.y X X
Ubuntu 24.04 LTS X X
Ubuntu 22.04 LTS X X
Ubuntu 20.04 LTS X X

 

Installation

System Requirements

Note

Ensuring your environment meets these requirements is equally important for containers, virtual machines, and baremetal solutions. Attempting to run DCGM in an environment that does not meet these requirements may not yield a successful outcome.

Resource Requirement
Minimum System Memory (Host RAM) >= 16GB
Minimum CPU Cores >= Number of GPUs

 

 

Pre-Requisites

1. The system package manager has been configured to use an NVIDIA package registry for the system’s Linux distribution. For those using a CUDA local package registry on disk, it is recommended to update to the latest version available.

Please refer to the CUDA installation guide for detailed steps.

2. Installations of the following NVIDIA software must be present on the system:

    1. A supported NVIDIA Datacenter Driver

Warning

DCGM is tested and designed to run with NVIDIA Datacenter Drivers. Attempting to run on other drivers, such as a developer driver, could result in missing functionality.

Please refer to the documentation on the various types of branches and support timelines.

 

2. On systems with NVSwitch™ hardware, such as NVIDIA DGX™ systems and NVIDIA HGX™ systems,

    • the Fabric Manager package
    • the NVSwitch™ Configuration & Query (NSCQ) package
    • the NVIDIA Switch Device Monitoring (NVSDM) package

For more information regarding the Fabric Manager package, please refer to the Fabric Manager User Guide

For more information regarding the NSCQ package, please refer to the HGX Software Guide.

For more information regarding the NVIDIA Switch Device Monitoring package, please refer to the Driver Installation Guide.

 

Installation

Ubuntu LTS and Debian

1. Remove any installations of the datacenter-gpu-manager and datacenter-gpu-manager-config packages

$ sudo dpkg --list datacenter-gpu-manager &> /dev/null && \
 sudo apt purge --yes datacenter-gpu-manager

$ sudo dpkg --list datacenter-gpu-manager-config &> /dev/null && \
  sudo apt purge --yes datacenter-gpu-manager-config

 

2. Update the package registry cache

$ sudo apt-get update

 

3. Install the datacenter-gpu-manager-4 package corresponding to the system CUDA version

$ CUDA_VERSION=$(nvidia-smi | sed -E -n 's/.*CUDA Version: ([0-9]+)[.].*/\1/p')
$ sudo apt-get install --yes \
                      --install-recommends \
                       datacenter-gpu-manager-4-cuda${CUDA_VERSION}

Installing the recommended packages provides additional DCGM functionality which is not present in the DCGM opensource product. To opt out of these packages and the associated functionality, replace --install-recommends with --no-install-recommends.

 

4. (Optional) Install the datacenter-gpu-manager-4 development files

$ sudo apt install --yes datacenter-gpu-manager-4-dev

 

RHEL / CentOS / Rocky Linux

1. Remove any installations of the datacenter-gpu-manager and datacenter-gpu-manager-config packages.

$ sudo dnf list --installed datacenter-gpu-manager &> /dev/null && \
  sudo dnf remove --assumeyes datacenter-gpu-manager

$ sudo dnf list --installed datacenter-gpu-manager-config &> /dev/null && \
  sudo dnf remove --assumeyes datacenter-gpu-manager-config

 

2. Update the package registry cache.

$ sudo dnf clean expire-cache

 

3. Install the datacenter-gpu-manager-4 package corresponding to the system CUDA version, its dependencies, and respective associated recommended packages.

$ CUDA_VERSION=$(nvidia-smi | sed -E -n 's/.*CUDA Version: ([0-9]+)[.].*/\1/p')
$ sudo dnf install --assumeyes \
                  --setopt=install_weak_deps=True \
                   datacenter-gpu-manager-4-cuda${CUDA_VERSION}

Installing the recommended packages provides additional DCGM functionality which is not present in the DCGM opensource product. To opt out of these packages and the associated functionality, replace --setopt=install_weak_deps=True with --setopt=install_weak_deps=False.

 

4. (Optional) Install the datacenter-gpu-manager-4 development files

$ sudo dnf install --assumeyes datacenter-gpu-manager-4-devel

 

 

Post-Install

Note

Note that the default nvidia-dcgm.service files included in the installation package use the systemd format. If DCGM is being installed on OS distributions that use the init.d format, then these files will need to be modified.

 

Enable the DCGM systemd service (on reboot) and start it now

 

$ sudo systemctl --now enable nvidia-dcgm
● dcgm.service - DCGM service
Loaded: loaded (/usr/lib/systemd/system/nvidia-dcgm.service; disabled; vendor preset: enabled)
Active: active (running) since Mon 2024-12-17 12:18:57 EDT; 14s ago
Main PID: 32847 (nv-hostengine)
   Tasks: 7 (limit: 39321)
CGroup: /system.slice/nvidia-dcgm.service
          └─32847 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm

Oct 12 12:18:57 ubuntu1804 systemd[1]: Started DCGM service.
Oct 12 12:18:58 ubuntu1804 nv-hostengine[32847]: DCGM initialized
Oct 12 12:18:58 ubuntu1804 nv-hostengine[32847]: Host Engine Listener Started

 

To verify installation, use dcgmi to query the system. You should see a listing of all supported GPUs (and any NVSwitches) found in the system:

 

$ dcgmi discovery -l
8 GPUs found.

+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:07:00.0                                         |
|        | Device UUID: GPU-1d82f4df-3cf9-150d-088b-52f18f8654e1                |
+--------+----------------------------------------------------------------------+
| 1      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:0F:00.0                                         |
|        | Device UUID: GPU-94168100-c5d5-1c05-9005-26953dd598e7                |
+--------+----------------------------------------------------------------------+
| 2      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:47:00.0                                        
|        | Device UUID: GPU-9387e4b3-3640-0064-6b80-5ace1ee535f6                |
+--------+----------------------------------------------------------------------+
| 3      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:4E:00.0                                         |
|        | Device UUID: GPU-cefd0e59-c486-c12f-418c-84ccd7a12bb2                |
+--------+----------------------------------------------------------------------+
| 4      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:87:00.0                                         |
|        | Device UUID: GPU-1501b26d-f3e4-8501-421d-5a444b17eda8                |
+--------+----------------------------------------------------------------------+
| 5      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:90:00.0                                         |
|        | Device UUID: GPU-f4180a63-1978-6c56-9903-ca5aac8af020                |
+--------+----------------------------------------------------------------------+
| 6      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:B7:00.0                                         |
|        | Device UUID: GPU-8b354e3e-0145-6cfc-aec6-db2c28dae134                |
+--------+----------------------------------------------------------------------+
| 7      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:BD:00.0                                         |
|        | Device UUID: GPU-a16e3b98-8be2-6a0c-7fac-9cb024dbc2df                |
+--------+----------------------------------------------------------------------+
6 NvSwitches found.
+-----------+
| Switch ID |
+-----------+
| 11        |
| 10        |
| 13        |
| 9         |
| 12        |
| 8         |
+-----------+


 

Related to

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.