How to Check, View, and Monitor Your GPU Metrics

Alexander Hill
Alexander Hill
  • Updated

Overview

Graphics Processing Units (GPUs) are critical components for tasks like machine learning, data processing, and gaming. Monitoring GPU performance metrics helps you understand resource utilization, identify bottlenecks, and ensure optimal performance. This guide explains how to check and monitor key GPU metrics using NVIDIA's System Management Interface (nvidia-smi) tool.

Prerequisites

  • A system with one or more NVIDIA GPUs installed
  • NVIDIA GPU drivers properly installed
  • Terminal or command-line access to your system

Steps

Basic GPU Information

  1. Check basic GPU information using the simple nvidia-smi command:
    bash
     
    nvidia-smi
    This displays a summary of all GPUs including model, driver version, and basic utilization metrics.

Detailed GPU Metrics

  1. View comprehensive GPU metrics with a custom query:
    bash
     
    nvidia-smi --query-gpu=timestamp,index,name,temperature.gpu,power.draw,clocks.gr,clocks.mem,utilization.gpu,utilization.memory,memory.used,memory.total,pstate,fan.speed --format=csv,noheader,nounits

    This command displays the following metrics:
    • timestamp: Current system time
    • index: GPU device index (useful for multi-GPU systems)
    • name: GPU model name
    • temperature.gpu: GPU temperature in Celsius
    • power.draw: Current power consumption in Watts
    • clocks.gr: Graphics clock speed in MHz
    • clocks.mem: Memory clock speed in MHz
    • utilization.gpu: GPU utilization percentage
    • utilization.memory: GPU memory utilization percentage
    • memory.used: Used GPU memory in MiB
    • memory.total: Total GPU memory in MiB
    • pstate: Current performance state (P0 to P12, with P0 being highest performance)
    • fan.speed: Fan speed percentage

Continuous Monitoring

    1. Set up continuous monitoring by combining with the watch command:
      bash
       
      watch -n 1 "nvidia-smi --query-gpu=timestamp,index,name,temperature.gpu,power.draw,utilization.gpu,memory.used,memory.total --format=csv,noheader,nounits
      This updates the display every second with a simplified set of metrics for easier viewing.

Custom Output Formats

  1. Format output as needed using nvidia-smi options:
    • For CSV format (good for logging):
      bash
       
      nvidia-smi --query-gpu=name,temperature.gpu,utilization.gpu --format=csv > gpu_log.csv
    • For XML format:
      bash
       
      nvidia-smi --query-gpu=name,temperature.gpu,utilization.gpu --format=xml

Specific GPU Information

    1. Target a specific GPU in a multi-GPU system:
      bash
       
      nvidia-smi -i 0 --query-gpu=temperature.gpu,power.draw --format=csv
       
      Replace 0 with the index of the GPU you want to monitor.

Process-Level Monitoring

    1. Monitor processes using GPU resources:
      bash
       
      nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv

      This shows which processes are using GPU resources and how much memory they're consuming.

Additional Tips

  • Set up automated monitoring by creating a script that runs nvidia-smi commands and logs the output to a file
  • Consider using graphical monitoring tools like NVIDIA-SMI GUI, GPU-Z, or NVIDIA's DCGM for more advanced visualizations
  • Critical thresholds to watch for:
    • Temperature: Generally keep below 85°C for most GPUs
    • Memory utilization: Near 100% can indicate memory bottlenecks
    • Power consumption: Consistently high values may indicate inefficient usage or potential thermal issues

Related to

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.