Overview
Graphics Processing Units (GPUs) are critical components for tasks like machine learning, data processing, and gaming. Monitoring GPU performance metrics helps you understand resource utilization, identify bottlenecks, and ensure optimal performance. This guide explains how to check and monitor key GPU metrics using NVIDIA's System Management Interface (nvidia-smi) tool.
Prerequisites
- A system with one or more NVIDIA GPUs installed
- NVIDIA GPU drivers properly installed
- Terminal or command-line access to your system
Steps
Basic GPU Information
- Check basic GPU information using the simple nvidia-smi command:
bash
nvidia-smi
Detailed GPU Metrics
- View comprehensive GPU metrics with a custom query:
bash
nvidia-smi --query-gpu=timestamp,index,name,temperature.gpu,power.draw,clocks.gr,clocks.mem,utilization.gpu,utilization.memory,memory.used,memory.total,pstate,fan.speed --format=csv,noheader,nounits
This command displays the following metrics:- timestamp: Current system time
- index: GPU device index (useful for multi-GPU systems)
- name: GPU model name
- temperature.gpu: GPU temperature in Celsius
- power.draw: Current power consumption in Watts
- clocks.gr: Graphics clock speed in MHz
- clocks.mem: Memory clock speed in MHz
- utilization.gpu: GPU utilization percentage
- utilization.memory: GPU memory utilization percentage
- memory.used: Used GPU memory in MiB
- memory.total: Total GPU memory in MiB
- pstate: Current performance state (P0 to P12, with P0 being highest performance)
- fan.speed: Fan speed percentage
Continuous Monitoring
-
- Set up continuous monitoring by combining with the watch command:
bash
watch -n 1 "nvidia-smi --query-gpu=timestamp,index,name,temperature.gpu,power.draw,utilization.gpu,memory.used,memory.total --format=csv,noheader,nounits
- Set up continuous monitoring by combining with the watch command:
Custom Output Formats
- Format output as needed using nvidia-smi options:
- For CSV format (good for logging):
bash
nvidia-smi --query-gpu=name,temperature.gpu,utilization.gpu --format=csv > gpu_log.csv
- For XML format:
bash
nvidia-smi --query-gpu=name,temperature.gpu,utilization.gpu --format=xml
- For CSV format (good for logging):
Specific GPU Information
-
- Target a specific GPU in a multi-GPU system:
bash
nvidia-smi -i 0 --query-gpu=temperature.gpu,power.draw --format=csv
0
with the index of the GPU you want to monitor.
- Target a specific GPU in a multi-GPU system:
Process-Level Monitoring
-
- Monitor processes using GPU resources:
bash
nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv
This shows which processes are using GPU resources and how much memory they're consuming.
- Monitor processes using GPU resources:
Additional Tips
- Set up automated monitoring by creating a script that runs nvidia-smi commands and logs the output to a file
- Consider using graphical monitoring tools like NVIDIA-SMI GUI, GPU-Z, or NVIDIA's DCGM for more advanced visualizations
- Critical thresholds to watch for:
- Temperature: Generally keep below 85°C for most GPUs
- Memory utilization: Near 100% can indicate memory bottlenecks
- Power consumption: Consistently high values may indicate inefficient usage or potential thermal issues
Related to
Comments
0 comments
Please sign in to leave a comment.