Overview
This document will provide a brief description of how to install and run GPU Burn, a tool used by Exxact to stress test GPUs. GPU Burn can be used to utilize your GPUs at a high threshold for a defined period of time which can allow you to confirm or eliminate potential GPU issues.
Affected Systems
Any Exxact Workstation or Server with GPUs
How To Identify Your OS with cat /etc/os-release
Prerequisites
- Linux Ubuntu or Rocky operating system
- You have a CUDA capable GPU
- You have CUDA already installed
- You are running all commands below as root
- You have all the necessary compilers/tools for compiling and running code
Step-by-step Instructions
Step 1: Download GPU Burn
GPU Burn can be downloaded by running the following:
git clone https://github.com/wilicc/gpu-burn
Step 2: Install GPU Burn
You will need to change into the gpu-burn directory and then run make to build the package.
cd gpu-burn
make
Step 3: Run GPU Burn
Usage: gpu_burn [OPTIONS] [TIME]
-m X Use X MB of memory
-m N% Use N% of the available GPU memory
-d Use doubles
-tc Try to use Tensor cores (if available)
-l List all GPUs in the system
-i N Execute only on GPU N
-h Show this help message
Example:
gpu_burn -d 3600
GPU Burn is a fairly straight forward tool and by default will run for ten seconds. Below you can find the example output from a run for 30 seconds on a system with 8 NVIDIA RTX A6000 GPUs.
exx@tt19163:~/gpu-burn$ ./gpu_burn 30
Using compare file: compare.ptx
Burning for 30 seconds.
GPU 0: NVIDIA RTX A6000 (UUID: GPU-29a27c7b-cd4b-9728-9cdc-7102f77d4548)
GPU 1: NVIDIA RTX A6000 (UUID: GPU-a34186de-ecc3-56c6-0e8f-ff8cfa0cc7b2)
(Removed initialization and testing summary outputs for brevity)
Tested 2 GPUs:
GPU 0: OK
GPU 1: OK
exx@tt19163:~/gpu-burn$
Step 4: Check for PASS/FAIL
$ ./gpu-burn 600
GPU 0: NVIDIA RTX 5090, 80 GB memory, ECC enabled
GPU 1: NVIDIA RTX 5090, 80 GB memory, ECC enabled
[00:03:00] GPU 0: OK (checksum matched)
[00:03:00] GPU 1: ERROR - checksum mismatch
[00:03:00] GPU 1: Possible hardware or memory instability detected
[00:10:00] GPU 0: OK (checksum matched)
[00:10:00] GPU 1: ERROR - checksum mismatch
========================================================================
Burn-in test completed (600 seconds)
Summary:
GPU 0: PASSED - No errors detected
GPU 1: FAILED - Multiple checksum mismatches detected
========================================================================
Recommendation: Investigate GPU 1 for potential hardware instability.
Possible causes: overheating, unstable power delivery, or defective memory.
Once the GPU burn completes, it will show “All GPUs successfully passed” or indicates which GPUs had errors and recommend additional troubleshooting.
If GPU failed and additional troubleshooting is needed, please refer to our GPU troubleshooting article for further information
Tips and Best Practice
- GPU Burn accepts time parameters in seconds. Convert hours to seconds for extended testing (ex 1 hour = 3600 seconds)
- Pair with
ipmitool sensorto monitor for component temperatures andnvidia-smifor GPU temperatures during GPU burns
Comments
0 comments
Please sign in to leave a comment.