How to Run GPU Burn

Matthew Estes
Matthew Estes
  • Updated

Overview

This document will provide a brief description of how to install and run GPU Burn, a tool used by Exxact to stress test GPUs. GPU Burn can be used to utilize your GPUs at a high threshold for a defined period of time which can allow you to confirm or eliminate potential GPU issues.

Affected Systems

Any Exxact Workstation or Server with GPUs

How To Identify Your OS with cat /etc/os-release

Prerequisites

  • Linux Ubuntu or Rocky operating system
  • You have a CUDA capable GPU
  • You have CUDA already installed
  • You are running all commands below as root
  • You have all the necessary compilers/tools for compiling and running code

Step-by-step Instructions

Step 1: Download GPU Burn

GPU Burn can be downloaded by running the following:

git clone https://github.com/wilicc/gpu-burn

Step 2: Install GPU Burn

You will need to change into the gpu-burn directory and then run make to build the package.

cd gpu-burn
make

Step 3: Run GPU Burn

Usage: gpu_burn [OPTIONS] [TIME]

-m X Use X MB of memory
-m N% Use N% of the available GPU memory
-d Use doubles
-tc Try to use Tensor cores (if available)
-l List all GPUs in the system
-i N Execute only on GPU N
-h Show this help message

Example:
gpu_burn -d 3600

GPU Burn is a fairly straight forward tool and by default will run for ten seconds. Below you can find the  example output from a run for 30 seconds on a system with 8 NVIDIA RTX A6000 GPUs.

exx@tt19163:~/gpu-burn$ ./gpu_burn 30
Using compare file: compare.ptx
Burning for 30 seconds.
GPU 0: NVIDIA RTX A6000 (UUID: GPU-29a27c7b-cd4b-9728-9cdc-7102f77d4548)
GPU 1: NVIDIA RTX A6000 (UUID: GPU-a34186de-ecc3-56c6-0e8f-ff8cfa0cc7b2)

(Removed initialization and testing summary outputs for brevity)

Tested 2 GPUs:
GPU 0: OK
GPU 1: OK
exx@tt19163:~/gpu-burn$

Step 4: Check for PASS/FAIL

$ ./gpu-burn 600
GPU 0: NVIDIA RTX 5090, 80 GB memory, ECC enabled
GPU 1: NVIDIA RTX 5090, 80 GB memory, ECC enabled

[00:03:00] GPU 0: OK (checksum matched)
[00:03:00] GPU 1: ERROR - checksum mismatch
[00:03:00] GPU 1: Possible hardware or memory instability detected

[00:10:00] GPU 0: OK (checksum matched)
[00:10:00] GPU 1: ERROR - checksum mismatch

========================================================================
Burn-in test completed (600 seconds)
Summary:
GPU 0: PASSED - No errors detected
GPU 1: FAILED - Multiple checksum mismatches detected
========================================================================

Recommendation: Investigate GPU 1 for potential hardware instability.
Possible causes: overheating, unstable power delivery, or defective memory.

Once the GPU burn completes, it will show “All GPUs successfully passed” or indicates which GPUs had errors and recommend additional troubleshooting.

If GPU failed and additional troubleshooting is needed, please refer to our GPU troubleshooting article for further information

 

Tips and Best Practice

  • GPU Burn accepts time parameters in seconds. Convert hours to seconds for extended testing (ex 1 hour = 3600 seconds)
  • Pair with ipmitool sensor to monitor for component temperatures and nvidia-smi for GPU temperatures during GPU burns

 

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.