GPU Validation Test

Marketing
Marketing
  • Updated

Document Scope

This article covers how to use the GPU validation test with NVIDIA cards.

Pre-requisites

NVIDIA drivers installed. (You can check with 'nvidia-smi' command to see if it correctly outputs the NVIDIA hardware devices.)

Instructions

1. Download/unpack files into root directory.

wget https://exxact-support.s3.us-west-1.amazonaws.com/Test+Folder/Stand_Alone_Validation_v4.2.1.tar.gz --no-check-certificate
tar -xvzf Stand_Alone_Validation_v4.2.1.tar.gz

2. Change directory to unpacked folder.

cd Stand_Alone_Validation

3. Set amount of GPU's/test cycles desired by editing 'run_test.x' file using your preferred editor (like nano or vi)

#How many GPUs in node

gpu_count=4

#How many tests to run of each type

#Large test requires 5GB memory

#Xlarge test requires 11GB memory

small_test_count=20

large_test_count=10

xlarge_test_count=5

NOTE: The duration of tests varies depending on GPU's being used. If you are using a smaller GPU specifically for display, you need to remove that GPU and use this system using terminal-view only or SSH to run the test.

4. Save the changes you just made. We typically like to start with 5/5/2 for the number of small/medium/large tests. The default number of cycles (20/10/5) is typically meant for overnight/long duration testing.

5. Run test in the background by using (run as root).

nohup ./run_test.x &

6. Monitor GPU temps by opening another terminal and using 'nvidia-smi -l'; once you no longer see the 'standalone-test.bin' process being printed from 'nvidia-smi', you can check the logs to see if your   set number of cycles are completed.

exx@ubuntu:~/Stand_Alone_Validation$ nvidia-smi -l
Tue Jan 15 17:35:14 2019
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.78 Driver Version: 410.78 CUDA Version: 10.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 1080 On | 00000000:05:00.0 On | N/A | | 78% 86C P2 149W / 180W | 4767MiB / 8118MiB | 100% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 1080 On | 00000000:06:00.0 Off | N/A | | 77% 86C P2 155W / 180W | 4569MiB / 8119MiB | 100% Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce GTX 1080 On | 00000000:09:00.0 Off | N/A | | 72% 86C P2 124W / 180W | 4569MiB / 8119MiB | 100% Default | +-------------------------------+----------------------+----------------------+ | 3 GeForce GTX 1080 On | 00000000:0A:00.0 Off | N/A | | 59% 83C P2 134W / 180W | 4569MiB / 8119MiB | 100% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1910 G /usr/lib/xorg/Xorg 157MiB | | 0 2889 G compiz 40MiB | | 0 5848 C ../standalone-test.bin 4557MiB | | 1 5849 C ../standalone-test.bin 4557MiB | | 2 5850 C ../standalone-test.bin 4557MiB | | 3 5851 C ../standalone-test.bin 4557MiB | +-----------------------------------------------------------------------------+

The approximate time to complete a 5/5/2 cycle is 6-8 hours.

Checking results

View the output logs in the 'Stand_Alone_Validation' directory and make sure the results match for each cycle. In this example, there are only had 5 small tests on 4x GPU's. The large and Xlarge tests write their own files per GPU_x.

Example:

exx@ubuntu:~/Stand_Alone_Validation$ ls 

clean.x GPU_1.log GPU_3.log lib nohup.out output_files_large run_test.x standalone-test_v3.bin
GPU_0.log GPU_2.log input LICENSE output_files README standalone-test.bin standalone-test_v3_p2p.bin
exx@ubuntu:~/Stand_Alone_Validation$ cat *.log
0.0: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 0.1: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 0.2: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 0.3: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 0.4: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 1.0: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 1.1: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 1.2: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 1.3: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 1.4: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 2.0: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 2.1: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 2.2: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 2.3: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 2.4: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 3.0: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 3.1: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 3.2: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 3.3: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 3.4: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430

As you can see above, 0.0 = GPU, cycle = Etot = EKtot = EPtot. There are 4 GPU's that have passed 5 cycles of the small test with matching results.

FAQ

Q: Is there any way of running the test on select cards without going through the trouble of opening the case and yanking out power cables/PCIe cards?

A. Yes. This involves a manual declaration of the env vars, and an adjustment of the script to comment 'CUDA_VISIBLE_DEVICES' out, so this does not over-write the UUID of the GPU of the single GPU card to be tested.

 

This is an applicable solution for a system admin who is comfortable working in the shell or CLI, and the Exxact GPU server or HPC is in a rack or data-center environment.

 

To run the GPU Stand Alone Validation tests against a single card-- we must customize the behavior of the script instead of pulling out the cards and rotating them manually.
It does involve a manual change to the GPU validation script, but I tested this in my lab and it worked as expected.

 

To run the test against one specific card, you will need to perform the following actions:

  1. Back-up the existing "run_test.x" shell script (just be safe, you can always re-download the entire tgz archive again)
  2. Edit the "run_test.x" using your favorite text editor (nano, vim , etc).
  3. #Comment out "CUDA_VISIBLE_DEVICES=$j",

This is seen (3) times in the run_test script. We are removing it here, because we will define this directly in the bash shell so we don't need to edit this file for each and every run.

  1. Run command, "nvidia-smi -L" to get list of all GPU UUIDs.
  2. For each card, before each run, you will set the GPU UUID for the card you wish to test.

Example:

export CUDA_VISIBLE_DEVICES=GPU-99135ce

nohup ./run_test.x &

    ... [Test completes]

 

export CUDA_VISIBLE_DEVICES=GPU-13599aa

{{ repeat as needed to isolate faulty GPU }}

 

About Exxact's Standalone Validation Suite

Exxact's Standalone Validation Suite is a proprietary test adapted from the GPU engine within the AMBER Molecular Dynamics Software Suite. Developed by Ross Walker, the principal developer of the AMBER GPU software, the test works by repeatedly running all atom molecular dynamics (MD) simulations of varying sizes. There are 3 different sizes of tests designed to stress both the GPU itself and the GPU memory. For each test size, a simulation consists of millions of MD steps, each comprising a large combination of single and double precision floating pointing calculations as well as fixed precision integer arithmetic. The calculation includes pair wise electrostatic and van der Waals interactions, Fourier Transforms, inverse R squared calculations, pair list sorts, and integration. This computation pattern uses all parts of the GPU and stresses the GPU memory. At the end of a fixed number of steps for each run, which averages between 15 and 30 mins, the final coordinates, energies, and velocities of the atoms are recorded. The calculation is then repeated from the same input parameters, and again after a fixed number of steps, the final coordinates, energies and velocities of the atoms are recorded. The AMBER GPU engine is designed to be bitwise reproducible, meaning a simulation started from identical conditions should give identical results. Any variation in the final results is thus an indication of either a bad GPU or bad GPU memory. The test is run for 24 hours and is very effective at identifying faulty GPUs. So effective that it is credited with identifying design flaws and insufficient frequency margins on 5 different NVIDIA GPU models and NVIDIA now includes a variation of this code as part of their chip design testing process. In addition to checking that all GPUs give consistent results, the performance of each GPU is tested using the same code. Performance between repeat runs and between GPUs is compared and determined to be within acceptable tolerances before a system is shipped. This approach effectively identifies both faulty GPUs, for example, with faulty power and temperature regulators, and any GPUs that might have insufficient cooling due to airflow restrictions, fan issues, etc.

 

 

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.