Troubleshooting Apparent Performance Issues on NVIDIA RTX 6000 ADA GPUs

Matthew Estes
Matthew Estes
  • Updated

Document Scope

This article describes how NVIDIA GPU performance can present as though a problem exists and some of the basic steps taken to determine the actual source of the problem. This is based on a recent "real-world" example our support team experienced.

Issue

A new workstation with four NVIDIA RTX 6000 ADA GPUs was reported to have slower performance than an older system with 3090's and they were concerned with what they believed to be much higher operating temperatures when running process like PyTorch. The customer noted that their 3090's would get no hotter than 78C while the RTX 6000 ADA GPUs were regularly hitting 87C-90C.

Troubleshooting

The initial suspicion was some sort of thermal throttling however we noted that the ADA architecture naturally runs much hotter than Ampere  and also is designed with higher thermal limits, meaning 87C was still within expected range.

Running "nvidia-smi -a" without quotes showed that the GPUs were actually power constrained, and not thermally limited.

GPU 00000000:01:00.0
SW Power Cap : Active
SM : 1515 MHz
Memory : 9500 MHz
GPU 00000000:2E:00.0
SW Power Cap : Active
SM : 690 MHz
Memory : 9500 MHz
GPU 00000000:41:00.0
SW Power Cap : Active
SM : 1215 MHz
Memory : 9500 MHz
GPU 00000000:61:00.0
SW Power Cap : Active
SM : 1245 MHz
Memory : 9500 MHz

Further diagnosis revealed that the customer code was the same as what they were running on their 3090/Ampere architecture and they had not made adjustments/allowances for this newer hardware to take advantage of the true potential. Additionally they were utilizing an older version of PyTorch based on an older version of CUDA (11.7). While the code will certainly execute and complete, the performance is measurably slower.

Resolution

After identifying the customer software/hardware "mismatch" we encouraged them to update their code on the ADA workstation to take advantage of this newer architecture. Once this was completed, they noted that not only did performance meet that of the older cards but in fact this new system outperformed the old.

Summary

When software is involved, performance issues in complex systems can be tricky to track down even if the actual resolution (updating your software to run on new hardware) might seem somewhat obvious. While the instinct may be to blame the difference in hardware, making sure you properly utilize that hardware is just as important.

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.