Fan Speed Issues
Summary
Fan speed issues cover fans that run too fast at idle, fluctuate, fail to spin, report wrong telemetry or GPU fan ERR, or make abnormal noise; causes include fan modules, BMC/BIOS control, chassis/fan-board paths, seating/contact issues, thermal-control mismatch, and normal-but-loud airflow.
Frequency
- 192 tickets mention fan speed, fan noise, fan telemetry, or fan-control faults.
Common Causes
- BMC/BIOS fan-control or telemetry faults. Fans ran high, oscillated, reacted backwards to temperature, or appeared mislabeled because control firmware or sensor interpretation was wrong (#22547, #24162, #31541, #42596, #41986, …and 60+ more).
- Single failed or noisy fan module. Cases often narrowed to one chassis, CPU-adjacent, or liquid-cooler radiator fan with grinding, humming, non-spin, or repeated warnings (#21703, #32480, #38267, #41804, #43535, #44453, …and 50+ more).
- Chassis, fan-board, or harness faults. Some systems needed chassis/fan-board repair or broader depot work because fan swaps did not clear the path fault (#31541, #32471, #34438, #35680, #40699).
- Thermal-control mismatch after service/configuration change. Fan behavior sometimes changed after RMA, firmware, or platform updates while the system otherwise ran (#24162, #24164, #30085, #40557, #40811).
- Expected or unconfirmed acoustics. Some reports were normal high airflow, separate CPU/GPU fan zones, load-dependent behavior, or intermittent noise closed before root cause confirmation; one 4x RTX PRO 6000 Blackwell Max-Q TS4 server was described as expected to run around ~11,000 RPM at idle and nearly double under load (#11548, #17481, #23091, #37709, #43894, #44674).
Diagnostic Steps
- Classify the symptom. Separate constant high RPM, oscillation/reversed response, one non-spinning/noisy fan, telemetry-only alarms, and loud-but-normal cooling (#21703, #22547, #31541, #37709).
- Check management evidence. Review BMC/IPMI readings, fan mode/profile, BIOS/BMC versions, SEL/event logs, sensor-to-temperature consistency, and physical label mapping; for normal-but-loud servers, confirm whether IPMI Power Saving mode is available before deeper firmware changes (#22547, #24164, #31541, #39977, #42596, #44674).
- Isolate the physical path. Swap/reseat suspect fans, GPUs, and cables as appropriate, inspect fan boards/chassis harnesses, and check dust or foreign-object obstruction before assuming firmware or chassis failure (#32471, #34438, #38267, #39878, #43894, #44357).
- Reproduce under controlled load/temperature. Compare idle and load behavior when inverted or unstable fan response is suspected (#31541, #36410, #40557).
Solutions
- Replace failed fan/module. Clean fix for noise, non-spin, or single-fan alerts, including liquid-cooler radiator fans (#21703, #32480, #38267, #41804, #43535, #44453, …and 50+ more).
- Update/reset BMC or BIOS fan control. Firmware/settings remediation resolves false telemetry, unstable curves, or control issues when hardware is healthy (#22547, #24164, #31541, #37005, #41268).
- Repair chassis-side control hardware. Use chassis replacement, fan-board work, or depot repair when fan swaps fail (#31541, #32471, #34438, #35680, #40699).
- Validate before return. Burn-in and thermal checks confirm corrected fan behavior after reproduction/repair (#22547, #31541, #32471, #36410).
-
Clarify expected behavior or monitor after reseat. Explain normal load acoustics, separate CPU/GPU fan banks, or platform-specific server airflow when no defect is found; IPMI Power Saving mode may reduce idle noise somewhat but may not make a large difference and fans can still scale under workload, while intermittent GPU fan
ERRcan be monitored after GPU reseat/swap when symptoms clear (#11548, #17481, #23091, #42596, #44357, #44674).
Edge Cases
- Repeat post-RMA fan behavior can recur after prior repair (#24162, #24164, #40699).
- Reversed control logic can make fans speed up as temperatures drop or otherwise react opposite expectation (#31541, #36410).
- Fan tickets may co-occur with overheating, GPU instability, software corruption, or boot failures, complicating intake; one follow-up loud-fan/high-temperature case ultimately recovered after university IT reinstalled the OS rather than after confirmed cooling hardware repair (#30085, #32471, #40557, #41684, #43116).
- Fan inoperability can be secondary during no-POST: a Tensor system's fans recovered after DIMM reseat/CMOS reset, but POST
00/no-video still required platform RMA (#43675). - IPMI/part-label ambiguity may look like mixed PSU/fan telemetry failure, while comparison testing shows normal CPU/GPU fan-zone behavior (#42596).
- 0-RPM IPMI readings with physically spinning GPU fans can indicate telemetry/firmware interpretation rather than failed fan modules (#41986).
- Intermittent GPU fan
ERRcan resolve after reseating/swapping GPU positions, so not every GPU fan alert requires immediate fan or GPU replacement when the symptom clears under follow-up testing (#44357). - False high CPU temperature telemetry can drive full-speed fans at idle; support requested IPMI sensor/SEL, GPU, and workload evidence before repair disposition (#39977).
- Simple fan replacement or acoustic inspection can still slow on part ID, shipment timing, follow-up gaps, return logistics, or clarifying whether only the fan versus the full workstation must be returned (#21703, #32480, #41804, #43535, #43894, #44453).
- A dead fan inside a PSU should be treated as a PSU module failure rather than a standalone chassis-fan replacement; customer photos of a non-spinning PSU fan and orange fault LED can be sufficient RMA evidence when logs are unavailable (#44958).
- Reducing the number of active power supplies is not a meaningful acoustic fix for normal loud server airflow when system fans dominate the sound, and BIOS/firmware power-management changes such as ASPM should be avoided unless validated because they can cause GPUs to fall off the bus (#44674).
Comments
0 comments
Please sign in to leave a comment.