Document Scope:
This article covers how to run CPU and memory stress tests using mprime for Linux, as well as isolate bad individual memory DIMM or trigger a motherboard's memory DIMM slot to report errors.
*For Windows, use Prime95 and use the same instructions as below.
Download mprime:
Usually wget files into a new directory to keep extracted files in one place. Otherwise, you will get logs/readmes filling up wherever you're currently in.
mkdir ~/mprime cd ~/mprime wget http://www.mersenne.org/ftp_root/gimps/p95v303b6.linux64.tar.gz tar -xf p95v303b6.linux64.tar.gz
Do NOT use for non-server >40 core/thread single-processor systems.
mprime/Prime95 is written in Assembly language, which significantly increases the amount of system resources than most programs; running this might be too extreme for your current system, and you may mistake the results (i.e. abrupt shutdown) as a hardware issue, whereas it just may be a limitation of your system's ability to control system hardware temperatures.
Using mprime:
If mprime is just needed for stress testing, typically only option 16 (Options/Torture Test) is used. It will max the detected threads across all CPU's. Hit 'enter' when it asks for 'Number of torture test threads to run (x):'.
Assuming you are not already in the mprime directory you’ve created if you're doing this all in one go from this article.
cd mprime
./mprime
If this is your first time extracting/using, it will ask if you want to join GIMPS. Since this is for troubleshooting only, you can hit 'n'.
[root@c103454 mprime]# ./mprime
Main Menu
1. Test/Primenet
2. Test/Worker threads
3. Test/Status
4. Test/Continue
5. Test/Exit
6. Advanced/Test
7. Advanced/Time
8. Advanced/P-1
9. Advanced/ECM
10. Advanced/Manual Communication
11. Advanced/Unreserve Exponent
12. Advanced/Quit Gimps
13. Options/CPU
14. Options/Resource Limits
15. Options/Preferences
16. Options/Torture Test
17. Options/Benchmark
18. Help/About
19. Help/About PrimeNet Server
Your choice:
Number of torture test threads to run (80):
Choose a type of torture test to run.
1 = Smallest FFTs (tests L1/L2 caches, high power/heat/CPU stress).
2 = Small FFTs (tests L1/L2/L3 caches, maximum power/heat/CPU stress).
3 = Large FFTs (stresses memory controller and RAM).
4 = Blend (tests all of the above). Is the Default test.
NOTE: If you fail the blend test but pass the smaller FFT tests then your problem is likely bad memory or bad memory controller.
Type of torture test to run (4):
Usually run test (2) for CPU only, (3) for Memory only, and Blend (4) to perform both.
Your choice: 16
Number of torture test threads to run (80):
Choose a type of torture test to run.
1 = Smallest FFTs (tests L1/L2 caches, high power/heat/CPU stress).
2 = Small FFTs (tests L1/L2/L3 caches, maximum power/heat/CPU stress).
3 = Large FFTs (stresses memory controller and RAM).
4 = Blend (tests all of the above).
Blend is the default. NOTE: if you fail the blend test but pass the
smaller FFT tests then your problem is likely bad memory or bad memory
controller.
Type of torture test to run (4):
Customize settings (N): y
Min FFT size (in K) (4):
Max FFT size (in K) (8192):
Memory to use (in MB, 0 = in-place FFTs) (382415): 360000
Time to run each FFT size (in minutes) (6):
Run a weaker torture test (not recommended) (N):
Accept the answers above? (Y):
mprime default settings will want to use ALL resources on small FFT's which will cause it to max TDP sharply and slow the system heavily during the beginning/smaller tests. It may become near-unusable if you leave everything on default. To remedy this, set Min FFT size to 128/256, and leave 1-2 GB of memory free for system processes. Below is an example of the 'Memory to use' setting to ensure mprime does not freeze up the system when starting the test.
Customize settings (N): y
Min FFT size (in K) (4): 128
Max FFT size (in K) (8192):
Memory to use (in MB, 0 = in-place FFTs) (382415): 360000
Time to run each FFT size (in minutes) (6):
Run a weaker torture test (not recommended) (N):
Accept the answers above? (Y):
From here, it is better to use whatever GUI/display or tools to monitor resources/temperatures.
Expected outcomes (Pass/Fail):
Fail:
If the system crashes/reboots on memory test (3) or blend (4), it could mean a couple of things (or more, as we continue using this tool to troubleshoot systems of varying hardware). If you are not sure, usually keeping track of the system uptime or IPMI event logs will help show the system's history.
- Memory is bad
- You would need to check IPMI/BMC logs to see if the system pinged an error from one of the DIMM slots
- 'ipmitool sel list' is another output you can try to view. This requires 'ipmitool' to be installed. View the web GUI as it may contain additional information such as temp sensors/flags since BMC event logs may only report critical errors.
- You can also check 'edac-util -v' to report any errors from system edac files. It can point out which memory DIMM slot is reporting correctable/uncorrectable errors.
- Ideally you would want 0 across all DIMM's, but a few <100 over a week is somewhat negligible. If it is reporting THOUSANDS upon a few hours (and it may even overflow the edac folders), then it is definitely sign of a bad memory DIMM that needs to be replaced.
- You would want to first re-seat the memory, improperly seated memory caused by any factor could cause correctable/negligible errors.
- If it is uncorrected errors, it most likely would have triggered a reboot/shutdown since those are very critical memory errors that affect the whole system.
- edac-util example
- You would need to check IPMI/BMC logs to see if the system pinged an error from one of the DIMM slots
- CPU or Motherboard is bad
- Both are highly unlikely since scalable processors released. Damage to contacts/capacitors on CPUs may cause misreads on memory or any installed hardware, or even bent pins on the MB CPU socket.
- Likely, memory slots being reported on NUMEROUS DIFFERENT slots instead of just one. It may report a whole row is bad, i.e., DIMM slots DIMM_D1, DIMM_E2, DIMM_F1 are all reporting bad, and swapping DIMM's with that row continues repeating that event log.
- Both are highly unlikely since scalable processors released. Damage to contacts/capacitors on CPUs may cause misreads on memory or any installed hardware, or even bent pins on the MB CPU socket.
Pass:
The system will have mprime running after 24 hours. Tests ran over weekends still go/report on Monday mornings. You can determine your own duration depending on your site's expectations, but it is typical a system can run a blend test for 2-3 days straight.
Ongoing discovery:
The procedures to run the mprime test above will not change much, but the outcomes will. For example, try not to have any other processes running when using mprime. It was fine before scalable processes where CPU cache sizes were significantly smaller, but any other memory-related processes may kill mprime without any specific reason in the 'results.txt'. An example is trying to run the GPU standalone test in this article GPU Validation Test while trying to run mprime. To test high wattage/TDP, run mprime test 2 to max out CPU, and then GPU Validation Test to (almost) max TDP across ALL installed GPU's that are CUDA-capable (i.e. mprime will run two CPU's to 120-140w each, and GPU Validation will run ~250-300W on each GPU). Memory/blend test not ran with GPU Validation since it will cause stability issues with memory.
Comments
0 comments
Please sign in to leave a comment.