1. Document Scope
This guide is intended for High Performance Computing (HPC) users and administrators who need to diagnose and resolve memory-related problems in their applications. Whether you are encountering out-of-memory (OOM) errors, memory leaks, or unexpected performance degradations, this document outlines a systematic approach to identifying, analyzing, and fixing common memory issues in HPC environments.
Target Audience:
- HPC application developers
- HPC system administrators
- Researchers running compute-intensive jobs on HPC clusters
Prerequisites:
- Basic understanding of Linux/Unix commands
- Familiarity with HPC job schedulers (e.g., Slurm, PBS, LSF)
- Access to HPC debugging and profiling tools (e.g., Valgrind, Intel Inspector, ARM DDT, CUDA memcheck)
2. Steps to Troubleshoot and Debug Memory Issues
Step 1: Verify HPC Job Resource Allocation
-
Check Allocated vs. Used Memory
- Slurm Example:
sacct -j <job_id> --format=JobID,MaxRSS,Elapsed,State
MaxRSS
shows the peak memory used by your job.
- PBS Example:
qstat -fx <job_id> | grep resources_used.mem
- Interpretation: Compare the peak memory usage to the requested memory. If
MaxRSS
(orresources_used.mem
) is close to or exceeds the allocated memory, you may need to request more memory or optimize your application.
- Slurm Example:
-
Monitor System-Wide Memory Usage
- Common Commands:
free -h # Displays total, used, and free system memory top/htop # Live view of processes consuming most memory
- Interpretation: A system running close to its memory capacity may trigger OOM-killer events.
- Common Commands:
Step 2: Check System Logs for OOM-Killer Events
-
Linux Kernel Logs
- Commands:
dmesg | grep -i oom journalctl -k | grep -i oom
- Interpretation: If your application triggers the OOM-killer, the kernel log typically indicates which process was killed and why.
- Commands:
-
Scheduler Logs
- Check: Scheduler-specific logs (e.g., Slurm’s
slurmd.log
, PBS’smom_logs
) might show if the job was terminated due to memory violations.
- Check: Scheduler-specific logs (e.g., Slurm’s
Step 3: Pinpoint Memory Leaks and Corruption
-
Valgrind (CPU-focused)
- Command:
valgrind --leak-check=full --track-origins=yes ./my_hpc_app
- Interpretation: Valgrind reports memory leaks, invalid reads/writes, and other issues. Especially useful for single-node debugging before scaling up.
- Command:
-
CUDA memcheck (GPU-focused)
- Command:
cuda-memcheck ./my_cuda_app
- Interpretation: Detects memory violations in GPU kernels. Look for out-of-bounds access or illegal memory operations.
- Command:
-
Intel Inspector
- Command:
inspxe-cl -collect mi1 -- ./my_hpc_app
- Interpretation: Intel Inspector can identify memory errors, threading issues, and other correctness problems. Ideal for applications compiled with Intel compilers.
- Command:
-
ARM DDT / ARM MAP
- Command:
# Launch via GUI or CLI on HPC cluster map --profile ./my_hpc_app ddt ./my_hpc_app
- Interpretation: Useful for debugging at scale on ARM and x86 systems. Visual interface helps pinpoint memory hotspots.
- Command:
Step 4: Profile Memory Usage and Performance
-
Performance Profilers
- HPCToolkit: Collects call-path profiles to identify which functions consume the most memory.
hpcrun -e MEM_UOPS_RETIRED -o my_profile ./my_hpc_app hpcviewer my_profile
- Intel VTune Profiler: Offers advanced memory bandwidth analysis (e.g., cache misses, NUMA usage).
amplxe-cl -collect memory-access -- ./my_hpc_app
- HPCToolkit: Collects call-path profiles to identify which functions consume the most memory.
-
Identify Hotspots
- Focus on functions or loops with high memory allocation.
- Optimize data structures or reduce temporary buffer usage if memory-intensive operations occur repeatedly.
Step 5: Adjust HPC Environment Settings
-
NUMA Settings
- Command:
numactl --hardware numactl --cpunodebind=0 --membind=0 ./my_hpc_app
- Interpretation: Ensuring memory is allocated on the local NUMA node can reduce latency and avoid remote memory accesses.
- Command:
-
Huge Pages
- Large page sizes can reduce TLB (Translation Lookaside Buffer) misses and improve performance for memory-intensive workloads.
- Check HPC documentation on how to enable huge pages (e.g.,
module load craype-hugepages2M
on Cray systems ormodule load hugepages
on some clusters).
-
Scheduler Resource Requests
- Adjust memory requests in job scripts:
- Slurm:
#SBATCH --mem=32G
- PBS:
#PBS -l mem=32gb
- Slurm:
- Over-requesting memory can lead to inefficient cluster usage; under-requesting can cause OOM errors.
- Adjust memory requests in job scripts:
Step 6: Perform Small-Scale Tests Before Full-Scale Runs
-
Unit Testing
- Run your application on a single node or a small number of nodes with debugging tools attached.
- Validate memory usage and correctness in a controlled environment.
-
Incremental Scaling
- Increase the node count gradually.
- Monitor memory usage to detect any unexpected per-node memory consumption.
-
Automated Testing
- Integrate memory checks into Continuous Integration (CI) pipelines if possible.
- This helps catch regressions in memory usage early in development.
3. Conclusion
Memory issues in HPC applications can significantly impact performance, resource utilization, and overall stability. By systematically checking resource allocation, reviewing system logs, and using specialized debugging tools (Valgrind, CUDA memcheck, Intel Inspector, ARM DDT, etc.), you can pinpoint and resolve memory leaks, corruption, and configuration missteps.
Key Takeaways
- Start with Basic Checks: Verify that your job allocation matches your application’s peak memory usage.
- Use the Right Tools: Different HPC ecosystems (CPU vs. GPU, Intel vs. ARM) have specialized profilers and debuggers.
- Optimize Iteratively: Begin with small-scale tests to validate fixes before running on the entire cluster.
- Leverage Scheduler Settings: Properly request memory in job scripts and tune NUMA/huge pages when appropriate.
With a clear methodology in place, you’ll reduce downtime, lower failure rates, and ensure your HPC applications make efficient use of available memory resources. Happy debugging!
Related to
Comments
0 comments
Please sign in to leave a comment.