1. Document Scope
This guide is designed for researchers, developers, and system administrators who manage or utilize supercomputers (large-scale HPC systems). It provides practical methods and commands for diagnosing memory-related problems—such as leaks, out-of-memory errors, and performance bottlenecks—across hundreds or even thousands of nodes.
Target Audience
- HPC administrators ensuring optimal cluster performance
- Researchers running large-scale simulations
- Developers debugging memory issues in parallelized applications
Prerequisites
- Basic command-line experience on Linux-based supercomputers
- Familiarity with job scheduling (Slurm, PBS, or similar)
- Some knowledge of debugging tools (e.g., Valgrind, ARM DDT, Intel Inspector)
2. Steps to Detect and Troubleshoot Memory Issues
Step 1: Verify Job Resource Allocation
-
Check Your Job’s Requested vs. Used Memory
- Slurm Example:
sacct -j <job_id> --format=JobID,Partition,MaxRSS,Elapsed,State
MaxRSS
reports the peak memory used by your job.
- PBS Example:
qstat -fx <job_id> | grep resources_used.mem
- Interpretation: If the maximum memory usage (
MaxRSS
orresources_used.mem
) is close to or exceeds your requested memory, your job may fail or be killed by the OOM (Out-Of-Memory) killer.
- Slurm Example:
-
Monitor Live Memory Usage on Compute Nodes
- Slurm:
sstat -j <job_id> --format=AveRSS,AveVMSize,MaxRSS
- Top or htop (if node access is allowed):
top -u <username>
- Interpretation: Observing ongoing memory consumption can signal whether your job is trending toward an out-of-memory scenario.
- Slurm:
Step 2: Check System and Scheduler Logs
-
Kernel Logs for OOM Events
- Commands:
dmesg | grep -i oom journalctl -k | grep -i oom
- Interpretation: If the OOM killer is invoked, these logs typically show which process was terminated and why.
- Commands:
-
Scheduler-Specific Logs
- Slurm: Look in
slurmd.log
on the compute nodes for memory-related errors. - PBS: Check
mom_logs
on the node where the job ran for signs of memory constraint violations.
- Slurm: Look in
Step 3: Identify Memory Leaks or Corruption
-
Valgrind (Serial or Small Parallel Debugging)
- Command:
valgrind --leak-check=full --show-leak-kinds=all ./your_app
- Interpretation: Valgrind locates invalid reads/writes, uninitialized variables, and memory leaks. For multi-node jobs, test on a small scale or one node first.
- Command:
-
ARM DDT (Scalable Debugger)
- Command:
ddt mpirun -n <num_processes> ./your_app
- Interpretation: Provides a graphical interface to debug MPI applications in parallel. Can detect memory errors, race conditions, and deadlocks across many ranks.
- Command:
-
Intel Inspector
- Command:
inspxe-cl -collect mi1 -- ./your_app
- Interpretation: Ideal for code compiled with Intel compilers; detects memory leaks, buffer overflows, and threading issues.
- Command:
-
CUDA memcheck (GPU-Focused)
- Command:
cuda-memcheck ./your_app
- Interpretation: Essential for GPU-accelerated workloads. Finds out-of-bounds access, illegal memory access, and synchronization errors in CUDA kernels.
- Command:
Step 4: Profile Memory Usage and Bottlenecks
-
Profiling Tools
- HPCToolkit:
hpcrun -o hpctoolkit-out ./your_app hpcviewer hpctoolkit-out
- Maps function calls to memory usage, helping locate hotspots.
- Intel VTune Profiler:
amplxe-cl -collect memory-access -- ./your_app
- Analyzes memory bandwidth, cache misses, and NUMA usage to highlight bottlenecks.
- HPCToolkit:
-
Look for Frequent Allocations
- Large, repeated allocations (e.g., in time-step loops) can cause fragmentation or degrade performance.
- Consolidating allocations or reusing buffers may cut overhead.
Step 5: Tune System-Level and Application-Level Settings
-
NUMA Awareness
- Commands:
numactl --hardware numactl --cpunodebind=0 --membind=0 ./your_app
- Interpretation: Ensures memory is allocated on the same NUMA node as the CPU, reducing remote memory access.
- Commands:
-
Huge Pages
- Why: Large page sizes (2MB or more) can reduce TLB misses, beneficial for memory-intensive workloads.
- Command (system-dependent):
cat /proc/meminfo | grep HugePages
- Interpretation: Check if huge pages are available and sufficient; some supercomputing centers provide modules like
module load craype-hugepages2M
.
-
Job Scheduler Memory Directives
- Slurm:
#SBATCH --mem=64G
- PBS:
#PBS -l mem=64GB
- Interpretation: Request enough memory to handle peak usage while avoiding unnecessary overhead on the cluster.
- Slurm:
-
MPI Environment Tuning
- For Open MPI, environment variables like
OMPI_MCA_btl_openib_eager_limit
orOMPI_MCA_btl_vader_single_copy_mechanism
might impact memory usage. - Check your MPI documentation for buffer size parameters and adjust as needed.
- For Open MPI, environment variables like
Step 6: Validate Fixes at Scale
-
Test on a Single Node or Reduced Node Count
- Fix the issues discovered in previous steps.
- Re-run debug and profiling tools on a small subset of nodes to confirm improvements.
-
Incremental Scaling
- Submit larger jobs in stages, monitoring memory usage via commands like:
sstat -j <job_id> --format=JobID,MaxRSS,AveRSS,AveVMSize
- Watch for any unexpected spikes as node count increases.
- Submit larger jobs in stages, monitoring memory usage via commands like:
-
Continuous Integration (CI) & Regression Testing
- Integrate memory checks (e.g., Valgrind tests) into your CI system to catch new leaks or inefficiencies early.
3. Conclusion
Detecting and troubleshooting memory issues on supercomputers requires a multifaceted approach—verifying resource requests, analyzing system logs, employing specialized debugging tools, and tuning both system-level and application-level parameters. By following these steps, you can significantly reduce the risk of out-of-memory errors, maximize application performance, and avoid derailing large-scale workloads.
Key Takeaways
- Match Requests to Usage: Keep job memory requests realistic to prevent OOM kills and cluster resource waste.
- Use the Right Tools: Tools like Valgrind, ARM DDT, and Intel Inspector can isolate memory leaks and corruption across large-scale HPC environments.
- Optimize Data Placement: Leverage NUMA binding and huge pages for improved performance on complex supercomputer architectures.
- Scale Iteratively: Debug on a small scale, then gradually increase node counts to verify that fixes hold up in production.
Armed with these strategies, HPC practitioners can maintain stable, high-performing applications on even the largest supercomputers.
Related to
Comments
0 comments
Please sign in to leave a comment.