How-To Article: Detecting and Troubleshooting Memory Issues on Supercomputers

Russell Smith
Russell Smith

1. Document Scope

This guide is designed for researchers, developers, and system administrators who manage or utilize supercomputers (large-scale HPC systems). It provides practical methods and commands for diagnosing memory-related problems—such as leaks, out-of-memory errors, and performance bottlenecks—across hundreds or even thousands of nodes.

Target Audience

  • HPC administrators ensuring optimal cluster performance
  • Researchers running large-scale simulations
  • Developers debugging memory issues in parallelized applications

Prerequisites

  • Basic command-line experience on Linux-based supercomputers
  • Familiarity with job scheduling (Slurm, PBS, or similar)
  • Some knowledge of debugging tools (e.g., Valgrind, ARM DDT, Intel Inspector)

2. Steps to Detect and Troubleshoot Memory Issues

Step 1: Verify Job Resource Allocation

  1. Check Your Job’s Requested vs. Used Memory

    • Slurm Example:
      sacct -j <job_id> --format=JobID,Partition,MaxRSS,Elapsed,State
      
      • MaxRSS reports the peak memory used by your job.
    • PBS Example:
      qstat -fx <job_id> | grep resources_used.mem
      
    • Interpretation: If the maximum memory usage (MaxRSS or resources_used.mem) is close to or exceeds your requested memory, your job may fail or be killed by the OOM (Out-Of-Memory) killer.
  2. Monitor Live Memory Usage on Compute Nodes

    • Slurm:
      sstat -j <job_id> --format=AveRSS,AveVMSize,MaxRSS
      
    • Top or htop (if node access is allowed):
      top -u <username>
      
    • Interpretation: Observing ongoing memory consumption can signal whether your job is trending toward an out-of-memory scenario.

Step 2: Check System and Scheduler Logs

  1. Kernel Logs for OOM Events

    • Commands:
      dmesg | grep -i oom
      journalctl -k | grep -i oom
      
    • Interpretation: If the OOM killer is invoked, these logs typically show which process was terminated and why.
  2. Scheduler-Specific Logs

    • Slurm: Look in slurmd.log on the compute nodes for memory-related errors.
    • PBS: Check mom_logs on the node where the job ran for signs of memory constraint violations.

Step 3: Identify Memory Leaks or Corruption

  1. Valgrind (Serial or Small Parallel Debugging)

    • Command:
      valgrind --leak-check=full --show-leak-kinds=all ./your_app
      
    • Interpretation: Valgrind locates invalid reads/writes, uninitialized variables, and memory leaks. For multi-node jobs, test on a small scale or one node first.
  2. ARM DDT (Scalable Debugger)

    • Command:
      ddt mpirun -n <num_processes> ./your_app
      
    • Interpretation: Provides a graphical interface to debug MPI applications in parallel. Can detect memory errors, race conditions, and deadlocks across many ranks.
  3. Intel Inspector

    • Command:
      inspxe-cl -collect mi1 -- ./your_app
      
    • Interpretation: Ideal for code compiled with Intel compilers; detects memory leaks, buffer overflows, and threading issues.
  4. CUDA memcheck (GPU-Focused)

    • Command:
      cuda-memcheck ./your_app
      
    • Interpretation: Essential for GPU-accelerated workloads. Finds out-of-bounds access, illegal memory access, and synchronization errors in CUDA kernels.

Step 4: Profile Memory Usage and Bottlenecks

  1. Profiling Tools

    • HPCToolkit:
      hpcrun -o hpctoolkit-out ./your_app
      hpcviewer hpctoolkit-out
      
      • Maps function calls to memory usage, helping locate hotspots.
    • Intel VTune Profiler:
      amplxe-cl -collect memory-access -- ./your_app
      
      • Analyzes memory bandwidth, cache misses, and NUMA usage to highlight bottlenecks.
  2. Look for Frequent Allocations

    • Large, repeated allocations (e.g., in time-step loops) can cause fragmentation or degrade performance.
    • Consolidating allocations or reusing buffers may cut overhead.

Step 5: Tune System-Level and Application-Level Settings

  1. NUMA Awareness

    • Commands:
      numactl --hardware
      numactl --cpunodebind=0 --membind=0 ./your_app
      
    • Interpretation: Ensures memory is allocated on the same NUMA node as the CPU, reducing remote memory access.
  2. Huge Pages

    • Why: Large page sizes (2MB or more) can reduce TLB misses, beneficial for memory-intensive workloads.
    • Command (system-dependent):
      cat /proc/meminfo | grep HugePages
      
    • Interpretation: Check if huge pages are available and sufficient; some supercomputing centers provide modules like module load craype-hugepages2M.
  3. Job Scheduler Memory Directives

    • Slurm:
      #SBATCH --mem=64G
      
    • PBS:
      #PBS -l mem=64GB
      
    • Interpretation: Request enough memory to handle peak usage while avoiding unnecessary overhead on the cluster.
  4. MPI Environment Tuning

    • For Open MPI, environment variables like OMPI_MCA_btl_openib_eager_limit or OMPI_MCA_btl_vader_single_copy_mechanism might impact memory usage.
    • Check your MPI documentation for buffer size parameters and adjust as needed.

Step 6: Validate Fixes at Scale

  1. Test on a Single Node or Reduced Node Count

    • Fix the issues discovered in previous steps.
    • Re-run debug and profiling tools on a small subset of nodes to confirm improvements.
  2. Incremental Scaling

    • Submit larger jobs in stages, monitoring memory usage via commands like:
      sstat -j <job_id> --format=JobID,MaxRSS,AveRSS,AveVMSize
      
    • Watch for any unexpected spikes as node count increases.
  3. Continuous Integration (CI) & Regression Testing

    • Integrate memory checks (e.g., Valgrind tests) into your CI system to catch new leaks or inefficiencies early.

3. Conclusion

Detecting and troubleshooting memory issues on supercomputers requires a multifaceted approach—verifying resource requests, analyzing system logs, employing specialized debugging tools, and tuning both system-level and application-level parameters. By following these steps, you can significantly reduce the risk of out-of-memory errors, maximize application performance, and avoid derailing large-scale workloads.

Key Takeaways

  • Match Requests to Usage: Keep job memory requests realistic to prevent OOM kills and cluster resource waste.
  • Use the Right Tools: Tools like Valgrind, ARM DDT, and Intel Inspector can isolate memory leaks and corruption across large-scale HPC environments.
  • Optimize Data Placement: Leverage NUMA binding and huge pages for improved performance on complex supercomputer architectures.
  • Scale Iteratively: Debug on a small scale, then gradually increase node counts to verify that fixes hold up in production.

Armed with these strategies, HPC practitioners can maintain stable, high-performing applications on even the largest supercomputers.

Related to

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.