How-To Article: Memory Profiling and Debugging on HPC Clusters

Russell Smith
Russell Smith

1. Document Scope

This guide is intended for scientists, software developers, and HPC administrators who need to identify and resolve memory-related performance issues or errors on High Performance Computing (HPC) clusters. It focuses on practical methods to profile memory usage, detect memory leaks, and optimize HPC applications to fully leverage cluster resources.

Target Audience

  • HPC application developers
  • Researchers running large-scale simulations or data processing
  • System administrators ensuring cluster efficiency

Prerequisites

  • Familiarity with Linux commands and job schedulers (e.g., Slurm, PBS, LSF)
  • Basic knowledge of debugging or profiling tools (Valgrind, Intel Inspector, etc.)
  • Access to a multi-node HPC cluster

2. Steps to Perform Memory Profiling and Debugging

Step 1: Verify Resource Requests and Usage

  1. Check Allocated vs. Used Memory

    • Slurm Example:
      sacct -j <job_id> --format=JobID,MaxRSS,Elapsed,State
      
      • MaxRSS displays the peak memory usage of your job.
    • PBS Example:
      qstat -fx <job_id> | grep resources_used.mem
      
    • Interpretation: Compare your application’s peak usage with the memory you requested (--mem in Slurm or -l mem= in PBS). If usage approaches the limit, you risk out-of-memory errors.
  2. Monitoring Memory in Real-Time

    • Common Commands:
      free -h      # Shows system memory (human-readable)
      top / htop   # Displays processes sorted by CPU/memory usage
      
    • Interpretation: Spot-check if a single process or node is becoming a memory bottleneck.

Step 2: Collect System and Scheduler Logs

  1. Detect OOM Killer Events

    • Commands:
      dmesg | grep -i oom
      journalctl -k | grep -i oom
      
    • Interpretation: If the kernel’s Out-of-Memory (OOM) killer terminated your process, you’ll see details on which process was killed and why.
  2. Scheduler Logs

    • Slurm: Look into slurmd.log on the compute node for memory-related errors.
    • PBS: Check the mom_logs on the node to see if memory limits were exceeded.

Step 3: Debug Memory Issues

  1. Valgrind (CPU-Focused)

    • Command:
      valgrind --leak-check=full --show-leak-kinds=all ./my_hpc_app
      
    • Interpretation: Identifies invalid reads/writes, memory leaks, and uninitialized variables. Particularly useful for a single-node test before scaling.
  2. Intel Inspector

    • Command:
      inspxe-cl -collect mi1 -- ./my_hpc_app
      
    • Interpretation: Locates memory errors, threading races, and more. Works best with Intel compiler toolchains.
  3. ARM DDT

    • Command:
      ddt mpirun -n <num_procs> ./my_hpc_app
      
    • Interpretation: A scalable parallel debugger that detects memory issues in MPI-based applications. Has a user-friendly GUI for HPC clusters.
  4. CUDA memcheck (GPU-Focused)

    • Command:
      cuda-memcheck ./my_cuda_app
      
    • Interpretation: Useful for GPU-accelerated HPC tasks. Flags out-of-bounds and misaligned memory operations in CUDA kernels.

Step 4: Profile Memory Usage

  1. HPCToolkit

    • Command:
      hpcrun -e MEM_UOPS_RETIRED -o hpctoolkit-data ./my_hpc_app
      hpcviewer hpctoolkit-data
      
    • Interpretation: Produces call-path profiles indicating where (which functions) memory operations occur most frequently.
  2. Intel VTune Profiler

    • Command:
      amplxe-cl -collect memory-access -- ./my_hpc_app
      
    • Interpretation: Offers advanced insights into cache misses, memory bandwidth, and NUMA usage.
  3. ARM MAP

    • Command:
      map --profile ./my_hpc_app
      
    • Interpretation: Provides a graphical timeline of memory usage alongside CPU load, helping you correlate code regions with memory spikes.

Step 5: Optimize Memory Usage

  1. Improve Data Structures

    • Why: Large arrays or complex data structures may not be aligned or blocked efficiently, causing excessive cache misses.
    • Action: Use techniques like loop tiling/blocking, padding arrays for alignment, and pooling memory allocations.
  2. NUMA Settings

    • Commands:
      numactl --hardware
      numactl --cpunodebind=0 --membind=0 ./my_hpc_app
      
    • Interpretation: Ensures memory is allocated on the same NUMA node as the CPU, reducing remote access overhead.
  3. Enable Huge Pages

    • Why: Larger page sizes (e.g., 2MB) can reduce TLB (Translation Lookaside Buffer) misses for memory-intensive codes.
    • Example: Some clusters provide modules like module load craype-hugepages2M or module load hugepages.
  4. Adjust MPI and Network Settings

    • Inspect environment variables for your MPI stack (e.g., Open MPI’s eager limits or shared memory transports) that may impact memory usage.

Step 6: Validate on Scaled Runs

  1. Run Small Tests

    • Fix issues uncovered by debuggers and profilers on a single or small number of nodes.
    • Validate the correctness and memory usage improvements.
  2. Incremental Scaling

    • Slowly increase node count (and possibly problem size) to ensure memory usage scales appropriately.
    • Slurm Example:
      sstat -j <job_id> --format=JobID,AveRSS,MaxRSS,AveVMSize
      
    • Interpretation: Look for unexpected spikes in memory usage per node or rank.
  3. Continuous Monitoring & CI

    • Integrate memory checks (Valgrind or others) into a Continuous Integration pipeline to catch regressions quickly.

3. Conclusion

Memory profiling and debugging on HPC clusters is crucial for preventing costly job failures and ensuring efficient resource utilization. By systematically verifying resource requests, using specialized debugging tools, and performing targeted optimizations, you can resolve memory leaks, reduce fragmentation, and optimize performance for your HPC workloads.

Key Takeaways

  • Match Allocations to Requests: Ensure your HPC jobs request enough memory to handle application demands.
  • Deploy the Right Tools: Tools like Valgrind, Intel Inspector, ARM DDT, and profiler suites (VTune, HPCToolkit, MAP) address specific environments (CPU, GPU, parallel).
  • Optimize Data and Runtime Settings: Data structure alignment, NUMA binding, and huge pages can collectively yield significant performance gains.
  • Scale in Stages: Validate fixes on a single node, then expand to multiple nodes, carefully monitoring memory usage to catch any surprises.

Armed with these best practices, HPC practitioners can effectively diagnose memory bottlenecks, improve cluster throughput, and achieve faster time-to-solution for their large-scale computations.

Related to

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.