Cray Debugging Support Tools

Implementation

CDST is available on HPE Cray EX and HPE Cray supercomputer systems, HPE Apollo 2000 Gen10Plus systems, HPE Apollo 80 systems, and Cray XC and CS systems; however, not all tools are supported on all platforms. See the specific platform user guides for details.

Introduction

Cray Debugging Support Tools are tools for debugging parallel applications.

ATP - Abnormal Termination Processing

Abnormal Termination Processing (ATP) is a tool that monitors Cray system user applications. If an application encounters a fatal signal, ATP will handle the signal and perform analysis on the dying application.

Overview

  • Enabling ATP

  • Required Slurm configuration for CS clusters

Operation of ATP

  • Load ATP Plugin

  • About Backtrace Trees

  • About Core Dumps

  • About the Core Selection Algorithm

  • About Hold Time

  • About Signals

  • About GPU Support

  • About Node Free Space Checks

  • About Custom Runtime Checks

  • Performing a manual dump

Environment variables

User-configurable settings for ATP to modify behavior at runtime

Compiler-specific details

  • Intel Fortran

  • GNU Fortran

Examples

STAT - Stack Trace Analysis Tool

The Stack Trace Analysis Tool (STAT) is a scalable, lightweight debugger for parallel applications. STAT works by gathering stack traces from all of a parallel application’s processes and merging them into a compact form. The resulting output indicates the location in the code that each application process is executing, which can help locate a bug. The Stack Trace Analysis Tool (STAT) package includes three commands to invoke and control STAT as well as analyze its output.

man pages

CCDB - Cray Comparative Debugger

The Cray Comparative Debugger (CCDB) is Cray’s next generation debugging tool. CCDB features a Graphical User Interface that extends the comparative debugging capabilities of gdb4hpc, enabling users to easily compare data structures between two executing applications.

User Guide

CCDB User Guide

man pages

CTI - Common Tools Interface

The Common Tools Interface (CTI) is an infrastructure framework to enable tools to launch, interact with, and run utilities alongside applications on HPC systems.

man pages

gdb4hpc

gdb4hpc is a GDB-based parallel debugger used to debug applications compiled with CCE, PGI, GNU and Intel Fortran, C and C++ compilers.

Guides and Tutorials

Getting Started Guide

The getting started guide covers the following topics and more:

  • The help system

  • Debugging basics

  • Procsets

  • Focusing on a subset of an application’s ranks

Getting Started Guide

HPC Features Tutorial

The tutorial covers the following unique HPC-centric features of gdb4hpc and more:

  • Comparative debugging

  • Assertion scripts

  • Decompositions

  • Shell commands and output piping

  • Array slicing

HPC Features Tutorial

Debugging an MPI/CUDA GPU Application Tutorial

This tutorial shows how to use gdb4hpc to debug a multinode MPI application that uses CUDA compute kernels. The tutorial is written with CUDA/NVIDIA GPUs in mind, but the concepts apply to HIP/AMD GPUs as well.

Debugging an MPI/CUDA GPU Application Tutorial

Handling Arrays

Handling arrays covers gdb4hpc’s enhancements on Gdb’s array handling tools.

Handling Arrays

Parallel Programming Library Support

gdb4hpc has extra support for some popular parallel programming libraries.

VSCode Extension Guide

A guide covering gdb4hpc’s vscode extension.

VSCode Extension Guide

Python Debugging

gdb4hpc has extra support for debugging Python applications.

gdb4hpc man Pages and Reference Material

sanitizers4hpc

Sanitizers4hpc is an aggregation tool to collect and analyze LLVM Sanitizers output at scale.

Guides

User Guide

man pages

sanitizers4hpc

valgrind4hpc

Valgrind4hpc is a Valgrind-based debugging tool to aid in the detection of memory leaks and errors in parallel applications. Valgrind4hpc aggregates any duplicate messages across ranks to help provide an understandable picture of program behavior.