Cray Performance and Analysis Tools

Author:

Hewlett Packard Enterprise Development LP.

Copyright:

Copyright 2023-2024 Hewlett Packard Enterprise Development LP.

Overview

The Performance Analysis Tools (Perftools) are a suite of utilities that enable users to capture and analyze performance data generated during program execution, thereby reducing the time to port and tune applications. These tools provide an integrated infrastructure for measurement, analysis, and visualization of computation, communication, I/O, and memory utilization to help users optimize programs for faster execution and more efficient computing resource usage. The data collected and analyzed by these tools help users answer two fundamental developer questions: What is the performance of my program? and How can I make it perform better?

The toolset allows developers to perform profiling, sampling, and tracing experiments on executables, extracting information at the program, function, loop, and line level. Programs written in Fortran, C/C++ (including UPC), MPI, SHMEM, OpenMP, CUDA, HIP, OpenACC, or a combination of these languages and models, are supported. Profiling applications built with the HPE Cray Compiling Environment (CCE), AMD, AOCC, GNU, Intel, Intel OneAPI, or Nvidia HPC SDK compilers are supported. However, not all combinations of programming models are supported, and not all compilers are supported on all platforms.

Use performance tools to:

  • Identify bottlenecks

  • Find load-balance and synchronization issues

  • Find communication overhead issues

  • Identify loops for parallelization

  • Map memory bandwidth utilization

  • Optimize vectorization

  • Collect application energy consumption information

  • Collect scaling information

  • Interpret performance data

Introduction

Performance analysis consists of three basic steps:

  1. Instrument the program to specify what kind of data to collect under what conditions.

  2. Execute the instrumented executable to generate and capture designated data.

  3. Analyze the data.

There are three programming interfaces available:

Perftools-lite

Perftools-lite: Simple interface that produces reports to stdout. There are five Perftools-lite submodules:

  • perftools-lite - Lowest overhead sampling experiment identifies key program bottlenecks.

  • perftools-lite-events - Produces a summarized trace; a good tool for detailed MPI statistics, including synchronization overhead.

  • perftools-lite-loops - Provides loop work estimates (must be used with CCE).

  • perftools-lite-gpu - Focuses on the program’s use of GPU accelerators.

  • perftools-lite-hbm - Reports memory traffic information (must be used with CCE and only for Intel processors).

See the perftools-lite man page for details.

Perftools

Perftools: Advanced interface that provides full-featured data collection and analysis capability, including full traces with timeline displays. It includes the following components:

  • pat_build - Utility that instruments programs for performance data collection.

  • pat_report - After using pat_build to instrument the program, setting the runtime environment variables and executing the program, use pat_report to generate text reports from the resulting data and export the data to other applications. See the pat_report man page for details.

  • CrayPat runtime library - Collects specified performance data during program execution. See the intro_craypat man page for details.

Perftools-preload

Perftools-preload: Runtime instrumentation version of the performance analysis tools, which eliminates the instrumentation step by pat_build on an executable program. perftools-preload acquires performance data about the program, providing access to nearly all performance analysis features provided by executing a program instrumented with pat_build. See the perftools-preload man page for more details.

  • pat_run - An option for programs built with or without perftools-preload. The program is instrumented during runtime, and collected data can be explored further with pat_report and Apprentice2 tools. See the pat_run man page for details.

Experiments available include:

  • Sampling experiment - A lightweight experiment that interrupts the program at specific intervals to gather data.

  • Profiling experiment - A tracing experiment that summarizes collected data.

  • Tracing experiment - A full-trace experiment that provides detailed information.

Also included:

  • PAPI - The PAPI library, from the Innovative Computing Laboratory at the University of Tennessee in Knoxville, is distributed with HPE Performance Analysis Tools. PAPI allows applications or custom tools to interface with hardware performance counters made available by the processor, network, or accelerator vendor. Perftools components use PAPI internally for CPU, GPU, network, power, and energy performance counter collection for derived metrics, observations, and performance reporting. A simplified user interface, which does not require the source code modification of using PAPI directly, is provided for accessing counters.

  • Apprentice2 - An interactive X Window System tool for visualizing and manipulating performance analysis data captured during program execution. Mac and Windows clients are also available.

  • pat_view - Aggregates and presents multiple sampling experiments for program scaling analysis. See the pat_view man page for more information.

  • Reveal - Extends technology by combining performance statistics, program source code visualization, and CCE compiler optimization feedback to better identify and exploit parallelism, and to pinpoint memory bandwidth sensitivities in an application. Reveal enables navigation through source code to highlighted dependencies or bottlenecks discovered during optimization. Using the program library provided by CCE and collected performance data, users can understand which high-level loops benefit from loop-level optimizations such as exposing vector parallelism. Reveal provides dependency and variable scoping information for those loops and assists users with creating parallel directives. A Mac client is available for Reveal.

  • pat_info - Generates a quick summary statement of the contents of a CrayPat experiment data directory.

  • pat_opts - Displays compile and link options used to prepare files for performance instrumentation.

Overview of Apprentice2

Apprentice2 is a GUI tool for visualizing and manipulating performance analysis data captured during program execution. It can display a wide variety of reports and graphs. The number and appearance of the reports when using Apprentice2 is determined by the kind and quantity of data captured during program execution, the type of program being analyzed, the way in which the program is instrumented, and the environment variables in effect at the time of program execution.

Apprentice2 is not integrated with performance tools. Users cannot set up or run performance analysis experiments from within Apprentice2, nor can they launch Apprentice2 from within performance tools. First use pat_build to instrument the program and capture performance data, then use pat_report to process the raw data (saved in .xf format) and convert it to .ap2 format. Perftools-lite modules, when loaded, automatically carry out these steps and generate .ap2 files. Use Apprentice2 to visualize and explore the resulting .ap2 data files.

Feel free to experiment with the Apprentice2 user interface, and to left- or right-click on any area that looks like it might be interesting. Because Apprentice2 does not write any data files, it cannot corrupt, truncate, or otherwise damage the original experiment data. However, under some circumstances, it is possible to use the Apprentice2 text report to overwrite generated MPICH_RANK_ORDER files. If this happens, use pat_report to regenerate the rank order files from the original .ap2 data files. For more information, see MPI Automatic Rank Order Analysis.

Both Windows and Mac clients are available for Apprentice2.

Overview of Apprentice3

Apprentice3 is the next generation GUI tools which will ultimately entirely replace Apprentice2. Currently, it supplements Apprentice2 showcasing new or updated features:

  • interactive performance reports

  • flame graph visualization

  • improved time line view

New features and improvements will be rolled out in Apprentice 3 in subsequent releases.

The user guide for Apprentice3 is here:

Apprentice 3

Man Pages

These man pages introduce and explain various components of the Performance Analysis Tools (Perftools):