About the HPE Cray Clang C and C++ Quick Reference
HPE Cray Compiling Environment (CCE) provides Fortran, C, and C++ compilers for HPE Cray EX supercomputer systems and HPE Apollo 2000 Gen10Plus systems. The HPE Cray Clang C and C++ Quick Reference includes basic reference information for the Cray Clang C and C++ compilers that are included in CCE. This guide is intended for users and application programmers.
Release Information and Record of Revision
The CCE 17.0.1 C and C++ compilers are based on Clang/LLVM 17.0.6. The latest and full version of Clang compiler documentation is located here. This document focuses on the ways in which the CCE implementation of Clang differs from the LLVM source.
Publication Title |
Date |
Release |
---|---|---|
HPE Cray Clang C and C++ Quick Reference (17.0.1) S-2179 |
May 2024 |
CCE 17.0.1 |
HPE Cray Clang C and C++ Quick Reference (17.0.0) S-2179 |
December 2023 |
CCE 17.0.0 |
HPE Cray Clang C and C++ Quick Reference (16.0.1) S-2179 |
September 2023 |
CCE 16.0.1 |
HPE Cray Clang C and C++ Quick Reference (16.0) S-2179 |
June 2023 |
CCE 16.0 |
HPE Cray Clang C and C++ Quick Reference (15.0) S-2179 |
November 2022 |
CCE 15.0 |
HPE Cray Clang C and C++ Quick Reference (14.0) S-2179 |
May 2022 |
CCE 14.0 |
HPE Cray Clang C and C++ Quick Reference (13.0) S-2179 |
November 2021 |
CCE 13.0 |
HPE Cray Clang C and C++ Quick Reference (12.0) S-2179 |
June 2021 |
CCE 12.0 |
HPE Cray Clang C and C++ Quick Reference (11.0) S-2179 |
Nov 2020 |
CCE 11.0 |
Cray® Classic C and C++ Reference Manual (9.1) S-2179 |
Nov 2019 |
CCE 9.1 |
Cray® Classic C and C++ Reference Manual (9.0) S-2179 |
June 2019 |
CCE 9.0 |
Cray® C and C++ Reference Manual (8.7) S-2179 |
April 2018 |
CCE 8.7 |
Cray® C and C++ Reference Manual (8.6) S-2179 |
Aug 2017 |
CCE 8.6 |
Cray® C and C++ Reference Manual (8.5) S-2179 |
June 2016 |
CCE 8.5 |
Additional Information Resources
Online help is available after the CCE module is loaded through:
man clang
- Returns the Clang man page.man craycc
orman crayCC
- Redirects you to the Clang man page. (Note thatcraycc
andcrayCC
man pages used in earlier versions of CCE are replaced by aliases.)clang --help
- Returns a summary of the command line options and arguments. Because this list is lengthy,clang --help \|
may be more useful.
The man page is presumed to be more current if content differences exist between this guide and the clang man page. Note also that the complete Clang reference manual is included in HTML format in the /opt/cray/pe/cce/<version>.0.0/doc/html/index.html
filesystem location.
Typographic Conventions
This style
indicates program code, reserved words, library functions, command-line prompts, screen output, file/path names, variables, and other software constructs.\
(backslash) at the end of a command line indicates the Linux shell line continuation character (lines joined by a backslash are parsed as a single line).
Copying and Pasting from a PDF
Using the copy and paste functions from within a PDF file can be unreliable. Although copying and pasting a command line typically works, copying and pasting formatted file content (for example, JSON, YAML) typically fails. To ensure that PDF file contents are copied and pasted correctly while performing procedures in the PDF version of this guide:
Copy the content from the PDF file.
Paste it to a neutral editing form and add the necessary formatting.
Copy the content from the neutral form and paste it into the console.
Double-check copied and pasted commands for correctness, as some commands may not render correctly from the PDF.
Introduction to CCE Clang
HPE Cray Compiling Environment (CCE) Clang supports compiling the C, C++, and UPC languages and the OpenMP parallel programming model for targets available on supported systems. Using this compiler for other languages, models, or targets is not supported; any documentation related to such features is provided as-is for reference purposes only.
Invoking Clang
The CCE Clang C and C++ compilers should be invoked using cc
and CC
as usual. This method sets the target, based on the loaded craype-arch
module and link with the usual HPE Cray libraries, including HPE Cray-optimized math functions, memcpy
, and OpenMP runtime. Use of the native clang
or clang++
commands is discouraged, as doing so may not find necessary paths and will not link automatically with Cray libraries.
To invoke Clang:
For C programs
cc [options] <filename> ...
For C++ programs
CC [options] <filename> ...
For UPC programs
cc -hupc [options] <filename> ...
For HIP programs
CC [options] -x hip <filename> ...
Compilation Stages
clang
is a C, C++, and Objective-C compiler that encompasses preprocessing, parsing, optimization, code generation, assembly, and linking. Depending on which high-level mode setting is passed, Clang will stop before doing a full link. While Clang is highly integrated, it is important to understand the stages of compilation, to understand how to invoke it. These stages are:
Driver
The clang executable is actually a small driver that controls the overall execution of other tools, such as the compiler, assembler, and linker. Typically, you do not need to interact with the driver, but you transparently use it to run the other tools.
Preprocessing
This stage handles the tokenization of the input source file, macro expansion, #include expansion, and handling of other preprocessor directives. The output of this stage is typically called a
.i
(for C),.ii
(for C++),.mi
(for Objective-C), or.mii
(for Objective-C++) file.Parsing and Semantic Analysis
This stage parses the input file, translating preprocessor tokens into a parse tree. When in the form of a parse tree, it applies semantic analysis to compute types for expressions as well as determining whether the code is well formed. This stage is responsible for generating most of the compiler warnings as well as parse errors. The output of this stage is an Abstract Syntax Tree (AST).
Code Generation and Optimization
This stage translates an AST into low-level intermediate representation (known as “LLVM IR”) and ultimately to machine code. This phase is responsible for optimizing the generated code and handling target-specific code generation. The output of this stage is typically called a
.s
file or assembly file.Clang also supports the use of an integrated assembler, where the code generator produces object files directly. This operation avoids the overhead of generating the
.s
file and calling the target assembler.Assembler
This stage runs the target assembler to translate the output of the compiler into a target object file. The output of this stage is typically called a
.o
file or object file.Linker
This stage runs the target linker to merge multiple object files into an executable or dynamic library. The output of this stage is typically called an
a.out
,.dylib
or.so
file.Static Analyzer
The Clang Static Analyzer is a tool that scans source code to try to find bugs through code analysis. This tool uses many parts of Clang and is built into the same driver. See the Clang Static Analyzer website for more details on how to use the static analyzer.
General Enhancements
Clang/LLVM provides improved performance of generated code and includes additional features. In general, performance improvements is enabled by default at appropriate optimization levels. (Features must be requested by an option.) The compiler predefines the __cray__
macro in addition to usual Clang predefined macros.
-fcray, -fno-cray
Select the compiler’s default behavior, which provides the basis for customization by other options. The default is
-fcray
, which enables Cray enhancements, whereas-fno-cray
disables Cray enhancements. The last instance of-fcray
and-fno-cray
applies. The position of-fcray
or-fno-cray
relative to other options does not matter. For example, with-fcray
, other options that disable specific Cray enhancements are honored, and with-fno-cray
, other options that enable specific Cray enhancements are honored.Note that
-fno-cray
is intended to help diagnose whether a problem is caused by a Cray enhancement or is present in the base Clang/LLVM distribution. Either way, the problem should be reported to Cray to receive the fastest response.-fenhanced-asm=<verbosity>
Emit descriptive comments in assembly code output. The default is
-fenhanced-asm=1
. Greater levels of verbosity will include more provenance information for inlined code. Use-fenhanced-asm=0
to disable.-fenhanced-ir=<verbosity>
Emit descriptive comments in IR output. The default is
-fenhanced-ir=1
. Greater levels of verbosity will include more provenance information for inlined code. Use -fenhanced-ir=0
to disable.
Performance Options
Clang does not apply optimizations unless they are requested. For best performance, -Ofast
with -flto
is recommended. For applications that are sensitive to floating-point optimizations, it may be necessary to adjust the floating-point optimization level using one of the options below. For applications that require bit reproducibility (that is, applications that are designed to calculate the same result no matter how the work is distributed among a constant product of MPI ranks and OpenMP threads), it may be necessary to forgo floating-point optimization by using -O3
instead of -Ofast
.
-fast
Enables
-Ofast
and link-time optimization.-ffp=level
Selects a level for HPE Cray floating-point math optimizations and math library functions. Requesting the lowest level,
-ffp=0
, generates code with the highest precision and grants the compiler minimal freedom to optimize floating-point operations, whereas requesting the highest level,-ffp=4
, grants the compiler maximal freedom to aggressively optimize but likely results in lower precision.Requesting levels 1 through 4 flushes denormals to zero and implies
-funsafe-math-optimizations
and-fno-math-errno
; if those options are subsequently changed, then this option may not work as expected. With-fcray
,-ffp=3
is implied by-ffast-math
or-Ofast
. Using-ffp=0
prevents the use of HPE Cray math libraries and disable all HPE Cray floating-point optimizations.Supported values for level are
0, 1, 2, 3, 4
.-fcray-mallopt, -fno-cray-mallopt
Optimize
malloc
by using HPE Cray custommallopt
parameters, which for most programs improves performance but may cause higher memory usage. This is a link-time option. The default is-fcray-mallopt
.-fivdep, -fno-ivdep
Enables or disables
#pragma ivdep
handling. The default is-fivdep
.-flocal-restrict, -fno-local-restrict
Honors restrict-qualified pointers declared in a block scope by assuming that they do not alias with other restrict-qualified pointers declared in the same block scope. The default is
-flocal-restrict
.-floop-trips=scale
Optimizes, assuming loops with statically unknown trip counts have trip counts, at the scale of
scale
.At this time, the only valid value for scale is
huge
. Assume loops have trip counts large enough such that referenced data will not fit in the cache.
Feature Options
Clang options supporting CCE features include:
-fsave-decompile
Generates decompile (
.dc
) and IR (.ll
) files before optimization, vectorization, and code generation, as well as after LTO. A decompile is a higher-level presentation of the IR that looks similar to C source code but cannot be compiled. Uses the decompile to gain insight about restructuring and optimization changes made by the compiler.-fsave-loopmark
Generates a loopmark listing file (
.lst
) that shows which optimizations were applied to which parts of the source code.-floopmark-style
Controls the style of the loopmark listing file produced when
-fsave-loopmark
is used. Allowed values aregrouped
(all messages placed at the end of the listing) andinterspersed
(each message placed after the relevant source code line). The default isgrouped
.-finstrument-loops
Instruments loops to gather profile data to use with CrayPAT.
-finstrument-openmp
Turns the insertion of the CrayPat OpenMP and accelerator tracing calls on and off.
-fcray-program-library-path=<directory>
Creates and uses a persistent repository of compiler information specified by <directory>.
The program library repository is implemented as a directory and the information contained in the program library is built up with each compiler invocation. Any compilation that does not have the
-fcray-program-library-path
option will not add information to this repository.Because of the persistence of the program library, the user is responsible to manage it. For example,
rm -r <directory>
might be added to the “make clean” target in an application Makefile. Because the program library is a directory, userm -r
to remove it.If an application Makefile works by creating files in multiple directories during a single build, then <directory> must be an absolute path. Otherwise, multiple and incomplete program library repositories are created. For example, avoid
-fcray-program-library-path=./pl
and instead use-fcray-program-library-path=/fullpath/builddir/pl
.This option may be specified with either an equal sign or a space before directory.
-fcray-trapping-math
Generates optimized trap-safe floating point code. This option disables any optimization which would introduce a trap where one did not exist in the source code. The default is
-fno-cray-trapping-math
.
Linker Options
-ffpe-trap=list
Enable traps at runtime for the specified exceptions. This option accepts a comma separated list of values. If the specified values contradict each other, the last value specified has priority.
This option does not affect compile time optimizations; it detects runtime exceptions. This option is processed only at link time and affects the entire program; it is not processed when compiling subprograms. Therefore, traps may be set using this command line option at the beginning of execution of the main program only. The program may subsequently change these settings by calling intrinsic or library procedures.
The default is -ffpe-trap=none
, which means no exceptions are trapped. Possible values with exceptions include:
none
Disables all traps
invalid
Trap on invalid operation
zero
Trap on divide-by-zero
fp
Trap on
zero
,invalid
, oroverflow
inexact
Trap on inexact result (or rounded result). Enabling traps for inexact results is not recommended.
overflow
Trap on overflow (or the result of an operation is too large to be represented)
underflow
Trap on underflow (or the result of an operation is too small to be represented)
denormal
Trap on denormalized operands
Uninitialized Variable Policy Control
Uninitialized variables can be a source of programming errors. Options listed below provide control over how the compiler treats these variables. Separate options for integer and floating-point types exist so that integer variables may be initialized to zero and floating-point variables may be initialized to NaN. Many bit patterns qualify as a NaN; these options use a quiet NaN of all ones because using a repeated byte pattern makes it possible to initialize large arrays using memset
. Conversely, these options apply to integral and floating-point variables (which are not part of structures) because structures could require an arbitrarily complex initialization sequence.
-funinitialized-heap-ints=<uninitialized | zero>
Initializes integer memory allocated by malloc or new to zero. For this option to have any effect, the void pointer returned by malloc must be typecast immediately to a pointer to an integer type because otherwise the compiler does not know how the memory will be used. For example,
(int*)malloc(...)
.-funinitialized-heap-floats=<uninitialized | nan>
Initializes floating-point memory allocated by malloc or new to zero. For this option to have any effect, the void pointer returned by malloc must be typecast immediately to a quiet NaN of all ones. For this option to have any effect, the void pointer returned by malloc must be typecast immediately to a pointer to a floating-point type because otherwise the compiler does not know how the memory will be used. For example,
(double*)malloc(...)
.-funinitialized-stack-ints=<uninitialized | zero>
Initializes stack integer variables to zero. If the
-ftrivial-auto-var-init
option is present, then it has precedence, and this option does nothing.-funinitialized-stack-floats=<uninitialized | nan>
Initializes stack floating-point variables to NaN. If the
-ftrivial-auto-var-init
option is present, then it has precedence, and this option does nothing.-funinitialized-static-floats=<zero | nan>
Initializes static floating-point variables to NaN.
Unified Parallel C (UPC) Options
CCE Clang options that support UPC include:
-hupc, -hdefault
-hupc
configures the compiler driver to expect UPC source code. Source files with a.upc
extension are automatically treated as UPC code, but this option permits a file with any other extension (typically.c
) to be understood as UPC code.-hdefault
cancels this behavior; if both-hupc
and-hdefault
appear in a command line, whichever appears last takes precedence and applies to all source files in the command line.-fupc-auto-amo, -fno-upc-auto-amo
Automatically uses network atomics for remote updates to reduce latency. For example,
x += 1
can be performed as a remote atomic add. If an update is recognized as local to the current thread, then no atomic is used. These atomics are intended as a performance optimization only and should not be relied upon to prevent race conditions. Enabled at-O1
and above.-fupc-buffered-async, -fno-upc-buffered-async
Sets aside memory in the UPC runtime library for aggregating random remote accesses designated with
#pragma pgas buffered_async
. Disabled by default.-fupc-pattern, -fno-upc-pattern
Identifies simple communication loops and aggregate the remote accesses into a single function call which replaces the loop. Enabled at
-O1
and above.-fupc-threads=<N>
Sets the number of threads for a static
THREADS
translation. This option causes__UPC_STATIC_THREADS__
to be defined instead of__UPC_DYNAMIC_THREADS__
and replaces all uses of the UPC keywordTHREADS
with the value <N>.
HIP Support and Options
HIP is supported only for AMD GPU targets and requires an AMD ROCm install for HIP header files and runtime libraries.
Several flags must be specified explicitly to compile and link HIP source files. For example, the following command lines will compile and link a HIP source file targeting an AMD MI250X GPU:
CC --offload-arch=gfx90a --rocm-path=<ROCM-INSTALL-PATH> -c -x hip [options] <filename> ...
CC --rocm-path=<ROCM-INSTALL-PATH> [options] <filename> ...
The following compiler options are relevant for compiling and linking HIP source files:
-x hip
Enable HIP compilation for any input files that appear after this option on the command line. This option should not be used on a link line with object files as input, since CCE will treat the object files as HIP source. The
-x none
flag can be used to cancel a prior-x hip
flag on the link line.--rocm-path=<ROCM-INSTALL-PATH>
Specifies the location of a ROCm install; used to locate HIP header files and device runtime libraries.
--offload-arch=[gfx908|gfx90a|gfx942]
Specifies the HIP offload target architecture. CCE currently supports
gfx908
(AMD MI00),gfx90a
(AMD MI250X), andgfx942
(AMD MI300A). This flag can be specified multiple times to produce a fat binary that contains device code for multiple GPUs.This flag also accepts the LLVM target ID syntax, which is a target processor followed by a colon-delimited list of processor features. Each feature is a predefined string, xnack or sramecc, followed by a plus or minus sign to enable or disable the setting (for example,
gfx90a:xnack+
orgfx90a:xnack-
). Any unspecified processor features receive a default value ofany
, which ensures the resulting executable runs correctly on a processor with or without that feature. Thexnack
processor feature is needed to run with unified memory for AMD GPUs.--cuda-offload-arch=[gfx908|gfx90a|gfx942]
A synonym for
--offload-arch
.-fgpu-rdc, -fno-gpu-rdc
Generates relocatable device code, allowing separate compilation of HIP source files with cross-file references. Compiling with
-fgpu-rdc
will produce a bundled HIP offload object file that requires linking with--hip-link
. Compiling with-fno-gpu-rdc
will produce ordinary host object files that do not need to be linked with--hip-link
. However,-fno-gpu-rdc
requires that all HIP device code in a HIP source file must be completely self-contained, without referencing any external user-defined symbols. The default is-fno-gpu-rdc
.--hip-link
Enables device linking for bundled HIP offload object files. This option is required when linking object files compiled with
-fgpu-rdc
.--munsafe-fp-atomics
,mno-unsafe-fp-atomics
Enables the use of native floating-point atomic instructions, which are not used at default for AMD MI250X GPUs because they are only safe for coarse-grained memory; floating-point atomic instructions operating on fine-grained memory are silently ignored. In general, memory granularity cannot be determined statically, so at default, the compiler always generates atomic compare-and-swap loops for floating-point atomic operations. (Integer atomic instructions, including atomic compare-and-swap, are safe for any memory granularity.) The
munsafe-fp-atomics
compiler flag may be used to enable the generation of native floating-point atomic instructions, but you must ensure that atomic operations do not target fine-grained memory. The default ismno-unsafe-fp-atomics
, which prevents the compiler from generating native floating-point atomic instructions for operations that may target fine-grained memory at runtime.
C and C++ Language Extensions
This chapter describes the language extensions provided by CCE Clang. Some of these extensions are widely implemented in other compilers, while others are unique and specific to HPE Cray systems. Note also that CCE Clang supports regular Clang language extensions.
Performance Extensions
#pragma ivdep
If placed before a
for
,while
, ordo while
loop,#pragma ivdep
causes the compiler to ignore vector dependencies in the loop (including explicit dependencies, when attempting to vectorize the loop) and allows the compiler to vectorize many loops that are potentially unsafe to vectorize.Reductions within the loop are allowed, except for reductions into global arrays. For example,
a[0] += 3
is not allowed ifa
is a global array.Even with
#pragma ivdep
, conditions other than vector dependencies can still inhibit vectorization.
Interoperability
Mixed-language programs that exchange long double data between Fortran and C or C++ object files do not work correctly on x86 targets. CCE Fortran assumes a 64-bit C_LONG_DOUBLE type, whereas Clang uses an 80-bit long double type padded to 128 bits of storage. To assist in making such programs work, the following options are available.
Note that if you are using a non-default long double format, avoid passing the long double data to library functions which expect the default format.
-mlong-double-64
Make the x86 “long double” type equivalent to the “double” type. This type matches CCE Fortran C_DOUBLE or C_LONG_DOUBLE.
-mlong-double-128
Make the x86 “long double” type equivalent to the “__float128” type. This type matches CCE Fortran C_FLOAT128.
-mlong-double-80
Make the x86 “long double” type equivalent to an 80-bit floating-point type that is padded to 128 bits of storage. This option is the default. Additionally, this Fortran option is relevant to interoperability.
-ffortran-byte-swap-io
Tell the Fortran runtime I/O subsystem to byte-swap input and output files for direct and sequential unformatted I/O. This is a link-time option to be used when linking with CCE Fortran object files.
Language Extensions
#pragma ivdep
If placed before a for
, while
, or do while
loop, #pragma ivdep
causes the compiler to ignore vector dependencies in the loop (including explicit dependencies) when attempting to vectorize the loop. This process allows the compiler to vectorize many loops that are potentially unsafe to vectorize.
Note that reductions within the loop are allowed, except for reductions into global arrays. For example, a[0] += 3
is not allowed if a
is a global array.
With #pragma ivdep
, conditions other than vector dependencies can still inhibit vectorization.