ATP - Selecting and reading core dumps
After displaying a summary of the job status, ATP will write selected core files and a graph visualization of the complete stack trace tree. Why only selected core files? Core files can be quite large, and take a long time to write to disk. If two ranks have the exact same stack trace tree, ATP will only select the first of the two ranks to produce a core file. By default, the maximum number of cores to write is 20, and can be configured by setting ATP_MAX_CORES
.
Referring back to our crashing job’s ATP output, we can see that rank 0 was selected to dump core, and written to /home/users/adangelo.
Producing core dumps for ranks 0
1 core written in /home/users/adangelo
That core file is located at core.atp.1034636.0.0.9550. We will start GDB in core-analysis mode to examine the program at the time of its crash.
gdb a.out core.atp.1034636.0.0.9550
Next, in GDB, let’s take a look at the backtrace tree:
(gdb) bt
#0 0x00007f689916acb9 in raise () from /lib64/libc.so.6
#1 0x00007f689ccecb4c in (anonymous namespace)::dumpCallback (descriptor=...,
succeeded=<optimized out>) at libAtpSigHandler.cpp:922
#2 0x00007f689cceffa9 in google_breakpad::ExceptionHandler::GenerateDump (
context=<optimized out>, this=0x223680)
at src/client/linux/handler/exception_handler.cc:602
#3 google_breakpad::ExceptionHandler::GenerateDump (this=0x223680,
context=<optimized out>) at src/client/linux/handler/exception_handler.cc:534
#4 0x00007f689ccf02d8 in google_breakpad::ExceptionHandler::SignalHandler (sig=11,
info=0x2282f0, uc=0x2281c0) at src/client/linux/handler/exception_handler.cc:412
#5 <signal handler called>
#6 0x00007f689929dd67 in __strlen_avx2 () from /lib64/libc.so.6
#7 0x00007f6899199b1b in printf_positional () from /lib64/libc.so.6
#8 0x00007f689919c16d in __vfprintf_internal () from /lib64/libc.so.6
#9 0x00007f6899187608 in printf () from /lib64/libc.so.6
#10 0x0000000000201b38 in main (argc=1, argv=0x7ffd1b562398) at crash.c:17
Due to the way core files are generated by ATP, the backtrace here is slightly different than the initially reported aggregated backtrace. However, we can clearly see our top stack frame, frame 10, at the same location in crash.c:17. Let’s select it.
(gdb) f 10
#10 0x0000000000201b38 in main (argc=1, argv=0x7ffd1b562398) at crash.c:17
17 printf("%s ", argv[i]);
Now that we are working inside frame 10, we can view the state of local variables at the time of the crash.
(gdb) p argv
$1 = (char **) 0x7ffd1b562398
The raw memory address of argv doesn’t tell us much. Let’s look at its first element.
(gdb) p argv[0]
$2 = 0x7ffd1b564f66 "/home/users/adangelo/./crash"
argv seems valid. What about i?
(gdb) p i
$3 = 368
(gdb) p argc
$4 = 1
Well, we certainly didn’t pass 368 arguments to our program! argc is only 1. So i is being set to an invalid number before line 17. Let’s refer back to our original source code.
for (int i = 0; i < argv; i++) {
printf("%s ", argv[i]);
}
Now we can see the problem: the loop is set to run while i < argv. It should be changed to i < argc, so that all indices into argv are valid.
After fixing our source code, rebuild and retest the application.
ATP_ENABLED=1 srun -n2 ./crash arg1 arg2
Arguments for rank 0: /home/users/adangelo/./crash arg1 arg2
Arguments for rank 1: /home/users/adangelo/./crash arg1 arg2