I’ve tried several traditional linker optimizations on cperl recently. The easiest is
LTO
or “Link Time Optimization” via gcc -flto=4
or clang -flto=thin
. This requires the gold
linker, and enables multi-threaded
link-time optimizations.
For gcc my configure script does
CC=${CC:-ccache gcc}
./Configure -sder -Dcc="$CC" \
-Dld="$CC -fuse-linker-plugin" \
-Accflags="-flto=4 -fuse-ld=gold -msse4.2 -march=native" ...
and for clang
CC=${CC:-ccache clang-7}
./Configure -sder -Dcc="$CC" \
-Dranlib=llvm-ranlib-7 -Dar=llvm-ar-7 -Dfull_ar=/usr/bin/llvm-ar-7 \
-Accflags="-DLTO -flto=thin -msse4.2 -march=native" ...
LTO is efficient because it sees all symbols at once, not seperated into objects, so it can do whole-program optimizations, esp. inlining and reordering. The impact is about 10%. The goal for v5.29.1c is to make LTO default for non-DEBUGGING builds.
PGO
or “Profile Guided Optimization”. This is mostly used on non-linux
system without perf
, and works with most compilers: ICC, GCC,
clang, Oracle Solaris Studio, MSVC, IBM XL C/C++.
First you need to compile your binaries with enabled profiling, on gcc
with -fprofile-generate
which generates a .gcno
file for each
object file. (The same file that is used for gcov coverage reports).
Then you must run a few tests, which records coverage data into
.gcda
files. Then recompile with -fprofile-use
: it will compile
with the gathered .gcda
coverage data and infer from branch-coverage
if an branch is LIKELY or UNLIKELY.
This is not as efficient as LTO, but does detect most branch misses and is sensitive to your local workflow.
AutoFDO
AutoFDO is the optimized profile
guided optimization method on linux via perf, in opposite to -pg
without perf
.
On most recent kernels you need to recompile autofdo by yourself
git clone --recursive https://github.com/google/autofdo
cd autofdo
aclocal -I .; autoheader; autoconf; automake --add-missing -c
./configure --with-llvm=/usr/bin/llvm-config-7
make -s
make -s install
The perl/cperl Makefile already contains two targets for gcc:
miniperl.autofdo
and perl.autofdo
. For clang just replace
create_gcov
with create_llvm_gcov
. The perl.autofdo
target just
optimizes a static perl
executable, not the shared libperl.so
, so
beware. miniperl is always static, so you can measure the impact there
easier. For me it’s about 20%.
The main benefit of AutoFDO over PGO is that you don’t need two binaries, a profiling one, and the resulting optimized one.
I haven’t checked yet how far the efforts on other systems are, getting sampling gcov data from an optimized binary. There’s as hwpmc driver effort at FreeBSD but you still need to link a -lpmc library for hardware counter support.
DTrace has now cpc support,
but a DTrace to gcov script would be needed, as DTrace
is far better than the linux-only perf
.
prelink
prelink from libreoffice is a post-link optimizer. It fixes up startup relocations, but with PIE, ASLR and constantly updated libraries it’s a bit of a hassle and is not recommended, esp. not on 32bit systems.
BOLT
llvm-bolt is also a post-link optimizer. It analyzes perf traces from a running executable or service, similar to autofdo, and then rewrites the binary to be faster.
I’ve compiled bolt at /usr/src/llvm/llvm-bolt/build
, added the bin path to my PATH
export PATH=/usr/src/llvm/llvm-bolt/build/bin:$PATH
, added the -Wl,-q
flag to ldflags
and lddlflags:
sed -i -e"s/ldflags='/ldflags='-Wl,-q /" config.sh
sed -i -e"s/lddlflags='/lddlflags='-Wl,-q /" config.sh
Sorry, -Alddlflags
is still unusable.
minibench.sh
is the script created by miniperl.autofdo
from above.
Using bolt:
perf record -e cycles:u -j any,u -o bolt.data -- ./minibench.sh
perf2bolt -p bolt.data -o bolt.fdata miniperl
llvm-bolt miniperl -o miniperl.bolt -data=bolt.fdata -reorder-blocks=cache+
-reorder-functions=hfsort+ -split-functions=3 -split-all-cold -split-eh -dyno-stats
=>
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: first alloc address is 0x400000
BOLT-INFO: creating new program header table at address 0x800000, offset 0x400000
BOLT-INFO: enabling relocation mode
BOLT-INFO: 640 functions out of 1602 simple functions (40.0%) have non-empty execution profile.
BOLT-INFO: 60 non-simple function(s) have profile.
BOLT-INFO: profile for 1 objects was ignored
BOLT-INFO: the input contains 702 (dynamic count : 1471) missed opportunities for macro-fusion
optimization. Will fix instances on a hot path.
BOLT-INFO: removed 2 'repz' prefixes with estimated execution count of 0 times.
BOLT-INFO: basic block reordering modified layout of 589 (32.05%) functions
BOLT-INFO: UCE removed 0 blocks and 0 bytes of code.
BOLT-INFO: running hfsort+ for 642 functions
BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:
195150 : executed forward branches
74347 : taken forward branches
30966 : executed backward branches
20082 : taken backward branches
29596 : executed unconditional branches
32611 : all function calls
9659 : indirect calls
5177 : PLT calls
1501423 : executed instructions
401528 : executed load instructions
235761 : executed store instructions
7071 : taken jump table branches
255712 : total branches
124025 : taken branches
131687 : non-taken conditional branches
94429 : taken conditional branches
226116 : all conditional branches
185751 : executed forward branches (-4.8%)
13694 : taken forward branches (-81.6%)
40365 : executed backward branches (+30.4%)
20453 : taken backward branches (+1.8%)
10158 : executed unconditional branches (-65.7%)
32611 : all function calls (=)
9659 : indirect calls (=)
5177 : PLT calls (=)
1478793 : executed instructions (-1.5%)
401528 : executed load instructions (=)
235761 : executed store instructions (=)
7071 : taken jump table branches (=)
236274 : total branches (-7.6%)
44305 : taken branches (-64.3%)
191969 : non-taken conditional branches (+45.8%)
34147 : taken conditional branches (-63.8%)
226116 : all conditional branches (=)
BOLT-INFO: SCTC: patched 27 tail calls (27 forward)
tail calls (0 backward) from a total of 27 while removing 0 double jumps and removing
14 basic blocks totalling 70 bytes of code. CTCs total execution count is 2 and the
number of times CTCs are taken is 0.
BOLT-INFO: setting _end to 0xe040b8
Before - After: perf stat -r2 ./minibench.sh
=>
# before bolt
# started on Mon Oct 22 17:53:16 2018
Performance counter stats for './minibench.sh' (2 runs):
3538.885152 task-clock (msec) # 0.777 CPUs utilized ( +- 0.07% )
2,024 context-switches # 0.572 K/sec ( +- 0.47% )
301 cpu-migrations # 0.085 K/sec ( +- 5.65% )
278,208 page-faults # 0.079 M/sec ( +- 0.00% )
10,721,866,060 cycles # 3.030 GHz ( +- 0.06% )
5,875,656,960 stalled-cycles-frontend # 54.80% frontend cycles idle ( +- 0.10% )
4,442,046,984 stalled-cycles-backend # 41.43% backend cycles idle ( +- 0.09% )
10,623,986,253 instructions # 0.99 insn per cycle
# 0.55 stalled cycles per insn ( +- 0.00% )
2,225,877,429 branches # 628.977 M/sec ( +- 0.00% )
86,592,270 branch-misses # 3.89% of all branches ( +- 0.07% )
4.55333 +- 0.00405 seconds time elapsed ( +- 0.09% )
# after bolt
# started on Mon Oct 22 17:55:46 2018
Performance counter stats for './minibench.sh' (2 runs):
3167.229126 task-clock (msec) # 0.757 CPUs utilized ( +- 0.12% )
2,006 context-switches # 0.633 K/sec ( +- 0.42% )
293 cpu-migrations # 0.093 K/sec ( +- 3.75% )
279,450 page-faults # 0.088 M/sec ( +- 0.03% )
9,566,795,439 cycles # 3.021 GHz ( +- 0.00% )
4,770,308,584 stalled-cycles-frontend # 49.86% frontend cycles idle ( +- 0.04% )
3,558,581,947 stalled-cycles-backend # 37.20% backend cycles idle ( +- 0.01% )
10,580,856,372 instructions # 1.11 insn per cycle
# 0.45 stalled cycles per insn ( +- 0.03% )
2,143,510,267 branches # 676.778 M/sec ( +- 0.04% )
92,102,204 branch-misses # 4.30% of all branches ( +- 0.06% )
4.18226 +- 0.00375 seconds time elapsed ( +- 0.09% )
Which is an impact of 9% on the already link-time optimized executable. Nice!
Note that the generated bolt.fdata
is readable and can
lead you to missed manual optimizations in the source code.
Static linker optimizations
Now you ask how to get all the linker optimizations into the source code and build-environment, to avoid all these compiler and linker optimization steps.
The first problem is to add missing or wrong LIKELY
/UNLIKELY
hints
to the branches in the sources. I’m not aware of any automatic tool
yet, which can display the location of the data or gcov file of a
branch-hit or miss. But the source code is pretty good annotated
already. perf itself gives a few hints in its UI.
The next problem is how to get the LTO optimizations into the Makefiles. We don’t need a linker script, re-arranging the order of the objects or just appending all sources into one big C file and compile only this would accomplish this. perl is a bit large, the compiler would consume too much memory, so I went for the optimization of the object linkorder at first.
There are two problem: permutate the objects itself, and second rearrange the functions inside the objects. Measuring all permutations of the 38 objects would need several years, perl cannot even represent the number as 64bit integer for a loop counter. So I’ve tried a guided search on most permutations, from back to forth, and stop subsequent permutations at the back when the generated file is slower than before. Something like this:
my @o = glob <*.o>;
my $best;
compile(\@o);
my $curr = bench($i);
$best = $curr;
my $p = $#o;
for my $i (0..5000) {
$p = permute(\@o, $p, $curr);
};
sub permute { # from perlfaq4
my ($a,$p,$curr) = @_;
my @idx = 0..$#{$a};
my $new = $curr;
while ($new <= $curr) { # faster?
$curr = $new;
--$p while $idx[$p-1] > $idx[$p];
my $q = $p or return $p;
push @idx, reverse splice @idx, $p;
++$q while $idx[$p-1] > $idx[$q];
@idx[$p-1,$q] = @idx[$q,$p-1];
my $tmp = $a->[$p-1];
$a->[$p-1] = $a->[$q];
$a->[$q] = $tmp;
compile($a);
$new = bench($i);
print "# ",join(" ",map{substr($_,0,-2)} @$a),"\n";
}
$p
}
sub compile {
my @o = @{$_[0]};
if ($^O eq 'linux') {
system("gcc -fstack-protector -L/usr/local/lib -L/opt/local/lib ".
"-o miniperl @o -lpthread -lnsl -ldl -lm -lcrypt -lutil -lc");
} elsif ($^O eq 'darwin') {
system("cc -mmacosx-version-min=10.11 -fstack-protector -L/usr/local/lib ".
"-L/opt/local/lib -force_flat_namespace -o miniperl @o -lpthread -ldl -lm -lutil -lc");
}
}
sub bench {
my ($i) = @_;
system("./miniperl t/opbasic/arith.t >/dev/null 2>/dev/null"); # warmup
my $t0 = [gettimeofday];
system("./minibench.sh");
my $elapsed = tv_interval ( $t0 );
print "$i $elapsed\n";
if ($elapsed < $best) {
$best = $elapsed;
print "# RECORD:\n";
}
$elapsed
}
This revealed that the current object layout is suboptimal, several miniperl-specific objects need to be spliced into the common_objs.
The best order on linux is:
gv | perlmini miniperlmain | perly toke | opmini | av pad sv hv pp_hot run
pp_ctl pp_type | ppmini | scope pp_sys regcomp mg doop util doio keywords
| xsutilsmini | utf8 regexec universal perlapi globals perlio pp_sort pp_pack
numeric reentr mathoms locale dquote mro_core taint time64 caretx dump deb
The old order was:
miniperlmain opmini perlmini xsutilsmini ppmini |
gv toke perly pad regcomp dump util mg reentr mro_core keywords
hv av run pp_hot sv pp_type scope pp_ctl pp_sys
doop doio regexec utf8 taint deb universal globals perlio perlapi
numeric mathoms locale pp_pack pp_sort caretx dquote time64
You see the pattern: The earliest and most important objects start at first, gv being the slowest and mostly used part, then the startup, the parser, the compiler, the data, then the run-time and at last the helpers. Nearby functions are called faster, static relative offsets can be shorter. Note that cperl seperated xsutils and pp also into mini parts because of the FFI.
The performance of the various orders varied wildly: from 3.031298s to 3.990619s, i.e. +-31%. See https://gist.github.com/rurban/67cb4b4046a3538837b6c2aade18ba2f for the linkorder script and result log.
Further reading
Relevant are also the size of the environment for a proper stack alignment, but we cannot control that. It’s over the 2% noise rate though. “Producing Wrong Data Without Doing Anything Obviously Wrong!" has more info about that. It also discusses why sometimes -O2 was faster than -O3, and discusses surprising results with perlbench, with -O3, lto (the linkorder) and the env size.
Martin Liška’s thesis Optimizing large applications discusses several of those strategies.