i'm trying understand very strange behavior in 1 of c programs. apparently, adding or removing seemingly inconsequential line @ end of drastically affects performance in rest of program.
my program looks bit this:
int large_buffer[10000]; void compute(file * input) { for(int i=0; i<100; i++) { do_lots_of_stuff(); printf("."); fflush(stdout); } } int main() { file *input = fopen("input.txt", "r"); compute(input); fclose(input); // <--- gets 2x slower if comment out (!) return 0; }
in theory, fclose(input)
line @ end of main function should not matter, since os should automatically close file @ end of program anyway. observed program took 2.5 seconds run when included fclose statement , 5s when commented out. factor of 2 difference! , not due latency @ start or end of program: speed @ .
printed out visibly faster in version fclose statement.
i suspect might have memory alignment or cache miss issue. if replace fclose function such ftell takes 5s run , if reduce size of large_buffer
<= 8000 elements runs in 2.5 seconds, regardless of fclose statement being there or not.
but able 100% sure of culprit behind strange behavior. possible run program under sort of profiler or other tool give me information? far tried running both versions under valgrind --tool=cachegrind
reported same amount of cache miss (0%) both versions of program.
edit 1: after running both versions of program under perf stat -d -d -d
got following results:
performance counter stats './no-fclose examples/bench.o': 5625.535086 task-clock (msec) # 1.000 cpus utilized 38 context-switches # 0.007 k/sec 0 cpu-migrations # 0.000 k/sec 54 page-faults # 0.010 k/sec 17,851,853,580 cycles # 3.173 ghz (53.23%) 6,421,955,412 stalled-cycles-frontend # 35.97% frontend cycles idle (53.23%) 4,919,383,925 stalled-cycles-backend # 27.56% backend cycles idle (53.23%) 13,294,878,129 instructions # 0.74 insn per cycle # 0.48 stalled cycles per insn (59.91%) 3,178,485,061 branches # 565.010 m/sec (59.91%) 440,171,927 branch-misses # 13.85% of branches (59.92%) 4,778,577,556 l1-dcache-loads # 849.444 m/sec (60.19%) 125,313 l1-dcache-load-misses # 0.00% of l1-dcache hits (60.22%) 12,110 llc-loads # 0.002 m/sec (60.25%) <not supported> llc-load-misses <not supported> l1-icache-loads 20,196,491 l1-icache-load-misses (60.22%) 4,793,012,927 dtlb-loads # 852.010 m/sec (60.18%) 683 dtlb-load-misses # 0.00% of dtlb cache hits (60.13%) 3,443 itlb-loads # 0.612 k/sec (53.38%) 90 itlb-load-misses # 2.61% of itlb cache hits (53.31%) <not supported> l1-dcache-prefetches 51,382 l1-dcache-prefetch-misses # 0.009 m/sec (53.24%) 5.627225926 seconds time elapsed
performance counter stats './yes-fclose examples/bench.o': 2652.609254 task-clock (msec) # 1.000 cpus utilized 15 context-switches # 0.006 k/sec 0 cpu-migrations # 0.000 k/sec 57 page-faults # 0.021 k/sec 8,277,447,108 cycles # 3.120 ghz (53.39%) 2,453,171,903 stalled-cycles-frontend # 29.64% frontend cycles idle (53.46%) 1,235,728,409 stalled-cycles-backend # 14.93% backend cycles idle (53.53%) 13,296,127,857 instructions # 1.61 insn per cycle # 0.18 stalled cycles per insn (60.20%) 3,177,698,785 branches # 1197.952 m/sec (60.20%) 71,034,122 branch-misses # 2.24% of branches (60.20%) 4,790,733,157 l1-dcache-loads # 1806.046 m/sec (60.20%) 74,908 l1-dcache-load-misses # 0.00% of l1-dcache hits (60.20%) 15,289 llc-loads # 0.006 m/sec (60.19%) <not supported> llc-load-misses <not supported> l1-icache-loads 140,750 l1-icache-load-misses (60.08%) 4,792,716,217 dtlb-loads # 1806.793 m/sec (59.93%) 1,010 dtlb-load-misses # 0.00% of dtlb cache hits (59.78%) 113 itlb-loads # 0.043 k/sec (53.12%) 167 itlb-load-misses # 147.79% of itlb cache hits (53.44%) <not supported> l1-dcache-prefetches 29,744 l1-dcache-prefetch-misses # 0.011 m/sec (53.36%) 2.653584624 seconds time elapsed
looks there few data-cache misses in both cases, kcachegrind had reported, slower version of program had worse branch prediction , more instruction cache misses , itlb loads. of these differences responsible 2x difference in runtime between test cases?
edit 2: fun fact, apparently can still keep strange behavior if replace "fclose" call single nop instruction.
edit 3: processor intel i5-2310 (sandy bridge)
edit 4: turns out if resize arrays editing assembly file not faster. apparetly reason got faster when changed sizes in c code because gcc decided rearange order of stuff in binary.
edit 5: more evidence matters precise addresses of jmp instructions: if add single nop (instead of printf) @ start of code gets faster. similarly, if remove useless instruction start of code gets faster. , when compiled code on different version of gcc got faster, despite generated assembly code being same. difference debug info @ start , sections of binary file in different order.
the key metric
your culprit branch misses:
440,171,927 branch-misses # 13.85% of branches
vs.
71,034,122 branch-misses # 2.24% of branches
i'm not sure processor your'e running, if hypothetically running 2.5 ghz processor on haswell, you'll see each branch prediction miss costs around 15 cycles (usually bit more because other stuff stalls too), , each cycle 0.4ns. so, 0.4ns/cycle * 15 cycles/missed branch * (440,171,927 - 71,034,122) branch misses = 2.2 seconds. depend on exact processor , machine code, explains bulk of difference right there.
cause
the branch prediction algorithms of different chips proprietary, if research here ( http://www.agner.org/optimize/microarchitecture.pdf) can learn more different processors , there limitations. essentially, you'll find processors better job of avoiding collisions in branch prediction tables store make predictions branches taken or not.
so, why relevant? well, has happened (99% chance) rearranging program slightly, change different branches in terms of memory locations. there many map same bucket in branch prediction tables processor. changing executable slightly, problem goes away. have have specific distance between 2 branch points trigger issue. have unluckily managed that.
simple solution
as found, many changes in fact cause program not hit degraded performance. essentially, changes distance between 2 critical branch points fix problem. can accomplish literally inserting 16 bytes (or enough move branch points different buckets) of machine code nops somewhere between 2 places. can did , change disrupt distance non-pathological case.
digging deeper
if want understand causes in situation, you'll need hands dirty. fun! need select 1 of many tools find specific branch being mispredicted. here 1 way: how measure mispredictions single branch on linux?
after identify mispredicted branch, can figure out if there way remove branch (almost idea performance anyway). if not, can place hint or, worst case, move things around ensure same entry isn't shared suggested.
broader lessons
programmers underestimate cost of branching (when compiler unable remove branches @ compile time). you've discovered, each branch puts more strain on branch prediction buffers of processor , increases chance of mispredictions. so, branches 100% predictable processor can degrade performance reducing resources available predicting other branches. further, when branch mispredicted, costs minimum of 12-15 cycles , can more if needed instructions aren't in l1 cache, l2 cache, l3 cache, or heaven you, main memory. additionally, compiler can't make optimizations across branches.
Comments
Post a Comment