c - How can I pinpoint if the slowness in my program is a CPU cache issue (on Linux)? -


i'm trying understand very strange behavior in 1 of c programs. apparently, adding or removing seemingly inconsequential line @ end of drastically affects performance in rest of program.

my program looks bit this:

int large_buffer[10000];  void compute(file * input) {     for(int i=0; i<100; i++) {         do_lots_of_stuff();         printf(".");         fflush(stdout);     } }  int main() {     file *input = fopen("input.txt", "r");     compute(input);     fclose(input); // <--- gets 2x slower if comment out (!)     return 0; } 

in theory, fclose(input) line @ end of main function should not matter, since os should automatically close file @ end of program anyway. observed program took 2.5 seconds run when included fclose statement , 5s when commented out. factor of 2 difference! , not due latency @ start or end of program: speed @ . printed out visibly faster in version fclose statement.

i suspect might have memory alignment or cache miss issue. if replace fclose function such ftell takes 5s run , if reduce size of large_buffer <= 8000 elements runs in 2.5 seconds, regardless of fclose statement being there or not.

but able 100% sure of culprit behind strange behavior. possible run program under sort of profiler or other tool give me information? far tried running both versions under valgrind --tool=cachegrind reported same amount of cache miss (0%) both versions of program.


edit 1: after running both versions of program under perf stat -d -d -d got following results:

 performance counter stats './no-fclose examples/bench.o':         5625.535086      task-clock (msec)         #    1.000 cpus utilized                           38      context-switches          #    0.007 k/sec                                    0      cpu-migrations            #    0.000 k/sec                                   54      page-faults               #    0.010 k/sec                       17,851,853,580      cycles                    #    3.173 ghz                      (53.23%)      6,421,955,412      stalled-cycles-frontend   #   35.97% frontend cycles idle     (53.23%)      4,919,383,925      stalled-cycles-backend    #   27.56% backend cycles idle      (53.23%)     13,294,878,129      instructions              #    0.74  insn per cycle                                                            #    0.48  stalled cycles per insn  (59.91%)      3,178,485,061      branches                  #  565.010 m/sec                    (59.91%)        440,171,927      branch-misses             #   13.85% of branches          (59.92%)      4,778,577,556      l1-dcache-loads           #  849.444 m/sec                    (60.19%)            125,313      l1-dcache-load-misses     #    0.00% of l1-dcache hits    (60.22%)             12,110      llc-loads                 #    0.002 m/sec                    (60.25%)    <not supported>      llc-load-misses                                                 <not supported>      l1-icache-loads                                                      20,196,491      l1-icache-load-misses                                         (60.22%)      4,793,012,927      dtlb-loads                #  852.010 m/sec                    (60.18%)                683      dtlb-load-misses          #    0.00% of dtlb cache hits   (60.13%)              3,443      itlb-loads                #    0.612 k/sec                    (53.38%)                 90      itlb-load-misses          #    2.61% of itlb cache hits   (53.31%)    <not supported>      l1-dcache-prefetches                                                     51,382      l1-dcache-prefetch-misses #    0.009 m/sec                    (53.24%)         5.627225926 seconds time elapsed 
 performance counter stats './yes-fclose examples/bench.o':         2652.609254      task-clock (msec)         #    1.000 cpus utilized                           15      context-switches          #    0.006 k/sec                                    0      cpu-migrations            #    0.000 k/sec                                   57      page-faults               #    0.021 k/sec                        8,277,447,108      cycles                    #    3.120 ghz                      (53.39%)      2,453,171,903      stalled-cycles-frontend   #   29.64% frontend cycles idle     (53.46%)      1,235,728,409      stalled-cycles-backend    #   14.93% backend cycles idle      (53.53%)     13,296,127,857      instructions              #    1.61  insn per cycle                                                            #    0.18  stalled cycles per insn  (60.20%)      3,177,698,785      branches                  # 1197.952 m/sec                    (60.20%)         71,034,122      branch-misses             #    2.24% of branches          (60.20%)      4,790,733,157      l1-dcache-loads           # 1806.046 m/sec                    (60.20%)             74,908      l1-dcache-load-misses     #    0.00% of l1-dcache hits    (60.20%)             15,289      llc-loads                 #    0.006 m/sec                    (60.19%)    <not supported>      llc-load-misses                                                 <not supported>      l1-icache-loads                                                         140,750      l1-icache-load-misses                                         (60.08%)      4,792,716,217      dtlb-loads                # 1806.793 m/sec                    (59.93%)              1,010      dtlb-load-misses          #    0.00% of dtlb cache hits   (59.78%)                113      itlb-loads                #    0.043 k/sec                    (53.12%)                167      itlb-load-misses          #  147.79% of itlb cache hits   (53.44%)    <not supported>      l1-dcache-prefetches                                                     29,744      l1-dcache-prefetch-misses #    0.011 m/sec                    (53.36%)         2.653584624 seconds time elapsed 

looks there few data-cache misses in both cases, kcachegrind had reported, slower version of program had worse branch prediction , more instruction cache misses , itlb loads. of these differences responsible 2x difference in runtime between test cases?


edit 2: fun fact, apparently can still keep strange behavior if replace "fclose" call single nop instruction.


edit 3: processor intel i5-2310 (sandy bridge)


edit 4: turns out if resize arrays editing assembly file not faster. apparetly reason got faster when changed sizes in c code because gcc decided rearange order of stuff in binary.


edit 5: more evidence matters precise addresses of jmp instructions: if add single nop (instead of printf) @ start of code gets faster. similarly, if remove useless instruction start of code gets faster. , when compiled code on different version of gcc got faster, despite generated assembly code being same. difference debug info @ start , sections of binary file in different order.

the key metric

your culprit branch misses:

440,171,927      branch-misses             #   13.85% of branches 

vs.

71,034,122      branch-misses             #    2.24% of branches 

i'm not sure processor your'e running, if hypothetically running 2.5 ghz processor on haswell, you'll see each branch prediction miss costs around 15 cycles (usually bit more because other stuff stalls too), , each cycle 0.4ns. so, 0.4ns/cycle * 15 cycles/missed branch * (440,171,927 - 71,034,122) branch misses = 2.2 seconds. depend on exact processor , machine code, explains bulk of difference right there.

cause

the branch prediction algorithms of different chips proprietary, if research here ( http://www.agner.org/optimize/microarchitecture.pdf) can learn more different processors , there limitations. essentially, you'll find processors better job of avoiding collisions in branch prediction tables store make predictions branches taken or not.

so, why relevant? well, has happened (99% chance) rearranging program slightly, change different branches in terms of memory locations. there many map same bucket in branch prediction tables processor. changing executable slightly, problem goes away. have have specific distance between 2 branch points trigger issue. have unluckily managed that.

simple solution

as found, many changes in fact cause program not hit degraded performance. essentially, changes distance between 2 critical branch points fix problem. can accomplish literally inserting 16 bytes (or enough move branch points different buckets) of machine code nops somewhere between 2 places. can did , change disrupt distance non-pathological case.

digging deeper

if want understand causes in situation, you'll need hands dirty. fun! need select 1 of many tools find specific branch being mispredicted. here 1 way: how measure mispredictions single branch on linux?

after identify mispredicted branch, can figure out if there way remove branch (almost idea performance anyway). if not, can place hint or, worst case, move things around ensure same entry isn't shared suggested.

broader lessons

programmers underestimate cost of branching (when compiler unable remove branches @ compile time). you've discovered, each branch puts more strain on branch prediction buffers of processor , increases chance of mispredictions. so, branches 100% predictable processor can degrade performance reducing resources available predicting other branches. further, when branch mispredicted, costs minimum of 12-15 cycles , can more if needed instructions aren't in l1 cache, l2 cache, l3 cache, or heaven you, main memory. additionally, compiler can't make optimizations across branches.


Comments