c - About the combination of OpenMP and -Ofast -


i implemented openmp parallelization in loop have sum principal cause of slowing down code. when did so, found out final results not same obtained non-parallelize code (which written in c). first, 1 might think "well, didn't implemented parallelization" curious thing when run parallelized code -ofast optimization results correct.

that be:

  • -o0 correct
  • -ofast correct
  • omp -o0 wrong
  • omp -o1 wrong
  • omp -o2 wrong
  • omp -o3 wrong
  • omp -ofast correct!

what -ofast doing solves error appears when implement openmp? recommendation of check or test? thanks!

edit here include smallest version of code still reproduces problem.

#include <stdlib.h> #include <stdio.h> #include <math.h> #include <gsl/gsl_rng.h> #include <gsl/gsl_randist.h>  #define length 100 #define r 50.0 #define urd 1.0/sqrt(2.0) #define pi (4.0*atan(1.0)) //pi  const gsl_rng_type * type; gsl_rng * item;  double calcdeltaenergy(double **m,int sx,int sy){     double denergy,r,zz;     int k,j;     double rrx,rry;     int rx,ry;     double energy, cpm, cmm, cmp, cpp;     denergy = 0;      //openmp parallelization:     #pragma omp parallel reduction (+:denergy)     (int index = 0; index < length*length; index++){         k = index % length;         j = index / length;      zz = 0.5*(1.0 - pow(-1.0, k + j + sx + sy));     (rx = -1; rx <= 1; rx++){         (ry = -1; ry <= 1; ry++){             rrx = (sx - k - rx*length)*urd;             rry = (sy - j - ry*length)*urd;              r = sqrt(rrx*rrx + rry*rry + zz);             if(r != 0 && r <= r){                 cpm = sqrt((rrx+0.5*(0.702*cos(m[k][j])-0.702*cos(m[sx][sy])))*(rrx+0.5*(0.702*cos(m[k][j])-0.702*cos(m[sx][sy]))) + (rry+0.5*(0.702*sin(m[k][j])-0.702*sin(m[sx][sy])))*(rry+0.5*(0.702*sin(m[k][j])-0.702*sin(m[sx][sy]))) + zz);                 cmm = sqrt((rrx-0.5*(0.702*cos(m[k][j])-0.702*cos(m[sx][sy])))*(rrx-0.5*(0.702*cos(m[k][j])-0.702*cos(m[sx][sy]))) + (rry-0.5*(0.702*sin(m[k][j])-0.702*sin(m[sx][sy])))*(rry-0.5*(0.702*sin(m[k][j])-0.702*sin(m[sx][sy]))) + zz);                 cpp = sqrt((rrx+0.5*(0.702*cos(m[k][j])+0.702*cos(m[sx][sy])))*(rrx+0.5*(0.702*cos(m[k][j])+0.702*cos(m[sx][sy]))) + (rry+0.5*(0.702*sin(m[k][j])+0.702*sin(m[sx][sy])))*(rry+0.5*(0.702*sin(m[k][j])+0.702*sin(m[sx][sy]))) + zz);                 cmp = sqrt((rrx-0.5*(0.702*cos(m[k][j])+0.702*cos(m[sx][sy])))*(rrx-0.5*(0.702*cos(m[k][j])+0.702*cos(m[sx][sy]))) + (rry-0.5*(0.702*sin(m[k][j])+0.702*sin(m[sx][sy])))*(rry-0.5*(0.702*sin(m[k][j])+0.702*sin(m[sx][sy]))) + zz);                 cpm = 1.0/cpm;                 cmm = 1.0/cmm;                 cpp = 1.0/cpp;                 cmp = 1.0/cmp;                 energy = (cpm + cmm - cpp - cmp)/(0.702*0.702); // s=cte=1                  denergy -= 2.0*energy;             }         }     }     } return denergy; }  void initialize(double **m){ double random;     for(int i=0;i<(length-1);i=i+2){           for(int j=0;j<(length-1);j=j+2) {               random=gsl_rng_uniform(item);               if (random<0.5) m[i][j]=pi/4.0;               else m[i][j]=5.0*pi/4.0;                random=gsl_rng_uniform(item);               if (random<0.5) m[i][j+1]=3.0*pi/4.0;               else m[i][j+1]=7.0*pi/4.0;                random=gsl_rng_uniform(item);               if (random<0.5) m[i+1][j]=3.0*pi/4.0;               else m[i+1][j]=7.0*pi/4.0;                random=gsl_rng_uniform(item);               if (random<0.5) m[i+1][j+1]=pi/4.0;               else m[i+1][j+1]=5.0*pi/4.0;           }     } }  int main(){     //choose , initiaze random number generator     gsl_rng_env_setup();     type = gsl_rng_default; //default=mt19937, ran2, lxs0     item = gsl_rng_alloc (type);      double **s; //site matrix     s = (double **) malloc(length*sizeof(double *));     (int = 0; < length; i++)         s[i] = (double *) malloc(length*sizeof(double ));      //initialization     initialize(s);      int l,m;     (int cl = 0; cl < length*length; cl++) {         l = gsl_rng_uniform_int(item, length); // rng[0, length-1]         m = gsl_rng_uniform_int(item, length); // rng[0, length-1]         printf("%lf\n", calcdeltaenergy(s, l, m));     }       //free memory     (int = 0; < length; i++)         free(s[i]);     free(s);     return 0; }  

i compile with:

g++ [optimization] -lm test.c -o test.x -lgsl -lgslcblas -fopenmp 

and run with:

gsl_rng_seed=123; ./test.x > test.dat 

comparing outputs different optimizations 1 can see stated before.

disclaimer: have little no experience openmp

it's race condition run when using openmp.

you'll need declare variables inside openmp loop private. 1 core may calculate values value of index, gets promptly recalculated different values on core uses value of index: variables such k, j, rrx, rry etc shared between compute nodes.

instead of using pragma

#pragma omp parallel private(k,j,zz,rx,ry,rrx,rry,r,cpm,cmm,cpp,cmp,energy) reduction (+:d\ 

(credits comment zulan below:) can declare variables inside parallel region, locally possible. makes them private implicitly , less prone initialization issues , easier reason about.

(you consider putting inside outer for-loop (over index) in function: function call overhead minimal compared calculations.)

as why -ofast openmp produce correct output.

my guess is: luck. here's -ofast (gcc manual):

disregard strict standards compliance. -ofast enables -o3 optimizations. enables optimizations not valid standard-compliant programs. turns on -ffast-math [...]

here's section on -ffast-math:

this option not turned on -o option besides -ofast since can result in incorrect output programs depend on exact implementation of ieee or iso rules/specifications math functions. may, however, yield faster code programs not require guarantees of these specifications.

thus, sqrt, cos , sin lot speedier. guess is, in case, calculations of variables inside outer loop don't bite each other, since individual threads fast, don't conflict. handwavingly explanation , guess.


Comments