i implemented openmp parallelization in loop have sum principal cause of slowing down code. when did so, found out final results not same obtained non-parallelize code (which written in c). first, 1 might think "well, didn't implemented parallelization" curious thing when run parallelized code -ofast optimization results correct.
that be:
- -o0 correct
- -ofast correct
- omp -o0 wrong
- omp -o1 wrong
- omp -o2 wrong
- omp -o3 wrong
- omp -ofast correct!
what -ofast doing solves error appears when implement openmp? recommendation of check or test? thanks!
edit here include smallest version of code still reproduces problem.
#include <stdlib.h> #include <stdio.h> #include <math.h> #include <gsl/gsl_rng.h> #include <gsl/gsl_randist.h> #define length 100 #define r 50.0 #define urd 1.0/sqrt(2.0) #define pi (4.0*atan(1.0)) //pi const gsl_rng_type * type; gsl_rng * item; double calcdeltaenergy(double **m,int sx,int sy){ double denergy,r,zz; int k,j; double rrx,rry; int rx,ry; double energy, cpm, cmm, cmp, cpp; denergy = 0; //openmp parallelization: #pragma omp parallel reduction (+:denergy) (int index = 0; index < length*length; index++){ k = index % length; j = index / length; zz = 0.5*(1.0 - pow(-1.0, k + j + sx + sy)); (rx = -1; rx <= 1; rx++){ (ry = -1; ry <= 1; ry++){ rrx = (sx - k - rx*length)*urd; rry = (sy - j - ry*length)*urd; r = sqrt(rrx*rrx + rry*rry + zz); if(r != 0 && r <= r){ cpm = sqrt((rrx+0.5*(0.702*cos(m[k][j])-0.702*cos(m[sx][sy])))*(rrx+0.5*(0.702*cos(m[k][j])-0.702*cos(m[sx][sy]))) + (rry+0.5*(0.702*sin(m[k][j])-0.702*sin(m[sx][sy])))*(rry+0.5*(0.702*sin(m[k][j])-0.702*sin(m[sx][sy]))) + zz); cmm = sqrt((rrx-0.5*(0.702*cos(m[k][j])-0.702*cos(m[sx][sy])))*(rrx-0.5*(0.702*cos(m[k][j])-0.702*cos(m[sx][sy]))) + (rry-0.5*(0.702*sin(m[k][j])-0.702*sin(m[sx][sy])))*(rry-0.5*(0.702*sin(m[k][j])-0.702*sin(m[sx][sy]))) + zz); cpp = sqrt((rrx+0.5*(0.702*cos(m[k][j])+0.702*cos(m[sx][sy])))*(rrx+0.5*(0.702*cos(m[k][j])+0.702*cos(m[sx][sy]))) + (rry+0.5*(0.702*sin(m[k][j])+0.702*sin(m[sx][sy])))*(rry+0.5*(0.702*sin(m[k][j])+0.702*sin(m[sx][sy]))) + zz); cmp = sqrt((rrx-0.5*(0.702*cos(m[k][j])+0.702*cos(m[sx][sy])))*(rrx-0.5*(0.702*cos(m[k][j])+0.702*cos(m[sx][sy]))) + (rry-0.5*(0.702*sin(m[k][j])+0.702*sin(m[sx][sy])))*(rry-0.5*(0.702*sin(m[k][j])+0.702*sin(m[sx][sy]))) + zz); cpm = 1.0/cpm; cmm = 1.0/cmm; cpp = 1.0/cpp; cmp = 1.0/cmp; energy = (cpm + cmm - cpp - cmp)/(0.702*0.702); // s=cte=1 denergy -= 2.0*energy; } } } } return denergy; } void initialize(double **m){ double random; for(int i=0;i<(length-1);i=i+2){ for(int j=0;j<(length-1);j=j+2) { random=gsl_rng_uniform(item); if (random<0.5) m[i][j]=pi/4.0; else m[i][j]=5.0*pi/4.0; random=gsl_rng_uniform(item); if (random<0.5) m[i][j+1]=3.0*pi/4.0; else m[i][j+1]=7.0*pi/4.0; random=gsl_rng_uniform(item); if (random<0.5) m[i+1][j]=3.0*pi/4.0; else m[i+1][j]=7.0*pi/4.0; random=gsl_rng_uniform(item); if (random<0.5) m[i+1][j+1]=pi/4.0; else m[i+1][j+1]=5.0*pi/4.0; } } } int main(){ //choose , initiaze random number generator gsl_rng_env_setup(); type = gsl_rng_default; //default=mt19937, ran2, lxs0 item = gsl_rng_alloc (type); double **s; //site matrix s = (double **) malloc(length*sizeof(double *)); (int = 0; < length; i++) s[i] = (double *) malloc(length*sizeof(double )); //initialization initialize(s); int l,m; (int cl = 0; cl < length*length; cl++) { l = gsl_rng_uniform_int(item, length); // rng[0, length-1] m = gsl_rng_uniform_int(item, length); // rng[0, length-1] printf("%lf\n", calcdeltaenergy(s, l, m)); } //free memory (int = 0; < length; i++) free(s[i]); free(s); return 0; }
i compile with:
g++ [optimization] -lm test.c -o test.x -lgsl -lgslcblas -fopenmp
and run with:
gsl_rng_seed=123; ./test.x > test.dat
comparing outputs different optimizations 1 can see stated before.
disclaimer: have little no experience openmp
it's race condition run when using openmp.
you'll need declare variables inside openmp loop private. 1 core may calculate values value of index
, gets promptly recalculated different values on core uses value of index
: variables such k
, j
, rrx
, rry
etc shared between compute nodes.
instead of using pragma
#pragma omp parallel private(k,j,zz,rx,ry,rrx,rry,r,cpm,cmm,cpp,cmp,energy) reduction (+:d\
(credits comment zulan below:) can declare variables inside parallel region, locally possible. makes them private implicitly , less prone initialization issues , easier reason about.
(you consider putting inside outer for-loop (over index
) in function: function call overhead minimal compared calculations.)
as why -ofast
openmp produce correct output.
my guess is: luck. here's -ofast
(gcc manual):
disregard strict standards compliance. -ofast enables -o3 optimizations. enables optimizations not valid standard-compliant programs. turns on -ffast-math [...]
here's section on -ffast-math
:
this option not turned on -o option besides -ofast since can result in incorrect output programs depend on exact implementation of ieee or iso rules/specifications math functions. may, however, yield faster code programs not require guarantees of these specifications.
thus, sqrt
, cos
, sin
lot speedier. guess is, in case, calculations of variables inside outer loop don't bite each other, since individual threads fast, don't conflict. handwavingly explanation , guess.
Comments
Post a Comment