Sunday, February 11, 2007

Experiments with g95 on OSX/PPC and Linux/x86

I have been trying to get WRF nesting working properly on a Mac PPC cluster. Eventually, this will expand to a Mac Intel machine/cluster. This effort will be successful when the nesting/MPI code produces: (1) the same results on a given machine, independent of the number of nodes requested; and (2) results that do not diverge significantly from those obtained on other platforms. The first efforts involved the IBM xlf compiler. I have been able to get WRFV212 and WRFV22 to compile and run, but the results vary with the number of modes requested -- at least for runs involving more than one domain. Single domain runs have been fine, independent of the number of processors employed.

Frustration with this led me to consider the g95 compiler. For comparison and context, I'm running WRFV22 with g95 on the Mac PPC cluster ("donar") as well as on an Athlon cluster ("cascade") running Mandriva Linux. Both g95 compilers are version 0.91, the most recent at this writing. The latter also has Intel Fortran (version 9.1.036). Builds of WRF with g95 use netcdf (version 3.6.0-p1) and mpich (version 1.2.5) compiled using g95; ifort builds use ifort-made versions of the same software.

Test case is a two domain run (D1 is 40x40 at 60 km resolution; D2 is 28x28 at 20 km; 31 vertical grid levels), run for 12 hours over the midwest US.

Status as of 11 February 2007: Results vary with the number of cpus requested for all combinations tested thusfar, but the point of divergence depends on the compiler and optimization level. At some extreme, it is possible that dividing up a small domain too finely might provoke rounding errors; this is speculation. Divergence is determined by two criteria, both of which must be present: difference in cksum on the wrfout_d02 files, and differences in fields visualized using GrADS, created using wrf_to_grads. In some cases, wrfout_d02 files have differed only in a few apparently unimportant lines (ascertained by examining dumps with ncdump) for a variable called snow density.

Summary:

cascade/ifort -- results diverge after 6 cpus (results for 1-6 cpus identical)
cascade/g95 standard - results diverge after 4 cpus (results for 1-4 cpus identical)
cascade/g95 debug -- results diverge after 3 cpus (results for 1-3 cpus identical)
donar/g95 debug -- results diverge after 2 cpus (results for 1 and 2 cpus identical)
donar/g95 standard -- results diverge after 6 cpus (results for 1-6 cpus identical)

Results differ for g95 runs between the two platforms even at the same optimization level.

One next step may be to work with a larger domain, to see if it delays divergence as the number of processors requested increases.

No comments: