Wednesday, February 21, 2007

WRFV22 g95 update 2/21/07

Here is an update on g95 with WRF version 2.2 on Mac OS X (PPC G5) and a Linux (Athlon) box. As noted below, WRFV22 produces usable (but not yet verified correct) runs with nesting on PPC with g95, whereas xlf still does not. This is likely an xlf compiler bug specific to OS X. The test case for this particular experiment consists of a single 150 x 150 grid point domain. Model configurations, independent of compiler, are: RSL_LITE, mpich-1.2.5, allows nesting. The executables represent


  • g95 on OS X/PPC (donar_g95)
  • xlf on OSX/PPC (donar_xlf)
  • g95 on Mandriva Linux/Athlon (cascade_g95)
  • ifort on Mandriva Linux/Athlon (cascade_ifort)


The first thing I check is to see if the model produces the same results, as verified by checksums and visual inspection of plotted fields, independent of the number of cpus -- at least until the domain becomes too finely subdivided. So far in this experiment, only g95 on OS X/PPC is passing that test. (It is likely that ifort will also pass, but that portion of the experiment is not yet finished.) The checksums are the same for runs with 1 to 24 cpus. For xlf on OS X/PPC, runs with 1, 2, 12 and 18 cpus produce one particular checksum while 4, 6, 8, 10 and 14 cpus result in another. 16 and 20 cpus result in a third checksum, and 24 cpus produces a fourth sum. The 12 hour forecast fields differ in some respects, with patterns that suggest roundoff errors.

Unsurprisingly, g95 is slower than a commercial compiler on the same hardware. The plot below presents timing results obtained thusfar. There are gaps in the data, and some degree of inconsistency regarding how the time function works on OS X and Linux. I will redo these statistics using Brian Jewett's scripts for extracting timings from the rsl.out.0000 files. Also, for g95 on OS X/PPC, I/O is a bottleneck. Running with nio_tasks_per_group > 0 appears to help a lot.




Click the image to open a larger version in a new window.

Here is the latest configure.wrf segment for my g95 OS X PPC runs.

Sunday, February 11, 2007

Experiments with g95 on OSX/PPC and Linux/x86

I have been trying to get WRF nesting working properly on a Mac PPC cluster. Eventually, this will expand to a Mac Intel machine/cluster. This effort will be successful when the nesting/MPI code produces: (1) the same results on a given machine, independent of the number of nodes requested; and (2) results that do not diverge significantly from those obtained on other platforms. The first efforts involved the IBM xlf compiler. I have been able to get WRFV212 and WRFV22 to compile and run, but the results vary with the number of modes requested -- at least for runs involving more than one domain. Single domain runs have been fine, independent of the number of processors employed.

Frustration with this led me to consider the g95 compiler. For comparison and context, I'm running WRFV22 with g95 on the Mac PPC cluster ("donar") as well as on an Athlon cluster ("cascade") running Mandriva Linux. Both g95 compilers are version 0.91, the most recent at this writing. The latter also has Intel Fortran (version 9.1.036). Builds of WRF with g95 use netcdf (version 3.6.0-p1) and mpich (version 1.2.5) compiled using g95; ifort builds use ifort-made versions of the same software.

Test case is a two domain run (D1 is 40x40 at 60 km resolution; D2 is 28x28 at 20 km; 31 vertical grid levels), run for 12 hours over the midwest US.

Status as of 11 February 2007: Results vary with the number of cpus requested for all combinations tested thusfar, but the point of divergence depends on the compiler and optimization level. At some extreme, it is possible that dividing up a small domain too finely might provoke rounding errors; this is speculation. Divergence is determined by two criteria, both of which must be present: difference in cksum on the wrfout_d02 files, and differences in fields visualized using GrADS, created using wrf_to_grads. In some cases, wrfout_d02 files have differed only in a few apparently unimportant lines (ascertained by examining dumps with ncdump) for a variable called snow density.

Summary:

cascade/ifort -- results diverge after 6 cpus (results for 1-6 cpus identical)
cascade/g95 standard - results diverge after 4 cpus (results for 1-4 cpus identical)
cascade/g95 debug -- results diverge after 3 cpus (results for 1-3 cpus identical)
donar/g95 debug -- results diverge after 2 cpus (results for 1 and 2 cpus identical)
donar/g95 standard -- results diverge after 6 cpus (results for 1-6 cpus identical)

Results differ for g95 runs between the two platforms even at the same optimization level.

One next step may be to work with a larger domain, to see if it delays divergence as the number of processors requested increases.