Thursday, December 20, 2007

WRFV221 Mac PowerPC w/ IBM xlf and MPI

My recent tests using the presently latest WRF version, v.2.2.1, on a PowerPC-based Mac using XLF and MPI appears to be all good, even with nesting. Two files are needed for this:

Additions for arch/configure.defaults.

Replacement for external/RSL_LITE/rsl_malloc.c. Rename file after downloading.

Sunday, September 16, 2007

WRFV22 on a dual quad-core Intel Mac

Here are some notes on a recent attempt to run WRF on a dual quad-core Intel Mac, which I received as a loaner from Apple. In the two weeks I had the machine, I made progress, but it's clear from my notes below I didn't pass the finish line. For one thing, I was unable to get even medium-sized jobs to run in the 64 bit environment -- and the tricks for accessing memory that worked with 32 bits fail. In 32 bit land, there is still a limit on the job size, which is apparently due to Apple's internal memory allocation restrictions.

Executive summary: I was able to get all 8 cpus working for me, tho the scaling wasn't the best, and the key turned out to be: (a) moving to mpich-2; AND (b) using --with-comm=shared. An important goal for me is getting results that do not vary with the number of processors used. With OMP, that required the ifort flag '-fp-model precise'. I also used this flag in the WRF code and when compiling mpich-2.

** These notes assume the 32 bit ifort compiler. I used ifort 10.0.017. These notes presume WRF model changes documented on previous posts. Compilation and execution took place on an HFSX-formatted disk.

(1) netcdf-3.6.2

export CC=/usr/bin/gcc
export CPPFLAGS="-O -DNDEBUG -DpgiFortran"
export CFLAGS="-O"
export CXX=/usr/bin/c++
export CXXFLAGS="-O"

export FC=ifort
export F77=ifort
export F90=ifort
export FFLAGS="-O3"
export F90FLAGS=


./configure --prefix=/usr/local/netcdf
make
make test
make check
sudo mkdir /usr/local/netcdf
sudo make install

(2) MPICH-2 (mpich2-1.0.5p4)

setenv FC ifort
setenv F90 ifort
setenv CC "gcc"
setenv RSHCOMMAND "/usr/bin/ssh"
setenv CXX "/usr/bin/c++"
setenv FFLAGS "-xP -vec- -fp-model precise"
setenv F90FLAGS "-xP -vec- -fp-model precise"
./configure --with-comm=shared
make

(3) Modify external/RSL_LITE/buf_for_proc.c to add "extern" before "char mess" (resolves a problem that crops up specifically with mpich-2)

(4) configure.wrf files I used for these tests: APPLE_LOANER_WRF.zip.

Thursday, July 5, 2007

WRFV22 on Mac Intel update... ifort and g95

The ZIP file linked below has been updated to reflect further progress made in getting WRFV22 to work on a Core Duo MacBook using Intel ifort and g95. As before, I am using mpich-1.2.5 for the MPI builds. My latest tests have employed ifort 10.0 and the most recent g95 at this writing. The g95 sections also are usable on PPC Macs.

For the most part, these configurations pass my "consistency tests", with some reservations. Consistency means that the MPI executables generate bitwise identical output, irrespective of the number of processors selected. However, if the domain is too finely subdivided, the results will start varying. I also encountered a test case in which identical results were generated for between 2-24 CPUs, inclusive, but results using only 1 CPU were different, which I cannot yet explain.

Results are not bitwise identical among compilers, or for the g95 versions when built on Intel and PPC architectures. However, I have not noted anything particularly amiss in the output. g95 executables from the PPC architecture can be executed on Intel (albeit slowly), and I've found this yields the same results obtained on PPC.

WRFV22_MacIntel_mods.zip (as of July 5, 2007).

Thursday, April 19, 2007

WRFV22 on Mac Intel w/ ifort

The zip archive below contains WRFV22 code modifications for an Intel-based Mac with the Intel Fortran compiler (ifort). I have been able to run the single-threaded and OMP versions without problem, but my tests with the MPI version (RSL and RSL_LITE) with mpich-1.2.5 have not produced results that are the same when one and two processors are requested. This has only been tested on a MacBook with a Core Duo processor. So, use at your own risk. Feedback appreciated.

WRFV22_MacIntel_mods.zip

Sunday, April 15, 2007

WRFV22 g95 on Mac supporting programs

Previously, I reported that I had to use tcsh to run WRFV22. Turns out I can run in bash if I specifically set the stacksize and datasize as follows

ulimit -s 65536
ulimit -d unlimited

To go along with the configure.defaults additions for g95 on Mac PPC, here are the flags I used for NetCDF and MPICH to get everything working. Specific versions of these programs I used were 3.6.0-p1 for NetCDF and 1.2.5 for MPICH.

netcdf-3.6.0-p1

* I used tcsh for the compilation

setenv CC /usr/bin/gcc
setenv CPPFLAGS "-O -DNDEBUG -DpgiFortran"
setenv CFLAGS "-O"
setenv CXX /usr/bin/c++
setenv CXXFLAGS "-O"

setenv FC g95
setenv F77 g95
setenv F90 g95
setenv FFLAGS "-O3 -fno-second-underscore"
setenv FCFLAGS "-O3 -fno-second-underscore"
setenv F90FLAGS "-O3 -fno-second-underscore"

./configure
* I then edited "macros.make", adding:
FLIBS = -lSystemStubs
F90LIBS = -lSystemStubs

./make
./make test
[all tests were found to work]
make install


mpich-1.2.5

* I used tcsh for the compilation

setenv FC g95
setenv F90 g95
setenv RSHCOMMAND "/usr/bin/ssh"
setenv CC gcc
setenv CXX gcc
setenv LIBS "-lSystemStubs"
setenv FFLAGS "-fno-second-underscore"
setenv F90FLAGS "-fno-second-underscore"
setenv CFLAGS "-fno-common -DFORTRANUNDERSCORE"

./configure --with-device=ch_p4 --without-romio
./make

* On one Mac PPC machine, I had to add -lstdc++ to LIBS, but another failed to work if that was done

* Although mpich compiles fine, the example programs would not compile without manually adding the "-fno-second-underscore" flag. Example:


mpif77 -fno-second-underscore -c fpi.f
mpif77 -o fpi fpi.o

Wednesday, February 21, 2007

WRFV22 g95 update 2/21/07

Here is an update on g95 with WRF version 2.2 on Mac OS X (PPC G5) and a Linux (Athlon) box. As noted below, WRFV22 produces usable (but not yet verified correct) runs with nesting on PPC with g95, whereas xlf still does not. This is likely an xlf compiler bug specific to OS X. The test case for this particular experiment consists of a single 150 x 150 grid point domain. Model configurations, independent of compiler, are: RSL_LITE, mpich-1.2.5, allows nesting. The executables represent


  • g95 on OS X/PPC (donar_g95)
  • xlf on OSX/PPC (donar_xlf)
  • g95 on Mandriva Linux/Athlon (cascade_g95)
  • ifort on Mandriva Linux/Athlon (cascade_ifort)


The first thing I check is to see if the model produces the same results, as verified by checksums and visual inspection of plotted fields, independent of the number of cpus -- at least until the domain becomes too finely subdivided. So far in this experiment, only g95 on OS X/PPC is passing that test. (It is likely that ifort will also pass, but that portion of the experiment is not yet finished.) The checksums are the same for runs with 1 to 24 cpus. For xlf on OS X/PPC, runs with 1, 2, 12 and 18 cpus produce one particular checksum while 4, 6, 8, 10 and 14 cpus result in another. 16 and 20 cpus result in a third checksum, and 24 cpus produces a fourth sum. The 12 hour forecast fields differ in some respects, with patterns that suggest roundoff errors.

Unsurprisingly, g95 is slower than a commercial compiler on the same hardware. The plot below presents timing results obtained thusfar. There are gaps in the data, and some degree of inconsistency regarding how the time function works on OS X and Linux. I will redo these statistics using Brian Jewett's scripts for extracting timings from the rsl.out.0000 files. Also, for g95 on OS X/PPC, I/O is a bottleneck. Running with nio_tasks_per_group > 0 appears to help a lot.




Click the image to open a larger version in a new window.

Here is the latest configure.wrf segment for my g95 OS X PPC runs.

Sunday, February 11, 2007

Experiments with g95 on OSX/PPC and Linux/x86

I have been trying to get WRF nesting working properly on a Mac PPC cluster. Eventually, this will expand to a Mac Intel machine/cluster. This effort will be successful when the nesting/MPI code produces: (1) the same results on a given machine, independent of the number of nodes requested; and (2) results that do not diverge significantly from those obtained on other platforms. The first efforts involved the IBM xlf compiler. I have been able to get WRFV212 and WRFV22 to compile and run, but the results vary with the number of modes requested -- at least for runs involving more than one domain. Single domain runs have been fine, independent of the number of processors employed.

Frustration with this led me to consider the g95 compiler. For comparison and context, I'm running WRFV22 with g95 on the Mac PPC cluster ("donar") as well as on an Athlon cluster ("cascade") running Mandriva Linux. Both g95 compilers are version 0.91, the most recent at this writing. The latter also has Intel Fortran (version 9.1.036). Builds of WRF with g95 use netcdf (version 3.6.0-p1) and mpich (version 1.2.5) compiled using g95; ifort builds use ifort-made versions of the same software.

Test case is a two domain run (D1 is 40x40 at 60 km resolution; D2 is 28x28 at 20 km; 31 vertical grid levels), run for 12 hours over the midwest US.

Status as of 11 February 2007: Results vary with the number of cpus requested for all combinations tested thusfar, but the point of divergence depends on the compiler and optimization level. At some extreme, it is possible that dividing up a small domain too finely might provoke rounding errors; this is speculation. Divergence is determined by two criteria, both of which must be present: difference in cksum on the wrfout_d02 files, and differences in fields visualized using GrADS, created using wrf_to_grads. In some cases, wrfout_d02 files have differed only in a few apparently unimportant lines (ascertained by examining dumps with ncdump) for a variable called snow density.

Summary:

cascade/ifort -- results diverge after 6 cpus (results for 1-6 cpus identical)
cascade/g95 standard - results diverge after 4 cpus (results for 1-4 cpus identical)
cascade/g95 debug -- results diverge after 3 cpus (results for 1-3 cpus identical)
donar/g95 debug -- results diverge after 2 cpus (results for 1 and 2 cpus identical)
donar/g95 standard -- results diverge after 6 cpus (results for 1-6 cpus identical)

Results differ for g95 runs between the two platforms even at the same optimization level.

One next step may be to work with a larger domain, to see if it delays divergence as the number of processors requested increases.