Skip to content
Snippets Groups Projects
user avatar
authored

Description:

Presentation

DYNAMICO is a new atmospheric dynamical core on icosaedral grid developed at IPSL

DYNAMICO can use XIOS ,a library dedicated to IO output. It can be run as a standard library linked with the model (attached mode) or with process dedicated exclusively dedicated to I/O (server mode)

http://forge.ipsl.jussieu.fr/ioserver

Technical information:

  • website : http://forge.ipsl.fr/dynamico
  • Scientific domain : climate
  • Language : Fortran
  • Parallelism : MPI + OpenMP/OpenAcc
  • GPU acceleration : Yes
  • Scalability : high
  • Vectorization: high

Compilation and simulation:

Download:

Sources are available at: http://forge.ipsl.fr/dynamico

For the test, we will use a specific release. To download this release, run:

./download.sh

Compile:

Compile the code using for instance:

./compile.sh jean-zay-cpu

machines/jean-zay-cpu/env contains the information for compilation (module load gcc openmpi lapack hdf5 ...)

You can create your own machine directory under machines to define the appropriate environment.

Run and validate the simulation:

For each test case, given in a separate folder (e.g. testcase_small), you can find three scripts:

  • prepare.sh: prepare the simulation (move data to the right location, recompile some minor changes, ...)
  • run.sh : run the application and print out the evaluated metric
  • validate.sh: validation of the simulation on a scientific point of view

For running and validating the simulation, one should be able to do:

cd testcase_XXX
./prepare.sh jean-zay-cpu
./run.sh
./validate.sh

And getting no error code returned. Those steps can also be used in a batch file for running the simulation using a job scheduler.

Understanding DYNAMICO parallel data decomposition:

Dynamico is composed of 10 mains rhombus tiles. Each tiles can be decomposed into smaller rhomus domain by splitting the main rhombus in the 2 directions.

This defined by parameters in run.def files :

  • nsplit_i = 4 split a main rhombus by 4 in the i direction
  • nsplit_j = 2 split a main rhombus by 4 in the j direction

So wou will get in this case 4x2=8 sub-rhombus by tiles, for a total of 4x2x10=80 sub-rhombus for the whole tiles (10 tiles covering the sphere). So the number of domains sub-rhombus is always a multiple of 10.

Domains will be assigned to available ressources. For example if your un only with 1 mpi process (without threads), the process will compute the whole 80 domains. But if you launch your run with 80 mpi process, each process will compute only 1 domain.

Using OpenMP you will also assign threads to domains. For example, if you launch your configuration with 10 mpi process with 4 OpenMP Threads (OMP_NUM_THREADS=4), each mpi process will be assign to 8 domains, so each threads will compute on 2 domains.

You can alse decide to use OpenMP threads to parallelize on the vertical layers using shared memory. For this need to define teams of threads that will shared iterations on the vertical layers. The size of the team is given by the parameters "omp_level_size" in the run.def file.

omp_level_size = 8 means you have a team of 8 threads working together on vertical levels.

The new parallelzation level can be combined with the others described above.

So you have 3 levels of parallelism :

  • MPI : distribution of domain to mpi process (inter nodes)
  • OpenMP : distribution of domain to teams of threads
  • OpenMP : team sharing iterations on vertical.

Example of configuration :

  • nsplit_i = 4
  • nsplit_j = 2
  • omp_levels=8
  • #mpi=80 #threads=8 => 1 domain by process, 8 threads on the vertical
  • #mpi=10 #threads=64 => 8 domains by process, 8 teams of 8 threads, each team assigned to 1 domain and working on vertical
  • #mpi=10 #threads=32 => 8 domains by process, 4 teams of 8 threads, each team assigned to 2 domains and working on vertical

IMPORTANT RESTRICTIONS:

#teams = #threads / omp_level_size

each team inside a process must be assigned to a same number of domain.

  • => 10nsplit_insplit_j must be an integer multiple of #mpi*#team otherwise a crash is occuring

Advice for setting decomposition and core placement:

Resolution of model : 10 tiles of nbp*nbp points on horizontal layer an llm vertical levels

  • nbp=640 (for big testcase)
  • llm=79

First decide the number of threads you want to run on vertical. It is very important that's this team of threads run on core with uniform memory access, ideally inside a same multicore processor. Working on team of threads on vertical is very important to achieve good strong scalability since you don't have overcomputation on ghost cells and for same number of cores the size of horizontal domain are greater. The size of teams on the vertical can be determine for best performance. So it can be the number of core of a multicore processor or an integer multiple.

For example: having omp_level=18 on a 18 cores processor may be a good idea(64/18~ 3-4 levels by threads not optimum) or omp_level=9 or omp_levels=6, etc...

It may have just 1 or more thread teams depending on the number of cores in thenode and of the gain obtained. Between teams of threads, theoritically, you will not have false sharing since each team writes into its own private memory zone.