Remote Mini-Workshop Series

Quite a few interesting mini-projects are being undertaken at the moment. It is worthwhile to advertise these to the larger community to invite participation. In our weekly calls we decided that we should set aside a few hours or half a day for one of these. I now suggest that we turn this into a mini-series, where we pick from the list below until we run out of interest. Maybe this will keep us busy until Christmas.

We picked Wednesday 9:00 EST as meeting time. We'll meet on Google Hangout (probably), details TBA here.

Spack: installing external package https://github.com/LLNL/spack [Erik]
SimulationIO: a new file format that's easy to read https://github.com/eschnett/SimulationIO
FunHPC (multi-threading with futures): overview https://bitbucket.org/eschnett/funhpc.cxx [Erik, Christian, Ian]
FunHPC (multi-threading with futures): shoehorning this into Cactus [Erik, Christian, Ian]
StencilOps: more efficient finite differencing stencils in Kranc [Ian]
DG: Jonah and my new DG formulation that can replace FD methods https://arxiv.org/abs/1604.00075 [Federico]
The "distribute" script: testing the Einstein Toolkit on HPC systems
Towards a Kranc implementation of a hydro formulation [Ian, Federico]

If you are interested in one of these topics, then add your name in square brackets after the topic.

If you are interested in presenting a topic yourself, then add a new item to the list.

Mini-Workshop #1: Wed, Dec 7, 2016, 9:00 EST

Topic: FunHPC (multi-threading with futures): overview https://bitbucket.org/eschnett/funhpc.cxx

Venue: Google Hangouts https://hangouts.google.com/call/jjkffrrvmnbhrooiyjxhfeb2ume

Agenda:

FunHPC design overview
Comparison to OpenMP
CPU vs. memory performance
Cache and multi-threading, loop tiling
How to parallelize an application via FunHPC
Building and installing
Examples
Benchmarks

Building and Installing

FunHPC is available on BitBucket https://bitbucket.org/eschnett/funhpc.cxx . It requires several other packages to be installed as well, namely

Cereal: Serializing C++ objects http://uscilab.github.io/cereal
hwloc: Determining the hardware (core, cache) layout http://www.open-mpi.org/projects/hwloc
jemalloc: Fast multi-threaded memory manager (malloc replacement) http://www.canonware.com/jemalloc
OpenMPI: FunHPC prefers this MPI library http://www.open-mpi.org
Qthreads: Fine-grained multi-threading (providing a C interface) http://www.cs.sandia.gov/qthreads

To install FunHPC from scratch, you need to install these other libraries first, and then edit FunHPC's Makefile. Google Test is also required, but will be downloaded automatically. Apologies for this unprofessional setup. In the future, FunHPC should be converted to use cmake, and Google Test should be packages as part of it.

The Cereal package requires a patch. This patch makes it distinguish between regular pointers and function pointers. Regular pointers cannot be serialized since it is unclear whether they are valid, and if so, how the target should be allocated or freed. Function pointers, however, can be serialized -- we assume they point to functions, which are constants, so that no memory management issues arise. You need to apply the following patch:

 --- old/include/cereal/types/common.hpp
 +++ new/include/cereal/types/common.hpp
 @@ -106,14 +106,16 @@
      t = reinterpret_cast<typename common_detail::is_enum<T>::type const &>( value );
    }
 
 +#ifndef CEREAL_ENABLE_RAW_POINTER_SERIALIZATION
    //! Serialization for raw pointers
    /*! This exists only to throw a static_assert to let users know we don't support raw pointers. */
    template <class Archive, class T> inline
    void CEREAL_SERIALIZE_FUNCTION_NAME( Archive &, T * & )
    {
      static_assert(cereal::traits::detail::delay_static_assert<T>::value,
        "Cereal does not support serializing raw pointers - please use a smart pointer");
    }
 +#endif
 
    //! Serialization for C style arrays
    template <class Archive, class T> inline

When you "make", you need to pass certain environment variables:

CEREAL_DIR (have to set in Makefile)
HWLOC_DIR
JEMALLOC_DIR
QTHREADS_DIR
CXX
MPICXX
MPIRUN

For example:

 make CEREAL_DIR=... HWLOC_DIR=... JEMALLOC_DIR=... QTHREADS_DIR=... CXX=c++ MPICXX=mpicxx MPIRUN=mpirun

I have installed FunHPC and all its dependencies on Wheeler (Caltech) into the directory /home/eschnett/src/spack-view . This includes a recent version of GCC that was used to build these libraries. If you want to use this, then I highly recommend using this version of GCC as well as all the other software installed in this directory (e.g. HDF5, PAPI, and many more) instead of combining these with system libraries.

As a side note, Roland Haas says that the Simfactory configuration for Wheeler is using this directory. This is not really relevant yet since we won't be using Cactus in the beginning.

Running FunHPC Applications

FunHPC is an MPI application, but we are not interested in using MPI today. We might still need to use mpirun, but only in a trivial way.

Qthreads etc. use environment variables to change certain settings. Some settings are necessary to prevent problems. These "problems" are usually resource exhaustion (e.g. not enough stack space), which Unix helpfully all translates into "Segmentation fault". I am usually setting these environment variables:

 export QTHREAD_NUM_SHEPHERDS="${nshep}"
 export QTHREAD_NUM_WORKERS_PER_SHEPHERD="${nwork}"
 export QTHREAD_STACK_SIZE=8388608 # Byte 
 export QTHREAD_GUARD_PAGES=0      # 0, 1
 export QTHREAD_INFO=1

Here "nshep" is the number of sockets (aka NUMA nodes), and "nwork" the number of cores per socket. You can find these e.g. via "hwloc-info". On Wheeler:

 $ ~/src/spack-view/bin/hwloc-info
 depth 0:        1 Machine (type #1)
  depth 1:       2 NUMANode (type #2)
   depth 2:      2 Package (type #3)
    depth 3:     2 L3Cache (type #4)
     depth 4:    24 L2Cache (type #4)
      depth 5:   24 L1dCache (type #4)
       depth 6:  24 L1iCache (type #4)
        depth 7: 24 Core (type #5)
         depth 8:        24 PU (type #6)

Thus I choose "nshep=2" and "nwork=12" on Wheeler.

By default, Qthreads chooses a rather small stack size of 8 kByte per thread. If a thread uses more stack space, random memory will be overwritten. You can enable guard pages, which is good for debugging. This will catch many cases where the stack overflows. Finally, Qthreads can produce info output at startup that might be helpful.

On Wheeler:

 ~eschnett/src/spack-view/bin/mpirun -np 1 -x QTHREAD_NUM_SHEPHERDS=2 -x QTHREAD_NUM_WORKERS_PER_SHEPHERD=12 -x QTHREAD_STACK_SIZE=1000000 ~eschnett/src/spack-view/bin/fibonacci

Loop Example

Let us look at a simple loop. We are going to parallelize it once with OpenMP, and once with FunHPC.

 #include <funhpc/async.hpp>
 #include <funhpc/main.hpp>
 
 #include <algorithm>
 #include <cassert>
 #include <vector>
 
 // Synchronize the ghost zones (the outermost points in each direction)
 void sync(double *y, int n) {
   // we just assume periodic boundaries
   assert(n >= 2);
   y[0] = y[n - 2];
   y[n - 1] = y[1];
 }
 
 // A basic loop, calculating a first derivative
 void vdiff(double *y, const double *x, int n) {
   for (int i = 1; i < n - 1; ++i) {
     y[i] = (x[i + 1] - x[i - 1]) / 2;
   }
   sync(y, n);
 }
 
 // The same loop, parallelized via OpenMP: The number of iterations is
 // split over the available number of cores (e.g. 12 on a NUMA node of
 // Wheeler). The disadvantages of this are:
 // - There is a implicit barrier at the end of the loop, so that the
 //   sync afterwards cannot overlap with the loop
 // - A single thread might handle too much work, overflowing the cache
 // - A single thread might not have enough work, so that the thread
 //   management overhead is too large
 void vdiff_openmp(double *y, const double *x, int n) {
 #pragma omp parallel for
   for (int i = 1; i < n - 1; ++i) {
     y[i] = (x[i + 1] - x[i - 1]) / 2;
   }
   sync(y, n);
 }
 
 // The same loop, this time parallelized via FunHPC. Each thread
 // handles a well-defined amount of work, to be chosen based on the
 // complexity of the loop kernel.
 // - Each thread runs until its result is needed. The interior of the
 //   domain can be calculated overlapping with the synchronization.
 // - If the number of iterations is small, only a single thread is
 //   used. Other cores are free to do other work, e.g. analysis or
 //   I/O.
 // - If the number of iterations is large, then many threads will be
 //   created. The threads will be executed in some arbitrary order.
 //   The cost of creating a thread is small (but of not negligible) --
 //   there is no problem if thousands of threads are created.
 void vdiff_funhpc(double *y, const double *x, int n) {
   // number of points per thread (depending on architecture and cache size)
   // (the number here is much too small; this is just for testing)
   const int blocksize = 8;
 
   // loop over blocks, starting one thread for each
   std::vector<qthread::future<void>> fs;
   for (int i0 = 1; i0 < n - 1; i0 += blocksize) {
     fs.push_back(qthread::async(qthread::launch::async, [=]() {
 
       // loop over the work of a single thread
       const int imin = i0;
       const int imax = std::min(i0 + blocksize, n - 1);
       for (int i = imin; i < imax; ++i) {
         y[i] = (x[i + 1] - x[i - 1]) / 2;
       }
     }));
   }
 
   // synchronize as soon as the boundary results are available
   assert(!fs.empty());
   fs[0].wait();
   fs[fs.size() - 1].wait();
   sync(y, n);
 
   // wait for all threads to finish
   for (const auto &f : fs)
     f.wait();
 }
 
 int funhpc_main(int argc, char **argv) {
   const int n = 1000000;
   std::vector<double> x(n), y(n);
   vdiff(&y[0], &x[0], n);
   vdiff_openmp(&y[0], &x[0], n);
   vdiff_funhpc(&y[0], &x[0], n);
   return 0;
 }

To-Do

This is a wiki -- everybody should add missing items here

Maybe: Make FunHPC compile with Clang on Darwin
Maybe: Set up FunHPC on Bethe or Fermi (if Frank can't get access to Wheeler)
Add pointers to http://cppreference.com to wiki (for async, future)
Describe future, shared_future; async's launch:: options
Make sure all FunHPC examples run on Wheeler
If possible: look at weird performance numbers (350 ms vs. 3500 ms on Wheeler's head node); run on compute node instead?

Done:

Correct broken FunHPC grid self-test
Provide make wrapper for Wheeler
Describe Cereal patch
Add pointers to package web sites to build instructions
Put loop parallelization example onto wiki (and make it compile)
Announce next meeting (Wed Dec. 14, 12:00 EST)

Mini-Workshop #2: Wed, Dec 14, 2016, 12:00 EST

(By popular request, later in the day.)

Topic: FunHPC in Cactus

Venue: Google Hangouts https://hangouts.google.com/call/jjkffrrvmnbhrooiyjxhfeb2ume

Agenda:

FunHPC vs. OpenMP (recap from last week)
Overview of current proof-of-concept code
Look at current benchmark results
How to download, build, run, benchmark
Next steps

Source code

FunHPC and its dependencies, as described last week (see above)
FunHPC arrangement, basic thorns and examples (get it at https://bitbucket.org/eschnett/funhpc.cactus.git)
Cactus flesh, branch eschnett/funhpc (mostly for startup)
Carpet, branch eschnett/funhpc (mostly to disable OpenMP)
Kranc, branch eschnett/funhpc (to generate FunHPC-parallelized code)
McLachlan, branch eschnett/funhpc (contains new FunHPC-parallelized thorns with "_FH" suffix)

I recommend using Wheeler to build and run, as I've tested this.

To Do

Put example parameter file online
Check whether a vanilla checkout builds
Update benchmark results
Next meeting: Next Wed, 12:00 EST (tentatively); continue with FunHPC in Cactus

Mini-Workshop #3 at Perimeter: Dec 21-22, 2016

Roland Haas and Erik Schnetter met at Perimeter to design a parallelization scheme for relativistic hydro code with FunHPC. Here is a rough description of the outcome of our discussions.

In principle, this follows the scheme outlined above:

- the code is written in a mostly point-wise manner - we determined that the code accesses about 200 Bytes of memory per grid point, and thus a per-thread block size of 8x4x4 should work well - each such block is scheduled as a separate FunHPC thread - each scheduled routine then loops over the blocks and starts the threads

To overlap communication and computation, one needs to ensure that communication and computation occur "at the same time", i.e. are scheduled next to each other. This means that the hydro RHS needs to be scheduled in MoL_PostStep where the hydro state vector is synchronized. The details how this can happen are slightly involved since this involves also McLachlan setting the ADMBase variable, setting Tmunu, calling con2prim, etc., and this part hasn't been implemented yet.

The general design is:

1. Schedule the RHS calculations for each block. The blocks that need boundary data are scheduled, but wait for a trigger. Interior blocks can start right away. 2. Schedule state vector synchronization. 3. Toggle the trigger that allows the boundary block threads to start. 4. Possibly schedule some other, independent routines. 5. Wait for all RHS threads to finish.

As describe above, this would not yet work since the RHS requires con2prim to have run, and con2prim runs pointwise, so it needs to run after the state vector boundary conditions have been applied. Our current rough idea is to run con2prim only in the interior (and thus before the state vector boundary conditions), and synchronize the primitive variables together with the hydro state vector.

A snapshot of the hydro code is in the rhaas/hydrofuntoy branch of Zelmani (git clone -b rhaas/hydrofuntoy git@bitbucket.org:zelmani/zelmani.git).

The snapshot in 138ef6bb of Zelmani is able to evolve static_tov without crashing.

Remote Mini-Workshop Series

Contents

Mini-Workshop #1: Wed, Dec 7, 2016, 9:00 EST

Building and Installing

Running FunHPC Applications

Loop Example

To-Do

Done:

Mini-Workshop #2: Wed, Dec 14, 2016, 12:00 EST

Source code

To Do

Mini-Workshop #3 at Perimeter: Dec 21-22, 2016

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Toolbox