Remote Mini-Workshop Series
Quite a few interesting mini-projects are being undertaken at the moment. It is worthwhile to advertise these to the larger community to invite participation. In our weekly calls we decided that we should set aside a few hours or half a day for one of these. I now suggest that we turn this into a mini-series, where we pick from the list below until we run out of interest. Maybe this will keep us busy until Christmas.
We picked Wednesday 9:00 EST as meeting time. We'll meet on Google Hangout (probably), details TBA here.
- Spack: installing external package https://github.com/LLNL/spack [Erik]
- SimulationIO: a new file format that's easy to read https://github.com/eschnett/SimulationIO
- FunHPC (multi-threading with futures): overview https://bitbucket.org/eschnett/funhpc.cxx [Erik, Christian, Ian]
- FunHPC (multi-threading with futures): shoehorning this into Cactus [Erik, Christian, Ian]
- StencilOps: more efficient finite differencing stencils in Kranc [Ian]
- DG: Jonah and my new DG formulation that can replace FD methods https://arxiv.org/abs/1604.00075 [Federico]
- The "distribute" script: testing the Einstein Toolkit on HPC systems
- Towards a Kranc implementation of a hydro formulation [Ian, Federico]
If you are interested in one of these topics, then add your name in square brackets after the topic.
If you are interested in presenting a topic yourself, then add a new item to the list.
Contents
Mini-Workshop #1: Wed, Dec 7, 2016, 9:00 EST
Topic: FunHPC (multi-threading with futures): overview https://bitbucket.org/eschnett/funhpc.cxx
Venue: Google Hangouts https://hangouts.google.com/call/jjkffrrvmnbhrooiyjxhfeb2ume
Agenda:
- FunHPC design overview
- Comparison to OpenMP
- CPU vs. memory performance
- Cache and multi-threading, loop tiling
- How to parallelize an application via FunHPC
- Building and installing
- Examples
- Benchmarks
Building and Installing
FunHPC is available on BitBucket https://bitbucket.org/eschnett/funhpc.cxx . It requires several other packages to be installed as well, namely
- Cereal: Serializing C++ objects http://uscilab.github.io/cereal
- hwloc: Determining the hardware (core, cache) layout http://www.open-mpi.org/projects/hwloc
- jemalloc: Fast multi-threaded memory manager (malloc replacement) http://www.canonware.com/jemalloc
- OpenMPI: FunHPC prefers this MPI library http://www.open-mpi.org
- Qthreads: Fine-grained multi-threading (providing a C interface) http://www.cs.sandia.gov/qthreads
To install FunHPC from scratch, you need to install these other libraries first, and then edit FunHPC's Makefile. Google Test is also required, but will be downloaded automatically. Apologies for this unprofessional setup. In the future, FunHPC should be converted to use cmake, and Google Test should be packages as part of it.
The Cereal package requires a patch. This patch makes it distinguish between regular pointers and function pointers. Regular pointers cannot be serialized since it is unclear whether they are valid, and if so, how the target should be allocated or freed. Function pointers, however, can be serialized -- we assume they point to functions, which are constants, so that no memory management issues arise. You need to apply the following patch:
 --- old/include/cereal/types/common.hpp
 +++ new/include/cereal/types/common.hpp
 @@ -106,14 +106,16 @@
      t = reinterpret_cast<typename common_detail::is_enum<T>::type const &>( value );
    }
 
 +#ifndef CEREAL_ENABLE_RAW_POINTER_SERIALIZATION
    //! Serialization for raw pointers
    /*! This exists only to throw a static_assert to let users know we don't support raw pointers. */
    template <class Archive, class T> inline
    void CEREAL_SERIALIZE_FUNCTION_NAME( Archive &, T * & )
    {
      static_assert(cereal::traits::detail::delay_static_assert<T>::value,
        "Cereal does not support serializing raw pointers - please use a smart pointer");
    }
 +#endif
 
    //! Serialization for C style arrays
    template <class Archive, class T> inline
When you "make", you need to pass certain environment variables:
- CEREAL_DIR (have to set in Makefile)
- HWLOC_DIR
- JEMALLOC_DIR
- QTHREADS_DIR
- CXX
- MPICXX
- MPIRUN
For example:
make CEREAL_DIR=... HWLOC_DIR=... JEMALLOC_DIR=... QTHREADS_DIR=... CXX=c++ MPICXX=mpicxx MPIRUN=mpirun
I have installed FunHPC and all its dependencies on Wheeler (Caltech) into the directory /home/eschnett/src/spack-view . This includes a recent version of GCC that was used to build these libraries. If you want to use this, then I highly recommend using this version of GCC as well as all the other software installed in this directory (e.g. HDF5, PAPI, and many more) instead of combining these with system libraries.
As a side note, Roland Haas says that the Simfactory configuration for Wheeler is using this directory. This is not really relevant yet since we won't be using Cactus in the beginning.
Running FunHPC Applications
FunHPC is an MPI application, but we are not interested in using MPI today. We might still need to use mpirun, but only in a trivial way.
Qthreads etc. use environment variables to change certain settings. Some settings are necessary to prevent problems. These "problems" are usually resource exhaustion (e.g. not enough stack space), which Unix helpfully all translates into "Segmentation fault". I am usually setting these environment variables:
 export QTHREAD_NUM_SHEPHERDS="${nshep}"
 export QTHREAD_NUM_WORKERS_PER_SHEPHERD="${nwork}"
 export QTHREAD_STACK_SIZE=8388608 # Byte 
 export QTHREAD_GUARD_PAGES=0      # 0, 1
 export QTHREAD_INFO=1
Here "nshep" is the number of sockets (aka NUMA nodes), and "nwork" the number of cores per socket. You can find these e.g. via "hwloc-info". On Wheeler:
 $ ~/src/spack-view/bin/hwloc-info
 depth 0:        1 Machine (type #1)
  depth 1:       2 NUMANode (type #2)
   depth 2:      2 Package (type #3)
    depth 3:     2 L3Cache (type #4)
     depth 4:    24 L2Cache (type #4)
      depth 5:   24 L1dCache (type #4)
       depth 6:  24 L1iCache (type #4)
        depth 7: 24 Core (type #5)
         depth 8:        24 PU (type #6)
Thus I choose "nshep=2" and "nwork=12" on Wheeler.
By default, Qthreads chooses a rather small stack size of 8 kByte per thread. If a thread uses more stack space, random memory will be overwritten. You can enable guard pages, which is good for debugging. This will catch many cases where the stack overflows. Finally, Qthreads can produce info output at startup that might be helpful.
On Wheeler:
~eschnett/src/spack-view/bin/mpirun -np 1 -x QTHREAD_NUM_SHEPHERDS=2 -x QTHREAD_NUM_WORKERS_PER_SHEPHERD=12 -x QTHREAD_STACK_SIZE=1000000 ~eschnett/src/spack-view/bin/fibonacci
Loop Example
Let us look at a simple loop. We are going to parallelize it once with OpenMP, and once with FunHPC.
 #include <funhpc/async.hpp>
 #include <funhpc/main.hpp>
 
 #include <algorithm>
 #include <cassert>
 #include <vector>
 
 // Synchronize the ghost zones (the outermost points in each direction)
 void sync(double *y, int n) {
   // we just assume periodic boundaries
   assert(n >= 2);
   y[0] = y[n - 2];
   y[n - 1] = y[1];
 }
 
 // A basic loop, calculating a first derivative
 void vdiff(double *y, const double *x, int n) {
   for (int i = 1; i < n - 1; ++i) {
     y[i] = (x[i + 1] - x[i - 1]) / 2;
   }
   sync(y, n);
 }
 
 // The same loop, parallelized via OpenMP: The number of iterations is
 // split over the available number of cores (e.g. 12 on a NUMA node of
 // Wheeler). The disadvantages of this are:
 // - There is a implicit barrier at the end of the loop, so that the
 //   sync afterwards cannot overlap with the loop
 // - A single thread might handle too much work, overflowing the cache
 // - A single thread might not have enough work, so that the thread
 //   management overhead is too large
 void vdiff_openmp(double *y, const double *x, int n) {
 #pragma omp parallel for
   for (int i = 1; i < n - 1; ++i) {
     y[i] = (x[i + 1] - x[i - 1]) / 2;
   }
   sync(y, n);
 }
 
 // The same loop, this time parallelized via FunHPC. Each thread
 // handles a well-defined amount of work, to be chosen based on the
 // complexity of the loop kernel.
 // - Each thread runs until its result is needed. The interior of the
 //   domain can be calculated overlapping with the synchronization.
 // - If the number of iterations is small, only a single thread is
 //   used. Other cores are free to do other work, e.g. analysis or
 //   I/O.
 // - If the number of iterations is large, then many threads will be
 //   created. The threads will be executed in some arbitrary order.
 //   The cost of creating a thread is small (but of not negligible) --
 //   there is no problem if thousands of threads are created.
 void vdiff_funhpc(double *y, const double *x, int n) {
   // number of points per thread (depending on architecture and cache size)
   // (the number here is much too small; this is just for testing)
   const int blocksize = 8;
 
   // loop over blocks, starting one thread for each
   std::vector<qthread::future<void>> fs;
   for (int i0 = 1; i0 < n - 1; i0 += blocksize) {
     fs.push_back(qthread::async(qthread::launch::async, [=]() {
 
       // loop over the work of a single thread
       const int imin = i0;
       const int imax = std::min(i0 + blocksize, n - 1);
       for (int i = imin; i < imax; ++i) {
         y[i] = (x[i + 1] - x[i - 1]) / 2;
       }
     }));
   }
 
   // synchronize as soon as the boundary results are available
   assert(!fs.empty());
   fs[0].wait();
   fs[fs.size() - 1].wait();
   sync(y, n);
 
   // wait for all threads to finish
   for (const auto &f : fs)
     f.wait();
 }
 
 int funhpc_main(int argc, char **argv) {
   const int n = 1000000;
   std::vector<double> x(n), y(n);
   vdiff(&y[0], &x[0], n);
   vdiff_openmp(&y[0], &x[0], n);
   vdiff_funhpc(&y[0], &x[0], n);
   return 0;
 }
To-Do
This is a wiki -- everybody should add missing items here
- Put loop parallelization example onto wiki (and make it compile)
- Maybe: Make FunHPC compile with Clang on Darwin
- Announce next meeting (Wed Dec. 14, 12:00 EST)
- Maybe: Set up FunHPC on Bethe or Fermi (if Frank can't get access to Wheeler)
- Add pointers to http://cppreference.com to wiki (for async, future)
- Describe future, shared_future; async's launch:: options
- Make sure all FunHPC examples run on Wheeler
- If possible: look at weird performance numbers (350 ms vs. 3500 ms on Wheeler's head node); run on compute node instead?
Done:
- Correct broken FunHPC grid self-test
- Provide make wrapper for Wheeler
- Describe Cereal patch
- Add pointers to package web sites to build instructions
