Remote Mini-Workshop Series
Quite a few interesting mini-projects are being undertaken at the moment. It is worthwhile to advertise these to the larger community to invite participation. In our weekly calls we decided that we should set aside a few hours or half a day for one of these. I now suggest that we turn this into a mini-series, where we pick from the list below until we run out of interest. Maybe this will keep us busy until Christmas.
We picked Wednesday 9:00 EST as meeting time. We'll meet on Google Hangout (probably), details TBA here.
- Spack: installing external package https://github.com/LLNL/spack [Erik]
- SimulationIO: a new file format that's easy to read https://github.com/eschnett/SimulationIO
- FunHPC (multi-threading with futures): overview https://bitbucket.org/eschnett/funhpc.cxx [Erik, Christian, Ian]
- FunHPC (multi-threading with futures): shoehorning this into Cactus [Erik, Christian, Ian]
- StencilOps: more efficient finite differencing stencils in Kranc [Ian]
- DG: Jonah and my new DG formulation that can replace FD methods https://arxiv.org/abs/1604.00075 [Federico]
- The "distribute" script: testing the Einstein Toolkit on HPC systems
- Towards a Kranc implementation of a hydro formulation [Ian, Federico]
If you are interested in one of these topics, then add your name in square brackets after the topic.
If you are interested in presenting a topic yourself, then add a new item to the list.
Mini-Workshop #1: Wed, Dec 7, 2016, 9:00 EST
Topic: FunHPC (multi-threading with futures): overview https://bitbucket.org/eschnett/funhpc.cxx
Agenda:
- FunHPC design overview
- Comparison to OpenMP
- CPU vs. memory performance
- Cache and multi-threading, loop tiling
- How to parallelize an application via FunHPC
- Building and installing
- Examples
- Benchmarks
Building and Installing
FunHPC is available on BitBucket https://bitbucket.org/eschnett/funhpc.cxx . It requires several other packages to be installed as well, namely
- cereal: Serializing C++ objects
- hwloc: Determining the hardware (core, cache) layout
- jemalloc: Fast multi-threaded memory manager (malloc replacement)
- OpenMPI: FunHPC prefers this MPI library
- Qthreads: Fine-grained multi-threading (providing a C interface)
To install FunHPC from scratch, you need to install these other libraries first, and then edit FunHPC's Makefile. Google Test is also required, but will be downloaded automatically. Apologies for this unprofessional setup. In the future, FunHPC should be converted to use cmake, and Google Test should be packages as part of it.
I have installed FunHPC and all its dependencies on Wheeler (Caltech) into the directory /home/eschnett/src/spack-view . This includes a recent version of GCC that was used to build these libraries. If you want to use this, then I highly recommend using this version of GCC as well as all the other software installed in this directory (e.g. HDF5, PAPI, and many more) instead of combining these with system libraries.
As a side note, Roland Haas says that the Simfactory configuration for Wheeler is using this directory. This is not really relevant yet since we won't be using Cactus in the beginning.
Running FunHPC Applications
FunHPC is an MPI application, but we are not interested in using MPI today. We might still need to use mpirun, but only in a trivial way.
Qthreads etc. use environment variables to change certain settings. Some settings are necessary to prevent problems. These "problems" are usually resource exhaustion (e.g. not enough stack space), which Unix helpfully all translates into "Segmentation fault". I am usually setting these environment variables:
export QTHREAD_NUM_SHEPHERDS="${nshep}" export QTHREAD_NUM_WORKERS_PER_SHEPHERD="${nwork}" export QTHREAD_STACK_SIZE=8388608 # Byte export QTHREAD_GUARD_PAGES=0 # 0, 1 export QTHREAD_INFO=1 export FUNHPC_MAIN_EVERYWHERE=1
Here "nshep" is the number of sockets (aka NUMA nodes), and "nwork" the number of cores per socket. You can find these e.g. via "hwloc-info". On Wheeler:
$ ~/src/spack-view/bin/hwloc-info depth 0: 1 Machine (type #1) depth 1: 2 NUMANode (type #2) depth 2: 2 Package (type #3) depth 3: 2 L3Cache (type #4) depth 4: 24 L2Cache (type #4) depth 5: 24 L1dCache (type #4) depth 6: 24 L1iCache (type #4) depth 7: 24 Core (type #5) depth 8: 24 PU (type #6)
Thus I choose "nshep=2" and "nwork=12" on Wheeler.