Vectorisation

From Einstein Toolkit Documentation
Jump to: navigation, search

The goal of vectorisation is to improve single-core performance by making use of advanced CPU features, such as using vector or cache-management instructions. On a modern CPU, using vector instructions can theoretically improve performance by a factor of two or four, and improving cache management can improve performance by an order of magnitude. In practice, the observed improvements are much smaller, but may still be significant.

In principle, it is the compiler's task to use these instructions. In practice, our code is often too complex for the compiler to understand, and these optimisations therefore do not occur. Below, we take a two-levels approach:

  • Create an API that hides machine-specific details
  • Use automated code generation that targets this API instead of C/C++ directly

Example

For example, Intel/AMD CPUs offer vectors consisting of two double precision numbers; this datatype is called __m128d. To add two such numbers, one calls the builtin function _mm_add_pd(x,y) (which the compiler translates into a single machine instruction, avoiding the function call). To store such a vector to memory while bypassing the cache (if we won't need the value again in the near future), one calls the builtin function _mm_stream_pd(p,x). However, these data types and builtin functions (if they exist) look very different on other architectures, e.g. on Power 7.

Our vectorisation API offers a hardware-independent abstraction for this. For the example above, this would be:

 #define CCTK_REAL8_VEC __m128d
 #define CCTK_REAL8_VEC_SIZE 2
 #define k8add(x,y) (_mm_add_pd(x,y))
 #define vec8_store_nta(p,x) (_mm_stream_pd(&(p),x))

The Code

The API

The API is defined in thorn LSUThorns/Vectors, and supports Intel/ADM SSE2 (all processors available today), Intel/AMD AVX (Intel's new Sandy Bridge architecture), Power 7 VSX, Blue Gene/P "Double Hummer", and – of course – a pseudo-vector default implementation using "vectors" of size 1, using only regular C/C++ operators.

Automated Code Generation

Kranc generates code for this API when one sets UseVectors -> True.

Configuring Cactus

By default, LSUThorns/Vectors uses the pseudo-vector implementation which generates scalar code (even if the source code has been vectorised). Vectorisation needs to be enabled via the configuration option VECTORISE = yes.

On systems with a 32kB instruction cache (i.e. currently everywhere except for AMD processors), one should add VECTORISE_INLINE = no to reduce code size.

Results

The aim of the project is to improve performance. Here are some timing numbers for a sample parameter file evolving a Minkowski spacetime for 20 RK4 time steps in a 99^3 box with ML_BSSN:

(Note: Help with improving the layout of this table is appreciated.)

System CPU Compiler scalar vectorised speedup
Bethe Intel Intel 338.926 287.248 1.18
Bethe Intel Intel 284.610 265.771 1.07
Blue Drop Power 7 IBM 447.937 317.282 1.41
Curry Intel GNU 448.354 436.726 1.03
Curry Intel GNU 949.560 432.314 2.20
Hopper AMD PGI 439.572 --- ---
Hopper AMD GNU 483.976 329.013 1.47
Kraken AMD Intel 584.770 268.879 2.17
Kraken AMD Intel 272.993
Queen Bee Intel Intel
Ranger AMD Intel 576.062 408.303 1.41
Surveyor BG/P IBM

All times are in seconds, measuring the sum of the execution times of ML_BSSN_RHS1 and ML_BSSN_RHS2 in ML_BSSN, using a single OpenMP thread and a single MPI process.