The goal of vectorisation is to improve single-core performance by making use of advanced CPU features, such as using vector or cache-management instructions. On a modern CPU, using vector instructions can theoretically improve performance by a factor of two or four, and improving cache management can improve performance by an order of magnitude. In practice, the observed improvements are much smaller, but may still be significant.

In principle, it is the compiler's task to use these instructions. In practice, our code is often too complex for the compiler to understand, and these optimisations therefore do not occur. Below, we take a two-levels approach:

Create an API that hides machine-specific details
Use automated code generation that targets this API instead of C/C++ directly

Example

For example, Intel/AMD CPUs offer vectors consisting of two double precision numbers; this datatype is called __m128d. To add two such numbers, one calls the builtin function _mm_add_pd(x,y) (which the compiler translates into a single machine instruction, avoiding the function call). To store such a vector to memory while bypassing the cache (if we won't need the value again in the near future), one calls the builtin function _mm_stream_pd(p,x). However, these data types and builtin functions (if they exist) look very different on other architectures, e.g. on Power 7.

Our vectorisation API offers a hardware-independent abstraction for this. For the example above, this would be:

 #define CCTK_REAL8_VEC __m128d
 #define CCTK_REAL8_VEC_SIZE 2
 #define k8add(x,y) (_mm_add_pd(x,y))
 #define vec8_store_nta(p,x) (_mm_stream_pd(&(p),x))

The Code

The API

The API is defined in thorn LSUThorns/Vectors, and supports Intel/ADM SSE2 (all processors available today), Intel/AMD AVX (Intel's new Sandy Bridge architecture), Power 7 VSX, Blue Gene/P "Double Hummer", and – of course – a pseudo-vector default implementation using "vectors" of size 1, using only regular C/C++ operators.

Automated Code Generation

Kranc generates code for this API when one sets UseVectors -> True.

Configuring Cactus

By default, LSUThorns/Vectors uses the pseudo-vector implementation which generates scalar code (even if the source code has been vectorised). Vectorisation needs to be enabled via the configuration option VECTORISE = yes.

On systems with a 32kB instruction cache (i.e. currently everywhere except for AMD processors), one should add VECTORISE_INLINE = no to reduce code size.

Results

The aim of the project is to improve performance. Here are some timing numbers for a sample parameter file evolving a Minkowski spacetime for 20 RK4 time steps in a 99^3 box with ML_BSSN:

(Note: Help with improving the layout of this table is appreciated.)

System	CPU	Compiler	scalar	vectorised	speedup
Bethe	Intel	Intel	338.926	287.248	1.18
Bethe	Intel	Intel	284.610	265.771	1.07
Blue Drop	Power 7	IBM	447.937	317.282	1.41
Curry	Intel	GNU	448.354	436.726	1.03
Curry	Intel	GNU	949.560	432.314	2.20
Hopper	AMD	PGI	439.572	---	---
Hopper	AMD	GNU	483.976	329.013	1.47
Kraken	AMD	Intel	584.770	268.879	2.17
Kraken	AMD	Intel		272.993
Queen Bee	Intel	Intel
Ranger	AMD	Intel	576.062	408.303	1.41
Surveyor	BG/P	IBM

All times are in seconds, measuring the sum of the execution times of ML_BSSN_RHS1 and ML_BSSN_RHS2 in ML_BSSN, using a single OpenMP thread and a single MPI process.

Vectorisation

Contents

Example

The Code

The API

Automated Code Generation

Configuring Cactus

Results

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Toolbox