Difference between revisions of "Vectorisation"

From Einstein Toolkit Documentation
Jump to: navigation, search
Line 14: Line 14:
 
   #define k8add(x,y) (_mm_add_pd(x,y))
 
   #define k8add(x,y) (_mm_add_pd(x,y))
 
   #define vec8_store_nta(p,x) (_mm_stream_pd(&(p),x))
 
   #define vec8_store_nta(p,x) (_mm_stream_pd(&(p),x))
 +
 +
=The Code=
  
 
==The API==
 
==The API==

Revision as of 15:46, 7 April 2011

The goal of vectorisation is to improve single-core performance by making use of advanced CPU features, such as using vector or cache-management instructions. On a modern CPU, using vector instructions can theoretically improve performance by a factor of two or four, and improving cache management can improve performance by an order of magnitude. In practice, the observed improvements are much smaller, but may still be significant.

In principle, it is the compiler's task to use these instructions. In practice, our code is often too complex for the compiler to understand, and these optimisations therefore do not occur. Below, we take a two-levels approach:

  • Create an API that hides machine-specific details
  • Use automated code generation that targets this API instead of C/C++ directly

Example

For example, Intel/AMD CPUs offer vectors consisting of two double precision numbers; this datatype is called __m128d. To add two such numbers, one calls the builtin function _mm_add_pd(x,y) (which the compiler translates into a single machine instruction, avoiding the function call). To store such a vector to memory while bypassing the cache (if we won't need the value again in the near future), one calls the builtin function _mm_stream_pd(p,x). However, these data types and builtin functions (if they exist) look very different on other architectures, e.g. on Power 7.

Our vectorisation API offers a hardware-independent abstraction for this. For the example above, this would be:

 #define CCTK_REAL8_VEC __m128d
 #define CCTK_REAL8_VEC_SIZE 2
 #define k8add(x,y) (_mm_add_pd(x,y))
 #define vec8_store_nta(p,x) (_mm_stream_pd(&(p),x))

The Code

The API

The API is defined in thorn LSUThorns/Vectors, and supports Intel/ADM SSE2 (all processors available today), Intel/AMD AVX (Intel's new Sandy Bridge architecture), Power 7 VSX, Blue Gene/P "Double Hummer", and – of course – a pseudo-vector default implementation using "vectors" of size 1, using only regular C/C++ operators.

Automated Code Generation

Kranc generates code for this API when one sets UseVectors -> True.