Difference between revisions of "Vectorisation"
(→Results) |
(→Results) |
||
Line 44: | Line 44: | ||
| Bethe || Intel || Intel || 284.610 || 265.771 || 1.07 | | Bethe || Intel || Intel || 284.610 || 265.771 || 1.07 | ||
|- | |- | ||
− | | Blue Drop || Power 7 || IBM | + | | Blue Drop || Power 7 || IBM || 447.937 || 317.282 || 1.41 |
|- | |- | ||
| Hopper || AMD || PGI | | Hopper || AMD || PGI |
Revision as of 02:21, 8 April 2011
The goal of vectorisation is to improve single-core performance by making use of advanced CPU features, such as using vector or cache-management instructions. On a modern CPU, using vector instructions can theoretically improve performance by a factor of two or four, and improving cache management can improve performance by an order of magnitude. In practice, the observed improvements are much smaller, but may still be significant.
In principle, it is the compiler's task to use these instructions. In practice, our code is often too complex for the compiler to understand, and these optimisations therefore do not occur. Below, we take a two-levels approach:
- Create an API that hides machine-specific details
- Use automated code generation that targets this API instead of C/C++ directly
Contents
Example
For example, Intel/AMD CPUs offer vectors consisting of two double precision numbers; this datatype is called __m128d. To add two such numbers, one calls the builtin function _mm_add_pd(x,y) (which the compiler translates into a single machine instruction, avoiding the function call). To store such a vector to memory while bypassing the cache (if we won't need the value again in the near future), one calls the builtin function _mm_stream_pd(p,x). However, these data types and builtin functions (if they exist) look very different on other architectures, e.g. on Power 7.
Our vectorisation API offers a hardware-independent abstraction for this. For the example above, this would be:
#define CCTK_REAL8_VEC __m128d #define CCTK_REAL8_VEC_SIZE 2 #define k8add(x,y) (_mm_add_pd(x,y)) #define vec8_store_nta(p,x) (_mm_stream_pd(&(p),x))
The Code
The API
The API is defined in thorn LSUThorns/Vectors, and supports Intel/ADM SSE2 (all processors available today), Intel/AMD AVX (Intel's new Sandy Bridge architecture), Power 7 VSX, Blue Gene/P "Double Hummer", and – of course – a pseudo-vector default implementation using "vectors" of size 1, using only regular C/C++ operators.
Automated Code Generation
Kranc generates code for this API when one sets UseVectors -> True.
Configuring Cactus
By default, LSUThorns/Vectors uses the pseudo-vector implementation which generates scalar code (even if the source code has been vectorised). Vectorisation needs to be enabled via the configuration option VECTORISE = yes.
On systems with a 32kB instruction cache (i.e. currently everywhere except for AMD processors), one should add VECTORISE_INLINE = yes to reduce code size.
Results
The aim of the project is to improve performance. Here are some timing numbers for a sample parameter file evolving a Minkowski spacetime for 20 RK4 time steps in a 99^3 box with ML_BSSN:
(Note: Help with improving the layout of this table is appreciated.)
System | CPU | Compiler | scalar | vectorised | speedup |
---|---|---|---|---|---|
Bethe | Intel | Intel | 338.926 | 287.248 | 1.18 |
Bethe | Intel | Intel | 284.610 | 265.771 | 1.07 |
Blue Drop | Power 7 | IBM | 447.937 | 317.282 | 1.41 |
Hopper | AMD | PGI | |||
Kraken | AMD | Intel | 584.770 | 268.879 | 2.17 |
Queen Bee | Intel | Intel | |||
Ranger | AMD | Intel | |||
Surveyor | BG/P | IBM |
All times are in seconds, measuring the sum of the execution times of ML_BSSN_RHS1 and ML_BSSN_RHS2 in ML_BSSN, using a single OpenMP thread and a single MPI process.