Difference between revisions of "Working Group on Performance Optimization"
(add Ian Hinder as WG member) |
(add list of projects) |
||
Line 41: | Line 41: | ||
# review discussion on in "Breakout Discussion on Scalability" in [https://docs.google.com/document/d/1u4-EgQM3DngPa0QfPoHZGVJy69jDrMvbmxOgxFmaOfg/edit Notes from ET 2017 meeting at NCSA]: 2018-04 (next call) | # review discussion on in "Breakout Discussion on Scalability" in [https://docs.google.com/document/d/1u4-EgQM3DngPa0QfPoHZGVJy69jDrMvbmxOgxFmaOfg/edit Notes from ET 2017 meeting at NCSA]: 2018-04 (next call) | ||
# advertise efforts and bring in more developers: 2018-04 (next call) | # advertise efforts and bring in more developers: 2018-04 (next call) | ||
+ | |||
+ | === List of projects of interest === | ||
+ | # FunHPC | ||
+ | ## Code is on hold. Led to OpenMP tiling project and OpenMP tasks. Contains thread-safe timers which should be pushed to master. | ||
+ | ## Overlap of computation and communication. This should be implemented outside of FunHPC. | ||
+ | # FOCUS: Runge-Kutta prolongation (continuous RK) | ||
+ | ## Exists as more than proof-of-concept, could be close to production ready for vacuum. Needs a bit more work, which Ian is going to push in the next 2 months. | ||
+ | # OpenMP with tiling in hydro toy code | ||
+ | ## Project is complete; example for how to efficiently OpenMP and SIMD a hydro code. | ||
+ | ## Philip is adding MHD to the code and moving it towards production mode | ||
+ | ## Erik has a version of this in Kranc which could be used for BSSN. Erik will report on whether there was a speed benefit in vacuum. | ||
+ | # FOCUS: OpenMP Tasks in Carpet::commstate | ||
+ | ## Tasks for individual parts of the work that needs to be done during prolongation | ||
+ | ## Proof of principle. | ||
+ | ## Code is in Carpet | ||
+ | ## There is code already | ||
+ | ## Tested by Jim. Roland to dig up data for BW. | ||
+ | ## Need to decide whether this is something worth pushing to master and using in production. Look at the effect on fine-grained timers to identify whether there is a benefit, even if it is just to a “small” part of the run time. | ||
+ | # Overlap of computation and communication | ||
+ | ## needs to be reimplemented from FunHPC | ||
+ | ## Need to make use of Sam and Steve’s scheduling work for dependencies | ||
+ | ## Erik needs to look at this, or at least help | ||
+ | ## Proof of concept; a bit far from getting anything in production | ||
+ | ## When we have a benchmark, assess how this would affect the benchmark | ||
+ | ## Put on hold for now | ||
+ | # FOCUS: NRPy+ | ||
+ | ## “Kranc-like”, but python-based: Generates (SIMD / CSE-enabled) code fragments from a high-level description (i.e., Einstein-like notation). Supports arbitrary-order finite difference operators. | ||
+ | ## Zach is working on rewrite of NRPy+ (ETA: beginning of June, 2018), with focus on improving modularity and implementing interactive (ipython) mode, to | ||
+ | ### enable construction of powerful Jupyter user tutorial notebooks (lower learning curve) | ||
+ | ### make the code more easily extensible for other purposes (e.g., beyond NR) | ||
+ | ## Provides | ||
+ | ### RHSs for SphericalBSSN thorn proposed for inclusion in ETK - production simulations like we normally do still in the future (vacuum production runs already possible; MHD-in-spherical coords thorn now being written by V. Mewes at RIT) | ||
+ | ### “guts” of magnetized BH accretion disk (Fishbone-Moncrief) initial data thorn | ||
+ | ## Speed comparison (gridpoint updates/second/RHS eval) between NRPy+ and McLachlan? Zach will do this if he has time. Zach’s notes-to-self: experiment with both GCC & ICC, be careful that vectorization is enabled in McLachlan as well. Contact Ian if McLachlan results seem slow. Loop tiling in SENR? | ||
+ | # David Koppelman’s prolongation optimisation | ||
+ | ## 3 passes of 1D stencil rather than 1 pass of 3D stencil | ||
+ | ## Nice standalone project | ||
+ | ## Should do this | ||
+ | # FOCUS: Benchmarking | ||
+ | ## Ian has started looking at the GW150914 example and adding better timing/sync barriers etc. See https://bitbucket.org/einsteintoolkit/performanceoptimisationwg/src/83160044e2297bccf6f44088cb8814820c7b71ab/bbhbench/?at=master. | ||
+ | ## Made a plot of time spent in different timers as a pie chart. | ||
+ | ## Erik will run this benchmark[b]. Looking at different OpenMP threads. | ||
+ | ## Benchmarking analysis tools - something very simple to parse the results. See http://www.serpentine.com/criterion/report.html for inspiration. Ian has a Mathematica notebook for analysing the XML timer tree files produced by Carpet. Want something in Python to do the same. Generate reduced data files, then plot them, then make nice web pages. | ||
+ | ## Main point of this project is to make something STANDARD that will be developed and continue to use. | ||
+ | |||
+ | |||
=== Deliverables === | === Deliverables === |
Revision as of 19:16, 27 April 2018
Contents
Organization
Type: Working group
Leads
- Roland Haas
- Erik Schnetter
Members
- Zach Etienne
- Ian Hinder
- Roland Haas
- Erik Schnetter
- Helvi Vitek
Funding
- NSF OAC-1550514
Background
Activities
The working group engages in researching, developing, implementing and promoting performance optimization for codes included in the Einstein Toolkit. This includes optimizations for currently supported architectures in the Einstein Toolkit (for example CPUs and GPUs) as well as new architectures that are not yet well supported (eg Intel Phi accelerators, modern GPUs).
The group interacts with Data_Dependant_Task_Scheduler to coordinate optimization efforts.
The group defines the targets of interest and meets regularly via online media as well as in person in small workshops to push forward specific optimization projects.
Members
We welcome new members to the working group! If you are working on performance optimization in some way (e.g. supporting accelerators, SIMD vectorization, new AMR schemes, improving convergence, fine-tunin parameters), then we are looking forward to hearing from you. We expect that this working group will help us share experience and expertise, and will allow us to have some technical discussions that might be out of the range of general interest.
Milestones
- review existing optimization efforts currently in private branches: <DEADLINE> to be added by bracketed persons by 2018-04
- [Erik] Carpet/eschnett/funhpc
- by 2018-05-31: review of features, decision which features to include into Cactus, extract features into new branch, start discussion on ET mailing list
- [Ian] CactusNumerical/ianhinder/rkprol
- [Roland, Erik] CactusExamples/eschnett/hydro
- [Zach] NRPy+, a “Kranc-like”, but Python/SymPy-based code capable of creating the mathematical “guts” of ETK thorns (as C code, supporting AVX256/AVX512 intrinsics). (public git repo, code announcement paper). NRPy+ already provides
- RHSs for SphericalBSSN thorn (code announcement paper), and
- an ETK GRMHD initial data thorn (magnetized BH accretion disk) (public git repo)
- [Erik] Carpet/eschnett/funhpc
- import identified optimization efforts into master branches: <DEADLINE> date TBD by 2018-04
- review discussion on in "Breakout Discussion on Scalability" in Notes from ET 2017 meeting at NCSA: 2018-04 (next call)
- advertise efforts and bring in more developers: 2018-04 (next call)
List of projects of interest
- FunHPC
- Code is on hold. Led to OpenMP tiling project and OpenMP tasks. Contains thread-safe timers which should be pushed to master.
- Overlap of computation and communication. This should be implemented outside of FunHPC.
- FOCUS: Runge-Kutta prolongation (continuous RK)
- Exists as more than proof-of-concept, could be close to production ready for vacuum. Needs a bit more work, which Ian is going to push in the next 2 months.
- OpenMP with tiling in hydro toy code
- Project is complete; example for how to efficiently OpenMP and SIMD a hydro code.
- Philip is adding MHD to the code and moving it towards production mode
- Erik has a version of this in Kranc which could be used for BSSN. Erik will report on whether there was a speed benefit in vacuum.
- FOCUS: OpenMP Tasks in Carpet::commstate
- Tasks for individual parts of the work that needs to be done during prolongation
- Proof of principle.
- Code is in Carpet
- There is code already
- Tested by Jim. Roland to dig up data for BW.
- Need to decide whether this is something worth pushing to master and using in production. Look at the effect on fine-grained timers to identify whether there is a benefit, even if it is just to a “small” part of the run time.
- Overlap of computation and communication
- needs to be reimplemented from FunHPC
- Need to make use of Sam and Steve’s scheduling work for dependencies
- Erik needs to look at this, or at least help
- Proof of concept; a bit far from getting anything in production
- When we have a benchmark, assess how this would affect the benchmark
- Put on hold for now
- FOCUS: NRPy+
- “Kranc-like”, but python-based: Generates (SIMD / CSE-enabled) code fragments from a high-level description (i.e., Einstein-like notation). Supports arbitrary-order finite difference operators.
- Zach is working on rewrite of NRPy+ (ETA: beginning of June, 2018), with focus on improving modularity and implementing interactive (ipython) mode, to
- enable construction of powerful Jupyter user tutorial notebooks (lower learning curve)
- make the code more easily extensible for other purposes (e.g., beyond NR)
- Provides
- RHSs for SphericalBSSN thorn proposed for inclusion in ETK - production simulations like we normally do still in the future (vacuum production runs already possible; MHD-in-spherical coords thorn now being written by V. Mewes at RIT)
- “guts” of magnetized BH accretion disk (Fishbone-Moncrief) initial data thorn
- Speed comparison (gridpoint updates/second/RHS eval) between NRPy+ and McLachlan? Zach will do this if he has time. Zach’s notes-to-self: experiment with both GCC & ICC, be careful that vectorization is enabled in McLachlan as well. Contact Ian if McLachlan results seem slow. Loop tiling in SENR?
- David Koppelman’s prolongation optimisation
- 3 passes of 1D stencil rather than 1 pass of 3D stencil
- Nice standalone project
- Should do this
- FOCUS: Benchmarking
- Ian has started looking at the GW150914 example and adding better timing/sync barriers etc. See https://bitbucket.org/einsteintoolkit/performanceoptimisationwg/src/83160044e2297bccf6f44088cb8814820c7b71ab/bbhbench/?at=master.
- Made a plot of time spent in different timers as a pie chart.
- Erik will run this benchmark[b]. Looking at different OpenMP threads.
- Benchmarking analysis tools - something very simple to parse the results. See http://www.serpentine.com/criterion/report.html for inspiration. Ian has a Mathematica notebook for analysing the XML timer tree files produced by Carpet. Want something in Python to do the same. Generate reduced data files, then plot them, then make nice web pages.
- Main point of this project is to make something STANDARD that will be developed and continue to use.
Deliverables
- the identified optimization options listed above
- graphs and data to back up the observed performance improvements
- code to include in the Einstein Toolkit
Engagement
The working groups communicates via personal email, the Einstein Toolkit performance optimization mailing list, and through periodic video-conferences.
Persons that are themselves working on performance optimization and that are interested in joining the working group are encouraged to contact the leads at rhaas@illinois.edu or eschnetter@perimeterinstitute.ca.
Agenda and minutes of calls
We keep all files in a shared Google drive folder which is publicly readable but only editable by group members.
We will have monthly calls using Google hangout on the last Friday of the month at 12:00pm CT.
- Agenda and minutes of Kick-off call on Friday March 23rd
- Agenda for April meeting