Working Group on Performance Optimization

From Einstein Toolkit Documentation
Revision as of 17:03, 28 February 2019 by Noncct zachetienne (talk | contribs) (Update NRPy+ link to latest version)
Jump to: navigation, search

Organization

Type: Working group

Leads

  • Roland Haas
  • Erik Schnetter

Members

  • Zach Etienne
  • Ian Hinder
  • Roland Haas
  • Erik Schnetter
  • Helvi Vitek
  • Eric West

Funding

  • NSF OAC-1550514

Approval

Approved by ET call on 2018-04-02

Background

Activities

The working group engages in researching, developing, implementing and promoting performance optimization for codes included in the Einstein Toolkit. This includes optimizations for currently supported architectures in the Einstein Toolkit (for example CPUs and GPUs) as well as new architectures that are not yet well supported (eg Intel Phi accelerators, modern GPUs).

The group interacts with Data_Dependant_Task_Scheduler to coordinate optimization efforts.

The group defines the targets of interest and meets regularly via online media as well as in person in small workshops to push forward specific optimization projects.

Members

We welcome new members to the working group! If you are working on performance optimization in some way (e.g. supporting accelerators, SIMD vectorization, new AMR schemes, improving convergence, fine-tunin parameters), then we are looking forward to hearing from you. We expect that this working group will help us share experience and expertise, and will allow us to have some technical discussions that might be out of the range of general interest.

Milestones

  1. review existing optimization efforts currently in private branches: <DEADLINE> to be added by bracketed persons by 2018-04
    1. [Erik] Carpet/eschnett/funhpc
      1. by 2018-05-31: review of features, decision which features to include into Cactus, extract features into new branch, start discussion on ET mailing list
    2. [Ian] CactusNumerical/ianhinder/rkprol <DEADLINE>
    3. [Roland, Erik] CactusExamples/eschnett/hydro <DEADLINE>
    4. [Roland, Erik] Carpet/rhaas/openmp-task by 2018-09
      1. by 2018-05-31: have pull request for ET
      2. by 2018-07-31: find and if memory footprint can be reduced with reasonable effort (by avoiding combine_send)
      3. by 2018-09: reduce memory footprint if possible
    5. [Zach] NRPy+ 2018-07
      1. enable construction of powerful Jupyter user tutorial notebooks (lower learning curve)
      2. make the code more easily extensible for other purposes (e.g., beyond NR)
      3. Speed comparison (gridpoint updates/second/RHS eval) between NRPy+ and McLachlan
    6. [Ian, Erik] Benchmarking <DEADLINE>
      1. construct standard benchmark run to be used for BBH benchmarks <DEADLINE>
      2. provide scripts to do standardized analysis of benchmark <DEADLINE>
  2. import identified optimization efforts into master branches: <DEADLINE> date TBD by 2018-04
  3. review discussion on in "Breakout Discussion on Scalability" in Notes from ET 2017 meeting at NCSA: 2018-04 (next call)
  4. advertise efforts and bring in more developers: 2018-04 (next call)

List of projects of interest

  1. FunHPC
    1. Code is on hold. Led to OpenMP tiling project and OpenMP tasks. Contains thread-safe timers which should be pushed to master.
    2. Overlap of computation and communication. This should be implemented outside of FunHPC.
  2. FOCUS: Runge-Kutta prolongation (continuous RK)
    1. Exists as more than proof-of-concept, could be close to production ready for vacuum. Needs a bit more work, which Ian is going to push in the next 2 months.
  3. OpenMP with tiling in hydro toy code
    1. Project is complete; example for how to efficiently OpenMP and SIMD a hydro code.
    2. Philip is adding MHD to the code and moving it towards production mode
    3. Erik has a version of this in Kranc which could be used for BSSN. Erik will report on whether there was a speed benefit in vacuum.
  4. FOCUS: OpenMP Tasks in Carpet::commstate
    1. Tasks for individual parts of the work that needs to be done during prolongation
    2. Proof of principle.
    3. Code is in Carpet
    4. There is code already
    5. Tested by Jim. Roland to dig up data for BW.
    6. Need to decide whether this is something worth pushing to master and using in production. Look at the effect on fine-grained timers to identify whether there is a benefit, even if it is just to a “small” part of the run time.
  5. Overlap of computation and communication
    1. needs to be reimplemented from FunHPC
    2. Need to make use of Sam and Steve’s scheduling work for dependencies
    3. Erik needs to look at this, or at least help
    4. Proof of concept; a bit far from getting anything in production
    5. When we have a benchmark, assess how this would affect the benchmark
    6. Put on hold for now
  6. FOCUS: NRPy+
    1. git repo for latest version, with tutorial module, code announcement paper
    2. “Kranc-like”, but python-based: Generates (SIMD / CSE-enabled) code fragments from a high-level description (i.e., Einstein-like notation). Supports arbitrary-order finite difference operators.
    3. Zach is working on rewrite of NRPy+ (ETA: beginning of June, 2018), with focus on improving modularity and implementing interactive (ipython) mode, to
      1. enable construction of powerful Jupyter user tutorial notebooks (lower learning curve)
      2. make the code more easily extensible for other purposes (e.g., beyond NR)
    4. Provides
      1. RHSs for SphericalBSSN thorn proposed for inclusion in ETK - production simulations like we normally do still in the future (vacuum production runs already possible; MHD-in-spherical coords thorn now being written by V. Mewes at RIT)
      2. “guts” of magnetized BH accretion disk (Fishbone-Moncrief) initial data thorn
    5. Speed comparison (gridpoint updates/second/RHS eval) between NRPy+ and McLachlan? Zach will do this if he has time. Zach’s notes-to-self: experiment with both GCC & ICC, be careful that vectorization is enabled in McLachlan as well. Contact Ian if McLachlan results seem slow. Loop tiling in SENR?
  7. David Koppelman’s prolongation optimisation
    1. 3 passes of 1D stencil rather than 1 pass of 3D stencil
    2. Nice standalone project
    3. Should do this
  8. FOCUS: Benchmarking
    1. Ian has started looking at the GW150914 example and adding better timing/sync barriers etc. See https://bitbucket.org/einsteintoolkit/performanceoptimisationwg/src/83160044e2297bccf6f44088cb8814820c7b71ab/bbhbench/?at=master.
    2. Made a plot of time spent in different timers as a pie chart.
    3. Erik will run this benchmark[b]. Looking at different OpenMP threads.
    4. Benchmarking analysis tools - something very simple to parse the results. See http://www.serpentine.com/criterion/report.html for inspiration. Ian has a Mathematica notebook for analysing the XML timer tree files produced by Carpet. Want something in Python to do the same. Generate reduced data files, then plot them, then make nice web pages.
    5. Main point of this project is to make something STANDARD that will be developed and continue to use.

Deliverables

  1. the identified optimization options listed above
  2. graphs and data to back up the observed performance improvements
  3. code to include in the Einstein Toolkit

Engagement

The working groups communicates via personal email, the Einstein Toolkit performance optimization mailing list, and through periodic video-conferences.

Persons that are themselves working on performance optimization and that are interested in joining the working group are encouraged to contact the leads at rhaas@illinois.edu or eschnetter@perimeterinstitute.ca.

Agenda and minutes of calls

We keep all files in a shared Google drive folder which is publicly readable but only editable by group members.

We have a repository on bitbucket to keep parameters files and similar around.

We will have monthly calls using Google hangout on the fourth Thursday of the month at 11:00am CT.