Data Dependant Task Scheduler
Development Team: Data Dependant Task Scheduler
Date Approved: August 13th, 2017
Lead: Steve Brandt
Team: Roland Haas, Erik Schnetter, Frank Loeffler, Samuel Cupp
Funding: NSF 1550514, NSF 1550551, Perimeter Institute
Background:
Writing code with the Einstein Toolkit is becoming more challenging because of the increase in the need to handle highly complex multi-physics, as well as the increase in parallelism and inhomogeneity of even desktop-size hardware. At the heart of every simulation is orchestrating the task implied by the different science and computational components, making sure that available computational and memory resources are used efficiently. And this is the job of the scheduler in the Einstein Toolkit. The scheduler is provided by Cactus. It provides time bins for starting up, evolving a step, and shutting down. Cactus programming modules (i.e. thorns) can schedule functions inside any of these bins, insert new scheduling bins inside this basic skeleton, and arrange them before or after others. Thorns modify this schedule through the use of a domain specific language (DSL). Halo exchanges for variables, or groups of variables, can be triggered by annotating these schedule bins with a sync directive. Identifying the correct places to perform synchronization has proven difficult in many cases. This have sometimes resulted in bugs, but more often in unnecessary synchronization leading to performance loss. This problem also manifests inside the adaptive mesh refinement component of the Einstein Toolkit. While initial applications followed the time-stepping requirements of the overall Berger and Oliger method, some quantities needed to be evaluated in ways that did not fit into this structure, like global reductions for analysis purposes, or multigrid cycles. Special scheduling annotations were created to facilitate the needs of these calculations. Science was served, but the programmability and overall usability of the Cactus Framework suffered.
On execution, Cactus traverses the schedule tree in order and tells the driver to execute the routines scheduled, by the active thorns, in the order requested by the thorn writers and communicate ghost zone data when necessary. Every requested data synchronization constitutes an explicit barrier, as further execution does not continue before the communication has completed. This is the case, even if the next scheduled routine does not depend on the data being communicated after the previous routine. Thus, it is currently not possible to interleave computation and communication and the explicit barriers makes it impossible for a part of the grid to get ahead of another part of the grid. In order to improve this situation, the present scheduler needs to be replaced with a much more flexible approach, relying explicitly on data dependencies.
Other libraries and frameworks of relevance include Charm++, HPX, FunHPC, SpECTRE (which uses Charm++, Jonah Miller wrote the first library to use Charm++ for SpECTRE), ExaHype, and GAMER (fully GPU/AMR library at UIUC written by Justin Shrive).
Development:
We will modify the flesh of Cactus to allow a thorn writer to annotate the schedule description with information about which variables a routine reads and writes as an alternative to the existing method of specifying which functions run before or after each other, and which are to be synchronized. This information will be used to construct a dependency graph using standard C++ futures.
The main wiki for development of the standard as well as current status of the project and working group may be found here:
https://docs.einsteintoolkit.org/et-docs/Adding_requirements_to_the_Cactus_scheduler
Milestones:
- October 2017: Identify methods to correctly specify READS and WRITES clauses in Cactus thorns, including procedures to modify existing thorns and make them compliant. In addition, modify the behavior of synchronization so that SYNC statements are no longer necessary.
- January 2018: Replace the static scheduler with a data-dependent scheduler, and adapt PUGH to work with this scheduler. Begin usage of C++ futures for parallelism.
- September 2018: Adapt Carpet to the new data-dependent scheduler with a simpler, but working PUGH reference implementation.
- September 2019: Complete adapting Carpet internals to use the new scheduler. Hardening, testing, and performance optimization. Adapt key thorns to make better use of the new infrastructure. Cactus user’s workshop to demonstrate the new scheduler.
- September 2020: Continue adaptation of science modules. Development of an online tutorial.
Deliverables:
September 2018: PUGH Reference Implementation Released
September 2019: New scheduler available in regular release in Carpet
September 2020: Online tutorial.
Engagement:
If you wish to contribute, please add your name to the working group on https://docs.einsteintoolkit.org/et-docs/Adding_requirements_to_the_Cactus_scheduler and contact Steven R. Brandt (sbrandt@cct.lsu.edu) to let him know your interest.