Difference between revisions of "GitSuperRepoRationale"
|  (→Scenarios) | |||
| Line 59: | Line 59: | ||
| ==Scenarios== | ==Scenarios== | ||
| + | |||
| + | We now describe several scenarios where a standard Cactus checkout provides a poor user experience that is improved by using a Git super-repo. | ||
| ===Updating a source tree=== | ===Updating a source tree=== | ||
Revision as of 04:33, 24 June 2011
Contents
Rationale
This section describes the motivations for this project and how our solution addresses problems with the existing systems.
Background
- Einstein toolkit built from many different components living in their own repositories
- End user must check out each component and compile them together into an executable which is then run to produce output
- End user is often also a developer of some of the components (public or private)
- GetComponents (URL) is a tool to simplify this process by collecting component repository information into a single "CRL" file (CRL = Component Retrieval Language).
- GetComponents allows you to check out the latest versions from a CRL file, or to update an existing set of checkouts to the latest version
Problems
- Upstream projects use different version control systems (SVN, Git, Mercurial, ...) leading to a nonuniform experience for the end user/developer. Multiple tools must be learned for merging/branching/committing etc.
- It is not easy to see at a glance exactly what version of the code is in use. One could iterate over all the different repositories, of different types, and print the revision information, and any local differences. This could be added to GetComponents, but this has not been done yet and we argue that this is not the best solution to the problem.
- Knowing what version of the code has been used to produce a given scientific result is essential for the scientific process, where results must be repeatable. The current best solution to this problem is the Formaline thorn which stores a complete copy of the source code of all thorns in the simulation output directory. We argue that this is only a partial solution to the problem. While all the source code is present, the version control metadata has been entirely stripped. When comparing different simulations, at best one obtains a large diff of all the source changes between them, without information about why they were made or who made them. There is also no method for conveniently using the formaline output for a new simulation.
- Updating a Cactus source tree is currently an irreversible and dangerous process. There is no guarantee that the "current" trunk branch of all the components will function correctly, and there is no way, short of a manual backup beforehand, of reverting to the previous state if they don't. It is not possible to see, at a glance, exactly what will be updated when you run "GetComponents -u".
- It is desirable for different members of a scientific research group to be using the same version of the code for production simulations, or at least for this to be possible/easy. In the current setup, each user is responsible for managing their own Cactus tree and will likely have completely different versions of the code, depending on when they last updated. It is not even guaranteed that the code can be described by a single "checkout date", since different components could be checked out at different times. Users may also have applied patches or altered behaviour, fixing bugs or adding features, to any of the components.
- SVN does not allow distributed version control. Many components of the ET are in SVN, which means that users cannot use version control locally. They cannot commit locally frequently and go back to previous versions when there are problems, then bundle up the changes as a coherent commit to upstream.
- Managing branches is difficult. Suppose a user wants to implement a new feature. First, a new branch is created in the corresponding repository (we ignore here the fact that most of the components are in SVN, which does not encourage this mode of development), then the feature is committed bit by bit to that branch. In the course of development, the user might want to run a production simulation. So they would switch that repository back to the production branch temporarily. However, there is no global record of which branch each repository is currently on, and it might be that some repositories are not on "production" branches. The best solution at the moment is to have a separate production tree, but users rarely have the discipline to do this.
Proposed solution
We propose a solution based on the idea of a "Git Super-Repository" with each of the components (thorn repositories) linked into it as submodules (repository pointers).
- A single Git repository represents the complete state of the code
- Each group can set up their own super-repo on their server containing pointers to both the public repositories and their own private ones.
- Any non-git repositories are mirrored on the group's server (using git-svn, for example) as Git repositories so that users can interact with a single version control system.
- Standard git tools can be used to determine the current revision of the code and any local changes for storing in the simulation output directory for the purpose of reproducibility.
- It is easy to see what branch each repository is on, and to use branches with components which are hosted in SVN
- We ensure that committing back to the upstream repositories is easy
- A "tested" branch can be created in the super-repo. This branch will only be advanced when the corresponding revision has passed the automated build and test process. This means that a user can know beforehand that if they use this branch, they will always get a working version of the code.
- Updating is now safe and reversible. Since the current state is encapsulated in a revision ID, one can easily revert back to it after an update if the code no longer works.
- A group can have a "production" branch in the super-repo (possibly one per project). Switching to this branch switches all components at the same time, and it can be easily verified that the code is now all on the production branch. The production branch can be advanced in a controlled way when it has been tested, and all users can update to the new tested version.
- Comparing the codes used to produce two simulations is now easier. Assuming that the code revision information is present in the output directory, users can simply do a "git diff <rev1> <rev2>" between the two revisions. They can use standard graphical tools to see the log messages and authors of the commits, or look in full detail at the patches themselves. The question "did you run with XX fix?" can be trivially answered. The use of a production branch would also help in this.
- Users have all the benefits of distributed revision control, and the entire version history of all components (including SVN ones) at their fingertips in their local repository copies.
Scenarios
We now describe several scenarios where a standard Cactus checkout provides a poor user experience that is improved by using a Git super-repo.
Updating a source tree
A user has a standard Cactus checkout, for example of the Einstein Toolkit. They last updated it a few months ago, and would like to update it to be "current". The current method for doing this is to run GetComponents --update with the corresponding thornlist (assuming that the user has kept it). This will update all the repositories in the tree to the current latest version. There is no way to see what will be updated (e.g. commit messages of the changes) and no way to revert to the previous state after the update has occurred. There is no easy way to know whether the latest version of the code passes its test suites.
With a Git super-repository, a user can see the changes that would be obtained on an update:
 git fetch
 git submodule fetch I made this up - how would you really do it?
 git log --oneline HEAD..FETCH_HEAD
578f986 Fix indexing in 7th order prolongation b8ab9dc fix testsuites 7f10f4b handle the case of only one point in a dimention in the code for the diagonal as well, in a similar way as for the other directi 3fc7a5b LoopControl: provide Fortran interface to vectorized loops 4f1cbfb CarpetLib: remove superfluous OMP PARALLEL section in (W)ENO prolongation 10a7e54 Revert "remove superfluous OMP PARALLEL section in (W)ENO prolongation"
Is the HEAD..FETCH_HEAD the simplest way to do this? How does this work with submodules?
The user can then do the update,
git pull git submodule update
try out the code, and if it doesn't work, or they decide for some other reason that they want to go back to the old version, they can find that version from
git reflog
and revert to it:
git revert --hard <commit>
Does this work with submodules? --hard is probably wrong - what is the best way to do this?
In the fetch and pull commands above, the user could specify the "tested" branch which always points to the most recent revision that passes all the testsuites. This way, the user knows that the version they are getting will compile and pass all tests.
Source consistency in a project
Several people are collaborating on a project. With a standard checkout, there is currently no convenient way to ensure that all the people have the same version of the code when they run simulations for the project. One of the project members does some development work to try to get something to work, and wants to share that with the other members of the project (but not to commit it upstream, as it is still experimental). Currently this must be handled with patch files. Very soon, even if all the project members started off with the same source tree, they will likely have different versions due to local modifications, and no easy way to see what those changes are.
With a Git super-repo, a branch can be easily created in the group's local repository for the project (one global branch, not one per sub-repository). Each user can then use
git log --oneline projectbranch..
to see what commits they have locally that are not on the project branch, and
git diff
to see their local uncommitted changes across all repositories. Once the project members agree that certain features are required for the project, these can be pushed to the project branch and all the users can update. At all times, it is easy to see the difference between each user's trees.
How much of this actually works with submodules?
Identifying whether a fix has been applied for a given simulation
One member of a group has a problem with their simulation. It is crashing and they don't know why. They ask around, and someone thinks they may know the cause, and ask what version of the code they are using, specifically whether they have updated a certain thorn to include a particular fix. Unfortunately, the simulation had been sitting in the queue on a supercomputer for quite a while, and the user has updated their Cactus tree, so they don't know what version of the code was used to produce the simulation. With a standard checkout, the user must first find the corresponding fix (e.g. using svn log and svn diff), then untar the correspinding Formaline tarball from the simulation output directory, and see by eye if it looks like the fix has been applied. If there have been subsequent changes to the file, it might look completely different to when it was fixed, and it might not be easy to tell if the version of the code is from before or after the fix.
With a Git super-repo, the user can look in the simulation output directory where Formaline has stored the Git hash of the version of the code that was used, along with any local changes as patch files. They can then find the hash of the commit, and look for both in the "git log" output. If their version comes after the fix, then they know that they were running with the fix already.
Determining the cause of differences in a simulation
A scientist has a problem. They have run a new simulation and are getting results they don't understand. To track this down, they try to run an old simulation that worked, and find that they get different results. They now need to determine what has changed in the code between the two simulations. With a standard Cactus checkout, their only option is to untar all the Formaline tarballs and run a "diff" between the two. They will see all the source code changes in raw form. They will not know who has made the changes, or why. There is no realistic way to unapply a particular suspect commit and test the new version.
With a Git super-repo, the user will look in the simulation output directory and identify the Git hash of each version of the code. They can then run
git log --oneline <rev1>..<rev2>
to see all the commits which are different between the two versions. This will show the first line of the log message. The user can then select one which looks promising and view the patch in detail:
git log -p <rev>
This will show the full text of the diff, as well as the full description provided by the author, who is also identified. Perhaps after some further explanations by the patch author, the user could then revert this single patch
git revert <rev>
(this creates a new local commit which reverts the old patch) and try the simulation again, and determine if this patch was responsible for the problem or not.
How much of this actually works with submodules?
