A Resilient Git Dependency Algorithm

posted 26 Mar 2018

Crates

In the last couple of months, there has been significant development around dependency management in the two languages I use the most these days: Go and Clojure. Go, with its publication of vgo and additional texts around it, and Clojure, with its announcement of Git support in deps.edn.

What I found curious about these developments is that they both can work on top of Git repositories. Both also challenge the assumption that we need version ranges. Instead, they assume that newer versions are always backwards compatible. To ensure repeatable builds, they pick the highest version found by transitively walking the dependencies in the dependency lists. This also means that they don’t need a lockfile.

One of the good benefits of using Git repositories as dependencies is that you can fork a library, fix a bug and then depend on your own fork until the fix has been merged in upstream. Instead of waiting for a new release, you can go back to what you were doing right away and verify that the fix works in your staging/test environment. I have done this several times in Go, and although the $GOPATH makes it a bit painful to fork a library in Go, it’s straightforward to specify the revision and location of a fork with Glide/Dep.

In the introduction to vgo, however, we’re met with this sentence:

Authors will be expected to tag releases with semantic versions, and vgo encourages using tagged versions, not arbitrary commits.

It is clear that tools.deps.alpha needs to properly handle commits as versions, but vgo may decide to not consider this as very important. For now, I will assume that they do, mostly because the Go community already utilises the benefits of forking dependencies and fixing bugs immediately.

Using Git commits as versions can cause some interesting problems – at least one which seems to be not properly considered by any of these tools: Git versions are only partially ordered, not totally ordered as many other version schemes are. Let’s consider the use case I described above: Someone working on module foo depends on module X and finds a bug in it. They want to fix the bug, so they send in a pull request, depend on it and push the change.

Meanwhile, some other module bar depends on X as well. The owner also submits a pull request with a fix for some other bug and depends on it1.

If we wanted to depend on both foo and bar in this situation, tools.deps.alpha rightfully complains that this is not a manageable situation. In order to fix this, we need to merge both commits down into the same branch and depend on the newer commit instead. So if we just wait until these branches are merged down, things should be just fine, right? Unfortunately, that’s not necessarily true. If I wanted to merge in a commit, I have not one, but three options to choose between:

GitHub's Merge options

That is the second issue with Git repositories: Commits may disappear, and a merge of a branch may not be traced back to the original commit.

In fact, the git repository may not even exist! Go had its own leftpad incident when bindata went away for some time. From experience, this is not a one-off issue either: I have experienced issues with libraries being rebased and another incident where a library I transitively depended on was deleted.

It is very tempting to not consider these issues as things we have to handle. After all, if a commit or repository that we depend on is gone, it is impossible to get them back during dependency resolution. And if neither of two commits can be considered “newer” than the other, we cannot pick one. Yet in many realistic situations, we can recover and actually pick the newest version, even if repositories and commits are lost.

To explain how we can improve over tools.deps.alpha and vgo, we first need to understand how they work. The full set of dependency algorithms for vgo is explained in Minimal Version Selection, and I’ll explain briefly how its resolution algorithm works. We also need to go through what seems like a long detour, but it is necessary to solve the problems explained above.

Minimal Version Selection (MSV) uses the concept of a rough list and a final list. The rough list is the set of all dependencies reachable from our project root: The set of all potential dependencies we will use. The final list is all the latest versions of all dependencies found in the rough list.

It is a very straightforward algorithm to implement if we only consider happy paths, and here is a Clojure version:

(def dep-name first)
(def dep-vsn second)

(defn rough-list
  [visited dep]
  (if (contains? visited dep)
    visited
    (reduce rough-list
            (conj visited dep)
            (get-deps dep))))

(defn latest-dependency
  [deps]
  (apply max-key dep-vsn deps))

(defn final-list
  [candidates]
  (->> candidates
       (group-by dep-name)
       (vals)
       (map latest-dependency)
       set))

(defn resolve-dependencies
  [top-level-deps]
  (let [cands (reduce rough-list
                      #{} top-level-deps)]
    (final-list cands)))

this one uses Lein/Boot-like vectors for brevity: The first element in a dependency vector is the dependency name/identifier, and the second its version. Using the example from the vgo MVS text

Version Select Example of a dependency set from Minimal Version Selection

we can make a static get-deps “function” and see that we get the same result as presented in that text:

(def get-deps
  {[:a 1] #{[:b 2] [:c 2]}
   [:b 1] #{[:d 1]}
   [:b 2] #{[:d 2]}
   [:c 1] #{}
   [:c 2] #{[:d 4]}
   [:c 3] #{[:f 1]}
   [:d 1] #{[:e 1]}
   [:d 2] #{[:e 1]}
   [:d 3] #{[:e 2]}
   [:d 4] #{[:e 2]}
   [:e 1] #{}
   [:e 2] #{}
   [:e 3] #{}
   [:f 1] #{[:g 1]}
   [:g 1] #{[:f 1]}})

;; code

user=> (resolve-dependencies #{[:a 1]})
#{[:c 2] [:d 4] [:b 2] [:e 2] [:a 1]}

tools.deps.alpha uses a similar algorithm to resolve dependencies, but it is not identical.

It is very easy to understand how MSV works, but its output can be surprising in many cases. Consider this example:

B 1B 2A 1X 1C 1D 1

Here, the dependencies we will depend on are X1, B2, C1 and D1. Since we expand B1 regardless of whether we have expanded B2, we will pick up C1. The same happens in tools.deps.alpha, but only if you indirectly depend on both B1 and B2, and B1 is visited first.

Even though this is not our original problem, we can see that it makes no sense to depend on C1: It is not used by anyone! In Go, this “only” leads to longer resolution times. For Clojure and Java applications, it can cause more hassle. Not only will uberjars be filled with unnecessary dependencies and possibly aot-compile more code than necessary, they can also alter the functionality of programs that look into the uberjar contents or attempt to find namespaces on the classpath (for example through bultitude)

There are a couple of ways to fix this. The most straightforward solution is to do a walk down the dependencies you originally depended on, and remove the ones you are unable to reach.

If we remove unused dependencies, it is tempting to optimise the MSV algorithm a bit. If we see an older version of a dependency we have already expanded, we could try to skip walking down it. In that way, we don’t have to fetch too many unused dependencies. Tempting as it is, it will, unfortunately, make the algorithm nondeterministic. Consider this example:

B 1B 2A 1X 1C 1D 1E 1E 2

If you apply this optimisation and run it on this input, you may end up with two different set of dependencies, depending on which order you expand them:

  1. B2, D1, E2, X1
  2. B2, D1, E1, X1

This set of dependencies also shows us a different problem with MVS: If B2 decided to go back to E1 (presumably because of bugs), and only B1 depended on E2, then it makes no sense to use E22!

Fixed-Point Dependency Walking

If you’re like me, you would be slightly confused if the dependency resolution algorithm presents you with a dependency or version that is not used by any other dependency. Perhaps we can avoid this issue with a different algorithm?

If the goal is to have “traceable” dependencies, then it makes sense that all dependencies are used by other dependencies you use. In fact, that’s what Maven/Aether does with its nearest-first algorithm. Why aren’t we using that one, then?

The nearest-first algorithm ignores version numbers altogether, which means it picks the first version of a dependency it finds. It will often pick a version that’s lower than what some other library may depend upon. If the newer version adds new functionality that this library uses, your project won’t even build! This is partly why :pedantic in Leiningen exists: It detects these things and suggests exclusions to fix this so that you use the latest versions.

It would be great if we could utilise the ideas in tools.deps.alpha and vgo while still having these “traceable” dependencies. If we attempt to encode this into rules, it looks something like this:

One of the cool things about this set of rules is that it is very easy to code it into a fixed-point algorithm. A naïve implementation works like this3:

Make two sets of dependencies: used and candidates. Put all the top-level dependencies into the candidate set. Make a new set new-used by taking all the latest versions of all dependencies in the candidate set. If this set is different from used, replace used with new-used and start over, otherwise return used.

If we do this on the set of dependencies we’ve used above

B 1B 2A 1X 1C 1D 1

then we can trace this by hand like so:

Oh no, Javascript is not enabled (or the js file couldn't be loaded)! No worries though, this was only supposed to be an interactive step-by-step visualisation. You'll just have to imagine the steps yourself (or enable JavaScript if you're curious).

What’s great about this algorithm is that an initial implementation is very short:

(require '[clojure.set :as set])

(defn resolve-dependencies
  [top-level-deps]
  (loop [used #{}]
    (let [cands (apply set/union
                       top-level-deps
                       (map get-deps used))
          new-used (final-list cands)]
      (if-not (= used new-used)
        (recur new-used)
        used))))

I don’t really know what to call this algorithm, but it is a fixed-point dependency walking algorithm, so I’ll just call it that (FPDW for short).

Ignorance is Bliss

So far, there seems to be no difference between MVS and FPDW. They both compare versions, and when they do, they will crash if they are incomparable:

(resolve-dependencies #{[:e 1] [:e :??]})
;; ClassCastException:
;; clojure.lang.Keyword cannot be cast to java.lang.Number

However, if we modify the dependencies to include this one, we see that it is harder to get such a crash to begin with:

B 1B 2A 1X 1C 1D 1E 2E ??

The naïve FPDW algorithm will not expand two versions of the same dependency at the same time. This is great for situations where you maintain a library B and depend on a branch of the library E. If the branch is squashed or rebased into another branch, you can update E without any worries that a previous version of B will cause conflicts for consumers.

However, it is still possible to get into this situation. Imagine a dependency tree like this:

C 1C 2A 1X 1B 1E 2E ??Z 1

Here, both MVS, tools.alpha.deps and the initial version of FPDW will crash, as E ?? and E 2 will eventually be compared. However, we can tweak FPDW to not care about this and eventually resolve E to version 2.

As a human, this may seem obvious: We know that C 2 will be picked over C 1, so we can avoid comparing E 2 with E ?? altogether by fully expanding Z 1 first, then walk down B 1.

Unfortunately, it’s not possible for the algorithm to know which dependencies to expand first. We can, however, employ another technique: Ignore errors as long as we have progress.

Instead of throwing an error when we hit a dependency we can’t resolve, we will first check if we will have progress without it. In this case, progress means that the used set is still changing. If the used set is not changing, we have ended up in a situation where we don’t have any progress, and we’re forced to throw an error.

What do we do with dependencies we’ve already included in the used set, but we now aren’t sure which version to include? Without going into too much detail, it turns out it is both safe to throw them out and to keep them in4. As it’s easier to throw them out in the naïve algorithm, we will just do that:

(defn try-latest-dependency
  [deps]
  (try
    (latest-dependency deps)
    (catch Exception e
      #{})))

(defn try-final-list
  [candidates]
  (->> candidates
       (group-by dep-name)
       (vals)
       (map try-latest-dependency)
       set))

(defn resolve-dependencies
  [top-level-deps]
  (loop [used #{}]
    (let [cands (apply set/union
                       top-level-deps
                       (map get-deps used))
          new-used (try-final-list cands)]
      (if (not= used new-used)
        (recur new-used)
        ;; No progress, so resolve normally.
        ;; If it fails, this throws an error.
        (final-list cands)))))

We now have an algorithm that can resolve dependencies, even if some Git repositories/commits are lost or if some commits cannot be compared!

Conclusion

tools.deps.alpha and vgo introduces some new interesting dependency resolution algorithms to the table, but neither seems to consider what to do if two versions aren’t comparable. This is particularly important for dependency algorithms that want to embrace Git and its branching capabilities. If we instead use a fixed-point algorithm, we can avoid most of those issues as well.

As seen in this post, using Git sources as dependencies introduce new challenges dependency algorithms should at least have an opinion about and preferably fix if possible. But do we really need to solve these problems? I think it is worth reflecting on one question I’ve not discussed here: Since Git repositories aren’t truly immutable, should we use them as a source for our dependencies?

  1. Right now, if you’re making a library for others to consume, you should prefer to depend on the “canonical” branch only. This example should hopefully only appear in some internal libraries within a corporation, and not for public exposure. With the more resilient dependency resolution algorithm described, it’s possible to relax this constraint. 

  2. From black box testing, this seems to be what tools.deps.alpha’s algorithm does, although I am not 100% sure. 

  3. This implementation relies on an assumption that may not always be true: That a dependency A will never depend on a dependency B where this or an older version of B depends on a newer version of A. If it does, then we may end up with some funny results and possible infinite loops. It is possible to handle those situations with a more sophisticated algorithm, but that’s not something I’ve done in this implementation. 

  4. You have to be careful with fixed-point algorithms, as they can quickly end up in a cycle without any progress. This change doesn’t add any.