Lock Files Are Not the Only Option

posted 26 Aug 2017

So you want to write a package manager is a good read if you want to understand the idea behind package managers (or – as referred in the text – project dependency managers) such as Go’s Glide or Rust’s Cargo. One of the parts it explains in detail is how to resolve a manifest file into a lock file.

It describes a manifest file as a file “that lists the depended-upon packages to be managed” (among other things), whereas a lock file is “a machine-written file with all the information necessary to [re]produce the full dependency source tree”.

The part which is important to realise here is that the lock file is used to ensure we can build the project with the exact same set of dependencies. Unfortunately, Hyrum’s law says that any change in your library will be a breaking change for someone out there. Did you fix a bug by adding some additional checks? Someone out there may have their system fall over because of the additional computation time you introduced. Did you fix a function that returned some buggy output? You can be sure someone out there assumed it was like that by design.

In this context, it applies to you as a consumer of libraries. You certainly want to be sure that what you built and deployed to your staging environment will be identical to what you ship to production. And if you need to build and deploy a hotfix, you certainly don’t want some new library version to unintentionally change any part of your program.

Clearly, lock files are great to ensure that we only need to handle changing dependencies when we want to. We should keep it that way. However – as the title of this blog post suggests – they are not the only way to ensure a reproducible build.

Reproducible Algorithms

Lock files were made because manifest files, combined with the algorithm to resolve dependencies, were not reproducible. It is the “easy” way to ensure that dependency resolution is reproducible; just store the result in a file and reuse it. But we can also attempt to ensure reproducibility by creating a dependency resolution algorithm that is reproducible with just the manifest file.

Dependency range support is the most common reason why dependency resolution algorithms are not reproducible. If you depend on foo [1.1.4, 2.0.0), then the version of foo you actually end up with will depend on what versions of foo were available when you resolved this version range.

If we instead say you can only specify one version of a dependency in the manifest file, we can use a very straightforward algorithm to resolve dependencies transitively:

def resolveDependencies(manifestDependencies):
  seenDeps = DependencySet([])
  currentDeps = manifestDependencies
  nextDeps = []
  while currentDeps:
    nextDeps = []
    for dep in currentDeps:
      if dep not in seenDeps:
        seenDeps.add(dep)
        # Read additional dependency info
        # from cache/dependency repository:
        dep.fetch()
        nextDeps.extend(dep.dependencies)
    currentDeps = nextDeps
  return seenDeps

This is a plain breadth-first search for dependencies, where we pick the first version of a dependency we see. Typically it is called “nearest-first” because the dependency version picked is nearest the root of the dependency tree you’re walking. And while it may sound like a naïve algorithm, it is effectively the core of the dependency resolution algorithm used by Maven and Leiningen. It is by no means perfect; there are good reasons why the :pedantic configuration parameter in Leiningen and other tools exists.

For fans of dependency ranges and their corresponding NP-complete problems, “single-dependency” algorithms are obviously not preferable. To support algorithms using version ranges, one option may be to embed the essential information from the lock file into the manifest file itself. This may sound silly at first, but the “only” essential information we need is which dependency versions were available at the time the lock file was generated. Assuming that the publication time for a dependency version is available to the package manager¹, we can provide a timestamp in the manifest file and ignore all releases that were released after it. Now, we can use the exact same algorithm that is used to compute the lock file contents. Provided the algorithm is deterministic, we end up with the exact same result as using a lock file generated at the time specified in the manifest file.

Interestingly, this is even more powerful than a lock file, because you can pick a timestamp in the past. And it may be interesting to utilise this inside the package manager itself: If you want to upgrade all your dependencies, but the build or your test fails after the upgrade, your package manager can do a binary search à la git bisect to find the exact time (and thus the exact dependency/dependencies) which caused the build/tests to fail.

Properties, Not Implementation Details

The point of lock files isn’t the lock files themselves, but what they provide: Reproducibility. I think package manager developers should at least know that reproducibility can, in some cases, be provided without lock files. Thinking outside the box can, in turn, give these package managers some interesting properties, as mentioned at the end of the previous section.

With the standard caveats which come with timestamps and distributed systems. For example will this idea not work with git repositories for several reasons. The biggest reason is that commit times are not the time commits were published to the repository. ↩