Module cargo::sources::registry::index::cache

source ·
Expand description

A layer of on-disk index cache for performance.

One important aspect of the index is that we want to optimize the “happy path” as much as possible. Whenever you type cargo build Cargo will always reparse the registry and learn about dependency information. This is done because Cargo needs to learn about the upstream crates.io crates that you’re using and ensure that the preexisting Cargo.lock still matches the current state of the world.

Consequently, Cargo “null builds” (the index that Cargo adds to each build itself) need to be fast when accessing the index. The primary performance optimization here is to avoid parsing JSON blobs from the registry if we don’t need them. Most secondary optimizations are centered around removing allocations and such, but avoiding parsing JSON is the #1 optimization.

When we get queries from the resolver we’re given a Dependency. This dependency in turn has a version requirement, and with lock files that already exist these version requirements are exact version requirements =a.b.c. This means that we in theory only need to parse one line of JSON per query in the registry, the one that matches version a.b.c.

The crates.io index, however, is not amenable to this form of query. Instead the crates.io index simply is a file where each line is a JSON blob, aka IndexPackage. To learn about the versions in each JSON blob we would need to parse the JSON via IndexSummary::parse, defeating the purpose of trying to parse as little as possible.

Note that as a small aside even loading the JSON from the registry is actually pretty slow. For crates.io and RemoteRegistry we don’t actually check out the git index on disk because that takes quite some time and is quite large. Instead we use libgit2 to read the JSON from the raw git objects. This in turn can be slow (aka show up high in profiles) because libgit2 has to do deflate decompression and such.

To solve all these issues a strategy is employed here where Cargo basically creates an index into the index. The first time a package is queried about (first time being for an entire computer) Cargo will load the contents (slowly via libgit2) from the registry. It will then (slowly) parse every single line to learn about its versions. Afterwards, however, Cargo will emit a new file (a cache, representing as SummariesCache) which is amenable for speedily parsing in future invocations.

This cache file is currently organized by basically having the semver version extracted from each JSON blob. That way Cargo can quickly and easily parse all versions contained and which JSON blob they’re associated with. The JSON blob then doesn’t actually need to get parsed unless the version is parsed.

Altogether the initial measurements of this shows a massive improvement for Cargo null build performance. It’s expected that the improvements earned here will continue to grow over time in the sense that the previous implementation (parse all lines each time) actually continues to slow down over time as new versions of a crate are published. In any case when first implemented a null build of Cargo itself would parse 3700 JSON blobs from the registry and load 150 blobs from git. Afterwards it parses 150 JSON blobs and loads 0 files git. Removing 200ms or more from Cargo’s startup time is certainly nothing to sneeze at!

Note that this is just a high-level overview, there’s of course lots of details like invalidating caches and whatnot which are handled below, but hopefully those are more obvious inline in the code itself.

Structs§

  • Manages the on-disk index caches.
  • A representation of the cache on disk that Cargo maintains of summaries.

Constants§