Expand description
A layer of on-disk index cache for performance.
One important aspect of the index is that we want to optimize the “happy
path” as much as possible. Whenever you type cargo build
Cargo will
always reparse the registry and learn about dependency information. This
is done because Cargo needs to learn about the upstream crates.io crates
that you’re using and ensure that the preexisting Cargo.lock
still matches
the current state of the world.
Consequently, Cargo “null builds” (the index that Cargo adds to each build itself) need to be fast when accessing the index. The primary performance optimization here is to avoid parsing JSON blobs from the registry if we don’t need them. Most secondary optimizations are centered around removing allocations and such, but avoiding parsing JSON is the #1 optimization.
When we get queries from the resolver we’re given a Dependency
. This
dependency in turn has a version requirement, and with lock files that
already exist these version requirements are exact version requirements
=a.b.c
. This means that we in theory only need to parse one line of JSON
per query in the registry, the one that matches version a.b.c
.
The crates.io index, however, is not amenable to this form of query. Instead
the crates.io index simply is a file where each line is a JSON blob, aka
IndexPackage
. To learn about the versions in each JSON blob we would
need to parse the JSON via IndexSummary::parse
, defeating the purpose
of trying to parse as little as possible.
Note that as a small aside even loading the JSON from the registry is actually pretty slow. For crates.io and
RemoteRegistry
we don’t actually check out the git index on disk because that takes quite some time and is quite large. Instead we uselibgit2
to read the JSON from the raw git objects. This in turn can be slow (aka show up high in profiles) because libgit2 has to do deflate decompression and such.
To solve all these issues a strategy is employed here where Cargo basically
creates an index into the index. The first time a package is queried about
(first time being for an entire computer) Cargo will load the contents
(slowly via libgit2) from the registry. It will then (slowly) parse every
single line to learn about its versions. Afterwards, however, Cargo will
emit a new file (a cache, representing as SummariesCache
) which is
amenable for speedily parsing in future invocations.
This cache file is currently organized by basically having the semver version extracted from each JSON blob. That way Cargo can quickly and easily parse all versions contained and which JSON blob they’re associated with. The JSON blob then doesn’t actually need to get parsed unless the version is parsed.
Altogether the initial measurements of this shows a massive improvement for Cargo null build performance. It’s expected that the improvements earned here will continue to grow over time in the sense that the previous implementation (parse all lines each time) actually continues to slow down over time as new versions of a crate are published. In any case when first implemented a null build of Cargo itself would parse 3700 JSON blobs from the registry and load 150 blobs from git. Afterwards it parses 150 JSON blobs and loads 0 files git. Removing 200ms or more from Cargo’s startup time is certainly nothing to sneeze at!
Note that this is just a high-level overview, there’s of course lots of details like invalidating caches and whatnot which are handled below, but hopefully those are more obvious inline in the code itself.
Structs§
- Manages the on-disk index caches.
- A representation of the cache on disk that Cargo maintains of summaries.
Constants§
- The current version of
SummariesCache
.