cargo/sources/registry/index/cache.rs
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281
//! A layer of on-disk index cache for performance.
//!
//! One important aspect of the index is that we want to optimize the "happy
//! path" as much as possible. Whenever you type `cargo build` Cargo will
//! *always* reparse the registry and learn about dependency information. This
//! is done because Cargo needs to learn about the upstream crates.io crates
//! that you're using and ensure that the preexisting `Cargo.lock` still matches
//! the current state of the world.
//!
//! Consequently, Cargo "null builds" (the index that Cargo adds to each build
//! itself) need to be fast when accessing the index. The primary performance
//! optimization here is to avoid parsing JSON blobs from the registry if we
//! don't need them. Most secondary optimizations are centered around removing
//! allocations and such, but avoiding parsing JSON is the #1 optimization.
//!
//! When we get queries from the resolver we're given a [`Dependency`]. This
//! dependency in turn has a version requirement, and with lock files that
//! already exist these version requirements are exact version requirements
//! `=a.b.c`. This means that we in theory only need to parse one line of JSON
//! per query in the registry, the one that matches version `a.b.c`.
//!
//! The crates.io index, however, is not amenable to this form of query. Instead
//! the crates.io index simply is a file where each line is a JSON blob, aka
//! [`IndexPackage`]. To learn about the versions in each JSON blob we would
//! need to parse the JSON via [`IndexSummary::parse`], defeating the purpose
//! of trying to parse as little as possible.
//!
//! > Note that as a small aside even *loading* the JSON from the registry is
//! > actually pretty slow. For crates.io and [`RemoteRegistry`] we don't
//! > actually check out the git index on disk because that takes quite some
//! > time and is quite large. Instead we use `libgit2` to read the JSON from
//! > the raw git objects. This in turn can be slow (aka show up high in
//! > profiles) because libgit2 has to do deflate decompression and such.
//!
//! To solve all these issues a strategy is employed here where Cargo basically
//! creates an index into the index. The first time a package is queried about
//! (first time being for an entire computer) Cargo will load the contents
//! (slowly via libgit2) from the registry. It will then (slowly) parse every
//! single line to learn about its versions. Afterwards, however, Cargo will
//! emit a new file (a cache, representing as [`SummariesCache`]) which is
//! amenable for speedily parsing in future invocations.
//!
//! This cache file is currently organized by basically having the semver
//! version extracted from each JSON blob. That way Cargo can quickly and
//! easily parse all versions contained and which JSON blob they're associated
//! with. The JSON blob then doesn't actually need to get parsed unless the
//! version is parsed.
//!
//! Altogether the initial measurements of this shows a massive improvement for
//! Cargo null build performance. It's expected that the improvements earned
//! here will continue to grow over time in the sense that the previous
//! implementation (parse all lines each time) actually continues to slow down
//! over time as new versions of a crate are published. In any case when first
//! implemented a null build of Cargo itself would parse 3700 JSON blobs from
//! the registry and load 150 blobs from git. Afterwards it parses 150 JSON
//! blobs and loads 0 files git. Removing 200ms or more from Cargo's startup
//! time is certainly nothing to sneeze at!
//!
//! Note that this is just a high-level overview, there's of course lots of
//! details like invalidating caches and whatnot which are handled below, but
//! hopefully those are more obvious inline in the code itself.
//!
//! [`Dependency`]: crate::core::Dependency
//! [`IndexPackage`]: super::IndexPackage
//! [`IndexSummary::parse`]: super::IndexSummary::parse
//! [`RemoteRegistry`]: crate::sources::registry::remote::RemoteRegistry
use std::fs;
use std::io;
use std::path::PathBuf;
use std::str;
use anyhow::bail;
use cargo_util::registry::make_dep_path;
use semver::Version;
use crate::util::cache_lock::CacheLockMode;
use crate::util::Filesystem;
use crate::CargoResult;
use crate::GlobalContext;
use super::split;
use super::INDEX_V_MAX;
/// The current version of [`SummariesCache`].
const CURRENT_CACHE_VERSION: u8 = 3;
/// A representation of the cache on disk that Cargo maintains of summaries.
///
/// Cargo will initially parse all summaries in the registry and will then
/// serialize that into this form and place it in a new location on disk,
/// ensuring that access in the future is much speedier.
///
/// For serialization and deserialization of this on-disk index cache of
/// summaries, see [`SummariesCache::serialize`] and [`SummariesCache::parse`].
///
/// # The format of the index cache
///
/// The idea of this format is that it's a very easy file for Cargo to parse in
/// future invocations. The read from disk should be fast and then afterwards
/// all we need to know is what versions correspond to which JSON blob.
///
/// Currently the format looks like:
///
/// ```text
/// +---------------+----------------------+--------------------+---+
/// | cache version | index schema version | index file version | 0 |
/// +---------------+----------------------+--------------------+---+
/// ```
///
/// followed by one or more (version + JSON blob) pairs...
///
/// ```text
/// +----------------+---+-----------+---+
/// | semver version | 0 | JSON blob | 0 | ...
/// +----------------+---+-----------+---+
/// ```
///
/// Each field represents:
///
/// * _cache version_ --- Intended to ensure that there's some level of
/// future compatibility against changes to this cache format so if different
/// versions of Cargo share the same cache they don't get too confused.
/// * _index schema version_ --- The schema version of the raw index file.
/// See [`IndexPackage::v`] for the detail.
/// * _index file version_ --- Tracks when a cache needs to be regenerated.
/// A cache regeneration is required whenever the index file itself updates.
/// * _semver version_ --- The version for each JSON blob. Extracted from the
/// blob for fast queries without parsing the entire blob.
/// * _JSON blob_ --- The actual metadata for each version of the package. It
/// has the same representation as [`IndexPackage`].
///
/// # Changes between each cache version
///
/// * `1`: The original version.
/// * `2`: Added the "index schema version" field so that if the index schema
/// changes, different versions of cargo won't get confused reading each
/// other's caches.
/// * `3`: Bumped the version to work around an issue where multiple versions of
/// a package were published that differ only by semver metadata. For
/// example, openssl-src 110.0.0 and 110.0.0+1.1.0f. Previously, the cache
/// would be incorrectly populated with two entries, both 110.0.0. After
/// this, the metadata will be correctly included. This isn't really a format
/// change, just a version bump to clear the incorrect cache entries. Note:
/// the index shouldn't allow these, but unfortunately crates.io doesn't
/// check it.
///
/// See [`CURRENT_CACHE_VERSION`] for the current cache version.
///
/// [`IndexPackage::v`]: super::IndexPackage::v
/// [`IndexPackage`]: super::IndexPackage
#[derive(Default)]
pub struct SummariesCache<'a> {
/// JSON blobs of the summaries. Each JSON blob has a [`Version`] beside,
/// so that Cargo can query a version without full JSON parsing.
pub versions: Vec<(Version, &'a [u8])>,
/// For cache invalidation, we tracks the index file version to determine
/// when to regenerate the cache itself.
pub index_version: &'a str,
}
impl<'a> SummariesCache<'a> {
/// Deserializes an on-disk cache.
pub fn parse(data: &'a [u8]) -> CargoResult<SummariesCache<'a>> {
// NB: keep this method in sync with `serialize` below
let (first_byte, rest) = data
.split_first()
.ok_or_else(|| anyhow::format_err!("malformed cache"))?;
if *first_byte != CURRENT_CACHE_VERSION {
bail!("looks like a different Cargo's cache, bailing out");
}
let index_v_bytes = rest
.get(..4)
.ok_or_else(|| anyhow::anyhow!("cache expected 4 bytes for index schema version"))?;
let index_v = u32::from_le_bytes(index_v_bytes.try_into().unwrap());
if index_v != INDEX_V_MAX {
bail!(
"index schema version {index_v} doesn't match the version I know ({INDEX_V_MAX})",
);
}
let rest = &rest[4..];
let mut iter = split(rest, 0);
let last_index_update = if let Some(update) = iter.next() {
str::from_utf8(update)?
} else {
bail!("malformed file");
};
let mut ret = SummariesCache::default();
ret.index_version = last_index_update;
while let Some(version) = iter.next() {
let version = str::from_utf8(version)?;
let version = Version::parse(version)?;
let summary = iter.next().unwrap();
ret.versions.push((version, summary));
}
Ok(ret)
}
/// Serializes itself with a given `index_version`.
pub fn serialize(&self, index_version: &str) -> Vec<u8> {
// NB: keep this method in sync with `parse` above
let size = self
.versions
.iter()
.map(|(_version, data)| (10 + data.len()))
.sum();
let mut contents = Vec::with_capacity(size);
contents.push(CURRENT_CACHE_VERSION);
contents.extend(&u32::to_le_bytes(INDEX_V_MAX));
contents.extend_from_slice(index_version.as_bytes());
contents.push(0);
for (version, data) in self.versions.iter() {
contents.extend_from_slice(version.to_string().as_bytes());
contents.push(0);
contents.extend_from_slice(data);
contents.push(0);
}
contents
}
}
/// Manages the on-disk index caches.
pub struct CacheManager<'gctx> {
/// The root path where caches are located.
cache_root: Filesystem,
/// [`GlobalContext`] reference for convenience.
gctx: &'gctx GlobalContext,
}
impl<'gctx> CacheManager<'gctx> {
/// Creates a new instance of the on-disk index cache manager.
///
/// `root` --- The root path where caches are located.
pub fn new(cache_root: Filesystem, gctx: &'gctx GlobalContext) -> CacheManager<'gctx> {
CacheManager { cache_root, gctx }
}
/// Gets the cache associated with the key.
pub fn get(&self, key: &str) -> Option<Vec<u8>> {
let cache_path = &self.cache_path(key);
match fs::read(cache_path) {
Ok(contents) => Some(contents),
Err(e) => {
tracing::debug!(?cache_path, "cache missing: {e}");
None
}
}
}
/// Associates the value with the key.
pub fn put(&self, key: &str, value: &[u8]) {
let cache_path = &self.cache_path(key);
if fs::create_dir_all(cache_path.parent().unwrap()).is_ok() {
let path = Filesystem::new(cache_path.clone());
self.gctx
.assert_package_cache_locked(CacheLockMode::DownloadExclusive, &path);
if let Err(e) = fs::write(cache_path, value) {
tracing::info!(?cache_path, "failed to write cache: {e}");
}
}
}
/// Invalidates the cache associated with the key.
pub fn invalidate(&self, key: &str) {
let cache_path = &self.cache_path(key);
if let Err(e) = fs::remove_file(cache_path) {
if e.kind() != io::ErrorKind::NotFound {
tracing::debug!(?cache_path, "failed to remove from cache: {e}");
}
}
}
fn cache_path(&self, key: &str) -> PathBuf {
let relative = make_dep_path(key, false);
// This is the file we're loading from cache or the index data.
// See module comment in `registry/mod.rs` for why this is structured
// the way it is.
self.cache_root.join(relative).into_path_unlocked()
}
}