cargo/sources/registry/index/
cache.rs

Help
1//! A layer of on-disk index cache for performance.
2//!
3//! One important aspect of the index is that we want to optimize the "happy
4//! path" as much as possible. Whenever you type `cargo build` Cargo will
5//! *always* reparse the registry and learn about dependency information. This
6//! is done because Cargo needs to learn about the upstream crates.io crates
7//! that you're using and ensure that the preexisting `Cargo.lock` still matches
8//! the current state of the world.
9//!
10//! Consequently, Cargo "null builds" (the index that Cargo adds to each build
11//! itself) need to be fast when accessing the index. The primary performance
12//! optimization here is to avoid parsing JSON blobs from the registry if we
13//! don't need them. Most secondary optimizations are centered around removing
14//! allocations and such, but avoiding parsing JSON is the #1 optimization.
15//!
16//! When we get queries from the resolver we're given a [`Dependency`]. This
17//! dependency in turn has a version requirement, and with lock files that
18//! already exist these version requirements are exact version requirements
19//! `=a.b.c`. This means that we in theory only need to parse one line of JSON
20//! per query in the registry, the one that matches version `a.b.c`.
21//!
22//! The crates.io index, however, is not amenable to this form of query. Instead
23//! the crates.io index simply is a file where each line is a JSON blob, aka
24//! [`IndexPackage`]. To learn about the versions in each JSON blob we would
25//! need to parse the JSON via [`IndexSummary::parse`], defeating the purpose
26//! of trying to parse as little as possible.
27//!
28//! > Note that as a small aside even *loading* the JSON from the registry is
29//! > actually pretty slow. For crates.io and [`RemoteRegistry`] we don't
30//! > actually check out the git index on disk because that takes quite some
31//! > time and is quite large. Instead we use `libgit2` to read the JSON from
32//! > the raw git objects. This in turn can be slow (aka show up high in
33//! > profiles) because libgit2 has to do deflate decompression and such.
34//!
35//! To solve all these issues a strategy is employed here where Cargo basically
36//! creates an index into the index. The first time a package is queried about
37//! (first time being for an entire computer) Cargo will load the contents
38//! (slowly via libgit2) from the registry. It will then (slowly) parse every
39//! single line to learn about its versions. Afterwards, however, Cargo will
40//! emit a new file (a cache, representing as [`SummariesCache`]) which is
41//! amenable for speedily parsing in future invocations.
42//!
43//! This cache file is currently organized by basically having the semver
44//! version extracted from each JSON blob. That way Cargo can quickly and
45//! easily parse all versions contained and which JSON blob they're associated
46//! with. The JSON blob then doesn't actually need to get parsed unless the
47//! version is parsed.
48//!
49//! Altogether the initial measurements of this shows a massive improvement for
50//! Cargo null build performance. It's expected that the improvements earned
51//! here will continue to grow over time in the sense that the previous
52//! implementation (parse all lines each time) actually continues to slow down
53//! over time as new versions of a crate are published. In any case when first
54//! implemented a null build of Cargo itself would parse 3700 JSON blobs from
55//! the registry and load 150 blobs from git. Afterwards it parses 150 JSON
56//! blobs and loads 0 files git. Removing 200ms or more from Cargo's startup
57//! time is certainly nothing to sneeze at!
58//!
59//! Note that this is just a high-level overview, there's of course lots of
60//! details like invalidating caches and whatnot which are handled below, but
61//! hopefully those are more obvious inline in the code itself.
62//!
63//! [`Dependency`]: crate::core::Dependency
64//! [`IndexPackage`]: super::IndexPackage
65//! [`IndexSummary::parse`]: super::IndexSummary::parse
66//! [`RemoteRegistry`]: crate::sources::registry::remote::RemoteRegistry
67
68use std::cell::RefCell;
69use std::fs;
70use std::io;
71use std::path::PathBuf;
72use std::str;
73
74use anyhow::bail;
75use cargo_util::registry::make_dep_path;
76use semver::Version;
77
78use crate::util::cache_lock::CacheLockMode;
79use crate::util::Filesystem;
80use crate::CargoResult;
81use crate::GlobalContext;
82
83use super::split;
84use super::INDEX_V_MAX;
85
86/// The current version of [`SummariesCache`].
87const CURRENT_CACHE_VERSION: u8 = 3;
88
89/// A representation of the cache on disk that Cargo maintains of summaries.
90///
91/// Cargo will initially parse all summaries in the registry and will then
92/// serialize that into this form and place it in a new location on disk,
93/// ensuring that access in the future is much speedier.
94///
95/// For serialization and deserialization of this on-disk index cache of
96/// summaries, see [`SummariesCache::serialize`]  and [`SummariesCache::parse`].
97///
98/// # The format of the index cache
99///
100/// The idea of this format is that it's a very easy file for Cargo to parse in
101/// future invocations. The read from disk should be fast and then afterwards
102/// all we need to know is what versions correspond to which JSON blob.
103///
104/// Currently the format looks like:
105///
106/// ```text
107/// +---------------+----------------------+--------------------+---+
108/// | cache version | index schema version | index file version | 0 |
109/// +---------------+----------------------+--------------------+---+
110/// ```
111///
112/// followed by one or more (version + JSON blob) pairs...
113///
114/// ```text
115/// +----------------+---+-----------+---+
116/// | semver version | 0 | JSON blob | 0 | ...
117/// +----------------+---+-----------+---+
118/// ```
119///
120/// Each field represents:
121///
122/// * _cache version_ --- Intended to ensure that there's some level of
123///   future compatibility against changes to this cache format so if different
124///   versions of Cargo share the same cache they don't get too confused.
125/// * _index schema version_ --- The schema version of the raw index file.
126///   See [`IndexPackage::v`] for the detail.
127/// * _index file version_ --- Tracks when a cache needs to be regenerated.
128///   A cache regeneration is required whenever the index file itself updates.
129/// * _semver version_ --- The version for each JSON blob. Extracted from the
130///   blob for fast queries without parsing the entire blob.
131/// * _JSON blob_ --- The actual metadata for each version of the package. It
132///   has the same representation as [`IndexPackage`].
133///
134/// # Changes between each cache version
135///
136/// * `1`: The original version.
137/// * `2`: Added the "index schema version" field so that if the index schema
138///   changes, different versions of cargo won't get confused reading each
139///   other's caches.
140/// * `3`: Bumped the version to work around an issue where multiple versions of
141///   a package were published that differ only by semver metadata. For
142///   example, openssl-src 110.0.0 and 110.0.0+1.1.0f. Previously, the cache
143///   would be incorrectly populated with two entries, both 110.0.0. After
144///   this, the metadata will be correctly included. This isn't really a format
145///   change, just a version bump to clear the incorrect cache entries. Note:
146///   the index shouldn't allow these, but unfortunately crates.io doesn't
147///   check it.
148///
149/// See [`CURRENT_CACHE_VERSION`] for the current cache version.
150///
151/// [`IndexPackage::v`]: super::IndexPackage::v
152/// [`IndexPackage`]: super::IndexPackage
153#[derive(Default)]
154pub struct SummariesCache<'a> {
155    /// JSON blobs of the summaries. Each JSON blob has a [`Version`] beside,
156    /// so that Cargo can query a version without full JSON parsing.
157    pub versions: Vec<(Version, &'a [u8])>,
158    /// For cache invalidation, we tracks the index file version to determine
159    /// when to regenerate the cache itself.
160    pub index_version: &'a str,
161}
162
163impl<'a> SummariesCache<'a> {
164    /// Deserializes an on-disk cache.
165    pub fn parse(data: &'a [u8]) -> CargoResult<SummariesCache<'a>> {
166        // NB: keep this method in sync with `serialize` below
167        let (first_byte, rest) = data
168            .split_first()
169            .ok_or_else(|| anyhow::format_err!("malformed cache"))?;
170        if *first_byte != CURRENT_CACHE_VERSION {
171            bail!("looks like a different Cargo's cache, bailing out");
172        }
173        let index_v_bytes = rest
174            .get(..4)
175            .ok_or_else(|| anyhow::anyhow!("cache expected 4 bytes for index schema version"))?;
176        let index_v = u32::from_le_bytes(index_v_bytes.try_into().unwrap());
177        if index_v != INDEX_V_MAX {
178            bail!(
179                "index schema version {index_v} doesn't match the version I know ({INDEX_V_MAX})",
180            );
181        }
182        let rest = &rest[4..];
183
184        let mut iter = split(rest, 0);
185        let last_index_update = if let Some(update) = iter.next() {
186            str::from_utf8(update)?
187        } else {
188            bail!("malformed file");
189        };
190        let mut ret = SummariesCache::default();
191        ret.index_version = last_index_update;
192        while let Some(version) = iter.next() {
193            let version = str::from_utf8(version)?;
194            let version = Version::parse(version)?;
195            let summary = iter.next().unwrap();
196            ret.versions.push((version, summary));
197        }
198        Ok(ret)
199    }
200
201    /// Serializes itself with a given `index_version`.
202    pub fn serialize(&self, index_version: &str) -> Vec<u8> {
203        // NB: keep this method in sync with `parse` above
204        let size = self
205            .versions
206            .iter()
207            .map(|(_version, data)| (10 + data.len()))
208            .sum();
209        let mut contents = Vec::with_capacity(size);
210        contents.push(CURRENT_CACHE_VERSION);
211        contents.extend(&u32::to_le_bytes(INDEX_V_MAX));
212        contents.extend_from_slice(index_version.as_bytes());
213        contents.push(0);
214        for (version, data) in self.versions.iter() {
215            contents.extend_from_slice(version.to_string().as_bytes());
216            contents.push(0);
217            contents.extend_from_slice(data);
218            contents.push(0);
219        }
220        contents
221    }
222}
223
224/// Manages the on-disk index caches.
225pub struct CacheManager<'gctx> {
226    /// The root path where caches are located.
227    cache_root: Filesystem,
228    /// [`GlobalContext`] reference for convenience.
229    gctx: &'gctx GlobalContext,
230    /// Keeps track of if we have sent a warning message if there was an error updating the cache.
231    /// The motivation is to avoid warning spam if the cache is not writable.
232    has_warned: RefCell<bool>,
233}
234
235impl<'gctx> CacheManager<'gctx> {
236    /// Creates a new instance of the on-disk index cache manager.
237    ///
238    /// `root` --- The root path where caches are located.
239    pub fn new(cache_root: Filesystem, gctx: &'gctx GlobalContext) -> CacheManager<'gctx> {
240        CacheManager {
241            cache_root,
242            gctx,
243            has_warned: Default::default(),
244        }
245    }
246
247    /// Gets the cache associated with the key.
248    pub fn get(&self, key: &str) -> Option<Vec<u8>> {
249        let cache_path = &self.cache_path(key);
250        match fs::read(cache_path) {
251            Ok(contents) => Some(contents),
252            Err(e) => {
253                tracing::debug!(?cache_path, "cache missing: {e}");
254                None
255            }
256        }
257    }
258
259    /// Associates the value with the key.
260    pub fn put(&self, key: &str, value: &[u8]) {
261        let cache_path = &self.cache_path(key);
262        if let Err(e) = self.put_inner(cache_path, value) {
263            tracing::info!(?cache_path, "failed to write cache: {e}");
264
265            if !*self.has_warned.borrow() {
266                let _ = self.gctx.shell().warn(format!(
267                    "failed to write cache, path: {}, error: {e}",
268                    cache_path.to_str().unwrap_or_default()
269                ));
270                *self.has_warned.borrow_mut() = true;
271            }
272        }
273    }
274
275    fn put_inner(&self, cache_path: &PathBuf, value: &[u8]) -> std::io::Result<()> {
276        fs::create_dir_all(cache_path.parent().unwrap())?;
277        let path = Filesystem::new(cache_path.clone());
278        self.gctx
279            .assert_package_cache_locked(CacheLockMode::DownloadExclusive, &path);
280        fs::write(cache_path, value)?;
281        Ok(())
282    }
283
284    /// Invalidates the cache associated with the key.
285    pub fn invalidate(&self, key: &str) {
286        let cache_path = &self.cache_path(key);
287        if let Err(e) = fs::remove_file(cache_path) {
288            if e.kind() != io::ErrorKind::NotFound {
289                tracing::debug!(?cache_path, "failed to remove from cache: {e}");
290            }
291        }
292    }
293
294    fn cache_path(&self, key: &str) -> PathBuf {
295        let relative = make_dep_path(key, false);
296        // This is the file we're loading from cache or the index data.
297        // See module comment in `registry/mod.rs` for why this is structured
298        // the way it is.
299        self.cache_root.join(relative).into_path_unlocked()
300    }
301}
cargo/sources/registry/index/cache.rs

cargo/sources/registry/index/
cache.rs