cargo/sources/registry/index/cache.rs
1//! A layer of on-disk index cache for performance.
2//!
3//! One important aspect of the index is that we want to optimize the "happy
4//! path" as much as possible. Whenever you type `cargo build` Cargo will
5//! *always* reparse the registry and learn about dependency information. This
6//! is done because Cargo needs to learn about the upstream crates.io crates
7//! that you're using and ensure that the preexisting `Cargo.lock` still matches
8//! the current state of the world.
9//!
10//! Consequently, Cargo "null builds" (the index that Cargo adds to each build
11//! itself) need to be fast when accessing the index. The primary performance
12//! optimization here is to avoid parsing JSON blobs from the registry if we
13//! don't need them. Most secondary optimizations are centered around removing
14//! allocations and such, but avoiding parsing JSON is the #1 optimization.
15//!
16//! When we get queries from the resolver we're given a [`Dependency`]. This
17//! dependency in turn has a version requirement, and with lock files that
18//! already exist these version requirements are exact version requirements
19//! `=a.b.c`. This means that we in theory only need to parse one line of JSON
20//! per query in the registry, the one that matches version `a.b.c`.
21//!
22//! The crates.io index, however, is not amenable to this form of query. Instead
23//! the crates.io index simply is a file where each line is a JSON blob, aka
24//! [`IndexPackage`]. To learn about the versions in each JSON blob we would
25//! need to parse the JSON via [`IndexSummary::parse`], defeating the purpose
26//! of trying to parse as little as possible.
27//!
28//! > Note that as a small aside even *loading* the JSON from the registry is
29//! > actually pretty slow. For crates.io and [`RemoteRegistry`] we don't
30//! > actually check out the git index on disk because that takes quite some
31//! > time and is quite large. Instead we use `libgit2` to read the JSON from
32//! > the raw git objects. This in turn can be slow (aka show up high in
33//! > profiles) because libgit2 has to do deflate decompression and such.
34//!
35//! To solve all these issues a strategy is employed here where Cargo basically
36//! creates an index into the index. The first time a package is queried about
37//! (first time being for an entire computer) Cargo will load the contents
38//! (slowly via libgit2) from the registry. It will then (slowly) parse every
39//! single line to learn about its versions. Afterwards, however, Cargo will
40//! emit a new file (a cache, representing as [`SummariesCache`]) which is
41//! amenable for speedily parsing in future invocations.
42//!
43//! This cache file is currently organized by basically having the semver
44//! version extracted from each JSON blob. That way Cargo can quickly and
45//! easily parse all versions contained and which JSON blob they're associated
46//! with. The JSON blob then doesn't actually need to get parsed unless the
47//! version is parsed.
48//!
49//! Altogether the initial measurements of this shows a massive improvement for
50//! Cargo null build performance. It's expected that the improvements earned
51//! here will continue to grow over time in the sense that the previous
52//! implementation (parse all lines each time) actually continues to slow down
53//! over time as new versions of a crate are published. In any case when first
54//! implemented a null build of Cargo itself would parse 3700 JSON blobs from
55//! the registry and load 150 blobs from git. Afterwards it parses 150 JSON
56//! blobs and loads 0 files git. Removing 200ms or more from Cargo's startup
57//! time is certainly nothing to sneeze at!
58//!
59//! Note that this is just a high-level overview, there's of course lots of
60//! details like invalidating caches and whatnot which are handled below, but
61//! hopefully those are more obvious inline in the code itself.
62//!
63//! [`Dependency`]: crate::core::Dependency
64//! [`IndexPackage`]: super::IndexPackage
65//! [`IndexSummary::parse`]: super::IndexSummary::parse
66//! [`RemoteRegistry`]: crate::sources::registry::remote::RemoteRegistry
67
68use std::cell::RefCell;
69use std::fs;
70use std::io;
71use std::path::PathBuf;
72use std::str;
73
74use anyhow::bail;
75use cargo_util::registry::make_dep_path;
76use semver::Version;
77
78use crate::util::cache_lock::CacheLockMode;
79use crate::util::Filesystem;
80use crate::CargoResult;
81use crate::GlobalContext;
82
83use super::split;
84use super::INDEX_V_MAX;
85
86/// The current version of [`SummariesCache`].
87const CURRENT_CACHE_VERSION: u8 = 3;
88
89/// A representation of the cache on disk that Cargo maintains of summaries.
90///
91/// Cargo will initially parse all summaries in the registry and will then
92/// serialize that into this form and place it in a new location on disk,
93/// ensuring that access in the future is much speedier.
94///
95/// For serialization and deserialization of this on-disk index cache of
96/// summaries, see [`SummariesCache::serialize`] and [`SummariesCache::parse`].
97///
98/// # The format of the index cache
99///
100/// The idea of this format is that it's a very easy file for Cargo to parse in
101/// future invocations. The read from disk should be fast and then afterwards
102/// all we need to know is what versions correspond to which JSON blob.
103///
104/// Currently the format looks like:
105///
106/// ```text
107/// +---------------+----------------------+--------------------+---+
108/// | cache version | index schema version | index file version | 0 |
109/// +---------------+----------------------+--------------------+---+
110/// ```
111///
112/// followed by one or more (version + JSON blob) pairs...
113///
114/// ```text
115/// +----------------+---+-----------+---+
116/// | semver version | 0 | JSON blob | 0 | ...
117/// +----------------+---+-----------+---+
118/// ```
119///
120/// Each field represents:
121///
122/// * _cache version_ --- Intended to ensure that there's some level of
123/// future compatibility against changes to this cache format so if different
124/// versions of Cargo share the same cache they don't get too confused.
125/// * _index schema version_ --- The schema version of the raw index file.
126/// See [`IndexPackage::v`] for the detail.
127/// * _index file version_ --- Tracks when a cache needs to be regenerated.
128/// A cache regeneration is required whenever the index file itself updates.
129/// * _semver version_ --- The version for each JSON blob. Extracted from the
130/// blob for fast queries without parsing the entire blob.
131/// * _JSON blob_ --- The actual metadata for each version of the package. It
132/// has the same representation as [`IndexPackage`].
133///
134/// # Changes between each cache version
135///
136/// * `1`: The original version.
137/// * `2`: Added the "index schema version" field so that if the index schema
138/// changes, different versions of cargo won't get confused reading each
139/// other's caches.
140/// * `3`: Bumped the version to work around an issue where multiple versions of
141/// a package were published that differ only by semver metadata. For
142/// example, openssl-src 110.0.0 and 110.0.0+1.1.0f. Previously, the cache
143/// would be incorrectly populated with two entries, both 110.0.0. After
144/// this, the metadata will be correctly included. This isn't really a format
145/// change, just a version bump to clear the incorrect cache entries. Note:
146/// the index shouldn't allow these, but unfortunately crates.io doesn't
147/// check it.
148///
149/// See [`CURRENT_CACHE_VERSION`] for the current cache version.
150///
151/// [`IndexPackage::v`]: super::IndexPackage::v
152/// [`IndexPackage`]: super::IndexPackage
153#[derive(Default)]
154pub struct SummariesCache<'a> {
155 /// JSON blobs of the summaries. Each JSON blob has a [`Version`] beside,
156 /// so that Cargo can query a version without full JSON parsing.
157 pub versions: Vec<(Version, &'a [u8])>,
158 /// For cache invalidation, we tracks the index file version to determine
159 /// when to regenerate the cache itself.
160 pub index_version: &'a str,
161}
162
163impl<'a> SummariesCache<'a> {
164 /// Deserializes an on-disk cache.
165 pub fn parse(data: &'a [u8]) -> CargoResult<SummariesCache<'a>> {
166 // NB: keep this method in sync with `serialize` below
167 let (first_byte, rest) = data
168 .split_first()
169 .ok_or_else(|| anyhow::format_err!("malformed cache"))?;
170 if *first_byte != CURRENT_CACHE_VERSION {
171 bail!("looks like a different Cargo's cache, bailing out");
172 }
173 let index_v_bytes = rest
174 .get(..4)
175 .ok_or_else(|| anyhow::anyhow!("cache expected 4 bytes for index schema version"))?;
176 let index_v = u32::from_le_bytes(index_v_bytes.try_into().unwrap());
177 if index_v != INDEX_V_MAX {
178 bail!(
179 "index schema version {index_v} doesn't match the version I know ({INDEX_V_MAX})",
180 );
181 }
182 let rest = &rest[4..];
183
184 let mut iter = split(rest, 0);
185 let last_index_update = if let Some(update) = iter.next() {
186 str::from_utf8(update)?
187 } else {
188 bail!("malformed file");
189 };
190 let mut ret = SummariesCache::default();
191 ret.index_version = last_index_update;
192 while let Some(version) = iter.next() {
193 let version = str::from_utf8(version)?;
194 let version = Version::parse(version)?;
195 let summary = iter.next().unwrap();
196 ret.versions.push((version, summary));
197 }
198 Ok(ret)
199 }
200
201 /// Serializes itself with a given `index_version`.
202 pub fn serialize(&self, index_version: &str) -> Vec<u8> {
203 // NB: keep this method in sync with `parse` above
204 let size = self
205 .versions
206 .iter()
207 .map(|(_version, data)| (10 + data.len()))
208 .sum();
209 let mut contents = Vec::with_capacity(size);
210 contents.push(CURRENT_CACHE_VERSION);
211 contents.extend(&u32::to_le_bytes(INDEX_V_MAX));
212 contents.extend_from_slice(index_version.as_bytes());
213 contents.push(0);
214 for (version, data) in self.versions.iter() {
215 contents.extend_from_slice(version.to_string().as_bytes());
216 contents.push(0);
217 contents.extend_from_slice(data);
218 contents.push(0);
219 }
220 contents
221 }
222}
223
224/// Manages the on-disk index caches.
225pub struct CacheManager<'gctx> {
226 /// The root path where caches are located.
227 cache_root: Filesystem,
228 /// [`GlobalContext`] reference for convenience.
229 gctx: &'gctx GlobalContext,
230 /// Keeps track of if we have sent a warning message if there was an error updating the cache.
231 /// The motivation is to avoid warning spam if the cache is not writable.
232 has_warned: RefCell<bool>,
233}
234
235impl<'gctx> CacheManager<'gctx> {
236 /// Creates a new instance of the on-disk index cache manager.
237 ///
238 /// `root` --- The root path where caches are located.
239 pub fn new(cache_root: Filesystem, gctx: &'gctx GlobalContext) -> CacheManager<'gctx> {
240 CacheManager {
241 cache_root,
242 gctx,
243 has_warned: Default::default(),
244 }
245 }
246
247 /// Gets the cache associated with the key.
248 pub fn get(&self, key: &str) -> Option<Vec<u8>> {
249 let cache_path = &self.cache_path(key);
250 match fs::read(cache_path) {
251 Ok(contents) => Some(contents),
252 Err(e) => {
253 tracing::debug!(?cache_path, "cache missing: {e}");
254 None
255 }
256 }
257 }
258
259 /// Associates the value with the key.
260 pub fn put(&self, key: &str, value: &[u8]) {
261 let cache_path = &self.cache_path(key);
262 if let Err(e) = self.put_inner(cache_path, value) {
263 tracing::info!(?cache_path, "failed to write cache: {e}");
264
265 if !*self.has_warned.borrow() {
266 let _ = self.gctx.shell().warn(format!(
267 "failed to write cache, path: {}, error: {e}",
268 cache_path.to_str().unwrap_or_default()
269 ));
270 *self.has_warned.borrow_mut() = true;
271 }
272 }
273 }
274
275 fn put_inner(&self, cache_path: &PathBuf, value: &[u8]) -> std::io::Result<()> {
276 fs::create_dir_all(cache_path.parent().unwrap())?;
277 let path = Filesystem::new(cache_path.clone());
278 self.gctx
279 .assert_package_cache_locked(CacheLockMode::DownloadExclusive, &path);
280 fs::write(cache_path, value)?;
281 Ok(())
282 }
283
284 /// Invalidates the cache associated with the key.
285 pub fn invalidate(&self, key: &str) {
286 let cache_path = &self.cache_path(key);
287 if let Err(e) = fs::remove_file(cache_path) {
288 if e.kind() != io::ErrorKind::NotFound {
289 tracing::debug!(?cache_path, "failed to remove from cache: {e}");
290 }
291 }
292 }
293
294 fn cache_path(&self, key: &str) -> PathBuf {
295 let relative = make_dep_path(key, false);
296 // This is the file we're loading from cache or the index data.
297 // See module comment in `registry/mod.rs` for why this is structured
298 // the way it is.
299 self.cache_root.join(relative).into_path_unlocked()
300 }
301}