cargo/core/
global_cache_tracker.rs

1//! Support for tracking the last time files were used to assist with cleaning
2//! up those files if they haven't been used in a while.
3//!
4//! Tracking of cache files is stored in a sqlite database which contains a
5//! timestamp of the last time the file was used, as well as the size of the
6//! file.
7//!
8//! While cargo is running, when it detects a use of a cache file, it adds a
9//! timestamp to [`DeferredGlobalLastUse`]. This batches up a set of changes
10//! that are then flushed to the database all at once (via
11//! [`DeferredGlobalLastUse::save`]). Ideally saving would only be done once
12//! for performance reasons, but that is not really possible due to the way
13//! cargo works, since there are different ways cargo can be used (like `cargo
14//! generate-lockfile`, `cargo fetch`, and `cargo build` are all very
15//! different ways the code is used).
16//!
17//! All of the database interaction is done through the [`GlobalCacheTracker`]
18//! type.
19//!
20//! There is a single global [`GlobalCacheTracker`] and
21//! [`DeferredGlobalLastUse`] stored in [`GlobalContext`].
22//!
23//! The high-level interface for performing garbage collection is defined in
24//! the [`crate::core::gc`] module. The functions there are responsible for
25//! interacting with the [`GlobalCacheTracker`] to handle cleaning of global
26//! cache data.
27//!
28//! ## Automatic gc
29//!
30//! Some commands (primarily the build commands) will trigger an automatic
31//! deletion of files that haven't been used in a while. The high-level
32//! interface for this is the [`crate::core::gc::auto_gc`] function.
33//!
34//! The [`GlobalCacheTracker`] database tracks the last time an automatic gc
35//! was performed so that it is only done once per day for performance
36//! reasons.
37//!
38//! ## Manual gc
39//!
40//! The user can perform a manual garbage collection with the `cargo clean`
41//! command. That command has a variety of options to specify what to delete.
42//! Manual gc supports deleting based on age or size or both. From a
43//! high-level, this is done by the [`crate::core::gc::Gc::gc`] method, which
44//! calls into [`GlobalCacheTracker`] to handle all the cleaning.
45//!
46//! ## Locking
47//!
48//! Usage of the database requires that the package cache is locked to prevent
49//! concurrent access. Although sqlite has built-in locking support, we want
50//! to use cargo's locking so that the "Blocking" message gets displayed, and
51//! so that locks can block indefinitely for long-running build commands.
52//! [`rusqlite`] has a default timeout of 5 seconds, though that is
53//! configurable.
54//!
55//! When garbage collection is being performed, the package cache lock must be
56//! in [`CacheLockMode::MutateExclusive`] to ensure no other cargo process is
57//! running. See [`crate::util::cache_lock`] for more detail on locking.
58//!
59//! When performing automatic gc, [`crate::core::gc::auto_gc`] will skip the
60//! GC if the package cache lock is already held by anything else. Automatic
61//! GC is intended to be opportunistic, and should impose as little disruption
62//! to the user as possible.
63//!
64//! ## Compatibility
65//!
66//! The database must retain both forwards and backwards compatibility between
67//! different versions of cargo. For the most part, this shouldn't be too
68//! difficult to maintain. Generally sqlite doesn't change on-disk formats
69//! between versions (the introduction of WAL is one of the few examples where
70//! version 3 had a format change, but we wouldn't use it anyway since it has
71//! shared-memory requirements cargo can't depend on due to things like
72//! network mounts).
73//!
74//! Schema changes must be managed through [`migrations`] by adding new
75//! entries that make a change to the database. Changes must not break older
76//! versions of cargo. Generally, adding columns should be fine (either with a
77//! default value, or NULL). Adding tables should also be fine. Just don't do
78//! destructive things like removing a column, or changing the semantics of an
79//! existing column.
80//!
81//! Since users may run older versions of cargo that do not do cache tracking,
82//! the [`GlobalCacheTracker::sync_db_with_files`] method helps dealing with
83//! keeping the database in sync in the presence of older versions of cargo
84//! touching the cache directories.
85//!
86//! ## Performance
87//!
88//! A lot of focus on the design of this system is to minimize the performance
89//! impact. Every build command needs to save updates which we try to avoid
90//! having a noticeable impact on build times. Systems like Windows,
91//! particularly with a magnetic hard disk, can experience a fairly large
92//! impact of cargo's overhead. Cargo's benchsuite has some benchmarks to help
93//! compare different environments, or changes to the code here. Please try to
94//! keep performance in mind if making any major changes.
95//!
96//! Performance of `cargo clean` is not quite as important since it is not
97//! expected to be run often. However, it is still courteous to the user to
98//! try to not impact it too much. One part that has a performance concern is
99//! that the clean command will synchronize the database with whatever is on
100//! disk if needed (in case files were added by older versions of cargo that
101//! don't do cache tracking, or if the user manually deleted some files). This
102//! can potentially be very slow, especially if the two are very out of sync.
103//!
104//! ## Filesystems
105//!
106//! Everything here is sensitive to the kind of filesystem it is running on.
107//! People tend to run cargo in all sorts of strange environments that have
108//! limited capabilities, or on things like read-only mounts. The code here
109//! needs to gracefully handle as many situations as possible.
110//!
111//! See also the information in the [Performance](#performance) and
112//! [Locking](#locking) sections when considering different filesystems and
113//! their impact on performance and locking.
114//!
115//! There are checks for read-only filesystems, which is generally ignored.
116
117use crate::core::gc::GcOpts;
118use crate::ops::CleanContext;
119use crate::util::cache_lock::CacheLockMode;
120use crate::util::interning::InternedString;
121use crate::util::sqlite::{self, Migration, basic_migration};
122use crate::util::{Filesystem, Progress, ProgressStyle};
123use crate::{CargoResult, GlobalContext};
124use anyhow::{Context as _, bail};
125use cargo_util::paths;
126use cargo_util_terminal::Verbosity;
127use rusqlite::{Connection, ErrorCode, params};
128use std::collections::{HashMap, hash_map};
129use std::path::{Path, PathBuf};
130use std::time::{Duration, SystemTime};
131use tracing::{debug, trace};
132
133/// The filename of the database.
134const GLOBAL_CACHE_FILENAME: &str = ".global-cache";
135
136const REGISTRY_INDEX_TABLE: &str = "registry_index";
137const REGISTRY_CRATE_TABLE: &str = "registry_crate";
138const REGISTRY_SRC_TABLE: &str = "registry_src";
139const GIT_DB_TABLE: &str = "git_db";
140const GIT_CO_TABLE: &str = "git_checkout";
141
142/// How often timestamps will be updated.
143///
144/// As an optimization timestamps are not updated unless they are older than
145/// the given number of seconds. This helps reduce the amount of disk I/O when
146/// running cargo multiple times within a short window.
147const UPDATE_RESOLUTION: u64 = 60 * 5;
148
149/// Type for timestamps as stored in the database.
150///
151/// These are seconds since the Unix epoch.
152type Timestamp = u64;
153
154/// The key for a registry index entry stored in the database.
155#[derive(Clone, Debug, Hash, Eq, PartialEq)]
156pub struct RegistryIndex {
157    /// A unique name of the registry source.
158    pub encoded_registry_name: InternedString,
159}
160
161/// The key for a registry `.crate` entry stored in the database.
162#[derive(Clone, Debug, Hash, Eq, PartialEq)]
163pub struct RegistryCrate {
164    /// A unique name of the registry source.
165    pub encoded_registry_name: InternedString,
166    /// The filename of the compressed crate, like `foo-1.2.3.crate`.
167    pub crate_filename: InternedString,
168    /// The size of the `.crate` file.
169    pub size: u64,
170}
171
172/// The key for a registry src directory entry stored in the database.
173#[derive(Clone, Debug, Hash, Eq, PartialEq)]
174pub struct RegistrySrc {
175    /// A unique name of the registry source.
176    pub encoded_registry_name: InternedString,
177    /// The directory name of the extracted source, like `foo-1.2.3`.
178    pub package_dir: InternedString,
179    /// Total size of the src directory in bytes.
180    ///
181    /// This can be None when the size is unknown. For example, when the src
182    /// directory already exists on disk, and we just want to update the
183    /// last-use timestamp. We don't want to take the expense of computing disk
184    /// usage unless necessary. [`GlobalCacheTracker::populate_untracked`]
185    /// will handle any actual NULL values in the database, which can happen
186    /// when the src directory is created by an older version of cargo that
187    /// did not track sizes.
188    pub size: Option<u64>,
189}
190
191/// The key for a git db entry stored in the database.
192#[derive(Clone, Debug, Hash, Eq, PartialEq)]
193pub struct GitDb {
194    /// A unique name of the git database.
195    pub encoded_git_name: InternedString,
196}
197
198/// The key for a git checkout entry stored in the database.
199#[derive(Clone, Debug, Hash, Eq, PartialEq)]
200pub struct GitCheckout {
201    /// A unique name of the git database.
202    pub encoded_git_name: InternedString,
203    /// A unique name of the checkout without the database.
204    pub short_name: InternedString,
205    /// Total size of the checkout directory.
206    ///
207    /// This can be None when the size is unknown. See [`RegistrySrc::size`]
208    /// for an explanation.
209    pub size: Option<u64>,
210}
211
212/// Filesystem paths in the global cache.
213///
214/// Accessing these assumes a lock has already been acquired.
215struct BasePaths {
216    /// Root path to the index caches.
217    index: PathBuf,
218    /// Root path to the git DBs.
219    git_db: PathBuf,
220    /// Root path to the git checkouts.
221    git_co: PathBuf,
222    /// Root path to the `.crate` files.
223    crate_dir: PathBuf,
224    /// Root path to the `src` directories.
225    src: PathBuf,
226}
227
228/// Migrations which initialize the database, and can be used to evolve it over time.
229///
230/// See [`Migration`] for more detail.
231///
232/// **Be sure to not change the order or entries here!**
233fn migrations() -> Vec<Migration> {
234    vec![
235        // registry_index tracks the overall usage of an index cache, and tracks a
236        // numeric ID to refer to that index that is used in other tables.
237        basic_migration(
238            "CREATE TABLE registry_index (
239                id INTEGER PRIMARY KEY AUTOINCREMENT,
240                name TEXT UNIQUE NOT NULL,
241                timestamp INTEGER NOT NULL
242            )",
243        ),
244        // .crate files
245        basic_migration(
246            "CREATE TABLE registry_crate (
247                registry_id INTEGER NOT NULL,
248                name TEXT NOT NULL,
249                size INTEGER NOT NULL,
250                timestamp INTEGER NOT NULL,
251                PRIMARY KEY (registry_id, name),
252                FOREIGN KEY (registry_id) REFERENCES registry_index (id) ON DELETE CASCADE
253             )",
254        ),
255        // Extracted src directories
256        //
257        // Note that `size` can be NULL. This will happen when marking a src
258        // directory as used that was created by an older version of cargo
259        // that didn't do size tracking.
260        basic_migration(
261            "CREATE TABLE registry_src (
262                registry_id INTEGER NOT NULL,
263                name TEXT NOT NULL,
264                size INTEGER,
265                timestamp INTEGER NOT NULL,
266                PRIMARY KEY (registry_id, name),
267                FOREIGN KEY (registry_id) REFERENCES registry_index (id) ON DELETE CASCADE
268             )",
269        ),
270        // Git db directories
271        basic_migration(
272            "CREATE TABLE git_db (
273                id INTEGER PRIMARY KEY AUTOINCREMENT,
274                name TEXT UNIQUE NOT NULL,
275                timestamp INTEGER NOT NULL
276             )",
277        ),
278        // Git checkout directories
279        basic_migration(
280            "CREATE TABLE git_checkout (
281                git_id INTEGER NOT NULL,
282                name TEXT UNIQUE NOT NULL,
283                size INTEGER,
284                timestamp INTEGER NOT NULL,
285                PRIMARY KEY (git_id, name),
286                FOREIGN KEY (git_id) REFERENCES git_db (id) ON DELETE CASCADE
287             )",
288        ),
289        // This is a general-purpose single-row table that can store arbitrary
290        // data. Feel free to add columns (with ALTER TABLE) if necessary.
291        basic_migration(
292            "CREATE TABLE global_data (
293                last_auto_gc INTEGER NOT NULL
294            )",
295        ),
296        // last_auto_gc tracks the last time auto-gc was run (so that it only
297        // runs roughly once a day for performance reasons). Prime it with the
298        // current time to establish a baseline.
299        Box::new(|conn| {
300            conn.execute(
301                "INSERT INTO global_data (last_auto_gc) VALUES (?1)",
302                [now()],
303            )?;
304            Ok(())
305        }),
306    ]
307}
308
309/// Type for SQL columns that refer to the primary key of their parent table.
310///
311/// For example, `registry_crate.registry_id` refers to its parent `registry_index.id`.
312#[derive(Copy, Clone, Debug, PartialEq)]
313struct ParentId(i64);
314
315impl rusqlite::types::FromSql for ParentId {
316    fn column_result(value: rusqlite::types::ValueRef<'_>) -> rusqlite::types::FromSqlResult<Self> {
317        let i = i64::column_result(value)?;
318        Ok(ParentId(i))
319    }
320}
321
322impl rusqlite::types::ToSql for ParentId {
323    fn to_sql(&self) -> rusqlite::Result<rusqlite::types::ToSqlOutput<'_>> {
324        Ok(rusqlite::types::ToSqlOutput::from(self.0))
325    }
326}
327
328/// Tracking for the global shared cache (registry files, etc.).
329///
330/// This is the interface to the global cache database, used for tracking and
331/// cleaning. See the [`crate::core::global_cache_tracker`] module docs for
332/// details.
333#[derive(Debug)]
334pub struct GlobalCacheTracker {
335    /// Connection to the SQLite database.
336    conn: Connection,
337    /// This is an optimization used to make sure cargo only checks if gc
338    /// needs to run once per session. This starts as `false`, and then the
339    /// first time it checks if automatic gc needs to run, it will be set to
340    /// `true`.
341    auto_gc_checked_this_session: bool,
342}
343
344impl GlobalCacheTracker {
345    /// Creates a new [`GlobalCacheTracker`].
346    ///
347    /// The caller is responsible for locking the package cache with
348    /// [`CacheLockMode::DownloadExclusive`] before calling this.
349    pub fn new(gctx: &GlobalContext) -> CargoResult<GlobalCacheTracker> {
350        let db_path = Self::db_path(gctx);
351        // A package cache lock is required to ensure only one cargo is
352        // accessing at the same time. If there is concurrent access, we
353        // want to rely on cargo's own "Blocking" system (which can
354        // provide user feedback) rather than blocking inside sqlite
355        // (which by default has a short timeout).
356        let db_path = gctx.assert_package_cache_locked(CacheLockMode::DownloadExclusive, &db_path);
357        let mut conn = Connection::open(db_path)?;
358        conn.pragma_update(None, "foreign_keys", true)?;
359        sqlite::migrate(&mut conn, &migrations())?;
360        Ok(GlobalCacheTracker {
361            conn,
362            auto_gc_checked_this_session: false,
363        })
364    }
365
366    /// The path to the database.
367    pub fn db_path(gctx: &GlobalContext) -> Filesystem {
368        gctx.home().join(GLOBAL_CACHE_FILENAME)
369    }
370
371    /// Given an encoded registry name, returns its ID.
372    ///
373    /// Returns None if the given name isn't in the database.
374    fn id_from_name(
375        conn: &Connection,
376        table_name: &str,
377        encoded_name: &str,
378    ) -> CargoResult<Option<ParentId>> {
379        let mut stmt =
380            conn.prepare_cached(&format!("SELECT id FROM {table_name} WHERE name = ?"))?;
381        match stmt.query_row([encoded_name], |row| row.get(0)) {
382            Ok(id) => Ok(Some(id)),
383            Err(rusqlite::Error::QueryReturnedNoRows) => Ok(None),
384            Err(e) => Err(e.into()),
385        }
386    }
387
388    /// Returns a map of ID to path for the given ids in the given table.
389    ///
390    /// For example, given `registry_index` IDs, it returns filenames of the
391    /// form "index.crates.io-6f17d22bba15001f".
392    fn get_id_map(
393        conn: &Connection,
394        table_name: &str,
395        ids: &[i64],
396    ) -> CargoResult<HashMap<i64, PathBuf>> {
397        let mut stmt =
398            conn.prepare_cached(&format!("SELECT name FROM {table_name} WHERE id = ?1"))?;
399        ids.iter()
400            .map(|id| {
401                let name = stmt.query_row(params![id], |row| {
402                    Ok(PathBuf::from(row.get::<_, String>(0)?))
403                })?;
404                Ok((*id, name))
405            })
406            .collect()
407    }
408
409    /// Returns all index cache timestamps.
410    pub fn registry_index_all(&self) -> CargoResult<Vec<(RegistryIndex, Timestamp)>> {
411        let mut stmt = self
412            .conn
413            .prepare_cached("SELECT name, timestamp FROM registry_index")?;
414        let rows = stmt
415            .query_map([], |row| {
416                let encoded_registry_name = row.get_unwrap(0);
417                let timestamp = row.get_unwrap(1);
418                let kind = RegistryIndex {
419                    encoded_registry_name,
420                };
421                Ok((kind, timestamp))
422            })?
423            .collect::<Result<Vec<_>, _>>()?;
424        Ok(rows)
425    }
426
427    /// Returns all registry crate cache timestamps.
428    pub fn registry_crate_all(&self) -> CargoResult<Vec<(RegistryCrate, Timestamp)>> {
429        let mut stmt = self.conn.prepare_cached(
430            "SELECT registry_index.name, registry_crate.name, registry_crate.size, registry_crate.timestamp
431             FROM registry_index, registry_crate
432             WHERE registry_crate.registry_id = registry_index.id",
433        )?;
434        let rows = stmt
435            .query_map([], |row| {
436                let encoded_registry_name = row.get_unwrap(0);
437                let crate_filename = row.get_unwrap(1);
438                let size = row.get_unwrap(2);
439                let timestamp = row.get_unwrap(3);
440                let kind = RegistryCrate {
441                    encoded_registry_name,
442                    crate_filename,
443                    size,
444                };
445                Ok((kind, timestamp))
446            })?
447            .collect::<Result<Vec<_>, _>>()?;
448        Ok(rows)
449    }
450
451    /// Returns all registry source cache timestamps.
452    pub fn registry_src_all(&self) -> CargoResult<Vec<(RegistrySrc, Timestamp)>> {
453        let mut stmt = self.conn.prepare_cached(
454            "SELECT registry_index.name, registry_src.name, registry_src.size, registry_src.timestamp
455             FROM registry_index, registry_src
456             WHERE registry_src.registry_id = registry_index.id",
457        )?;
458        let rows = stmt
459            .query_map([], |row| {
460                let encoded_registry_name = row.get_unwrap(0);
461                let package_dir = row.get_unwrap(1);
462                let size = row.get_unwrap(2);
463                let timestamp = row.get_unwrap(3);
464                let kind = RegistrySrc {
465                    encoded_registry_name,
466                    package_dir,
467                    size,
468                };
469                Ok((kind, timestamp))
470            })?
471            .collect::<Result<Vec<_>, _>>()?;
472        Ok(rows)
473    }
474
475    /// Returns all git db timestamps.
476    pub fn git_db_all(&self) -> CargoResult<Vec<(GitDb, Timestamp)>> {
477        let mut stmt = self
478            .conn
479            .prepare_cached("SELECT name, timestamp FROM git_db")?;
480        let rows = stmt
481            .query_map([], |row| {
482                let encoded_git_name = row.get_unwrap(0);
483                let timestamp = row.get_unwrap(1);
484                let kind = GitDb { encoded_git_name };
485                Ok((kind, timestamp))
486            })?
487            .collect::<Result<Vec<_>, _>>()?;
488        Ok(rows)
489    }
490
491    /// Returns all git checkout timestamps.
492    pub fn git_checkout_all(&self) -> CargoResult<Vec<(GitCheckout, Timestamp)>> {
493        let mut stmt = self.conn.prepare_cached(
494            "SELECT git_db.name, git_checkout.name, git_checkout.size, git_checkout.timestamp
495             FROM git_db, git_checkout
496             WHERE git_checkout.git_id = git_db.id",
497        )?;
498        let rows = stmt
499            .query_map([], |row| {
500                let encoded_git_name = row.get_unwrap(0);
501                let short_name = row.get_unwrap(1);
502                let size = row.get_unwrap(2);
503                let timestamp = row.get_unwrap(3);
504                let kind = GitCheckout {
505                    encoded_git_name,
506                    short_name,
507                    size,
508                };
509                Ok((kind, timestamp))
510            })?
511            .collect::<Result<Vec<_>, _>>()?;
512        Ok(rows)
513    }
514
515    /// Returns whether or not an auto GC should be performed, compared to the
516    /// last time it was recorded in the database.
517    pub fn should_run_auto_gc(&mut self, frequency: Duration) -> CargoResult<bool> {
518        trace!(target: "gc", "should_run_auto_gc");
519        if self.auto_gc_checked_this_session {
520            return Ok(false);
521        }
522        let last_auto_gc: Timestamp =
523            self.conn
524                .query_row("SELECT last_auto_gc FROM global_data", [], |row| row.get(0))?;
525        let should_run = last_auto_gc + frequency.as_secs() < now();
526        trace!(target: "gc",
527            "last auto gc was {}, {}",
528            last_auto_gc,
529            if should_run { "running" } else { "skipping" }
530        );
531        self.auto_gc_checked_this_session = true;
532        Ok(should_run)
533    }
534
535    /// Writes to the database to indicate that an automatic GC has just been
536    /// completed.
537    pub fn set_last_auto_gc(&self) -> CargoResult<()> {
538        self.conn
539            .execute("UPDATE global_data SET last_auto_gc = ?1", [now()])?;
540        Ok(())
541    }
542
543    /// Deletes files from the global cache based on the given options.
544    pub fn clean(&mut self, clean_ctx: &mut CleanContext<'_>, gc_opts: &GcOpts) -> CargoResult<()> {
545        self.clean_inner(clean_ctx, gc_opts)
546            .context("failed to clean entries from the global cache")
547    }
548
549    #[tracing::instrument(skip_all)]
550    fn clean_inner(
551        &mut self,
552        clean_ctx: &mut CleanContext<'_>,
553        gc_opts: &GcOpts,
554    ) -> CargoResult<()> {
555        let gctx = clean_ctx.gctx;
556        let base = BasePaths {
557            index: gctx.registry_index_path().into_path_unlocked(),
558            git_db: gctx.git_db_path().into_path_unlocked(),
559            git_co: gctx.git_checkouts_path().into_path_unlocked(),
560            crate_dir: gctx.registry_cache_path().into_path_unlocked(),
561            src: gctx.registry_source_path().into_path_unlocked(),
562        };
563        let now = now();
564        trace!(target: "gc", "cleaning {gc_opts:?}");
565        let tx = self.conn.transaction()?;
566        let mut delete_paths = Vec::new();
567        // This can be an expensive operation, so only perform it if necessary.
568        if gc_opts.is_download_cache_opt_set() {
569            // TODO: Investigate how slow this might be.
570            Self::sync_db_with_files(
571                &tx,
572                now,
573                gctx,
574                &base,
575                gc_opts.is_download_cache_size_set(),
576                &mut delete_paths,
577            )
578            .context("failed to sync tracking database")?
579        }
580        if let Some(max_age) = gc_opts.max_index_age {
581            let max_age = now - max_age.as_secs();
582            Self::get_registry_index_to_clean(&tx, max_age, &base, &mut delete_paths)?;
583        }
584        if let Some(max_age) = gc_opts.max_src_age {
585            let max_age = now - max_age.as_secs();
586            Self::get_registry_items_to_clean_age(
587                &tx,
588                max_age,
589                REGISTRY_SRC_TABLE,
590                &base.src,
591                &mut delete_paths,
592            )?;
593        }
594        if let Some(max_age) = gc_opts.max_crate_age {
595            let max_age = now - max_age.as_secs();
596            Self::get_registry_items_to_clean_age(
597                &tx,
598                max_age,
599                REGISTRY_CRATE_TABLE,
600                &base.crate_dir,
601                &mut delete_paths,
602            )?;
603        }
604        if let Some(max_age) = gc_opts.max_git_db_age {
605            let max_age = now - max_age.as_secs();
606            Self::get_git_db_items_to_clean(&tx, max_age, &base, &mut delete_paths)?;
607        }
608        if let Some(max_age) = gc_opts.max_git_co_age {
609            let max_age = now - max_age.as_secs();
610            Self::get_git_co_items_to_clean(&tx, max_age, &base.git_co, &mut delete_paths)?;
611        }
612        // Size collection must happen after date collection so that dates
613        // have precedence, since size constraints are a more blunt
614        // instrument.
615        //
616        // These are also complicated by the `--max-download-size` option
617        // overlapping with `--max-crate-size` and `--max-src-size`, which
618        // requires some coordination between those options which isn't
619        // necessary with the age-based options. An item's age is either older
620        // or it isn't, but contrast that with size which is based on the sum
621        // of all tracked items. Also, `--max-download-size` is summed against
622        // both the crate and src tracking, which requires combining them to
623        // compute the size, and then separating them to calculate the correct
624        // paths.
625        if let Some(max_size) = gc_opts.max_crate_size {
626            Self::get_registry_items_to_clean_size(
627                &tx,
628                max_size,
629                REGISTRY_CRATE_TABLE,
630                &base.crate_dir,
631                &mut delete_paths,
632            )?;
633        }
634        if let Some(max_size) = gc_opts.max_src_size {
635            Self::get_registry_items_to_clean_size(
636                &tx,
637                max_size,
638                REGISTRY_SRC_TABLE,
639                &base.src,
640                &mut delete_paths,
641            )?;
642        }
643        if let Some(max_size) = gc_opts.max_git_size {
644            Self::get_git_items_to_clean_size(&tx, max_size, &base, &mut delete_paths)?;
645        }
646        if let Some(max_size) = gc_opts.max_download_size {
647            Self::get_registry_items_to_clean_size_both(&tx, max_size, &base, &mut delete_paths)?;
648        }
649
650        clean_ctx.remove_paths(&delete_paths)?;
651
652        if clean_ctx.dry_run {
653            tx.rollback()?;
654        } else {
655            tx.commit()?;
656        }
657        Ok(())
658    }
659
660    /// Returns a list of directory entries in the given path that are
661    /// themselves directories.
662    fn list_dir_names(path: &Path) -> CargoResult<Vec<String>> {
663        Self::read_dir_with_filter(path, &|entry| {
664            entry.file_type().map_or(false, |ty| ty.is_dir())
665        })
666    }
667
668    /// Returns a list of names in a directory, filtered by the given callback.
669    fn read_dir_with_filter(
670        path: &Path,
671        filter: &dyn Fn(&std::fs::DirEntry) -> bool,
672    ) -> CargoResult<Vec<String>> {
673        let entries = match path.read_dir() {
674            Ok(e) => e,
675            Err(e) => {
676                if e.kind() == std::io::ErrorKind::NotFound {
677                    return Ok(Vec::new());
678                } else {
679                    return Err(
680                        anyhow::Error::new(e).context(format!("failed to read path `{path:?}`"))
681                    );
682                }
683            }
684        };
685        let names = entries
686            .filter_map(|entry| entry.ok())
687            .filter(|entry| filter(entry))
688            .filter_map(|entry| entry.file_name().into_string().ok())
689            .collect();
690        Ok(names)
691    }
692
693    /// Synchronizes the database to match the files on disk.
694    ///
695    /// This performs the following cleanups:
696    ///
697    /// 1. Remove entries from the database that are missing on disk.
698    /// 2. Adds missing entries to the database that are on disk (such as when
699    ///    files are added by older versions of cargo).
700    /// 3. Fills in the `size` column where it is NULL (such as when something
701    ///    is added to disk by an older version of cargo, and one of the mark
702    ///    functions marked it without knowing the size).
703    ///
704    ///    Size computations are only done if `sync_size` is set since it can
705    ///    be a very expensive operation. This should only be set if the user
706    ///    requested to clean based on the cache size.
707    /// 4. Checks for orphaned files. For example, if there are `.crate` files
708    ///    associated with an index that does not exist.
709    ///
710    ///    These orphaned files will be added to `delete_paths` so that the
711    ///    caller can delete them.
712    #[tracing::instrument(skip(conn, gctx, base, delete_paths))]
713    fn sync_db_with_files(
714        conn: &Connection,
715        now: Timestamp,
716        gctx: &GlobalContext,
717        base: &BasePaths,
718        sync_size: bool,
719        delete_paths: &mut Vec<PathBuf>,
720    ) -> CargoResult<()> {
721        debug!(target: "gc", "starting db sync");
722        // For registry_index and git_db, add anything that is missing in the db.
723        Self::update_parent_for_missing_from_db(conn, now, REGISTRY_INDEX_TABLE, &base.index)?;
724        Self::update_parent_for_missing_from_db(conn, now, GIT_DB_TABLE, &base.git_db)?;
725
726        // For registry_crate, registry_src, and git_checkout, remove anything
727        // from the db that isn't on disk.
728        Self::update_db_for_removed(
729            conn,
730            REGISTRY_INDEX_TABLE,
731            "registry_id",
732            REGISTRY_CRATE_TABLE,
733            &base.crate_dir,
734        )?;
735        Self::update_db_for_removed(
736            conn,
737            REGISTRY_INDEX_TABLE,
738            "registry_id",
739            REGISTRY_SRC_TABLE,
740            &base.src,
741        )?;
742        Self::update_db_for_removed(conn, GIT_DB_TABLE, "git_id", GIT_CO_TABLE, &base.git_co)?;
743
744        // For registry_index and git_db, remove anything from the db that
745        // isn't on disk.
746        //
747        // This also collects paths for any child files that don't have their
748        // respective parent on disk.
749        Self::update_db_parent_for_removed_from_disk(
750            conn,
751            REGISTRY_INDEX_TABLE,
752            &base.index,
753            &[&base.crate_dir, &base.src],
754            delete_paths,
755        )?;
756        Self::update_db_parent_for_removed_from_disk(
757            conn,
758            GIT_DB_TABLE,
759            &base.git_db,
760            &[&base.git_co],
761            delete_paths,
762        )?;
763
764        // For registry_crate, registry_src, and git_checkout, add anything
765        // that is missing in the db.
766        Self::populate_untracked_crate(conn, now, &base.crate_dir)?;
767        Self::populate_untracked(
768            conn,
769            now,
770            gctx,
771            REGISTRY_INDEX_TABLE,
772            "registry_id",
773            REGISTRY_SRC_TABLE,
774            &base.src,
775            sync_size,
776        )?;
777        Self::populate_untracked(
778            conn,
779            now,
780            gctx,
781            GIT_DB_TABLE,
782            "git_id",
783            GIT_CO_TABLE,
784            &base.git_co,
785            sync_size,
786        )?;
787
788        // Update any NULL sizes if needed.
789        if sync_size {
790            Self::update_null_sizes(
791                conn,
792                gctx,
793                REGISTRY_INDEX_TABLE,
794                "registry_id",
795                REGISTRY_SRC_TABLE,
796                &base.src,
797            )?;
798            Self::update_null_sizes(
799                conn,
800                gctx,
801                GIT_DB_TABLE,
802                "git_id",
803                GIT_CO_TABLE,
804                &base.git_co,
805            )?;
806        }
807        Ok(())
808    }
809
810    /// For parent tables, add any entries that are on disk but aren't tracked in the db.
811    #[tracing::instrument(skip(conn, now, base_path))]
812    fn update_parent_for_missing_from_db(
813        conn: &Connection,
814        now: Timestamp,
815        parent_table_name: &str,
816        base_path: &Path,
817    ) -> CargoResult<()> {
818        trace!(target: "gc", "checking for untracked parent to add to {parent_table_name}");
819        let names = Self::list_dir_names(base_path)?;
820
821        let mut stmt = conn.prepare_cached(&format!(
822            "INSERT INTO {parent_table_name} (name, timestamp)
823                VALUES (?1, ?2)
824                ON CONFLICT DO NOTHING",
825        ))?;
826        for name in names {
827            stmt.execute(params![name, now])?;
828        }
829        Ok(())
830    }
831
832    /// Removes database entries for any files that are not on disk for the child tables.
833    ///
834    /// This could happen for example if the user manually deleted the file or
835    /// any such scenario where the filesystem and db are out of sync.
836    #[tracing::instrument(skip(conn, base_path))]
837    fn update_db_for_removed(
838        conn: &Connection,
839        parent_table_name: &str,
840        id_column_name: &str,
841        table_name: &str,
842        base_path: &Path,
843    ) -> CargoResult<()> {
844        trace!(target: "gc", "checking for db entries to remove from {table_name}");
845        let mut select_stmt = conn.prepare_cached(&format!(
846            "SELECT {table_name}.rowid, {parent_table_name}.name, {table_name}.name
847             FROM {parent_table_name}, {table_name}
848             WHERE {table_name}.{id_column_name} = {parent_table_name}.id",
849        ))?;
850        let mut delete_stmt =
851            conn.prepare_cached(&format!("DELETE FROM {table_name} WHERE rowid = ?1"))?;
852        let mut rows = select_stmt.query([])?;
853        while let Some(row) = rows.next()? {
854            let rowid: i64 = row.get_unwrap(0);
855            let id_name: String = row.get_unwrap(1);
856            let name: String = row.get_unwrap(2);
857            if !base_path.join(id_name).join(name).exists() {
858                delete_stmt.execute([rowid])?;
859            }
860        }
861        Ok(())
862    }
863
864    /// Removes database entries for any files that are not on disk for the parent tables.
865    #[tracing::instrument(skip(conn, base_path, child_base_paths, delete_paths))]
866    fn update_db_parent_for_removed_from_disk(
867        conn: &Connection,
868        parent_table_name: &str,
869        base_path: &Path,
870        child_base_paths: &[&Path],
871        delete_paths: &mut Vec<PathBuf>,
872    ) -> CargoResult<()> {
873        trace!(target: "gc", "checking for db entries to remove from {parent_table_name}");
874        let mut select_stmt =
875            conn.prepare_cached(&format!("SELECT rowid, name FROM {parent_table_name}"))?;
876        let mut delete_stmt =
877            conn.prepare_cached(&format!("DELETE FROM {parent_table_name} WHERE rowid = ?1"))?;
878        let mut rows = select_stmt.query([])?;
879        while let Some(row) = rows.next()? {
880            let rowid: i64 = row.get_unwrap(0);
881            let id_name: String = row.get_unwrap(1);
882            if !base_path.join(&id_name).exists() {
883                delete_stmt.execute([rowid])?;
884                // Make sure any child data is also cleaned up.
885                for child_base in child_base_paths {
886                    let child_path = child_base.join(&id_name);
887                    if child_path.exists() {
888                        debug!(target: "gc", "removing orphaned path {child_path:?}");
889                        delete_paths.push(child_path);
890                    }
891                }
892            }
893        }
894        Ok(())
895    }
896
897    /// Updates the database to add any `.crate` files that are currently
898    /// not tracked (such as when they are downloaded by an older version of
899    /// cargo).
900    #[tracing::instrument(skip(conn, now, base_path))]
901    fn populate_untracked_crate(
902        conn: &Connection,
903        now: Timestamp,
904        base_path: &Path,
905    ) -> CargoResult<()> {
906        trace!(target: "gc", "populating untracked crate files");
907        let mut insert_stmt = conn.prepare_cached(
908            "INSERT INTO registry_crate (registry_id, name, size, timestamp)
909             VALUES (?1, ?2, ?3, ?4)
910             ON CONFLICT DO NOTHING",
911        )?;
912        let index_names = Self::list_dir_names(&base_path)?;
913        for index_name in index_names {
914            let Some(id) = Self::id_from_name(conn, REGISTRY_INDEX_TABLE, &index_name)? else {
915                // The id is missing from the database. This should be resolved
916                // via update_db_parent_for_removed_from_disk.
917                continue;
918            };
919            let index_path = base_path.join(index_name);
920            let crates = Self::read_dir_with_filter(&index_path, &|entry| {
921                entry.file_type().map_or(false, |ty| ty.is_file())
922                    && entry
923                        .file_name()
924                        .to_str()
925                        .map_or(false, |name| name.ends_with(".crate"))
926            })?;
927            for crate_name in crates {
928                // Missing files should have already been taken care of by
929                // update_db_for_removed.
930                let size = paths::metadata(index_path.join(&crate_name))?.len();
931                insert_stmt.execute(params![id, crate_name, size, now])?;
932            }
933        }
934        Ok(())
935    }
936
937    /// Updates the database to add any files that are currently not tracked
938    /// (such as when they are downloaded by an older version of cargo).
939    #[tracing::instrument(skip(conn, now, gctx, base_path, populate_size))]
940    fn populate_untracked(
941        conn: &Connection,
942        now: Timestamp,
943        gctx: &GlobalContext,
944        id_table_name: &str,
945        id_column_name: &str,
946        table_name: &str,
947        base_path: &Path,
948        populate_size: bool,
949    ) -> CargoResult<()> {
950        trace!(target: "gc", "populating untracked files for {table_name}");
951        // Gather names (and make sure they are in the database).
952        let id_names = Self::list_dir_names(&base_path)?;
953
954        // This SELECT is used to determine if the directory is already
955        // tracked. We don't want to do the expensive size computation unless
956        // necessary.
957        let mut select_stmt = conn.prepare_cached(&format!(
958            "SELECT 1 FROM {table_name}
959             WHERE {id_column_name} = ?1 AND name = ?2",
960        ))?;
961        let mut insert_stmt = conn.prepare_cached(&format!(
962            "INSERT INTO {table_name} ({id_column_name}, name, size, timestamp)
963             VALUES (?1, ?2, ?3, ?4)
964             ON CONFLICT DO NOTHING",
965        ))?;
966        let mut progress = Progress::with_style("Scanning", ProgressStyle::Ratio, gctx);
967        // Compute the size of any directory not in the database.
968        for id_name in id_names {
969            let Some(id) = Self::id_from_name(conn, id_table_name, &id_name)? else {
970                // The id is missing from the database. This should be resolved
971                // via update_db_parent_for_removed_from_disk.
972                continue;
973            };
974            let index_path = base_path.join(id_name);
975            let names = Self::list_dir_names(&index_path)?;
976            let max = names.len();
977            for (i, name) in names.iter().enumerate() {
978                if select_stmt.exists(params![id, name])? {
979                    continue;
980                }
981                let dir_path = index_path.join(name);
982                if !dir_path.is_dir() {
983                    continue;
984                }
985                progress.tick(i, max, "")?;
986                let size = if populate_size {
987                    Some(du(&dir_path, table_name)?)
988                } else {
989                    None
990                };
991                insert_stmt.execute(params![id, name, size, now])?;
992            }
993        }
994        Ok(())
995    }
996
997    /// Fills in the `size` column where it is NULL.
998    ///
999    /// This can happen when something is added to disk by an older version of
1000    /// cargo, and one of the mark functions marked it without knowing the
1001    /// size.
1002    ///
1003    /// `update_db_for_removed` should be called before this is called.
1004    #[tracing::instrument(skip(conn, gctx, base_path))]
1005    fn update_null_sizes(
1006        conn: &Connection,
1007        gctx: &GlobalContext,
1008        parent_table_name: &str,
1009        id_column_name: &str,
1010        table_name: &str,
1011        base_path: &Path,
1012    ) -> CargoResult<()> {
1013        trace!(target: "gc", "updating NULL size information in {table_name}");
1014        let mut null_stmt = conn.prepare_cached(&format!(
1015            "SELECT {table_name}.rowid, {table_name}.name, {parent_table_name}.name
1016             FROM {table_name}, {parent_table_name}
1017             WHERE {table_name}.size IS NULL AND {table_name}.{id_column_name} = {parent_table_name}.id",
1018        ))?;
1019        let mut update_stmt = conn.prepare_cached(&format!(
1020            "UPDATE {table_name} SET size = ?1 WHERE rowid = ?2"
1021        ))?;
1022        let mut progress = Progress::with_style("Scanning", ProgressStyle::Ratio, gctx);
1023        let rows: Vec<_> = null_stmt
1024            .query_map([], |row| {
1025                Ok((row.get_unwrap(0), row.get_unwrap(1), row.get_unwrap(2)))
1026            })?
1027            .collect();
1028        let max = rows.len();
1029        for (i, row) in rows.into_iter().enumerate() {
1030            let (rowid, name, id_name): (i64, String, String) = row?;
1031            let path = base_path.join(id_name).join(name);
1032            progress.tick(i, max, "")?;
1033            // Missing files should have already been taken care of by
1034            // update_db_for_removed.
1035            let size = du(&path, table_name)?;
1036            update_stmt.execute(params![size, rowid])?;
1037        }
1038        Ok(())
1039    }
1040
1041    /// Adds paths to delete from either `registry_crate` or `registry_src` whose
1042    /// last use is older than the given timestamp.
1043    fn get_registry_items_to_clean_age(
1044        conn: &Connection,
1045        max_age: Timestamp,
1046        table_name: &str,
1047        base_path: &Path,
1048        delete_paths: &mut Vec<PathBuf>,
1049    ) -> CargoResult<()> {
1050        debug!(target: "gc", "cleaning {table_name} since {max_age:?}");
1051        let mut stmt = conn.prepare_cached(&format!(
1052            "DELETE FROM {table_name} WHERE timestamp < ?1
1053                RETURNING registry_id, name"
1054        ))?;
1055        let rows = stmt
1056            .query_map(params![max_age], |row| {
1057                let registry_id = row.get_unwrap(0);
1058                let name: String = row.get_unwrap(1);
1059                Ok((registry_id, name))
1060            })?
1061            .collect::<Result<Vec<_>, _>>()?;
1062        let ids: Vec<_> = rows.iter().map(|r| r.0).collect();
1063        let id_map = Self::get_id_map(conn, REGISTRY_INDEX_TABLE, &ids)?;
1064        for (id, name) in rows {
1065            let encoded_registry_name = &id_map[&id];
1066            delete_paths.push(base_path.join(encoded_registry_name).join(name));
1067        }
1068        Ok(())
1069    }
1070
1071    /// Adds paths to delete from either `registry_crate` or `registry_src` in
1072    /// order to keep the total size under the given max size.
1073    fn get_registry_items_to_clean_size(
1074        conn: &Connection,
1075        max_size: u64,
1076        table_name: &str,
1077        base_path: &Path,
1078        delete_paths: &mut Vec<PathBuf>,
1079    ) -> CargoResult<()> {
1080        debug!(target: "gc", "cleaning {table_name} till under {max_size:?}");
1081        let total_size: u64 = conn.query_row(
1082            &format!("SELECT coalesce(SUM(size), 0) FROM {table_name}"),
1083            [],
1084            |row| row.get(0),
1085        )?;
1086        if total_size <= max_size {
1087            return Ok(());
1088        }
1089        // This SQL statement selects all of the rows ordered by timestamp,
1090        // and then uses a window function to keep a running total of the
1091        // size. It selects all rows until the running total exceeds the
1092        // threshold of the total number of bytes that we want to delete.
1093        //
1094        // The window function essentially computes an aggregate over all
1095        // previous rows as it goes along. As long as the running size is
1096        // below the total amount that we need to delete, it keeps picking
1097        // more rows.
1098        //
1099        // The ORDER BY includes `name` mainly for test purposes so that
1100        // entries with the same timestamp have deterministic behavior.
1101        //
1102        // The coalesce helps convert NULL to 0.
1103        let mut stmt = conn.prepare(&format!(
1104            "DELETE FROM {table_name} WHERE rowid IN \
1105                (SELECT x.rowid FROM \
1106                    (SELECT rowid, size, SUM(size) OVER \
1107                        (ORDER BY timestamp, name ROWS UNBOUNDED PRECEDING) AS running_amount \
1108                        FROM {table_name}) x \
1109                    WHERE coalesce(x.running_amount, 0) - x.size < ?1) \
1110                RETURNING registry_id, name;"
1111        ))?;
1112        let rows = stmt
1113            .query_map(params![total_size - max_size], |row| {
1114                let id = row.get_unwrap(0);
1115                let name: String = row.get_unwrap(1);
1116                Ok((id, name))
1117            })?
1118            .collect::<Result<Vec<_>, _>>()?;
1119        // Convert registry_id to the encoded registry name, and join those.
1120        let ids: Vec<_> = rows.iter().map(|r| r.0).collect();
1121        let id_map = Self::get_id_map(conn, REGISTRY_INDEX_TABLE, &ids)?;
1122        for (id, name) in rows {
1123            let encoded_name = &id_map[&id];
1124            delete_paths.push(base_path.join(encoded_name).join(name));
1125        }
1126        Ok(())
1127    }
1128
1129    /// Adds paths to delete from both `registry_crate` and `registry_src` in
1130    /// order to keep the total size under the given max size.
1131    fn get_registry_items_to_clean_size_both(
1132        conn: &Connection,
1133        max_size: u64,
1134        base: &BasePaths,
1135        delete_paths: &mut Vec<PathBuf>,
1136    ) -> CargoResult<()> {
1137        debug!(target: "gc", "cleaning download till under {max_size:?}");
1138
1139        // This SQL statement selects from both registry_src and
1140        // registry_crate so that sorting of timestamps incorporates both of
1141        // them at the same time. It uses a const value of 1 or 2 as the first
1142        // column so that the code below can determine which table the value
1143        // came from.
1144        let mut stmt = conn.prepare_cached(
1145            "SELECT 1, registry_src.rowid, registry_src.name AS name, registry_index.name,
1146                    registry_src.size, registry_src.timestamp AS timestamp
1147             FROM registry_src, registry_index
1148             WHERE registry_src.registry_id = registry_index.id AND registry_src.size NOT NULL
1149
1150             UNION
1151
1152             SELECT 2, registry_crate.rowid, registry_crate.name AS name, registry_index.name,
1153                    registry_crate.size, registry_crate.timestamp AS timestamp
1154             FROM registry_crate, registry_index
1155             WHERE registry_crate.registry_id = registry_index.id
1156
1157             ORDER BY timestamp, name",
1158        )?;
1159        let mut delete_src_stmt =
1160            conn.prepare_cached("DELETE FROM registry_src WHERE rowid = ?1")?;
1161        let mut delete_crate_stmt =
1162            conn.prepare_cached("DELETE FROM registry_crate WHERE rowid = ?1")?;
1163        let rows = stmt
1164            .query_map([], |row| {
1165                Ok((
1166                    row.get_unwrap(0),
1167                    row.get_unwrap(1),
1168                    row.get_unwrap(2),
1169                    row.get_unwrap(3),
1170                    row.get_unwrap(4),
1171                ))
1172            })?
1173            .collect::<Result<Vec<(i64, i64, String, String, u64)>, _>>()?;
1174        let mut total_size: u64 = rows.iter().map(|r| r.4).sum();
1175        debug!(target: "gc", "total download cache size appears to be {total_size}");
1176        for (table, rowid, name, index_name, size) in rows {
1177            if total_size <= max_size {
1178                break;
1179            }
1180            if table == 1 {
1181                delete_paths.push(base.src.join(index_name).join(name));
1182                delete_src_stmt.execute([rowid])?;
1183            } else {
1184                delete_paths.push(base.crate_dir.join(index_name).join(name));
1185                delete_crate_stmt.execute([rowid])?;
1186            }
1187            // TODO: If delete crate, ensure src is also deleted.
1188            total_size -= size;
1189        }
1190        Ok(())
1191    }
1192
1193    /// Adds paths to delete from the git cache, keeping the total size under
1194    /// the give value.
1195    ///
1196    /// Paths are relative to the `git` directory in the cache directory.
1197    fn get_git_items_to_clean_size(
1198        conn: &Connection,
1199        max_size: u64,
1200        base: &BasePaths,
1201        delete_paths: &mut Vec<PathBuf>,
1202    ) -> CargoResult<()> {
1203        debug!(target: "gc", "cleaning git till under {max_size:?}");
1204
1205        // Collect all the sizes from git_db and git_checkouts, and then sort them by timestamp.
1206        let mut stmt = conn.prepare_cached("SELECT rowid, name, timestamp FROM git_db")?;
1207        let mut git_info = stmt
1208            .query_map([], |row| {
1209                let rowid: i64 = row.get_unwrap(0);
1210                let name: String = row.get_unwrap(1);
1211                let timestamp: Timestamp = row.get_unwrap(2);
1212                // Size is added below so that the error doesn't need to be
1213                // converted to a rusqlite error.
1214                Ok((timestamp, rowid, None, name, 0))
1215            })?
1216            .collect::<Result<Vec<_>, _>>()?;
1217        for info in &mut git_info {
1218            let size = cargo_util::du(&base.git_db.join(&info.3), &[])?;
1219            info.4 = size;
1220        }
1221
1222        let mut stmt = conn.prepare_cached(
1223            "SELECT git_checkout.rowid, git_db.name, git_checkout.name,
1224                git_checkout.size, git_checkout.timestamp
1225                FROM git_checkout, git_db
1226                WHERE git_checkout.git_id = git_db.id AND git_checkout.size NOT NULL",
1227        )?;
1228        let git_co_rows = stmt
1229            .query_map([], |row| {
1230                let rowid = row.get_unwrap(0);
1231                let db_name: String = row.get_unwrap(1);
1232                let name = row.get_unwrap(2);
1233                let size = row.get_unwrap(3);
1234                let timestamp = row.get_unwrap(4);
1235                Ok((timestamp, rowid, Some(db_name), name, size))
1236            })?
1237            .collect::<Result<Vec<_>, _>>()?;
1238        git_info.extend(git_co_rows);
1239
1240        // Sort by timestamp, and name. The name is included mostly for test
1241        // purposes so that entries with the same timestamp have deterministic
1242        // behavior.
1243        git_info.sort_by(|a, b| (b.0, &b.3).cmp(&(a.0, &a.3)));
1244
1245        // Collect paths to delete.
1246        let mut delete_db_stmt = conn.prepare_cached("DELETE FROM git_db WHERE rowid = ?1")?;
1247        let mut delete_co_stmt =
1248            conn.prepare_cached("DELETE FROM git_checkout WHERE rowid = ?1")?;
1249        let mut total_size: u64 = git_info.iter().map(|r| r.4).sum();
1250        debug!(target: "gc", "total git cache size appears to be {total_size}");
1251        while let Some((_timestamp, rowid, db_name, name, size)) = git_info.pop() {
1252            if total_size <= max_size {
1253                break;
1254            }
1255            if let Some(db_name) = db_name {
1256                delete_paths.push(base.git_co.join(db_name).join(name));
1257                delete_co_stmt.execute([rowid])?;
1258                total_size -= size;
1259            } else {
1260                total_size -= size;
1261                delete_paths.push(base.git_db.join(&name));
1262                delete_db_stmt.execute([rowid])?;
1263                // If the db is deleted, then all the checkouts must be deleted.
1264                let mut i = 0;
1265                while i < git_info.len() {
1266                    if git_info[i].2.as_deref() == Some(name.as_ref()) {
1267                        let (_, rowid, db_name, name, size) = git_info.remove(i);
1268                        delete_paths.push(base.git_co.join(db_name.unwrap()).join(name));
1269                        delete_co_stmt.execute([rowid])?;
1270                        total_size -= size;
1271                    } else {
1272                        i += 1;
1273                    }
1274                }
1275            }
1276        }
1277        Ok(())
1278    }
1279
1280    /// Adds paths to delete from `registry_index` whose last use is older
1281    /// than the given timestamp.
1282    fn get_registry_index_to_clean(
1283        conn: &Connection,
1284        max_age: Timestamp,
1285        base: &BasePaths,
1286        delete_paths: &mut Vec<PathBuf>,
1287    ) -> CargoResult<()> {
1288        debug!(target: "gc", "cleaning index since {max_age:?}");
1289        let mut stmt = conn.prepare_cached(
1290            "DELETE FROM registry_index WHERE timestamp < ?1
1291                RETURNING name",
1292        )?;
1293        let mut rows = stmt.query([max_age])?;
1294        while let Some(row) = rows.next()? {
1295            let name: String = row.get_unwrap(0);
1296            delete_paths.push(base.index.join(&name));
1297            // Also delete .crate and src directories, since by definition
1298            // they cannot be used without their index.
1299            delete_paths.push(base.src.join(&name));
1300            delete_paths.push(base.crate_dir.join(&name));
1301        }
1302        Ok(())
1303    }
1304
1305    /// Adds paths to delete from `git_checkout` whose last use is
1306    /// older than the given timestamp.
1307    fn get_git_co_items_to_clean(
1308        conn: &Connection,
1309        max_age: Timestamp,
1310        base_path: &Path,
1311        delete_paths: &mut Vec<PathBuf>,
1312    ) -> CargoResult<()> {
1313        debug!(target: "gc", "cleaning git co since {max_age:?}");
1314        let mut stmt = conn.prepare_cached(
1315            "DELETE FROM git_checkout WHERE timestamp < ?1
1316                RETURNING git_id, name",
1317        )?;
1318        let rows = stmt
1319            .query_map(params![max_age], |row| {
1320                let git_id = row.get_unwrap(0);
1321                let name: String = row.get_unwrap(1);
1322                Ok((git_id, name))
1323            })?
1324            .collect::<Result<Vec<_>, _>>()?;
1325        let ids: Vec<_> = rows.iter().map(|r| r.0).collect();
1326        let id_map = Self::get_id_map(conn, GIT_DB_TABLE, &ids)?;
1327        for (id, name) in rows {
1328            let encoded_git_name = &id_map[&id];
1329            delete_paths.push(base_path.join(encoded_git_name).join(name));
1330        }
1331        Ok(())
1332    }
1333
1334    /// Adds paths to delete from `git_db` in order to keep the total size
1335    /// under the given max size.
1336    fn get_git_db_items_to_clean(
1337        conn: &Connection,
1338        max_age: Timestamp,
1339        base: &BasePaths,
1340        delete_paths: &mut Vec<PathBuf>,
1341    ) -> CargoResult<()> {
1342        debug!(target: "gc", "cleaning git db since {max_age:?}");
1343        let mut stmt = conn.prepare_cached(
1344            "DELETE FROM git_db WHERE timestamp < ?1
1345                RETURNING name",
1346        )?;
1347        let mut rows = stmt.query([max_age])?;
1348        while let Some(row) = rows.next()? {
1349            let name: String = row.get_unwrap(0);
1350            delete_paths.push(base.git_db.join(&name));
1351            // Also delete checkout directories, since by definition they
1352            // cannot be used without their db.
1353            delete_paths.push(base.git_co.join(&name));
1354        }
1355        Ok(())
1356    }
1357}
1358
1359/// Helper to generate the upsert for the parent tables.
1360///
1361/// This handles checking if the row already exists, and only updates the
1362/// timestamp it if it hasn't been updated recently. This also handles keeping
1363/// a cached map of the `id` value.
1364///
1365/// Unfortunately it is a bit tricky to share this code without a macro.
1366macro_rules! insert_or_update_parent {
1367    ($self:expr, $conn:expr, $table_name:expr, $timestamps_field:ident, $keys_field:ident, $encoded_name:ident) => {
1368        let mut select_stmt = $conn.prepare_cached(concat!(
1369            "SELECT id, timestamp FROM ",
1370            $table_name,
1371            " WHERE name = ?1"
1372        ))?;
1373        let mut insert_stmt = $conn.prepare_cached(concat!(
1374            "INSERT INTO ",
1375            $table_name,
1376            " (name, timestamp)
1377                VALUES (?1, ?2)
1378                ON CONFLICT DO UPDATE SET timestamp=excluded.timestamp
1379                RETURNING id",
1380        ))?;
1381        let mut update_stmt = $conn.prepare_cached(concat!(
1382            "UPDATE ",
1383            $table_name,
1384            " SET timestamp = ?1 WHERE id = ?2"
1385        ))?;
1386        for (parent, new_timestamp) in std::mem::take(&mut $self.$timestamps_field) {
1387            trace!(target: "gc",
1388                concat!("insert ", $table_name, " {:?} {}"),
1389                parent,
1390                new_timestamp
1391            );
1392            let mut rows = select_stmt.query([parent.$encoded_name])?;
1393            let id = if let Some(row) = rows.next()? {
1394                let id: ParentId = row.get_unwrap(0);
1395                let timestamp: Timestamp = row.get_unwrap(1);
1396                if timestamp < new_timestamp - UPDATE_RESOLUTION {
1397                    update_stmt.execute(params![new_timestamp, id])?;
1398                }
1399                id
1400            } else {
1401                insert_stmt.query_row(params![parent.$encoded_name, new_timestamp], |row| {
1402                    row.get(0)
1403                })?
1404            };
1405            match $self.$keys_field.entry(parent.$encoded_name) {
1406                hash_map::Entry::Occupied(o) => {
1407                    assert_eq!(*o.get(), id);
1408                }
1409                hash_map::Entry::Vacant(v) => {
1410                    v.insert(id);
1411                }
1412            }
1413        }
1414        return Ok(());
1415    };
1416}
1417
1418/// This is a cache of modifications that will be saved to disk all at once
1419/// via the [`DeferredGlobalLastUse::save`] method.
1420///
1421/// This is here to improve performance.
1422#[derive(Debug)]
1423pub struct DeferredGlobalLastUse {
1424    /// Cache of registry keys, used for faster fetching.
1425    ///
1426    /// The key is the registry name (which is its directory name) and the
1427    /// value is the `id` in the `registry_index` table.
1428    registry_keys: HashMap<InternedString, ParentId>,
1429    /// Cache of git keys, used for faster fetching.
1430    ///
1431    /// The key is the git db name (which is its directory name) and the value
1432    /// is the `id` in the `git_db` table.
1433    git_keys: HashMap<InternedString, ParentId>,
1434
1435    /// New registry index entries to insert.
1436    registry_index_timestamps: HashMap<RegistryIndex, Timestamp>,
1437    /// New registry `.crate` entries to insert.
1438    registry_crate_timestamps: HashMap<RegistryCrate, Timestamp>,
1439    /// New registry src directory entries to insert.
1440    registry_src_timestamps: HashMap<RegistrySrc, Timestamp>,
1441    /// New git db entries to insert.
1442    git_db_timestamps: HashMap<GitDb, Timestamp>,
1443    /// New git checkout entries to insert.
1444    git_checkout_timestamps: HashMap<GitCheckout, Timestamp>,
1445    /// This is used so that a warning about failing to update the database is
1446    /// only displayed once.
1447    save_err_has_warned: bool,
1448    /// The current time, used to improve performance to avoid accessing the
1449    /// clock hundreds of times.
1450    now: Timestamp,
1451}
1452
1453impl DeferredGlobalLastUse {
1454    pub fn new() -> DeferredGlobalLastUse {
1455        DeferredGlobalLastUse {
1456            registry_keys: HashMap::new(),
1457            git_keys: HashMap::new(),
1458            registry_index_timestamps: HashMap::new(),
1459            registry_crate_timestamps: HashMap::new(),
1460            registry_src_timestamps: HashMap::new(),
1461            git_db_timestamps: HashMap::new(),
1462            git_checkout_timestamps: HashMap::new(),
1463            save_err_has_warned: false,
1464            now: now(),
1465        }
1466    }
1467
1468    pub fn is_empty(&self) -> bool {
1469        self.registry_index_timestamps.is_empty()
1470            && self.registry_crate_timestamps.is_empty()
1471            && self.registry_src_timestamps.is_empty()
1472            && self.git_db_timestamps.is_empty()
1473            && self.git_checkout_timestamps.is_empty()
1474    }
1475
1476    fn clear(&mut self) {
1477        self.registry_index_timestamps.clear();
1478        self.registry_crate_timestamps.clear();
1479        self.registry_src_timestamps.clear();
1480        self.git_db_timestamps.clear();
1481        self.git_checkout_timestamps.clear();
1482    }
1483
1484    /// Indicates the given [`RegistryIndex`] has been used right now.
1485    pub fn mark_registry_index_used(&mut self, registry_index: RegistryIndex) {
1486        self.mark_registry_index_used_stamp(registry_index, None);
1487    }
1488
1489    /// Indicates the given [`RegistryCrate`] has been used right now.
1490    ///
1491    /// Also implicitly marks the index used, too.
1492    pub fn mark_registry_crate_used(&mut self, registry_crate: RegistryCrate) {
1493        self.mark_registry_crate_used_stamp(registry_crate, None);
1494    }
1495
1496    /// Indicates the given [`RegistrySrc`] has been used right now.
1497    ///
1498    /// Also implicitly marks the index used, too.
1499    pub fn mark_registry_src_used(&mut self, registry_src: RegistrySrc) {
1500        self.mark_registry_src_used_stamp(registry_src, None);
1501    }
1502
1503    /// Indicates the given [`GitCheckout`] has been used right now.
1504    ///
1505    /// Also implicitly marks the git db used, too.
1506    pub fn mark_git_checkout_used(&mut self, git_checkout: GitCheckout) {
1507        self.mark_git_checkout_used_stamp(git_checkout, None);
1508    }
1509
1510    /// Indicates the given [`RegistryIndex`] has been used with the given
1511    /// time (or "now" if `None`).
1512    pub fn mark_registry_index_used_stamp(
1513        &mut self,
1514        registry_index: RegistryIndex,
1515        timestamp: Option<&SystemTime>,
1516    ) {
1517        let timestamp = timestamp.map_or(self.now, to_timestamp);
1518        self.registry_index_timestamps
1519            .insert(registry_index, timestamp);
1520    }
1521
1522    /// Indicates the given [`RegistryCrate`] has been used with the given
1523    /// time (or "now" if `None`).
1524    ///
1525    /// Also implicitly marks the index used, too.
1526    pub fn mark_registry_crate_used_stamp(
1527        &mut self,
1528        registry_crate: RegistryCrate,
1529        timestamp: Option<&SystemTime>,
1530    ) {
1531        let timestamp = timestamp.map_or(self.now, to_timestamp);
1532        let index = RegistryIndex {
1533            encoded_registry_name: registry_crate.encoded_registry_name,
1534        };
1535        self.registry_index_timestamps.insert(index, timestamp);
1536        self.registry_crate_timestamps
1537            .insert(registry_crate, timestamp);
1538    }
1539
1540    /// Indicates the given [`RegistrySrc`] has been used with the given
1541    /// time (or "now" if `None`).
1542    ///
1543    /// Also implicitly marks the index used, too.
1544    pub fn mark_registry_src_used_stamp(
1545        &mut self,
1546        registry_src: RegistrySrc,
1547        timestamp: Option<&SystemTime>,
1548    ) {
1549        let timestamp = timestamp.map_or(self.now, to_timestamp);
1550        let index = RegistryIndex {
1551            encoded_registry_name: registry_src.encoded_registry_name,
1552        };
1553        self.registry_index_timestamps.insert(index, timestamp);
1554        self.registry_src_timestamps.insert(registry_src, timestamp);
1555    }
1556
1557    /// Indicates the given [`GitCheckout`] has been used with the given
1558    /// time (or "now" if `None`).
1559    ///
1560    /// Also implicitly marks the git db used, too.
1561    pub fn mark_git_checkout_used_stamp(
1562        &mut self,
1563        git_checkout: GitCheckout,
1564        timestamp: Option<&SystemTime>,
1565    ) {
1566        let timestamp = timestamp.map_or(self.now, to_timestamp);
1567        let db = GitDb {
1568            encoded_git_name: git_checkout.encoded_git_name,
1569        };
1570        self.git_db_timestamps.insert(db, timestamp);
1571        self.git_checkout_timestamps.insert(git_checkout, timestamp);
1572    }
1573
1574    /// Saves all of the deferred information to the database.
1575    ///
1576    /// This will also clear the state of `self`.
1577    #[tracing::instrument(skip_all)]
1578    pub fn save(&mut self, tracker: &mut GlobalCacheTracker) -> CargoResult<()> {
1579        trace!(target: "gc", "saving last-use data");
1580        if self.is_empty() {
1581            return Ok(());
1582        }
1583        let tx = tracker.conn.transaction()?;
1584        // These must run before the ones that refer to their IDs.
1585        self.insert_registry_index_from_cache(&tx)?;
1586        self.insert_git_db_from_cache(&tx)?;
1587        self.insert_registry_crate_from_cache(&tx)?;
1588        self.insert_registry_src_from_cache(&tx)?;
1589        self.insert_git_checkout_from_cache(&tx)?;
1590        tx.commit()?;
1591        trace!(target: "gc", "last-use save complete");
1592        Ok(())
1593    }
1594
1595    /// Variant of [`DeferredGlobalLastUse::save`] that does not return an
1596    /// error.
1597    ///
1598    /// This will log or display a warning to the user.
1599    pub fn save_no_error(&mut self, gctx: &GlobalContext) {
1600        if let Err(e) = self.save_with_gctx(gctx) {
1601            // Because there is an assertion in auto-gc that checks if this is
1602            // empty, be sure to clear it so that assertion doesn't fail.
1603            self.clear();
1604            if !self.save_err_has_warned {
1605                if is_silent_error(&e) && gctx.shell().verbosity() != Verbosity::Verbose {
1606                    tracing::warn!("failed to save last-use data: {e:?}");
1607                } else {
1608                    crate::display_warning_with_error(
1609                        "failed to save last-use data\n\
1610                        This may prevent cargo from accurately tracking what is being \
1611                        used in its global cache. This information is used for \
1612                        automatically removing unused data in the cache.",
1613                        &e,
1614                        &mut gctx.shell(),
1615                    );
1616                    self.save_err_has_warned = true;
1617                }
1618            }
1619        }
1620    }
1621
1622    fn save_with_gctx(&mut self, gctx: &GlobalContext) -> CargoResult<()> {
1623        let mut tracker = gctx.global_cache_tracker()?;
1624        self.save(&mut tracker)
1625    }
1626
1627    /// Flushes all of the `registry_index_timestamps` to the database,
1628    /// clearing `registry_index_timestamps`.
1629    fn insert_registry_index_from_cache(&mut self, conn: &Connection) -> CargoResult<()> {
1630        insert_or_update_parent!(
1631            self,
1632            conn,
1633            "registry_index",
1634            registry_index_timestamps,
1635            registry_keys,
1636            encoded_registry_name
1637        );
1638    }
1639
1640    /// Flushes all of the `git_db_timestamps` to the database,
1641    /// clearing `registry_index_timestamps`.
1642    fn insert_git_db_from_cache(&mut self, conn: &Connection) -> CargoResult<()> {
1643        insert_or_update_parent!(
1644            self,
1645            conn,
1646            "git_db",
1647            git_db_timestamps,
1648            git_keys,
1649            encoded_git_name
1650        );
1651    }
1652
1653    /// Flushes all of the `registry_crate_timestamps` to the database,
1654    /// clearing `registry_index_timestamps`.
1655    fn insert_registry_crate_from_cache(&mut self, conn: &Connection) -> CargoResult<()> {
1656        let registry_crate_timestamps = std::mem::take(&mut self.registry_crate_timestamps);
1657        for (registry_crate, timestamp) in registry_crate_timestamps {
1658            trace!(target: "gc", "insert registry crate {registry_crate:?} {timestamp}");
1659            let registry_id = self.registry_id(conn, registry_crate.encoded_registry_name)?;
1660            let mut stmt = conn.prepare_cached(
1661                "INSERT INTO registry_crate (registry_id, name, size, timestamp)
1662                 VALUES (?1, ?2, ?3, ?4)
1663                 ON CONFLICT DO UPDATE SET timestamp=excluded.timestamp
1664                    WHERE timestamp < ?5
1665                 ",
1666            )?;
1667            stmt.execute(params![
1668                registry_id,
1669                registry_crate.crate_filename,
1670                registry_crate.size,
1671                timestamp,
1672                timestamp - UPDATE_RESOLUTION
1673            ])?;
1674        }
1675        Ok(())
1676    }
1677
1678    /// Flushes all of the `registry_src_timestamps` to the database,
1679    /// clearing `registry_index_timestamps`.
1680    fn insert_registry_src_from_cache(&mut self, conn: &Connection) -> CargoResult<()> {
1681        let registry_src_timestamps = std::mem::take(&mut self.registry_src_timestamps);
1682        for (registry_src, timestamp) in registry_src_timestamps {
1683            trace!(target: "gc", "insert registry src {registry_src:?} {timestamp}");
1684            let registry_id = self.registry_id(conn, registry_src.encoded_registry_name)?;
1685            let mut stmt = conn.prepare_cached(
1686                "INSERT INTO registry_src (registry_id, name, size, timestamp)
1687                 VALUES (?1, ?2, ?3, ?4)
1688                 ON CONFLICT DO UPDATE SET timestamp=excluded.timestamp
1689                    WHERE timestamp < ?5
1690                 ",
1691            )?;
1692            stmt.execute(params![
1693                registry_id,
1694                registry_src.package_dir,
1695                registry_src.size,
1696                timestamp,
1697                timestamp - UPDATE_RESOLUTION
1698            ])?;
1699        }
1700
1701        Ok(())
1702    }
1703
1704    /// Flushes all of the `git_checkout_timestamps` to the database,
1705    /// clearing `registry_index_timestamps`.
1706    fn insert_git_checkout_from_cache(&mut self, conn: &Connection) -> CargoResult<()> {
1707        let git_checkout_timestamps = std::mem::take(&mut self.git_checkout_timestamps);
1708        for (git_checkout, timestamp) in git_checkout_timestamps {
1709            let git_id = self.git_id(conn, git_checkout.encoded_git_name)?;
1710            let mut stmt = conn.prepare_cached(
1711                "INSERT INTO git_checkout (git_id, name, size, timestamp)
1712                 VALUES (?1, ?2, ?3, ?4)
1713                 ON CONFLICT DO UPDATE SET timestamp=excluded.timestamp
1714                    WHERE timestamp < ?5",
1715            )?;
1716            stmt.execute(params![
1717                git_id,
1718                git_checkout.short_name,
1719                git_checkout.size,
1720                timestamp,
1721                timestamp - UPDATE_RESOLUTION
1722            ])?;
1723        }
1724
1725        Ok(())
1726    }
1727
1728    /// Returns the numeric ID of the registry, either fetching from the local
1729    /// cache, or getting it from the database.
1730    ///
1731    /// It is an error if the registry does not exist.
1732    fn registry_id(
1733        &mut self,
1734        conn: &Connection,
1735        encoded_registry_name: InternedString,
1736    ) -> CargoResult<ParentId> {
1737        match self.registry_keys.get(&encoded_registry_name) {
1738            Some(i) => Ok(*i),
1739            None => {
1740                let Some(id) = GlobalCacheTracker::id_from_name(
1741                    conn,
1742                    REGISTRY_INDEX_TABLE,
1743                    &encoded_registry_name,
1744                )?
1745                else {
1746                    bail!(
1747                        "expected registry_index {encoded_registry_name} to exist, but wasn't found"
1748                    );
1749                };
1750                self.registry_keys.insert(encoded_registry_name, id);
1751                Ok(id)
1752            }
1753        }
1754    }
1755
1756    /// Returns the numeric ID of the git db, either fetching from the local
1757    /// cache, or getting it from the database.
1758    ///
1759    /// It is an error if the git db does not exist.
1760    fn git_id(
1761        &mut self,
1762        conn: &Connection,
1763        encoded_git_name: InternedString,
1764    ) -> CargoResult<ParentId> {
1765        match self.git_keys.get(&encoded_git_name) {
1766            Some(i) => Ok(*i),
1767            None => {
1768                let Some(id) =
1769                    GlobalCacheTracker::id_from_name(conn, GIT_DB_TABLE, &encoded_git_name)?
1770                else {
1771                    bail!("expected git_db {encoded_git_name} to exist, but wasn't found")
1772                };
1773                self.git_keys.insert(encoded_git_name, id);
1774                Ok(id)
1775            }
1776        }
1777    }
1778}
1779
1780/// Converts a [`SystemTime`] to a [`Timestamp`] which can be stored in the database.
1781fn to_timestamp(t: &SystemTime) -> Timestamp {
1782    t.duration_since(SystemTime::UNIX_EPOCH)
1783        .expect("invalid clock")
1784        .as_secs()
1785}
1786
1787/// Returns the current time.
1788///
1789/// This supports pretending that the time is different for testing using an
1790/// environment variable.
1791///
1792/// If possible, try to avoid calling this too often since accessing clocks
1793/// can be a little slow on some systems.
1794#[expect(
1795    clippy::disallowed_methods,
1796    reason = "testing only, no reason for config support"
1797)]
1798fn now() -> Timestamp {
1799    match std::env::var("__CARGO_TEST_LAST_USE_NOW") {
1800        Ok(now) => now.parse().unwrap(),
1801        Err(_) => to_timestamp(&SystemTime::now()),
1802    }
1803}
1804
1805/// Returns whether or not the given error should cause a warning to be
1806/// displayed to the user.
1807///
1808/// In some situations, like a read-only global cache, we don't want to spam
1809/// the user with a warning. I think once cargo has controllable lints, I
1810/// think we should consider changing this to always warn, but give the user
1811/// an option to silence the warning.
1812pub fn is_silent_error(e: &anyhow::Error) -> bool {
1813    if let Some(e) = e.downcast_ref::<rusqlite::Error>() {
1814        if matches!(
1815            e.sqlite_error_code(),
1816            Some(ErrorCode::CannotOpen | ErrorCode::ReadOnly)
1817        ) {
1818            return true;
1819        }
1820    }
1821    false
1822}
1823
1824/// Returns the disk usage for a git checkout directory.
1825#[tracing::instrument]
1826fn du_git_checkout(path: &Path) -> CargoResult<u64> {
1827    // !.git is used because clones typically use hardlinks for the git
1828    // contents. TODO: Verify behavior on Windows.
1829    // TODO: Or even better, switch to worktrees, and remove this.
1830    cargo_util::du(&path, &["!.git"])
1831}
1832
1833fn du(path: &Path, table_name: &str) -> CargoResult<u64> {
1834    if table_name == GIT_CO_TABLE {
1835        du_git_checkout(path)
1836    } else {
1837        cargo_util::du(&path, &[])
1838    }
1839}
cargo/core/global_cache_tracker.rs

cargo/core/
global_cache_tracker.rs