Skip to main content

cargo/sources/registry/
mod.rs

1//! A `Source` for registry-based packages.
2//!
3//! # What's a Registry?
4//!
5//! [Registries] are central locations where packages can be uploaded to,
6//! discovered, and searched for. The purpose of a registry is to have a
7//! location that serves as permanent storage for versions of a crate over time.
8//!
9//! Compared to git sources (see [`GitSource`]), a registry provides many
10//! packages as well as many versions simultaneously. Git sources can also
11//! have commits deleted through rebasings where registries cannot have their
12//! versions deleted.
13//!
14//! In Cargo, [`RegistryData`] is an abstraction over each kind of actual
15//! registry, and [`RegistrySource`] connects those implementations to
16//! [`Source`] trait. Two prominent features these abstractions provide are
17//!
18//! * A way to query the metadata of a package from a registry. The metadata
19//!   comes from the index.
20//! * A way to download package contents (a.k.a source files) that are required
21//!   when building the package itself.
22//!
23//! We'll cover each functionality later.
24//!
25//! [Registries]: https://doc.rust-lang.org/nightly/cargo/reference/registries.html
26//! [`GitSource`]: super::GitSource
27//!
28//! # Different Kinds of Registries
29//!
30//! Cargo provides multiple kinds of registries. Each of them serves the index
31//! and package contents in a slightly different way. Namely,
32//!
33//! * [`LocalRegistry`] --- Serves the index and package contents entirely on
34//!   a local filesystem.
35//! * [`RemoteRegistry`] --- Serves the index ahead of time from a Git
36//!   repository, and package contents are downloaded as needed.
37//! * [`HttpRegistry`] --- Serves both the index and package contents on demand
38//!   over a HTTP-based registry API. This is the default starting from 1.70.0.
39//!
40//! Each registry has its own [`RegistryData`] implementation, and can be
41//! created from either [`RegistrySource::local`] or [`RegistrySource::remote`].
42//!
43//! [`LocalRegistry`]: local::LocalRegistry
44//! [`RemoteRegistry`]: remote::RemoteRegistry
45//! [`HttpRegistry`]: http_remote::HttpRegistry
46//!
47//! # The Index of a Registry
48//!
49//! One of the major difficulties with a registry is that hosting so many
50//! packages may quickly run into performance problems when dealing with
51//! dependency graphs. It's infeasible for cargo to download the entire contents
52//! of the registry just to resolve one package's dependencies, for example. As
53//! a result, cargo needs some efficient method of querying what packages are
54//! available on a registry, what versions are available, and what the
55//! dependencies for each version is.
56//!
57//! To solve the problem, a registry must provide an index of package metadata.
58//! The index of a registry is essentially an easily query-able version of the
59//! registry's database for a list of versions of a package as well as a list
60//! of dependencies for each version. The exact format of the index is
61//! described later.
62//!
63//! See the [`index`] module for topics about the management, parsing, caching,
64//! and versioning for the on-disk index.
65//!
66//! ## The Format of The Index
67//!
68//! The index is a store for the list of versions for all packages known, so its
69//! format on disk is optimized slightly to ensure that `ls registry` doesn't
70//! produce a list of all packages ever known. The index also wants to ensure
71//! that there's not a million files which may actually end up hitting
72//! filesystem limits at some point. To this end, a few decisions were made
73//! about the format of the registry:
74//!
75//! 1. Each crate will have one file corresponding to it. Each version for a
76//!    crate will just be a line in this file (see [`cargo_util_schemas::index::IndexPackage`] for its
77//!    representation).
78//! 2. There will be two tiers of directories for crate names, under which
79//!    crates corresponding to those tiers will be located.
80//!    (See [`cargo_util::registry::make_dep_path`] for the implementation of
81//!    this layout hierarchy.)
82//!
83//! As an example, this is an example hierarchy of an index:
84//!
85//! ```notrust
86//! .
87//! ├── 3
88//! │   └── u
89//! │       └── url
90//! ├── bz
91//! │   └── ip
92//! │       └── bzip2
93//! ├── config.json
94//! ├── en
95//! │   └── co
96//! │       └── encoding
97//! └── li
98//!     ├── bg
99//!     │   └── libgit2
100//!     └── nk
101//!         └── link-config
102//! ```
103//!
104//! The root of the index contains a `config.json` file with a few entries
105//! corresponding to the registry (see [`RegistryConfig`] below).
106//!
107//! Otherwise, there are three numbered directories (1, 2, 3) for crates with
108//! names 1, 2, and 3 characters in length. The 1/2 directories simply have the
109//! crate files underneath them, while the 3 directory is sharded by the first
110//! letter of the crate name.
111//!
112//! Otherwise the top-level directory contains many two-letter directory names,
113//! each of which has many sub-folders with two letters. At the end of all these
114//! are the actual crate files themselves.
115//!
116//! The purpose of this layout is to hopefully cut down on `ls` sizes as well as
117//! efficient lookup based on the crate name itself.
118//!
119//! See [The Cargo Book: Registry Index][registry-index] for the public
120//! interface on the index format.
121//!
122//! [registry-index]: https://doc.rust-lang.org/nightly/cargo/reference/registry-index.html
123//!
124//! ## The Index Files
125//!
126//! Each file in the index is the history of one crate over time. Each line in
127//! the file corresponds to one version of a crate, stored in JSON format (see
128//! the [`cargo_util_schemas::index::IndexPackage`] structure).
129//!
130//! As new versions are published, new lines are appended to this file. **The
131//! only modifications to this file that should happen over time are yanks of a
132//! particular version.**
133//!
134//! # Downloading Packages
135//!
136//! The purpose of the index was to provide an efficient method to resolve the
137//! dependency graph for a package. After resolution has been performed, we need
138//! to download the contents of packages so we can read the full manifest and
139//! build the source code.
140//!
141//! To accomplish this, [`RegistryData::download`] will "make" an HTTP request
142//! per-package requested to download tarballs into a local cache. These
143//! tarballs will then be unpacked into a destination folder.
144//!
145//! Note that because versions uploaded to the registry are frozen forever that
146//! the HTTP download and unpacking can all be skipped if the version has
147//! already been downloaded and unpacked. This caching allows us to only
148//! download a package when absolutely necessary.
149//!
150//! # Filesystem Hierarchy
151//!
152//! Overall, the `$HOME/.cargo` looks like this when talking about the registry
153//! (remote registries, specifically):
154//!
155//! ```notrust
156//! # A folder under which all registry metadata is hosted (similar to
157//! # $HOME/.cargo/git)
158//! $HOME/.cargo/registry/
159//!
160//!     # For each registry that cargo knows about (keyed by hostname + hash)
161//!     # there is a folder which is the checked out version of the index for
162//!     # the registry in this location. Note that this is done so cargo can
163//!     # support multiple registries simultaneously
164//!     index/
165//!         registry1-<hash>/
166//!         registry2-<hash>/
167//!         ...
168//!
169//!     # This folder is a cache for all downloaded tarballs (`.crate` file)
170//!     # from a registry. Once downloaded and verified, a tarball never changes.
171//!     cache/
172//!         registry1-<hash>/<pkg>-<version>.crate
173//!         ...
174//!
175//!     # Location in which all tarballs are unpacked. Each tarball is known to
176//!     # be frozen after downloading, so transitively this folder is also
177//!     # frozen once its unpacked (it's never unpacked again)
178//!     # CAVEAT: They are not read-only. See rust-lang/cargo#9455.
179//!     src/
180//!         registry1-<hash>/<pkg>-<version>/...
181//!         ...
182//! ```
183//!
184
185use std::cell::RefCell;
186use std::collections::HashSet;
187use std::fs;
188use std::fs::{File, OpenOptions};
189use std::io;
190use std::io::Read;
191use std::io::Write;
192use std::path::{Path, PathBuf};
193
194use anyhow::Context as _;
195use cargo_util::paths;
196use cargo_util_terminal::report::Level;
197use flate2::read::GzDecoder;
198use futures::FutureExt as _;
199use serde::Deserialize;
200use serde::Serialize;
201use tar::{Archive, EntryType};
202use tracing::debug;
203
204use crate::core::dependency::Dependency;
205use crate::core::global_cache_tracker;
206use crate::core::{Package, PackageId, SourceId};
207use crate::sources::PathSource;
208use crate::sources::source::MaybePackage;
209use crate::sources::source::QueryKind;
210use crate::sources::source::Source;
211use crate::util::cache_lock::CacheLockMode;
212use crate::util::interning::InternedString;
213use crate::util::{CargoResult, Filesystem, GlobalContext, LimitErrorReader, restricted_names};
214use crate::util::{VersionExt, hex};
215
216pub use cargo_util_schemas::index::RegistryConfig;
217
218/// The `.cargo-ok` file is used to track if the source is already unpacked.
219/// See [`RegistrySource::unpack_package`] for more.
220///
221/// Not to be confused with `.cargo-ok` file in git sources.
222const PACKAGE_SOURCE_LOCK: &str = ".cargo-ok";
223
224pub const CRATES_IO_INDEX: &str = "https://github.com/rust-lang/crates.io-index";
225pub const CRATES_IO_HTTP_INDEX: &str = "sparse+https://index.crates.io/";
226pub const CRATES_IO_REGISTRY: &str = "crates-io";
227pub const CRATES_IO_DOMAIN: &str = "crates.io";
228
229/// The content inside `.cargo-ok`.
230/// See [`RegistrySource::unpack_package`] for more.
231#[derive(Deserialize, Serialize)]
232#[serde(rename_all = "kebab-case")]
233struct LockMetadata {
234    /// The version of `.cargo-ok` file
235    v: u32,
236}
237
238/// A [`Source`] implementation for a local or a remote registry.
239///
240/// This contains common functionality that is shared between each registry
241/// kind, with the registry-specific logic implemented as part of the
242/// [`RegistryData`] trait referenced via the `ops` field.
243///
244/// For general concepts of registries, see the [module-level documentation](crate::sources::registry).
245pub struct RegistrySource<'gctx> {
246    /// A unique name of the source (typically used as the directory name
247    /// where its cached content is stored).
248    name: InternedString,
249    /// The unique identifier of this source.
250    source_id: SourceId,
251    /// The path where crate files are extracted (`$CARGO_HOME/registry/src/$REG-HASH`).
252    src_path: Filesystem,
253    /// Local reference to [`GlobalContext`] for convenience.
254    gctx: &'gctx GlobalContext,
255    /// Abstraction for interfacing to the different registry kinds.
256    ops: Box<dyn RegistryData + 'gctx>,
257    /// Interface for managing the on-disk index.
258    index: index::RegistryIndex<'gctx>,
259    /// Yanked versions that have already been selected during queries.
260    ///
261    /// As of this writing, this is for not emitting the `--precise <yanked>`
262    /// warning twice, with the assumption of (`dep.package_name()` + `--precise`
263    /// version) being sufficient to uniquely identify the same query result.
264    selected_precise_yanked: RefCell<HashSet<(InternedString, semver::Version)>>,
265}
266
267/// Result from loading data from a registry.
268#[derive(Debug, Clone)]
269pub enum LoadResponse {
270    /// The cache is valid. The cached data should be used.
271    CacheValid,
272
273    /// The cache is out of date. Returned data should be used.
274    Data {
275        raw_data: Vec<u8>,
276        /// Version of this data to determine whether it is out of date.
277        index_version: Option<String>,
278    },
279
280    /// The requested crate was found.
281    NotFound,
282}
283
284/// An abstract interface to handle both a local and remote registry.
285///
286/// This allows [`RegistrySource`] to abstractly handle each registry kind.
287///
288/// For general concepts of registries, see the [module-level documentation](crate::sources::registry).
289#[async_trait::async_trait(?Send)]
290pub trait RegistryData {
291    /// Performs initialization for the registry.
292    ///
293    /// This should be safe to call multiple times, the implementation is
294    /// expected to not do any work if it is already prepared.
295    fn prepare(&self) -> CargoResult<()>;
296
297    /// Returns the path to the index.
298    ///
299    /// Note that different registries store the index in different formats
300    /// (remote = git, http & local = files).
301    fn index_path(&self) -> &Filesystem;
302
303    /// Returns the path of the directory that stores the cache of `.crate` files.
304    ///
305    /// The directory is currently expected to contain a flat list of all `.crate` files,
306    /// named `<package-name>-<version>.crate`.
307    fn cache_path(&self) -> &Filesystem;
308
309    /// Loads the JSON for a specific named package from the index.
310    ///
311    /// * `root` is the root path to the index.
312    /// * `path` is the relative path to the package to load (like `ca/rg/cargo`).
313    /// * `index_version` is the version of the requested crate data currently
314    ///    in cache. This is useful for checking if a local cache is outdated.
315    async fn load(
316        &self,
317        root: &Path,
318        path: &Path,
319        index_version: Option<&str>,
320    ) -> CargoResult<LoadResponse>;
321
322    /// Loads the `config.json` file and returns it.
323    ///
324    /// Local registries don't have a config, and return `None`.
325    async fn config(&self) -> CargoResult<Option<RegistryConfig>>;
326
327    /// Invalidates locally cached data.
328    fn invalidate_cache(&self);
329
330    /// If quiet, the source should not display any progress or status messages.
331    fn set_quiet(&mut self, quiet: bool);
332
333    /// Is the local cached data up-to-date?
334    fn is_updated(&self) -> bool;
335
336    /// Prepare to start downloading a `.crate` file.
337    ///
338    /// Despite the name, this doesn't actually download anything. If the
339    /// `.crate` is already downloaded, then it returns [`MaybeLock::Ready`].
340    /// If it hasn't been downloaded, then it returns [`MaybeLock::Download`]
341    /// which contains the URL to download. The [`crate::core::package::Downloads`]
342    /// system handles the actual download process. After downloading, it
343    /// calls [`Self::finish_download`] to save the downloaded file.
344    ///
345    /// `checksum` is currently only used by local registries to verify the
346    /// file contents (because local registries never actually download
347    /// anything). Remote registries will validate the checksum in
348    /// `finish_download`. For already downloaded `.crate` files, it does not
349    /// validate the checksum, assuming the filesystem does not suffer from
350    /// corruption or manipulation.
351    async fn download(&self, pkg: PackageId, checksum: &str) -> CargoResult<MaybeLock>;
352
353    /// Finish a download by saving a `.crate` file to disk.
354    ///
355    /// After [`crate::core::package::Downloads`] has finished a download,
356    /// it will call this to save the `.crate` file. This is only relevant
357    /// for remote registries. This should validate the checksum and save
358    /// the given data to the on-disk cache.
359    ///
360    /// Returns a [`File`] handle to the `.crate` file, positioned at the start.
361    async fn finish_download(
362        &self,
363        pkg: PackageId,
364        checksum: &str,
365        data: &[u8],
366    ) -> CargoResult<File>;
367
368    /// Returns whether or not the `.crate` file is already downloaded.
369    fn is_crate_downloaded(&self, _pkg: PackageId) -> bool {
370        true
371    }
372
373    /// Validates that the global package cache lock is held.
374    ///
375    /// Given the [`Filesystem`], this will make sure that the package cache
376    /// lock is held. If not, it will panic. See
377    /// [`GlobalContext::acquire_package_cache_lock`] for acquiring the global lock.
378    ///
379    /// Returns the [`Path`] to the [`Filesystem`].
380    fn assert_index_locked<'a>(&self, path: &'a Filesystem) -> &'a Path;
381}
382
383/// The status of [`RegistryData::download`] which indicates if a `.crate`
384/// file has already been downloaded, or if not then the URL to download.
385pub enum MaybeLock {
386    /// The `.crate` file is already downloaded. [`File`] is a handle to the
387    /// opened `.crate` file on the filesystem.
388    Ready(File),
389    /// The `.crate` file is not downloaded, here's the URL to download it from.
390    ///
391    /// `descriptor` is just a text string to display to the user of what is
392    /// being downloaded.
393    Download {
394        url: String,
395        descriptor: String,
396        authorization: Option<String>,
397    },
398}
399
400mod download;
401mod http_remote;
402pub(crate) mod index;
403pub use index::IndexSummary;
404mod local;
405mod remote;
406
407/// Generates a unique name for [`SourceId`] to have a unique path to put their
408/// index files.
409fn short_name(id: SourceId, is_shallow: bool) -> String {
410    // CAUTION: This should not change between versions. If you change how
411    // this is computed, it will orphan previously cached data, forcing the
412    // cache to be rebuilt and potentially wasting significant disk space. If
413    // you change it, be cautious of the impact. See `test_cratesio_hash` for
414    // a similar discussion.
415    let hash = hex::short_hash(&id);
416    let ident = id.url().host_str().unwrap_or("").to_string();
417    let mut name = format!("{}-{}", ident, hash);
418    if is_shallow {
419        name.push_str("-shallow");
420    }
421    name
422}
423
424impl<'gctx> RegistrySource<'gctx> {
425    /// Creates a [`Source`] of a "remote" registry.
426    /// It could be either an HTTP-based [`http_remote::HttpRegistry`] or
427    /// a Git-based [`remote::RemoteRegistry`].
428    pub fn remote(
429        source_id: SourceId,
430        gctx: &'gctx GlobalContext,
431    ) -> CargoResult<RegistrySource<'gctx>> {
432        assert!(source_id.is_remote_registry());
433        let name = short_name(
434            source_id,
435            gctx.cli_unstable()
436                .git
437                .map_or(false, |features| features.shallow_index)
438                && !source_id.is_sparse(),
439        );
440        let ops = if source_id.is_sparse() {
441            Box::new(http_remote::HttpRegistry::new(source_id, gctx, &name)?) as Box<_>
442        } else {
443            Box::new(remote::RemoteRegistry::new(source_id, gctx, &name)) as Box<_>
444        };
445
446        Ok(RegistrySource::new(source_id, gctx, &name, ops))
447    }
448
449    /// Creates a [`Source`] of a local registry, with [`local::LocalRegistry`] under the hood.
450    ///
451    /// * `path` --- The root path of a local registry on the file system.
452    pub fn local(
453        source_id: SourceId,
454        path: &Path,
455        gctx: &'gctx GlobalContext,
456    ) -> RegistrySource<'gctx> {
457        let name = short_name(source_id, false);
458        let ops = local::LocalRegistry::new(path, gctx, &name);
459        RegistrySource::new(source_id, gctx, &name, Box::new(ops))
460    }
461
462    /// Creates a source of a registry. This is a inner helper function.
463    ///
464    /// * `name` --- Name of a path segment which may affect where `.crate`
465    ///   tarballs, the registry index and cache are stored. Expect to be unique.
466    /// * `ops` --- The underlying [`RegistryData`] type.
467    fn new(
468        source_id: SourceId,
469        gctx: &'gctx GlobalContext,
470        name: &str,
471        ops: Box<dyn RegistryData + 'gctx>,
472    ) -> RegistrySource<'gctx> {
473        // Before starting to work on the registry, make sure that
474        // `<cargo_home>/registry` is marked as excluded from indexing and
475        // backups. Older versions of Cargo didn't do this, so we do it here
476        // regardless of whether `<cargo_home>` exists.
477        //
478        // This does not use `create_dir_all_excluded_from_backups_atomic` for
479        // the same reason: we want to exclude it even if the directory already
480        // exists.
481        //
482        // IO errors in creating and marking it are ignored, e.g. in case we're on a
483        // read-only filesystem.
484        let registry_base = gctx.registry_base_path();
485        let _ = registry_base.create_dir();
486        cargo_util::paths::exclude_from_backups_and_indexing(&registry_base.into_path_unlocked());
487
488        RegistrySource {
489            name: name.into(),
490            src_path: gctx.registry_source_path().join(name),
491            gctx,
492            source_id,
493            index: index::RegistryIndex::new(source_id, ops.index_path(), gctx),
494            ops,
495            selected_precise_yanked: RefCell::new(HashSet::new()),
496        }
497    }
498
499    /// Decode the [configuration](RegistryConfig) stored within the registry.
500    ///
501    /// This requires that the index has been at least checked out.
502    pub async fn config(&self) -> CargoResult<Option<RegistryConfig>> {
503        self.ops.config().await
504    }
505
506    /// Unpacks a downloaded package into a location where it's ready to be
507    /// compiled.
508    ///
509    /// No action is taken if the source looks like it's already unpacked.
510    ///
511    /// # History of interruption detection with `.cargo-ok` file
512    ///
513    /// Cargo has always included a `.cargo-ok` file ([`PACKAGE_SOURCE_LOCK`])
514    /// to detect if extraction was interrupted, but it was originally empty.
515    ///
516    /// In 1.34, Cargo was changed to create the `.cargo-ok` file before it
517    /// started extraction to implement fine-grained locking. After it was
518    /// finished extracting, it wrote two bytes to indicate it was complete.
519    /// It would use the length check to detect if it was possibly interrupted.
520    ///
521    /// In 1.36, Cargo changed to not use fine-grained locking, and instead used
522    /// a global lock. The use of `.cargo-ok` was no longer needed for locking
523    /// purposes, but was kept to detect when extraction was interrupted.
524    ///
525    /// In 1.49, Cargo changed to not create the `.cargo-ok` file before it
526    /// started extraction to deal with `.crate` files that inexplicably had
527    /// a `.cargo-ok` file in them.
528    ///
529    /// In 1.64, Cargo changed to detect `.crate` files with `.cargo-ok` files
530    /// in them in response to [CVE-2022-36113], which dealt with malicious
531    /// `.crate` files making `.cargo-ok` a symlink causing cargo to write "ok"
532    /// to any arbitrary file on the filesystem it has permission to.
533    ///
534    /// In 1.71, `.cargo-ok` changed to contain a JSON `{ v: 1 }` to indicate
535    /// the version of it. A failure of parsing will result in a heavy-hammer
536    /// approach that unpacks the `.crate` file again. This is in response to a
537    /// security issue that the unpacking didn't respect umask on Unix systems.
538    ///
539    /// This is all a long-winded way of explaining the circumstances that might
540    /// cause a directory to contain a `.cargo-ok` file that is empty or
541    /// otherwise corrupted. Either this was extracted by a version of Rust
542    /// before 1.34, in which case everything should be fine. However, an empty
543    /// file created by versions 1.36 to 1.49 indicates that the extraction was
544    /// interrupted and that we need to start again.
545    ///
546    /// Another possibility is that the filesystem is simply corrupted, in
547    /// which case deleting the directory might be the safe thing to do. That
548    /// is probably unlikely, though.
549    ///
550    /// To be safe, we delete the directory and start over again if an empty
551    /// `.cargo-ok` file is found.
552    ///
553    /// [CVE-2022-36113]: https://blog.rust-lang.org/2022/09/14/cargo-cves.html#arbitrary-file-corruption-cve-2022-36113
554    fn unpack_package(&self, pkg: PackageId, tarball: &File) -> CargoResult<PathBuf> {
555        let package_dir = format!("{}-{}", pkg.name(), pkg.version());
556        let dst = self.src_path.join(&package_dir);
557        let path = dst.join(PACKAGE_SOURCE_LOCK);
558        let path = self
559            .gctx
560            .assert_package_cache_locked(CacheLockMode::DownloadExclusive, &path);
561        let unpack_dir = path.parent().unwrap();
562        match fs::read_to_string(path) {
563            Ok(ok) => match serde_json::from_str::<LockMetadata>(&ok) {
564                Ok(lock_meta) if lock_meta.v == 1 => {
565                    self.gctx
566                        .deferred_global_last_use()?
567                        .mark_registry_src_used(global_cache_tracker::RegistrySrc {
568                            encoded_registry_name: self.name,
569                            package_dir: package_dir.into(),
570                            size: None,
571                        });
572                    return Ok(unpack_dir.to_path_buf());
573                }
574                _ => {
575                    if ok == "ok" {
576                        tracing::debug!("old `ok` content found, clearing cache");
577                    } else {
578                        tracing::warn!("unrecognized .cargo-ok content, clearing cache: {ok}");
579                    }
580                    // See comment of `unpack_package` about why removing all stuff.
581                    paths::remove_dir_all(dst.as_path_unlocked())?;
582                }
583            },
584            Err(e) if e.kind() == io::ErrorKind::NotFound => {}
585            Err(e) => anyhow::bail!("unable to read .cargo-ok file at {path:?}: {e}"),
586        }
587        dst.create_dir()?;
588
589        let bytes_written = unpack(self.gctx, tarball, unpack_dir, &|_| true)?;
590        update_mtime_for_generated_files(unpack_dir);
591
592        // Now that we've finished unpacking, create and write to the lock file to indicate that
593        // unpacking was successful.
594        let mut ok = OpenOptions::new()
595            .create_new(true)
596            .read(true)
597            .write(true)
598            .open(&path)
599            .with_context(|| format!("failed to open `{}`", path.display()))?;
600
601        let lock_meta = LockMetadata { v: 1 };
602        write!(ok, "{}", serde_json::to_string(&lock_meta).unwrap())?;
603
604        self.gctx
605            .deferred_global_last_use()?
606            .mark_registry_src_used(global_cache_tracker::RegistrySrc {
607                encoded_registry_name: self.name,
608                package_dir: package_dir.into(),
609                size: Some(bytes_written),
610            });
611
612        Ok(unpack_dir.to_path_buf())
613    }
614
615    /// Unpacks the `.crate` tarball of the package in a given directory.
616    ///
617    /// Returns the path to the crate tarball directory,
618    /// which is always `<unpack_dir>/<pkg>-<version>`.
619    ///
620    /// This holds some assumptions
621    ///
622    /// * The associated tarball already exists
623    /// * If this is a local registry,
624    ///   the package cache lock must be externally synchronized.
625    ///   Cargo does not take care of it being locked or not.
626    pub fn unpack_package_in(
627        &self,
628        pkg: &PackageId,
629        unpack_dir: &Path,
630        include: &dyn Fn(&Path) -> bool,
631    ) -> CargoResult<PathBuf> {
632        let path = self.ops.cache_path().join(pkg.tarball_name());
633        let path = self.ops.assert_index_locked(&path);
634        let dst = unpack_dir.join(format!("{}-{}", pkg.name(), pkg.version()));
635        let tarball =
636            File::open(path).with_context(|| format!("failed to open {}", path.display()))?;
637        unpack(self.gctx, &tarball, &dst, include)?;
638        update_mtime_for_generated_files(&dst);
639        Ok(dst)
640    }
641
642    /// Turns the downloaded `.crate` tarball file into a [`Package`].
643    ///
644    /// This unconditionally sets checksum for the returned package, so it
645    /// should only be called after doing integrity check. That is to say,
646    /// you need to call either [`RegistryData::download`] or
647    /// [`RegistryData::finish_download`] before calling this method.
648    async fn get_pkg(&self, package: PackageId, path: &File) -> CargoResult<Package> {
649        let path = self
650            .unpack_package(package, path)
651            .with_context(|| format!("failed to unpack package `{}`", package))?;
652        let src = PathSource::new(&path, self.source_id, self.gctx);
653        src.load()?;
654        let mut pkg = match src.download(package).await? {
655            MaybePackage::Ready(pkg) => pkg,
656            MaybePackage::Download { .. } => unreachable!(),
657        };
658
659        // After we've loaded the package configure its summary's `checksum`
660        // field with the checksum we know for this `PackageId`.
661        let cksum = self
662            .index
663            .hash(package, &*self.ops)
664            .now_or_never()
665            .expect("a downloaded dep now pending!?")
666            .expect("summary not found");
667        pkg.manifest_mut()
668            .summary_mut()
669            .set_checksum(cksum.to_string());
670
671        Ok(pkg)
672    }
673}
674
675#[async_trait::async_trait(?Send)]
676impl<'gctx> Source for RegistrySource<'gctx> {
677    async fn query(
678        &self,
679        dep: &Dependency,
680        kind: QueryKind,
681        f: &mut dyn FnMut(IndexSummary),
682    ) -> CargoResult<()> {
683        let mut req = dep.version_req().clone();
684
685        // Handle `cargo update --precise` here.
686        if let Some((_, requested)) = self
687            .source_id
688            .precise_registry_version(dep.package_name().as_str())
689            .filter(|(c, to)| {
690                if to.is_prerelease() && self.gctx.cli_unstable().unstable_options {
691                    req.matches_prerelease(c)
692                } else {
693                    req.matches(c)
694                }
695            })
696        {
697            req.precise_to(&requested);
698        }
699
700        let mut called = false;
701        let callback = &mut |s| {
702            called = true;
703            f(s);
704        };
705
706        // If this is a locked dependency, then it came from a lock file and in
707        // theory the registry is known to contain this version. If, however, we
708        // come back with no summaries, then our registry may need to be
709        // updated, so we fall back to performing a lazy update.
710        if kind == QueryKind::Exact && req.is_locked() && !self.ops.is_updated() {
711            debug!("attempting query without update");
712            self.index
713                .query_inner(dep.package_name(), &req, &*self.ops, &mut |is| {
714                    match &is {
715                        IndexSummary::Candidate(s) | IndexSummary::Yanked(s) if dep.matches(&s) => {
716                            // We are looking for a package from a lock file so we do not care about yank
717                            callback(is)
718                        }
719                        _ => {}
720                    }
721                })
722                .await?;
723            if called {
724                return Ok(());
725            } else {
726                debug!("falling back to an update");
727                self.invalidate_cache();
728            }
729        }
730
731        let mut called = false;
732        let callback = &mut |s| {
733            called = true;
734            f(s);
735        };
736
737        let mut precise_yanked_in_use = false;
738        self.index
739            .query_inner(dep.package_name(), &req, &*self.ops, &mut |s| {
740                let matched = match kind {
741                    QueryKind::Exact | QueryKind::RejectedVersions => {
742                        let s = match &s {
743                            IndexSummary::Candidate(s)
744                            | IndexSummary::Yanked(s)
745                            | IndexSummary::Offline(s)
746                            | IndexSummary::Unsupported(s, _)
747                            | IndexSummary::Invalid(s) => s,
748                        };
749                        if req.is_precise() && self.gctx.cli_unstable().unstable_options {
750                            dep.matches_prerelease(&s)
751                        } else {
752                            dep.matches(&s)
753                        }
754                    }
755                    QueryKind::AlternativeNames => true,
756                    QueryKind::Normalized => true,
757                };
758                if !matched {
759                    return;
760                }
761                match s {
762                    s @ _ if kind == QueryKind::RejectedVersions => callback(s),
763                    s @ IndexSummary::Candidate(_) => callback(s),
764                    s @ IndexSummary::Yanked(_) => {
765                        // HACK: While source knows nothing about yank policy,
766                        // We still detect `cargo update --precise <yanked>`
767                        // so we can warn about the user-visible selection.
768                        //
769                        // We should consider also move this out from source query.
770                        if req.is_precise() {
771                            precise_yanked_in_use = true;
772                        }
773                        callback(s);
774                    }
775                    IndexSummary::Unsupported(summary, v) => {
776                        tracing::debug!(
777                            "unsupported schema version {} ({} {})",
778                            v,
779                            summary.name(),
780                            summary.version()
781                        );
782                    }
783                    IndexSummary::Invalid(summary) => {
784                        tracing::debug!("invalid ({} {})", summary.name(), summary.version());
785                    }
786                    IndexSummary::Offline(summary) => {
787                        tracing::debug!("offline ({} {})", summary.name(), summary.version());
788                    }
789                }
790            })
791            .await?;
792        if precise_yanked_in_use {
793            let name = dep.package_name();
794            let version = req
795                .precise_version()
796                .expect("--precise <yanked-version> in use");
797            if self
798                .selected_precise_yanked
799                .borrow_mut()
800                .insert((name, version.clone()))
801            {
802                let mut shell = self.gctx.shell();
803                shell.print_report(
804                    &[Level::WARNING
805                        .secondary_title(format!(
806                            "selected package `{name}@{version}` was yanked by the author"
807                        ))
808                        .element(
809                            Level::HELP.message("if possible, try a compatible non-yanked version"),
810                        )],
811                    false,
812                )?;
813            }
814        }
815        if called {
816            return Ok(());
817        }
818        if kind == QueryKind::AlternativeNames || kind == QueryKind::Normalized {
819            // Attempt to handle misspellings by searching for a chain of related
820            // names to the original name. The resolver will later
821            // reject any candidates that have the wrong name, and with this it'll
822            // have enough information to offer "a similar crate exists" suggestions.
823            // For now we only try canonicalizing `-` to `_` and vice versa.
824            // More advanced fuzzy searching become in the future.
825            for name_permutation in [
826                dep.package_name().replace('-', "_"),
827                dep.package_name().replace('_', "-"),
828            ] {
829                let name_permutation = name_permutation.into();
830                if name_permutation == dep.package_name() {
831                    continue;
832                }
833                self.index
834                    .query_inner(name_permutation, &req, &*self.ops, &mut |s| f(s))
835                    .await?;
836            }
837        }
838        Ok(())
839    }
840
841    fn supports_checksums(&self) -> bool {
842        true
843    }
844
845    fn requires_precise(&self) -> bool {
846        false
847    }
848
849    fn source_id(&self) -> SourceId {
850        self.source_id
851    }
852
853    fn invalidate_cache(&self) {
854        self.index.clear_summaries_cache();
855        self.ops.invalidate_cache();
856    }
857
858    fn set_quiet(&mut self, quiet: bool) {
859        self.ops.set_quiet(quiet);
860    }
861
862    async fn download(&self, package: PackageId) -> CargoResult<MaybePackage> {
863        let hash = self.index.hash(package, &*self.ops).await?;
864        match self.ops.download(package, &hash).await? {
865            MaybeLock::Ready(file) => self.get_pkg(package, &file).await.map(MaybePackage::Ready),
866            MaybeLock::Download {
867                url,
868                descriptor,
869                authorization,
870            } => Ok(MaybePackage::Download {
871                url,
872                descriptor,
873                authorization,
874            }),
875        }
876    }
877
878    async fn finish_download(&self, package: PackageId, data: Vec<u8>) -> CargoResult<Package> {
879        let hash = self.index.hash(package, &*self.ops).await?;
880        let file = self.ops.finish_download(package, &hash, &data).await?;
881        self.get_pkg(package, &file).await
882    }
883
884    fn fingerprint(&self, pkg: &Package) -> CargoResult<String> {
885        Ok(pkg.package_id().version().to_string())
886    }
887
888    fn describe(&self) -> String {
889        self.source_id.display_index()
890    }
891}
892
893/// Get the maximum unpack size that Cargo permits
894/// based on a given `size` of your compressed file.
895///
896/// Returns the larger one between `size * max compression ratio`
897/// and a fixed max unpacked size.
898///
899/// In reality, the compression ratio usually falls in the range of 2:1 to 10:1.
900/// We choose 20:1 to cover almost all possible cases hopefully.
901/// Any ratio higher than this is considered as a zip bomb.
902///
903/// In the future we might want to introduce a configurable size.
904///
905/// Some of the real world data from common compression algorithms:
906///
907/// * <https://www.zlib.net/zlib_tech.html>
908/// * <https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf>
909/// * <https://blog.cloudflare.com/results-experimenting-brotli/>
910/// * <https://tukaani.org/lzma/benchmarks.html>
911fn max_unpack_size(gctx: &GlobalContext, size: u64) -> u64 {
912    const SIZE_VAR: &str = "__CARGO_TEST_MAX_UNPACK_SIZE";
913    const RATIO_VAR: &str = "__CARGO_TEST_MAX_UNPACK_RATIO";
914    const MAX_UNPACK_SIZE: u64 = 512 * 1024 * 1024; // 512 MiB
915    const MAX_COMPRESSION_RATIO: usize = 20; // 20:1
916
917    let max_unpack_size = if cfg!(debug_assertions) && gctx.get_env(SIZE_VAR).is_ok() {
918        // For integration test only.
919        gctx.get_env(SIZE_VAR)
920            .unwrap()
921            .parse()
922            .expect("a max unpack size in bytes")
923    } else {
924        MAX_UNPACK_SIZE
925    };
926    let max_compression_ratio = if cfg!(debug_assertions) && gctx.get_env(RATIO_VAR).is_ok() {
927        // For integration test only.
928        gctx.get_env(RATIO_VAR)
929            .unwrap()
930            .parse()
931            .expect("a max compression ratio in bytes")
932    } else {
933        MAX_COMPRESSION_RATIO
934    };
935
936    u64::max(max_unpack_size, size * max_compression_ratio as u64)
937}
938
939/// Set the current [`umask`] value for the given tarball. No-op on non-Unix
940/// platforms.
941///
942/// On Windows, tar only looks at user permissions and tries to set the "read
943/// only" attribute, so no-op as well.
944///
945/// [`umask`]: https://man7.org/linux/man-pages/man2/umask.2.html
946#[allow(unused_variables)]
947fn set_mask<R: Read>(tar: &mut Archive<R>) {
948    #[cfg(unix)]
949    tar.set_mask(crate::util::get_umask());
950}
951
952/// Unpack a tarball with zip bomb and overwrite protections.
953fn unpack(
954    gctx: &GlobalContext,
955    tarball: &File,
956    unpack_dir: &Path,
957    include: &dyn Fn(&Path) -> bool,
958) -> CargoResult<u64> {
959    let mut tar = {
960        let size_limit = max_unpack_size(gctx, tarball.metadata()?.len());
961        let gz = GzDecoder::new(tarball);
962        let gz = LimitErrorReader::new(gz, size_limit);
963        let mut tar = Archive::new(gz);
964        set_mask(&mut tar);
965        tar
966    };
967    let mut bytes_written = 0;
968    let prefix = unpack_dir.file_name().unwrap();
969    let parent = unpack_dir.parent().unwrap();
970    for entry in tar.entries()? {
971        let mut entry = entry.context("failed to iterate over archive")?;
972        let entry_path = entry
973            .path()
974            .context("failed to read entry path")?
975            .into_owned();
976
977        if let Ok(path) = entry_path.strip_prefix(prefix) {
978            if !include(path) {
979                continue;
980            }
981        } else {
982            // We're going to unpack this tarball into the global source
983            // directory, but we want to make sure that it doesn't accidentally
984            // (or maliciously) overwrite source code from other crates. Cargo
985            // itself should never generate a tarball that hits this error, and
986            // crates.io should also block uploads with these sorts of tarballs,
987            // but be extra sure by adding a check here as well.
988            anyhow::bail!(
989                "invalid tarball downloaded, contains \
990                     a file at {entry_path:?} which isn't under {prefix:?}",
991            )
992        }
993
994        // Prevent unpacking symlinks and other unexpected entry types
995        match entry.header().entry_type() {
996            EntryType::Regular | EntryType::Directory => {}
997            t => anyhow::bail!(
998                "invalid tarball downloaded, contains an entry at {entry_path:?} with invalid type {t:?}",
999            ),
1000        }
1001
1002        // Prevent unpacking the lockfile from the crate itself.
1003        if entry_path
1004            .file_name()
1005            .map_or(false, |p| p == PACKAGE_SOURCE_LOCK)
1006        {
1007            continue;
1008        }
1009        // Unpacking failed
1010        bytes_written += entry.size();
1011        let mut result = entry.unpack_in(parent).map_err(anyhow::Error::from);
1012        if cfg!(windows) && restricted_names::is_windows_reserved_path(&entry_path) {
1013            result = result.with_context(|| {
1014                format!(
1015                    "`{}` appears to contain a reserved Windows path, \
1016                        it cannot be extracted on Windows",
1017                    entry_path.display()
1018                )
1019            });
1020        }
1021        result.with_context(|| format!("failed to unpack entry at `{}`", entry_path.display()))?;
1022    }
1023
1024    Ok(bytes_written)
1025}
1026
1027/// Workaround for rust-lang/cargo#16237
1028///
1029/// Generated files should have the same deterministic mtime as other files.
1030/// However, since we forgot to set mtime for those files when uploading, they
1031/// always have older mtime (1973-11-29) that prevents zip from packing (requiring >1980)
1032///
1033/// This workaround updates mtime after we unpack the tarball at the destination.
1034fn update_mtime_for_generated_files(pkg_root: &Path) {
1035    const GENERATED_FILES: &[&str] = &["Cargo.lock", "Cargo.toml", ".cargo_vcs_info.json"];
1036    // Hardcoded value be removed once alexcrichton/tar-rs#420 is merged and released.
1037    // See also rust-lang/cargo#16237
1038    const DETERMINISTIC_TIMESTAMP: i64 = 1153704088;
1039
1040    for file in GENERATED_FILES {
1041        let path = pkg_root.join(file);
1042        let mtime = filetime::FileTime::from_unix_time(DETERMINISTIC_TIMESTAMP, 0);
1043        if let Err(e) = filetime::set_file_mtime(&path, mtime) {
1044            tracing::trace!("failed to set deterministic mtime for {path:?}: {e}");
1045        }
1046    }
1047}