Expand description
Propagates assignment destinations backwards in the CFG to eliminate redundant assignments.
§Motivation
MIR building can insert a lot of redundant copies, and Rust code in general often tends to move
values around a lot. The result is a lot of assignments of the form dest = {move} src;
in MIR.
MIR building for constants in particular tends to create additional locals that are only used
inside a single block to shuffle a value around unnecessarily.
LLVM by itself is not good enough at eliminating these redundant copies (eg. see https://github.com/rust-lang/rust/issues/32966), so this leaves some performance on the table that we can regain by implementing an optimization for removing these assign statements in rustc itself. When this optimization runs fast enough, it can also speed up the constant evaluation and code generation phases of rustc due to the reduced number of statements and locals.
§The Optimization
Conceptually, this optimization is “destination propagation”. It is similar to the Named Return
Value Optimization, or NRVO, known from the C++ world, except that it isn’t limited to return
values or the return place _0
. On a very high level, independent of the actual implementation
details, it does the following:
- Identify
dest = src;
statements with values fordest
andsrc
whose storage can soundly be merged. - Replace all mentions of
src
withdest
(“unifying” them and propagating the destination backwards). - Delete the
dest = src;
statement (by making it anop
).
Step 1) is by far the hardest, so it is explained in more detail below.
§Soundness
We have a pair of places p
and q
, whose memory we would like to merge. In order for this to
be sound, we need to check a number of conditions:
-
p
andq
must both be constant - it does not make much sense to talk about merging them if they do not consistently refer to the same place in memory. This is satisfied if they do not contain any indirection through a pointer or any indexing projections. -
p
andq
must have the same type. If we replace a local with a subtype or supertype, we may end up with a different vtable for that local. See thesubtyping-impacts-selection
tests for an example where that causes issues. -
We need to make sure that the goal of “merging the memory” is actually structurally possible in MIR. For example, even if all the other conditions are satisfied, there is no way to “merge”
_5.foo
and_6.bar
. For now, we ensure this by requiring that bothp
andq
are locals with no further projections. Future iterations of this pass should improve on this. -
Finally, we want
p
andq
to use the same memory - however, we still need to make sure that each of them has enough “ownership” of that memory to continue “doing its job.” More precisely, what we will check is that whenever the program performs a write top
, then it does not currently care about what the value inq
is (and vice versa). We formalize the notion of “does not care what the value inq
is” by checking the liveness ofq
.Because of the difficulty of computing liveness of places that have their address taken, we do not even attempt to do it. Any places that are in a local that has its address taken is excluded from the optimization.
The first two conditions are simple structural requirements on the Assign
statements that can
be trivially checked. The third requirement however is more difficult and costly to check.
§Future Improvements
There are a number of ways in which this pass could be improved in the future:
-
Merging storage liveness ranges instead of removing storage statements completely. This may improve stack usage.
-
Allow merging locals into places with projections, eg
_5
into_6.foo
. -
Liveness analysis with more precision than whole locals at a time. The smaller benefit of this is that it would allow us to dest prop at “sub-local” levels in some cases. The bigger benefit of this is that such liveness analysis can report more accurate results about whole locals at a time. For example, consider:
ⓘ_1 = u; // unrelated code _1.f1 = v; _2 = _1.f1;
Because the current analysis only thinks in terms of locals, it does not have enough information to report that
_1
is dead in the “unrelated code” section. -
Liveness analysis enabled by alias analysis. This would allow us to not just bail on locals that ever have their address taken. Of course that requires actually having alias analysis (and a model to build it on), so this might be a bit of a ways off.
-
Various perf improvements. There are a bunch of comments in here marked
PERF
with ideas for how to do things more efficiently. However, the complexity of the pass as a whole should be kept in mind.
§Previous Work
A previous attempt at implementing an optimization like this turned out to be a significant regression in compiler performance. Fixing the regressions introduced a lot of undesirable complexity to the implementation.
A subsequent approach tried to avoid the costly computation by limiting itself to
acyclic CFGs, but still turned out to be far too costly to run due to suboptimal performance
within individual basic blocks, requiring a walk across the entire block for every assignment
found within the block. For the tuple-stress
benchmark, which has 458745 statements in a
single block, this proved to be far too costly.
Another approach after that was much closer to correct, but had some soundness
issues - it was failing to consider stores outside live ranges, and failed to uphold some of the
requirements that MIR has for non-overlapping places within statements. However, it also had
performance issues caused by O(l² * s)
runtime, where l
is the number of locals and s
is
the number of statements and terminators.
Since the first attempt at this, the compiler has improved dramatically, and new analysis frameworks have been added that should make this approach viable without requiring a limited approach that only works for some classes of CFGs:
- rustc now has a powerful dataflow analysis framework that can handle forwards and backwards analyses efficiently.
- Layout optimizations for coroutines have been added to improve code generation for async/await, which are very similar in spirit to what this optimization does.
Also, rustc now has a simple NRVO pass (see nrvo.rs
), which handles a subset of the cases that
this destination propagation pass handles, proving that similar optimizations can be performed
on MIR.
§Pre/Post Optimization
It is recommended to run SimplifyCfg
and then SimplifyLocals
some time after this pass, as
it replaces the eliminated assign statements with nop
s and leaves unused locals behind.
Structs§
- Merger 🔒
- Describes where a statement/terminator writes to
Enums§
Functions§
- Some locals are part of the function’s interface and can not be removed.
- If the pair of places is being considered for merging, returns the candidate which would be merged in order to accomplish this.