This is a nice idea, and one I also advocate for, however it's important to keep in mind that the idea of reproducibility relies on determinism. So much of what goes into a build pipeline is inherently nondeterministic, because we're making decisions at compile time which can differ from compilation run to compilation run, setting aside flags. In fact, that's the point of an optimizing compiler, as many reproducible build projects have discovered, turning on optimizations pretty much guarantees no reproducibility.
As long as the compiler is not optimizing by "let's give this 3 seconds of solving time, then continue if no better solution is found", then optimizing is not inherently nondeterministic.
Counterpoint: Archlinux is 89% reproducible with optimizations enabled. The only thing I see which is difficult to make reproducible is optimizations with a timeout.
Instead of using a timeout, an optimization that can must be cut off if the cost is excessive can keep some kind of operation or size count, where the count is strictly a function of the input. For example, an optimization based on binary decision diagrams (BDDs) can put a ceiling on the number of nodes in the BDD.
This is defeatist: compilers do not usually use the system RNG to make decisions, so what's happening is entirely accidental introduction of difference which propagates.
There is "intentional input" (contents of the source files), and "accidental input" (source file full paths, timestamps, layout of memory given to you by the OS, and so on). A reproducible build system should give the same output for the same "intentional input".
(the only place where you do see RNG driven optimization is things like FPGA routing, which is a mess of closed toolchains anyway. It has no place in regular software compilers.)
Well, lot of things can influence here. Multithreaded build, PGO, or even the different access order of the hash table inside the code optimizer can be a factor. Things are getting probalistic and thus somewhat nondeterministic: the build itself is nondeterministic but the runtime/final execution is deterministic
"Reproducible" isn't necessary for "not modified from what everyone else gets", and that still makes some attacks FAR harder (and easier to identify, as you know what the "normal" one is). And a published Merkle tree just makes it easier to verify "none of this has changed", as opposed to SHAs on a website that could change any time.
For sure, which is one of the big benefits of git + git tagging, but the issue is even if you know you received the same binary as someone else, without reproducible and auditable builds, you have no idea if that binary originated from the same code in the case of a targeted attack.
> For sure, which is one of the big benefits of git + git tagging
That's not enough for serious security though, because git is (still) using SHA1 instead of SHA256. You would need something extra, like a signed commit.
There's also the much simpler pitfall of an attacker just creating a branch named the same as a commit, in the hopes that people will accidentally check it out instead.