I suspect, but haven't properly measured, that pointer tagging upsets speculativ...

celeritascelery · on Sept 11, 2024

Why would that impact speculative loads/branch prediction? The pointers are untagged before they are accessed so it should not impact the loads.

JonChesterfield · on Sept 11, 2024

You want the address to be visible to the CPU somewhat early so that the target (might be) in the cache before you use it. I'd expect pointer tagging to obstruct that mechanism - in the worst case codegen might mask out the bits immediately before the memory operation. I don't know how transparent this sort of thing is to the core in practice and haven't found anyone else measuring it.

chc4 · on Sept 11, 2024

That's not really how out-of-order execution in CPUs work. The address doesn't have to be fully computed X cycles before a load in order to be filled. Loads are filled as their dependencies are computed: requiring an additional operation to compute the address means your address is essentially 1 cycle delayed - but that's delay, not throughput, and only actually makes your code slower if your pipeline stalls

dzaima · on Sept 11, 2024

Data memory-dependent prefetchers are a thing (..with expected side-channel potential), and tagging would conceivably make it non-functional. Though, realistically, I wouldn't expect for it to make much difference.

Joker_vD · on Sept 11, 2024

I'm fairly certain that the lower bits are masked away on memory reads by pretty much everything that has an on-board cache anyhow, so they're fair game. Some ISAs even mandate this masking-away for large-than-byte loads.

JonChesterfield · on Sept 11, 2024

My guesswork for x64 would be that all is good if dereferencing the tagged value would hit in the same cache line as dereferencing the untagged value. Though I could also be persuaded that x64 completely ignores the top 16 bits until the last moment (to check consistency with the 17th bit) in which case high tagging would be free. It seems relatively likely to be something that is different across different x64 implementations. But so far I'm just running with "it's probably fine, should benchmark later"