I suspect, but haven't properly measured, that pointer tagging upsets speculative loads / branch prediction (when you're loading an address) to varying extent across different tagging schemes and different cpu implementations. I'd hope setting low bits are cheaper than high bits but really should write the microbenchmark to find out someday. Anyone know of existing attempts to characterise that?
You want the address to be visible to the CPU somewhat early so that the target (might be) in the cache before you use it. I'd expect pointer tagging to obstruct that mechanism - in the worst case codegen might mask out the bits immediately before the memory operation. I don't know how transparent this sort of thing is to the core in practice and haven't found anyone else measuring it.
That's not really how out-of-order execution in CPUs work. The address doesn't have to be fully computed X cycles before a load in order to be filled. Loads are filled as their dependencies are computed: requiring an additional operation to compute the address means your address is essentially 1 cycle delayed - but that's delay, not throughput, and only actually makes your code slower if your pipeline stalls
Data memory-dependent prefetchers are a thing (..with expected side-channel potential), and tagging would conceivably make it non-functional. Though, realistically, I wouldn't expect for it to make much difference.
I'm fairly certain that the lower bits are masked away on memory reads by pretty much everything that has an on-board cache anyhow, so they're fair game. Some ISAs even mandate this masking-away for large-than-byte loads.
My guesswork for x64 would be that all is good if dereferencing the tagged value would hit in the same cache line as dereferencing the untagged value. Though I could also be persuaded that x64 completely ignores the top 16 bits until the last moment (to check consistency with the 17th bit) in which case high tagging would be free. It seems relatively likely to be something that is different across different x64 implementations. But so far I'm just running with "it's probably fine, should benchmark later"