> they should smoke BLAKE2 and probably even BLAKE3 in dedicated hardware
That's certainly possible, but there are subtleties to watch out for. To really take advantage of the tree structure in K12, you need a vectorized implementation of the Keccak permutation. For comparison, there are vectorized implementations of the AES block cipher, and these are very useful for optimizing AES-CTR. This ends up being one of the strengths of CTR mode compared to some other modes like CBC, which can't process multiple blocks in parallel, at least not for a single input.
So one subtlety we have to think about, is that the sponge construction inside SHA-3 looks more like CBC mode than like CTR mode. The blocks form a chain. And that means that a vectorized implementation of Keccak can't benefit SHA-3 itself, again at least not for hashing a single input. So if this is going to be provided in hardware, it will have to be specifically with other constructions like K12 in mind. That could happen, but it might be harder to justify the cost in chip area. (At this point I'm out of my depth. I have no idea what Intel or ARM are planning. Or maybe vectorized hardware implementations of Keccak already exist and I'm just writing nonsense.)
That's certainly possible, but there are subtleties to watch out for. To really take advantage of the tree structure in K12, you need a vectorized implementation of the Keccak permutation. For comparison, there are vectorized implementations of the AES block cipher, and these are very useful for optimizing AES-CTR. This ends up being one of the strengths of CTR mode compared to some other modes like CBC, which can't process multiple blocks in parallel, at least not for a single input.
So one subtlety we have to think about, is that the sponge construction inside SHA-3 looks more like CBC mode than like CTR mode. The blocks form a chain. And that means that a vectorized implementation of Keccak can't benefit SHA-3 itself, again at least not for hashing a single input. So if this is going to be provided in hardware, it will have to be specifically with other constructions like K12 in mind. That could happen, but it might be harder to justify the cost in chip area. (At this point I'm out of my depth. I have no idea what Intel or ARM are planning. Or maybe vectorized hardware implementations of Keccak already exist and I'm just writing nonsense.)