The important question to benchmark for small buffer sizes isn't actually cycles...

The important question to benchmark for small buffer sizes isn't actually cycles per byte, it's micro-ops per byte. You are sort of getting there by adding threads, but if you want to measure micro-ops per byte by measuring cycles per byte, your benchmarking code should run a number of parallel implementations of the generator on each thread (and stick to 1 thread per physical core, too). Each generator will have a "roof" where more generators per thread doesn't cost any more in terms of cycles/byte. I am assuming that for PCG, that roof is around 4-ish on an old CPU, and may be as high as 6 or 8 on your macbook, while the roof for ChaCha will be close to 1.

ChaCha exploits instruction-level parallelism to get speed. PCG doesn't - it has a chain of instructions that must be executed sequentially. That means that when the PCG generator executes, it leaves gaps in the instruction stream that can be filled with other instructions for the game. That means a slowdown in the game that is more significant than what your benchmark suggests.

I'm going to do this and write blog about it (although I don't have a macbook), so we may be able to compare results.