> It is tempting to build a math library around SIMD hoping to get some performance gains. However, it often has no proven benefit ... For example, game play programmers often do a lot of piecemeal vector math. They are not chopping 8 carrots at once
Her point is well taken however we beat the odds on the PlayStation/3: I don't trust my memory to give a frame-time percentage but switching our "one carrot at a time" libraries from scalar to AltiVec made a measurable impact for not a lot of work.
We originally ported it all to SSE2 so that we'd hit GPFs for misaligned when testing on PC but whenever I compare with the Scalar version it's marginally better too so it's held up over time.
Conversely, we've recently found on the Nintendo Switch that NEON isn't a clear win; I suspect that the in addition to shuffling overhead you don't quite get "4 for the price of 1" like you seem to elsewhere, ie: if you're doing a 3D vectors or matrices padded into 4-float registers unused calculations in the fourth component have a cost.
So she's right -- chop 8 carrots at once if you can -- but sometimes (but not always) you can chop just 1 carrot faster with SIMD.
Not sure I fully understood all that. Still a wonderful read. Angry Birds 29 will be even crazier! (If they even use Box2D anymore ... and if micro transactions and loot hadn't ruined the series.)
Her point is well taken however we beat the odds on the PlayStation/3: I don't trust my memory to give a frame-time percentage but switching our "one carrot at a time" libraries from scalar to AltiVec made a measurable impact for not a lot of work.
We originally ported it all to SSE2 so that we'd hit GPFs for misaligned when testing on PC but whenever I compare with the Scalar version it's marginally better too so it's held up over time.
Conversely, we've recently found on the Nintendo Switch that NEON isn't a clear win; I suspect that the in addition to shuffling overhead you don't quite get "4 for the price of 1" like you seem to elsewhere, ie: if you're doing a 3D vectors or matrices padded into 4-float registers unused calculations in the fourth component have a cost.
So she's right -- chop 8 carrots at once if you can -- but sometimes (but not always) you can chop just 1 carrot faster with SIMD.