Note that the hardware is not particularly recent (Q3 2017), but we tend to rent servers which are not exactly on the bleeding edge, so that was my platform.
Thanks for the interesting writeup. I wonder if it is because glibc has a less-optimized atan2f (for floats). The double version seems quite involved in glibc anyway:
Without benchmarking, I would expect atan2f to be around 20-30 cycles per element or less with either Intel's or Apple's scalar math library, and proportionally faster for their vector libs.
You're right, I should have specified -- it is glibc 2.32-48 . This the source specifying how glibc is built: https://github.com/NixOS/nixpkgs/blob/97c5d0cbe76901da0135b0... .
I've amended the article so that it says `glibc` rather than `libc`, and added a sidenote specifying the version.
I link to it statically as indicated in the gist, although I believe that shouldn't matter. Also see https://gist.github.com/bitonic/d0f5a0a44e37d4f0be03d34d47ac... .
Note that the hardware is not particularly recent (Q3 2017), but we tend to rent servers which are not exactly on the bleeding edge, so that was my platform.