The problem description is writing out bytes which is probably some of the more expensive part of this. In fact, if you read the winning solution description, IO is the primary problem here.
> doesn't immediately jump out as SIMD.
IDK that I agree with this assessment. Very naively, I see no reason you'd not take the 512 SIMD registers and split them into 16 32 bit lanes. From there, it's a relatively simple matter of using 2 registers for the divisors, pulling out the results, and transforming them into the text to print. In other words, you be chunking this up into 16 iterations per loop. (vs 1 with the naive assembly).
This is the sort of thing that jumps out as easily vectorizable.
Now, the fastest answer very obviously does not take this approach because I'm certain they realized the same thing, that the difficult part here isn't the actual division, but instead pumping out the correct text at the highest speed possible. If you read through it, most of the code is dedicated to converting binary numbers into ascii :D
Maybe I should be more clear; no data needs to be fetched from disk/network, and the "writes" don't need to go past memory.
As for the second point, you might have a different definition of "naive" and "relatively simple" as my brain has rotted too much from only thinking about SIMD for numerical computation. While determining divisibility be relatively clear, it wasn't clear how the printing would be easily vectorizable as the output-per-number is variable in length.
I think this nails it. Vectorizing the math problem is "easy" (send batches of numbers to cores, do the division) but then you have to re-order it for printing (not to mention actually print it), so paradoxically, it probably makes more sense for the program to be single-threaded.
Yes the patterns repeat every 15 integers.
So you only need to do one division operation, get the index modulo-15.
I was bored once at work and figured out a way to compute this without doing much arithmetic. It only requires 1 modulo operation.
for (int i=1; i<100;i++)
{
// make a bit vector with a 1 in ith bit, modulo 15.
unsigned int i_as_bitvector = 1 << (i % 15);
// Check it against the valid positions for FizzBuzz
// 0ctal 011111 is binary 1001001001001
// Octal 02041 is binary 10000100001
printf("%d ", i);
if (i_as_bitvector & 011111 ) printf("Fizz");
if (i_as_bitvector & 02041 ) printf("Buzz");
printf("\n");
}
I also have a version which has NO magic constants anywhere in the program, except 3 and 5. I'll post it if someone is interested.
Also, after printing 1 to 9, you have a single pattern of 30 numbers that repeats exactly 3 times from 10 to 99, then 30 times from 100 to 999, then 300 times from 1000 to 9999, and so on. That lets you extract lots of code from the loop and run it roughly once every 10^n numbers.
Why would you think in sets of ten, when there should actually just be one pattern in 15? Then it just becomes a challenge to arrange your code to work on these blocks of 15.
We could probably code 16 versions of the block of 15 code that repeat and are nicely aligned for SIMD.
In my mind a SIMD version would work by predefining a string buffer 15 elements long with Fizz in positions 0,3,6,... and Buzz in positions 0,5,10. These are constants that don't ever change. Then vectorized code would only write to positions that change, the numbers: 1,2,4,7,8,11,13,14. Most of the time these positions would have fixed width too (lengths of large numbers don't change often) so you can use constant write addresses. So 8 SIMD threads could handle a block, and write everything blockwise.
I was thinking blocks of 10 because I can itoa(the number of tens) and then concat with the subset of 0, 1, 2, 3, 4, etc. that I care about. I guess with blocks of 15 you just need to do two itoas, and worry about two types of block vs 3.
Using AVX 512 is not suitable for programs that takes very small time to run. There is a penalty in the ms range to "warm up" the CPU (it is more a cool down, actually).
As OP stated, the limiting factor is on the memory access. That's why he kept saying 64B every four cycles.
But OP likely didn't use because most CPU lacks support for AVX512.
The new Intel CPU introduced many changes for the frontend. This will likely improve the speed.
There might also be possible to try make the CPU operate at higher clockspeed.
Code looks a bit long. Not sure if the unrolling actually helps.
EDIT: Just look at the agner microarchitecture doc. Ice Lake and Tiger Lake can do 64 bytes/cycle.
In theory, it can run 4x faster (on bare metal, maybe).
The problem description is writing out bytes which is probably some of the more expensive part of this. In fact, if you read the winning solution description, IO is the primary problem here.
> doesn't immediately jump out as SIMD.
IDK that I agree with this assessment. Very naively, I see no reason you'd not take the 512 SIMD registers and split them into 16 32 bit lanes. From there, it's a relatively simple matter of using 2 registers for the divisors, pulling out the results, and transforming them into the text to print. In other words, you be chunking this up into 16 iterations per loop. (vs 1 with the naive assembly).
This is the sort of thing that jumps out as easily vectorizable.
Now, the fastest answer very obviously does not take this approach because I'm certain they realized the same thing, that the difficult part here isn't the actual division, but instead pumping out the correct text at the highest speed possible. If you read through it, most of the code is dedicated to converting binary numbers into ascii :D