> no I/O The problem description is writing out bytes which is probably some of ...

loser777 · on Oct 28, 2021

Maybe I should be more clear; no data needs to be fetched from disk/network, and the "writes" don't need to go past memory.

As for the second point, you might have a different definition of "naive" and "relatively simple" as my brain has rotted too much from only thinking about SIMD for numerical computation. While determining divisibility be relatively clear, it wasn't clear how the printing would be easily vectorizable as the output-per-number is variable in length.

dvt · on Oct 28, 2021

> transforming them into the text to print

I think this nails it. Vectorizing the math problem is "easy" (send batches of numbers to cores, do the division) but then you have to re-order it for printing (not to mention actually print it), so paradoxically, it probably makes more sense for the program to be single-threaded.

yuliyp · on Oct 28, 2021

You don't even need to do division. There are only 3 patterns that repeat for every set of 10 numbers. You just need to track which one you're on.

hasmanean · on Oct 29, 2021

Yes the patterns repeat every 15 integers. So you only need to do one division operation, get the index modulo-15.

I was bored once at work and figured out a way to compute this without doing much arithmetic. It only requires 1 modulo operation.

   for (int i=1; i<100;i++)
   {
       // make a bit vector with a 1 in ith bit, modulo 15.
       unsigned int i_as_bitvector = 1 << (i % 15); 
       // Check it against the valid positions for FizzBuzz

       // 0ctal 011111 is binary 1001001001001          
       // Octal  02041 is binary   10000100001
       printf("%d  ", i); 
       if (i_as_bitvector & 011111 ) printf("Fizz"); 
       if (i_as_bitvector &  02041 ) printf("Buzz"); 
       printf("\n");
  }

I also have a version which has NO magic constants anywhere in the program, except 3 and 5. I'll post it if someone is interested.

bonzini · on Oct 29, 2021

Also, after printing 1 to 9, you have a single pattern of 30 numbers that repeats exactly 3 times from 10 to 99, then 30 times from 100 to 999, then 300 times from 1000 to 9999, and so on. That lets you extract lots of code from the loop and run it roughly once every 10^n numbers.

ascar · on Oct 29, 2021

Why would you think in sets of ten, when there should actually just be one pattern in 15? Then it just becomes a challenge to arrange your code to work on these blocks of 15.

We could probably code 16 versions of the block of 15 code that repeat and are nicely aligned for SIMD.

WithinReason · on Oct 29, 2021

In my mind a SIMD version would work by predefining a string buffer 15 elements long with Fizz in positions 0,3,6,... and Buzz in positions 0,5,10. These are constants that don't ever change. Then vectorized code would only write to positions that change, the numbers: 1,2,4,7,8,11,13,14. Most of the time these positions would have fixed width too (lengths of large numbers don't change often) so you can use constant write addresses. So 8 SIMD threads could handle a block, and write everything blockwise.

Same idea could be used to parallelize for a GPU.

saagarjha · on Oct 29, 2021

Because the digits increment in groups of ten.

yuliyp · on Oct 29, 2021

I was thinking blocks of 10 because I can itoa(the number of tens) and then concat with the subset of 0, 1, 2, 3, 4, etc. that I care about. I guess with blocks of 15 you just need to do two itoas, and worry about two types of block vs 3.

dvt · on Oct 28, 2021

> There are only 3 patterns that repeat for every set of 10 numbers

Ah, yep, good point!

swsieber · on Oct 28, 2021

The real question is if you can use some for formatting numbers.

boibombeiro · on Oct 29, 2021

Using AVX 512 is not suitable for programs that takes very small time to run. There is a penalty in the ms range to "warm up" the CPU (it is more a cool down, actually).

As OP stated, the limiting factor is on the memory access. That's why he kept saying 64B every four cycles.

But OP likely didn't use because most CPU lacks support for AVX512.

The new Intel CPU introduced many changes for the frontend. This will likely improve the speed.

There might also be possible to try make the CPU operate at higher clockspeed.

Code looks a bit long. Not sure if the unrolling actually helps.

EDIT: Just look at the agner microarchitecture doc. Ice Lake and Tiger Lake can do 64 bytes/cycle.

In theory, it can run 4x faster (on bare metal, maybe).