FizzBuzz has many properties that make it very suitable for these kinds of optimizations that might not be applicable to general purpose code:
+ extremely small working set (a few registers worth of state)
+ extremely predictable branching behavior
+ no I/O
These properties however don't diminish the achievement of leveraging AVX-2 (or any vectorization) for a problem that doesn't immediately jump out as SIMD.
The problem description is writing out bytes which is probably some of the more expensive part of this. In fact, if you read the winning solution description, IO is the primary problem here.
> doesn't immediately jump out as SIMD.
IDK that I agree with this assessment. Very naively, I see no reason you'd not take the 512 SIMD registers and split them into 16 32 bit lanes. From there, it's a relatively simple matter of using 2 registers for the divisors, pulling out the results, and transforming them into the text to print. In other words, you be chunking this up into 16 iterations per loop. (vs 1 with the naive assembly).
This is the sort of thing that jumps out as easily vectorizable.
Now, the fastest answer very obviously does not take this approach because I'm certain they realized the same thing, that the difficult part here isn't the actual division, but instead pumping out the correct text at the highest speed possible. If you read through it, most of the code is dedicated to converting binary numbers into ascii :D
Maybe I should be more clear; no data needs to be fetched from disk/network, and the "writes" don't need to go past memory.
As for the second point, you might have a different definition of "naive" and "relatively simple" as my brain has rotted too much from only thinking about SIMD for numerical computation. While determining divisibility be relatively clear, it wasn't clear how the printing would be easily vectorizable as the output-per-number is variable in length.
I think this nails it. Vectorizing the math problem is "easy" (send batches of numbers to cores, do the division) but then you have to re-order it for printing (not to mention actually print it), so paradoxically, it probably makes more sense for the program to be single-threaded.
Yes the patterns repeat every 15 integers.
So you only need to do one division operation, get the index modulo-15.
I was bored once at work and figured out a way to compute this without doing much arithmetic. It only requires 1 modulo operation.
for (int i=1; i<100;i++)
{
// make a bit vector with a 1 in ith bit, modulo 15.
unsigned int i_as_bitvector = 1 << (i % 15);
// Check it against the valid positions for FizzBuzz
// 0ctal 011111 is binary 1001001001001
// Octal 02041 is binary 10000100001
printf("%d ", i);
if (i_as_bitvector & 011111 ) printf("Fizz");
if (i_as_bitvector & 02041 ) printf("Buzz");
printf("\n");
}
I also have a version which has NO magic constants anywhere in the program, except 3 and 5. I'll post it if someone is interested.
Also, after printing 1 to 9, you have a single pattern of 30 numbers that repeats exactly 3 times from 10 to 99, then 30 times from 100 to 999, then 300 times from 1000 to 9999, and so on. That lets you extract lots of code from the loop and run it roughly once every 10^n numbers.
Why would you think in sets of ten, when there should actually just be one pattern in 15? Then it just becomes a challenge to arrange your code to work on these blocks of 15.
We could probably code 16 versions of the block of 15 code that repeat and are nicely aligned for SIMD.
In my mind a SIMD version would work by predefining a string buffer 15 elements long with Fizz in positions 0,3,6,... and Buzz in positions 0,5,10. These are constants that don't ever change. Then vectorized code would only write to positions that change, the numbers: 1,2,4,7,8,11,13,14. Most of the time these positions would have fixed width too (lengths of large numbers don't change often) so you can use constant write addresses. So 8 SIMD threads could handle a block, and write everything blockwise.
I was thinking blocks of 10 because I can itoa(the number of tens) and then concat with the subset of 0, 1, 2, 3, 4, etc. that I care about. I guess with blocks of 15 you just need to do two itoas, and worry about two types of block vs 3.
Using AVX 512 is not suitable for programs that takes very small time to run. There is a penalty in the ms range to "warm up" the CPU (it is more a cool down, actually).
As OP stated, the limiting factor is on the memory access. That's why he kept saying 64B every four cycles.
But OP likely didn't use because most CPU lacks support for AVX512.
The new Intel CPU introduced many changes for the frontend. This will likely improve the speed.
There might also be possible to try make the CPU operate at higher clockspeed.
Code looks a bit long. Not sure if the unrolling actually helps.
EDIT: Just look at the agner microarchitecture doc. Ice Lake and Tiger Lake can do 64 bytes/cycle.
In theory, it can run 4x faster (on bare metal, maybe).
I think you're missing the point. The issue is that with the advent of higher level languages, starting from Java, Javascript and on to Python and so on, most people have forgotten or have never learnt the skills to optimize code.
I'll argue that, as a percent, the number of people who can write proper multi-threaded code has only diminished over the years.
And we see the result, despite the massive increase in computing power, software in general has become slower and bloated.
The overwhelming majority of slow code isn't slow because it's failed to do this kind of microoptimization, it's slow because it used the wrong algorithm.
If you have "months to write fizzbuzz" levels of resources available, sure, you can microoptimize everything. Except in practice you can't, because the amount of effort needed for this kind of microoptimization is quadratic or worse in the complexity of the problem you're actually solving.
For a realistic-sized problem, if you write solutions in C and Python with the same amount of effort from equally skilled programmers, the Python version will almost certainly be faster, because they'll have had a lot more time available to spend prototyping and picking good algorithms rather than debugging undefined behaviour because they used the wrong combination of integer bit lengths.
> The overwhelming majority of slow code isn't slow because it's failed to do this kind of microoptimization, it's slow because it used the wrong algorithm.
I disagree. I believe the majority of code is slow because it's written without any consideration at all to performance. I like Casey Muratori's philosophy of non-pessimization where true optmization (measuring, working hot spots) is rare and rarely necessary but massive speedups compared to the general state of the art are achievable by simply not writing code using patterns that are inherently slow. This isn't deep algorithmic stuff it's just avoiding copies and/or pointer chasing.
> For a realistic-sized problem, if you write solutions in C and Python with the same amount of effort from equally skilled programmers, the Python version will almost certainly be faster
The Advent of Code runs every year and I'm not sure about C (not something I track) but there are plenty of Rust submissions and while the Rust submissions take longer to come in, it's not THAT much longer. From memory it's like 2x on a timescale of minutes. The Python programs are not faster.
Not disagreeing or anything, but an Advent of Code sized example will not be anywhere close in complexity to your typical CRUD application. The thing is, we possibly could not even write most of these applications in a low level language, as we should not forget that while Rust indeed has good abstractions, low level languages simply leak memory-related informations to high levels as well, making it necessary to deal with them on a high level. And in these cases, time to market, speed of refactors and maintainability is much more important than the row speed (which again, may not even be that significant at all. Eg, often most work is done on the db side)
> an Advent of Code sized example will not be anywhere close in complexity to your typical CRUD application
Advent of Code size problems is the best case scenario for Python. The difference in implementation time goes down as program size increases because you spend a larger fraction of the time figuring out what you're implementing.
The reason it's not done is mainly social and not technical. Ecosystems matter and this is particularly the case in graphical desktop environments where massive amounts of time and effort are required to re-implement features matching users' expectations (e.g. Flutter).
If we're talking a server side http app server then the library requirements are significantly lower and the libraries are generally there in almost any language. To keep the language comparison the same, it's not significantly slower to implement a CRUD backend in Rust than it is in Python. Depending on your specific setup and how much pre-prep is involved it can be faster. I'm not aware of any framework in a dynamic language that can produce production grade endpoint handlers that are as terse as a Rocket app heavily leveraging request guards. The guards handle validation, conversion, session context and db connection based on the endpoint parameter types so you just write the happy path which is 1-2 lines of code for CRUD.
> speed of refactors and maintainability
Dynamic languages are significantly worse for maintainability and refactoring. I say this as someone who generally gets paid to be a frontend developer and spent years writing both Python and Clojure professionally. Despite my arguments, I do not write server side Rust for work because team coherence is more important for doing my actual job (solving a customer's problem) than programming language and I'm the only person in my company who writes Rust. I've been doing this long enough that I accept the worse solution as the cost of doing business. My personal projects don't have this constraint so perf is orders of magnitude better in exchange for a lot more knowledge and a bit more design work.
> Advent of Code size problems is the best case scenario for Python. The difference in implementation time goes down as program size increases because you spend a larger fraction of the time figuring out what you're implementing.
The difference in code size between a good and a bad language goes up as program size increases, because a large codebase makes it harder to keep it all in your head and forces you to add more layers and duplication.
>And in these cases, time to market, speed of refactors and maintainability is much more important than the row speed (which again, may not even be that significant at all. Eg, often most work is done on the db side)
You don't have to ditch Java or C# for your app and use C or Rust instead. But you can write Java and C# in a more performant way while not sacrificing time to market, speed of refactors and maintainability.
Also, if performance, throughput, stability wouldn't be a thing, we have to wonder why Twitter switched from RoR to Java? Couldn't they just buy more servers instead?
> you can write Java and C# in a more performant way while not sacrificing time to market, speed of refactors and maintainability.
In practice you only have so much attention and there are better places to spend it.
> Also, if performance, throughput, stability wouldn't be a thing, we have to wonder why Twitter switched from RoR to Java?
If raw speed matters we have to wonder why Twitter started with RoR, and only switched (mostly to Scala, not Java, AIUI) once they had tens of millions of users.
In a tiny handful of outlier cases where a given piece of code is run literally billions of times a day it eventually becomes worth doing performance microoptimisations. But only after you've confirmed that there's absolutely massive demand for this particular code and already done your iterative refactors and figured out what the code should do.
> But you can write Java and C# in a more performant way while not sacrificing time to market, speed of refactors and maintainability.
Writing Java in a more performant way often means ditching OOP and using primitive types, primitive arrays everywhere instead, and also avoiding allocations up to the point where all objects are reused. Such code is just as slow to write and as error-prone as C, and way worse than C++ where at least you have RAII, generics, STL algorithms and zero-cost abstractions.
You only have to do that in hot loops (in most problems), that you can easily profile with the very good observability tools that java provides. The rest can be maintainable, good quality OOP code.
I would even go as far and say that OOP’s encapsulation pretty much exists just for this: the “ugly” implementation can reside inside a class and you only have to care about its public API from the outside.
> I believe the majority of code is slow because it's written without any consideration at all to performance. I like Casey Muratori's philosophy of non-pessimization where true optmization (measuring, working hot spots) is rare and rarely necessary but massive speedups compared to the general state of the art are achievable by simply not writing code using patterns that are inherently slow. This isn't deep algorithmic stuff it's just avoiding copies and/or pointer chasing.
I don't/can't watch videos; I would be interested to see a written argument, but generally my view is that copies and pointer chasing are irrelevant because code is so far away from optimal already. You're right that code is mainly written without any consideration to performance, but that manifests itself first in poor choices of algorithm and/or datastructure. I don't think there's anything "deep" about, say, using a hashmap rather than searching linearly through a list; indeed I'd consider it easier to understand than avoiding copying.
> The Advent of Code runs every year and I'm not sure about C (not something I track) but there are plenty of Rust submissions and while the Rust submissions take longer to come in, it's not THAT much longer. From memory it's like 2x on a timescale of minutes. The Python programs are not faster.
Advent of Code is still extremely tiny programs, and Rust is a much much better language than C for thinking in.
True. Most business apps I've seen seemed like people did serious research into how to write the worst code from a performance point of view. Gazillions of wrappers upon wrappers, tons of indirection, data being copied and recopied thousands of times, creating FactoryFactoryFactories just for the sake of it.
If people would pay more attention on how the data flows instead of Uncle Bob and GoF, they would not only end up with a cleaner software architecture but also with more performance.
> I believe the majority of code is slow because it's written without any consideration at all to performance.
That can both be correct. I use a std. data structures for many things even if some prefix tree might be better for the operations I perform on a set. I don't think about that in most cases since computing power is cheaper than the additional work. If performance becomes a problem, I can optimize later.
Advent of Code isn't a realistic performance metric. Working software is for most software projects.
> For a realistic-sized problem, if you write solutions in C and Python with the same amount of effort from equally skilled programmers, the Python version will almost certainly be faster, because they'll have had a lot more time available to spend prototyping and picking good algorithms rather than debugging undefined behaviour because they used the wrong combination of integer bit lengths.
I think this is more nuanced than that. Competitive programming actually gives fairly good insight here, because skilled programmers are placed under time constraints and obviously care about performance.
What you see, usually, is that programs that don't need maximum performance are often written in Python, which allows people to quickly write code that needs less debugging. But double the allotted time and the C programmer finally has their code finished and free of bugs, and they're beating the pants off of your Python code right out of the gate. You try and come up with clever ways to avoid copies and drive down the Big-O, but the C programmer comes up with the same thing just a bit behind you and leads most of the time because they get a free 10x speedup just because of the language they're using.
It's not surprising that the highest levels of competitive programming are dominated by C (and to a greater extent, C++) programmers, with Python use extremely rare, almost always reserved for rare, complicated algorithms that happen to bring a significant big-O improvement that justify the use of Python to write them.
The overwhelming majority of code isn't slow because it used the wrong algorithm, it's because the Product team can't demo or show off an increase in efficiency. It's because they aren't measured by their ability to reduce or control the growth of the AWS bill. And they probably wouldn't have the knowledge to ascertain a reasonable bill. So they push deadlines which pushes devs to take ugly but fast to write solutions using high level languages.
Also, performance improvements pay for themselves over months and years while new features usually pay for themselves faster (at least they do in forecasts).
And finally, touching working code to make a performance improvement necessarily has some risk and that risk might be worth more than a year of cost savings from the improvement.
>The overwhelming majority of slow code isn't slow because it's failed to do this kind of microoptimization, it's slow because it used the wrong algorithm.
Try to look at The Computer Language Benchmarks Game where all implementations are using the same algorithm. An optimized C implementation can be 300 times faster than Python.
That's exactly the kind of missing the point that I'm talking about: tiny programs where all implementations are using the same algorithm and people have spent months microoptimising and are utterly unreflective of realistic business problems.
There's a law that says that as resources becomes easier to consume, people consume them more. Like when increasing the capacity of a road doesn't reduce the time spent in traffic. Wouldn't that apply to software? I feel like it's a bit unfair to call software "slower and bloated" when you wouldn't call a road "bloated" because it has a new lane. People can do more things now.
More like I start manufacturing 3 lane wide cars because the material im using is cheaper than cutting and engineering it to normal size - the way I see modern software analogy in this context.
It feels like the frustration comes because I see the analogy this way:
The roads have been given extra lanes, and the total possible throughout of the road has increased. As a result, car manufacturers have decided that they can use a cheaper manufacturing process which makes slower cars. But when anyone complains, they point to the additional lanes and say ‘but the number of cars that get from point A to point B has increased, even if each trip is slower.’
The end user isn’t getting a choice when software developers make slower code each year. They have to buy more expensive hardware to keep up with the slowdowns of software developers.
I get it. But what exactly do you want to happen here?
In terms of your analogy, do you want more expensive cars that less people have?
Software developers could spend more time optimizing their code. But that means that are spending less time elsewhere, fixing bugs, adding features, developing other products.
It should be noted that many times features come in that people do not want.
“The new format bar” for slack and “huddles” (perhaps less antagonisingly) are examples of things that Slack has done seemingly to prove they’re still working on the program.
Despite the the features not being needed in the slightest.
I actually disagree, at least about huddles. Huddles are a direct response to the low-friction nature of Discord voice channels (and arguably Teamspeak/Mumble/etc before Discord, although nothing else married text and voice channels quite as well before Discord.) It's almost silly how small of a difference there is between starting a huddle and a channel-wide "call" - the only difference is that it doesn't start out by "ringing" for everyone in the channel. But that tiny difference completely shifts the feeling from "this is a formal communication" to "we're all just hanging out and you can join if you want but no pressure."
IMO huddles happened because the pandemic moved so many more people remote that the need for a water-cooler replacement became much more obvious.
That would imply that we do less or as much things as before with computers which I think is false. Maybe you don't agree with what everyone does, but that doesn't mean that it's worthless.
Sometimes bloaty design decisions transcend the question of programming language. Electron probably has (I didn't check) a faster JavaScript engine than Qt, but I'd expect that similarly sophisticated implementations of the same program in Qt or Electron show an advantage in performance to the former. But people use Electron because it's familiar and it's got a lot of active development, and more cynically, Google likes it that way because it means that people keep coding for Chrome (cf. Microsoft's VS ecosystem, Apple's walled garden, etc).
The problem with modern software is that vendors encourage new developers to learn using heavyweight toolkits because lock-in is profitable in the long run. One day Electron will work with Go, but it will still be a plague.
>FizzBuzz has many properties that make it very suitable for these kinds of optimizations that might not be applicable to general purpose code: + extremely small working set (a few registers worth of state) + extremely predictable branching behavior + no I/O
The working set depends on how you organize the data.
I/o can be made in larger chunks most of the times. You have to consider about the data you need and when you need it.
As for branching, that most of the time depends on programmer rather than on the problem being solved. The situation where you need very complicate branching to solve a problem are rare. And I've seen O(n log n) algorithms beating clever O(n) algorithms because they could be better optimized for the CPU.
> for a problem that doesn't immediately jump out as SIMD.
It's a single transformation applied to elements of a large array. Parallelization is obvious, and maybe the exact SIMDization is not obvious, one is certainly motivated to formulate it. And a scatter-like approach does spring to mind right away, I believe.
The parent comment asserted that FizzBuzz should never branch. This would require it to not have any backwards control flow, and thus require full unrolling of an infinite sequence. Hence the infeasibility.
These properties however don't diminish the achievement of leveraging AVX-2 (or any vectorization) for a problem that doesn't immediately jump out as SIMD.