I very much don't understand why Intel didn't disable AVX-512 on the P cores, unless and until the OS writes to a new MSR that means "I understand that the P cores can do AVX-512, while the E cores cannot", and then enable AVX-512 on the P cores.
Old OS versions continue to work fine, and newer OS versions can opt into the new world and benefit.
My guess would be that this is because of thread migration.
(After reading the TFA: that's what Agner says right there in the 4th paragraph.)
(After re-reading the comment: I guess that the OS changes would need to be extensive with little to no benefit: running AVX2 on all cores will likely be faster than running 2 P cores with AVX512. The only thing that is really affected is the code that could use AVX512_FP16, but I doubt there's a lot of it outside of Intel.)
> I guess that the OS changes would need to be extensive
I don't think that is true. In the simplest case, you could modify the #UD handler to notice when the fault is caused by an AVX512 instruction running on an E-core, and then simply and pin the process to the P-cores, migrate the process, and continue. All existing scheduler functionality.
> The only thing that is really affected is the code that could use AVX512_FP16, but I doubt there's a lot of it outside of Intel.
AVX512 is a lot more than just extending the vector width, and that extended functionality can be very useful for quickly emulating other CPU's vector instruction sets.
I understand even less why Intel has been creating numerous AVX extensions (1,2 and 512) instead of focusing on a single programmable variable vector length extension that would not require the addition of new opcodes every time the CPU supported maximum vector length increased.
It has been clear for a long while that the vector length was going to gradually increase every few years to support new data processing requirements, so the writing has been on the wall all along.
Variable-width is the new hotness but it doesn’t work well on all tasks. If your task is fixed-width, or the algorithm changes based on task width, you kind of need to know the hardware width. And shuffling, perjuring, swizzling, and bit-shifting don’t translate very well to these approaches. Nor is there really an easy way to operate these in a fixed-width mode.
It’s not all bad but the people who just say “just treat it like an n-element array and let the hardware handle it bro!” are hand waving an enormous amount of algorithms that may not actually be translatable to that.
And broadly speaking, having a couple code-paths and choosing at runtime is not that bad. HPC is used to working “close to the metal” and tuning their code to run optimally. You can’t make that complexity go away - things will always run better on X hardware but Y hardware runs that code path way worse. In this domain all you do with clever auto-programming magic is obscure the problem - you are now writing for the optimizer instead of writing for the hardware, but different hardware still runs differently and you need multiple code paths to hit that optimally. So now you have two problems - keeping the hardware running optimally and keeping the optimizer from messing you up.
Since you are now dispatching large operations that will not complete atomically in one cycle (they simply won’t, you can’t do 2M elements on a 128-bit vector) it doesn’t seem likely that we will stay strictly in-order on these, and the next step is you start extracting parallelism from the stream and that’s the problem that immediately emerges from that. Now you have an optimizer running a SIMT program (opmasks are basically SIMT) extracted from the instruction stream, and you need to not just write code that runs fast - you can’t - you need to write code that the optimizer turns into code that runs fast, and that seems like absolute hell on an architecture where you can’t even know the number of registers-per-thread in advance. Maybe I am reading this all wrong but that’s where you’d really end up going with a “run this op on an N element array” model. If you run one vector-instruction at a time you are in cache/register hell constantly spilling to memory and back, so you want to optimize that to keep things in-register/in-cache, and that means extracting parallelism from the stream to run more stuff while it’s still somewhere reasonably hot.
This is the exact garden-path that GPU drivers went down and it was a huge mistake with OpenGL/DX9/DX11 that had to be walked back with the DX12/Vulkan APIs to get back closer to the metal, because the optimizer stage became completely inscrutable and managing it to keep it from doing something stupid became impossible.
There is a level of irreducible complexity in HPC and exposing the hardware is the best way to avoid it. Leaky abstractions make things worse.
Writing for a phone is not HPC but if you care about energy efficiency you still need to make sure you’re running reasonably optimally and not wasting cycles, especially with a super wide vector unit. That means you’re leaning heavily on the “sensible” code path having decent efficiency, which means you’re leaning on the hardware to behave sanely. So it’s the same thing.
Again, not saying it’s a completely bad idea, but you can already see the abstraction starting to leak with the various types of operations that don’t really work on a variable-width hardware concept, and that’s worrying. There is a lot more opportunity for this to go bad than I think people are at first glance.
Hardly new. The VAX 6000 Series
Vector Processor supported variable length vectors with up to 64 vector elements:
With the VAX, the vector register has a maximum length of 64 elements. Each element can contain up to 64 bits. The elements used can be enabled or disabled by setting bits in a Vector Mask Register (VMR). The programmer usually determines the range, or limits the number of elements used through a Vector Length Register (VLR) (see Figure 1–2). This range can vary, for example, from 0 to 64 elements. Of course, if the vector length = 0, no vector elements will be processed.
I am pretty sure the concept predates the VAX days and might go as far back as the CDC STAR-100 from 1974 and potentially also the Cray-2, but I have had a trouble locating authoritative sources on the details of the ISA of the former.
Variable length vectors have been available much earlier than any VAX.
The most notorious vector computer is Cray-1 from 1976, and it already had variable length vector operations using a vector length register (and also 64-element vector registers).
All later vector computers and also both the vector extension of the RISC-V ISA and the SVE extension of ARM have been strongly inspired by the ISA of Cray-1.
CDC STAR-100 introduced a few influential ideas, but it had a low performance because it did not have vector registers (the operands of its vector instructions were arrays stored in the main memory).
They have, haven't they? And officially it was never supported. The article even mentions it: some BIOSs had the ability to override it, but it sounds like that's been disabled in the microcode?
Old OS versions continue to work fine, and newer OS versions can opt into the new world and benefit.