I can't help but think most commentators haven't actually read the article or the patent. This isn't about having an FPGA embedded into the CPU or near the CPU, it's about having a programmable FPGA like execution unit that can be programmed to be say a 4-bit floating point adder, or any other weird execution unit one might need.
Why is this important? Have a program that does a lot of integer multiplications? Let's program all of these programmable execution units to multiply integers on the fly, etc. Now your integer multiply throughput is higher, as per the current program's needs.
Have lots of weird old x86 instructions you are forced to support but no one actually uses? Don't waste transistors on them just program an execution unit to execute that instruction on the fly, etc.
I think it's great, and that most people are missing the point.
> Have lots of weird old x86 instructions you are forced to support but no one actually uses? Don't waste transistors on them
That's been the role of microcode for like three decades now. Why does it matter if the instruction no one uses is implemented with FPGA gates or uops? No one uses them.
Theoretically, Now you can create "micro-codes" in the CPU for your specific needs - e.g. scientists do a lot of calulation and would like a processor optimised for that. Now they can use the FPGA to do it. You want a CPU instruction that is optimised for something else - you can program the FPGA for that.
The FPGA can't be reprogrammed fast enough for the JIT approach to work, unless you're running computations that take many minutes or hours. I suppose that does apply to some workloads, but it would be tough to ask your JIT to solve the halting problem and guess whether a workload will last 10 seconds or take longer than that.
I would say that program that are run and exit immediatly are a minority of what is produced with languages.
A lot of web server, services, gui are produced in JIT languages and have a timespan of multiple minutes.
AFAIK FPGA reprogrammation time depends on the size of your edit, and your hardware, the article says that they expect to reprogramm it on a program load, so I don't think it will be that slow.
Why can't a JIT program the FPGA based upon previous runs through the same set of code? IIRC the JVM won't rest on its final optimizations until it runs a chunk of code hundreds (or thousands? I forget) of times.
I've been doing that for years. I have created custom microcoded CPU's in FPGA's for tasks where it would provide an advantage. One example I remember was a microcoded real time image warping engine.
> Don't waste transistors on them just program an execution unit to execute that instruction on the fly, etc.
Possible, but the "x86" part is already a big decoder in front of a murky processor underneath so this is already what the CPU does - if you removed the reference to an FPGA, rewriting old x86 instructions in terms of "new" ones is microcode.
Wonder how that'd work - in practical terms - with modern systems?
eg, you could be running (say) 3 or 4 primary applications at the same time. Which one gets to use the FPGA pieces, or are they re-written every time, on every context switch? ;)
Re-writing them on every context switch sounds extremely unlikely, so it'd be more some kind of resource locking thing instead. Which could mean that FPGA-using applications at least start out being fairly niche, as only one could run "per core" or something.
Maybe dedicated cores per application instead or something?
The work is already being put in with modern NUMA (non-uniform memory access) systems to pin apps to specific cores. This seems like it would overlap if this ended up being used in production.
how about having FPGA execution units in addition to the normal units and os deciding how and who will use these new EUs based on the most CPU intensive apps running currently
I think the point is that what you're describing existed for years. Any Xilinx Zynq chip or Altera SoC chip can do this already. Just because the data doesn't travel through the AXI/AMBA bus does not make this novel.
Of course it does because you get access to the CPU as well so you can hop from an instruction you built on the FPGA to another “silicon” instruction with the same registers and processor state. This is extremely clever and doesn’t involve shuffling code from the main processor over a slow bus, executing some stuff all on the fpga and shuffling it back.
this won't be a question of what the user wants to do with these parts. I bet it won't even be accessible for common programmers. Applications will simply constantly racing between each other and reprogram the field programmable part of my cpu every startup
An alarm goes out: our company has fewer patents than company X! In fact, we have the smallest patent hoard of our competitor group. If they sue us we might not have enough patents to sue them back! We must have more patents! Everyone who gets a patent gets a bonus! ( Exit CEO, trailing exclamation marks. All the engineers file their pet idea as a patent, hoping management will be interested in building it).
(Some years later) Okay, some of those patents we filed are a bit silly. But at least we now have a huge, intimidating patent pile! No one will dare sue us now! Mua ha ha! But let's be a bit more careful what we give those patent bonuses for. (Meanwhile at company X: our company has fewer patents than company Y!...)
The above is a true story, happened to me. Well, apart from the moustache twirling. My name is on some not very practical patents. So I'm not very convinced by stories which read the tea leaves from patents as to what a company intends ( Or economists trying to infer innovation rate from patent filing rate). Another problem is that the patent office is slow. Unless the company is General Fusion, most probably the product will be out before the patent.
I almost got my name on a patent that way for sharing an idea in an internal forum. I refused (well, I asked politely) and they took my name off it. (There was someone else in the conversation as well.)
Similar story with a patent that a friend applied for at National Instruments some years ago for his work on LabView software. As far as I could tell from his description it was more of an implementation detail rather than a patentable product. NI was pushing for patents though, and he obliged. Ended up getting a nice little bonus for his young family!
Everything described in the article sounds exactly like some of the Virtex*-FX products from more than 10 years ago.
For instance, the Virtex4-FX had either one or two 450MHz PowerPC coresembedded in it, where you could implement 8 of your own additional instructions in the FPGA. This is effectively now a CPU where you can extend the instruction set, and design your own instructions specific to your application. For example, you might make special instructions using the onboard logic to accelerate video compression, or math operations; I know of one application that was designed to do a 4x4 matrix multiply per cycle.
For those curious, Xtensa is a similar embeddable architecture (known especially for its use in the ESP32 microcontroller) that allows broad latitude to the designer to customize its instruction set with custom acceleration. The integration is very good, the compiler recognizes the new intrinsics and the designer has control over how the instruction is pipelined into the main processor.
Unfortunately it's very proprietary, and as far as I know there isn't an at-home version you can play with on FPGAs. But this kind of thing does exist if you can afford it - you don't have to roll your own RTL.
I am very familiar with the new Zynq family, embedding ARM cores on the same die together with FPGA fabric. I didn't know that the PowerPC version allowed such a tight coupling as handing off an instruction to programmable logic, the current Zynq models are much more lightly coupled, using AXI buses to connect the ARM cores with the PL (and many other components on the same SoC).
What was the latency like to actually get data into your shiny new instruction e.g. do I get a 14 stage pipeline stall to actually use the instruction?
I hate to be that bucket of cold water, but there's multiple reasons FPGAs haven't been successful in package with CPUs. Firstly, the costs of embedding the FPGA - FPGAs are relatively large and power hungry (for what they can do), if you're sticking one on a CPU die, you're seriously talking about trading that against other extremely useful logic. You really need to make a judgement at purchase time whether you want that dark piece of silicon instead of CPU cores for day to day use.
Secondly, whilst they're reconfigurable, they're not reoconfigurable in the time scales it takes to spawn a thread, it's more like the same scale of time to compile a program (this is getting a little better over time). Which makes it a difficult system design problem to make sure your FPGA is programmed with the right image to run the software programme you want. If you're at that level of optimization, why not just design your system to use a PCI-E board, it'll give you more CPU, and way more FPGA compute and both will be cheaper because you get a stock CPU and stock FPGA, not some super custom FPGA-CPU hybrid chip.
Thirdly the programming model for FPGAs are fundamentally very different to CPUs, it's dataflow, and generally the FPGA is completely deterministic. We really don't have a good answer for writing FPGA logic to handle the sort of cache hierarchy, out of order execution that CPUs do. So you're not getting the same sort of advantage that you'd expect from that data locality. It's very difficult to write CPU/FPGA programs that run concurrently, almost all solutions today run in parallel - you package up your work, send it off to the FPGA and wait for it to finish.
Finally, as others have said - the tools are bad. That's relatively solvable.
For me, it boils down to this, if you have an application that you think would be good on the same package as a CPU, it's probably worth hardening it into ASIC (see: error correction, Apple's AI stuff). If you have an application that isn't, then a PCI-E card is probably a better bet - you get more FPGA, more CPU and you're not trading the two off.
ASICs only make sense if you have high volume. PCI-e takes a lot of resources/space. The sweet spot for FPGA-CPU hybrid chips are embedded devices that are latency sensitive. For example, time-of-flight sensors and specialty cameras.
I guess to overcome the reconfiguration latency others have mentioned, the use case would be systems that configure their custom instructions once on boot and then the software just sees a cpu as normal, just with these custom instructions. Ie not intended for reconfiguration on context switch.
I definitely agree that a PCI-E card is preferable. Hell even if you have it in CPU, you probably want it sat on the PCI-E bus anyways so it can P2P DMA with other hardware.
Also (not disagreeing but I'm curious), last time I checked FPGAs could pull off some level of partial reconfiguration in the millisecond and sub millisecond ranges. I may be a bit off on these times but I saw them in a research paper a few years back. What types of speed would be necessary for CPUs to actually be able to benefit from a small FPGA onboard (rather than on an expansion card) with all the context switching.
High end FPGAs are theoretically capable of millisecond fast partial reconfigurations but doing so requires making a lot of tradeoffs that just highlight the impedance mismatch between the generic nature of CPUs and the purgatory that is FPGAs. The more of the FPGA you want to reconfigure, the longer it takes (stop the world, depending on which parts its touches) and unless the reconfigured portion is limited to a standard bus, the reconfiguration won't work (or you have to reconfigure more of the design to deal with different interfaces, timings, etc. blowing up reconfiguration time and defeating the purpose). All of the bitstreams have to be compiled ahead of time as well.
Unless latency is so critical that the speed of light is the limiting factor, partial reconfiguration just replaces PCIe with a much harder to work with AXI interconnect (or similar, but it always end up being AXI...).
It's easier to provide "custom instructions" and only accelerate CPU bottlenecks if you don't have PCIe as a massive bottleneck. If you are using an accelerator behind a bus you always have to make sure there is enough work for the accelerator to justify a data transfer. GPUs are built around the idea of batching a lot of work and running it in parallel. You can make an FPGA work like that but you are throwing away the low latency benefits of FPGAs.
Even the best-case scenarios for integrating a FPGA onto the same die as CPU cores would still have the FPGA separate from the CPU cores. It's really not possible to make an open-ended high bandwidth low latency interface to a huge chunk of FPGA silicon part of the regular CPU core's tightly-optimized pipeline, without drastically slowing down that CPU. The sane way to use an FPGA is as a coprocessor, not grafted onto the processor core itself. Then, you're interacting with the FPGA through interfaces like memory-mapped IO whether it's on-die, on-package, or on an add-in card.
> It's really not possible to make an open-ended high bandwidth low latency interface to a huge chunk of FPGA silicon part of the regular CPU core's tightly-optimized pipeline
That's what's interesting about the article, because that's what the patent is about: "implementing as part of a processor pipeline a reprogrammable execution unit capable of executing specialized instructions".
Yeah, worth mentioning highly optimized FPGA designs run at up to 600MHz (or to put it another way, 400MHz lower than what Intel advertised 4 years ago). So at a minimum, you're going to clock cross, have a >10 cycle pipeline at CPU speeeds (variable clock) and clock cross back.
The downside of PCIe is PCIe is very complex. And the tools make interfacing with it bewildering. I really want a PCIe FPGA that looks to me like data magically appears on an AXI bus.
Everyone seems to be talking about accelerated instructions but how about I/O?
FPGAs are awesome at asynchronous I/O and low latency. We could implement network stacks, sound and video processing, etc... It can start a TLS handshake as soon as the electrical signal hits the ethernet port, while the CPU is not even aware of it happening. It can timestamp MIDI input down to the microsecond and replay with the same precision. It can process position data from a VR headset at the very last moment in the graphics pipeline. Maybe even do something like a software defined radio.
Basically every simple but latency-critical operations. Of course, embedded/realtime systems are a prime target.
That's pretty much what Xilinx's Zynq product lines are already targeting, including embedded. They're comparatively nice boards to work on, as long as you can swallow the BOM cost.
Network example is wasteful. There are already NICs that have FPGA and a whole set of linux's kernel features. You wouldn't want that to be that far away from the NIC itself.
PTP works just like that - timestamps incoming and outgoing packets right after/before packet hit the wire. There is eXpress Data Path that can offload eBPF programs to NICs and deal with packets without them even coming into even kernel at all.
High Frequency Traders do exactly that IIRC today.
As for video processing codecs today are way too complex to be run there. Well, no one will stop you from running something like an integer DCT part on FPGA.
VR thing... Generally, aside from Nvidia companies don't want to ship entire FPGA to end customers (guess why Nvidia G-Sync monitors used to be so expensive). Something like Snapdragon XR2 "solves" VR. Also, in order to render a picture you need to know headset position early, not at the last moment. How would you know what to render?
How useful is the subject depends entirely on FPGA capability, and it's size. I bet it will be more useful for things like implementing some hash function there or something like that.
IMO this will be a very niche product inside already niche market.
A killer tech for this would be a framework that automatically reprograms the FPGA and offloads the work if it makes sense.
For example - running k-means? Have your FPGA automatically (with minimal dev effort) flash to be a Nearest Neighbor accelerator.
The problem is finding a way to make that translation happen with minimal dev effort, as software is written rather differently from hardware.
Their web site is very sparse on what programming models the tool supports. Traditionally, the things you can easily accelerate automatically are algorithms you can write naturally in Fortran 77 (lots of arrays, no pointers), and that's one limit on the applicability of these automatic tools. (Other limits that other posters have pointed out are compilation+place+route runtime, and reconfiguration time.)
They are claiming you can use malloc and make "extensive" use of pointers in C programs and still have them automatically compiled for the FPGA. That's where details are needed and they are mostly missing.
I watched their 30 minute demo film. The speedups are impressive, and on the small example it's impressive that it does the partitioning automatically. However, the program contains only a single call to malloc, and all pointers are derived from that address, so it doesn't do much to convince us that it the memory model and alias analysis give you more flexibility than the F77 model.
You might want to check the "Warp Processing" project out: http://www.cs.ucr.edu/~vahid/warp/. It is probably exactly what you are thinking about. Transparent analysis of the instruction stream at runtime and synthesis and offloading of hot spots to the FPGA.
Why is that surprising. A LOT of statements that are done in programming can be executed in parallel. It's just not worth it to actually make threads for them since the overhead of threads is larger then just executing the set of instructions sequentially. In fact all modern processors take advantage of the data dependencies and execute it in parallel if possible.
I recall reading papers about doing this by profiling Java apps a decade or so ago, but I would have to dig pretty deep in my HN comment history to find them.
The approach seems conceptually similar to the
optimizations available via the enterprise version of GraalVM.
For decades, the FPGA vendors have had this fever dream of "an FPGA in every PC" -- either as an add-on card, or as part of the chipset on a motherboard -- that would enable a compiler or operating system to seamlessly accelerate arbitrary tasks on demand.
In my opinion, the problem has always been their software: the FPGA vendor tools are slow, bloated monstrosities. The core of these tools are written by the big three EDA vendors (Cadence, Synopsys, and Mentor Graphics) rather than the FPGA vendors themselves. The licenses include ridiculous, paranoid restrictions [1] and force the FPGA vendors to keep their bitstream formats and timing databases secret [2] in order to prevent competition from other tool vendors. Most FPGA vendors didn't see this as a problem, but even the ones that did didn't have much of a choice, because the tool market is a cartel.
Thankfully, we now have an open source toolchain [3] with support for a growing number of FPGA architectures [4], and using it vs. the vendor tools is like using gcc or llvm vs. a '90s era, non-compliant C++ compiler. It even has a real IR that isn't Verilog, which has made it easier to design new HDLs [5].
I don't see how a dynamic FPGA accelerator platform can be even remotely viable without this. It's the difference between a developer getting to choose between one of a few dozen pre-baked designs that lock up the entire FPGA (and needing to learn how to shovel data into it), vs. a compiler flag that can give you the option of unrolling any loop directly into any inactive region of FPGA fabric.
It would be quite the cherry on top to see AMD build something interesting in this space. But unless they're willing to fully unencumber at least this one design, I think the effort is likely to fail. The open source guys are chomping at the bit to make this work, and have been making real progress lately. Meanwhile, the EDA vendors have been making promises, failing, and throwing tantrums for the last 20 years. It's time to write them off.
[2] Imagine trying to write an assembler without being allowed to see the manual that tells you how instructions are encoded. It's like that, but the state-space is hundreds to thousands of bytes in multiple configurations rather than a few dozen bits.
I would love to hack on FPGAs but always run into the issue of closed toolchains. The recent open source work is a breath of fresh air, but we need to see an FPGA vender that embraces and sponsors this work.
I think/hope it's an unstable equilibrium -- if either Altera/Intel or Xilinx/AMD give a nod to the open source tools, the others will follow.
Lattice is seemingly at "wink wink, nudge nudge" levels of support -- their lawyers won't allow them to say anything because they're afraid of pissing off Synopsys, but they also know that they're currently the best supported platform, and don't seem interested in deliberately making things difficult.
On paper at least it could be good idea for a company in lattice's position, at very least academics would probably switch.
I would like to see a FAANG try and support some open tools - it doesn't have to be anything legally sketchy like reverse engineering bitstreams - for example, Yosys only has limited SystemVerilog support
Firstly, (for the uninitiated) Bluspec is both a Haskell DSL (Bluespec Classic) and a Verilog-like language (Bluespec SystemVerilog)
It compiles to Verilog, but the stack is much more integrated than other similar compile-to-verilog HDLs - the simulator is similar to verilator and much easier to get started with.
I'm kind of beginning to feel that Haskell isn't a good medium for HDL code - Verilog already encourages unreadable names like "mem_chk_sig_state" and Haskell code is almost unstructured to my eye (I like functional programming but it seems hard to keep it readable because of the style it imposes - the flow is there but the names are usually way too short for my taste)
I'm pretty sure Bluespec and SpinalHDL compile to Verilog. Chisel uses it's own IR (FIRRTL). I think Migen used to target Verilog, but now targets (one of?) the IR(s) that Yosys supports (RTLIL?).
This looks a bit like the old (2000s) work of Leopard Logic or Tensilica. Exciting stuff.
One important note (based on some comments here): generally, these in-CPU FPGAs have very fast reconfiguration. Not sure if it's 1, 10 or 100 cycles but it's not milliseconds. Actually, (in past examples) configuration might take milliseconds but it would load a number of planes of configurations: plane 0 might be MP3 audio device; plane 1 might be MPEG2 video device. Then reconfiguration is: switch to plane 1.
This AMD proposal looks like it's much more tightly integrated into the CPU so it's got to be even faster. Combine that with the deep knowledge of processor internals you'll have to have to code for this thing and I'm having a hard time seeing you and me having much luck tinkering. This is probably 99.99% data center with gnarly NDAs and field support.
I think the right way isn't "learn a HDL", it's "learn digital electronics design". Hardware description languages enable succint hardware description, but it's still necessary to keep an image of the actual hardware in mind.
You're going to need to commit a lot more time than that. HDLs and the surrounding concepts have key fundamental differences from software that a lot of developers have a hard time stomaching. That's why high-level synthesis is the FPGA industry's City of El Dorado; software developers would be able to create acceleration designs without having to build up a fairly large new skillset.
I've never understood this argument. The change in mindset is extremely small. It's merely a matter of awareness. High level synthesis can work just fine if you don't go overboard with constructs that are hard to synthesize. There is no fundamental reason why a math equation in C should be harder to synthesize than the Verilog or VHDL equivalent.
I think HLS is oversold. It's not that hard for software guys to learn some digital logic and write an accelerator. One or two weeks and it shouldn't be a problem. Where the real problems lies is in the tooling. You can't learn that in one or two weeks. You first need to damage your brain to be able to handle it.
I'm assuming that if public knowledge of AMD's efforts are at the patent level, it will be a few years before there's much to work with, by which point you'd have a solid foundation from which to accelerate your learning.
While sibling comments mention that it is probably wiser to learn digital logic before HDL (and I agree with them), I think it is important to also consider that there is now High Level Syntehesis where programming languages similar to C (e.g., OpenCL) can compile to VHDL. HLS may lower the barrier for programmers to take advantage of FPGAs. However, whether the design can compile to fit the constraints of the FPGA available is another question that I do not know the answer.
this approach is not new, and has been toyed around since the 1960 (!), see G. Estrin's work on adaptive architectures for example.
i got to know about this as part of PRISM (processor reconfiguration through-instruction set metamorphosis) work in the early 90's. there is a very cool paper by the same name. check it out !
My guess: because FPGAs are slow compared to mainstream desktop CPUs and only make sense if you have massive paralelism. But then you'd need a massive FPGA which would be crazy expensive, plus you'd need a good way to handle throughput.
That, plus programming FPGA kind of sucks. The software tool chains are somewhere between 20 and 30 years behind the state of the art for software development.
Also, FPGAs can't be reasonably context-switched. Flashing them takes a significant amount of time, so forget about time-multiplexing access to the FPGA among different applications.
I could imagine some sort of API based queuing - say you have 2 "slots" you can program stuff on so if you play 8 k video you can have on flashed to video decoder while the other one can speed up your kernel compilation. If you then want to also use FPGA accelerated denoising on some video you recently recorded, the OS will politely tell you to wait for one of the other apps using the available slots to terminate first.
Since applications do all their rendering via the GPU these days, desktop multi-tasking requires reasonably time-sliced access to the GPU. GPUs have proper memory protection these days (GPU-side page tables for each process). That's big progress over 10 years ago.
True. But it's still far away from a unified approach you'd expect (as someone outside the field) in a modern OS. After all, one of the jobs of an OS is to abstract away access to underlying hardware as much as possible. Until we get some improvements here, my hopes are not very high for the FPGA domain.
Once partial reconfiguration works and the FPGA can access main memory directly I see a lot of use cases. Imagine applications reconfiguring the FPGA in the blink of an eye to optimize their own algorithms.
Why? To make an FPGA do what you want you need to be able to reconfigure it. If you have reconfiguration capability you need to have remote code execution. And in that case you have already lost.
As in, the FPGA would have to be carefully segmented so the accelerator couldn't be used to access memory it shouldn't have access to.
I don't think it would happen in a general purpose chip but I could see it happening in a smaller one like the exploits christopher Domas demonstrated against some embedded X86 cores.
Why though? Your Integrated Intel or AMD GPU can also access all of your memory. I don't see how an FPGA provides any additional attack vector. As I said you'd need code execution privileges anyway and once you have that your system is already owned.
Yes through the PCI bus not directly. You don't want to have that latency. You want a unified model. Like Intel GPUs that can access main memory, or the FPGA being another endpoint in AMDs infinite fabric architecture. That exists as well in SoCFPGA boards. But not in the mid or high performance segments.
Back when AMD released the first Opteron CPUs there was a vendor selling an FPGA that would plug into an Opteron socket along with the IP to implement HyperTransport in the FPGA.
Apart from the tooling woes others have mentioned (it's hard to get across how much FPGA tooling sucks compared to software tooling), FPGAs occupy a strange set of niches, but it's not clear how many are ones which would benefit the average PC or server. If you want raw memory bandwidth and FLOPs, a GPU is better (on speed, power, and price). If you want to evaluate complicated conditions and control flow, a CPU is better (in the same catagories). Because FPGAs are very power hungry and generally slower than the raw silicon, they only really give you more compute if what you want to compute is either far enough off the beaten path that some custom logic is much more efficient than the available instructions, and/or it benefits from a specific parallel datapath which doesn't easily fit on a GPU or CPU (GPUs being the best at embarrasingly parallel systems and modern CPUs being good at parallelising general purpose code reasonably well).
In practice you see FPGAs mostly in two areas: specialised embedded applications which benefit from heavy custom I/O and/or some efficient specific DSP but don't have enough volume to justify an ASIC design, or in accelerators for simulating ASIC design.
Because adding a layer of abstraction to silicon transistors makes for terrible performance and energy usage.
Every other year or so someone "rediscovers" FPGA and thinks this niche architecture is poised for a total revolution of how computing works, think drag&drop hardware and super fast custom everything. It never happens and it will never happen because customization, much like premature optimization, is the root of all evil and also just.. see the first paragraph.
Existing FPGA vendors made sure their products remained in a lucrative niche by maintaining full control over the development process for FPGA designs.
How fast can an FPGA be reprogrammed? If I close my FPGA accelerated machine learning training algorithm, and then open a PC game, would it be feasible to load the new gaming-oriented instructions in ~10-30" that a PC game takes to open?
What sort of gaming-related workload do you think an FPGA would be suitable for? I don't know much about the gaming world, but isn't the majority of the computational workload graphics-rendering related, in which case, the GPU architecture is the best candidate to iterate on?
Lisa Su is a fantastic CEO. Time will tell what the impact of AMD’s acquisition of Xilinx will be (should it close), but this shows the strategy and execution behind Su and team.
While a lot of acquisitions don’t pan out, this seems great.
They're going to need good leadership to pull this off. AMD doesn't have a great track record when it comes to these integrations.
AMD bought ATI while promising the same integration "synergies". GPU style compute was going to be completely woven into the CPU - "AMD Fusion". Sounds great - but they ended up with them being beaten to the CPU-with-integrated-GPU market by Intel by over a year (Intel Clarkdale launched January 2010, AMD Llano midway 2011). 14 years after the acquisition, AMD's iGPU integration is not much different compared to any other iGPU integration, their raw performance lead is shrinking compared to Intel and they're beaten by Apple. Radeon Technologies Group functionally operates independently within the company, and AMD won't use their more performant new RDNA architecture in iGPUs for two years after its launch for some reason - even their 2021 APUs still use their 2017 Vega architecture (fundamentally based on 2012 GCN technology). In the intervening years they've screwed up their processor architecture and marketshare for by going all in on the terrible Bulldozer architecture that was designed around the broken promises of far reaching GPU integration.
Given all that the ATI acquisition might still have been worth it - in hindsight AMD needed a competent GPU architecture one way or another - but the mismanagement of this acquistion nearly killed the company. I hope better leadership can do something here but I'm not really holding my breath.
Agreed. Now to be fair, the acquisition is also what helped the company survive because it got them the console business. So it's not like it was completely botched.
They screwed up majorly with software, and they may have the same problem with an FPGA acquisition as well. AMD failed big time to capitalize on GPUs the way Nvidia did, and that's really almost entirely down to lack of good software solutions. There's ROCm now and it seems plausible that the gap is going to narrow further with AMD GPUs deployed to big HPC clusters, but a gap remains.
The consoles use AMD SoCs that include CPU and GPU cores, but there's nothing special about how the CPU and GPU are connected. The only remotely unusual aspect there is that many of the console SoCs connect GDDR5/6 to the SoC's shared memory controller, while other consumer devices using similar chips (marketed by AMD as APUs) tend to use DDR4 or LPDDR.
AMD purchasing Xilinx is a reaction to Intel purchasing Altera five years ago. Dr. Su might be a good CEO for other reasons, but this isn't something that illustrates brilliant strategy on her part.
I think it's more a reaction to the decreasing importance of CPUs in the datacenter in favor of interconnect technology. FPGAs are one of the directions in which the "smart nic" or "DPU" tech has been moving, which is critical to the trend of datacenter disaggregation. Xilinx has a very strong offering in that regard.
If you look at market data, you can see that this market did not exist a few years ago and is now estimated to be worth billions, with major players releasing products in the space. Unless the dynamics pushing this forward change overnight, I think it's pretty safe to call it a trend.
A large reason for the deal with Altera was that Altera already used intel for fabrication. I understand Intel's 10nm and 7nm failure has hurt them a lot in that regard, quite the opposite of the expected synergy. Unlike Xilinx for AMD, they didn't really have any other technologies intel needed either, the biggest advantage was fabrication and that fell through.
Xilinx had laid off a good chunk right before their sale to AMD. Xilinx was having some financial troubles; when that happens, investors want out before a company craters. So selling themselves was one possible solution.
The industry doesn't move overnight. AMD might have seen where Intel was going and didn't want to be caught off guard, or that might be the alternative to Apple approach of dozens of coprocessors on a chip.
About *!$% time! I was hoping Intel would do something like this when they acquired Altera a few years back. Does anyone know why Intel acquired Altera?
chuckle This is true. As far as I can see (as a hardware engineer frequently doing FPGA stuff) The Intel/Altera combo has not produced any new products nor yielded any customer benefit beyond what would have happened if the two companies had remained independent. But I'll bet the "business strategists" at each company who thought this one up made a pile of money from the deal.
You are right the ARM cores, mostly. Xilinx Zynq devices have ARM A devices built into them as "hard" cores. That is, the ARMs are instantiated directly in silicon, not as "soft" cores which take LUTs (gates) from the FPGA fabric. The ARM A is a microprocessor (not a microcontroller) powerful enough to run Linux.
The ARM connects to the FPGA fabric using a so-called AXI bus, which is a local bus defined by ARM. Xilinx supplies a bunch of "soft" cores which you can instantiate in the FPGA and integrate with the ARM. Of course, you can write your own logic for the FPGA too, as long as you can figure out how to interface to it using one of the AXI bus variants.
Several vendors offer experimenters platforms which are affordable enough for hobbyists and folks making engineering prototypes. Examples are the Avnet's Zed board and Digilent's Zybo board.
The biggest problem with the Zynq ecosystem is that the Xilinx tools -- Vivado/SDK and whatever they renamed it to last year -- are steaming piles of smelly brown stoff. Vivado is buggy, poorly supported, has bad documentation, and the supplied examples typically don't work in the latest version of Vivado since they were written long ago and have been made obsolete via version skew. An absolute disgrace compared to what software engineers are used to. The SDK is basically Eclipse which has its own problems, but is not as bad as Vivado. Ask me how I know.
I think AMD and Xilinx have a long way to go before they can satisfy the hype and speculation I see in all the posts here. I suppose one could shell out $20K for a seat of Synopsys if one wanted a decent set of dev tools, but that's not the direction most software engineers are going nowadays.
Also, assuming NVidia completes its acquisition of ARM, the whole Zynq ecosystem is imperiled since it pits ARM against NVidia.
Not sure how realistic it would be, but I would like to see a RISC-V base core, and the FPGA implementing the extensions. Why? Because it would be cool! Also, I don't really see a use case except for debugging compilers supporting multiple RISC-V extensions and what not.
Yet another patent that should never have been granted.
SoC have been a thing for a long time. SoC = CPU + FPGA on a single chip.
Looking at the patent, the list of 20 claims is absurd. The title says it all "... PROGRAMMABLE INSTRUCTIONS IN COMPUTER SYSTEMS", they're trying to patent anything that can run or dispatch instructions.
Every claim is almost a patent on its own. Submit 20 claims that are progressively more specific, so if one claim is denied during the patent application or afterwards, the other claims can still stand.
Typical strategy is to claim as many things as you can imagine, like inventing CPU and anything that can evaluate an instruction and instructions themselves, then remove any claim that the patent office refuses to grant.
No they're not. Claims 1, 8, and 15 are the only independent claims in the patent (if you're not infringing the independent claims you're not infringing any of the dependent claims: the tradeoff is the more general independent claims are generally easier to invalidate). All of them depend on having a dispatch unit which can be programmed to dispatch instructions into some logic which is configured by a bitfile (and also having the code and bitfile alongside each other and loaded by the same system). Most FPGA SoCs don't have a programmable instruction dispatch unit (which seems to me to be the core of the patent), and they generally do not have the software and bitfield side-by-side and loaded by the same loader, though that is probably an element which is quite vague and could be argued either way.
I don't like patents in general (and especially in software), but this patent is not as general as you claim.
That's how the industry works. You gather and hoard as many frivolous patents as you can in a cold war arms race. If a new company threatens your business, you search your portfolio for a patent they violated and sue them.
Companies who grow to a certain size look to be acquired by larger firms with bigger war chests.
Sometimes companies recognize patents are stifling progress and engage in cross licensing or pooling of patents. Sometimes they do it to gang up on a new rival.
Why is this important? Have a program that does a lot of integer multiplications? Let's program all of these programmable execution units to multiply integers on the fly, etc. Now your integer multiply throughput is higher, as per the current program's needs.
Have lots of weird old x86 instructions you are forced to support but no one actually uses? Don't waste transistors on them just program an execution unit to execute that instruction on the fly, etc.
I think it's great, and that most people are missing the point.