AMD Patent Reveals Hybrid CPU-FPGA Design That Could Be Enabled by Xilinx Tech

dhruvdh · on Jan 3, 2021

I can't help but think most commentators haven't actually read the article or the patent. This isn't about having an FPGA embedded into the CPU or near the CPU, it's about having a programmable FPGA like execution unit that can be programmed to be say a 4-bit floating point adder, or any other weird execution unit one might need.

Why is this important? Have a program that does a lot of integer multiplications? Let's program all of these programmable execution units to multiply integers on the fly, etc. Now your integer multiply throughput is higher, as per the current program's needs.

Have lots of weird old x86 instructions you are forced to support but no one actually uses? Don't waste transistors on them just program an execution unit to execute that instruction on the fly, etc.

I think it's great, and that most people are missing the point.

ajross · on Jan 3, 2021

> Have lots of weird old x86 instructions you are forced to support but no one actually uses? Don't waste transistors on them

That's been the role of microcode for like three decades now. Why does it matter if the instruction no one uses is implemented with FPGA gates or uops? No one uses them.

webmobdev · on Jan 3, 2021

Theoretically, Now you can create "micro-codes" in the CPU for your specific needs - e.g. scientists do a lot of calulation and would like a processor optimised for that. Now they can use the FPGA to do it. You want a CPU instruction that is optimised for something else - you can program the FPGA for that.

asimpletune · on Jan 4, 2021

Or maybe a compiler could recognize that optimization is possible and create it for you

Kuinox · on Jan 4, 2021

More likely a JIT can do that.

LeifCarrotson · on Jan 4, 2021

The FPGA can't be reprogrammed fast enough for the JIT approach to work, unless you're running computations that take many minutes or hours. I suppose that does apply to some workloads, but it would be tough to ask your JIT to solve the halting problem and guess whether a workload will last 10 seconds or take longer than that.

penteract · on Jan 4, 2021

For a JIT, the easy way to guess is to wait until it's already taken 10 seconds and if it hasn't stopped, assume it will take at least 10 more.

Kuinox · on Jan 4, 2021

I would say that program that are run and exit immediatly are a minority of what is produced with languages. A lot of web server, services, gui are produced in JIT languages and have a timespan of multiple minutes. AFAIK FPGA reprogrammation time depends on the size of your edit, and your hardware, the article says that they expect to reprogramm it on a program load, so I don't think it will be that slow.

bcrosby95 · on Jan 4, 2021

Why can't a JIT program the FPGA based upon previous runs through the same set of code? IIRC the JVM won't rest on its final optimizations until it runs a chunk of code hundreds (or thousands? I forget) of times.

pjmlp · on Jan 4, 2021

Depends pretty much on which JVM we are talking about, Java is like C, plenty to chose from.

In what concerns OpenJDK that is configurable via the -XX flags.

snicker7 · on Jan 4, 2021

In scientific computing, that is the typical workload. That is why, say, Julia exists despite having a ridiculous JIT overhead.

chippiewill · on Jan 4, 2021

Isn't that a general warm-up problem with JITs though?

pjmlp · on Jan 5, 2021

With any kind of compilation process, a balancing act between time spend waiting and the quality of generated machine code.

robomartin · on Jan 5, 2021

I've been doing that for years. I have created custom microcoded CPU's in FPGA's for tasks where it would provide an advantage. One example I remember was a microcoded real time image warping engine.

pjmlp · on Jan 4, 2021

Six actually, given that microcode was the approach taken on most mainframes that started around Burroughs timeframe.

mhh__ · on Jan 3, 2021

> Don't waste transistors on them just program an execution unit to execute that instruction on the fly, etc.

Possible, but the "x86" part is already a big decoder in front of a murky processor underneath so this is already what the CPU does - if you removed the reference to an FPGA, rewriting old x86 instructions in terms of "new" ones is microcode.

justinclift · on Jan 4, 2021

Wonder how that'd work - in practical terms - with modern systems?

eg, you could be running (say) 3 or 4 primary applications at the same time. Which one gets to use the FPGA pieces, or are they re-written every time, on every context switch? ;)

Re-writing them on every context switch sounds extremely unlikely, so it'd be more some kind of resource locking thing instead. Which could mean that FPGA-using applications at least start out being fairly niche, as only one could run "per core" or something.

Maybe dedicated cores per application instead or something?

freeqaz · on Jan 4, 2021

The work is already being put in with modern NUMA (non-uniform memory access) systems to pin apps to specific cores. This seems like it would overlap if this ended up being used in production.

apsient · on Jan 4, 2021

how about having FPGA execution units in addition to the normal units and os deciding how and who will use these new EUs based on the most CPU intensive apps running currently

laydn · on Jan 4, 2021

I think the point is that what you're describing existed for years. Any Xilinx Zynq chip or Altera SoC chip can do this already. Just because the data doesn't travel through the AXI/AMBA bus does not make this novel.

andy_ppp · on Jan 4, 2021

Of course it does because you get access to the CPU as well so you can hop from an instruction you built on the FPGA to another “silicon” instruction with the same registers and processor state. This is extremely clever and doesn’t involve shuffling code from the main processor over a slow bus, executing some stuff all on the fpga and shuffling it back.

nitrogen · on Jan 4, 2021

It sounds like that's exactly how another processor worked: https://news.ycombinator.com/item?id=25623763

shuringai · on Jan 5, 2021

this won't be a question of what the user wants to do with these parts. I bet it won't even be accessible for common programmers. Applications will simply constantly racing between each other and reprogram the field programmable part of my cpu every startup

av3csr · on Jan 3, 2021

Sounds like a more general version of what Sambanova is doing with their Dataflow unit.

ajb · on Jan 4, 2021

How patents happen:

An alarm goes out: our company has fewer patents than company X! In fact, we have the smallest patent hoard of our competitor group. If they sue us we might not have enough patents to sue them back! We must have more patents! Everyone who gets a patent gets a bonus! ( Exit CEO, trailing exclamation marks. All the engineers file their pet idea as a patent, hoping management will be interested in building it).

(Some years later) Okay, some of those patents we filed are a bit silly. But at least we now have a huge, intimidating patent pile! No one will dare sue us now! Mua ha ha! But let's be a bit more careful what we give those patent bonuses for. (Meanwhile at company X: our company has fewer patents than company Y!...)

The above is a true story, happened to me. Well, apart from the moustache twirling. My name is on some not very practical patents. So I'm not very convinced by stories which read the tea leaves from patents as to what a company intends ( Or economists trying to infer innovation rate from patent filing rate). Another problem is that the patent office is slow. Unless the company is General Fusion, most probably the product will be out before the patent.

skybrian · on Jan 4, 2021

I almost got my name on a patent that way for sharing an idea in an internal forum. I refused (well, I asked politely) and they took my name off it. (There was someone else in the conversation as well.)

pnw_hazor · on Jan 4, 2021

Omitting an inventor from a patent is usually grounds for invalidating the patent.

(Unless they are careful to exclude all of your contributions from the claims -- which is almost impossible)

gaudat · on Jan 4, 2021

Sounds like it's a bad idea to have our names on a patent from the way you said it. Can you enlighten us on that?

skybrian · on Jan 4, 2021

I don't like the idea of software patents, generally, and would feel weird arguing against them if I had my name on one.

ericlewis · on Jan 4, 2021

Anecdata: but, just because I threw in a couple of suggestions I didn’t feel like I deserved it. Also, don’t much care for patents personally.

debug-desperado · on Jan 4, 2021

Similar story with a patent that a friend applied for at National Instruments some years ago for his work on LabView software. As far as I could tell from his description it was more of an implementation detail rather than a patentable product. NI was pushing for patents though, and he obliged. Ended up getting a nice little bonus for his young family!

leecb · on Jan 3, 2021

Everything described in the article sounds exactly like some of the Virtex*-FX products from more than 10 years ago.

For instance, the Virtex4-FX had either one or two 450MHz PowerPC coresembedded in it, where you could implement 8 of your own additional instructions in the FPGA. This is effectively now a CPU where you can extend the instruction set, and design your own instructions specific to your application. For example, you might make special instructions using the onboard logic to accelerate video compression, or math operations; I know of one application that was designed to do a 4x4 matrix multiply per cycle.

https://www.digikey.com/catalog/en/partgroup/virtex-4-fx-ser... https://www.xilinx.com/support/documentation/data_sheets/ds1...

thrtythreeforty · on Jan 3, 2021

For those curious, Xtensa is a similar embeddable architecture (known especially for its use in the ESP32 microcontroller) that allows broad latitude to the designer to customize its instruction set with custom acceleration. The integration is very good, the compiler recognizes the new intrinsics and the designer has control over how the instruction is pipelined into the main processor.

Unfortunately it's very proprietary, and as far as I know there isn't an at-home version you can play with on FPGAs. But this kind of thing does exist if you can afford it - you don't have to roll your own RTL.

jng · on Jan 3, 2021

I am very familiar with the new Zynq family, embedding ARM cores on the same die together with FPGA fabric. I didn't know that the PowerPC version allowed such a tight coupling as handing off an instruction to programmable logic, the current Zynq models are much more lightly coupled, using AXI buses to connect the ARM cores with the PL (and many other components on the same SoC).

mhh__ · on Jan 3, 2021

What was the latency like to actually get data into your shiny new instruction e.g. do I get a 14 stage pipeline stall to actually use the instruction?

rowanG077 · on Jan 3, 2021

That depends on how you designed your instruction.

sitkack · on Jan 3, 2021

And your pipeline

Traster · on Jan 3, 2021

I hate to be that bucket of cold water, but there's multiple reasons FPGAs haven't been successful in package with CPUs. Firstly, the costs of embedding the FPGA - FPGAs are relatively large and power hungry (for what they can do), if you're sticking one on a CPU die, you're seriously talking about trading that against other extremely useful logic. You really need to make a judgement at purchase time whether you want that dark piece of silicon instead of CPU cores for day to day use.

Secondly, whilst they're reconfigurable, they're not reoconfigurable in the time scales it takes to spawn a thread, it's more like the same scale of time to compile a program (this is getting a little better over time). Which makes it a difficult system design problem to make sure your FPGA is programmed with the right image to run the software programme you want. If you're at that level of optimization, why not just design your system to use a PCI-E board, it'll give you more CPU, and way more FPGA compute and both will be cheaper because you get a stock CPU and stock FPGA, not some super custom FPGA-CPU hybrid chip.

Thirdly the programming model for FPGAs are fundamentally very different to CPUs, it's dataflow, and generally the FPGA is completely deterministic. We really don't have a good answer for writing FPGA logic to handle the sort of cache hierarchy, out of order execution that CPUs do. So you're not getting the same sort of advantage that you'd expect from that data locality. It's very difficult to write CPU/FPGA programs that run concurrently, almost all solutions today run in parallel - you package up your work, send it off to the FPGA and wait for it to finish.

Finally, as others have said - the tools are bad. That's relatively solvable.

For me, it boils down to this, if you have an application that you think would be good on the same package as a CPU, it's probably worth hardening it into ASIC (see: error correction, Apple's AI stuff). If you have an application that isn't, then a PCI-E card is probably a better bet - you get more FPGA, more CPU and you're not trading the two off.

pranavjoneja · on Jan 3, 2021

ASICs only make sense if you have high volume. PCI-e takes a lot of resources/space. The sweet spot for FPGA-CPU hybrid chips are embedded devices that are latency sensitive. For example, time-of-flight sensors and specialty cameras.

dkersten · on Jan 4, 2021

I guess to overcome the reconfiguration latency others have mentioned, the use case would be systems that configure their custom instructions once on boot and then the software just sees a cpu as normal, just with these custom instructions. Ie not intended for reconfiguration on context switch.

CuriousCosmic · on Jan 3, 2021

I definitely agree that a PCI-E card is preferable. Hell even if you have it in CPU, you probably want it sat on the PCI-E bus anyways so it can P2P DMA with other hardware.

Also (not disagreeing but I'm curious), last time I checked FPGAs could pull off some level of partial reconfiguration in the millisecond and sub millisecond ranges. I may be a bit off on these times but I saw them in a research paper a few years back. What types of speed would be necessary for CPUs to actually be able to benefit from a small FPGA onboard (rather than on an expansion card) with all the context switching.

akiselev · on Jan 3, 2021

High end FPGAs are theoretically capable of millisecond fast partial reconfigurations but doing so requires making a lot of tradeoffs that just highlight the impedance mismatch between the generic nature of CPUs and the purgatory that is FPGAs. The more of the FPGA you want to reconfigure, the longer it takes (stop the world, depending on which parts its touches) and unless the reconfigured portion is limited to a standard bus, the reconfiguration won't work (or you have to reconfigure more of the design to deal with different interfaces, timings, etc. blowing up reconfiguration time and defeating the purpose). All of the bitstreams have to be compiled ahead of time as well.

Unless latency is so critical that the speed of light is the limiting factor, partial reconfiguration just replaces PCIe with a much harder to work with AXI interconnect (or similar, but it always end up being AXI...).

imtringued · on Jan 3, 2021

It's easier to provide "custom instructions" and only accelerate CPU bottlenecks if you don't have PCIe as a massive bottleneck. If you are using an accelerator behind a bus you always have to make sure there is enough work for the accelerator to justify a data transfer. GPUs are built around the idea of batching a lot of work and running it in parallel. You can make an FPGA work like that but you are throwing away the low latency benefits of FPGAs.

wtallis · on Jan 3, 2021

Even the best-case scenarios for integrating a FPGA onto the same die as CPU cores would still have the FPGA separate from the CPU cores. It's really not possible to make an open-ended high bandwidth low latency interface to a huge chunk of FPGA silicon part of the regular CPU core's tightly-optimized pipeline, without drastically slowing down that CPU. The sane way to use an FPGA is as a coprocessor, not grafted onto the processor core itself. Then, you're interacting with the FPGA through interfaces like memory-mapped IO whether it's on-die, on-package, or on an add-in card.

therealcamino · on Jan 4, 2021

> It's really not possible to make an open-ended high bandwidth low latency interface to a huge chunk of FPGA silicon part of the regular CPU core's tightly-optimized pipeline

That's what's interesting about the article, because that's what the patent is about: "implementing as part of a processor pipeline a reprogrammable execution unit capable of executing specialized instructions".

Traster · on Jan 3, 2021

Yeah, worth mentioning highly optimized FPGA designs run at up to 600MHz (or to put it another way, 400MHz lower than what Intel advertised 4 years ago). So at a minimum, you're going to clock cross, have a >10 cycle pipeline at CPU speeeds (variable clock) and clock cross back.

bfrog · on Jan 4, 2021

Replace pcie with AXI and that seems to be pretty close to what zynq/cycle v soc have today on the same package?

phendrenad2 · on Jan 4, 2021

The downside of PCIe is PCIe is very complex. And the tools make interfacing with it bewildering. I really want a PCIe FPGA that looks to me like data magically appears on an AXI bus.

GuB-42 · on Jan 3, 2021

Everyone seems to be talking about accelerated instructions but how about I/O?

FPGAs are awesome at asynchronous I/O and low latency. We could implement network stacks, sound and video processing, etc... It can start a TLS handshake as soon as the electrical signal hits the ethernet port, while the CPU is not even aware of it happening. It can timestamp MIDI input down to the microsecond and replay with the same precision. It can process position data from a VR headset at the very last moment in the graphics pipeline. Maybe even do something like a software defined radio.

Basically every simple but latency-critical operations. Of course, embedded/realtime systems are a prime target.

emidoots · on Jan 3, 2021

A fair amount of enterprise NICs in data centers do exactly this, e.g. Intel FPGA smart NICs

I don't know enough to know how this being on the CPU would affect performance in this scenario, but I'd love to learn more!

AlotOfReading · on Jan 4, 2021

That's pretty much what Xilinx's Zynq product lines are already targeting, including embedded. They're comparatively nice boards to work on, as long as you can swallow the BOM cost.

Izikiel43 · on Jan 4, 2021

Microsoft has already done the networking thing with project catapult as of a few years ago. I think they also use it for ai.

https://www.microsoft.com/en-us/research/project/project-cat...

andoriyu · on Jan 4, 2021

Network example is wasteful. There are already NICs that have FPGA and a whole set of linux's kernel features. You wouldn't want that to be that far away from the NIC itself.

PTP works just like that - timestamps incoming and outgoing packets right after/before packet hit the wire. There is eXpress Data Path that can offload eBPF programs to NICs and deal with packets without them even coming into even kernel at all.

High Frequency Traders do exactly that IIRC today.

As for video processing codecs today are way too complex to be run there. Well, no one will stop you from running something like an integer DCT part on FPGA.

VR thing... Generally, aside from Nvidia companies don't want to ship entire FPGA to end customers (guess why Nvidia G-Sync monitors used to be so expensive). Something like Snapdragon XR2 "solves" VR. Also, in order to render a picture you need to know headset position early, not at the last moment. How would you know what to render?

How useful is the subject depends entirely on FPGA capability, and it's size. I bet it will be more useful for things like implementing some hash function there or something like that.

IMO this will be a very niche product inside already niche market.

Scene_Cast2 · on Jan 3, 2021

A killer tech for this would be a framework that automatically reprograms the FPGA and offloads the work if it makes sense. For example - running k-means? Have your FPGA automatically (with minimal dev effort) flash to be a Nearest Neighbor accelerator.

The problem is finding a way to make that translation happen with minimal dev effort, as software is written rather differently from hardware.

cashsterling · on Jan 3, 2021

I recommend checking out CacheQ: https://cacheq.com/

they are working on almost exactly this. If I was an investor, or Intel or AMD, I would buy them and/or invest heavily.

therealcamino · on Jan 3, 2021

Their web site is very sparse on what programming models the tool supports. Traditionally, the things you can easily accelerate automatically are algorithms you can write naturally in Fortran 77 (lots of arrays, no pointers), and that's one limit on the applicability of these automatic tools. (Other limits that other posters have pointed out are compilation+place+route runtime, and reconfiguration time.)

They are claiming you can use malloc and make "extensive" use of pointers in C programs and still have them automatically compiled for the FPGA. That's where details are needed and they are mostly missing.

I watched their 30 minute demo film. The speedups are impressive, and on the small example it's impressive that it does the partitioning automatically. However, the program contains only a single call to malloc, and all pointers are derived from that address, so it doesn't do much to convince us that it the memory model and alias analysis give you more flexibility than the F77 model.

d_tr · on Jan 3, 2021

You might want to check the "Warp Processing" project out: http://www.cs.ucr.edu/~vahid/warp/. It is probably exactly what you are thinking about. Transparent analysis of the instruction stream at runtime and synthesis and offloading of hot spots to the FPGA.

Scene_Cast2 · on Jan 3, 2021

Huh, interesting. It seems that the work doesn't have to be explicitly parallel for this to work, which is a surprise.

rowanG077 · on Jan 4, 2021

Why is that surprising. A LOT of statements that are done in programming can be executed in parallel. It's just not worth it to actually make threads for them since the overhead of threads is larger then just executing the set of instructions sequentially. In fact all modern processors take advantage of the data dependencies and execute it in parallel if possible.

rch · on Jan 3, 2021

I recall reading papers about doing this by profiling Java apps a decade or so ago, but I would have to dig pretty deep in my HN comment history to find them.

The approach seems conceptually similar to the optimizations available via the enterprise version of GraalVM.

d_tr · on Jan 3, 2021

The main reason I am interested in this acquisition is a (faint) hope that they open some specs up to help projects like SymbiFlow.

ohazi · on Jan 3, 2021

For decades, the FPGA vendors have had this fever dream of "an FPGA in every PC" -- either as an add-on card, or as part of the chipset on a motherboard -- that would enable a compiler or operating system to seamlessly accelerate arbitrary tasks on demand.

In my opinion, the problem has always been their software: the FPGA vendor tools are slow, bloated monstrosities. The core of these tools are written by the big three EDA vendors (Cadence, Synopsys, and Mentor Graphics) rather than the FPGA vendors themselves. The licenses include ridiculous, paranoid restrictions [1] and force the FPGA vendors to keep their bitstream formats and timing databases secret [2] in order to prevent competition from other tool vendors. Most FPGA vendors didn't see this as a problem, but even the ones that did didn't have much of a choice, because the tool market is a cartel.

Thankfully, we now have an open source toolchain [3] with support for a growing number of FPGA architectures [4], and using it vs. the vendor tools is like using gcc or llvm vs. a '90s era, non-compliant C++ compiler. It even has a real IR that isn't Verilog, which has made it easier to design new HDLs [5].

I don't see how a dynamic FPGA accelerator platform can be even remotely viable without this. It's the difference between a developer getting to choose between one of a few dozen pre-baked designs that lock up the entire FPGA (and needing to learn how to shovel data into it), vs. a compiler flag that can give you the option of unrolling any loop directly into any inactive region of FPGA fabric.

It would be quite the cherry on top to see AMD build something interesting in this space. But unless they're willing to fully unencumber at least this one design, I think the effort is likely to fail. The open source guys are chomping at the bit to make this work, and have been making real progress lately. Meanwhile, the EDA vendors have been making promises, failing, and throwing tantrums for the last 20 years. It's time to write them off.

[1] https://twitter.com/OlofKindgren/status/1052822081652617221?...

[2] Imagine trying to write an assembler without being allowed to see the manual that tells you how instructions are encoded. It's like that, but the state-space is hundreds to thousands of bytes in multiple configurations rather than a few dozen bits.

[3] https://github.com/YosysHQ/yosys

[4] https://symbiflow.github.io/

[5] https://github.com/m-labs/nmigen

travis729 · on Jan 3, 2021

I would love to hack on FPGAs but always run into the issue of closed toolchains. The recent open source work is a breath of fresh air, but we need to see an FPGA vender that embraces and sponsors this work.

ohazi · on Jan 3, 2021

I think/hope it's an unstable equilibrium -- if either Altera/Intel or Xilinx/AMD give a nod to the open source tools, the others will follow.

Lattice is seemingly at "wink wink, nudge nudge" levels of support -- their lawyers won't allow them to say anything because they're afraid of pissing off Synopsys, but they also know that they're currently the best supported platform, and don't seem interested in deliberately making things difficult.

mhh__ · on Jan 3, 2021

On paper at least it could be good idea for a company in lattice's position, at very least academics would probably switch.

I would like to see a FAANG try and support some open tools - it doesn't have to be anything legally sketchy like reverse engineering bitstreams - for example, Yosys only has limited SystemVerilog support

WideCharr · on Jan 4, 2021

Going to plug my work on this here: https://github.com/MikePopoloski/slang

At some point I'd like to see it integrated as the frontend to tools like Yosys to get best-in-class SystemVerilog support in open tools.

mhh__ · on Jan 4, 2021

Good luck

mhh__ · on Jan 3, 2021

Symbiflow is still a long long way off replacing Vendor tools at scale, right?

I'm really liking Clash and Bluespec (Bluespec is completely open source now) but I don't want to write any conventional languages.

thrtythreeforty · on Jan 3, 2021

What does Bluespec compile to? All the way to a bitstream (surely not) or to Verilog or an intermediate language?

mhh__ · on Jan 3, 2021

Firstly, (for the uninitiated) Bluspec is both a Haskell DSL (Bluespec Classic) and a Verilog-like language (Bluespec SystemVerilog)

It compiles to Verilog, but the stack is much more integrated than other similar compile-to-verilog HDLs - the simulator is similar to verilator and much easier to get started with.

I'm kind of beginning to feel that Haskell isn't a good medium for HDL code - Verilog already encourages unreadable names like "mem_chk_sig_state" and Haskell code is almost unstructured to my eye (I like functional programming but it seems hard to keep it readable because of the style it imposes - the flow is there but the names are usually way too short for my taste)

ohazi · on Jan 3, 2021

I'm pretty sure Bluespec and SpinalHDL compile to Verilog. Chisel uses it's own IR (FIRRTL). I think Migen used to target Verilog, but now targets (one of?) the IR(s) that Yosys supports (RTLIL?).

CoffeeDregs · on Jan 4, 2021

This looks a bit like the old (2000s) work of Leopard Logic or Tensilica. Exciting stuff.

One important note (based on some comments here): generally, these in-CPU FPGAs have very fast reconfiguration. Not sure if it's 1, 10 or 100 cycles but it's not milliseconds. Actually, (in past examples) configuration might take milliseconds but it would load a number of planes of configurations: plane 0 might be MP3 audio device; plane 1 might be MPEG2 video device. Then reconfiguration is: switch to plane 1.

This AMD proposal looks like it's much more tightly integrated into the CPU so it's got to be even faster. Combine that with the deep knowledge of processor internals you'll have to have to code for this thing and I'm having a hard time seeing you and me having much luck tinkering. This is probably 99.99% data center with gnarly NDAs and field support.

ineedasername · on Jan 3, 2021

Sounds like spending a few hours a month learning an HDL could be a good long-term career decision.

nsajko · on Jan 3, 2021

I think the right way isn't "learn a HDL", it's "learn digital electronics design". Hardware description languages enable succint hardware description, but it's still necessary to keep an image of the actual hardware in mind.

ip26 · on Jan 3, 2021

HDL is really just ascii schematics.

PoachedSausage · on Jan 3, 2021

I often find it easier to grok things like circuits and logic visually, certainly when in circuit design/PCB mode.

Does anyone know of any open source software for taking smallish chunks of Verilog/VHDL and making a visual representation/schematic?

nobodywasishere · on Jan 4, 2021

I wrote a small blog post about using netlistsvg, yosys, and ghdl to make block diagrams for VHDL here: https://nobodywasishere.github.io/netlistsvg/

PoachedSausage · on Jan 4, 2021

Thanks. Very useful.

minetest2048 · on Jan 3, 2021

yosys with yosys show command can do it for Verilog: http://www.clifford.at/yosys/files/yosys_appnote_011_design_... I found it really helpful to see what a chunk of Verilog code would look like after synthesis.

There is an online version at here: http://www.clifford.at/yosys/nogit/YosysJS/snapshot/demo02.h... which uses YosysJS. Hopefully someone can port Compiler Explorer UI to this

mhh__ · on Jan 3, 2021

https://digitaljs.tilk.eu/ also this

PoachedSausage · on Jan 4, 2021

I had a feeling yosys might do it somehow. Thanks.

seabird · on Jan 3, 2021

You're going to need to commit a lot more time than that. HDLs and the surrounding concepts have key fundamental differences from software that a lot of developers have a hard time stomaching. That's why high-level synthesis is the FPGA industry's City of El Dorado; software developers would be able to create acceleration designs without having to build up a fairly large new skillset.

imtringued · on Jan 3, 2021

I've never understood this argument. The change in mindset is extremely small. It's merely a matter of awareness. High level synthesis can work just fine if you don't go overboard with constructs that are hard to synthesize. There is no fundamental reason why a math equation in C should be harder to synthesize than the Verilog or VHDL equivalent.

mhh__ · on Jan 3, 2021

> math equation in C

I think it's the bits around the outside of the (say) math kernel which will trip up an "ah it's just like C!"-thinking programmer.

pjmlp · on Jan 4, 2021

Except that part that electronics must also take physics into consideration, and if they plug into some kind of analog circuits even more so.

rowanG077 · on Jan 4, 2021

I think HLS is oversold. It's not that hard for software guys to learn some digital logic and write an accelerator. One or two weeks and it shouldn't be a problem. Where the real problems lies is in the tooling. You can't learn that in one or two weeks. You first need to damage your brain to be able to handle it.

ineedasername · on Jan 3, 2021

I'm assuming that if public knowledge of AMD's efforts are at the patent level, it will be a few years before there's much to work with, by which point you'd have a solid foundation from which to accelerate your learning.

Nullabillity · on Jan 3, 2021

The dataflow dialect of VHDL instantly felt really natural to me, coming from FRP (among a bunch of other stuff).

Of course, using it in industry is presumably pretty different from using it for a few school courses.

deelowe · on Jan 3, 2021

Anyone who is considering this, make sure you learn digital circuits first.

efferifick · on Jan 3, 2021

While sibling comments mention that it is probably wiser to learn digital logic before HDL (and I agree with them), I think it is important to also consider that there is now High Level Syntehesis where programming languages similar to C (e.g., OpenCL) can compile to VHDL. HLS may lower the barrier for programmers to take advantage of FPGAs. However, whether the design can compile to fit the constraints of the FPGA available is another question that I do not know the answer.

signa11 · on Jan 4, 2021

this approach is not new, and has been toyed around since the 1960 (!), see G. Estrin's work on adaptive architectures for example.

i got to know about this as part of PRISM (processor reconfiguration through-instruction set metamorphosis) work in the early 90's. there is a very cool paper by the same name. check it out !

ps : PRISM Paper (http://class.ece.iastate.edu/tyagi/cpre583/documents/prism.p...)

qwerty456127 · on Jan 3, 2021

I could never stop wondering why is this not a norm yet. Why doesn't every computer have an FPGA.

amelius · on Jan 3, 2021

My guess: because FPGAs are slow compared to mainstream desktop CPUs and only make sense if you have massive paralelism. But then you'd need a massive FPGA which would be crazy expensive, plus you'd need a good way to handle throughput.

I could be totally wrong, though.

atq2119 · on Jan 3, 2021

That, plus programming FPGA kind of sucks. The software tool chains are somewhere between 20 and 30 years behind the state of the art for software development.

Also, FPGAs can't be reasonably context-switched. Flashing them takes a significant amount of time, so forget about time-multiplexing access to the FPGA among different applications.

m4rtink · on Jan 3, 2021

I could imagine some sort of API based queuing - say you have 2 "slots" you can program stuff on so if you play 8 k video you can have on flashed to video decoder while the other one can speed up your kernel compilation. If you then want to also use FPGA accelerated denoising on some video you recently recorded, the OS will politely tell you to wait for one of the other apps using the available slots to terminate first.

amelius · on Jan 3, 2021

Is there even any progress in OSes with respect to how they deal with tasks/processes on GPUs?

atq2119 · on Jan 3, 2021

Progress relative to what?

Since applications do all their rendering via the GPU these days, desktop multi-tasking requires reasonably time-sliced access to the GPU. GPUs have proper memory protection these days (GPU-side page tables for each process). That's big progress over 10 years ago.

amelius · on Jan 4, 2021

True. But it's still far away from a unified approach you'd expect (as someone outside the field) in a modern OS. After all, one of the jobs of an OS is to abstract away access to underlying hardware as much as possible. Until we get some improvements here, my hopes are not very high for the FPGA domain.

mhh__ · on Jan 3, 2021

Probably power, getting the data onto the FPGA, and utilising FPGA's being unlike software.

I definitely want one but any common task worth having on an FPGA is probably common enough to justify either a GPU or actual silicon.

Intel and AMD both have the IP to do it, and iPhones do have a Lattice chip on them apparently

rowanG077 · on Jan 3, 2021

Once partial reconfiguration works and the FPGA can access main memory directly I see a lot of use cases. Imagine applications reconfiguring the FPGA in the blink of an eye to optimize their own algorithms.

rjsw · on Jan 3, 2021

There have been PCI FPGA boards available for a long time that can access main memory, I had them in my desktop machines nearly 20 years ago.

mhh__ · on Jan 3, 2021

Attacking a hypothetical poorly isolated on-chip FPGA seems like the mother of all exploits, thinking about it

rowanG077 · on Jan 3, 2021

Why? To make an FPGA do what you want you need to be able to reconfigure it. If you have reconfiguration capability you need to have remote code execution. And in that case you have already lost.

mhh__ · on Jan 3, 2021

As in, the FPGA would have to be carefully segmented so the accelerator couldn't be used to access memory it shouldn't have access to.

I don't think it would happen in a general purpose chip but I could see it happening in a smaller one like the exploits christopher Domas demonstrated against some embedded X86 cores.

rowanG077 · on Jan 3, 2021

Why though? Your Integrated Intel or AMD GPU can also access all of your memory. I don't see how an FPGA provides any additional attack vector. As I said you'd need code execution privileges anyway and once you have that your system is already owned.

rjsw · on Jan 3, 2021

The boards that I have used could not reprogram the FPGA over the PCI bus.

mhh__ · on Jan 3, 2021

I was thinking aloud about the memory rather than the actual FPGA bitstream

rowanG077 · on Jan 3, 2021

Yes through the PCI bus not directly. You don't want to have that latency. You want a unified model. Like Intel GPUs that can access main memory, or the FPGA being another endpoint in AMDs infinite fabric architecture. That exists as well in SoCFPGA boards. But not in the mid or high performance segments.

rjsw · on Jan 3, 2021

Back when AMD released the first Opteron CPUs there was a vendor selling an FPGA that would plug into an Opteron socket along with the IP to implement HyperTransport in the FPGA.

kijiki · on Jan 4, 2021

Really? I used a HT FPGA back then, but it came on an HTX card that fit into the HTX slot on the motherboard.

Never heard of putting it into the socket, would be a real pain to attach JTAG to program/debug your design...

rcxdude · on Jan 4, 2021

Apart from the tooling woes others have mentioned (it's hard to get across how much FPGA tooling sucks compared to software tooling), FPGAs occupy a strange set of niches, but it's not clear how many are ones which would benefit the average PC or server. If you want raw memory bandwidth and FLOPs, a GPU is better (on speed, power, and price). If you want to evaluate complicated conditions and control flow, a CPU is better (in the same catagories). Because FPGAs are very power hungry and generally slower than the raw silicon, they only really give you more compute if what you want to compute is either far enough off the beaten path that some custom logic is much more efficient than the available instructions, and/or it benefits from a specific parallel datapath which doesn't easily fit on a GPU or CPU (GPUs being the best at embarrasingly parallel systems and modern CPUs being good at parallelising general purpose code reasonably well).

In practice you see FPGAs mostly in two areas: specialised embedded applications which benefit from heavy custom I/O and/or some efficient specific DSP but don't have enough volume to justify an ASIC design, or in accelerators for simulating ASIC design.

stefan_ · on Jan 4, 2021

Because adding a layer of abstraction to silicon transistors makes for terrible performance and energy usage.

Every other year or so someone "rediscovers" FPGA and thinks this niche architecture is poised for a total revolution of how computing works, think drag&drop hardware and super fast custom everything. It never happens and it will never happen because customization, much like premature optimization, is the root of all evil and also just.. see the first paragraph.

andy_ppp · on Jan 4, 2021

I just always assumed it didn't happen because as soon as anything useful comes along in the FPGA space someone will just add it in normal silicon?

imtringued · on Jan 3, 2021

Existing FPGA vendors made sure their products remained in a lucrative niche by maintaining full control over the development process for FPGA designs.

RantyDave · on Jan 4, 2021

GPUs came in and spoiled the party. FPGA's are better for low latency applications though. 5G etc.

whatever1 · on Jan 4, 2021

How fast can an FPGA be reprogrammed? If I close my FPGA accelerated machine learning training algorithm, and then open a PC game, would it be feasible to load the new gaming-oriented instructions in ~10-30" that a PC game takes to open?

dewhelmed · on Jan 4, 2021

What sort of gaming-related workload do you think an FPGA would be suitable for? I don't know much about the gaming world, but isn't the majority of the computational workload graphics-rendering related, in which case, the GPU architecture is the best candidate to iterate on?

whatever1 · on Jan 4, 2021

Not a game developer, but I believe that game mechanics, lighting and "AI" are all handled by the CPU.

gh02t · on Jan 4, 2021

Programming a bitstream onto the FPGA is relatively quick, it's perfectly feasible. The time consuming part is in development and synthesis.

BryanBeshore · on Jan 3, 2021

Lisa Su is a fantastic CEO. Time will tell what the impact of AMD’s acquisition of Xilinx will be (should it close), but this shows the strategy and execution behind Su and team.

While a lot of acquisitions don’t pan out, this seems great.

DCKing · on Jan 3, 2021

They're going to need good leadership to pull this off. AMD doesn't have a great track record when it comes to these integrations.

AMD bought ATI while promising the same integration "synergies". GPU style compute was going to be completely woven into the CPU - "AMD Fusion". Sounds great - but they ended up with them being beaten to the CPU-with-integrated-GPU market by Intel by over a year (Intel Clarkdale launched January 2010, AMD Llano midway 2011). 14 years after the acquisition, AMD's iGPU integration is not much different compared to any other iGPU integration, their raw performance lead is shrinking compared to Intel and they're beaten by Apple. Radeon Technologies Group functionally operates independently within the company, and AMD won't use their more performant new RDNA architecture in iGPUs for two years after its launch for some reason - even their 2021 APUs still use their 2017 Vega architecture (fundamentally based on 2012 GCN technology). In the intervening years they've screwed up their processor architecture and marketshare for by going all in on the terrible Bulldozer architecture that was designed around the broken promises of far reaching GPU integration.

Given all that the ATI acquisition might still have been worth it - in hindsight AMD needed a competent GPU architecture one way or another - but the mismanagement of this acquistion nearly killed the company. I hope better leadership can do something here but I'm not really holding my breath.

atq2119 · on Jan 3, 2021

Agreed. Now to be fair, the acquisition is also what helped the company survive because it got them the console business. So it's not like it was completely botched.

They screwed up majorly with software, and they may have the same problem with an FPGA acquisition as well. AMD failed big time to capitalize on GPUs the way Nvidia did, and that's really almost entirely down to lack of good software solutions. There's ROCm now and it seems plausible that the gap is going to narrow further with AMD GPUs deployed to big HPC clusters, but a gap remains.

m4rtink · on Jan 3, 2021

Aren't all the new desktop consoles and the generation before that based on AMD CPU and GPU fused together in a specific way ?

wtallis · on Jan 3, 2021

The consoles use AMD SoCs that include CPU and GPU cores, but there's nothing special about how the CPU and GPU are connected. The only remotely unusual aspect there is that many of the console SoCs connect GDDR5/6 to the SoC's shared memory controller, while other consumer devices using similar chips (marketed by AMD as APUs) tend to use DDR4 or LPDDR.

parsimo2010 · on Jan 3, 2021

AMD purchasing Xilinx is a reaction to Intel purchasing Altera five years ago. Dr. Su might be a good CEO for other reasons, but this isn't something that illustrates brilliant strategy on her part.

ATsch · on Jan 3, 2021

I think it's more a reaction to the decreasing importance of CPUs in the datacenter in favor of interconnect technology. FPGAs are one of the directions in which the "smart nic" or "DPU" tech has been moving, which is critical to the trend of datacenter disaggregation. Xilinx has a very strong offering in that regard.

baybal2 · on Jan 3, 2021

It is not a trend at all if you look at market data.

Prime majority of hosting market still goes to bog standard servers, not even blades.

I'll wait for "clouds" to get to significant double double digit market share first.

ATsch · on Jan 3, 2021

If you look at market data, you can see that this market did not exist a few years ago and is now estimated to be worth billions, with major players releasing products in the space. Unless the dynamics pushing this forward change overnight, I think it's pretty safe to call it a trend.

sitkack · on Jan 3, 2021

This is AMD competing with Nvidia, not AMD competing with Intel.

rusticpenn · on Jan 3, 2021

Intel did not produce anything worthwhile from that strategy yet and I have seen no plans either. I use Altera for all my FPGA needs.

ATsch · on Jan 3, 2021

A large reason for the deal with Altera was that Altera already used intel for fabrication. I understand Intel's 10nm and 7nm failure has hurt them a lot in that regard, quite the opposite of the expected synergy. Unlike Xilinx for AMD, they didn't really have any other technologies intel needed either, the biggest advantage was fabrication and that fell through.

GeorgeTirebiter · on Jan 3, 2021

Xilinx had laid off a good chunk right before their sale to AMD. Xilinx was having some financial troubles; when that happens, investors want out before a company craters. So selling themselves was one possible solution.

cptskippy · on Jan 3, 2021

The industry doesn't move overnight. AMD might have seen where Intel was going and didn't want to be caught off guard, or that might be the alternative to Apple approach of dozens of coprocessors on a chip.

BryanBeshore · on Jan 3, 2021

As I said, time will tell

nynx · on Jan 3, 2021

This is exciting! Would be cool if it could access some sort of gpio as well!

rwmj · on Jan 3, 2021

About *!$% time! I was hoping Intel would do something like this when they acquired Altera a few years back. Does anyone know why Intel acquired Altera?

PedroBatista · on Jan 3, 2021

Almost the same reason someone buys a Peloton bike or rusted old Porsche. Because someone had a dream last night and have the money.

sbrorson · on Jan 3, 2021

chuckle This is true. As far as I can see (as a hardware engineer frequently doing FPGA stuff) The Intel/Altera combo has not produced any new products nor yielded any customer benefit beyond what would have happened if the two companies had remained independent. But I'll bet the "business strategists" at each company who thought this one up made a pile of money from the deal.

d_tr · on Jan 3, 2021

AFAIK there exist some Xeon + FPGA chips. No clue about availability though...

harry8 · on Jan 4, 2021

Are they connected by the pcie bus though?

https://www.xes-inc.com/

AlphaSite · on Jan 4, 2021

This seems more appropriate for GPUs than for CPUs where it’s high throughput and you can eat the latency cost of reconfiguring the node.

mhh__ · on Jan 3, 2021

Xilinx already have ARM cores in their FPGAs so I wonder which way they'll go - I'd honestly prefer a neoverse core than an X86

sbrorson · on Jan 3, 2021

You are right the ARM cores, mostly. Xilinx Zynq devices have ARM A devices built into them as "hard" cores. That is, the ARMs are instantiated directly in silicon, not as "soft" cores which take LUTs (gates) from the FPGA fabric. The ARM A is a microprocessor (not a microcontroller) powerful enough to run Linux.

The ARM connects to the FPGA fabric using a so-called AXI bus, which is a local bus defined by ARM. Xilinx supplies a bunch of "soft" cores which you can instantiate in the FPGA and integrate with the ARM. Of course, you can write your own logic for the FPGA too, as long as you can figure out how to interface to it using one of the AXI bus variants.

Several vendors offer experimenters platforms which are affordable enough for hobbyists and folks making engineering prototypes. Examples are the Avnet's Zed board and Digilent's Zybo board.

The biggest problem with the Zynq ecosystem is that the Xilinx tools -- Vivado/SDK and whatever they renamed it to last year -- are steaming piles of smelly brown stoff. Vivado is buggy, poorly supported, has bad documentation, and the supplied examples typically don't work in the latest version of Vivado since they were written long ago and have been made obsolete via version skew. An absolute disgrace compared to what software engineers are used to. The SDK is basically Eclipse which has its own problems, but is not as bad as Vivado. Ask me how I know.

I think AMD and Xilinx have a long way to go before they can satisfy the hype and speculation I see in all the posts here. I suppose one could shell out $20K for a seat of Synopsys if one wanted a decent set of dev tools, but that's not the direction most software engineers are going nowadays.

Also, assuming NVidia completes its acquisition of ARM, the whole Zynq ecosystem is imperiled since it pits ARM against NVidia.

jagger27 · on Jan 3, 2021

AMD already has full-on Arm products.

https://www.amd.com/en/amd-opteron-a1100

efferifick · on Jan 3, 2021

Not sure how realistic it would be, but I would like to see a RISC-V base core, and the FPGA implementing the extensions. Why? Because it would be cool! Also, I don't really see a use case except for debugging compilers supporting multiple RISC-V extensions and what not.

craigjb · on Jan 3, 2021

Microchip has the product for you then! Well, the RiscV part anyway. https://www.microsemi.com/product-directory/soc-fpgas/5498-p...

galaxyLogic · on Jan 4, 2021

I think neural networks and AI might be a good application area for this.

economusty · on Jan 3, 2021

Computronium

user5994461 · on Jan 3, 2021

Yet another patent that should never have been granted.

SoC have been a thing for a long time. SoC = CPU + FPGA on a single chip.

Looking at the patent, the list of 20 claims is absurd. The title says it all "... PROGRAMMABLE INSTRUCTIONS IN COMPUTER SYSTEMS", they're trying to patent anything that can run or dispatch instructions.

refulgentis · on Jan 3, 2021

>> the list of 20 claims is absurd.

Claims are a union - each individual claim may sound simple, what matters is the combination.

>> The title says it all "... PROGRAMMABLE INSTRUCTIONS IN COMPUTER SYSTEMS", they're trying to patent anything that can run or dispatch instructions.

No. The title of a patent is not a patent.

user5994461 · on Jan 3, 2021

Every claim is almost a patent on its own. Submit 20 claims that are progressively more specific, so if one claim is denied during the patent application or afterwards, the other claims can still stand.

Typical strategy is to claim as many things as you can imagine, like inventing CPU and anything that can evaluate an instruction and instructions themselves, then remove any claim that the patent office refuses to grant.

fvv · on Jan 3, 2021

Claims define the context and boundaries of the patent

rcxdude · on Jan 4, 2021

No they're not. Claims 1, 8, and 15 are the only independent claims in the patent (if you're not infringing the independent claims you're not infringing any of the dependent claims: the tradeoff is the more general independent claims are generally easier to invalidate). All of them depend on having a dispatch unit which can be programmed to dispatch instructions into some logic which is configured by a bitfile (and also having the code and bitfile alongside each other and loaded by the same system). Most FPGA SoCs don't have a programmable instruction dispatch unit (which seems to me to be the core of the patent), and they generally do not have the software and bitfield side-by-side and loaded by the same loader, though that is probably an element which is quite vague and could be argued either way.

I don't like patents in general (and especially in software), but this patent is not as general as you claim.

cptskippy · on Jan 3, 2021

That's how the industry works. You gather and hoard as many frivolous patents as you can in a cold war arms race. If a new company threatens your business, you search your portfolio for a patent they violated and sue them.

Companies who grow to a certain size look to be acquired by larger firms with bigger war chests.

Sometimes companies recognize patents are stifling progress and engage in cross licensing or pooling of patents. Sometimes they do it to gang up on a new rival.

mhh__ · on Jan 3, 2021

Ironically the neural engine patent is literally the only public information on how it works I can find