Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Isn’t the intel 6208U a strong competitor to the AMD 7302? At the same price and TDP it has higher clock speeds and a unified memory domain, compared to the AMD 4-way NUMA architecture. It seems like you can make a case for either, depending on your workload.


The AMD Rome chips (including the 7302) behave as one numa node I thought (and can find online). You also get quite a lot of PCIe4 as a bonus and a higher all core base frequency. Though your mileage may vary depending on workload as you already stated.


AMD's latest CPUs are one NUMA node per socket.

Here's the 64-core Threadripper's core to core communication latency: https://pbs.twimg.com/media/EQXru3WU8AAV3JC?format=png&name=...

Communication within a CCX is quicker, but everything else goes through the central IO die that has all the DRAM controllers.


What is core-to-core communication?

Cache is shared by the cores, but may be temporarily "assigned" to a core that recently wrote to it. Is the latency(x,y) the "# of cycles to reassign to x a cache page owned by y?"?


> Cache is shared by the cores,

Not really. All three levels of cache are split on Rome. L1 and L2 are per-core, and L3 is per-CCX (4 cores). If you have 1 thread with a working set larger than the 16MB L3 slice that each CCX gets, then you'll be hitting DRAM rather than spill over into the L3 of another CCX. But if you have cores on separate CCXs that are using the same region of memory, then the usual cache coherency semantics for separate chips applies.

The next version of AMD's Zen architecture is expected to increase the CCX size to 8 cores, so all 32MB of L3 on an 8-core chiplet will be unified and shared between all 8 cores, rather that being partitioned into two 16MB per-CCX chunks. I don't think it's practical for them to unify the L3 cache across multiple chiplets given the performance of their inter-die connections, and I don't think they have the die space on the central IO die for a fully unified L4 cache. (Shrinking the IO die to 7nm may make it possible to have some L4, but probably not enough to really help many workloads.)


> L3 is per-CCX (4 cores). If you have 1 thread with a working set larger than the 16MB L3 slice

Still, 4MB per core is a lot more than the paltry 1.3MB Intel's 9282 offers.


That’s an incredibly useful table! Do you know where the data in that table came from?



It's more complicated than that. There are still die-local memory controllers, but the penalty for remote access is vastly lower than earlier Epyc models — so much so that you plausibly could run your workload with naive UMA memory access and be just fine. AMD's ad copy says it's UMA, but really that's just marketing spin on improved remote memory perf.


Fwiw the latest xeons (Cascade lake) have the option of two numa nodes per socket available in the bios.


You can configure it in different ways in a BIOS but the physical reality remains that it is NUMA and some accesses take longer than others.


You're either talking about cache latency, or still talking about first-gen EPYC/Threadripper rather than the current generation codenamed Rome. On a cache miss, all chiplets on a single-socket Rome system have roughly equal latency for a DRAM fetch, regardless of which physical address is being fetched. Any differences are insignificant compared to inter-socket memory access or fetching from a different chiplet's DRAM on first-gen EPYC. And even if you wanted to treat each chiplet as a separate NUMA node, 4 isn't the right number for Rome.


"And even if you wanted to treat each chiplet as a separate NUMA node, 4 isn't the right number for Rome."

You can configure Rome systems with 1, 2, or 4 NUMA domains per socket (NPS1, NPS2, or NPS4, where NPS == "NUMA per socket".) Memory bandwidth is higher if you configure as NPS4, but it exposes different latencies to memory based on its location.

It's really impressive that you can get uniform latency to memory for 64 cores on the 7702 chips (when configured as NPS1).

https://www.dell.com/support/article/en-us/sln319015/amd-rom...


The underlying hardware reality is that the IO die is organized into quadrants instead of being a full crossbar switch between 8 CCXs and an 8-channel DRAM controller. Whether to enumerate it as 1, 2 or 4 NUMA domains per socket depends very much on what kind of software you plan to run.

Saying that memory bandwidth is higher when configured as NPS4 probably isn't universally true, because that setting will constrain the bandwidth a single core can use to just effectively dual-channel. For a benchmark with the appropriate thread count and sufficiently low core-to-core communication, NPS4 probably makes it easiest to maximize aggregate memory bandwidth utilization (this seems to be what Dell's STREAM Triad results show, with NPS4 and 1 thread per CCX as optimal settings for that benchmark).


I was responding to your claim that "And even if you wanted to treat each chiplet as a separate NUMA node, 4 isn't the right number for Rome", which was incorrect. 4 is one of the three possible options for the number of NUMA domains.


How does 4 nodes let you treat each of the 8 chiplets as a separate NUMA node?


Your comments about Rome are completely incorrect. There are four main memory controllers in this architecture and some of them are further from some CCDs than others. In the worst case, accessing the furthest-away controller adds 25ns to main memory latency.

You can put this part in "NPS1" mode which interleaves all channels into an apparently uniform memory region, however it is still the case that 1/4 of memory takes an extra 25ns to access and 1/2 of it takes an extra 10ns, compared to the remainder. Putting the part into NPS1 mode just zeroes out the SRAT tables so the OS isn't aware of the difference.

But don't take it from me. AMD's developer docs clearly state, and I am quoting, "The EPYC 7002 Series processors use a Non-Uniform Memory Access (NUMA) Micro- architecture."


> AMD's developer docs clearly state, and I am quoting,

Please quote something that's unambiguously supporting your claims. What you've quoted is insufficient.

What I said about a single-socket Rome processor is not "completely incorrect" under any reasonable interpretation. The latency and bandwidth limitations in moving data from one side of the IO die to another is much smaller than the inter-socket connections that were traditionally implied by NUMA, or the inter-chiplet communication in first-gen EPYC/Threadripper.

If you want to insist that NUMA apply to even the slightest measurable memory performance asymmetry between cores, please say so, so that we may know ahead of time whether the discussion is also going to lead to esoteric details like the ring vs mesh interconnects on Intel's processors.


If you're not sensitive to main memory latency, just say that. Don't try to tell me that 25ns is not relevant. It's ~100 CPU cycles and it's also about 25% swing from fastest to slowest.


Intel's server/workstation CPUs have had 2 memory controllers during the last several generations, so even if the whole CPU is seen as a single NUMA node by the software, the actual memory access latency has always been different from core to core, depending on the core position on the intercommunication mesh or ring.

So what ???

The initial posting was about the CPU being seen as a single or multiple NUMA node by the software, not about having an equal memory access latency for all cores, which has never been true for any server/workstation CPU, from any vendor, since many, many years ago.


You are referring to the last generation chips ...


It would be for me, but it's a single socket only processor. I like the 7302 specifically for the non-P variant. If I was going to stick to just one socket, I'd probably spend a bit more and go with the entry level Threadripper 3960x...

It's a nice looking processor though and probably the only one worth a damn in that line up.


I haven't seen a single review of the 6208U/6209U/6210U or anywhere that has them in stock, they might as well not exist.


Launch-day reviews are pretty uncommon for server processors, especially mid-cycle refreshes that don't introduce any fundamentally new tech. And retail stock the same week as the announcement is also not how this market segment usually works.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: