Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"And even if you wanted to treat each chiplet as a separate NUMA node, 4 isn't the right number for Rome."

You can configure Rome systems with 1, 2, or 4 NUMA domains per socket (NPS1, NPS2, or NPS4, where NPS == "NUMA per socket".) Memory bandwidth is higher if you configure as NPS4, but it exposes different latencies to memory based on its location.

It's really impressive that you can get uniform latency to memory for 64 cores on the 7702 chips (when configured as NPS1).

https://www.dell.com/support/article/en-us/sln319015/amd-rom...



The underlying hardware reality is that the IO die is organized into quadrants instead of being a full crossbar switch between 8 CCXs and an 8-channel DRAM controller. Whether to enumerate it as 1, 2 or 4 NUMA domains per socket depends very much on what kind of software you plan to run.

Saying that memory bandwidth is higher when configured as NPS4 probably isn't universally true, because that setting will constrain the bandwidth a single core can use to just effectively dual-channel. For a benchmark with the appropriate thread count and sufficiently low core-to-core communication, NPS4 probably makes it easiest to maximize aggregate memory bandwidth utilization (this seems to be what Dell's STREAM Triad results show, with NPS4 and 1 thread per CCX as optimal settings for that benchmark).


I was responding to your claim that "And even if you wanted to treat each chiplet as a separate NUMA node, 4 isn't the right number for Rome", which was incorrect. 4 is one of the three possible options for the number of NUMA domains.


How does 4 nodes let you treat each of the 8 chiplets as a separate NUMA node?




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: