"And even if you wanted to treat each chiplet as a separate NUMA node, 4 isn't t...

wtallis · on March 1, 2020

The underlying hardware reality is that the IO die is organized into quadrants instead of being a full crossbar switch between 8 CCXs and an 8-channel DRAM controller. Whether to enumerate it as 1, 2 or 4 NUMA domains per socket depends very much on what kind of software you plan to run.

Saying that memory bandwidth is higher when configured as NPS4 probably isn't universally true, because that setting will constrain the bandwidth a single core can use to just effectively dual-channel. For a benchmark with the appropriate thread count and sufficiently low core-to-core communication, NPS4 probably makes it easiest to maximize aggregate memory bandwidth utilization (this seems to be what Dell's STREAM Triad results show, with NPS4 and 1 thread per CCX as optimal settings for that benchmark).

therealcamino · on March 1, 2020

I was responding to your claim that "And even if you wanted to treat each chiplet as a separate NUMA node, 4 isn't the right number for Rome", which was incorrect. 4 is one of the three possible options for the number of NUMA domains.

detaro · on March 1, 2020

How does 4 nodes let you treat each of the 8 chiplets as a separate NUMA node?