Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Also sort of interested in this comment. It can be difficult to make ECC useful. There's chipkill vs SECDED, for starters. On paper, EPYC Rome has chipkill. More important than paper features is integration with the board firmware and the OS kernel ... Linux RAS features are quite useless if the kernel fails to notice the errors. Whether this stuff is well-integrated depends a lot on your vendors.


An occasional 1 bit correction is very common compared to chipkill, so there is a huge benefit to ECC without chipkill. In fact, with 1000s of servers, I've never had chipkill give me any benefit. I guess I'm too small to see the effect from chipkill. But yes, I do see 1-bit corrections.


Yeah, not advocating for chipkill, but the OS has to know how to interpret the machine check syndromes, is all I was getting at. This has been a problem for me on Skylake-SP with Linux, to name one.


I've always had to go out of my way to find single-bit correction numbers in Linux. I suspect that once you find that, noticing a chipkill event is pretty easy. But I've never seen a chipkill event, despite having a lot of DRAM for a long time.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: