Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hosting a 7B model is completely different than hosting a 150B+ model. I thought this would be obvious, but I should have been explicit.


It's not really. And 8x7B is not a 7B model, it's a MoE that's closer to 60B that has to be kept in memory, and uses 2 experts per token so it runs at 15B speeds.

All of the current frameworks support MoE and sharding among GPUs so I don't see what the issue is.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: