Yeah I’ve often wondered why folks aren’t training two tier MoEs for VRAM + RAM. We already have designs for shared experts so it cannot be hard to implement a router that allocated 10x or 100x as often to “core” experts vs the “nice to have” experts. I suppose balancing during training is tricky but some sort of custom loss on the router layers should work.
I’ve also wondered why the routers aren’t training to be serially consistent so you can predict layers to swap into VRAM a few layers ahead to maximize available bandwidth.
I think part of the issue is that in production deployments, you're batching high enough that you'll be paging in those long tail experts constantly.
Unless you're handing that in some kind of fancy way, you'll be holding up the batch while waiting for host memory which will kill your throughout.
It makes much more sense for non batched local inference, especially if you can keep the MoE routing stable like you say, but most folks aren't optimising for that.
Ideally, you should rearrange batches so that inference steps that rely on the same experts get batched together, then inferences that would "hold up" a batch simply wait for that one "long tail" expert to be loaded, whereupon they can progress. This might require checkpointing partial inference steps more often, but that ought to be doable.
I think this is doable for very long tail experts that get swapped in for specialised topics - say, orbital mechanics.
But for experts that light up at, say, 1% frequency per batch, you're doing an awful lot of transfers from DRAM which you amortize over a single token, instead of reads from HBM which you amortize over 32 tokens.
I think your analysis is right this would make sense mostly for the 30B-3A style models that are mostly for edge / hobbyist use, where context length is precious so nobody is batching.
Given that experts live per layer I dont think it makes sense to have orbital mechanics experts but … I have wondered about swapping out the bottom 10% of layers per topic given that that is likely where the highest order concepts live. I’ve always wondered why people bother with LORA on all layers given that the early layers are more likely to be topic agnostic and focused on more basic pattern assembly (see the recent papers on how LLMs count on a manifold)
1) This is basically the intention of several recent MoE models: keep particular generally useful experts hot in VRAM.
2) Unless you can swap layers in faster than you consume them there is no point to predicting layers (what does this even really mean? did you mean predicting experts?).
It seems at the moment the best you can do is keep experts and layers more likely to be used for a given query in VRAM and offload the rest, but this is work-dependent.
So llama.cpp currently statically puts overflow MoE experts in RAM and inferences them on CPU, so you get a mix of GPU + CPU inferencing. You are rooflined by RAM->CPU bandwidth + CPU compute.
With good predictability of MoE, you might see a world were it's more efficient to spend PCI bandwidth (slower than RAM->CPU) on loading MOE experts for the next ~3 layers from RAM to VRAM so you are not rooflined by CPU compute.
VLLM / SGLang (AFAIK) just assume you have enough VRAM to fit all the experts (but will page KV cache to RAM).
Most of the work I'm aware of starts from the perspective of optimizing inference but the implication that pushing the lessons upstream gets mentioned here and there.
Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models (https://arxiv.org/abs/2505.16056)
I can imagine a couple scenarios in which a high-quality, large model would be much preferred over lower latency models, primarily when you need the quality.
Sometimes you may need to deal with odd proprietary processors with limited and flawed knowledge about them.
For example, in the 37c3 talk "Breaking "DRM" in Polish trains" by the folks from Dragon Sector (I highly recommend watching it), they needed to reverse-engineer Tricore binaries, however they found the Ghidra implementation had bugs.
As for the PLCs, the IEC 61131-3 functional block diagrams transpile to C code which then compiles to Tricore binaries using an obscure GCC fork. Not saying that anyone would want to write Rust code for PLCs, but this is not uncommon in the world of embedded.
If I had to wager a guess, bootc might get more actual use now that it's supported in RHEL 9.6 and 10 as "image mode". It's an exciting piece of technology, especially from the perspective of a platform engineer.
Also, bootc is a basis for the Universal Blue family of distros, especially Bazzite, which is very popular with gamers.
Holy hell, this is such a depressing, dehumanized, out-of-touch article—especially regarding the fascination with AI in the field of, admit it or not, art. The article even makes mistakes in the most basic factual areas, such as calling light novels "visual novels." It's deeply saddening.
For a more positive outlook on the animation industry, check out "Keep Your Hands Off Eizouken!"—an incredibly creative love letter to the joy and passion of creating art through the lens of animation.
Hey thanks for this recommendation! I'm a few episodes in and loving it. Your description is spot on.
One thing that is striking to me is just how much budget Japanese high schools have for clubs and stuff. Meanwhile in America we have teachers buying basic school supplies from their meager wages...
While YMMV, some smaller companies and manufacturers offer boards based on ARM or RISC-V cores. Consider checking out the MNT Reform - likely one of the closest computers that while a bit impractical, fills the openness criteria.
Modern ARM designs destined for computing are alo coming with management engine like features, just like Intel and AMD, plus other obscure silicone features and proprietary FW blobs, even more so than Intel and AMD, like DSPs, ISPs, etc.
If you want non-obscure HW you'll have to go back to pre-microcode days of processors roll out your own.
The article doesn't seem to mention AWS, really. I also feel like the primary issue is the lack of communication and support, even for a large corporate partner.
Seems like they're moving to bare-metal, which has an obvious benefit of being able to tell your on-call engineer to fix the issue or die trying.
But in this case the answer from AWS would have been that's their SLA and you need to just be ready to handle an instance getting messed up from time to time, because it's guaranteed to happen.