> Yeah so MoE doesn't really come into play for production serving -- once you are batching your requests you hit every expert at a large enough batch size
In their V3 paper DeepSeek talk about having redundant copies of some "experts" when deploying with expert parallelism in order to account for the different amounts of load they get. I imagine it only makes a difference at very high loads, but I thought it was a pretty interesting technique.
In their V3 paper DeepSeek talk about having redundant copies of some "experts" when deploying with expert parallelism in order to account for the different amounts of load they get. I imagine it only makes a difference at very high loads, but I thought it was a pretty interesting technique.