From this point of view I don't understand what's happening between the actual SOTA models practice and the academic models. The former at this point are all MoEs, starting with GPT4. But then the open models, if not for DeepSeek V3 and Mixtral, are always dense models.