A Mixture of Experts (MoE) is a machine learning technique developed by Michael I. Jordan and Robert A. Jacobs in the early 1990s. The model is a type of ensemble learning approach that consists of multiple learning components, termed “experts,” each of which makes predictions or decisions about a subset of the data. The focus here is a division of labour among the experts, allowing each to become specialized in a particular aspect or feature of the data, thus making the collective model more accurate and efficient.
In the MoE model, the decision of allocating incoming data to a specific expert is typically managed by a “gater” network. This gating network receives the same input data as the experts and then provides a probability for each expert, indicating how relevant it might be for the given input. The final output is a weighted sum of the expert outputs, where the weights are given by the gating network probabilities.
This architecture is particularly powerful for handling complex, high-dimensional data. The division of specialization among experts means that each expert only needs to learn the characteristics of a smaller section of the data—ideally, where its performance is the best—and it does not need to perform well outside of its designated specialization. Training MoE models can be more challenging than training singular models, as optimizing all of these components together can be computationally tricky. Mixtures of Experts continue to be a valuable technique in machine learning capable of handling diverse and complex tasks.