Mixtures and Hierarchical Mixtures of Experts:

A Theoretical Overview




 

Both the mixtures of experts (ME) model, introduced by Jacobs, Jordan, Nowlan and Hinton (1991), and the hierarchical mixtures of experts (HME) model,  introduced by Jordan and Jacobs (1994), provide important paradigms for learning from data, and are of mutual interest to researchers in neural networks, in machine learning and in statistics.   The fundamental problem is to learn a mapping in which the structure of the mapping varies for different regions of the input space.  The ME and HME approach is to combine simple models to become more complex and more powerful ones. In mixtures of experts, the `experts' are typically regression models that are relatively simple, which are combined by a set of local mixing weights called the `gating functions' which can depend on the input or predictor.  In hierarchical mixtures of experts, this mixing / combining process is done recursively. Consequently, the HME model has a tree-structure and can summarize the data at multiple scales of resolution due to its use of nested input regions. Learning with ME and HME requires one to discover how to assign experts to the various regions and how to train the experts to adapt to their assigned task. Commonly used learning methods include least squares,  maximum likelihood and Bayesian computation.

ME and HME are conditional mixtures of regression models --- they may be regarded as generalizations to the standard mixture models (Titterington, Smith and Makov 1985) which mix density functions with constant probabilities.  Two major advantages of ME and HME are:

(i) They are very flexible --- many different types of models can be mixed to become more powerful: classifiers, regression models, time series models ... ...

(ii) Like ordinary mixture models, ME and HME naturally accommodate convenient computing algorithms [such as the EM (Expectation-Maximization; Jordan and Jacobs 1994) and the Gibbs sampler (Peng, Jacobs and Tanner 1996)], which can decompose a complex numerical problem into many simpler ones.

References.



Examples of Applications.

Both ME and HME have been empirically shown to be powerful and general frameworks for examining relationships among variables in a variety of settings. Some examples of such applications include:

References.



Despite the fact that ME and HME have had wide applications and have been incorporated into neural network textbooks [e.g., Bishop (1995) and Haykin (1994) which features an HME design on the cover], there has been very little formal statistical justification of the methodology  until very recently. The following is an overview of some recent theoretical developments on ME and HME.  A more complete online resource is maintained by S. Waterhouse, who provides  a bibliography and computing software.

Most theoretical results are obtained for a class of hierarchical mixtures-of-experts (HME) models where exponential family regression models with generalized linear mean functions of the form y(a+xtb) are mixed. Here y is the inverse link function. Suppose the true response y follows an exponential family regression model with mean function belonging to a class of smooth functions of the form y (h(x)) where h is a smooth function from a Sobolev class.

The theoretical results below, roughly speaking, show many desirable features of ME and HME. For example:

1. There are convenient algorithms for learning a probabilistic relationship using ME and HME, and the algorithms converge to choose the optimal model in some sense (e. g., maximizing the likelihood function). (See the discussion on computing algorithms.)

2. The ME and HME are `rich' enough to include models that can approximate a wide class of smooth relationships with arbitrary accuracy. (These are related to the results on approximation rates.)

3. Given that we train an ME or HME network using noisy data, the inferences and predictions based on this system are valid, in the sense that the fitted relationship will be close to the actual one given sufficient amount of data, and a probability-based precision assessment is available. (These are related to the results on consistency and asymptotic normality.)
 

References.



Computing Algorithms.

The ME and HME can be regarded as mixtures of regression models, which inherit some pleasant features of the standard mixture models (Titterington, Smith and Makov 1985), namely, there are convenient computation algorithms associated with them. The EM (expectation-maximization) algorithm for maximizing the likelihood function was introduced in the original paper of Jordan and Jacobs (1994) and is still one of the most commonly used methods. Its convergence property is studied by Jordan and Xu (1995), who show a linear convergence rate. Bayesian algorithms such as the Gibbs sampler and the Metropolis algorithm (see, e.g., Tanner 1996) are applied in Peng et al. (1996).

Rosen, Jiang and Tanner (2000) generalize the EM algorithm to the ES (expectation-solution) algorithm for situations that are common in biostatistics, where no joint probability model or likelihood function is assumed for the data, and a set of estimating equations are solved to perform valid inference. (They also consider the convergence properties of the ES algorithm in the appendix.)  This ES algorithm was shown to be useful for the estimating equation approach with incomplete data, and applied to real data problems with normal-type and Poisson-type responses.

References.



Approximation Rate of the Mean Functions.

It is demonstrated by Zeevi, Meir and Maiorov (1998) that one-layer mixtures of linear model experts can be used to approximate a class of smooth functions as the number of experts increases, and the least-squares method can be used to estimate the mean response consistently when the sample size increases.

Using a different method, Jiang and Tanner (1999a) generalize the results of Zeevi et al. to HME with more one one layers, when nonlinear (generalized linear) regression models are mixed, where they show that the HME mean functions can approximate the true mean function, at a rate of O(m-2/s) in Lp norm, where m is the number of experts, and s is the dimension of the predictor x.

References.



Approximation Rate of the Density Functions and Consistency.

It is also shown that the HME probability density functions can approximate the data generating density  (Jiang and Tanner 1999b), at a rate O(m-2/s) in Hellinger distance, and at a rate of O(m-4/s) in Kullback-Leibler divergence, under a regularity condition, where m is the number of experts, and s is the dimension of the predictor x.  The regularity condition essentially requires the mixing weights to be able to approximate a fine partition of the predictor space. They also provide an extra condition under which the mean-square error of the predicted mean response obtained from the maximum likelihood method converges to zero, as the sample size and the number of experts both increase. This extra condition requires the searching scope of the maximum likelihood to be large enough to include an HME model that is closest to the data generating probability density.

References.



Asymptotic Normality under Correct Model Specification.

Jiang and Tanner (2000) provide regularity conditions on the experts and on the gating functions under which the maximum likelihood method under the correct model specification produces a consistent and asymptotically normal estimator of the mean response. The regularity conditions are validated for Poisson, gamma, normal and binomial experts.

References.



Identifiability of ME.

It is shown that the  group of invariant transformations of the ME probability density functions is a direct product of the group of permutations of the expert labels and the group of translations of the parameters in the gating functions (Jiang and Tanner 1999c). The ME systems are identifiable, therefore, if the experts are ordered and the gating parameters are initialized. The conditions under which these results hold are validated for Poisson, gamma, normal and binomial experts.

References.



VC Dimension of ME.

The Vapnik-Chervonenkis (VC) dimension is a central concept in risk minimization (Vapnik 1992) and PAC (probably approximately correct) learning (Valiant 1984), and is useful for planning for the sample size, and estimating the computational efficiency. The role of the VC dimension in computational learning (see Anthony and Biggs 1992) is similar to that of the degree of freedom of a model in statistics. However, the VC dimension is typically much harder to estimate. One can consider the mixtures-of-experts (ME) methodology as a tool of classification,  where experts of logistic regression models or Bernoulli models are mixed. It is shown that the VC dimension of the ME architecture is bounded below by m+s-1, and is bounded above by O(m4 s2), where s is the dimension of the input and m is the number of experts (Jiang 2000) . For mixtures of Bernoulli experts with a scalar input, they show that the lower bound m is attained, in which case one obtains the exact result that the VC dimension is equal to the number of experts.

References.



Selecting the Number of Experts.

Selecting umber of experts in mixture models is a well-known complicated problem (Titterington et al. 1985). There, however, have been various attempts for ME and HME model selection. Jacobs, Tanner and Peng (1996) consider a Bayesian method based on the posterior probabilities assigned to the expert components and define the worth index for model selection. Waterhouse and Robinson (1996) apply a likelihood splitting criteria to each expert in the HME and grow the tree adaptively. Fritsch, Finke and Waibel (1997) also consider automatic growing and pruning of HME which enable large hierarchies consisting of several hundred experts to be trained effectively.

References.



Extensions and Other Related Methods.

There are extensions or analogs of ME and HME which also utilize the idea of local mixture or `divide and conquer'. For example, the method of `local experts combination' by Rida, Labbi and Pellegrini (1999);  the method of gated experts by Weigend, Mangeas and Srivastava (1995) which incorporates neural networks to form `nonparametric' gating functions and is applied to time series data; the hidden Markov chain versions of ME and HME discussed by Jordan, Ghahramani, and Saul (1996) and Shi and Weigend (1997).

An important methodology that is related to ME and HME is the boosting algorithms (Schapire 1990, Freund and Schapire 1997), which also combines simple learners to be more complex and more powerful one; but this combination is realized by linear combination rather than probabilistic mixing. There are important applications of this methodology (see the boosting homepage maintained by Schapire). There have also been efforts of combining the two approaches of ME / HME and boosting, by Waterhouse and Cook (1997), Avnimelech and Intrator (1999), and Moerland and Mayoraz (1999), for example.
 



References.

(last updated on 1/31/2000.)

(back to the referring page)