上QQ阅读APP看书，第一时间看更新

Ensembles and model-averaging

Another approach to regularization involves creating multiple models (ensembles) and combining them, such as by model-averaging or some other algorithm for combining individual model results. There is a rich history of using ensemble techniques in machine learning, such as bagging, boosting, and random forest, that use this technique. The general idea is that, if you build different models using the training data, each model has different errors in the predicted values. Where one model predicts too high a value, another may predict too low a value, and when averaged, some of the errors cancel out, resulting in a more accurate prediction than would have been otherwise obtained.

The key to ensemble methods is that the different models must have some variability in their predictions. If the predictions from the different models are highly correlated, then using ensemble techniques will not be beneficial. If the predictions from the different models have very low correlations, then the average will be far more accurate as it gains the strengths of each model. The following code gives an example using simulated data. This small example illustrates the point with just three models:

## simulated data
set.seed(1234)
d <- data.frame(
 x = rnorm(400))
d$y <- with(d, rnorm(400, 2 + ifelse(x < 0, x + x^2, x + x^2.5), 1))
d.train <- d[1:200, ]
d.test <- d[201:400, ]
 
## three different models
m1 <- lm(y ~ x, data = d.train)
m2 <- lm(y ~ I(x^2), data = d.train)
m3 <- lm(y ~ pmax(x, 0) + pmin(x, 0), data = d.train)
 
## In sample R2
cbind(M1=summary(m1)$r.squared,
 M2=summary(m2)$r.squared,M3=summary(m3)$r.squared)
       M1  M2   M3
[1,] 0.33 0.6 0.76

We can see that the predictive value of each model, at least in the training data, varies quite a bit. Evaluating the correlations among fitted values in the training data can also help to indicate how much overlap there is among the model predictions:

cor(cbind(M1=fitted(m1),
 M2=fitted(m2),M3=fitted(m3)))
     M1   M2   M3
M1 1.00 0.11 0.65
M2 0.11 1.00 0.78
M3 0.65 0.78 1.00

Next, we generate predicted values for the testing data, the average of the predicted values, and again correlate the predictions along with reality in the testing data:

## generate predictions and the average prediction
d.test$yhat1 <- predict(m1, newdata = d.test)
d.test$yhat2 <- predict(m2, newdata = d.test)
d.test$yhat3 <- predict(m3, newdata = d.test)
d.test$yhatavg <- rowMeans(d.test[, paste0("yhat", 1:3)])

## correlation in the testing data
cor(d.test)
             x     y  yhat1  yhat2 yhat3 yhatavg
x        1.000  0.44  1.000 -0.098  0.60    0.55
y        0.442  1.00  0.442  0.753  0.87    0.91
yhat1    1.000  0.44  1.000 -0.098  0.60    0.55
yhat2   -0.098  0.75 -0.098  1.000  0.69    0.76
yhat3    0.596  0.87  0.596  0.687  1.00    0.98
yhatavg  0.552  0.91  0.552  0.765  0.98    1.00

From the results, we can see that the average of the three models' predictions performs better than any of the models individually. However, this is not always the case; one good model may have better predictions than the average predictions. In general, it is good to check that the models being averaged perform similarly, at least in the training data. The second lesson is that, given models with similar performance, it is desirable to have lower correlations between model predictions, as this will result in the best performing average.

There are other forms of ensemble methods that are included in other machine learning algorithms, for example, bagging and boosting. Bagging is used in random forests, where many models are generated, each having different samples of the data. The models are deliberately designed to be small, incomplete models. By averaging the predictions of lots of undertrained models that use only a portion of the data, we should get a more powerful model. An example of boosting includes gradient-boosted models (GBMs), which also use multiple models, but this time each model focuses on the instances that were incorrectly predicted in the previous model. Both random forests and GBMs have proven to be very successful with structured data because they reduce variance, that is, avoid overfitting the data.

Bagging and model-averaging are not used as frequently in deep neural networks because the computational cost of training each model can be quite high, and thus repeating the process many times becomes prohibitively expensive in terms of time and compute resources. Nevertheless, it is still possible to use model averaging in the context of deep neural networks, even if perhaps it is on only a handful of models rather than hundreds, as is common in random forests and some other approaches.