Impact of correlated predictions on the variance of an ensemble model

Let X_1 and X_2 be the prediction errors of two statistical/machine learning algorithms. X_1 and X_2 have relatively low bias, and high variances \sigma^2_1 and \sigma^2_2. They are also correlated, having a Pearson correlation coefficient equal to \rho_{1, 2}.

Aggregating models 1 and 2 might result in a predictive model with lower prediction error variance than 1 and 2. But not all the times. For those who attended statistics/probability/portfolio optimization classes, this may probably sound obvious; you can directly jump to the illustrative R part, below.

Let Z := \alpha X_1 + (1-\alpha) X_2, with \alpha \in \left[0, 1\right], be the prediction error of the ensemble model built with 1 and 2. We have :

Var(Z) = \alpha^2 \sigma^2_1 + (1 - \alpha)^2\sigma^2_2 + 2\alpha(1-\alpha)Cov(X_1, X_2)

And from the fact that :

Cov(X_1, X_2) = \rho_{1, 2} \sigma_1\sigma_2

We get :

Var(Z) = \alpha^2 \sigma^2_1 + (1 - \alpha)^2\sigma^2_2 + 2\alpha(1-\alpha) \rho_{1, 2} \sigma_1\sigma_2

Now, let’s see how Var(Z) changes with an increase of \alpha, the ensemble’s allocation for model 1 :

\frac{\partial Var(Z)}{\partial \alpha}= 2\alpha \sigma^2_1 - 2 (1 - \alpha) \sigma^2_2 + 2(1-2\alpha) \rho_{1, 2} \sigma_1\sigma_2

When \alpha is close to 0, that is, when the ensemble contains almost only model 2, we have :

\frac{\partial Var(Z)}{\partial \alpha}_{|\alpha = 0}= 2 \left( \rho_{1, 2} \sigma_1\sigma_2 - \sigma^2_2 \right) = 2 \sigma^2_2\left( \rho_{1, 2} \frac{\sigma_1}{\sigma_2} - 1 \right)

That’s the relative change in the variance of the ensemble prediction error, induced by introducing model 1 in an ensemble containing almost only 2. Hence, if \rho_{1, 2} = \frac{\sigma_2}{\sigma_1}, increasing the allocation of model 1 won’t increase or decrease the variance of ensemble prediction at all. The variance will decrease if we introduce in the ensemble a model 1, so that \rho_{1, 2} \leq \frac{\sigma_2}{\sigma_1}. If \rho_{1, 2} \geq \frac{\sigma_2}{\sigma_1}, it won’t decrease, no matter the combination of X_1 and X_2.

For a simple illustrative example in R, I create simulated data observations :

  # number of observations
  n <- 100
  u <- 1:n

  # Simulated observed data
  intercept <- 5 
  slope <- 0.2
  set.seed(2)

  ## data
  y <-  intercept + slope*u + rnorm(n)
  plot(u, y, type = 'l', main = "Simulated data observations")
  points(u, y, pch = 19)

graph1

I fit a linear regression model to the data, as a benchmark model :

  # Defining training and test sets
  train <- 1:(0.7*n)
  test <- -train
  u.train <- u[(train)]
  y.train <- y[(train)]
  u.test <- u[(test)]
  y.test <- y[(test)]

  # Fitting the benchmark model to the training set 
  fit.lm <- lm(y.train ~ u.train)

  (slope.coeff <- fit.lm$coefficients[2])
  (intercept.coeff <- fit.lm$coefficients[1])

## u.train
## 0.1925
## (Intercept)
## 5.292

Obtaining the predicted values from the benchmark model, and prediction error

  # Predicted values from benchmark model on the test set
  y.pred <- intercept.coeff + slope.coeff*u.test   
  
  # prediction error from linear regression
  pred.err <- y.pred - y.test
  (mean.pred.err <- mean(pred.err))
  (var.pred.err <- var(pred.err))

## [1] -0.1802
## [1] 1.338

Now, I consider two other models, 1 and 2 :

y = a + b \times u + sin(u) + \epsilon_1

and

y = a + b  \times u - 0.35  \times sin(u) + \epsilon_2

with \epsilon_1 and \epsilon_2 being 2 correlated gaussians with zero mean and constant variances (well, not really “models” that i fit, but these help me to build fictitious various predictions with different correlations for the illustration). The slope and intercept are those obtained from the benchmark model.

 # Alternative model 1 (low bias, high variance, oscillating)
  m <- length(y.pred)
  eps1 <- rnorm(m, mean = 0, sd = 1.5)
  y.pred1 <- intercept.coeff + slope.coeff*u.test + sin(u.test) + eps1

 # prediction error for model 1
  pred.err1 <- y.pred1 - y.test

We can visualize the predictions of model 1, 2 and the benchmark, with different prediction errors correlations between model 1 and 2, to get an intuition of the possible ensemble predictions :

  # Different prediction errors correlations for 1 and 2
  rho.vec <- c(-1, -0.8, 0.6, 1)  

  # Independent random gaussian numbers defining model 2 errors
  eps <- rnorm(m, mean = 0, sd = 2)

  # Plotting the predictions with different correlations
  par(mfrow=c(2, 2))  
  for (i in 1:4)
  {
    rho <- rho.vec[i]

    # Correlated gaussian numbers (Cholesky decomposition)
    eps2 <- rho*eps1 + sqrt(1 - rho^2)*eps

    # Alternative  model 2 (low bias, higher variance than 1, oscillating)
    y.pred2 <- intercept.coeff + slope.coeff*u.test - 0.35*sin(u.test) + eps2    

    # prediction error for model 2
    pred.err2 <- y.pred2 - y.test    

    # predictions from 1 & 2 correlation
    corr.pred12 <- round(cor(pred.err1, pred.err2), 2)

    # Plot
    plot(u.test, y.test, type = "p", 
         xlab = "test set values", ylab = "predicted values",
         main = paste("models 1 & 2 pred. values \n correlation :", 
                      corr.pred12))
    points(u.test, y.test, pch = 19)
    lines(u.test, y.pred, lwd = 2)
    lines(u.test, y.pred1, 
          col = "blue", lwd = 2)
    lines(u.test, y.pred2, 
          col = "red", lwd = 2)
  }

graph2v2

Now including allocations of models 1 and 2, we can see how the ensemble variance evolves as a function of allocation and correlation :

  # Allocation for model 1 in the ensemble
  alpha.vec <- seq(from = 0, to = 1, by = 0.05)  

  # Correlations between model 1 and model 2
  rho.vec <- seq(from = -1, to = 1, by = 0.05)  

  # Results matrices
  nb.row <- length(alpha.vec)
  nb.col <- length(rho.vec)  

  ## Average prediction errors of the ensemble
  mean.pred.err.ens <- matrix(0, nrow = nb.row, ncol = nb.col)
  rownames(mean.pred.err.ens) <- paste0("pct. 1 : ", alpha.vec*100, "%")
  colnames(mean.pred.err.ens) <- paste0("corr(1, 2) : ", rho.vec)

  ## Variance of prediction error of the ensemble
  var.pred.err.ens <- matrix(0, nrow = nb.row, ncol = nb.col)
  rownames(var.pred.err.ens) <- paste0("pct. 1 : ", alpha.vec*100, "%")
  colnames(var.pred.err.ens) <- paste0("corr(1, 2) : ", rho.vec)

  # loop on correlations and allocations 
  for (i in 1:nb.row)
  {
    for (j in 1:nb.col)
    {
      alpha <- alpha.vec[i]
      rho <- rho.vec[j]

      # Alternative model 2 (low bias, higher variance, oscillating)
      eps2 <- rho*eps1 + sqrt(1 - rho^2)*eps
      y.pred2 <- intercept.coeff + slope.coeff*u.test - 0.35*sin(u.test) + eps2
      pred.err2 <- y.pred2 - y.test

      # Ensemble prediction error
      z <- alpha*pred.err1 + (1-alpha)*pred.err2
      mean.pred.err.ens[i, j] <- mean(z)      
      var.pred.err.ens[i, j] <-  var(z)
    }
  }
  res.var <- var.pred.err.ens
 
  # Heat map for the variance of the ensemble
  filled.contour(alpha.vec, rho.vec, res.var, color = terrain.colors, 
                 main = "prediction error variance for the ensemble", 
                 xlab = "allocation in x1", ylab = "correlation between 1 and 2")

graph3v2

Hence, the lower the correlation between 1 and 2, the lower the variance of the ensemble model prediction. This, combined with an allocation of 1 comprised between 35\% and 50\% seem to be the building blocks for the final model. The ensemble models biases helps in making a choice of allocation (this is actually an optimization problem that can be directly solved with portfolio theory).

Finding the final ensemble, with lower variance and lower bias :

# Final ensemble

## Allocation
alpha.vec[which.min(as.vector(res.var))]
## [1] 0.45

## Correlation
res.bias <- abs(mean.pred.err.ens)
which.min(res.bias[which.min(as.vector(res.var)), ])
 ## corr(1, 2) : 
-0.7
 ## 7

Creating the final model with these parameters :

    rho <- -0.7
    eps2 <- rho*eps1 + sqrt(1 - rho^2)*eps

    # Alternative model 2 (low bias, higher variance, oscillating)
    y.pred2 <- intercept.coeff + slope.coeff*u.test - 0.35*sin(u.test) + eps2        

    # Final ensemble prediction
    y.pred.ens <- 0.45*y.pred1 + 0.55*y.pred2
  
    # Plot
    plot(u.test, y.test, type = "p", 
         xlab = "test set", ylab = "predicted values",
         main = "Final ensemble model (green)")
    points(u.test, y.test, pch = 19)
    # benchmark 
    lines(u.test, y.pred, lwd = 2)
    # model 1
    lines(u.test, y.pred1, col = "blue", lwd = 2)
    # model 2
    lines(u.test, y.pred2, col = "red", lwd = 2)
    # ensemble model with 1 and 2
    points(u.test, y.pred.ens, col = "green", pch = 19)
    lines(u.test, y.pred.ens, col = "green", lwd = 2)

graph4v2

Performance of the final model :

    # Benchmark
    pred.err <- y.pred - y.test
    # Model 1 
    pred1.err <- y.pred1 - y.test
    # Model 2
    pred2.err <- y.pred2 - y.test
    # Ensemble model
    pred.ens.err <- y.pred.ens - y.test
    
    # Bias comparison 
    bias.ens <- mean(y.pred.ens - y.test)  
    bias.ens_vsbench <- (bias.ens/mean(y.pred - y.test) - 1)*100
    bias.ens_vs1 <- (bias.ens/mean(y.pred1 - y.test) - 1)*100
    bias.ens_vs2 <- (bias.ens/mean(y.pred2 - y.test) - 1)*100
  
    # Variance comparison
    var.ens <- var(y.pred.ens - y.test)
    var.ens_vsbench <- (var.ens/var(y.pred - y.test) - 1)*100
    var.ens_vs1 <- (var.ens/var(y.pred1 - y.test) - 1)*100
    var.ens_vs2 <- (var.ens/var(y.pred2 - y.test) - 1)*100
  
   cbind(c(bias.ens_vsbench, bias.ens_vs1, bias.ens_vs2), c(var.ens_vsbench, var.ens_vs1, var.ens_vs2))
##        [,1]  [,2]
## [1,] -95.31 46.27
## [2,] -112.12 -62.93
## [3,] -88.33 -45.53
Advertisements

3 Comments

Filed under R stuff

3 responses to “Impact of correlated predictions on the variance of an ensemble model

  1. Lukas Drap

    Interesting read. I am a bit confused about Your reasoning with the derivative. It seems that You are just comparing alpha = 0 and alpha very small going to zero. Unless the derivative function holds certain properties, this does not imply much about case where i.e. alpha =1/2. The derivative at 0 could indciate a wrong step but it could be outperformed at higher values. Or did I missed something?

    Like

  2. Lukas Drap

    Thanks for Your answer. I am not entirely conviced, though. 🙂 Could You reason why it possible could not happen the following case? Alpha = 0 is a good solution, alpha = 0.0001 is a worse solution than alpha = 0, but the best solution is alpha = 1/2. In order for this not to happen we would need convexity of the function…

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s