Tag Archives: Statistical learning

Impact of correlated predictions on the variance of an ensemble model

Let X_1 and X_2 be the prediction errors of two statistical/machine learning algorithms. X_1 and X_2 have relatively low bias, and high variances \sigma^2_1 and \sigma^2_2. They are also correlated, having a Pearson correlation coefficient equal to \rho_{1, 2}.

Aggregating models 1 and 2 might result in a predictive model with lower prediction error variance than 1 and 2. But not all the times. For those who attended statistics/probability/portfolio optimization classes, this may probably sound obvious; you can directly jump to the illustrative R part, below.

Let Z := \alpha X_1 + (1-\alpha) X_2, with \alpha \in \left[0, 1\right], be the prediction error of the ensemble model built with 1 and 2. We have :

Var(Z) = \alpha^2 \sigma^2_1 + (1 - \alpha)^2\sigma^2_2 + 2\alpha(1-\alpha)Cov(X_1, X_2)

And from the fact that :

Cov(X_1, X_2) = \rho_{1, 2} \sigma_1\sigma_2

We get :

Var(Z) = \alpha^2 \sigma^2_1 + (1 - \alpha)^2\sigma^2_2 + 2\alpha(1-\alpha) \rho_{1, 2} \sigma_1\sigma_2

Now, let’s see how Var(Z) changes with an increase of \alpha, the ensemble’s allocation for model 1 :

\frac{\partial Var(Z)}{\partial \alpha}= 2\alpha \sigma^2_1 - 2 (1 - \alpha) \sigma^2_2 + 2(1-2\alpha) \rho_{1, 2} \sigma_1\sigma_2

When \alpha is close to 0, that is, when the ensemble contains almost only model 2, we have :

\frac{\partial Var(Z)}{\partial \alpha}_{|\alpha = 0}= 2 \left( \rho_{1, 2} \sigma_1\sigma_2 - \sigma^2_2 \right) = 2 \sigma^2_2\left( \rho_{1, 2} \frac{\sigma_1}{\sigma_2} - 1 \right)

That’s the relative change in the variance of the ensemble prediction error, induced by introducing model 1 in an ensemble containing almost only 2. Hence, if \rho_{1, 2} = \frac{\sigma_2}{\sigma_1}, increasing the allocation of model 1 won’t increase or decrease the variance of ensemble prediction at all. The variance will decrease if we introduce in the ensemble a model 1, so that \rho_{1, 2} \leq \frac{\sigma_2}{\sigma_1}. If \rho_{1, 2} \geq \frac{\sigma_2}{\sigma_1}, it won’t decrease, no matter the combination of X_1 and X_2.

For a simple illustrative example in R, I create simulated data observations :

  # number of observations
  n <- 100
  u <- 1:n

  # Simulated observed data
  intercept <- 5 
  slope <- 0.2
  set.seed(2)

  ## data
  y <-  intercept + slope*u + rnorm(n)
  plot(u, y, type = 'l', main = "Simulated data observations")
  points(u, y, pch = 19)

graph1

I fit a linear regression model to the data, as a benchmark model :

  # Defining training and test sets
  train <- 1:(0.7*n)
  test <- -train
  u.train <- u[(train)]
  y.train <- y[(train)]
  u.test <- u[(test)]
  y.test <- y[(test)]

  # Fitting the benchmark model to the training set 
  fit.lm <- lm(y.train ~ u.train)

  (slope.coeff <- fit.lm$coefficients[2])
  (intercept.coeff <- fit.lm$coefficients[1])

## u.train
## 0.1925
## (Intercept)
## 5.292

Obtaining the predicted values from the benchmark model, and prediction error

  # Predicted values from benchmark model on the test set
  y.pred <- intercept.coeff + slope.coeff*u.test   
  
  # prediction error from linear regression
  pred.err <- y.pred - y.test
  (mean.pred.err <- mean(pred.err))
  (var.pred.err <- var(pred.err))

## [1] -0.1802
## [1] 1.338

Now, I consider two other models, 1 and 2 :

y = a + b \times u + sin(u) + \epsilon_1

and

y = a + b  \times u - 0.35  \times sin(u) + \epsilon_2

with \epsilon_1 and \epsilon_2 being 2 correlated gaussians with zero mean and constant variances (well, not really “models” that i fit, but these help me to build fictitious various predictions with different correlations for the illustration). The slope and intercept are those obtained from the benchmark model.

 # Alternative model 1 (low bias, high variance, oscillating)
  m <- length(y.pred)
  eps1 <- rnorm(m, mean = 0, sd = 1.5)
  y.pred1 <- intercept.coeff + slope.coeff*u.test + sin(u.test) + eps1

 # prediction error for model 1
  pred.err1 <- y.pred1 - y.test

We can visualize the predictions of model 1, 2 and the benchmark, with different prediction errors correlations between model 1 and 2, to get an intuition of the possible ensemble predictions :

  # Different prediction errors correlations for 1 and 2
  rho.vec <- c(-1, -0.8, 0.6, 1)  

  # Independent random gaussian numbers defining model 2 errors
  eps <- rnorm(m, mean = 0, sd = 2)

  # Plotting the predictions with different correlations
  par(mfrow=c(2, 2))  
  for (i in 1:4)
  {
    rho <- rho.vec[i]

    # Correlated gaussian numbers (Cholesky decomposition)
    eps2 <- rho*eps1 + sqrt(1 - rho^2)*eps

    # Alternative  model 2 (low bias, higher variance than 1, oscillating)
    y.pred2 <- intercept.coeff + slope.coeff*u.test - 0.35*sin(u.test) + eps2    

    # prediction error for model 2
    pred.err2 <- y.pred2 - y.test    

    # predictions from 1 & 2 correlation
    corr.pred12 <- round(cor(pred.err1, pred.err2), 2)

    # Plot
    plot(u.test, y.test, type = "p", 
         xlab = "test set values", ylab = "predicted values",
         main = paste("models 1 & 2 pred. values \n correlation :", 
                      corr.pred12))
    points(u.test, y.test, pch = 19)
    lines(u.test, y.pred, lwd = 2)
    lines(u.test, y.pred1, 
          col = "blue", lwd = 2)
    lines(u.test, y.pred2, 
          col = "red", lwd = 2)
  }

graph2v2

Now including allocations of models 1 and 2, we can see how the ensemble variance evolves as a function of allocation and correlation :

  # Allocation for model 1 in the ensemble
  alpha.vec <- seq(from = 0, to = 1, by = 0.05)  

  # Correlations between model 1 and model 2
  rho.vec <- seq(from = -1, to = 1, by = 0.05)  

  # Results matrices
  nb.row <- length(alpha.vec)
  nb.col <- length(rho.vec)  

  ## Average prediction errors of the ensemble
  mean.pred.err.ens <- matrix(0, nrow = nb.row, ncol = nb.col)
  rownames(mean.pred.err.ens) <- paste0("pct. 1 : ", alpha.vec*100, "%")
  colnames(mean.pred.err.ens) <- paste0("corr(1, 2) : ", rho.vec)

  ## Variance of prediction error of the ensemble
  var.pred.err.ens <- matrix(0, nrow = nb.row, ncol = nb.col)
  rownames(var.pred.err.ens) <- paste0("pct. 1 : ", alpha.vec*100, "%")
  colnames(var.pred.err.ens) <- paste0("corr(1, 2) : ", rho.vec)

  # loop on correlations and allocations 
  for (i in 1:nb.row)
  {
    for (j in 1:nb.col)
    {
      alpha <- alpha.vec[i]
      rho <- rho.vec[j]

      # Alternative model 2 (low bias, higher variance, oscillating)
      eps2 <- rho*eps1 + sqrt(1 - rho^2)*eps
      y.pred2 <- intercept.coeff + slope.coeff*u.test - 0.35*sin(u.test) + eps2
      pred.err2 <- y.pred2 - y.test

      # Ensemble prediction error
      z <- alpha*pred.err1 + (1-alpha)*pred.err2
      mean.pred.err.ens[i, j] <- mean(z)      
      var.pred.err.ens[i, j] <-  var(z)
    }
  }
  res.var <- var.pred.err.ens
 
  # Heat map for the variance of the ensemble
  filled.contour(alpha.vec, rho.vec, res.var, color = terrain.colors, 
                 main = "prediction error variance for the ensemble", 
                 xlab = "allocation in x1", ylab = "correlation between 1 and 2")

graph3v2

Hence, the lower the correlation between 1 and 2, the lower the variance of the ensemble model prediction. This, combined with an allocation of 1 comprised between 35\% and 50\% seem to be the building blocks for the final model. The ensemble models biases helps in making a choice of allocation (this is actually an optimization problem that can be directly solved with portfolio theory).

Finding the final ensemble, with lower variance and lower bias :

# Final ensemble

## Allocation
alpha.vec[which.min(as.vector(res.var))]
## [1] 0.45

## Correlation
res.bias <- abs(mean.pred.err.ens)
which.min(res.bias[which.min(as.vector(res.var)), ])
 ## corr(1, 2) : 
-0.7
 ## 7

Creating the final model with these parameters :

    rho <- -0.7
    eps2 <- rho*eps1 + sqrt(1 - rho^2)*eps

    # Alternative model 2 (low bias, higher variance, oscillating)
    y.pred2 <- intercept.coeff + slope.coeff*u.test - 0.35*sin(u.test) + eps2        

    # Final ensemble prediction
    y.pred.ens <- 0.45*y.pred1 + 0.55*y.pred2
  
    # Plot
    plot(u.test, y.test, type = "p", 
         xlab = "test set", ylab = "predicted values",
         main = "Final ensemble model (green)")
    points(u.test, y.test, pch = 19)
    # benchmark 
    lines(u.test, y.pred, lwd = 2)
    # model 1
    lines(u.test, y.pred1, col = "blue", lwd = 2)
    # model 2
    lines(u.test, y.pred2, col = "red", lwd = 2)
    # ensemble model with 1 and 2
    points(u.test, y.pred.ens, col = "green", pch = 19)
    lines(u.test, y.pred.ens, col = "green", lwd = 2)

graph4v2

Performance of the final model :

    # Benchmark
    pred.err <- y.pred - y.test
    # Model 1 
    pred1.err <- y.pred1 - y.test
    # Model 2
    pred2.err <- y.pred2 - y.test
    # Ensemble model
    pred.ens.err <- y.pred.ens - y.test
    
    # Bias comparison 
    bias.ens <- mean(y.pred.ens - y.test)  
    bias.ens_vsbench <- (bias.ens/mean(y.pred - y.test) - 1)*100
    bias.ens_vs1 <- (bias.ens/mean(y.pred1 - y.test) - 1)*100
    bias.ens_vs2 <- (bias.ens/mean(y.pred2 - y.test) - 1)*100
  
    # Variance comparison
    var.ens <- var(y.pred.ens - y.test)
    var.ens_vsbench <- (var.ens/var(y.pred - y.test) - 1)*100
    var.ens_vs1 <- (var.ens/var(y.pred1 - y.test) - 1)*100
    var.ens_vs2 <- (var.ens/var(y.pred2 - y.test) - 1)*100
  
   cbind(c(bias.ens_vsbench, bias.ens_vs1, bias.ens_vs2), c(var.ens_vsbench, var.ens_vs1, var.ens_vs2))
##        [,1]  [,2]
## [1,] -95.31 46.27
## [2,] -112.12 -62.93
## [3,] -88.33 -45.53

3 Comments

Filed under R stuff

Solvency II yield curve ‘fits exactly’. Too tightly to explain ?

For the QIS5 and the LTGA (Quantitative Impact Studies), the Prudential Authority (EIOPA) suggested the use of Smith-Wilson method for the purpose of discounting future cash flows. A description of this method can be found here. The technical specifications for the QIS5 argued about this method that : “the term structure is fitted exactly to all observed zero coupon bond prices”. The method was subsequently used for the LTGA.

But : is fitting exactly, a desirable feature for a yield curve ? I won’t give a definitive answer here, but some hints, as the question could be answered in a lot of different ways (well… this post is mainly an excuse to present you ycinterextra in action after this, for you to tell me how to improve the package).

I’ll fit the Nelson-Siegel (NS), Svensson (SV) and Smith-Wilson (SW) models (implemented in ycinterextra) to the data from Diebold, F. X., & Li, C. (2006). Forecasting the term structure of government bond yields. Journal of econometrics, 130(2), 337-364. And  we’ll see how accurate are the  values they produce on out-of-sample data. I used mixed ideas from Diebold and Li (2006), along with Gilli, Manfred and Schumann, Enrico, A Note on ‘Good Starting Values’ in Numerical Optimisation (June 3, 2010). Available at SSRN: http://ssrn.com/abstract=1620083.

The data can be found here : http://www.ssc.upenn.edu/~fdiebold/papers/paper49/FBFITTED.txt.

# I put the data in a text file.
data.diebold.li <- scan("FBFITTED.txt", skip = 14)

Now, loading the package and formatting the data :

library(ycinterextra)

# Number of observation dates from January 1985 through December 2000
nbdates <- 372

# Time to maturities for each observation date
mats <- round(c(1, 3, 6, 9, 12, 15, 18, 21, 24, 30, 36, 48, 60, 72, 84, 96, 108, 120)/12, 3)
nbmats <- length(mats)
nbcol <- nbmats + 1

# Formatting the data
data.diebold.li.final <- NULL
for(i in seq_len(nbdates))
{
 data.diebold.li.final <- rbind(data.diebold.li.final, data.diebold.li[(nbcol*(i -1)+1):(nbcol*(i -1)+nbcol)])
}

A 3D plot of the observed data can be obtained with the following function :

x <- mats
y <- data.diebold.li.final[,1]
z <- data.diebold.li.final[,-1]

graph3D <- function(x, y, z)
{
 par(bg = "white")
 nrz <- nrow(z)
 ncz <- ncol(z)
 # Create a function interpolating colors in the range of specified colors
 jet.colors <- colorRampPalette( c("blue", "green") )
 # Generate the desired number of colors from this palette

 nbcol <- 100
 color <- jet.colors(nbcol)
 # Compute the z-value at the facet centres
 zfacet <- z[-1, -1] + z[-1, -ncz] + z[-nrz, -1] + z[-nrz, -ncz]
 # Recode facet z-values into color indices
 facetcol <- cut(zfacet, nbcol)
 persp(y, x, z, col = color[facetcol], phi = 30, theta = -30, ticktype= 'detailed', nticks = 3,
 xlab = "observation date", ylab = "maturity", zlab = "zero rates in pct.")
}

graph3D(x, y, z)

image1

Now, let’s calibrate the 3 models, NS, SV and SW to the observed yield curves (this takes a lot of time, as there are 372 curves, each with 18 maturities) :

nrz <- nrow(z)
ncz <- ncol(z)
yM.NS <- matrix(NA, nrow = nrz, ncol = ncz)
yM.SV <- matrix(NA, nrow = nrz, ncol = ncz)
yM.SW <- matrix(NA, nrow = nrz, ncol = ncz)

# Loop over the dates from January 1985 through December 2000
for (t in seq_len(nbdates))
{
 # market yields
 yM <- as.numeric(data.diebold.li.final[t, -1])

 yM.NS[t, ] <- fitted(ycinter(yM = yM, matsin = mats, matsout = mats,
 method = "NS", typeres = "rates"))

 yM.SV[t, ] <- fitted(ycinter(yM = yM, matsin = mats, matsout = mats,
 method = "SV", typeres = "rates"))

 yM.SW[t, ] <- fitted(ycinter(yM = yM, matsin = mats, matsout = mats,
 method = "SW", typeres = "rates"))
}

After a little time… SW curve seems to fit very well indeed, and the plot below looks pretty similar to the one we made before :

graph3D(x, y, yM.SW)

image2

With a plot of the average yield curve (average of the 372 observed curves as in Diebold and Li (2006)), we can see that all of the three models seem to provide fairly good fits to the data :

# Average yield curves
plot(mats, apply(z, 2, mean), pch = 3, xlab = "time to maturity",
 ylab = "yield to maturity",
 main = paste("Actual and fitted (model-based)", "\n", "average yield curves"))
lines(mats, apply(yM.NS, 2, mean), col = "blue", lwd= 2)
lines(mats, apply(yM.SV, 2, mean), col = "red", lwd= 2)
lines(mats, apply(yM.SW, 2, mean), col = "green", lwd=2, lty = 2)
legend("bottomright", c("Actual", "NS", "SV", "SW"),
 col=c("black", "blue", "red", "green"), text.col = c("black", "blue", "red", "green"),
 lty=c(NA, 1, 1, 2), lwd = c(1, 2, 2, 2), pch=c(3, NA, NA, NA))

image3

Here are plots of the residuals for NS and SW (for SV, the plot is pretty similar to the plot for NS) :

# residuals
graph3D(x, y, yM.NS-z)
graph3D(x, y, yM.SW-z)

image4image5

Now, taking a closer look at some given cross-sections (here on 3/31/1989), we see what’s actually going on (click for a better resolution) :

t <- 231

par(mfrow=c(2,2))

plot(mats, data.diebold.li.final[t, -1], xlab = "time to maturity",
 ylab = "yield to maturity", col="red", main = "Observed data")
points(mats, data.diebold.li.final[t, -1], col="red")

plot(mats, yM.NS [t,], type='l', xlab = "time to maturity",
 ylab = "yield to maturity", col = "blue", main = "Nelson-Siegel fit")
points(mats, data.diebold.li.final[t, -1], col="red")

plot(mats, yM.SV[t,], type='l', xlab = "time to maturity",
 ylab = "yield to maturity", col = "blue", main = "Svensson fit")
points(mats, data.diebold.li.final[t, -1], col="red")

plot(mats, yM.SW[t,], type='l', xlab = "time to maturity",
 ylab = "yield to maturity", col = "blue", main = "Smith-Wilson fit")
points(mats, data.diebold.li.final[t, -1], col="red")

image6

And for 7/31/1989 (with t == 235) :

image7

NS fits, well. SV fits a litte bit better, due to its extended number of parameters. SW fits perfectly to each point, due to its laaarge number of parameters.

Let’s see how accurate are the values produced by the 3 methods on out-of-sample values, by the use of a simple machine learning test set method (a  method like k-fold cross-validation can be used as well, it’s computationnally more expensive, but simple though). I divide the sample data into 2 sets :

  •  A learning set with the zero rates observed at time to maturities : c(1, 6, 9, 15, 18, 21, 24, 30, 48, 72, 84, 96, 120)/12
  • A test set with the zero rates observed at time to maturities : c(3, 12, 36, 60, 108)/12

The models’ parameters are estimated using the learning set, and the test set is used to assess how well (with less error) the model can predict unknown values :

# Maturities for test set
matsoutsample <- c(3, 12, 36, 60, 108)/12
matchoutsample <- match(matsoutsample, mats)
m <- length(matsoutsample)

# Maturities for learning set
matsinsample <- mats[-matchoutsample]

 

# Residuals matrices
res.NS <- matrix(NA, nrow = nrz, ncol = length(matchoutsample))
res.SV <- matrix(NA, nrow = nrz, ncol = length(matchoutsample))
res.SW <- matrix(NA, nrow = nrz, ncol = length(matchoutsample))

# Loop over the dates from January 1985 through December 2000
for (t in seq_len(nbdates))
{
 yM.t <- as.numeric(data.diebold.li.final[t, -c(1, matchoutsample)])

 yM.t.bechm <- as.numeric(data.diebold.li.final[t, -1])

 res.NS[t, ] <- (fitted(ycinter(yM = yM.t, matsin = matsinsample, matsout = mats,
 method = "NS", typeres = "rates")) - yM.t.bechm)[matchoutsample]

 res.SV[t, ] <- (fitted(ycinter(yM = yM.t, matsin = matsinsample, matsout = mats,
 method = "SV", typeres = "rates")) - yM.t.bechm)[matchoutsample]

 res.SW[t, ] <- (fitted(ycinter(yM = yM.t, matsin = matsinsample, matsout = mats,
 method = "SW", typeres = "rates")) - yM.t.bechm)[matchoutsample]
}

For each model, we get the 372 mean squared errors as follows :

mse.ns <- apply((res.NS), 1, function(x) crossprod(x))/m
mse.sv <- apply((res.SV), 1, function(x) crossprod(x))/m
mse.sw <- apply((res.SW), 1, function(x) crossprod(x))/m

A summary of these mean squared errors shows that SW actually overfits, and fails to predict the out-of-sample data :

summary(mse.ns)
 Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0005895 0.0054050 0.0117700 0.0227100 0.0198400 0.5957000
summary(mse.sv)
 Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0003694 0.0075310 0.0154200 0.0329400 0.0319700 0.6079000
summary(mse.sw)
 Min. 1st Qu. Median Mean 3rd Qu. Max.
 0.057 0.556 1.478 431.200 10.820 29040.000

Leave a comment

Filed under R stuff, Yield curve