Model Residuals in Time Series Data

Residuals are the indicator of the model quality. Based on Rob J Hyndman's book "Forecasting: Principles & Practice", residuals in forecasting is difference between observed value and its forecast based on all previous observations. Residuals are useful in checking whether a model has adequately captured the information in the data. All the patterns should be in the model, only randomness remains in the residuals.

Therefore the ideal model has to be:

uncorrelated
has zero mean and useful properties are:
constant variance
be normally distributed

First I will activate some useful libraries we will be using.

library(fpp)
library(forecast)

For our example I will use dowjones index as a data set. The idea will be to set up already well know simple models like: Mean Model, Naive model and Drift Model. In previous post I described it more detailed. Next, knowing what attributes the ideal model should have we can check which one of those 3 are quite good or definitely not good. Let's see the dowjones index and plot it.

dowjones

plot(dowjones,main="Dowjones index") 

As we can see it is a trend data, dataset is moving towards a direction.

Now time to set up our simple models:

driftdjmodel<-rwf(dowjones,h=20,drift=T) # caries the change over first and last observation into the future

meandjmodel<-meanf(dowjones,h=20) # returns the mean as the forecast value

naivedjmodel<-naive(dowjones,h=20) # returns the last observation as forecast value

Due to the nature of naive and drift methods we have at the front of the vector NA value. We need to delate it.

naivedjwithoutNA<-naivedjmodel$residuals[2:20]

driftdjwithoutNA<-driftdjmodel$residuals[2:20]

Let's plot our dowjones index with our 3 simple methods.

plot(meandjmodel,main="Forecasting dowjones index with 3 simple methods")

lines(naivedjmodel$mean,col=123,lwd=2)

lines(driftdjmodel$mean,col=22,lwd=2)

legend("topleft",lty=1,cex=1,col=c(4,123,22),legend=c("Mean method","Naive method","Drift method"))

Let's say we build three very simple models now thanks to residuals we will be able to estimate quality of it.

First, we will check the Mean model.

Variance and mean

var(meandjmodel$residuals)
[1] 30.31672
plot(meandjmodel$residuals,main="Residuals from Mean model") # it is how plot of residuals mean model looks like.

mean(meandjmodel$residuals)

[1] -3.464321e-15

Conclusion: Relatively big value of variance doesn't look good however mean not far away from zero.

Histogram of distribution will help us to check normal distribution

hist(meandjmodel$residuals,main="Histogram of residuals")

Conclusion: model is not normally distributed

Plotting Q-Q plot is also useful for determinising if the residuals follow the normal distribution. If the data values in the plot fall along a straight line at a 45-degree angle, then the data is normally distributed.

qqnorm(meandjmodel$residuals)

qqline(meandjmodel$residuals)

It confirms also that we don't have normal distribution here.

Autocorrelation

Standard residual diagnostic is to check the ACF of the residuals of a forecasting method. It should look like white noise (uncorrelated, mean zero, constant variance).

Acf(meandjmodel$residuals,main="ACF of residuals")

Conclusion: There is correlation between the lag. There are several bars clearly above the threshold levels.

To check if the residuals are white noise we can use Box-Pierce test and Ljung-Box test. If the p-values are relatively large, we can conclude that the residuals are not distinguishable from a white noise series.

Box.test(meandjmodel$residuals,lag = 20,fitdf=0,type="Lj")

Box-Ljung test

data: meandjmodel$residuals

X-squared = 884.47, df = 20, p-value < 2.2e-16

Conclusion: It doesn't look like a white noise, the p-value is small

It is useful to know that all of these methods for checking residuals are packaged into one R function checkresiduals(). It will produce a time plot, ACF plot and histogram of the residuals (with an overlaid normal distribution for comparison), and do a Ljung-Box test with the correct degrees of freedom.

checkresiduals(meandjmodel)

Conclusion: The mean model seems to be a very weak one. Autocorrelation is present, it is not normally distributed and variance is large.

Checking the Naive model forecast:

Variance and mean

var(naivedjwithoutNA)

[1] 0.07025088

plot(naivedjwithoutNA)

mean(naivedjwithoutNA) 

[1] -0.08210526

Conclusion: Variance is relatively small and mean not so far away from zero although not zero.

Histogram of distribution and Q-Q plot

hist(naivedjwithoutNA)
qqnorm(naivedjwithoutNA)
qqline(naivedjwithoutNA)

Conclusion: Not normally distributed

Autocorrelation

Acf(naivedjwithoutNA)

Conclusion: it seems no autocorrelation present. For uncorrelated data, we would expect each autocorrelation to be close to zero.

Test if the residuals are white noise. It can be done thanks to Box-Pierce test and Ljung-Box test

Box.test(naivedjwithoutNA,lag = 18,fitdf=0,type="Lj")

Box-Ljung test

data: naivedjwithoutNA

X-squared = 21.648, df = 18, p-value = 0.248

checkresiduals(naivedjwithoutNA)

Conclusion: The naived model seems to be better one although not ideal. Mean is not zero, although variance is small and no autocorrelation present, unfortunately not normally distributed.

Checking the Drift model

Variance and mean

var(driftdjwithoutNA)

[1] 0.07025088

plot(driftdjwithoutNA)

mean(driftdjwithoutNA)

[1] -0.2157416

Conclusion: Variance relatively small, mean not to close zero.

Histogram of distribution and Q-Q plot

hist(driftdjwithoutNA)
qqnorm(driftdjwithoutNA)
qqline(driftdjwithoutNA)

Conclusion: Not normally distributed

Autocorrelation

Acf(driftdjwithoutNA)

Conclusion: It seems no autocorrelation present

Box.test(driftdjwithoutNA,lag = 19,fitdf=0,type="Lj")

Box-Ljung test

data: driftdjwithoutNA

X-squared = 21.648, df = 18, p-value = 0.248

checkresiduals(driftdjwithoutNA)

Conclusion: The drift model is similar to naive model with mean more away from zero. All model are away of being ideal but from those 3 I would take naive one.

My journey through the data science - by Karolina M'Goma

Search This Blog

Model Residuals in Time Series Data

Comments

Post a Comment

Popular posts from this blog

Random number generators, reproducibility and sampling with dplyr

Ggplot2 for data visualizations

Basic Statistics for Time Series