Skip to main content

Model Residuals in Time Series Data

Residuals are the indicator of the model quality. Based on Rob J Hyndman's book "Forecasting: Principles & Practice", residuals in forecasting is difference between observed value and its forecast based on all previous observations. Residuals are useful in checking whether a model has adequately captured the information in the data. All the patterns should be in the model, only randomness remains in the residuals.

Therefore the ideal model has to be:

  • uncorrelated
  • has zero mean and useful properties are:
  • constant variance 
  • be normally distributed

First I will activate some useful libraries we will be using.

library(fpp)
library(forecast)

For our example I will use dowjones index as a data set. The idea will be to set up already well know simple models like: Mean Model, Naive model and Drift Model. In previous post I described it more detailed. Next, knowing what attributes the ideal model should have we can check which one of those 3 are quite good or definitely not good. Let's see the dowjones index and plot it.

dowjones
plot(dowjones,main="Dowjones index") 

As we can see it is a trend data, dataset is moving towards a direction.

Now time to set up our simple models:

driftdjmodel<-rwf(dowjones,h=20,drift=T) # caries the change over first and last observation into the future
meandjmodel<-meanf(dowjones,h=20) # returns the mean as the forecast value
naivedjmodel<-naive(dowjones,h=20) # returns the last observation as forecast value

Due to the nature of naive and drift methods we have at the front of the vector NA value. We need to delate it.

naivedjwithoutNA<-naivedjmodel$residuals[2:20]
driftdjwithoutNA<-driftdjmodel$residuals[2:20]

Let's plot our dowjones index with our 3 simple methods.

plot(meandjmodel,main="Forecasting dowjones index with 3 simple methods")
lines(naivedjmodel$mean,col=123,lwd=2)
lines(driftdjmodel$mean,col=22,lwd=2)
legend("topleft",lty=1,cex=1,col=c(4,123,22),legend=c("Mean method","Naive method","Drift method"))

Let's say we build three very simple models now thanks to residuals we will be able to estimate quality of it.

First, we will check the Mean model.

  • Variance and mean

var(meandjmodel$residuals)
[1] 30.31672
plot(meandjmodel$residuals,main="Residuals from Mean model") # it is how plot of residuals mean model looks like.
mean(meandjmodel$residuals) 
[1] -3.464321e-15

Conclusion: Relatively big value of variance doesn't look good however mean not far away from zero.

  • Histogram of distribution will help us to check normal distribution

hist(meandjmodel$residuals,main="Histogram of residuals") 

Conclusion: model is not normally distributed

Plotting Q-Q plot is also useful for determinising if the residuals follow the normal distribution. If the data values in the plot fall along a straight line at a 45-degree angle, then the data is normally distributed.

qqnorm(meandjmodel$residuals)

qqline(meandjmodel$residuals)

It confirms also that we don't have normal distribution here.

  • Autocorrelation
Standard residual diagnostic is to check the ACF of the residuals of a forecasting method. It should look like white noise (uncorrelated, mean zero, constant variance).

Acf(meandjmodel$residuals,main="ACF of residuals")

Conclusion: There is correlation between the lag. There are several bars clearly above the threshold levels.

To check if the residuals are white noise we can use Box-Pierce test and Ljung-Box test. If the p-values are relatively large, we can conclude that the residuals are not distinguishable from a white noise series.

Box.test(meandjmodel$residuals,lag = 20,fitdf=0,type="Lj")

Box-Ljung test

data:  meandjmodel$residuals
X-squared = 884.47, df = 20, p-value < 2.2e-16

Conclusion: It doesn't look like a white noise, the p-value is small

It is useful to know that all of these methods for checking residuals are packaged into one R function checkresiduals(). It will produce a time plot, ACF plot and histogram of the residuals (with an overlaid normal distribution for comparison), and do a Ljung-Box test with the correct degrees of freedom.

checkresiduals(meandjmodel) 


Conclusion: The mean model seems to be a very weak one. Autocorrelation is present, it is not normally distributed and variance is large.

Checking the Naive model forecast:

  • Variance and mean

var(naivedjwithoutNA)
[1] 0.07025088
plot(naivedjwithoutNA)
mean(naivedjwithoutNA) 
[1] -0.08210526

Conclusion: Variance is relatively small and mean not so far away from zero although not zero.

  • Histogram of distribution and Q-Q plot

hist(naivedjwithoutNA)
qqnorm(naivedjwithoutNA)
qqline(naivedjwithoutNA)

Conclusion: Not normally distributed

  • Autocorrelation

Acf(naivedjwithoutNA)

Conclusion: it seems no autocorrelation present. For uncorrelated data, we would expect each autocorrelation to be close to zero.

Test if the residuals are white noise. It can be done thanks to Box-Pierce test and Ljung-Box test

Box.test(naivedjwithoutNA,lag = 18,fitdf=0,type="Lj") 

Box-Ljung test
data:  naivedjwithoutNA
X-squared = 21.648, df = 18, p-value = 0.248

checkresiduals(naivedjwithoutNA)

Conclusion: The naived model seems to be better one although not ideal. Mean is not zero, although variance is small and no autocorrelation present, unfortunately not normally distributed.

Checking the Drift model

  • Variance and mean

var(driftdjwithoutNA)
[1] 0.07025088
plot(driftdjwithoutNA)

mean(driftdjwithoutNA)
[1] -0.2157416

Conclusion: Variance relatively small, mean not to close zero.

  • Histogram of distribution and Q-Q plot

hist(driftdjwithoutNA)
qqnorm(driftdjwithoutNA)
qqline(driftdjwithoutNA)



Conclusion: Not normally distributed

  • Autocorrelation
Acf(driftdjwithoutNA)
Conclusion: It seems no autocorrelation present
Box.test(driftdjwithoutNA,lag = 19,fitdf=0,type="Lj") 

Box-Ljung test

data:  driftdjwithoutNA
X-squared = 21.648, df = 18, p-value = 0.248
checkresiduals(driftdjwithoutNA)

Conclusion: The drift model is similar to naive model with mean more away from zero. All model are away of being ideal but from those 3 I would take naive one.


Comments

Popular posts from this blog

Random number generators, reproducibility and sampling with dplyr

Let's assume that you want to take some random observations from your data set. Dplyr helps you with the function sample_n(). To make your code reproducible you seed the ID of a “random” set of values. You need to indicate number of rows you want to extract and specify if the rows should be replaced or not. To show you how it works I will use again mtcars dataset which is included in your base R program. Let's see first six rows of this data frame.  library(dplyr) data("mtcars") head(mtcars)                    mpg cyl disp  hp drat    wt  qsec vs am gear carb Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 Datsun 710        22.8   4  108  93 3.85 2.320 18.61...

Ggplot2 for data visualizations

 When I have started my adventure with R, immediately I've noticed that everybody was taking about ggplot2 and its importance.  Tap "ggplot2"  in google and check it by yourself.  You will see a lots of professional fancy graphs, articles, blogs and other great materials.  I was so impressed  that I was even trying to start my learning of R programming from ggplot2.  Soon I understood, that I needed some basics first and it is better to take time if you are starting from zero. Before jumping to the ggplot2 structure I will share with you some tips I find useful. First it is good to remember that there are some steps while you explore your data. Most of the time you have to collect data first,  do some pre-processing and exploration,  modelling & analysis and only after comes visualization. Of course in previous steps,  graphs also can be helpful to interpret the situation correctly however it is important that you have prepared, clea...

Basic Statistics for Time Series

What we can say about the time series data at the beginning? How we can describe it and what elements determinate the method we will use to forecast the data? For my own personal use I have prepared some notes which help me to answer questions above. I was using some definitions from the book of "Forecasting: Principles & Practice" by Rob J Hyndman like also some other blog's article like: https://towardsdatascience.com/descriptive-statistics-in-time-series-modelling Basic Statistics for Time Series When you make sure that your data has time series class, you can check the data with the basic functions we have in R. ts() is useful to build Time Series from scratch. mean() shows the average of a set of data. median() shows the middle value of the arranged set of data. plot() shows on the graph how the Time series looks like sort() sort the data quantile() function returns quantiles which are cut points dividing the range of a probability distribution into continuous ...