What we can say about the time series data at the beginning? How we can describe it and what elements determinate the method we will use to forecast the data? For my own personal use I have prepared some notes which help me to answer questions above. I was using some definitions from the book of "Forecasting: Principles & Practice" by Rob J Hyndman like also some other blog's article like: https://towardsdatascience.com/descriptive-statistics-in-time-series-modelling
Basic Statistics for Time Series
When you make sure that your data has time series class, you can check the data with the basic functions we have in R.
ts() is useful to build Time Series from scratch.
mean() shows the average of a set of data.
median() shows the middle value of the arranged set of data.
plot() shows on the graph how the Time series looks like
sort() sort the data
quantile() function returns quantiles which are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way.
var() returns variance value. Variance is the average squared distance from the mean. It shows the measure of how data points differ from the mean.
Trend is present when dataset is moving towards a direction. We can see it by plotting our data with plot() or plot.ts() if data is not classified yet as Time Series
Stationarity
Stationarity is present when we have constant mean and variance over time. It doesn’t increase or decrease with time linearly or exponentially (no trends), and it doesn’t show any kind of repeating patterns (no seasonality). It shows if the data has the same statistical properties throughout the time series (constant mean & constant variance & no autocorrelation).
A lot of models like ARIMA needs the data to be stationarity but how to identify it?
We can plot the data but often it is difficult for the human eye to tell this.
The ACF of stationary data drops to zero relatively quickly and for non-stationary data decreases slowly.
For non-stationary data, the value of first lag is often large and positive.
Although, the previous observations are useful, the most efficient way is to preform the Augmented Dickey-Fuller test. This test removes autocorrelation and tests for stationarity with the function adf.test()
How to interpret the results of this test? We will get the p-value. If p-value is less than 0.05 (p-value: low), we can assume that the data is stationary. If the p-value is more than 0.05 (p-value: high), we assume the data to be non-stationary.
What to do with Non-stationary Data?
There are 3 ways to do it depending on the time series:
Transformations => If the data show different variation at different levels of the series transformations can be useful, particular logarithms. Transformations help to stabilize the variance. We can use Box-Cox transformations or Automated Box-Cox transformations.
Differencing=> The differenced series is the change between each observation in the original series. It helps to stabilize the mean.
De-trending=> Time Series decomposition.
Autocorrelation
Autocorrelation measure linear relationship between lagged values of a time series and lag is a gap between two or more observations. In other world it is the correlation coefficient between lags of the Time series. It shows if previous observations influence the later ones.
To identify autocorrelation we can use the following functions:
acf() shows the autocorrelation
pacf() shows the partial autocorrelation which is the correlation coefficient adjusted for a shorter lags
We can chose for example the max number of lags:
acf(OurTimeSeries,lag.max=20)
pacf(OurTimeSeries,lag.max=20)
putting plot=FALSE suppresses the plot and returns the coefficient values only.
Seasonality
It is a repeated pattern over a fixed interval (e.g the quarter of the year, the month, or day of the week.). Seasonal pattern has a constant length.
We can have also Cyclic pattern when data exhibit rises and falls that are not of fixed period therefore it has variable length.
To identify seasonality we can always starts with plotting the series because sometimes we can catch it easily. We have however other function which can be helpful:
ggseasonplot() data is plotted against the individual "seasons" in which the data were observed. It underlie seasonal pattern to be seen more clearly.
monthplot() data for each season collected together in time plot as separate time series. It visualized changes in seasonality over time.
It is good to mentioned that by plotting correlogram with ACF we can recognize seasonality in a time series.
If there is seasonality, the ACF at the seasonal lag (e.g 12 for monthly data) will be large and positive.
For seasonal monthly data a large ACF value will be seen at lag 12,24,36,...
For seasonal quarterly data a large ACF value will be seen at lag 4,8,12,...
We can use also decompose() function which divide the data into original data set, trend component, seasonal part and white noise.
White noise data is uncorrelated across time with zero mean and constant variance. Rob.J.Hyndman advices in his book to think about it as a completely uninteresting with no predictable patterns.
Comments
Post a Comment