Time
Series and Fitting Regression on Time Series Data
Time series is one
of the easiest topic in Data Science if understood conceptually. It
can be related to standard regression. One difference from standard
linear regression is that the data are not necessarily independent
and not necessarily identically distributed. One defining
characteristic of time series is that this is a list of observations
where the ordering matters. Ordering is very important because
there is dependency and changing the order could change the meaning
of the data.
A univariate
time series is a sequence of measurements of the same
variable collected over time. Most often, the measurements are
made at regular time intervals. Any time series is assumed to be made
of 4 components.
- Secular Trend (or General Trend)
- Seasonal Movements
- Cyclical Movements
- Irregular FluctuationsGeneral trend is long term effect of some external factor. Trend may show a growth or decline over a very long period of time. Seasonality is short term movement due to seasonal factors like demand of ice-creams in summer. Cyclicity or Business Cycle is simply a repetition of long term oscillation. Randomness or Irregularity is sudden changes which are unlikely to predict.Below Image shows a time series of quarterly production of beer in Australia for 18 years.
Lets look
at the component of above series. ( Image is taken from “Applied
Time Series Analysis” course by Eberly College of Science). There
is an upward trend. It is pretty obvious from the image. Another way
of looking at the trend is to draw mean production line and then check if
new data points are trying to come near the mean? If yes, there is no
trend. There is seasonality as we can see regular repeating patterns
in every year.
Fitting
Regression model for above series-
- For a linear trend, use t (the time index) as a predictor variable in a regression.
- For a quadratic trend, we might consider using both t and t2.
- For quarterly data, with possible seasonal (quarterly)
effects, we can define indicator variables such as Sj
= 1 if observation is in quarter j of a year and 0
otherwise. There are 4 such indicators.
A -E are coefficients to be multiplied to independent variables; t,S1,S2,S3,S4. RMSE or R Square can be used for model validation. I would also like to mention that R square is a comparison parameter to know 'how better regression is fitting data points compare to simple average”. More variables like quadratic trend; t^2 etc can be added and then R square can be used to identify best model.
another beautiful article on anomaly detection is present at-
Anomaly Detection and it's types
read a blog to know about all Market Basket Analysis algorithms-
http://machinelearningstories.blogspot.in/2016_11_01_archive.html
read another blog to know about text classification algorithms-
http://machinelearningstories.blogspot.in/2016_08_01_archive.html
No comments:
Post a Comment