**Assumption Of Linear Regression-**

1.
Mean of response, at each value of predictor x
is Linear function of x.

2.
Error terms should be Independent
of each other.

3.
Error terms should be Normally
distributed.

4.
Error terms should have Equal
variance.

These can be termed as LINE
assumptions.

**Residuals v/s fitted**values graph can be used to test assumptions of linear relationship of Y and x values, Equal variance of error terms.

- The
residuals "bounce randomly" around the 0 line. This suggests
that the assumption that the relationship is linear is reasonable.
- The
residuals roughly form a "horizontal band" around the 0 line.
This suggests that the variances of the error terms are equal.
- No one residual "stands out" from the basic random pattern of residuals. This suggests that there are no outliers.

Residuals v/s fitted values graph when relationship is not
linear or variance is not constant for error terms.

Relationship is not linear- Most of the values will not lie
near y=0, showing deviation from linear relationship. Had it a perfect linear
relation all residuals must be 0.

Error terms don’t have constant variance- As the fitted values are increasing error
terms are increasing.

Hetroscedastic residuals |

- The
plot has a "
**fanning**" effect. That is, the residuals are close to 0 for small*x*values and are more spread out for large*x*values. - The
plot has a "
**funneling**" effect. That is, the residuals are spread out for small*x*values and close to 0 for large*x*values. - Or,
the spread of the residuals in the residuals vs. fits plot varies in some
complex fashion.

**Residual v/s order plot-**Error terms should be independent of each other. There should not be autocorrelation/trend in residuals. If the data are obtained in a time (or space) sequence, a residuals vs. order plot helps to see if there is any correlation between the error terms that are near each other in the sequence.

If
we draw sequence of residuals in first residuals /order plot , we get next plot
which shows negative auto-correlation. If we have such correlation it’s time to
move to time series from regression.

We
can simply draw distribution of residuals. A normal distribution has a
bell-shaped density curve described by its mean and standard
deviation . The density curve is symmetrical, centered about its mean,
with its spread determined by its standard deviation. if
the residuals follow a normal distribution with mean

*µ*and variance*σ*^{2}, then a plot of the theoretical percentiles of the normal distribution versus the observed sample percentiles should be approximately linear.**Handling Outliers and Multi-collinearity-**

Outliers or influential data points can be identified by
Cook’s distance or by Difference in fits. Both use same idea of identifying
influential points.
They fit the regression without i’th observation and see the change in y
values. Higher the change more influential is I th variable.

More
information - https://onlinecourses.science.psu.edu/stat501/node/340

**VIF**is used to identify multi-collinearity in regression. Lets say y= ax1 +bx2 +c is the regression equation. There we need to check if x1 and x2 are co-related. We can build a regression like x1=px2+q, the value of r square in this regression equation is strength of correlation. VIF is nothing but 1/(1-R2), of x1=px2+q equation) , clearly if R2 is more than .8/VIR >5, we say good correlation.

## No comments:

## Post a Comment