Linear regression is a method to predict the value of an outcome variable (Y) depending on one or more input predictor variables (X). The objective of linear regression is to model a continuous variable (Y) as a mathematical function of one or more variable(s) (X), in order to use this regression model to predict Y when only X is known.

This mathematical equation can be generalized as follows:

Y = β1 + β2X + ϵ

where, β1 is the intercept and β2 is the slope. Collectively, they are called regression coefficients. ϵ is the error term, the part of Y the regression model is unable to explain.

## Simple Linear Regression

In Simple Linear Regression, there is only one predictor variable (X). The predictions of Y when plotted as a function of X form a straight line.

### A Simple Example

Let us take a simple example to understand the concept of regression. Consider the following data –

X Y

1 4

3 5

4 3

2 2

5 5

If we plot the above data, we get the following:

In Linear Regression, we try to find the best-fitting straight line through the points. The best-fitting line is called a regression line.

In the above plot, the black line is the regression line. The most frequently used criterion for the best fitting line is the line which minimizes the sum of the squared errors of prediction. The error of prediction for a point is the value of the point minus the predicted value. The predicted value is simply the value on the regression line.

For example, for the point where x=3, y=5, the predicted value is 3.4, that is, the point on the line corresponding to x=3 is 3.4 but actually the y value is 5. So, the error of prediction is 1.6.

The regression line minimizes the root mean square error, that is, the root of sum of square of errors of prediction of all the points. The mean square error (MSE) is the mean of square of the errors of prediction of all the points. And the root mean square error (RMSE) is the root of MSE. In the above example, the MSE is 2.150 and RMSE is 1.466.

To compute the regression line, a few concepts of statistics are used. Let MX be the mean of X, MY be the mean of Y, sX be the standard deviation of X, sY be the standard deviation of Y, and r be the correlation between X and Y.

The slope of the regression line is calculated as –

b = r*(sY/sX)

And the intercept of the regression line is calculated as –

A = MY – b*(MX)

Nowadays there are many statistical software that help to compute the regression line.

## Correlation

Correlation is a statistical measure that suggests the level of linear dependence between two variables. Correlation can take values between -1 to +1. A value of correlation closer to 1 or -1 suggests a strong relationship between the variables. Whereas the value of correlation closer to 0 indicates a weak relationship between the variables. A low correlation (-0.2 < x < 0.2) usually suggests that most of the variation of the response variable (Y) is not explained by the predictor (X).

## Statistical Significance

We assess out linear regression model’s statistical significance by looking at p-value and t-value. Usually, we acknowledge a linear model to be statistically significant when these p-values are less than the pre-determined statistical significance level. This level is ideally 0.05.

When there is a p-value, there is a hull and alternative hypothesis related to it which helps to analyze the model. In linear regression, the null hypothesis is that the coefficients linked with the variables are equal to zero. The alternate hypothesis is that the coefficients linked with the variables are not equal to zero (i.e. there exists a relationship between the independent variable in question and the dependent variable).

A larger t-value implies that it is less likely that the coefficients are not equal to zero entirely by chance. So, higher the t-value, the better. Pr(>|t|) or p-value is the probability that you get a t-value as high or higher than the observed value when the Null Hypothesis (the β coefficient is equal to zero or that there is no relationship) is true. So, if the Pr(>|t|) is low, the coefficients are significant (significantly different from zero). If the Pr(>|t|) is high, the coefficients are not significant.

When p-value is less than the significance level (< 0.05), we can safely reject the null hypothesis that the coefficient β of the predictor is zero.

## Real Life Example

Let us consider an example where we have data of class tenth and class twelfth marks data of 30 students. We want to check if there is any relation between the class tenth and class twelfth marks. The scatterplot of our data looks like:

Applying Linear regression, we get the plot as:

For the above regression model, the R² value is 0.631 which is closer to 1 that indicates that there is a moderate relationship between class tenth and class twelfth marks.

The values of MSE and RMSE are 84.132 and 9.172 respectively.

## Applications of Linear Regression

Linear regression is a significantly important tool to anticipate the possible relationships between variables in various fields such as biological, social and behavioral sciences. It is one of the most important concepts used in these disciplines. Linear Regression is also used in Finance and economics. It is used to predict consumer spending, inventory investment, spending on imports and numerous other financial statistics. It is widely used in capital asset pricing model in finance. It is also used in trend line analysis, epidemiology etc.

The applications of linear regression is ever increasing as the data availability is increasing in every other field. With more data collection in various fields, the application of regression can be extended to any data to compute the significance of various variables.

## Comments are closed.