Linear Regression

Introduction to Simple Linear Regression

Regression analysis is most often used for prediction. The goal in regression analysis is to create a mathematical model that can be used to predict the values of a dependent variable based upon the values of an independent variable. In other words, we use the model to predict the value of Y when we know the value of X. (The dependent variable is the one to be predicted). Correlation analysis is often used with regression analysis because correlation analysis is used to measure the strength of association between the two variables X and Y.

In regression analysis involving one independent variable and one dependent variable the values are frequently plotted in two demensions as a scatter plot. The scatter plot allows us to visually inspect the data prior to running a regression analysis. Often this step allows us to see if the relationship between the two variables is increasing or decreasing and gives only a rough idea of the relationship. The simplest relationship between two variables is a straight-line or linear relationship. Of course the data may well be curvilinear and in that case we would have to use a different model to describe the relationship (we will deal only with linear realtionships for now). Simple linear regression analysis finds the straight line that best fits the data. Suppose you have the following data for a particular bird species and you wanted to use this data to predict the age of a newly caught bird that has a wing length of 4.0 centimeters.

 Wing length (cm) Age (days) 1.5 4.0 2.2 5.0 3.1 8.0 3.2 9.0 3.2 10.0 3.9 11.0 4.1 12.0 4.7 14.0 5.2 16.0

Enter the data above into the regression applet below to see a scatter plot of the data. To do this you would click on the cell in the table and enter the appropriate number in the cell. Once all the pairs of numbers have been correctly added to the table, you click the "Plot the Data" button to see the scatter plot. The applet also draws the best fit regression line through the data and prints out the slope and Y-intercept across the top of the table.

Scatter Plot and Regression Applet

Since we now know the values for "slope" and "Y-intercept" and we have the value of wing length (x) = 4.0 cm, we can now use the equation for the best fit line to predict the age of the bird.

Y = (3.33)(4.0) + (-1.613) = 11.7 days

Our regression model predicts the age of our bird to be 11.7 days based on our best fit line. So how is the best fit line produced? Generally, the concept of least squares is used to draw the best fit line.

The Least Squares Method

Each of our data points is represented by an X and a Y. Each point of the best fit line is represented by an X and a Yhat. Our data points don't always fall exactly on the best fit line and, therefore, Y does not always equal Yhat. The least squares method uses the vertical deviation of each data point from the best fit line (i.e. the deviation denoted as Y - Yhat). The best fit line results when there is the smallest value for the sum of the squares of the deviations between Y and Yhat. In other words, we want to minimize the equation

The sum of the squares of these deviations is called the residual sum of squares or sometimes the error sum of squares. This method produces the best line possible given the fact that we have only a subset of all possible data.

Now that we have the best fit line, we need to find the equation for this line. To do that we need to know the slope and the Y-intercept. These parameters have to be estimated from the subset of the data (i.e. the more complete the data set, the better the estimates). To estimate these parameters, follow the sets below:

1. Arrange the data into X, Y pairs (as in the bird data table above).
2. Compute the mean of all of the X values.
3. Compute the sum of the X2 by squaring each value of X and adding the squares.
4. Compute the sum of the Y2 in the same manner.
5. Compute the sum of each X value multiplied by its corresponding Y value.
6. Calculate the slope (b) of the line as:

7. Calculate the Y-intercept (a) where X = 0 by the formula:

8. Place the slope and Y-intercept values into the equation for a line.

Correlation

Recall that the regression analysis tells us about the linear dependence of one variable on a second variable. In a correlation analysis measures the degree of association between two variables. The correlation coefficient (sometimes also called the product-moment correlation coefficient) is calcuated by:

The correlation coefficient is unitless and between +1 and -1. In general, the closer the correlation coefficient is to +1 or -1 the better the association between the two variables X and Y.

This page was created as part of the Mathbeans Project. The java applets were created by David Eck and modified by Jim Ryan. The Mathbeans Project is funded by a grant from the National Science Foundation DUE-9950473.