Suresh Joshi's Blog: November 2015

Wednesday, November 11, 2015

Regression and Regression Constants....

Regression: It is a prediction. Deriving an equation for predicting one variable from the other.

Algebraically, Errors of prediction = (Y - ^Y)

Least squares linear regression: In a cause and effect relationship, the independent variable (X) is the cause, and the dependent variable (Y) is the effect. Least squares linear regression is a method for predicting the value of a dependent variable Y, based on the value of an independent variable X.

The Least Squares Regression Line: It is a series of points arranged in a straight line that tells how much better the prediction is. Linear regression finds the straight line, called the least squares regression line that best represents observations in a bivariate data set. Suppose Y is a dependent variable, and X is an independent variable, the population regression line cab be written as:

Y = b X + a

In this linear equation, ‘b’ is the beta coefficient and ‘a’ is the Y- intercept of the regression line.

Line of Best Fit (Least Square Method): Minimizes the squared difference (squared deviations) between Y and ^Y (because we can have more than one predicted value for one X value). This method is a more accurate way of finding the line of best fit. A line of best fit is a straight line that is the best approximation of the given set of data. It is used to study the nature of the relation between two variables.

A line of best fit can be roughly determined using an eyeball method by drawing a straight line on a scatterplot so that the number of points above the line and below the line is about equal (and the line passes through as many points as possible).

Steps to find the equation of line of best fit:

1. Calculate the mean of the x-values and the mean of the y-values.

2. Find the slope of the line of best fit

3. Compute the Y-intercept of the line by using the formula.

4. Write equation of the line (Y = bX + a)

Properties of Regression Line: When the regression parameters (a & b) are defined by the equation above, the regression line has the following properties:

The difference between obtained and predicted value (Y - ^Y) is called an error of prediction called residual. We want to find a line that minimizes the squared difference between Y and ^Y and is known as least square regression line and the approach is called least squares regression.
Two important measures of the size of an effect in regression are r² and r.
The regression line passes through the mean of the X values (x) and through the mean of the Y values (y) or we can say that it passes through the centroid of the data.
The regression constant (a) = Y-intercept of the regression line.
To use the regression equation technique described in the text, we must have a logical pairing of the scores on the two variables and a linear relationship between them.

Intersection of two means fall on the regression line.

Regression coefficient (b, Slope, nonstandardized): The amount of change in Y for a one – unit change in X. Or the rate at which Y change with change in X. Larger the value (size) of the regression coefficient, the steeper the slope. It is (b) which a measure of how strongly each predictor variable influences the criterion (outcome) variable.

b_yx = r_xy (SD_y / SD_x); when Y – Outcome, X – predictor

And, b_xy = r_xy (SD_x / SD_y); When X – Outcome, Y – predictor

‘b’ is measured in units of standard deviation. For example, a beta value of 2.5 indicates that a change of one standard deviation in the predictor variable will result in a change of 2.5 standard deviations in the outcome (criterion) variable.

1. ‘b’ = 1.08 means when X increases by 1 point, outcome increases by 1.08.

2. On multiplying two slopes b _y·x and b_x·y we are left with the square of the correlation coefficient which tells about percentage of variance of both variables together. It tells the percentage accuracy in predicting Y. It is better to have the knowledge of correlation coefficient to predict the outcomes.

3. If b coefficient is positive, the relationship of predictor variable with dependent variable is positive (e.g., the greater the IQ the better the grade point average) and if b coefficient is negative then the relationship is negative (e.g., the lower the class size the better the average test scores).

4. If b coefficient is equal to 0 then there is no relationship between the variables.

5. ‘b’ can be anything (when b = +, r = +, and b = -, r = - but b = +, r = - & vice versa is not possible)

6. Many lines may have same slope (b) but cannot have the same intercept ‘a’ altogether (‘a’ is the unique identification of a line).

Beta coefficient (b, standardized regression coefficients): It is the change in Y for one unit standard deviation change in X. It is the slope of the regression line when both X and Y variables are converted to standardized z-scores. Thus, higher the beta value the greater the impact of the predictor variable on the criterion variable.

1. When we have only one predictor variable in our model, then,

Beta (b) = r_xy.

2. When we have more than one predictor variable, we cannot compare contribution of each predictor variable by simply comparing the correlation coefficients. The beta coefficient allows to make such comparisons and to assess the strength of the relationship between each predictor variable to the criterion variable

Interpreting regression constants:

The regression coefficient (b) is the average change (increase or decrease depending on +ve or –ve b) in the outcome variable (Y) for a 1 – unit change in the predictive variable (X). Slope (b coefficient) is a measure of how strongly each predictor variable influences the criterion (outcome) variable. Higher the beta value the greater the impact of the predictor variable on the outcome variable. Relationship is positive if b coefficient is positive and vice versa.

Regression constant ‘a’; Y – intercept: That anchors our line.

The constant term ‘a (regression constant)’ is the value at which fitted line (Line of best fit) crosses the y-axis. It is used as a ‘correction factor’ when using particular values of the x's to predict y. If we don’t include the constant, the regression line is forced to go through the origin which means all of the predictors and the outcome variables must be zero at that point.

These notes are written by S C Joshi during EPSY 635 Course, Fall 2015, Texas A&M University. Acknowledgements to Dr. Bob Hall, Professor, EPSY, Texas A&M University for his assistance in understanding these terms during the course

Scatterplot (what is it and why is it required?)..

Scatterplot: It is a graphical (pictorial) interpretation of the linear correlation between two continuous (interval / ratio) variables, predictive and outcome, plotted in X and Y axis respectively.

Why Scatterplot?

1. Because describing a relationship through a number is not enough, we need to look at the relationship in a scatterplot or how points very (bunched-up) around regression line in a scatterplot.

2. Graph (scatterplot) also helps in detecting the homoscedasticity (variability of points around the regression line) in a relationship which is rectified through transformation.

3. Graph (scatterplot) tells about the outliers in the data, which are the threats to the interpretation and needs to be omitted.

4. Scatterplot determines the situations where the correlation is curvilinear. Simply reporting r = 0 might be misleading because a relation still exists which might not be a linear one.

Two Characteristics of a scatter-plot:

1. The slope of the scatter-plot, and

2. The degree to which the points in the scatter-plot cluster around an imaginary line representing the slope.

Covariance: It is a measure of the degree to which two random variables (X, Y) change together.

Cov_xy = Sum of products of errors / (n-1)

r²(Coefficient of determination):

Indicates how well data points fit a line or curve. It is mainly used in models to predict future outcomes or test hypotheses on the basis of other related information.

The squared correlation coefficient (r²) is the proportion of variance in Y that can be accounted for by knowing X. Conversely, it is the proportion of variance in X that can be accounted for by knowing Y. Further, I can say that it is a statistic which indicates the percentage change in the amount of the outcome variable (dependent) that is ‘explained by’ the changes in the predictor variable (independent). In other words, shared / common variance, means how much percentage of the relationship can be explained by the regression (or predicted) and rest is unexplained. It is the indicator of accuracy of a prediction. We can also call it a proportion of the variance in outcome variable (Y) that is predictable from the predictor variable (X) or it is the fraction of the variation in Y that is explained by least-squares regression of Y on X.

1. The coefficient of determination ranges from 0 to 1 (proportion or percentage, cannot be more than 1 or 100% respectively).

2. It is important to note that a high coefficient of determination does not guarantee that a cause-and-effect relationship exists. However, a cause-and-effect relationship between the independent variable and the dependent variable will result in a high coefficient of determination.

3. An R² of 0 means that the dependent variable cannot be predicted from the independent variable.

4. An R² of 1 means the dependent variable can be predicted without error from the independent variable.

Threats to Correlation Coefficient...

Threats to PPMCC (Factors affecting correlation):

1. Outliers: the data that stands out in the scatterplot does not contribute to the relationship and must be omitted. (for small group outlier has large effects). Outliers does not go with the linear relationship (must have performed poor in exam 3 and outstanding in exam 4). Value of r will increase if the outliers are removed and more accurate depiction of the relationship between the predictor and the outcome measures because outliers are the threats to the PPMCC and results decrease in r value. So, on removing outliers data shows better liner relationship thus increase in r value.

2. Combined groups: Two groups separately may not show any relationship individually but together may show a strong relationship.

3. Extreme groups: There is no relationship (r = 0) between two variables for low performing groups and high performing groups individually, but together they show the relationship.

4. Range restriction (truncated range): If we restrict the range (means accept only high SAT scorers we will not be able to see the relationship on how low SAT scores are going to do in college so) relationship is going to be lower. Truncated range lowers the correlation.

5. Nonlinear relationship (curvilinear relationship): is not explained by PPMCC.

Assumptions Underlying the Pearson Product Moment Correlation Coefficient....

Assumptions Underlying the PPMCC: PPMCC is appropriate when three conditions exists: underlying measurement scales for the variables being correlated must be interval or ratio level, Scores for each variable should be normally distributed (no skewness should be there), relationship between the two variables should be fundamentally linear.

1. The underlying measurement scales for the variables being correlated must be interval or ratio level (i.e., they are continuous).

Examples of variables that meet this criterion include revision time (measured in hours), intelligence (measured using IQ score), exam performance (measured from 0 to 100), weight (measured in kg), and so forth.

2. Bivariate Normality (Bivariate normal distribution): Scores for each variable should be normally distributed. No skewness, neither positive nor negative should be there.

3. Relationship between the two variables should be fundamentally linear.

4. There should be no significant outliers (single data points within the data that do not follow the usual pattern).

5. Homoscedasticity: It means variance around regression line should be same for all values of predictor variable (X). A relation is called heteroscedastic when all the points very (bunched up) near the regression line. Homoscedasticity is violated when there is much more variability (points are scattered away from the regression line) around the regression line.

Serious violations in homoscedasticity (assuming a distribution of data is homoscedastic when in actuality it is heteroscedastic) result in underemphasizing the Pearson coefficient. Assuming homoscedasticity assumes that variance is fixed throughout a distribution.

Heteroscedasticity is caused by non-normality of one of the variables, an indirect relationship between variables, or to the effect of a data transformation. Heteroscedasticity is not fatal to an analysis, the analysis is weakened, not invalidated. Homoscedasticity is detected with scatterplots and is rectified through transformation.

Pearson Product Moment Correlation Coefficient....

PPMCC (only describes linear relation between two continuous (interval / ratio scale) variables):

[Correlation coefficient (r_xy) is a statistic]

The correlation coefficient (r_xy) is a measure of the strength of association between two continuous (interval / ratio scale) variables. It reflects how closely scores on two continuous variables go together. The more closely two variables go together, the stronger the association between them and the more extreme the correlation coefficient.

Mathematically, PPMRC (r_xy) is defined as the ratio of the covariance of two continuous variables and the product of their standard deviations. It measures the strength of the linear relationship between normally distributed variables. When the variables are not normally distributed or the relationship between the variables is not linear, it may not be the appropriate method (Spearman rank correlation method would be more appropriate).

r_xy = Cov_xy / S_xS_y = (Sum of products of errors/n-1) / S_xS_y

Sum of product of rank orders is divided by n – 1 to standardize two variables against variability (it is equalizing the contribution of both).

Important points about PPMCC:

1. r_xy is a scaled (normalized) measure of the covariance.

2. It shows correspondence between the rank orders and is used to establish validity and reliability of the instrument. r_xy is high means the rank orders of the two variables are close to each other and vice versa (disruption in rank means low r).

3. Covariance (Cov_xy– variance of both variables together) is different than that of coefficient of determination (r² – proportion of variance in Y that can be accounted for by knowing X).

4. r_xy gives combined rank of two variables (interval / ratio scale), actually compares rank orders of two variables.

5. Correlation coefficient (r_xy) indicates magnitude (0 to 1) or intensity and direction (negative and positive). If the data points fall in a random pattern, the correlation is equal to zero.

6. Outcome variable is called the response or dependent variable (Y) and risk factors and confounder are called the predictors, or explanatory or independent variables (X).

7. It does not make any difference which variable is plotted in which axis as far as no prediction to be made. But if a prediction to be made then, by convention, predictive variables are plotted in the x-axis and outcome variables in the y-axis.

8. X and Y variables can be measured entirely on different scales. Change in scale does not hamper the correlation because PPMCC does not depends upon scales (r_xy does not have any unit in its own, it is a ratio).

9. Pearson correlation coefficient, r, does not represent the slope of the line of best fit, it only shows the direction of the relationship, uphill or downhill.

10. r_xy has nothing to do with the mean differences. Same means of two data scores does not tell anything about the relationship (r_xy).

11. r_xy= r_yx (Scatterplot will be same, but slope ‘b’ will change)

12. Every correlation (r_xy) has two slopes and two intercepts (when Y as a function of X and when X as a function of Y).

13. A correlation of 0 does not mean zero relationship between two variables; rather, it means zero linear relationship. (It is possible for two variables to have zero linear relationship and a strong curvilinear relationship at the same time.)

14. Correlation does not imply causation: Two variables may be related to each other, but this doesn’t mean that one variable causes the other.

15. Because the two variables are paired through a linear equation (for them to show a linear correlationship) which is a logical relation between X and Y.

16. Larger sample size makes the correlation more stable. Large sample size is a pretty good reason to trust on the correlation. Small sample size does not provide accurate picture of the correlation, I mean a single outlier makes a huge difference in the correlation.

17. Correlation can be understood by various means: Scatterplots, slope of the regression line, variance interpretation (The squared correlation coefficient (r²) is the proportion of variance in Y that can be accounted for by knowing X. Conversely, it is the proportion of variance in X that can be accounted for by knowing Y).

18. The correlation coefficient is the slope (b) of the regression line when both the X and Y variables have been converted to z-scores. The larger the size of the correlation coefficient, the steeper the slope.

19. Linear relationship is described by for every one-point increase in one variable, you get a four-point increase in the other variable.

20. A PPMCC is appropriate to describe when X increases, Y decreases by the same amount.

21. Pearson Product Moment Correlation can be used to express the degree of relationship for:

1. For every extra year of growth in a pine forest, you can expect an increase of 10,000 board feet,

2. Strenuous exercise results in large weight loss, moderate exercise maintains weight at current levels and no exercise produces gains in weight.

22. The higher the correlation between X and Y, then more accurate the resulting predictions are.

23. We can have strong relationship between two variables bust still have a low correlation coefficient when: Relationship is non-linear and the variances are truncated (cut off)

24. We can’t say that correlation coefficient is not proper for those data where r = 0 as we really don’t know about that.

25. Potential problems with Pearson correlation: The PPMC is not able to tell the difference between dependent and independent variables. For example, if we are trying to find the correlation between a high calorie diet and diabetes, we might find a high correlation of .8. However, we could also work out the correlation coefficient with the variables switched around. In other words, we could say that diabetes causes a high calorie diet. That obviously makes no sense.

Guilford’s Interpretation:

< 0.20 – Slight, almost negligible relationship

0.20 – 0.40 – Low (weak) correlation, definite but small relationship

0.40 – 0.70 – moderate correlation, substantial relationship

0.70 – 1.00 – very high (strong) correlation, very dependable relationship (their rank orders might be close to each other, scores with one variable grows with the other)