Wednesday, November 11, 2015

Pearson Product Moment Correlation Coefficient....

PPMCC (only describes linear relation between two continuous (interval / ratio scale) variables):
[Correlation coefficient (rxy) is a statistic]

The correlation coefficient (rxy) is a measure of the strength of association between two continuous (interval / ratio scale) variables.  It reflects how closely scores on two continuous variables go together.  The more closely two variables go together, the stronger the association between them and the more extreme the correlation coefficient.
Mathematically, PPMRC (rxy) is defined as the ratio of the covariance of two continuous variables and the product of their standard deviations. It measures the strength of the linear relationship between normally distributed variables. When the variables are not normally distributed or the relationship between the variables is not linear, it may not be the appropriate method (Spearman rank correlation method would be more appropriate).

rxy = Covxy / SxSy = (Sum of products of errors/n-1) / SxSy 
Sum of product of rank orders is divided by n – 1 to standardize two variables against variability (it is equalizing the contribution of both). 

Important points about PPMCC:
1.     rxy is a scaled (normalized) measure of the covariance.
2.     It shows correspondence between the rank orders and is used to establish validity and reliability of the instrument. rxy is high means the rank orders of the two variables are close to each other and vice versa (disruption in rank means low r).
3.     Covariance (Covxy variance of both variables together) is different than that of coefficient of determination (r2proportion of variance in Y that can be accounted for by knowing X).
4.     rxy gives combined rank of two variables (interval / ratio scale), actually compares rank orders of two variables.
5.     Correlation coefficient (rxy) indicates magnitude (0 to 1) or intensity and direction (negative and positive). If the data points fall in a random pattern, the correlation is equal to zero.
6.     Outcome variable is called the response or dependent variable (Y) and risk factors and confounder are called the predictors, or explanatory or independent variables (X).
7.     It does not make any difference which variable is plotted in which axis as far as no prediction to be made. But if a prediction to be made then, by convention, predictive variables are plotted in the x-axis and outcome variables in the y-axis.
8.     X and Y variables can be measured entirely on different scales. Change in scale does not hamper the correlation because PPMCC does not depends upon scales (rxy does not have any unit in its own, it is a ratio).
9.     Pearson correlation coefficient, r, does not represent the slope of the line of best fit, it only shows the direction of the relationship, uphill or downhill.
10.                         rxy has nothing to do with the mean differences. Same means of two data scores does not tell anything about the relationship (rxy). 
11.                         rxy = ryx (Scatterplot will be same, but slope ‘b’ will change)
12.                         Every correlation (rxy) has two slopes and two intercepts (when Y as a function of X and when X as a function of Y).
13.                         A correlation of 0 does not mean zero relationship between two variables; rather, it means zero linear relationship. (It is possible for two variables to have zero linear relationship and a strong curvilinear relationship at the same time.)
14.                         Correlation does not imply causation: Two variables may be related to each other, but this doesn’t mean that one variable causes the other.
15.                         Because the two variables are paired through a linear equation (for them to show a linear correlationship) which is a logical relation between X and Y.
16.                         Larger sample size makes the correlation more stable. Large sample size is a pretty good reason to trust on the correlation. Small sample size does not provide accurate picture of the correlation, I mean a single outlier makes a huge difference in the correlation.
17.                         Correlation can be understood by various means: Scatterplots, slope of the regression line, variance interpretation (The squared correlation coefficient (r2) is the proportion of variance in Y that can be accounted for by knowing X. Conversely, it is the proportion of variance in X that can be accounted for by knowing Y).
18.                         The correlation coefficient is the slope (b) of the regression line when both the X and Y variables have been converted to z-scores. The larger the size of the correlation coefficient, the steeper the slope.
19.                         Linear relationship is described by for every one-point increase in one variable, you get a four-point increase in the other variable.
20.                         A PPMCC is appropriate to describe when X increases, Y decreases by the same amount.
21.                         Pearson Product Moment Correlation can be used to express the degree of relationship for:
1.     For every extra year of growth in a pine forest, you can expect an increase of 10,000 board feet,
2.     Strenuous exercise results in large weight loss, moderate exercise maintains weight at current levels and no exercise produces gains in weight.
22.                         The higher the correlation between X and Y, then more accurate the resulting predictions are.
23.                         We can have strong relationship between two variables bust still have a low correlation coefficient when: Relationship is non-linear and the variances are truncated (cut off)
24.                         We can’t say that correlation coefficient is not proper for those data where r = 0 as we really don’t know about that.
25.                         Potential problems with Pearson correlation: The PPMC is not able to tell the difference between dependent and independent variables. For example, if we are trying to find the correlation between a high calorie diet and diabetes, we might find a high correlation of .8. However, we could also work out the correlation coefficient with the variables switched around. In other words, we could say that diabetes causes a high calorie diet. That obviously makes no sense. 

Guilford’s Interpretation:
< 0.20 – Slight, almost negligible relationship
0.20 – 0.40 – Low (weak) correlation, definite but small relationship
0.40 – 0.70 – moderate correlation, substantial relationship
0.70 – 1.00 – very high (strong) correlation, very dependable relationship (their rank orders might be close to each other, scores with one variable grows with the other)

These notes are written by S C Joshi during EPSY 635 Course, Fall 2015, Texas A&M University. Acknowledgements to Dr. Bob Hall, Professor, EPSY, Texas A&M University for his assistance in understanding these terms during the course  

No comments:

Post a Comment