Correlation (Calculating Pearson’s r)

Correlation refers to the idea that two variables (x and y) impact each other. For instance, the grades in a statistics class may be related to, or correlated with the amount of time those students study. As study time goes up, grades go up. This would be a positive correlation. On the other hand, as time spent partying, grades go down. This is called a negative correlation.

A positive correlation doesn’t strictly refer to good things, though. As the percent of poverty in a community goes up, the amount of crime may also go up. This is a positive correlation, but certainly not a good thing!

Correlations are expressed from -1 (which is perfectly negative) and +1 (which is perfectly positive.) The number shows the strength, and the sign (positive or negative) shows the direction. Therefore, -0.75 is a stronger correlation (or connection) than 0.25.

One common expression is “Correlation is not causation”; this refers to the idea that items can be correlated without really being related to each other. For instance, there is a close connection between the rates of ice-cream consumption in the winter and the drowning rate, even though one really doesn’t affect the other.

How to Calculate Correlation
Pearson’s r (also known as the correlation coefficient) is a simple correlation tool to work with. (Technically the r is used for samples and p is used for populations, but we’ll be working with samples, a limited amount of the total so we will simply refer to it as Pearson’s r or r.)

The formula is here:

This formula may look complicated, but let’s step through it step by step.

The sum of the values of X subtracted from the mean of X multipled by the values of Y subtracted from the mean of Y divided by the square root of X subtracted from the mean of X-squared multiplied by Y subtracted from the mean of Y-squared.

Let’s look at the following set of data of student absences and their final grades:

 Student # Absences Exam Grade 1 4 82 2 2 98 3 2 76 4 3 68 5 1 84 6 0 99 7 4 67 8 8 58 9 7 50 10 3 78

The first step is to create a scatterplot of the data to see if any patterns stick out:

This shows a moderately negative correlation, as absences go up, grades go down.

Moving to the equation, let’s look at the top part fist:

∑[(X-MX)(Y-MY)]

We have to calculate the mean of X and the mean of Y:

4 + 2 + 2 + 3 + 1 + 0 + 4 + 8 + 7 + 3 = 34 / 10 = Mean of X of 3.4
82 + 98 + 76 + 68 + 84 + 99 + 67 + 58 + 50 + 78 = 760 / 10 = Mean of Y of 76.
Next, we calculate X-Mx and Y-My, and sum them up.

 X X – Mx Y Y – My 4 0,6 82 6 2 -1,4 98 22 2 -1,4 76 0 3 -0,4 68 -8 1 -2,4 84 8 0 -3,4 99 23 4 0,6 67 -9 8 4,6 58 -18 7 3,6 50 -26 3 -0,4 78 2

Next, we must multiply the values of each of these together:

 X – Mx Y – My X-Mx * Y-My 0,6 6 3,6 -1,4 22 -30,8 -1,4 0 0 -0,4 -8 3,2 -2,4 8 -19,2 -3,4 23 -78,2 0,6 -9 -5,4 4,6 -18 -82,8 3,6 -26 -93,6 -0,4 2 -0,8

And the sum of these (3.6 + -30.8 + 0 + 3.2 and so on) is -304. Here’s our equation so far:

Next, let’s look at the bottom part of the equation:

 X X – Mx (X-Mx)2 Y Y – My (Y-My)2 4 0,6 0.36 82 6 36 2 -1,4 1.96 98 22 484 2 -1,4 1.96 76 0 0 3 -0,4 0.16 68 -8 64 1 -2,4 5.76 84 8 64 0 -3,4 11.56 99 23 529 4 0,6 0.36 67 -9 81 8 4,6 21.16 58 -18 324 7 3,6 12.96 50 -26 676 3 -0,4 0.16 78 2 4 ∑ 56.4 ∑ 2262

We take the square of each of the X values and sum them up. We do the same for the Y values.

This results in: -304 / Sqrt(56.4*2262)

Next, we multiply the two bottoms together. 56.4 x 2262 = 127,576.8.

Taking the square root yields 357.179.

Our final calculation is -304 / 357.179 which equals -0.85.
-0.85 is our final correlation, which we can confirm using Excel’s CORREL function.

Cite this article as: MacDonald, D.K., (2015), "Correlation (Calculating Pearson’s r)," retrieved on May 27, 2019 from http://dustinkmacdonald.com/correlation-calculating-pearsons-r/.
by

Z-Scores

Z Scores were a concept I had trouble with in University. It’s actually not as difficult as they’re made out to be. I’ll spare you the complicated introduction (as I’m sure you got one from both your textbook and your Professor), but remember that z-scores show you the distance between your score and the mean, in standard deviations.

So a z-score of 1 is one standard deviation (approximately 84%) above the entire population, and 34% above the mean. Typically you’ll be asked to do a few things:

• What percent above the mean is a particular z-score
• What percent below the mean is a particular z-score
• What percent is between two scores
• How do you convert a raw score into a z-score
• How do you convert a z-score into a raw score

So, let’s get too it. Remember that you’ll need a z-table (usually provided by your Professor or available in you textbook) for these exercises.

Raw Score into Z-Score

The formula for converting a raw score into a z-score is Z = (M – X) / SD, or Z Score = (Mean – Value) / Standard Deviation.

So, if you have a score of 80, and the mean is 75, with a Standard Deviation of 5, your equation will be:

(80 – 75) / 5 = 1. Therefore your Z score is 1.0

If you instead scored 73, it would be (73 – 75) / 5 = -0.4.

Percent Below a Score

You’ll be given a z-score like 0.66, and you’ll need to find out what percent of scores are above it. Simply go to your z-table, and find 0.66. Some tables list all the values sequentially (0.5, 0.51, 0.52) while others use a table like Wikipedia’s.

If your table includes both “% mean to z” and “% in tail” (like my textbook), just look a the “% mean to z.” If your table uses decimals (like 0.7454), multiply them by 100 to get the correct value.

When I look up 0.66 in my textbook’s z-table, I see 24.54%. When I look up the same value in Wikipedia’s table, I see 74.54%. What gives? The % mean to z is only half of the equation. In order to return a correct z-score, you take your 24.54% and add 50 to it, because it’s a positive z-score.

If you have a negative z-score, like -0.85, you take the z-score you’re given (30.23%) and you subtract 50 from it, which gives you -19.77 (ignore the negative.), or 19.77%.

Percent Above a Score

To find the percent above a score, you perform the same calculation for percent below, but you subtract from 100. For instance, 19.77% is percent below, so when you subtract that value from 100, you get 80.23% above.

Percent Between Scores

To calculate percent between scores, you simply take the difference between two scores and subtract them. For instance, if you want to know the difference between 0.6 and 0.7:

% to mean of 0.6 is 22.57 and of 0.7 is 25.80.

Because both numbers are above the mean, we add 50 to each, giving us percentages of 72.57 and 75.80.

Subtracting 72.57 from 75.80 gives a grand total of 3.23% between the two values.

Z-Score Back to Raw Score

The formula for converting a z-score back to a raw score is R = Z*SD + M. So if your z score is 0.8, the mean is 75 and the Standard Deviation is 5, your equation looks like:

Raw Score = 0.8*5 + 75

Raw Score = 4 + 75

Raw Score = 79

Cite this article as: MacDonald, D.K., (2015), "Z-Scores," retrieved on May 27, 2019 from http://dustinkmacdonald.com/z-scores/.

by

Dispersion and Variability (Standard Deviation)

The topics dispersion and variability (or variance) describes the “spread” of data in a distribution. This article explains how to compute the variance and the standard.

The first measure of dispersion to look at is the variance. Let’s look at the data set below:

 X Values 4 5 2 7

Steps to Calculate Variance:

1. Calculate mean
2. Subtract each value in set from mean
3. Square each number from 2)
4. Sum the values from 3)
5. Divide by the number of values in the set

Let’s work through these steps. First, let’s calculate the mean:

M = ∑X / n (the sum of X divided by N)
M = 4 + 5 + 2 + 7 / 4
M = 18 / 4
M = 4.5

Second, we subtract each value in the set from the mean.

 X Values X – M 4 -0.5 5 0.5 2 2.5 7 2.5

Third, we square each value.

 X Values X – M (X – M)2 4 -0.5 0.25 5 0.5 0.25 2 -2.5 6.25 7 2.5 6.25

Forth, we sum the values from the third.

 X Values X – M (X – M)2 4 -0.5 0.25 5 0.5 0.25 2 -2.5 6.25 7 2.5 6.25 ∑ 13

Finally, we divide by the number of values in the set:

Variance is 13 / 4 = 3.25

To calculate the standard deviation, you simply take the square root of the variance.

Sqrt(3.25) = 1.80

So, the standard deviation is 1.80. You can confirm this by going into Excel and using the STDEV.P formula

Cite this article as: MacDonald, D.K., (2015), "Dispersion and Variability (Standard Deviation)," retrieved on May 27, 2019 from http://dustinkmacdonald.com/dispersion-and-variability-standard-deviation/.

by

Measures of Central Tendency

The measures of central tendency are processes for determining what the central value in a dataset is. The most common is the arithmetic average, or mean – so this value has come to be known as simply the average.

The three measures of central tendency are mean, median and mode.

Mean

To calculate the mean (also known as the arithmetic mean or arithmetic average), you take all of the scores, add up  their values and divide them by the number you have. Let’s look at the following values of student values out of 10:

 4 4 4 5 5 5 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 9 9 9 9 9 10

There are 40 values here. If we add them all up, we get the total 280. Dividing by the number of values, we get an average of:

• / 40 = 7

Median

The mean is a very common distribution but can be affected by extreme sources. If any values are very high or very low compared to the majority, the mean can be affected. In situations like this, we use the median. The median is the middle value in the set of scores.

For instance, let’s look at a limited set of numbers from the above data set:

 4 4 4 5 5 5 6

There are 7 values here, so the middle value, 5 becomes the median. In a situation like our full chart above where we have 40 values, we instead have two middle values.

 17 18 19 20 21 22 23 24 7 7 7 7 7 7 7 7

Taking values 20 and 21 (7 + 7) and dividing them by 2 gives us the median 7.

Mode

Finally, the mode is simply the most common score occurring in a distribution. In the full data set above, we have the following values and frequencies:

 Value Frequency 4 3 5 3 6 7 7 12 8 9 9 5 10 2

In this case, 7 appears twelve times, so it becomes our median.

Choosing a Measure of Central Tendency

The mean is most commonly used – it is the best for symmetric distributions (distributions without major outliers.) The median is best for a skewed distribution or one with outlier(s), while the mode is used in 3 cases:

• One particular score dominates a distribution
• Distribution is bimodal or multimodal
• Data are nominal

Weighted Mean

One special case of the mean is the “weighted mean”, where some values are “weighted” or contribute more to the total value than others. The data set from above is presented here:

 Value Frequency 4 3 5 3 6 7 7 12 8 9 9 5 10 2

To calculate the weighted mean, we multiply each value by its frequency, before dividing by the frequency. This is similar to the mean as you’ll see:

• 3×4 + 3×5 + 7×6 + 12×7 + 9×8 + 5×9 + 10×2
= 12 + 15 + 42 + 84 + 72 + 45 + 20
= 290
• We divide by the original frequencies:

3 + 3 + 7 + 12 + 9 + 5 + 2
= 41

• And now we’ll divide the top by the bottom:290 / 41 = 7.073

Cite this article as: MacDonald, D.K., (2015), "Measures of Central Tendency," retrieved on May 27, 2019 from http://dustinkmacdonald.com/measures-of-central-tendency/.

by

Frequency Distributions

Frequency distributions are a simple way of organizing data based on how many each item has occurred. This can be used for individual values or for age ranges.

The steps for making a frequency distribution are pretty simple. Let’s take a set of 20 students and their ages:

 18 17 17 18 15 15 17 14 15 16 16 16 17 14 17 14 15 14 16 17

Creating a Frequency Distribution:

1. Determine the highest and lowest scores
2. Create two columns, label the first with the variable name, label the second frequency
3. List the full range of values that encompass all the scores in the data set from highest to lowest. Include all values in the range, even those for which the frequency is 0
4. Count the number of scores at each value and write those numbers in the frequency column.

Let’s work through these steps one by one.

1. The first step is to establish the highest and lowest scores. In this example, the highest score is 18 and the lowest score is 14.
2. Creating two columns, we’ll call the first column “Student Age” and the second column “Frequency”.
3. Next, we’ll add each of the ages, 14-18 to our chart and count how many times each one occurs
 Student Age Frequency 14 4 15 4 16 4 17 5 18 2

As you can see, each of the values occurs four times, except for the age 17 which occurs five times, and the age 18 which only occurs twice. From this data we can begin drawing rudimentary conclusions about the individuals in the sample. For instance, this group is relatively evenly distributed, except for 18 year olds which are under-represented.

Group Frequency Table

If your data exists in a range, you can also create a grouped frequency table. This is similar to a regular frequency table but is often used for data where there can be many specific values (for instance, recording the speed at which a person performs a task can result in values that go into the millisecond.) In cases like these, grouped frequency tables are helpful.

It is slightly more complicated to put together. Before we start, let’s go over some definitions.

• Lower Class Limit – are the smallest numbers that can actually belong to the different classes
• Upper Class Limit – are the largest numbers that can actually belong to the different classes
• Class Boundaries – are the numbers used to separate classes, but without gaps created by the class limits

Let’s use the following dataset that contains 16 people, using the amount of time it took people to perform a task to demonstrate these terms:

 Lower Class Limit Upper Class Limit Frequency 0 Under 1.5 4 1.5 Under 3.0 5 3 Under 4.5 3 4.5 Under 6.0 3 6.0 Under 7.5 0 7.5 Under 9.0 1

What “under 1.5” means is that any value between 0 and 1.499 would qualify, but it is worded this way to simplify things.

How did we decide on the intervals here (e.g. 0 to 1.5, 1.5 to 3 and so on?) We used something called the 2k guideline.

The 2k guideline says that the square of the correct number of intervals should be greater than the number of items in the dataset. F6or instance, in our example we have 16 people, so let’s work through the squares:

• 2^2 = 4
• 2^3 = 8
• 2^4 = 16
• 2^5 = 32

So, in this example we could use only 4 intervals if we needed to, but the authors of the set (which came from a statistics textbook) chose to use 6 to make the data easier to see. Ideally you want the number to be between 5 and 10.

To determine the best distance between the intervals (in this case they’ve chosen 1.5), you can take the range (which is the highest value minus the lowest value) and divide it by the number of classes.

Let’s assume the highest value was 8 and the lowest value was 0.1. Our “distance calculation” would thus be: 8 – 0.1 / 6 = 1.31. Again, the authors chose to use a simpler value to make interpreting the data easier.

Histogram

Once you have your frequency distribution, you can turn it into a visual display with a histogram, which looks almost like a bar chart and enables us to see the data at a glance. An example of a histogram is below:

Cite this article as: MacDonald, D.K., (2015), "Frequency Distributions," retrieved on May 27, 2019 from http://dustinkmacdonald.com/frequency-distributions/.

by