# Z-Test Hypothesis Testing

## Introduction

The Z-test is a simple tool for hypothesis testing that can be used to identify whether a mean result, when compared to a larger set is statistically significant when the larger set is a normal distribution.

Many datasets (for instance population height, test scores, etc.) have normal distributions. If you’re unsure whether your dataset has a normal distribution you can approximate and assume that it does if you have at least 30 items to draw on (e.g. 30 students heights, 30 test scores.)

You will need to know the population mean and the standard deviation in order to perform the one-sample z-test. If you don’t know the standard deviation you should use a t-test instead.

## Hypothesis Testing

The following six steps are for a Z-test:

1. Identify our population, comparison distribution, hypothesis and assumptions. Choose an appropriate test.
2. State the null and research hypotheses.
3. Determine the characteristics of the comparison distribution.
4. Determine the cutoffs that indicate the points beyond which we reject the null hypothesis
5. Calculate the test statistic.
6. Decide whether to reject or accept (fail to reject) the null hypothesis.

## One-Sample Z-Test Formula

The following is the formula for the z-test:

Where x̄ (x-bar) is the sample mean, ∆ (delta) is the value you are comparing it with (the population mean), σ (sigma) is the population standard deviation and n is the number of values in the larger set.

## Z-Test Example

The example we will work with for our one-sample z-test is a set of students who received 4 hours of study strategies tutoring before beginning a statistics course and another set of students who did not. You can compare the grades of these students to find out if the tutoring has impacted their grades.

There are 150 students in the class. The mean score in the class (x-bar) is 72, with a SD of 6. The mean score of the 20 students who received tutoring (delta) is 75 with a SD of 5. These values seem too close for us to estimate purely by hand so we will use our formula. Plugging this into the formula, we get:

z = 75 – 73 / (5 / sqrt(150))
z = 75 – 73 / (5 / 12.25)
z = 75 – 73 / 0.41
z = 2 / 0.82
z = 2.44

Looking up 2.44 in our Z-Table gives us 49.27.

We subtract 49.27 from 50 (the mean) gives us 0.0073 (the % in tail value in the z-chart.) Because we are performing a two-tailed test (we want to know whether our value is significantly above or significantly below the mean), we multiply 0.0073 by to get a p value of 0.0146.

In order to reject the null hypothesis, our p value must be under 0.05. Because our p value is below 0.05, we reject the null hypothesis. This means that the students who received 4 hours of tutoring did have better grades than those who didn’t.

Cite this article as: MacDonald, D.K., (2016), "Z-Test Hypothesis Testing," retrieved on November 23, 2017 from http://dustinkmacdonald.com/z-test-hypothesis-testing/.
by

# Statistical Coding and Classification

## Introduction to Classification

Oftentimes when performing research or intelligence analysis, the first step is to classify the available data. Classification provides a number of benefits that make later analysis easier. For one, they allow you to infer other qualities based on all items in a class sharing similar properties.

For instance, knowing that mammals have fur and all mammals give birth birth to live young (as opposed to laying eggs), you can infer that if you see a creature identified as a mammal you can predict these properties about the creature.

Another benefit to classification is that it allows you to see relationships among classes that you may not have been aware of before. The classic Periodic Table is a good example of this: elements along the right-hand side of the periodic table (so-called Noble Gases) all hold similar properties, while other columns ordered together also appear to have similar properties. It is not simply that the elements were organized this way after they were found to match, but in fact “holes” in the periodic table indicated where elements must exist but haven’t been discovered yet.

This brings us to the next benefit of classification, the ability to uncover missing information. Although this is sometimes exploited in military and diplomatic circles (for instance, SEAL Team Six is actually the 4th SEAL team – the number was incremented in order to mislead enemies about how many SEAL Teams there are), this is still a very useful technicque for discovering what you don’t know.

Finally, classification allows you to focus on the properties of group items rather than of individual ones, which can make analyzing large amounts of information much easier than it otherwise would be. We’re sometimes overwhelmed by information and these preliminary steps can help us drill down. This is also accomplished through coding, below.

## Statistical Coding

Statistical coding is the form of classification that is perhaps most familiar to researchers. Coding is the task of taking data and assigning it to categories. This allows us to turn normally qualitative data into quantitative or numerical data. If you look at the example of Gender, assigning Male a value of “0” and Female a value of “1” is a form of coding that allows you to perform statistical analysis.

Coding is often used to group responses together. If asking someone what their first emotion is after a sudden loss or grief, you may have to translate disparate responses like, “I was overwhelmed”, “I didn’t know what to do”, and “I felt numb” into simple categories (“Overwhelmed”, “Confused/Shocked”, “Numb”) and later into numerical values (1, 2, 3.)

Make sure to store the results of your coding in a “codebook” so that later you can remember what variable was turned into what coding.

There are a few advantages of statistical coding. For one, it allows you to perform statistical analyses not possible on qualitative data and allows you to perform “blind” analyses without us knowing which variable corresponds to which value.

Cite this article as: MacDonald, D.K., (2016), "Statistical Coding and Classification," retrieved on November 23, 2017 from http://dustinkmacdonald.com/statistical-coding-classification/.
by

# Predicting Your Helpline Call Answer Rate

One role of helpline managers is to manage their workers so that they can answer the most calls possible within the available resources. Even helplines that run 24-hours and have 100% coverage can’t answer 100% of the calls that come in if they have more callers calling in than workers available.

Using a system like Chronicall can give you real-time information on the calls that you answer and don’t and prepare more detailed results (for instance, noting where calls are not answered because the worker is already on a call.)

Given a series of values that are related to each other, regression allows us to predict values where we either don’t have the data or where we want to know the “average” of a piece of data.

For this task, we assume all you have is the data about how many hours your helpline is covered (either in hours or percentages) and the percentage of calls that you answer.

 Hours Covered (out of 24) Call Answer Percentage 24 80 24 78 24 82 24 76 24 79 22 75 22 85 22 76 20 82 20 80 19 70 18 74

While we can use the regression formulas by-hand, Excel provides simple techniques for deducing the formula. The first step (for the purpose of this article) was to do the calculations by hand to demonstrate. You can see the regression article for full details on how to do this.

## Regression By Hand

 Hours Covered (out of 24) [X] Call Answer Percentage [Y] X2 Y2 XY 24 80 576 6400 1920 24 78 576 6084 1872 24 82 576 6724 1968 24 76 576 5776 1824 24 79 576 6241 1896 22 75 484 5625 1650 22 85 484 7225 1870 22 76 484 5776 1672 20 82 400 6724 1640 20 80 400 6400 1600 19 70 361 4900 1330 18 74 324 5476 1332 263 937 5817 73351 20574

b = (12*20574 – 263*937) / 12*5817 – 263^2
b = 0.71969

a = 937 / 12 – 0.71969 * (263/12)
a = 62.3101

So our final equation is:

Y’ = a + bX
Y’ = 62.3101 + (0.71969)X

## Using Excel

We can use Excel to simplify this calculation. Starting with an Excel spreadsheet containing our X and Y values:

Next, we use Excel’s LINEST function. This requires you to select TWO cells at once. The first required value (called an “argument” in Excel) is the known Y values. In this case, it is C2 through C13. The next value is the known X values (B2 through B13.)

The third argument is whether to set b to zero, or to calculate it normally. Since we’re using the equation Y’ = a + bX and not the equation Y = mx + b, we’ll set it to TRUE. The final argument asks whether we want additional statistical information included, so we set this to FALSE.

So our final equation is:

=LINEST(C2:C13;B2:B13;TRUE;FALSE)

After we’re done typing this, instead of hitting enter like normal, we hit Ctrl-Shift-Enter. This is very important! If we neglect to do this, Excel will only give us part of the information we need. If we’ve done this correctly, Excel will put brackets around the formula, like this:

And you’ll notice that both cells you selected are filled in. The first cell holds the b value and the second cell holds the a value. Putting them into the formula, we have:

Y’ = 62.31024 + (0.719685)X

So, if we want to calculate what our answer percentage will be if we have 21 hours of coverage:

Y = 62.31024 + (0.719685)21 = 77.42

This falls right in line with our expected values, and this technique can be used with any other data where you need to predict values in a linear fashion.

Cite this article as: MacDonald, D.K., (2015), "Predicting Your Helpline Call Answer Rate," retrieved on November 23, 2017 from http://dustinkmacdonald.com/predicting-your-helpline-call-answer-rate/.

by

# Least-Squares Regression

Regression is a technique used to predict future values based on known values. For instance, linear regression allows us to predict what an unknown Y value will be, given a series of known X and Y’s, and a given X value.

Given the following, it’s easy to see the pattern. But assuming no obvious pattern exists, regression can help us determine what the Y value will be given our known X values.

 X Y 2 3 4 6 6 9 8 12 10 15 12 14

The X value is known as the independent variable, the “predictor variable”, while the Y value is the value you’re being predicted.

The linear regression (or “least squares regression”) equation is Y’ = a + bX

• Y’ (Y-prime) is the predicted Y value for the X value
• a is the estimated value of Y when X is 0
• b is the slope (the average change in Y’ for each change in X)
• X is any value of the independent variable

There are additional formulas for both a and b.

Let’s take a look at the following data-set, that compares the number of calls made for a product against the number of sales:

 Calls (X) Sales (Y) 20 30 40 60 20 40 30 60 10 30 10 40 20 40 20 50 20 30 30 70 220 450

First we need to calculate the sum of X-squared, Y-squared and X*Y:

 Calls (X) Sales (Y) X2 Y2 XY 20 30 400 900 600 40 60 1600 3600 2400 20 40 400 1600 800 30 60 900 3600 1800 10 30 100 900 300 10 40 100 1600 400 20 40 400 1600 800 20 50 400 2500 1000 20 30 400 900 600 30 70 900 4900 2100 Total 220 450 5600 22100 10800

Returning to our formula, let’s start with b first:

The top of the equation looks like this: b = 10(10800) – 220 * 450 / n(∑X2)-(∑X)2. We’ve simply filled in the values from our chart.

b = 10(10800) – 220 * 450
b = 108,000 – 99,000
b = 9,000 / n(∑X2)-( ∑X)2

Now we have to do the bottom half of the equation:

n(∑X2)-(∑X)2

=10(5600)-(220) 2
=56,000 – 48,400
=7,600

Returning to our equation:

b = 9,000 / 7,600
b = 1.1842

Now let’s move on to a:

a = 450 / 10 – 1.1842 * (220 / 10)
a = 45 – (1.1842 * 22)
a = 45 – 26.0524
a = 18.9476

So, going back to our original regression equation, Y’ = a + bX and plugging our numbers, we get:

Y’ = 18.9476 + (1.1842)X

To use this equation, we now put our desired value in for X. With an estimated 20 calls:

Y’ = 18.9476 + (1.1842)*20
Y’ = 18.9476 + 23.684
Y’ = 42.63

So, a salesperson who makes 20 calls will expect to make 42 sales.

Cite this article as: MacDonald, D.K., (2015), "Least-Squares Regression," retrieved on November 23, 2017 from http://dustinkmacdonald.com/least-squares-regression/.

by

# Correlation (Calculating Pearson’s r)

Correlation refers to the idea that two variables (x and y) impact each other. For instance, the grades in a statistics class may be related to, or correlated with the amount of time those students study. As study time goes up, grades go up. This would be a positive correlation. On the other hand, as time spent partying, grades go down. This is called a negative correlation.
A positive correlation doesn’t strictly refer to good things, though. As the percent of poverty in a community goes up, the amount of crime may also go up. This is a positive correlation, but certainly not a good thing!

Correlations are expressed from -1 (which is perfectly negative) and +1 (which is perfectly positive.) The number shows the strength, and the sign (positive or negative) shows the direction. Therefore, -0.75 is a stronger correlation (or connection) than 0.25.

One common expression is “Correlation is not causation”; this refers to the idea that items can be correlated without really being related to each other. For instance, there is a close connection between the rates of ice-cream consumption in the winter and the drowning rate, even though one really doesn’t affect the other.

How to Calculate Correlation
Pearson’s r (also known as the correlation coefficient) is a simple correlation tool to work with. (Technically the r is used for samples and p is used for populations, but we’ll be working with samples, a limited amount of the total so we will simply refer to it as Pearson’s r or r.)

The formula is here:

This formula may look complicated, but let’s step through it step by step.

The sum of the values of X subtracted from the mean of X multipled by the values of Y subtracted from the mean of Y divided by the square root of X subtracted from the mean of X-squared multiplied by Y subtracted from the mean of Y-squared.

Let’s look at the following set of data of student absences and their final grades:

 Student # Absences Exam Grade 1 4 82 2 2 98 3 2 76 4 3 68 5 1 84 6 0 99 7 4 67 8 8 58 9 7 50 10 3 78

The first step is to create a scatterplot of the data to see if any patterns stick out:

This shows a moderately negative correlation, as absences go up, grades go down.

Moving to the equation, let’s look at the top part fist:

∑[(X-MX)(Y-MY)]

We have to calculate the mean of X and the mean of Y:

4 + 2 + 2 + 3 + 1 + 0 + 4 + 8 + 7 + 3 = 34 / 10 = Mean of X of 3.4
82 + 98 + 76 + 68 + 84 + 99 + 67 + 58 + 50 + 78 = 760 / 10 = Mean of Y of 76.
Next, we calculate X-Mx and Y-My, and sum them up.

 X X – Mx Y Y – My 4 0,6 82 6 2 -1,4 98 22 2 -1,4 76 0 3 -0,4 68 -8 1 -2,4 84 8 0 -3,4 99 23 4 0,6 67 -9 8 4,6 58 -18 7 3,6 50 -26 3 -0,4 78 2

Next, we must multiply the values of each of these together:

 X – Mx Y – My X-Mx * Y-My 0,6 6 3,6 -1,4 22 -30,8 -1,4 0 0 -0,4 -8 3,2 -2,4 8 -19,2 -3,4 23 -78,2 0,6 -9 -5,4 4,6 -18 -82,8 3,6 -26 -93,6 -0,4 2 -0,8

And the sum of these (3.6 + -30.8 + 0 + 3.2 and so on) is -304. Here’s our equation so far:

Next, let’s look at the bottom part of the equation:

 X X – Mx (X-Mx)2 Y Y – My (Y-My)2 4 0,6 0.36 82 6 36 2 -1,4 1.96 98 22 484 2 -1,4 1.96 76 0 0 3 -0,4 0.16 68 -8 64 1 -2,4 5.76 84 8 64 0 -3,4 11.56 99 23 529 4 0,6 0.36 67 -9 81 8 4,6 21.16 58 -18 324 7 3,6 12.96 50 -26 676 3 -0,4 0.16 78 2 4 ∑ 56.4 ∑ 2262

We take the square of each of the X values and sum them up. We do the same for the Y values.

This results in: -304 / Sqrt(56.4*2262)

Next, we multiply the two bottoms together. 56.4 x 2262 = 127,576.8.

Taking the square root yields 357.179.

Our final calculation is -304 / 357.179 which equals -0.85.
-0.85 is our final correlation, which we can confirm using Excel’s CORREL function.

Cite this article as: MacDonald, D.K., (2015), "Correlation (Calculating Pearson’s r)," retrieved on November 23, 2017 from http://dustinkmacdonald.com/correlation-calculating-pearsons-r/.

by