# Scales of Measurement

## Introduction

This post is part of a series I’ve been chipping away at, where I teach basic statistics and probability. The other posts in the series include:

Variables are the outcomes of a psychological measurement. As the Australian Bureau of Statistics notes, a variable is “any characteristics, number, or quantity that can be measured or counted.” Also called data items, they are called variables because the “value may vary” and may change over time.

There are four scales of measurement used to distinguish variables:

• Nominal/Categorical
• Ordinal
• Interval
• Ratio

## Nominal Variables

Nominal variables are those that separate a value into different categories. Examples of nominal variables are gender (male, female, other) or type of transportation (car, bus, train). These are categories and have no intrinsic value that allows them to be compared on their own.

## Ordinal Variables

Ordinal variables are similar to nominal variables but they are ranked. These are like nominal variables but they are ranked. One example of an ordinal variable is educational achievement. A scale might look like this:

• Less than a high school diploma
• High school or GED
• Bachelor’s degree
• Graduate or first professional degree
• Doctorate degree

These can be ranked from least education to most education, but there is no way to tell necessarily how much “more” education a Bachelor’s degree is when compared to a graduate or professional degree.

## Interval Variables

An interval variable is an ordinal variable where the different items are evenly spaced. For example, income level:

• \$0-4,999
• \$5,000-9,999
• 10,000-14,999
• 15,000-20,000

Each one of these is evenly spaced. There must be a continuum to measure an interval variable.

## Ratio Variables

Ratio variables are like interval variables but with the notable exception that “0” indicates an absence of the value. For example, in our previous example income level happens to mean no money. If we look at temperature however, 0 degrees Celsius does not mean there is no temperature. This makes Celsius an Interval Variable.

On the other hand, Kelvin is a ratio variable because 0 Kelvin really means no heat or temperature at all (as we say, absolute zero.)

## Continuous vs. Discrete Variables

One more distinction is the difference between continuous and discrete variables. Continuous variables are those that can take on any value. For example, a variable that can have any number between 10 and 11 (10.48938, 10.74982, 10.9999) is continuous.

If the survey only has two values with with nothing in between (like 10 or 11) then this is a discrete variable, also known as an integer.

## Why Separate Variables into Categories

It’s important to understand whether the variables we are working with are nominal, ordinal, interval, ratio, because we’ll use different statistical tests when working with different data. Coding, and other manipulations and processing of the data may also differ depending on the variable.

by

# Z-Test Hypothesis Testing

## Introduction

The Z-test is a simple tool for hypothesis testing that can be used to identify whether a mean result, when compared to a larger set is statistically significant when the larger set is a normal distribution.

Many datasets (for instance population height, test scores, etc.) have normal distributions. If you’re unsure whether your dataset has a normal distribution you can approximate and assume that it does if you have at least 30 items to draw on (e.g. 30 students heights, 30 test scores.)

You will need to know the population mean and the standard deviation in order to perform the one-sample z-test. If you don’t know the standard deviation you should use a t-test instead.

## Hypothesis Testing

The following six steps are for a Z-test:

1. Identify our population, comparison distribution, hypothesis and assumptions. Choose an appropriate test.
2. State the null and research hypotheses.
3. Determine the characteristics of the comparison distribution.
4. Determine the cutoffs that indicate the points beyond which we reject the null hypothesis
5. Calculate the test statistic.
6. Decide whether to reject or accept (fail to reject) the null hypothesis.

## One-Sample Z-Test Formula

The following is the formula for the z-test:

Where x̄ (x-bar) is the sample mean, ∆ (delta) is the value you are comparing it with (the population mean), σ (sigma) is the population standard deviation and n is the number of values in the larger set.

## Z-Test Example

The example we will work with for our one-sample z-test is a set of students who received 4 hours of study strategies tutoring before beginning a statistics course and another set of students who did not. You can compare the grades of these students to find out if the tutoring has impacted their grades.

There are 150 students in the class. The mean score in the class (x-bar) is 72, with a SD of 6. The mean score of the 20 students who received tutoring (delta) is 75 with a SD of 5. These values seem too close for us to estimate purely by hand so we will use our formula. Plugging this into the formula, we get:

z = 75 – 73 / (5 / sqrt(150))
z = 75 – 73 / (5 / 12.25)
z = 75 – 73 / 0.41
z = 2 / 0.82
z = 2.44

Looking up 2.44 in our Z-Table gives us 49.27.

We subtract 49.27 from 50 (the mean) gives us 0.0073 (the % in tail value in the z-chart.) Because we are performing a two-tailed test (we want to know whether our value is significantly above or significantly below the mean), we multiply 0.0073 by to get a p value of 0.0146.

In order to reject the null hypothesis, our p value must be under 0.05. Because our p value is below 0.05, we reject the null hypothesis. This means that the students who received 4 hours of tutoring did have better grades than those who didn’t.

Cite this article as: MacDonald, D.K., (2016), "Z-Test Hypothesis Testing," retrieved on June 26, 2019 from http://dustinkmacdonald.com/z-test-hypothesis-testing/.
by

# Statistical Coding and Classification

## Introduction to Classification

Oftentimes when performing research or intelligence analysis, the first step is to classify the available data. Classification provides a number of benefits that make later analysis easier. For one, they allow you to infer other qualities based on all items in a class sharing similar properties.

For instance, knowing that mammals have fur and all mammals give birth birth to live young (as opposed to laying eggs), you can infer that if you see a creature identified as a mammal you can predict these properties about the creature.

Another benefit to classification is that it allows you to see relationships among classes that you may not have been aware of before. The classic Periodic Table is a good example of this: elements along the right-hand side of the periodic table (so-called Noble Gases) all hold similar properties, while other columns ordered together also appear to have similar properties. It is not simply that the elements were organized this way after they were found to match, but in fact “holes” in the periodic table indicated where elements must exist but haven’t been discovered yet.

This brings us to the next benefit of classification, the ability to uncover missing information. Although this is sometimes exploited in military and diplomatic circles (for instance, SEAL Team Six is actually the 4th SEAL team – the number was incremented in order to mislead enemies about how many SEAL Teams there are), this is still a very useful technicque for discovering what you don’t know.

Finally, classification allows you to focus on the properties of group items rather than of individual ones, which can make analyzing large amounts of information much easier than it otherwise would be. We’re sometimes overwhelmed by information and these preliminary steps can help us drill down. This is also accomplished through coding, below.

## Statistical Coding

Statistical coding is the form of classification that is perhaps most familiar to researchers. Coding is the task of taking data and assigning it to categories. This allows us to turn normally qualitative data into quantitative or numerical data. If you look at the example of Gender, assigning Male a value of “0” and Female a value of “1” is a form of coding that allows you to perform statistical analysis.

Coding is often used to group responses together. If asking someone what their first emotion is after a sudden loss or grief, you may have to translate disparate responses like, “I was overwhelmed”, “I didn’t know what to do”, and “I felt numb” into simple categories (“Overwhelmed”, “Confused/Shocked”, “Numb”) and later into numerical values (1, 2, 3.)

Make sure to store the results of your coding in a “codebook” so that later you can remember what variable was turned into what coding.

There are a few advantages of statistical coding. For one, it allows you to perform statistical analyses not possible on qualitative data and allows you to perform “blind” analyses without us knowing which variable corresponds to which value.

Cite this article as: MacDonald, D.K., (2016), "Statistical Coding and Classification," retrieved on June 26, 2019 from http://dustinkmacdonald.com/statistical-coding-classification/.
by

One role of helpline managers is to manage their workers so that they can answer the most calls possible within the available resources. Even helplines that run 24-hours and have 100% coverage can’t answer 100% of the calls that come in if they have more callers calling in than workers available.

Using a system like Chronicall can give you real-time information on the calls that you answer and don’t and prepare more detailed results (for instance, noting where calls are not answered because the worker is already on a call.)

Given a series of values that are related to each other, regression allows us to predict values where we either don’t have the data or where we want to know the “average” of a piece of data.

For this task, we assume all you have is the data about how many hours your helpline is covered (either in hours or percentages) and the percentage of calls that you answer.

 Hours Covered (out of 24) Call Answer Percentage 24 80 24 78 24 82 24 76 24 79 22 75 22 85 22 76 20 82 20 80 19 70 18 74

While we can use the regression formulas by-hand, Excel provides simple techniques for deducing the formula. The first step (for the purpose of this article) was to do the calculations by hand to demonstrate. You can see the regression article for full details on how to do this.

## Regression By Hand

 Hours Covered (out of 24) [X] Call Answer Percentage [Y] X2 Y2 XY 24 80 576 6400 1920 24 78 576 6084 1872 24 82 576 6724 1968 24 76 576 5776 1824 24 79 576 6241 1896 22 75 484 5625 1650 22 85 484 7225 1870 22 76 484 5776 1672 20 82 400 6724 1640 20 80 400 6400 1600 19 70 361 4900 1330 18 74 324 5476 1332 263 937 5817 73351 20574

b = (12*20574 – 263*937) / 12*5817 – 263^2
b = 0.71969

a = 937 / 12 – 0.71969 * (263/12)
a = 62.3101

So our final equation is:

Y’ = a + bX
Y’ = 62.3101 + (0.71969)X

## Using Excel

We can use Excel to simplify this calculation. Starting with an Excel spreadsheet containing our X and Y values:

Next, we use Excel’s LINEST function. This requires you to select TWO cells at once. The first required value (called an “argument” in Excel) is the known Y values. In this case, it is C2 through C13. The next value is the known X values (B2 through B13.)

The third argument is whether to set b to zero, or to calculate it normally. Since we’re using the equation Y’ = a + bX and not the equation Y = mx + b, we’ll set it to TRUE. The final argument asks whether we want additional statistical information included, so we set this to FALSE.

So our final equation is:

=LINEST(C2:C13;B2:B13;TRUE;FALSE)

After we’re done typing this, instead of hitting enter like normal, we hit Ctrl-Shift-Enter. This is very important! If we neglect to do this, Excel will only give us part of the information we need. If we’ve done this correctly, Excel will put brackets around the formula, like this:

And you’ll notice that both cells you selected are filled in. The first cell holds the b value and the second cell holds the a value. Putting them into the formula, we have:

Y’ = 62.31024 + (0.719685)X

So, if we want to calculate what our answer percentage will be if we have 21 hours of coverage:

Y = 62.31024 + (0.719685)21 = 77.42

This falls right in line with our expected values, and this technique can be used with any other data where you need to predict values in a linear fashion.

by

# Least-Squares Regression

Regression is a technique used to predict future values based on known values. For instance, linear regression allows us to predict what an unknown Y value will be, given a series of known X and Y’s, and a given X value.

Given the following, it’s easy to see the pattern. But assuming no obvious pattern exists, regression can help us determine what the Y value will be given our known X values.

 X Y 2 3 4 6 6 9 8 12 10 15 12 14

The X value is known as the independent variable, the “predictor variable”, while the Y value is the value you’re being predicted.

The linear regression (or “least squares regression”) equation is Y’ = a + bX

• Y’ (Y-prime) is the predicted Y value for the X value
• a is the estimated value of Y when X is 0
• b is the slope (the average change in Y’ for each change in X)
• X is any value of the independent variable

There are additional formulas for both a and b.

Let’s take a look at the following data-set, that compares the number of calls made for a product against the number of sales:

 Calls (X) Sales (Y) 20 30 40 60 20 40 30 60 10 30 10 40 20 40 20 50 20 30 30 70 220 450

First we need to calculate the sum of X-squared, Y-squared and X*Y:

 Calls (X) Sales (Y) X2 Y2 XY 20 30 400 900 600 40 60 1600 3600 2400 20 40 400 1600 800 30 60 900 3600 1800 10 30 100 900 300 10 40 100 1600 400 20 40 400 1600 800 20 50 400 2500 1000 20 30 400 900 600 30 70 900 4900 2100 Total 220 450 5600 22100 10800

The top of the equation looks like this: b = 10(10800) – 220 * 450 / n(∑X2)-(∑X)2. We’ve simply filled in the values from our chart.

b = 10(10800) – 220 * 450
b = 108,000 – 99,000
b = 9,000 / n(∑X2)-( ∑X)2

Now we have to do the bottom half of the equation:

n(∑X2)-(∑X)2

=10(5600)-(220) 2
=56,000 – 48,400
=7,600

Returning to our equation:

b = 9,000 / 7,600
b = 1.1842

Now let’s move on to a:

a = 450 / 10 – 1.1842 * (220 / 10)
a = 45 – (1.1842 * 22)
a = 45 – 26.0524
a = 18.9476

So, going back to our original regression equation, Y’ = a + bX and plugging our numbers, we get:

Y’ = 18.9476 + (1.1842)X

To use this equation, we now put our desired value in for X. With an estimated 20 calls:

Y’ = 18.9476 + (1.1842)*20
Y’ = 18.9476 + 23.684
Y’ = 42.63

So, a salesperson who makes 20 calls will expect to make 42 sales.

Cite this article as: MacDonald, D.K., (2015), "Least-Squares Regression," retrieved on June 26, 2019 from http://dustinkmacdonald.com/least-squares-regression/.

by