Statistical Coding and Classification

Introduction to Classification

Oftentimes when performing research or intelligence analysis, the first step is to classify the available data. Classification provides a number of benefits that make later analysis easier. For one, they allow you to infer other qualities based on all items in a class sharing similar properties.

For instance, knowing that mammals have fur and all mammals give birth birth to live young (as opposed to laying eggs), you can infer that if you see a creature identified as a mammal you can predict these properties about the creature.

Another benefit to classification is that it allows you to see relationships among classes that you may not have been aware of before. The classic Periodic Table is a good example of this: elements along the right-hand side of the periodic table (so-called Noble Gases) all hold similar properties, while other columns ordered together also appear to have similar properties. It is not simply that the elements were organized this way after they were found to match, but in fact “holes” in the periodic table indicated where elements must exist but haven’t been discovered yet.

This brings us to the next benefit of classification, the ability to uncover missing information. Although this is sometimes exploited in military and diplomatic circles (for instance, SEAL Team Six is actually the 4th SEAL team – the number was incremented in order to mislead enemies about how many SEAL Teams there are), this is still a very useful technicque for discovering what you don’t know.

Finally, classification allows you to focus on the properties of group items rather than of individual ones, which can make analyzing large amounts of information much easier than it otherwise would be. We’re sometimes overwhelmed by information and these preliminary steps can help us drill down. This is also accomplished through coding, below.

Statistical Coding

Statistical coding is the form of classification that is perhaps most familiar to researchers. Coding is the task of taking data and assigning it to categories. This allows us to turn normally qualitative data into quantitative or numerical data. If you look at the example of Gender, assigning Male a value of “0” and Female a value of “1” is a form of coding that allows you to perform statistical analysis.

Coding is often used to group responses together. If asking someone what their first emotion is after a sudden loss or grief, you may have to translate disparate responses like, “I was overwhelmed”, “I didn’t know what to do”, and “I felt numb” into simple categories (“Overwhelmed”, “Confused/Shocked”, “Numb”) and later into numerical values (1, 2, 3.)

Make sure to store the results of your coding in a “codebook” so that later you can remember what variable was turned into what coding.

There are a few advantages of statistical coding. For one, it allows you to perform statistical analyses not possible on qualitative data and allows you to perform “blind” analyses without us knowing which variable corresponds to which value.

Cite this article as: MacDonald, D.K., (2016), "Statistical Coding and Classification," retrieved on January 23, 2018 from
Facebooktwittergoogle_plusredditmailby feather

Understanding and Preventing Male Suicide


Suicide is a significant public health issue in most countries. Suicide rates have been constant in the US and Canada, with some age and risk categories experiencing reduced suicide rates while increased suicide rates in other age groups and risk categories have made up the difference.

Male suicide has been commonly overlooked as suicide has not been seen as a gendered issue. Unfortunately, as more men than women die in virtually every country where the World Health Organization publishes data (2012) there exist the potential for significant reductions to be made in the suicide rate by interventions targeted specifically at men.

Suicide Statistics: A Comparison

Suicide rates are presented here for Canada, broken down by age range and gender.

Age Range Male Female Total Male % of Total
10 to 14 12 17 29 41.38%
15 to 19 140 58 198 70.71%
20 to 24 224 77 301 74.42%
25 to 29 198 63 261 75.86%
30 to 34 212 71 283 74.91%
35 to 39 220 68 288 76.39%
40 to 44 267 87 354 75.42%
45 to 49 318 114 432 73.61%
50 to 54 322 121 443 72.69%
55 to 59 273 102 375 72.80%
60 to 64 186 59 245 75.92%
65 to 69 117 33 150 78.00%
70 to 74 107 21 128 83.59%
75 to 79 78 23 101 77.23%
80 to 84 60 16 76 78.95%
85 to 89 36 13 49 73.47%
90 and older 10 3 13 76.92%
Total 2780 946 3726

As you can see, male suicides make up the majority of suicides in every age range except the 10-14 rate, where girls outnumbered boys. That is certainly worthy of further research by child suicide prevention specialists.

In Canada, suicide rates peak for men around 45-54, which contrasts with other countries where suicide rates increase with age after 30 and suicide rates in the elderly are the fastest growing group.

Suicide Methods

The most common method of suicide in the United States is firearms, accounting for 51% of the suicides in the US (Barber & Miller, 2014), followed by suffocation/hanging (25%), overdose/poisoning (17%) and other methods at 7.6%. (Centers for Disease Control and Prevention, 2013)

Because 85% of firearm suicide attempts result in death while only 2% of overdoses do (Vyrostek,  Annest, & Ryan, 2004), and because men most often choose methods like firearm and hanging over overdosing (Callanan & Davis, 2012), reducing access to firearms can significantly reduce the amount of male suicide.

Theories of Suicidal Behaviour

There are a number of theories that attempt to explain suicidal behaviour. These include the Interpersonal Theory of Suicide, the Stress-Diathesis Model, and the Integrated Motivational-Volitional Model. The interpersonal theory is detailed below.

The Interpersonal Theory of Suicide suggests that you need three elements for suicide to take place:

  • Thwarted Belongingness
  • Perceived Burdensomeness
  • Acquired Suicide Capability

Thwarted belongingness involves feeling like you have no social support or that you do not belong in your peer group. This can also be called “alienation.” Men are known to have smaller social circles (McPherson, Smith-Lovin & Brashears, 2006) and fewer access to social support when they are distressed.

Perceived burdensomeness refers to the idea that you feel like a burden on those around you. For men, this can present as being unable to be a provider or support their family.

Finally, acquired suicide capability refers to events that give you the capability to die by suicide. This includes exposure to war, physical abuse, fighting, self-injurious behaviour (cutting, etc.), or other elements that desensitize you to painful or fear-inducing experiences.

Men are more likely than women to be victims and perpetrators of violence (Statistics Canada, 2006), they make up the majority of occupational injuries (Bureau of Labour Statistics, 2013) and sufferers of substance abuse (Cotto, 2010). All of these items can increase men’s suicidality.

Additionally, suicidal intent (desire to die) has been associated with use of more lethal suicide methods. What this means is that although women attempt suicide at a rate of 3x men do, they don’t intend to die. The goal of attempting suicide is to accomplish other means. Update Nov-1/15 This is in fact incorrect and there is research support to the idea that women have similar levels of suicide intent as men (Denning, Conwell, King & Cox, 2000).

Player et. al. (2015) suggest that male coping strategies are responsible. While women increase their social support and look outward when they are feeling suicidal, men often wall themselves off from others to avoid being a burden. This only amplifies their systems and increases their distress, which can prevent an interruption in the suicidal process that may happen with women.

Clinical Interventions to Reduce Male Suicide

Interventions for suicide that can help individual men include:

Counseling on Access to Lethal Means. By reducing access to lethal means like firearms you can reduce an individual’s chance of dying by suicide. Many suicide attempts are made impulsively and having a gun makes a suicide attempt much more lethal.

Treatment for substance abuse.  Many suicides involve drugs and alcohol and so getting off drugs and alcohol can reduce a person’s reason and ability to attempt suicide, both because of the impact of substance abuse on a person’s ability to function in their day-to-day life (especially as it relates to relationships) but also because drugs and alcohol can make people — young men especially — more impulsive.

Increasing social circles. The average man has a social circle smaller than women. This lack of close friends means that men are not able to express themselves emotionally.

Self-esteem training. This can be a part of counselling or therapy or an initiative on it’s own. Group environments in particular provide an opportunity to both build a man’s social skills and his self-esteem. The benefit of high self esteem is that it can reduce a man’s perception that he is a burden, one of the key elements for suicide.

Public Health Strategies to Reduce Male Suicide

From a public health perspective, there are a few interventions we can help reduce male suicide.

Getting more men in front of family doctors. Men have poor records of going to the doctor when they need to, or even for regular checkups. Because physical health issues can prevent men from working or otherwise providing for themselves (creating the feeling of burdensomeness), physical health care is an important element to reducing suicidal ideation.

Screening for suicide and substance abuse by family doctors. Once men are in front of their physician, it’s important that they’re able to recognize the signs and symptoms of suicidal ideation and substance abuse. It has been noted that mental health professionals are less likely to diagnose depression in men and this is also an area for exploration.

Improved services for sexual violence. With as many as 1 in 6 men experiencing sexual abuse/assault in their lifetime (Dube, Anda & Whitfield, 2005) and a lack of services like rape crisis centres that provide service to men, suicide as a result of the after-effects of abuse will continue to be a devastating issue.

Areas for Additional Research

Areas for additional research include whether men respond differently to standard treatments for depression or substance abuse, or if there are any ways to intervene with men experiencing suicidal ideation that are particularly effective.


Barber, C.W., Miller, M.J. (2014) Reducing a Suicidal Person’s Access to Lethal Means of Suicide: A Research Agenda. American Journal of Preventive Medicine. 47(3S2):S264–S272

Centers for Disease Control and Prevention. (2013) Web-based Injury Statistics Query and Reporting System (WISQARS). Accessed Jun 21 2015 from

Denning, D.G., Conwell, Y., King, D., Cox, C. (2000) Method choice, intent, and gender in completed suicide. Journal of Suicide and Life Threatening Behaviour. 30(3). 282-288

Dube, S.R., Anda, R.F. & Whitfield, C.L., et al. (2005). Long-term consequences of childhood sexual abuse by gender of victim. American Journal of Preventive Medicine, 28, 430-438.

Callanan, V.J., Davis, M.S. Gender differences in suicide methods. (2012). Social Psychiatry and Psychiatric Epidemiology. 47:857–869 DOI 10.1007/s00127-011-0393-5

Cotto, J.H. et al. (2010) Gender effects on drug use, abuse, and dependence: An analysis of results from the National Survey on Drug Use and Health. Gender Medicine. 7(5):402-413

“Fatal occupational injuries in 2013.” Bureau of Labour Statistics. (2013). Accessed from on Sep 5 2015.

Global Health Observatory Data Repository. (2012) World Health Organization. Accessed from on Sep 1 2015.

McPherson, M., Smith-Lovin, L., Brashears, M.E. (2006) Social Isolation in America: Changes in Core Discussion Networks Over Two Decades. American Sociological Review. 71(3).

Player MJ, Proudfoot J, Fogarty A, Whittle E, Spurrier M, Shand F, et al. (2015) What Interrupts Suicide Attempts in Men: A Qualitative Study. PLoS ONE 10(6): e0128180. doi:10.1371/journal.pone.0128180

Vaillancourt, R. 2010. Gender differences in police-reported violent crime in Canada, 2008. Catalogue no. 85F0033M, no. 24. Ottawa: Statistics Canada.

Vyrostek S.B., Annest, J.L, & Ryan, G.W. Surveillance for fatal and nonfatal injuries–United States, 2001. Morbidity and Mortality Weekly Report. 2004:53(SS07);1-57.

Cite this article as: MacDonald, D.K., (2015), "Understanding and Preventing Male Suicide," retrieved on January 23, 2018 from
Facebooktwittergoogle_plusredditmailby feather

Predicting Your Helpline Call Answer Rate

One role of helpline managers is to manage their workers so that they can answer the most calls possible within the available resources. Even helplines that run 24-hours and have 100% coverage can’t answer 100% of the calls that come in if they have more callers calling in than workers available.

Using a system like Chronicall can give you real-time information on the calls that you answer and don’t and prepare more detailed results (for instance, noting where calls are not answered because the worker is already on a call.)

Given a series of values that are related to each other, regression allows us to predict values where we either don’t have the data or where we want to know the “average” of a piece of data.

For this task, we assume all you have is the data about how many hours your helpline is covered (either in hours or percentages) and the percentage of calls that you answer.

Hours Covered (out of 24) Call Answer Percentage
24 80
24 78
24 82
24 76
24 79
22 75
22 85
22 76
20 82
20 80
19 70
18 74

While we can use the regression formulas by-hand, Excel provides simple techniques for deducing the formula. The first step (for the purpose of this article) was to do the calculations by hand to demonstrate. You can see the regression article for full details on how to do this.

Regression By Hand

Hours Covered (out of 24) [X] Call Answer Percentage [Y] X2 Y2 XY
24 80 576 6400 1920
24 78 576 6084 1872
24 82 576 6724 1968
24 76 576 5776 1824
24 79 576 6241 1896
22 75 484 5625 1650
22 85 484 7225 1870
22 76 484 5776 1672
20 82 400 6724 1640
20 80 400 6400 1600
19 70 361 4900 1330
18 74 324 5476 1332
263 937 5817 73351 20574

b = (12*20574 – 263*937) / 12*5817 – 263^2
b = 0.71969

a = 937 / 12 – 0.71969 * (263/12)
a = 62.3101

So our final equation is:

Y’ = a + bX
Y’ = 62.3101 + (0.71969)X

Using Excel

We can use Excel to simplify this calculation. Starting with an Excel spreadsheet containing our X and Y values:


Next, we use Excel’s LINEST function. This requires you to select TWO cells at once. The first required value (called an “argument” in Excel) is the known Y values. In this case, it is C2 through C13. The next value is the known X values (B2 through B13.)


The third argument is whether to set b to zero, or to calculate it normally. Since we’re using the equation Y’ = a + bX and not the equation Y = mx + b, we’ll set it to TRUE. The final argument asks whether we want additional statistical information included, so we set this to FALSE.


So our final equation is:


After we’re done typing this, instead of hitting enter like normal, we hit Ctrl-Shift-Enter. This is very important! If we neglect to do this, Excel will only give us part of the information we need. If we’ve done this correctly, Excel will put brackets around the formula, like this:step4

And you’ll notice that both cells you selected are filled in. The first cell holds the b value and the second cell holds the a value. Putting them into the formula, we have:

Y’ = 62.31024 + (0.719685)X

So, if we want to calculate what our answer percentage will be if we have 21 hours of coverage:

Y = 62.31024 + (0.719685)21 = 77.42

This falls right in line with our expected values, and this technique can be used with any other data where you need to predict values in a linear fashion.

Cite this article as: MacDonald, D.K., (2015), "Predicting Your Helpline Call Answer Rate," retrieved on January 23, 2018 from

Facebooktwittergoogle_plusredditmailby feather

Least-Squares Regression

Regression is a technique used to predict future values based on known values. For instance, linear regression allows us to predict what an unknown Y value will be, given a series of known X and Y’s, and a given X value.

Given the following, it’s easy to see the pattern. But assuming no obvious pattern exists, regression can help us determine what the Y value will be given our known X values.

2 3
4 6
6 9
8 12
10 15


The X value is known as the independent variable, the “predictor variable”, while the Y value is the value you’re being predicted.

The linear regression (or “least squares regression”) equation is Y’ = a + bX

  • Y’ (Y-prime) is the predicted Y value for the X value
  • a is the estimated value of Y when X is 0
  • b is the slope (the average change in Y’ for each change in X)
  • X is any value of the independent variable

There are additional formulas for both a and b.

a b

Let’s take a look at the following data-set, that compares the number of calls made for a product against the number of sales:

Calls (X) Sales (Y)
20 30
40 60
20 40
30 60
10 30
10 40
20 40
20 50
20 30
30 70
220 450


First we need to calculate the sum of X-squared, Y-squared and X*Y:

Calls (X) Sales (Y) X2 Y2 XY
20 30 400 900 600
40 60 1600 3600 2400
20 40 400 1600 800
30 60 900 3600 1800
10 30 100 900 300
10 40 100 1600 400
20 40 400 1600 800
20 50 400 2500 1000
20 30 400 900 600
30 70 900 4900 2100
Total 220 450 5600 22100 10800


Returning to our formula, let’s start with b first:


The top of the equation looks like this: b = 10(10800) – 220 * 450 / n(∑X2)-(∑X)2. We’ve simply filled in the values from our chart.

b = 10(10800) – 220 * 450
b = 108,000 – 99,000
b = 9,000 / n(∑X2)-( ∑X)2

Now we have to do the bottom half of the equation:


=10(5600)-(220) 2
=56,000 – 48,400

Returning to our equation:

b = 9,000 / 7,600
b = 1.1842

Now let’s move on to a:


a = 450 / 10 – 1.1842 * (220 / 10)
a = 45 – (1.1842 * 22)
a = 45 – 26.0524
a = 18.9476

So, going back to our original regression equation, Y’ = a + bX and plugging our numbers, we get:

Y’ = 18.9476 + (1.1842)X

To use this equation, we now put our desired value in for X. With an estimated 20 calls:

Y’ = 18.9476 + (1.1842)*20
Y’ = 18.9476 + 23.684
Y’ = 42.63

So, a salesperson who makes 20 calls will expect to make 42 sales.

Cite this article as: MacDonald, D.K., (2015), "Least-Squares Regression," retrieved on January 23, 2018 from

Facebooktwittergoogle_plusredditmailby feather

Correlation (Calculating Pearson’s r)

Correlation refers to the idea that two variables (x and y) impact each other. For instance, the grades in a statistics class may be related to, or correlated with the amount of time those students study. As study time goes up, grades go up. This would be a positive correlation. On the other hand, as time spent partying, grades go down. This is called a negative correlation.
A positive correlation doesn’t strictly refer to good things, though. As the percent of poverty in a community goes up, the amount of crime may also go up. This is a positive correlation, but certainly not a good thing!

Correlations are expressed from -1 (which is perfectly negative) and +1 (which is perfectly positive.) The number shows the strength, and the sign (positive or negative) shows the direction. Therefore, -0.75 is a stronger correlation (or connection) than 0.25.

One common expression is “Correlation is not causation”; this refers to the idea that items can be correlated without really being related to each other. For instance, there is a close connection between the rates of ice-cream consumption in the winter and the drowning rate, even though one really doesn’t affect the other.

How to Calculate Correlation
Pearson’s r (also known as the correlation coefficient) is a simple correlation tool to work with. (Technically the r is used for samples and p is used for populations, but we’ll be working with samples, a limited amount of the total so we will simply refer to it as Pearson’s r or r.)

The formula is here:


This formula may look complicated, but let’s step through it step by step.

The sum of the values of X subtracted from the mean of X multipled by the values of Y subtracted from the mean of Y divided by the square root of X subtracted from the mean of X-squared multiplied by Y subtracted from the mean of Y-squared.

Let’s look at the following set of data of student absences and their final grades:

Student # Absences Exam Grade
1 4 82
2 2 98
3 2 76
4 3 68
5 1 84
6 0 99
7 4 67
8 8 58
9 7 50
10 3 78

The first step is to create a scatterplot of the data to see if any patterns stick out:

This shows a moderately negative correlation, as absences go up, grades go down.

Moving to the equation, let’s look at the top part fist:


We have to calculate the mean of X and the mean of Y:

4 + 2 + 2 + 3 + 1 + 0 + 4 + 8 + 7 + 3 = 34 / 10 = Mean of X of 3.4
82 + 98 + 76 + 68 + 84 + 99 + 67 + 58 + 50 + 78 = 760 / 10 = Mean of Y of 76.
Next, we calculate X-Mx and Y-My, and sum them up.

X X – Mx Y Y – My
4 0,6 82 6
2 -1,4 98 22
2 -1,4 76 0
3 -0,4 68 -8
1 -2,4 84 8
0 -3,4 99 23
4 0,6 67 -9
8 4,6 58 -18
7 3,6 50 -26
3 -0,4 78 2

Next, we must multiply the values of each of these together:

X – Mx Y – My X-Mx * Y-My
0,6 6 3,6
-1,4 22 -30,8
-1,4 0 0
-0,4 -8 3,2
-2,4 8 -19,2
-3,4 23 -78,2
0,6 -9 -5,4
4,6 -18 -82,8
3,6 -26 -93,6
-0,4 2 -0,8

And the sum of these (3.6 + -30.8 + 0 + 3.2 and so on) is -304. Here’s our equation so far:


Next, let’s look at the bottom part of the equation:

X X – Mx (X-Mx)2 Y Y – My (Y-My)2
4 0,6 0.36 82 6 36
2 -1,4 1.96 98 22 484
2 -1,4 1.96 76 0 0
3 -0,4 0.16 68 -8 64
1 -2,4 5.76 84 8 64
0 -3,4 11.56 99 23 529
4 0,6 0.36 67 -9 81
8 4,6 21.16 58 -18 324
7 3,6 12.96 50 -26 676
3 -0,4 0.16 78 2 4



We take the square of each of the X values and sum them up. We do the same for the Y values.

This results in: -304 / Sqrt(56.4*2262)

Next, we multiply the two bottoms together. 56.4 x 2262 = 127,576.8.

Taking the square root yields 357.179.

Our final calculation is -304 / 357.179 which equals -0.85.
-0.85 is our final correlation, which we can confirm using Excel’s CORREL function.

Cite this article as: MacDonald, D.K., (2015), "Correlation (Calculating Pearson’s r)," retrieved on January 23, 2018 from

Facebooktwittergoogle_plusredditmailby feather