Dustin K MacDonald

Menu
  • Home
  • About
  • Economic and Community Development
  • Nonprofit Management
    • Counselling and Service Delivery
    • Suicide Prevention / Crisis Intervention
  • Politics and Governance
  • Math and Statistics
  • Salesforce
Menu

DTSC-650 Data Analytics in R Tips

Posted on October 22, 2021October 22, 2021 by Dustin

Table of Contents

Introduction

As a student in Eastern University’s Master of Science in Data Science, I recently completed a course in basic data analytics with R. I had never worked with R but this was probably my favorite course in the whole program.

Tidyverse

The tidyverse is a set of packages used for data science including dplyr and ggplot2 for graphing and some others. The key to DTSC-650 is to understand how to perform data manipulation with basic dplyr commands.

CodeGrade

This course includes several assignments using the automated grading platform CodeGrade. You’re given some directions and asked to work with the dataset to achieve the results. It will compare your output to the correct output, and assign you points for each question you answered successfully.

The Pipe Command (%>%)

One of the most critical tools to be successful in this class is understanding the pipe. It’s part of the magrittr package and is included in dplyr. The pipe allows your output to be carried across a series of operations to make your code more readable.

So, for example, if you need to select name, age, and location from a worker_data dataset, filter it to only those in the East location, and calculate their mean age, you could write it like this:

mean_east <- select(worker_data, name, age, location)
mean_east <- filter(worker_data, location == 'East')
mean_east <- summarize(worker_data, mean_worker_data = mean(age))

However, that is clunky and includes a lot of duplication. By using the pipe, you can instead rewrite this code as:

mean_east <- worker_data %>%
select(name, age, location) %>%
filter(location == 'East') %>%
summarize(mean_worker_data = mean(age))

Much simpler and easier to read and your data still ends up in the same place.

Important R Commands

In addition to understanding the pipe, the other thing that’s helpful to keep handy is a dplyr cheat sheet. It’s helpful to understand how each command works so that you can use them successfully.

Select

What it’s for: Select keeps only the columns you specify from your dataframe so that you can use them in your analysis.

When to use it: I always start each question by selecting only the relevant columns so that I don’t run into any issues down the road.

Code example:

To select only the name, age and location columns from a dataframe called worker_data (without using the pipe in this example):

worker_data_select <- select(worker_data, name, age, location)

Filter

What it’s for: Filter allows you to filter your dataset by specific conditions. This could be filter only where a column is equal to something, when a column is above or below something, or a list of values.

When to use it: Use filter when you need to only examine data meeting certain criteria.

Code example:

To select only the individuals with age above 40 and see their names:

worker_data_filter <- worker_data %>%
select(name, age) %>%
filter(age > 40)

Or to select individuals in locations East or West you can use the pipe symbol | (not to be confused with the R pipe %>%!):

worker_data_filter <- worker_data %>% select(name, location) %>% filter(location == 'East' | location = 'West')

To select both East and West you can use &

worker_data_filter <- worker_data %>% select(name, location) %>% filter(location == 'East' & location = 'West')

Arrange

What it’s for: Arrange is used to sort your dataframe ascending (the top result is the lowest value) or descending (the top value is the highest value.) It defaults to ascending, but if you put a minus symbol it will flip to descending.

When to use it: Whenever you need to pull the maximum or minimum value in a column.

Code example: To find the oldest workers:

worker_data_filter <- worker_data %>%
select(name, age) %>%
arrange(age)

To find the youngest workers, just put a minus symbol in front of the age column:

worker_data_filter <- worker_data %>%
select(name, age) %>%
arrange(-age)

Head

What it’s for: Head returns the number of rows that you specify.

When to use it: Whenever you need to return a single item.

Code example: In our previous example, you pulled the list of youngest workers. Maybe you only want the 5 youngest workers. That would look like this:

worker_data_filter <- worker_data %>%
select(name, age) %>%
arrange(-age) %>%
head(5)

Print

What it’s for: To print data to the screen

When to use it: This should be self-explanatory!

Code example:

print('Hello world')

Summarize

What it’s for: Summarize is used to add new columns to your dataframe, by applying an operation to them but keeping the existing columns.

When to use it: When you want to calculate the mean or median, or create a new column based on an existing column.

Code example:

To create a new column called mean_age that is the mean (average) of the ages of your employees you could do this:

worker_data_filter <- worker_data %>%
select(name, age) %>%
summarize(mean_age = mean(age))

After the summarize statement, your new df will have three columns (name, age and mean_age that you just created), because you started with two (name and age.)

See the instructions for useful functions you can use summarize for including mean(), n() or count, sd(), IQR(), and more.

Transmute

What it’s for: Transmute works almost identically as mutate below and summarize above except that it drops any column not mentioned.

When to use it: Same as above, but when you don’t need the other columns

Code example:

Let’s say you want the mean age, but you don’t need anything else. Not the name, or any other columns. You could pull just the column you need, but by transmuting you’ll be finished with just your single value:

worker_data_mean_age <- worker_data %>%
transmute(mean_age = mean(age)) %>%
head(1)

Your output is just the single value of the mean.

Mutate

What it’s for: Mutate allows you to add new variables and complete calculations on them.

When to use it: Pretty self explanatory!

Code example:

worker_data_mutate <- worker_data %>%
mutate(squared_age = age ** 2

This will take each row of the dataframe and square the age, depositing it in a new column.

As.data.frame

What it’s for: Converting other data types into dataframes

When to use it: I mostly used this when I was trying to get my output properly rounded.

Code example:

as.data.frame(worker_csv)

Count

What it’s for: When you want to return the count (the number of results) rather than the results themselves.

When to use it: Pretty self explanatory!

Code example:

Suppose you wanted to find the number of ages represented in your dataset:

worker_data_ages <- worker_data %>%
select(age) %>%
distinct()

Distinct

What it’s for: Distinct eliminates duplicates.

When to use it: Whenever you have multiple duplicated values and you want to eliminate those duplicates.

Code example:

Suppose you want to find the different ages represented in a dataset.

worker_data_ages <- worker_data %>%
select(age) %>%
distinct()

Na.omit

What it’s for: This is used to exclude output from your results.

When to use it: When commands fail because of the presence of NA values (missing values.)

Code example:

worker_data_nomissing <- na.omit(worker_data)

4 thoughts on “DTSC-650 Data Analytics in R Tips”

  1. KC says:
    February 4, 2022 at 2:47 pm

    Hi Dustin, great post and those are pretty much all the commands needed to finish all 8 assignments! I am taking DTSC 575 right now and working on the Behavioral Risk Factor project until the end of term early March. For the final section, the wording of those 5 questions become very ambiguous in my opinion. Does the professor give a lot of leeway of what and how we present our output? Is it true that as long as we commented on the logic of how we come up with the chosen variables and interpret the summary statistics and plots we should be good to do well in the class?

    P.S. I saw you’re taking DTSC 670 Machine Learning this term. Do your CodeGrade assignments allow multiple submission like DTSC 575?

    Reply
    1. Dustin says:
      February 4, 2022 at 5:14 pm

      Hi KC,

      I actually get a lot of spam comments on my blog so all comments are approved manually. The professor does grade leniently on those questions where you have to pick your own variables and do the analysis. It’s about making sure that you are making an effort, rather than scrutinizing your output closely.

      When I wrote this post, I was in 670. I’ve since finished 670 and 680. The 670 project is 85% assignments but they are guided, so they tell you what to do and frequently what the output should look like to make sure you know you’re doing it correctly. I’m actually writing up a tips and tricks post for 670 right now which I should finish in a few days.

      Dustin

      Reply
  2. KC says:
    February 4, 2022 at 5:58 pm

    Thank you Dustin! Can’t wait for your 670 review

    Reply
  3. Ezichi says:
    February 12, 2022 at 9:27 pm

    Thank you for your wonderful blogs and resources! I have read this one a few times before but it is finally making sense now that I am completing my DTSC 550 labs! I have been trying to wrap my mind around the ‘why’ of the pipe command and your explanation just helped me understand it. 🙂 I am registered to take DTSC 650 next term and I am looking forward to it.

    Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Privacy Policy

See here for our privacy policy. This site uses affiliate links and Adsense ads to provide targeted advertising.

Tags

analytical technique assessment city council communication community development counselling crisis chat crisis intervention data science eastern university economic development education empathy evaluation forecasting fundraising governance humint intelligence intelligence analysis keokuk county language learning legal local government management peer support personal development politics professional development protective factors psychosocial risk factors safety planning salesforce sigourney social media statistics suicide suicide assessment suicide risk assessment technology terrorism training violence risk assessment youth

Recommended Posts

  • Conducting Psychosocial Assessments
  • DCIB Model of Suicide Risk Assessment
  • ABC Model of Crisis Intervention
  • My Friend is Suicidal - What do I do?

Recent Posts

  • ITS834 Emerging Threats and Countermeasures
  • Salesforce Flow that autonames records
  • Salesforce formula that calculates age
  • Earning the Project Management Professional (PMP)
  • ITS831 Information Technology Importance in Strategic Planning

Archives

  • September 2023 (3)
  • August 2023 (1)
  • July 2023 (1)
  • May 2023 (1)
  • March 2023 (1)
  • February 2023 (2)
  • January 2023 (4)
  • December 2022 (2)
  • May 2022 (1)
  • April 2022 (2)
  • March 2022 (1)
  • February 2022 (1)
  • December 2021 (1)
  • October 2021 (1)
  • August 2021 (2)
  • May 2021 (3)
  • December 2020 (1)
  • November 2020 (4)
  • July 2020 (1)
  • June 2020 (1)
  • April 2020 (1)
  • March 2020 (4)
  • February 2020 (7)
  • January 2020 (1)
  • November 2019 (1)
  • October 2019 (2)
  • September 2019 (4)
  • August 2019 (2)
  • March 2019 (1)
  • February 2019 (1)
  • January 2019 (1)
  • December 2018 (4)
  • November 2018 (3)
  • October 2018 (3)
  • September 2018 (19)
  • October 2017 (2)
  • September 2017 (2)
  • August 2017 (1)
  • July 2017 (39)
  • May 2017 (3)
  • April 2017 (4)
  • March 2017 (4)
  • February 2017 (4)
  • January 2017 (5)
  • December 2016 (4)
  • November 2016 (4)
  • October 2016 (5)
  • September 2016 (4)
  • August 2016 (5)
  • July 2016 (5)
  • June 2016 (5)
  • May 2016 (3)
  • April 2016 (2)
  • March 2016 (2)
  • February 2016 (2)
  • January 2016 (4)
  • December 2015 (2)
  • November 2015 (2)
  • October 2015 (2)
  • September 2015 (2)
  • August 2015 (1)
  • June 2015 (2)
  • May 2015 (5)
  • April 2015 (3)
  • March 2015 (8)
  • February 2015 (12)
  • January 2015 (28)

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org

Recent Comments

  • ITS834 Emerging Threats and Countermeasures. - Dustin K MacDonald on University of the Cumberlands PhD in Information Technology
  • Earning the Project Management Professional (PMP) - Dustin K MacDonald on University of the Cumberlands PhD in Information Technology
  • Dustin on How I Got a Book Contract
  • Ananth on How I Got a Book Contract
  • Aly on Improving Your Helpline Work

Tags

analytical technique assessment city council communication community development counselling crisis chat crisis intervention data science eastern university economic development education empathy evaluation forecasting fundraising governance humint intelligence intelligence analysis keokuk county language learning legal local government management peer support personal development politics professional development protective factors psychosocial risk factors safety planning salesforce sigourney social media statistics suicide suicide assessment suicide risk assessment technology terrorism training violence risk assessment youth
© 2023 Dustin K MacDonald | Powered by Minimalist Blog WordPress Theme