Table of Contents
Introduction
As a student in Eastern University’s Master of Science in Data Science, I recently completed a course in basic data analytics with R. I had never worked with R but this was probably my favorite course in the whole program.
Tidyverse
The tidyverse is a set of packages used for data science including dplyr and ggplot2 for graphing and some others. The key to DTSC-650 is to understand how to perform data manipulation with basic dplyr commands.
CodeGrade
This course includes several assignments using the automated grading platform CodeGrade. You’re given some directions and asked to work with the dataset to achieve the results. It will compare your output to the correct output, and assign you points for each question you answered successfully.
The Pipe Command (%>%)
One of the most critical tools to be successful in this class is understanding the pipe. It’s part of the magrittr package and is included in dplyr. The pipe allows your output to be carried across a series of operations to make your code more readable.
So, for example, if you need to select name, age, and location from a worker_data dataset, filter it to only those in the East location, and calculate their mean age, you could write it like this:
mean_east <- select(worker_data, name, age, location)
mean_east <- filter(worker_data, location == 'East')
mean_east <- summarize(worker_data, mean_worker_data = mean(age))
However, that is clunky and includes a lot of duplication. By using the pipe, you can instead rewrite this code as:
mean_east <- worker_data %>%
select(name, age, location) %>%
filter(location == 'East') %>%
summarize(mean_worker_data = mean(age))
Much simpler and easier to read and your data still ends up in the same place.
Important R Commands
In addition to understanding the pipe, the other thing that’s helpful to keep handy is a dplyr cheat sheet. It’s helpful to understand how each command works so that you can use them successfully.
Select
What it’s for: Select keeps only the columns you specify from your dataframe so that you can use them in your analysis.
When to use it: I always start each question by selecting only the relevant columns so that I don’t run into any issues down the road.
Code example:
To select only the name, age and location columns from a dataframe called worker_data (without using the pipe in this example):
worker_data_select <- select(worker_data, name, age, location)
Filter
What it’s for: Filter allows you to filter your dataset by specific conditions. This could be filter only where a column is equal to something, when a column is above or below something, or a list of values.
When to use it: Use filter when you need to only examine data meeting certain criteria.
Code example:
To select only the individuals with age above 40 and see their names:
worker_data_filter <- worker_data %>% select(name, age) %>% filter(age > 40)
Or to select individuals in locations East or West you can use the pipe symbol | (not to be confused with the R pipe %>%!):
worker_data_filter <- worker_data %>% select(name, location) %>% filter(location == 'East' | location = 'West')
To select both East and West you can use &
worker_data_filter <- worker_data %>% select(name, location) %>% filter(location == 'East' & location = 'West')
Arrange
What it’s for: Arrange is used to sort your dataframe ascending (the top result is the lowest value) or descending (the top value is the highest value.) It defaults to ascending, but if you put a minus symbol it will flip to descending.
When to use it: Whenever you need to pull the maximum or minimum value in a column.
Code example: To find the oldest workers:
worker_data_filter <- worker_data %>% select(name, age) %>% arrange(age)
To find the youngest workers, just put a minus symbol in front of the age column:
worker_data_filter <- worker_data %>% select(name, age) %>% arrange(-age)
Head
What it’s for: Head returns the number of rows that you specify.
When to use it: Whenever you need to return a single item.
Code example: In our previous example, you pulled the list of youngest workers. Maybe you only want the 5 youngest workers. That would look like this:
worker_data_filter <- worker_data %>% select(name, age) %>% arrange(-age) %>% head(5)
What it’s for: To print data to the screen
When to use it: This should be self-explanatory!
Code example:
print('Hello world')
Summarize
What it’s for: Summarize is used to add new columns to your dataframe, by applying an operation to them but keeping the existing columns.
When to use it: When you want to calculate the mean or median, or create a new column based on an existing column.
Code example:
To create a new column called mean_age that is the mean (average) of the ages of your employees you could do this:
worker_data_filter <- worker_data %>% select(name, age) %>% summarize(mean_age = mean(age))
After the summarize statement, your new df will have three columns (name, age and mean_age that you just created), because you started with two (name and age.)
See the instructions for useful functions you can use summarize for including mean(), n() or count, sd(), IQR(), and more.
Transmute
What it’s for: Transmute works almost identically as mutate below and summarize above except that it drops any column not mentioned.
When to use it: Same as above, but when you don’t need the other columns
Code example:
Let’s say you want the mean age, but you don’t need anything else. Not the name, or any other columns. You could pull just the column you need, but by transmuting you’ll be finished with just your single value:
worker_data_mean_age <- worker_data %>% transmute(mean_age = mean(age)) %>% head(1)
Your output is just the single value of the mean.
Mutate
What it’s for: Mutate allows you to add new variables and complete calculations on them.
When to use it: Pretty self explanatory!
Code example:
worker_data_mutate <- worker_data %>% mutate(squared_age = age ** 2
This will take each row of the dataframe and square the age, depositing it in a new column.
As.data.frame
What it’s for: Converting other data types into dataframes
When to use it: I mostly used this when I was trying to get my output properly rounded.
Code example:
as.data.frame(worker_csv)
Count
What it’s for: When you want to return the count (the number of results) rather than the results themselves.
When to use it: Pretty self explanatory!
Code example:
Suppose you wanted to find the number of ages represented in your dataset:
worker_data_ages <- worker_data %>% select(age) %>% distinct()
Distinct
What it’s for: Distinct eliminates duplicates.
When to use it: Whenever you have multiple duplicated values and you want to eliminate those duplicates.
Code example:
Suppose you want to find the different ages represented in a dataset.
worker_data_ages <- worker_data %>% select(age) %>% distinct()
Na.omit
What it’s for: This is used to exclude output from your results.
When to use it: When commands fail because of the presence of NA values (missing values.)
Code example:
worker_data_nomissing <- na.omit(worker_data)
Hi Dustin, great post and those are pretty much all the commands needed to finish all 8 assignments! I am taking DTSC 575 right now and working on the Behavioral Risk Factor project until the end of term early March. For the final section, the wording of those 5 questions become very ambiguous in my opinion. Does the professor give a lot of leeway of what and how we present our output? Is it true that as long as we commented on the logic of how we come up with the chosen variables and interpret the summary statistics and plots we should be good to do well in the class?
P.S. I saw you’re taking DTSC 670 Machine Learning this term. Do your CodeGrade assignments allow multiple submission like DTSC 575?
Hi KC,
I actually get a lot of spam comments on my blog so all comments are approved manually. The professor does grade leniently on those questions where you have to pick your own variables and do the analysis. It’s about making sure that you are making an effort, rather than scrutinizing your output closely.
When I wrote this post, I was in 670. I’ve since finished 670 and 680. The 670 project is 85% assignments but they are guided, so they tell you what to do and frequently what the output should look like to make sure you know you’re doing it correctly. I’m actually writing up a tips and tricks post for 670 right now which I should finish in a few days.
Dustin
Thank you Dustin! Can’t wait for your 670 review
Thank you for your wonderful blogs and resources! I have read this one a few times before but it is finally making sense now that I am completing my DTSC 550 labs! I have been trying to wrap my mind around the ‘why’ of the pipe command and your explanation just helped me understand it. 🙂 I am registered to take DTSC 650 next term and I am looking forward to it.