DTSC-670 Tips and Tricks

Table of Contents

Introduction

DTSC-670 is the Foundations of Machine Learning Course at Eastern University. It’s a good course, definitely stretches your knowledge of matplotlib, Python, and scikit-learn as you learn how to build machine learning models and evaluate them.

The purpose of this article is to explore some of the concepts in the course and give you some tips. As a Graduate Assistant (GA) for this course, I’ve had the opportunity to work closely with many students and see where they struggle. This won’t use any actual code from the assignments, but rather discuss the higher level principles that you’ll need to know.

Assignment 1: Johnny Likes Pie

Description of Assignment: This assignment involves the Johnny Likes Pie dataset. You’ll learn how to perform one-hot encoding to encode the categorical data into a format suitable for machine learning algorithms.

General Strategy:

Read the data into a dataframe (df)
Drop the unnecessary example column
One-hot encode the data (it’s all categorical)
Create a features and response df
Create and fit a linear regression model
Use sklearn’s accuracy_score function to calculate the model’s accuracy

Most Common Error: The major error I see students make in this course is when they create the features and response df, they forget to drop the column that became the response df from the features df. This means their model effectively memorizes the output (collinearity).

Assignment 2: Brazil COVID Data

Description of Assignment: This assignment involves cleaning a large (500,000 row) COVID dataset used in a published study (Temperature significantly changes COVID-19 transmission in (sub)tropical cities of Brazil). You’ll learn how to read in Excel sheets, merge data, and otherwise flex your pandas skills.

General Strategy:

Read the data into a series of dataframes
Filter the data so you only have the data from the 27 capital cities
Create a days column that counts from 0 to 150 (151 total rows because Python starts at 0)
Create days_sq and days_cube columns by squaring and cubing the days column
Create pop (the city’s population), pop_sq and pop_density
Create temp by filtering the temp df by the data from the 27 capital cities and merging it

Most Common Error: A huge amount of effort is spent on trying to match the days to the cities and all sorts of complexities. This is where knowing the dataset comes in handy. Knowing that the data is already sorted by date means you can just create a list using list(range()) and then multiply it by 27 to produce a list that has the correct count before adding it to your df.

Some Helpful Links:

Filtering a dataset based on values:
https://cmdlinetips.com/2018/02/how-to-subset-pandas-dataframe-based-on-values-of-a-column/
Dropping rows that don’t meet certain values:
https://stackoverflow.com/questions/52456874/drop-rows-on-multiple-conditions-in-pandas-dataframe
To see every row from your df you can use add pd.set_option(‘display.max_rows’, None). To reset that and only see the limited display of rows that pandas gives you, use pd.reset_option
(https://pandas.pydata.org/docs/reference/api/pandas.reset_option.html)
Creating new columns (like days_sq) derived from others (like days):
https://pandas.pydata.org/docs/getting_started/intro_tutorials/05_add_columns.html

Assignment 3: Multiple Linear Regression

Description of Assignment: This assignment involves creating a multiple linear regression model to recover the intercept and coefficients used to plot some data.

General Strategy:

Read the data into a df
Create the 4 plots using scatter3D
Train your linear regression model
Take x_fit and y_fit and put them into a df
Use your trained model to predict z_fit
Plot another set of 4 figures with plot3D (x_fit, y_fit, z_fit) and scatter3D (x, y, z) to add the line of best fit

Most Common Error: Some students try to “math their way out” of the programming in this assignment. You need to use the linear regression model to generate the line of best fit. Any other attempt will lead to errors, lost time and lost points.

Another important thing – you need to make sure your new df (with x_fit and y_fit) is the right number of dimensions. The easiest way to do this is by giving it a label.

Some Helpful Links:

Using GridSpec to create the subplots (https://matplotlib.org/3.1.1/tutorials/intermediate/gridspec.html)
Reshaping arrays (https://www.w3schools.com/python/numpy/numpy_array_reshape.asp)

Assignment 4: Custom Transformer

Description of Assignment: This is probably one of the hardest assignments for students to wrap their heads around because it’s so different from the others.

General Strategy:

Read the data into a df
Create your transformer
Create your numerical pipeline, which includes a simple mean imputer, the transformer you wrote, and StandardScaler
One-hot encode your categorical data
Put your numerical pipeline and one-hot-encoded data into a ColumnTransformer to create a full pipeline

Most Common Error: This assignment uses old-fashioned array data manipulation which students haven’t worked with for a while. It’s easy to get complacent when working with dataframes and it’s hard to remember that numpy arrays don’t have column labels and things like that.

Your transformer should have an init method, a transform method, and a fit method. Your fit method is empty, but the bulk of the work will happen in your transform method.

Some Helpful Links:

Creating a Custom Transformer in Python (https://www.section.io/engineering-education/custom-transformer/) – this article is useful because it includes the structure of the 3 functions in Python
How to One Hot Encode data in Python (https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/)

Assignment 5: Scikit-learn Principles

Description of Assignment: This is a short writing assignment on the principles of scikit-learn using references provided by the professor

General Strategy: Read the articles, write the summary!

Most Common Error: N/A. Most students do fine with this one.

Some Helpful Links:

A review of the design principles of scikit-learn (https://jakevdp.github.io/PythonDataScienceHandbook/05.02-introducing-scikit-learn.html#Scikit-Learn’s-Estimator-API)

Assignment 6: Classification System Metrics

Description of Assignment: This assignment has you learn the functions for accuracy, model error rate, precision and recall, Fbeta, True Positive Rate (TPR) and False Positive Rate (FPR), and then create a function that creates a Receiver Operating Characteristic (ROC) curve.

General Strategy:

This one is straight forward. When you’re told to build the accuracy function, you build it as given! This tests your Python skills, especially function writing and understanding loop

Most Common Error: The biggest issue students have is around the ROC_curve_computer function. Your previous functions will be used in this one, you don’t need to duplicate them. Remember that you’re making a prediction, determining the TPR/FPR and then outputting the list of TPRs and FPRs. It’s the TPR and FPR list that you will be plotting.

Some Helpful Links:

Understanding the ROC (https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc)
Writing Functions in Python (https://www.w3schools.com/python/python_functions.asp)

Assignment 7: Polynomial Regression I

Description of Assignment: This assignment is very similar to Assignment 3 except you’ll be plotting a polynomial regression instead of a linear regression

General Strategy:

Reuse your Assignment 3 code! And add in the PolynomialFeatures Transformation

Most Common Error: The biggest issue here is that you plot the initial x, y and z data, and then you complete a PolynomialFeatures fit and transform and you train your LinearRegression model on that transformed df. When you get to creating the second image (to determine the line of best fit), you’re given x_fit and y_fit and need to predict z_fit. You have to put x_fit and y_fit through the same transformation you did x and y.

Otherwise, this is very similar to assignment 3.

Some Helpful Links:

How to use Polynomial Features (https://machinelearningmastery.com/polynomial-features-transforms-for-machine-learning/)

Assignment 8: Polynomial Regression II

Description of Assignment: This is very similar to assignment 7. So similar that you’ll be fine as long as you reviewed assignment 7.

General Strategy: Exactly the same as assignment 7

Most Common Error: N/A

Introduction

Assignment 1: Johnny Likes Pie

Assignment 2: Brazil COVID Data

Assignment 3: Multiple Linear Regression

Assignment 4: Custom Transformer

Assignment 5: Scikit-learn Principles

Assignment 6: Classification System Metrics

Assignment 7: Polynomial Regression I

Assignment 8: Polynomial Regression II

2 thoughts on “DTSC-670 Tips and Tricks”

Leave a Reply Cancel reply