Table of Contents
Introduction
DTSC-670 is the Foundations of Machine Learning Course at Eastern University. It’s a good course, definitely stretches your knowledge of matplotlib, Python, and scikit-learn as you learn how to build machine learning models and evaluate them.
The purpose of this article is to explore some of the concepts in the course and give you some tips. As a Graduate Assistant (GA) for this course, I’ve had the opportunity to work closely with many students and see where they struggle. This won’t use any actual code from the assignments, but rather discuss the higher level principles that you’ll need to know.
Assignment 1: Johnny Likes Pie
Description of Assignment: This assignment involves the Johnny Likes Pie dataset. You’ll learn how to perform one-hot encoding to encode the categorical data into a format suitable for machine learning algorithms.
General Strategy:
- Read the data into a dataframe (df)
- Drop the unnecessary example column
- One-hot encode the data (it’s all categorical)
- Create a features and response df
- Create and fit a linear regression model
- Use sklearn’s accuracy_score function to calculate the model’s accuracy
Most Common Error: The major error I see students make in this course is when they create the features and response df, they forget to drop the column that became the response df from the features df. This means their model effectively memorizes the output (collinearity).
Assignment 2: Brazil COVID Data
Description of Assignment: This assignment involves cleaning a large (500,000 row) COVID dataset used in a published study (Temperature significantly changes COVID-19 transmission in (sub)tropical cities of Brazil). You’ll learn how to read in Excel sheets, merge data, and otherwise flex your pandas skills.
General Strategy:
- Read the data into a series of dataframes
- Filter the data so you only have the data from the 27 capital cities
- Create a days column that counts from 0 to 150 (151 total rows because Python starts at 0)
- Create days_sq and days_cube columns by squaring and cubing the days column
- Create pop (the city’s population), pop_sq and pop_density
- Create temp by filtering the temp df by the data from the 27 capital cities and merging it
Most Common Error: A huge amount of effort is spent on trying to match the days to the cities and all sorts of complexities. This is where knowing the dataset comes in handy. Knowing that the data is already sorted by date means you can just create a list using list(range()) and then multiply it by 27 to produce a list that has the correct count before adding it to your df.
Some Helpful Links:
- Filtering a dataset based on values:
https://cmdlinetips.com/2018/02/how-to-subset-pandas-dataframe-based-on-values-of-a-column/ - Dropping rows that don’t meet certain values:
https://stackoverflow.com/questions/52456874/drop-rows-on-multiple-conditions-in-pandas-dataframe - To see every row from your df you can use add pd.set_option(‘display.max_rows’, None). To reset that and only see the limited display of rows that pandas gives you, use pd.reset_option
(https://pandas.pydata.org/docs/reference/api/pandas.reset_option.html) - Creating new columns (like days_sq) derived from others (like days):
https://pandas.pydata.org/docs/getting_started/intro_tutorials/05_add_columns.html
Assignment 3: Multiple Linear Regression
Description of Assignment: This assignment involves creating a multiple linear regression model to recover the intercept and coefficients used to plot some data.
General Strategy:
- Read the data into a df
- Create the 4 plots using scatter3D
- Train your linear regression model
- Take x_fit and y_fit and put them into a df
- Use your trained model to predict z_fit
- Plot another set of 4 figures with plot3D (x_fit, y_fit, z_fit) and scatter3D (x, y, z) to add the line of best fit
Most Common Error: Some students try to “math their way out” of the programming in this assignment. You need to use the linear regression model to generate the line of best fit. Any other attempt will lead to errors, lost time and lost points.
Another important thing – you need to make sure your new df (with x_fit and y_fit) is the right number of dimensions. The easiest way to do this is by giving it a label.
Some Helpful Links:
- Using GridSpec to create the subplots (https://matplotlib.org/3.1.1/tutorials/intermediate/gridspec.html)
- Reshaping arrays (https://www.w3schools.com/python/numpy/numpy_array_reshape.asp)
Assignment 4: Custom Transformer
Description of Assignment: This is probably one of the hardest assignments for students to wrap their heads around because it’s so different from the others.
General Strategy:
- Read the data into a df
- Create your transformer
- Create your numerical pipeline, which includes a simple mean imputer, the transformer you wrote, and StandardScaler
- One-hot encode your categorical data
- Put your numerical pipeline and one-hot-encoded data into a ColumnTransformer to create a full pipeline
Most Common Error: This assignment uses old-fashioned array data manipulation which students haven’t worked with for a while. It’s easy to get complacent when working with dataframes and it’s hard to remember that numpy arrays don’t have column labels and things like that.
Your transformer should have an init method, a transform method, and a fit method. Your fit method is empty, but the bulk of the work will happen in your transform method.
Some Helpful Links:
- Creating a Custom Transformer in Python (https://www.section.io/engineering-education/custom-transformer/) – this article is useful because it includes the structure of the 3 functions in Python
- How to One Hot Encode data in Python (https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/)
Assignment 5: Scikit-learn Principles
Description of Assignment: This is a short writing assignment on the principles of scikit-learn using references provided by the professor
General Strategy: Read the articles, write the summary!
Most Common Error: N/A. Most students do fine with this one.
Some Helpful Links:
- A review of the design principles of scikit-learn (https://jakevdp.github.io/PythonDataScienceHandbook/05.02-introducing-scikit-learn.html#Scikit-Learn’s-Estimator-API)
Assignment 6: Classification System Metrics
Description of Assignment: This assignment has you learn the functions for accuracy, model error rate, precision and recall, Fbeta, True Positive Rate (TPR) and False Positive Rate (FPR), and then create a function that creates a Receiver Operating Characteristic (ROC) curve.
General Strategy:
- This one is straight forward. When you’re told to build the accuracy function, you build it as given! This tests your Python skills, especially function writing and understanding loop
Most Common Error: The biggest issue students have is around the ROC_curve_computer function. Your previous functions will be used in this one, you don’t need to duplicate them. Remember that you’re making a prediction, determining the TPR/FPR and then outputting the list of TPRs and FPRs. It’s the TPR and FPR list that you will be plotting.
Some Helpful Links:
- Understanding the ROC (https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc)
- Writing Functions in Python (https://www.w3schools.com/python/python_functions.asp)
Assignment 7: Polynomial Regression I
Description of Assignment: This assignment is very similar to Assignment 3 except you’ll be plotting a polynomial regression instead of a linear regression
General Strategy:
- Reuse your Assignment 3 code! And add in the PolynomialFeatures Transformation
Most Common Error: The biggest issue here is that you plot the initial x, y and z data, and then you complete a PolynomialFeatures fit and transform and you train your LinearRegression model on that transformed df. When you get to creating the second image (to determine the line of best fit), you’re given x_fit and y_fit and need to predict z_fit. You have to put x_fit and y_fit through the same transformation you did x and y.
Otherwise, this is very similar to assignment 3.
Some Helpful Links:
- How to use Polynomial Features (https://machinelearningmastery.com/polynomial-features-transforms-for-machine-learning/)
Assignment 8: Polynomial Regression II
Description of Assignment: This is very similar to assignment 7. So similar that you’ll be fine as long as you reviewed assignment 7.
General Strategy: Exactly the same as assignment 7
Most Common Error: N/A
2 thoughts on “DTSC-670 Tips and Tricks”