Dustin K MacDonald

Menu
  • Home
  • About
  • Economic and Community Development
  • Nonprofit Management
    • Counselling and Service Delivery
    • Suicide Prevention / Crisis Intervention
  • Politics and Governance
  • Math and Statistics
  • Salesforce
Menu

DTSC-670 Tips and Tricks

Posted on February 10, 2022February 11, 2022 by Dustin

Table of Contents

Introduction

DTSC-670 is the Foundations of Machine Learning Course at Eastern University. It’s a good course, definitely stretches your knowledge of matplotlib, Python, and scikit-learn as you learn how to build machine learning models and evaluate them.

The purpose of this article is to explore some of the concepts in the course and give you some tips. As a Graduate Assistant (GA) for this course, I’ve had the opportunity to work closely with many students and see where they struggle. This won’t use any actual code from the assignments, but rather discuss the higher level principles that you’ll need to know.

Assignment 1: Johnny Likes Pie

Description of Assignment: This assignment involves the Johnny Likes Pie dataset. You’ll learn how to perform one-hot encoding to encode the categorical data into a format suitable for machine learning algorithms.

General Strategy:

  • Read the data into a dataframe (df)
  • Drop the unnecessary example column
  • One-hot encode the data (it’s all categorical)
  • Create a features and response df
  • Create and fit a linear regression model
  • Use sklearn’s accuracy_score function to calculate the model’s accuracy

Most Common Error: The major error I see students make in this course is when they create the features and response df, they forget to drop the column that became the response df from the features df. This means their model effectively memorizes the output (collinearity).

Assignment 2: Brazil COVID Data

Description of Assignment: This assignment involves cleaning a large (500,000 row) COVID dataset used in a published study (Temperature significantly changes COVID-19 transmission in (sub)tropical cities of Brazil). You’ll learn how to read in Excel sheets, merge data, and otherwise flex your pandas skills.

General Strategy:

  • Read the data into a series of dataframes
  • Filter the data so you only have the data from the 27 capital cities
  • Create a days column that counts from 0 to 150 (151 total rows because Python starts at 0)
  • Create days_sq and days_cube columns by squaring and cubing the days column
  • Create pop (the city’s population), pop_sq and pop_density
  • Create temp by filtering the temp df by the data from the 27 capital cities and merging it

Most Common Error: A huge amount of effort is spent on trying to match the days to the cities and all sorts of complexities. This is where knowing the dataset comes in handy. Knowing that the data is already sorted by date means you can just create a list using list(range()) and then multiply it by 27 to produce a list that has the correct count before adding it to your df.

Some Helpful Links:

  • Filtering a dataset based on values:
    https://cmdlinetips.com/2018/02/how-to-subset-pandas-dataframe-based-on-values-of-a-column/
  • Dropping rows that don’t meet certain values:
    https://stackoverflow.com/questions/52456874/drop-rows-on-multiple-conditions-in-pandas-dataframe
  • To see every row from your df you can use add pd.set_option(‘display.max_rows’, None). To reset that and only see the limited display of rows that pandas gives you, use pd.reset_option
    (https://pandas.pydata.org/docs/reference/api/pandas.reset_option.html)
  • Creating new columns (like days_sq) derived from others (like days):
    https://pandas.pydata.org/docs/getting_started/intro_tutorials/05_add_columns.html

Assignment 3: Multiple Linear Regression

Description of Assignment: This assignment involves creating a multiple linear regression model to recover the intercept and coefficients used to plot some data.

General Strategy:

  • Read the data into a df
  • Create the 4 plots using scatter3D
  • Train your linear regression model
  • Take x_fit and y_fit and put them into a df
  • Use your trained model to predict z_fit
  • Plot another set of 4 figures with plot3D (x_fit, y_fit, z_fit) and scatter3D (x, y, z) to add the line of best fit

Most Common Error: Some students try to “math their way out” of the programming in this assignment. You need to use the linear regression model to generate the line of best fit. Any other attempt will lead to errors, lost time and lost points.

Another important thing – you need to make sure your new df (with x_fit and y_fit) is the right number of dimensions. The easiest way to do this is by giving it a label.

Some Helpful Links:

  • Using GridSpec to create the subplots (https://matplotlib.org/3.1.1/tutorials/intermediate/gridspec.html)
  • Reshaping arrays (https://www.w3schools.com/python/numpy/numpy_array_reshape.asp)

Assignment 4: Custom Transformer

Description of Assignment: This is probably one of the hardest assignments for students to wrap their heads around because it’s so different from the others.

General Strategy:

  • Read the data into a df
  • Create your transformer
  • Create your numerical pipeline, which includes a simple mean imputer, the transformer you wrote, and StandardScaler
  • One-hot encode your categorical data
  • Put your numerical pipeline and one-hot-encoded data into a ColumnTransformer to create a full pipeline

Most Common Error: This assignment uses old-fashioned array data manipulation which students haven’t worked with for a while. It’s easy to get complacent when working with dataframes and it’s hard to remember that numpy arrays don’t have column labels and things like that.

Your transformer should have an init method, a transform method, and a fit method. Your fit method is empty, but the bulk of the work will happen in your transform method.

Some Helpful Links:

  • Creating a Custom Transformer in Python (https://www.section.io/engineering-education/custom-transformer/) – this article is useful because it includes the structure of the 3 functions in Python
  • How to One Hot Encode data in Python (https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/)

Assignment 5: Scikit-learn Principles

Description of Assignment: This is a short writing assignment on the principles of scikit-learn using references provided by the professor

General Strategy: Read the articles, write the summary!

Most Common Error: N/A. Most students do fine with this one.

Some Helpful Links:

  • A review of the design principles of scikit-learn (https://jakevdp.github.io/PythonDataScienceHandbook/05.02-introducing-scikit-learn.html#Scikit-Learn’s-Estimator-API)

Assignment 6: Classification System Metrics

Description of Assignment: This assignment has you learn the functions for accuracy, model error rate, precision and recall, Fbeta, True Positive Rate (TPR) and False Positive Rate (FPR), and then create a function that creates a Receiver Operating Characteristic (ROC) curve.

General Strategy:

  • This one is straight forward. When you’re told to build the accuracy function, you build it as given! This tests your Python skills, especially function writing and understanding loop

Most Common Error: The biggest issue students have is around the ROC_curve_computer function. Your previous functions will be used in this one, you don’t need to duplicate them. Remember that you’re making a prediction, determining the TPR/FPR and then outputting the list of TPRs and FPRs. It’s the TPR and FPR list that you will be plotting.

Some Helpful Links:

  • Understanding the ROC (https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc)
  • Writing Functions in Python (https://www.w3schools.com/python/python_functions.asp)

Assignment 7: Polynomial Regression I

Description of Assignment: This assignment is very similar to Assignment 3 except you’ll be plotting a polynomial regression instead of a linear regression

General Strategy:

  • Reuse your Assignment 3 code! And add in the PolynomialFeatures Transformation

Most Common Error: The biggest issue here is that you plot the initial x, y and z data, and then you complete a PolynomialFeatures fit and transform and you train your LinearRegression model on that transformed df. When you get to creating the second image (to determine the line of best fit), you’re given x_fit and y_fit and need to predict z_fit. You have to put x_fit and y_fit through the same transformation you did x and y.

Otherwise, this is very similar to assignment 3.

Some Helpful Links:

  • How to use Polynomial Features (https://machinelearningmastery.com/polynomial-features-transforms-for-machine-learning/)

Assignment 8: Polynomial Regression II

Description of Assignment: This is very similar to assignment 7. So similar that you’ll be fine as long as you reviewed assignment 7.

General Strategy: Exactly the same as assignment 7

Most Common Error: N/A

2 thoughts on “DTSC-670 Tips and Tricks”

  1. Pingback: How to become a Graduate Assistant at Eastern University - Dustin K MacDonald
  2. Pingback: Eastern University MS in Data Science 2022 Review - Dustin K MacDonald

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Privacy Policy

See here for our privacy policy. This site uses affiliate links and Adsense ads to provide targeted advertising.

Tags

analytical technique assessment communication community development counselling crisis chat crisis intervention data science eastern university economic development education empathy evaluation forecasting fundraising governance information technology intelligence intelligence analysis keokuk county language learning legal management peer support personal development phd politics professional development protective factors psychosocial risk factors safety planning salesforce sigourney social media statistics suicide suicide assessment suicide risk assessment technology terrorism training university of the cumberlands violence risk assessment youth

Recommended Posts

  • Conducting Psychosocial Assessments
  • DCIB Model of Suicide Risk Assessment
  • ABC Model of Crisis Intervention
  • My Friend is Suicidal - What do I do?

Recent Posts

  • ITS833 Information Governance
  • Enhanced Care Management (ECM) with Salesforce
  • ITS835 Enterprise Risk Management
  • Glorifind Christian Search Engine
  • Sigourney Iowa Election Results, 2023

Archives

  • November 2023 (6)
  • October 2023 (1)
  • September 2023 (3)
  • August 2023 (1)
  • July 2023 (1)
  • May 2023 (1)
  • March 2023 (1)
  • February 2023 (2)
  • January 2023 (4)
  • December 2022 (2)
  • May 2022 (1)
  • April 2022 (2)
  • March 2022 (1)
  • February 2022 (1)
  • December 2021 (1)
  • October 2021 (1)
  • August 2021 (2)
  • May 2021 (3)
  • December 2020 (1)
  • November 2020 (4)
  • July 2020 (1)
  • June 2020 (1)
  • April 2020 (1)
  • March 2020 (4)
  • February 2020 (7)
  • January 2020 (1)
  • November 2019 (1)
  • October 2019 (2)
  • September 2019 (4)
  • August 2019 (2)
  • March 2019 (1)
  • February 2019 (1)
  • January 2019 (1)
  • December 2018 (4)
  • November 2018 (3)
  • October 2018 (3)
  • September 2018 (19)
  • October 2017 (2)
  • September 2017 (2)
  • August 2017 (1)
  • July 2017 (39)
  • May 2017 (3)
  • April 2017 (4)
  • March 2017 (4)
  • February 2017 (4)
  • January 2017 (5)
  • December 2016 (4)
  • November 2016 (4)
  • October 2016 (5)
  • September 2016 (4)
  • August 2016 (5)
  • July 2016 (5)
  • June 2016 (5)
  • May 2016 (3)
  • April 2016 (2)
  • March 2016 (2)
  • February 2016 (2)
  • January 2016 (4)
  • December 2015 (2)
  • November 2015 (2)
  • October 2015 (2)
  • September 2015 (2)
  • August 2015 (1)
  • June 2015 (2)
  • May 2015 (5)
  • April 2015 (3)
  • March 2015 (8)
  • February 2015 (12)
  • January 2015 (28)

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org

Recent Comments

  • Dustin on Starting a Crisis Line or Hotline
  • HAPPINESSHEALTHCOURAGE, LLC on Starting a Crisis Line or Hotline
  • ITS833 Information Governance - Dustin K MacDonald on University of the Cumberlands PhD in Information Technology
  • Elected Officials in Sigourney, Iowa - Dustin K MacDonald on Sigourney Iowa Election Results, 2023
  • ITS 835 Enterprise Risk Management - Dustin K MacDonald on University of the Cumberlands PhD in Information Technology

Tags

analytical technique assessment communication community development counselling crisis chat crisis intervention data science eastern university economic development education empathy evaluation forecasting fundraising governance information technology intelligence intelligence analysis keokuk county language learning legal management peer support personal development phd politics professional development protective factors psychosocial risk factors safety planning salesforce sigourney social media statistics suicide suicide assessment suicide risk assessment technology terrorism training university of the cumberlands violence risk assessment youth
© 2023 Dustin K MacDonald | Powered by Minimalist Blog WordPress Theme