Table of Contents
Introduction
Data Science and data analysis have been an interest of mine for a while. Combining statistics, computer programming and domain knowledge into one applied field ticks a lot of boxes of interest for me. I learn best in a structured environment though, and I’d been looking for something to help me build my skills.
I previously earned my Data Analyst with DataCamp certificate, but I found that DataCamp provided too much scaffolding. When I returned to Python a few months after finishing it, I found I had forgotten a lot of the syntax and even basic methods – because DataCamp had provided 80% of the sample code and had me fill in just small bits, even during the advanced parts of the course.
I recently stumbled upon UpLevel, which promised a different approach.
What is UpLevel?
UpLevel is a data science course provider similar to DataCamp, DataQuest, and numerous boot camps: you complete projects based on verified datasets, and in doing so build your skills.
But, UpLevel has a major difference: they don’t give you scaffolding. Instead, the course provides you with pseudocode descriptions of the tasks you need to accomplish and then has you refer to reference documents to figure out the syntax. If you get stuck, there are references along the way.
Getting Started
After reviewing the material on UpLevel’s website I was pretty excited. (Note: UpLevel appears very new, and the data science projects part of the company was so hard to find when I returned to it to write this article I had to get it from my email. Make sure you bookmark it.) The other main components are a recruitment arm for companies and a data science blog.
This was an opportunity for me to continue building my data science skills in a way that I felt would fix the major gap in my skills so far.
UpLevel has a sale going on right now, so while each course is normally $30, and a monthly subscription is a slight discount at $25, I got a major deal at $14.99 for a monthly subscription. Each month I get a code to access a new course, of which there are currently 17 – including one just available for pre-order in the time I subscribed and the time I wrote this article.
Because I already have a working knowledge of Python and some basic data science principles, I opted for an Intermediate project. The one that really caught my eye was Identifying Mental Health Factors and Predicting Depression. As I spent years volunteering and then working for a crisis line, I’ve always been intruiged by the possibility of analytics to identify people at risk.
About 30 minutes after I finished subscribing and my payment was submitted, I was emailed a discount code. In order to get my lesson I went through the checkout for the lesson I wanted but entered that code, which deducted 100% of the price for me. Simple and elegant. After my checkout, the materials were immediately available to download.
Getting Into the Lessons
I already had Anaconda3 installed on my computer (which includes Spyder, a Python IDE, Jupyter, and several other tools I’m not familiar with.) This is important, because UpLevel courses come in the form of Jupyter notebooks.
If you don’t have this software already, you’ll need it.
These notebook files are used by professional data scientists as well as hobbyists and can contain entire data projects inside them, so that they can be easily emailed, uploaded to Github or otherwise moved around.
Part I. Data Cleaning
The first Jupyter notebook (Part I), looks like this:
Expanding on the brief project scenario presented on the website is a more “applied” work-like focus one:
The notebook’s first steps recommend opening up the questionnaires and provided the actual study that the data come from, nice tough! Opening up the questionnaire, I immediately recognized the questions. The depression questions are part of the PHQ-9, a standardized depression and suicide screener that I used at Morneau Shepell. Even though the students are Japanese, the study chosen was a good one because the tools they’re using are standard.
Continuing through the notebook, are the pseudocode pieces that I talked about previously:
I worked through them fairly quickly because I was familiar with these pieces, mostly stopping just to refresh myself on the specific syntax or steps (how do I use an iloc to slice data again?)
Part II. EDA and Hypothesis Testing
I finished Part I in perhaps 90 minutes, so I decided to get started on Part II. At the end of Part I I had imported the dataframe into Python, dropped rows that were part of the data set originally (but included in error), imputed missing values in one of the remaining columns with the median and then exported the dataframe back to CSV format in order to confirm that it all looked right.
Part II involved splitting the dataframe into two: one for the numerical values and the other for the categorical (string) data and then plotting them as histograms and countplots and looking at the correlations between them.
Part II is more challenging because you’re requested to make a boxplot (for example) but not shown the “successful” plot. I didn’t realize that I was selecting the wrong dataframe except by accident. If I was provided with screenshots of the output (instead of a black box with a question mark on it indicating where the output will be), that would significantly help.
I also had this issue with the t-test. It wasn’t obvious if I had successfully calculated the value because I wasn’t provided with the “answer” of the right p value. Providing the end results without the code used to generate them would really help.
Another note, when instructed to do a t-test they note that “If you do this right, you’ll see that the pvalue is larger than 0.05, which means the means of the two groups are the same” which is definitely not how p values work!
If our hypothesis is that there is a statistically significant difference between the level of depression in domestic versus international students, then the null hypothesis is that there is no difference.
The mean of the total depression score for international students is 8.04. The mean of the total depression score for domestic students is 8.61. If we want to know if there is a statistically significant difference, we can run a t-test.
The t-test tells us the probability of the same result occurring by chance. Usually we set p to 0.05 or 0.01 (so a 5% or 1% probability of the same results happening by chance.) If our p-value < 0.05, we reject the null hypothesis. If it’s above 0.05, we fail to reject the null hypothesis (which means the results are not statistically significant.)
It sounds like a nitpick, but it’s very important that we get it right. In this course, the p-value is 0.41 so we fail to reject the null hypothesis.
Later, we do more boxplots, more t-tests, and calculate the chi square test of the relationship between suicidality and religiosity. In this step, it was helpful to have some of the output to compare that I had done the calculation correctly.
Part III. Data Coding
Part III is shorter than the previous two parts, and focuses primarily on preparing the data for machine learning. In fact, since these steps repeated steps from the earlier lessons (and I still had my dataframes from those steps available), it was very quick.
The end result of Part III is having a dataframe that is “coded”, so the different factors (for example, Gender) are replaced with numerical values, for example where male = 0 and female = 1.
Part IV. Machine Learning
Part IV focuses on machine learning. While I do have an okay grasp of other statistical and data science concepts, machine learning (which some people consider to be the only true data science, not that I agree with that) is something I haven’t been exposed to much.
Part IV is by far the longest part of the workbook. While Part I consists of 9 steps, Part II of 16 steps, Part III of 9 steps as well, Part IV is 27 steps!
Part IV involves use of the library scikit-learn, a library I was familiar with by name but actually hadn’t used at all, in contrast to the others. Luckily they included a few resources.
Continuing with the theme of providing almost enough help, I found Step 6 to contain far too little detail:
Perhaps the solution here would be to sort the lesson difficulty on two matrices: difficulty in data science and difficulty in machine learning. It’s possible for hypothesis testing, EDA and programming to be easy while machine learning is more difficult.
Essentially the lesson has you split your data into test and training groups, and then set up 6 different machine learning algorithms:
- DummyRegressor (which actually doesn’t do anything more than predict one specific Y output for all of your X inputs and is used to establish a baseline)
- Linear Regression
- Decision Tree
- Random Forest
And then has you do a bit of manipulation of your variables. Unfortunately the lack of detail here means that this part of the workbook is not very useful, despite being the longest. With no sample code, specific steps, or even examples of the outputs, it’s impossible to know if you’re on the right track.
This part of the course was the most disappointing of the bunch, and finding out that the Telegram group promised didn’t exist, and the Facebook page is just a regular business page with no chat feature, didn’t help.
Conclusion
I’m fairly satisfied with UpLevel. A little bit of additional detail to confirm calculations I’ve made are correct (or perhaps the option to unlock the solution) would be a great addition. Second, it’s important to have opportunities for students to connect with each other.
Despite these minor issues, I’m looking forward to next month, and might even buy a course early to keep my momentum going.
When searching for a way to describe it, I’d say it’s almost like an asynchronous internship. You have an expert “looking over your shoulder” and providing different ways in which to accomplish each task towards an agreed-upon goal, but also giving you the freedom to go at it.
I hope UpLevel continues to improve their subscriptions.
Happy learning!