I am now ABC (all but the capstone) on Johns Hopkins Coursera Data Science Specialization. I now only need to complete the capstone project to finish the specialization. That should happen sometime this fall, depending on when they offer the capstone for the first time.
I am not sure how typical of a student I was for this program. I currently work as a data scientist, have a decent background in AI, have a number of publications, and am currently completing a PhD in computer science. So, a logical question is what did I want from this program?
- I have not done a great deal of R programming. This program focused heavily on R. I view this as both a strength and weakness of the program. I am mostly a Java/Python/C#/C++ guy. I found the R instruction very useful.
- I've focused mainly in AI/machine learning, I hoped this program would fill in some gaps.
In a nutshell, here are my opinions.
Pros: Very practical real-world data sets. Experience with both black-box (machine learning) and more explainable (regression models) systems. Introduction to Slidify and Shiny, I've already used both in my day-to-day job. It takes some real work and understanding to make it through this program. The last three courses rocked!
Cons: Peer review is really hit or miss. More on this later. Some lecture material was sub-par (statistical inference) compared to Khan Academy. Only reinvent the wheel if you are going to make a better wheel.
Here are my quick opinions on some of these courses. I've posted more on some of these as I took them (see past blog posts here here and here).
- The Data Scientist’s Toolbox: Basically, can you install R, RStudio and use GitHub. I've already done all three so I got little from this course. If you have not dealt with R, RStudio and GitHub this class will be a nice slow intro to the program.
- R Programming: I enjoyed this course! It was hard, considering I was taking classes #1 and #3 at the same time. If you had no programming experiance, this course will be really hard! Be ready to supplement the instruction with lots of Google Searching.
- Getting and Cleaning Data: Data wrangling is an important part of data science. Getting data into a tidy format is important. This course used quite a bit of R programming. For me, not being an R programmer and taking course #2 at the same time meant extra work. If you are NOT an advanced programmer already DO NOT take #2 and #3 at the same time.
- Exploratory Data Analysis: This was a valuable class, it taught you all about the R graphing packages.
- Reproducible Research: Valuable course! Learning about R markdown was very useful, I am already using this in one of my books to make sure that several of my examples are reproducible by providing a RMD script to produce all the charts from my book.
- Statistical Inference: This was an odd class. I already knew statistical inference and did quite well despite not watching any lectures (hardly). I don't believe this course made anyone happy (hardly). Either you already knew the topic and were bored, or you were completely lost trying to learn statistics for the first time. There are several Khan academy videos that cover all the material in this course. Why dose Hopkins need to reproduce this? Is this not the point of MOOC? Why not link to the Khan academy videos and test the students. Best of both worlds! Also, 90% of the material was not used in the rest of the course, so I suspect many students might have been left wondering why this course is for.
- Regression Models: Great course, this is the explainable counterpart to machine learning. You are introduced to linear regression and GLM's. This course was setup as the perfect counterpart to #8. My only beef on this course was that I got screwed by peer review. More on this later.
- Practical Machine Learning: Great course. This course showed some of the most current model types in data science: Gradient Boosting Machines (GBMs) and Random Forests. Also a great description of boosting and an awesome Kaggle like assignment where you submitted results from your model to see if you can match a "hidden data set".
- Developing Data Products: Great course. I really enjoyed playing with shiny, and even used it for one of the examples in my upcoming book. You can see my shiny project here. https://jeffheaton.shinyapps.io/shiny/ and writeup here. http://slidify.org/publish.html They encouraged you to post these to GitHub and public sites, so I assume I am not violating anything by posting here. Don't plagiarize me!!!
Peer Reviewed Grading
If you are not familiar with peer review grading, here is how it works. For each project you are given four counterparts that review and grade your assignment. This is mostly double-blind, as neither the student or reviewer knows the other. I used my regular GitHub account on all assignments. So it was pretty obvious who I was. I was even emailed by a grader once who recognized me from my open source projects. Your grade is an average of what those four people gave you. At $49 a course maybe this is the only way they can afford grade. I currently spend nearly 100 times that for each of my PhD courses.
Overall, peer review grading worked good for me in all courses but one. Here are some of my concerns on peer grading.
- You probably have many graders who are pressed for time and just give high-marks without much thought. (just a guess/opinion)
- You are going to be graded by people who may not have not gotten the question correctly in the first place.
- You are instructed NOT to run the R program. So now I am being graded on someone's ability to mentally compile and execute my program?
- Each peer is going to apply different standards. You could get radically different marks depending on who your four peers were.
So here is my story in the one case where peer review did not work for me. I in the upper 98-99% range on most of these courses. Except for course #8. I had good scores going into the final project. However, two of my peers knocked me for these reasons:
- Two of my peers could not download my file from Coursera. Yet the other two had no problem. Fine, so I get a zero because someone's ISP was flaking out.
- Two of peers did not give me credit because the felt I had not used RMD for my report?? (which I had) Fine, so I lose a fair amount of points because two random peers did not know what RMD output looks like.
This took a toll on my grade, I still passed. But this is the one course I did not get "with distinction" on. Yeah big deal. In the grand scheme of things I don't really care. Just mildly annoying. However, if you are hovering near a 70%, and you get one or two bad reviewers you are probably toast.
Great program. It won't make you a star data scientist, but it will give you a great foundation to go from. Kaggle might be a good next step. Another might be a blog and doing some real, and interesting data science to showcase your skills! This is somewhat how I got into data science.