Machine Learning 101 Presentation

Today I gave a presentation for the St. Louis Microsoft Business Intelligence Users Group (BI). This post contains all of the links that I discussed in the presentation.  Even if you did not attend the presentation you can download a PDF of my slides.

My Thoughts on Courses 1-9 of the Johns Hopkins Coursera Data Science Specialization

I am now ABC (all but the capstone) on Johns Hopkins Coursera Data Science Specialization.  I now only need to complete the capstone project to finish the specialization.  That should happen sometime this fall, depending on when they offer the capstone for the first time.

I am not sure how typical of a student I was for this program.  I currently work as a data scientist, have a decent background in AI, have a number of publications, and am currently completing a PhD in computer science.  So, a logical question is what did I want from this program?

  1. I have not done a great deal of R programming.  This program focused heavily on R.  I view this as both a strength and weakness of the program.  I am mostly a Java/Python/C#/C++ guy.  I found the R instruction very useful.
  2. I've focused mainly in AI/machine learning, I hoped this program would fill in some gaps.
  3. Curiosity.

In a nutshell, here are my opinions.

Pros: Very practical real-world data sets.  Experience with both black-box (machine learning) and more explainable (regression models) systems.  Introduction to Slidify and Shiny, I've already used both in my day-to-day job.  It takes some real work and understanding to make it through this program.  The last three courses rocked!

Cons: Peer review is really hit or miss.  More on this later.  Some lecture material was sub-par (statistical inference) compared to Khan Academy.  Only reinvent the wheel if you are going to make a better wheel.

Course Breakdown

Here are my quick opinions on some of these courses.  I've posted more on some of these as I took them (see past blog posts here here and here).

  1. The Data Scientist’s Toolbox: Basically, can you install R, RStudio and use GitHub.  I've already done all three so I got little from this course.  If you have not dealt with R, RStudio and GitHub this class will be a nice slow intro to the program.
  2. R Programming: I enjoyed this course!  It was hard, considering I was taking classes #1 and #3 at the same time.  If you had no programming experiance, this course will be really hard!  Be ready to supplement the instruction with lots of Google Searching.
  3. Getting and Cleaning Data: Data wrangling is an important part of data science.  Getting data into a tidy format is important.  This course used quite a bit of R programming.  For me, not being an R programmer and taking course #2 at the same time meant extra work.  If you are NOT an advanced programmer already DO NOT take #2 and #3 at the same time.
  4. Exploratory Data Analysis: This was a valuable class, it taught you all about the R graphing packages.
  5. Reproducible Research:  Valuable course!  Learning about R markdown was very useful, I am already using this in one of my books to make sure that several of my examples are reproducible by providing a RMD script to produce all the charts from my book.
  6. Statistical Inference:  This was an odd class.  I already knew statistical inference and did quite well despite not watching any lectures (hardly).  I don't believe this course made anyone happy (hardly).  Either you already knew the topic and were bored, or you were completely lost trying to learn statistics for the first time.  There are several Khan academy videos that cover all the material in this course.  Why dose Hopkins need to reproduce this?  Is this not the point of MOOC?  Why not link to the Khan academy videos and test the students.  Best of both worlds!  Also, 90% of the material was not used in the rest of the course, so I suspect many students might have been left wondering why this course is for.
  7. Regression Models: Great course, this is the explainable counterpart to machine learning.  You are introduced to linear regression and GLM's.  This course was setup as the perfect counterpart to #8. My only beef on this course was that I got screwed by peer review.  More on this later.
  8. Practical Machine Learning: Great course.  This course showed some of the most current model types in data science: Gradient Boosting Machines (GBMs) and Random Forests.  Also a great description of boosting and an awesome Kaggle like assignment where you submitted results from your model to see if you can match a "hidden data set".
  9. Developing Data Products: Great course.  I really enjoyed playing with shiny, and even used it for one of the examples in my upcoming book.  You can see my shiny project here. https://jeffheaton.shinyapps.io/shiny/  and writeup here.  http://slidify.org/publish.html  They encouraged you to post these to GitHub and public sites, so I assume I am not violating anything by posting here.  Don't plagiarize me!!!

Peer Reviewed Grading

If you are not familiar with peer review grading, here is how it works.  For each project you are given four counterparts that review and grade your assignment.  This is mostly double-blind, as neither the student or reviewer knows the other.  I used my regular GitHub account on all assignments.  So it was pretty obvious who I was.  I was even emailed by a grader once who recognized me from my open source projects.  Your grade is an average of what those four people gave you.  At $49 a course maybe this is the only way they can afford grade.  I currently spend nearly 100 times that for each of my PhD courses. :(

Overall, peer review grading worked good for me in all courses but one.  Here are some of my concerns on peer grading.

  1. You probably have many graders who are pressed for time and just give high-marks without much thought.  (just a guess/opinion)
  2. You are going to be graded by people who may not have not gotten the question correctly in the first place.
  3. You are instructed NOT to run the R program.  So now I am being graded on someone's ability to mentally compile and execute my program?
  4. Each peer is going to apply different standards.  You could get radically different marks depending on who your four peers were.

So here is my story in the one case where peer review did not work for me.  I in the upper 98-99% range on most of these courses.  Except for course #8.  I had good scores going into the final project.  However, two of my peers knocked me for these reasons:

  • Two of my peers could not download my file from Coursera.  Yet the other two had no problem.  Fine, so I get a zero because someone's ISP was flaking out.
  • Two of peers did not give me credit because the felt I had not used RMD for my report?? (which I had) Fine, so I lose a fair amount of points because two random peers did not know what RMD output looks like.

This took a toll on my grade, I still passed.  But this is the one course I did not get "with distinction" on.  Yeah big deal.  In the grand scheme of things I don't really care.  Just mildly annoying.  However, if you are hovering near a 70%, and you get one or two bad reviewers you are probably toast.

Conclusions

Great program.  It won't make you a star data scientist, but it will give you a great foundation to go from.  Kaggle might be a good next step.  Another might be a blog and doing some real, and interesting data science to showcase your skills!  This is somewhat how I got into data science.

Spoke at the 2014 Health Actuaries Conference in San Francisco

2014_health_soa

I was an invited speaker at this year's Society of Actuaries (SOA) Health 2014 conference in San Francisco.  I presented on the topic of Deep Learning and Unstructured Data.  I've been working with both topics for nearly a year now. I make use of deep learning through the Theano framework for Python.  I hope to add deep learning support to Encog at some point.  It was fun to meet with other data scientists in the insurance industry.  I left with a greater appreciation for several model types that I've not previously worked with, such as Gradient Boosting Machines (GBM).  I also wrote an article on deep learning for the SOA Forecasting and Futurism Section.  You can find the PDF here, the July 2014 edition.

 

50% Done with Johns Hopkins COURSERA Data Science Specialization, My Review of Classes 3&4

I am 50% done with the Johns Hopkins COURSERA Data Science Specialization, and continuing onward.  I took courses 1 through 3 in a single batch, and found them useful.  You can read my review of courses 1 through 3 here.  I decided not to take three classes in parallel again.  I am learning the R programming language as I go, so these courses can take some time.  I plan on sticking to two courses at a time until I complete the 9 courses.  This will allow me to tackle the final capstone project sometime this fall.

I am taking the signature track.  This means that COURSERA issues me a verification certificate for each course.  You can see my certificates for these two courses here.

My grades were high enough that I earned the "with distinction" note.  This has been the case with all five classes so far.  I hope that I am not giving the impression that these courses are easy.  Short of loading data (that I had preprocessed elsewhere) into R for modeling, I had not really dealt with R much.  My goals for this class are twofold.  First, to finally sit down and really learn R.  My second goal is to see more application of data science in new domains.  As a data scientist, I work primarily with life insurance industry data.

Both of the current two classes really fit the bill for this.  Both give me some insights into how data science is used in academic, peer reviewed, research.  Given that I am entering a distance learning Computer Science PhD program, this is very good exposure for me.

Review of Exploratory Data Analysis

Exploratory data analysis taught me all about charting in R.  There are three central charting packages for R.  Each of them has their uses, and this class taught all free.  In the world of R charting there is the base charting, lattice charting and grammar of graphics.  These packages all have their place.  The ggplot (grammar of graphics) attempts to address some of the issues in earlier charting packages.

The entire point of exploratory data analysis is to produce quick charts to allow you to get a handle on what the data look like.  These are not formal graphics that you will include in your report, though they might become that.

Review of Reproducible Research

If you've read many academic articles, you know how hard it can be to reproduce exactly what the researcher is talking about.  The goal of this course is to automate all numerical analysis to a simple R script.  This allows the reader to know exactly where numbers came from, and what the process is.  Many academic papers lack any sort of direction.

Conclusions

The next course for me is Practical Machine Learning.  I am taking this course by itself.  I only have two courses to finish before the capstone.  This would allow me to complete the program (but not the capstone) by September.  In September, I start a PhD in computer science.  Because of this I want to keep the schedule clear for the fall.

Signed up for fall 2014 semester at NSU for CS PhD

I am now signed up for my fall classes at Nova Southeastern University.  This is my first semester in the Computer Science Ph.D. program.  I am taking 8 credit hours.  This consists of two 4 credit hour courses.  In August I will travel to their campus in Ft. Lauderdale, Florida. The distance format requires me to travel to their campus twice a semester. I am looking forward to my first semester, and learning how their distance PhD program works.

I signed up for the following courses.

These are the books that I will use for the first semester.

nsu_books_fall_2014

I already owned Peter Norvig's AIMA book that will be used in CISD 760, and am somewhat familiar with it.  This book is absolutely a classic in the AI field, and I am very glad that it will be the course book for my class.

I am also looking forward to the database systems class.  I am not familiar with the text book, but it seems to cover quite a few topics.

I am looking forward to the semester.  I am sure both courses will be a fair amount of work.  I timed the writing for volume 2 of AIFH to complete just before courses begin.  I will post more about the semester as it begins and I see exactly what I've gotten myself into. :)

 

Review of Traveling Salesman (the Movie)

This weekend I watched Traveling Salesman the Movie .  Overall, I was not that impressed.  I came in with very high hopes. The name of the movie comes from the fact that the Traveling Salesman is a prime example of a NP hard problem.  I found it interesting that no where in the movie was the Traveling Salesman Problem (TSP) actually explained, let alone much of a definition of NP hard. I am not entirely sure the words "traveling salesman" even came up once in the movie.  There was a brief mention of the knapsack problem.  This makes it a bit difficult to follow for a non-computer science background person.  I watched the movie with my wife, and needed to explain TSP and NP-Hard.

What the movie is actually about, is not evident from the title.  "P vs NP" is not the centerpiece of the plot.  P vs NP is simply a stand-in for any great technical advance that a government might want to control.  Strong AI, Cold Fusion, Human Cloning or FTL travel could have filled this role.  Any technical advance that is so great, that it will alter the course of history-- most likely to the advantage of the nation that wields it.  Many obvious comparisons to the Manhattan Project were made by the movie.

I really had a hard time seeing "P vs NP" as being on the same level as the Manhattan Project.  If they wanted this level of comparison, they could at least pick a technology that might actually happen.  P is almost certainly not equal to NP. In a 2012 poll, only 9% of computer scientists believed that P might equal NP.  If we could solve NP problems in polynomial time this might be roughly equivalent to a nearly infinitely fast computer.  Such a machine could decrypt nearly anything allowing cybere warfare on an epic scale.

Movies, like other art forms, are the subject of interpretation.  What a movie is actually "about", is typically a personal interpretation.  For me, TSP the Movie was a very lengthy dialog between several scientists and a government that wants to lock away their discovery.  Additionally, who is "accountable" for the use or misuse of the technology-- the scientists or the government. Should have the United States published their research on the Manhattan Project? Additionally, what will the government ultimately do to protect its secrets?  If you want to watch a talky hour long debate on these topics, this movie is for you!  Traveling Salesman (the movie) has little to do about the Traveling Salesman (the problem). Overall, I feel I was conned into a political debate.  And, for me, not that compelling of an argument for liberation of information.

What are my favorite eLearning opportunities?

In a twitter conversation, someone asked me what my favorite eLearning classes were.  An interesting question!  An even more interesting topic for a blog post.

Due to a limitation of time, I am forced to prioritize.  Here are the top ones for me.

  • Johns Hopkins Coursera Data Science Specialization - I am currently enrolled in this, about half way through.  It is really good! You can read my review of the first three courses here.
  • AI Class and Machine Learning Class - Both originally out of Stanford, now on Coursera and Udacity respectively - These were great.  I took these a few years ago when they first came out. They teach you a wide range of ML and AI algorithms, at the lowest level.
  • Khan Academy - Any time I find a hole in my math knowledge, I can usually fine a video here to fill it.
  • Statistics @ Udacity by San Jose University - I took this course for college credit.  Even though it was probably the 3rd undergrad stats class I took.  I wanted to see what the "for credit" approach was like.  It was a great class.  Anyone who is interested in AI/ML/Data Science and is NOT familiar with stats REALLY should take a stats class.
  • Nova Southeastern University - This is actually more distance learning than eLearning.  I am a student in their computer science PhD program.  You can read more about that here. I do have to go to Ft. Lauderdale Florida four times a year.  I will be writing more about this as I progress through the program.  I liked this program because I could earn a research oriented PhD in computer science.  Not the management programs that many other distance/online/non-traditional PhD programs offer.  Not that there is anything wrong with such programs, I am just an uber-techie. :)

If I were in the market for a masters degree, I would seriously consider one of these two.

There are even search engines for online courses, such as http://www.skilledup.com/