Review of the First Three Johns Hopkins Coursera Data Science Courses

I am currently working towards the Johns Hopkins Data Science Specialization at Coursera. I am now complete with the first three courses.  I posted my initial, and very positive, impressions when I was about half-way through the first four-week block.  My impressions are still very favorable at completion. Now that the course is complete, I can post my complete thoughts for the first three courses.

There are a total of nine courses, and a capstone.  After completing all 10 requirements you earn the "specialization".  After you complete a course you are given an "online certificate".  You can link this to your Linked In page, or other social media.  You can see my first three certificates here.

Some of my courses show a 101%.  I am not entirely sure what that means.  I did not lose any points, but I did not think I had extra credit.  But I won't complain. :)

The instructors are releasing these courses in batches of three. Each course lasts for four weeks. I decided to take the first three in parallel.  This is not necessarily a good idea, depending on your experience level. I did scale back to two courses for the second month.  The capstone project will not be released until this fall, so I decided to slow my progress so as to finish in conjunction with the capstone project.

Data Science Specialization (overall)

Overall, I feel this is a very solid program.  The three instructors all have very solid academic credentials and real-world experience in data science.  If I saw this program listed on a resume, I would consider it a plus.  The program is challenging and far from trivial.  As a note, I currently work as a data scientist and have a number of publications in the area of Artificial Intelligence.  See the conclusions section for my motivations for earning this "specialization".

This specialization is very much geared towards the R programming language.  Python and R seem to be the two major players in the field of data science. C is often used to create high-performance models to be used with Python & R.  More traditional languages, such as Java/C# sometimes have their place, particularly when a data product needs to be "weaponized" for production.

Currently, I am much more in the Python & C/C++ camp than R.  Gaining familiarity with R was one of the major motivators for me.  The R programming language WILL be difficult for someone coming out of a Java/C# background.  Python is a bit closer to R.  The value of R is its community and third party add ons that all work together.  Simply as a programming language, R is not that good.  It is a domain specific language (DSL) for statistics.  You would not write a word processor or video game in R!  Python is a general purpose language that also has much of the community and statistical models of R.  If you are going to be a data scientist, you should know both R and Python.

All of the courses are organized somewhat similarly. You are provided with video lectures and HTML/PDF versions of the slights.  All course material was available on "day 1" of the course. The following assessments are used to calculate your grade.

  • Quizzes: These were a combination of multiple choice and fill-in-the-blank.  You will sometimes use R to answer the quizzes. Most quizzes can be attempted twice, and you are given the higher of the two scores.
  • Programming Assignments: Programming assignments are essentially "unit tests" that you use a submit script to transmit.  Your answer, but not your code, is evaluated.  You can retry programming assignments as many times as you like.
  • Peer Reviewed Assignments: These are essentially documents, with attached files, that you submit for review. They are graded somewhat "subjectively" by your classmates. You are typically submitting program code, result data and screen shots.

All of the above examples have "fixed deadlines".  The deadlines occur at the end of each of the four weeks of the course. After the deadline, you get partial credit for two days.  Beyond that, it is a zero.

The "peer reviewed assignments" did cause some concerns amount the students.  You are simply given a final grade in the peer reviewed section.  It is a black box.  You do not know how many students evaluated you, what they gave you, or why they graded you how they did. Graders are instructed NOT to run your R program.  (good advise, from a security point)  But this brought up concerns about students ability to grade a program they cannot run.  I ran into no issues with peer review.  I felt the grades I received were fair.

  • Evaluating your understanding: Really good!  I felt the programming assignments were a great mix of projects, documentation and quizzes.  If you can complete the assignments for these three classes, you have a good understanding of the subject matter.
  • Providing Experience with the technology: Really good!  You get hands-on experience with R, Rstudio and GitHub. The fact that they include GitHub and exercises in collaboration is great!
  • Teaching the subject material: Acceptable. I must commend the instructors for placing HTML and PDF versions of their slides.  Trying to flip through video and find something is just painful.  However, not everything from the assignments is in the course material.  This does not bother me, and I think it is a good thing. However, this was a stumbling block for some students. However, you need to know how to go out and find answers. Especially, if you aspire to be a data scientist. As a data scientist, my specifications are often quite vague!  You need to be a researcher/hacker and figure it out.

The Data Scientist's Toolbox

This is the first course in the series. This course was really easy for me, and I did not learn that much.  However, this is mainly because I already had R and R Studio installed.  I also already had a GitHub account.  If you are already at this level, then this class is just something you need to check off the list to get the specialization.

However, this class does set a great foundation!  It is absolutely awesome that the instructions expose you to GitHub for collaboration and sharing.  They also include how to ask questions, and where to ask them.  As an open source maintainer I can say that "question asking etiquette" is NOT common knowledge.

This course lays a great foundation!  But, you may already have said foundation.

R Programming

This is the second course in the series, and it was very good. It teaches you some of the fundamentals of the R programming language.  You are given assignments that test your knowledge of the topics. Some of the topics include:

  • Reading data files
  • Output to files
  • How to use R's looping functions
  • How to NOT use R's looping functions and use the various apply functions
  • Data frames

R is vector based, with all sorts of "helper functions" to perform various tasks quickly.  This can be a bit frustrating, until you get used to it.  Python has some of the same issues.  When I first learned Python, I would frequently lament that you can reduce anything to 2 lines of code if you know the correct "pythonic magic" to invoke.  Pythonic refers to a program that is expressed in Python's unique style.

Getting and Cleaning Data

This is a very practical course. Data are rarely in the form you want them!  Wrangling data is a critical skill.  This course introduces the concept of tidy data. Tidy data are data that has been transformed in some way to make it easier to model.  You almost always need to do this. This is why it is critical to document your "tidy steps".  I've run into a number of peer-reviewed academic articles where enough instructions are not given to realistically reproduce their research. You should always provide scripts to help reproduce your research.  One of the instructors wrote a book on this topic.

This course also provides details on obtaining data from the web, CSV, XML, JSON and API's.  The final project asks you to merge a public data set and provide a tidy data set made up of several files.  You also create a codebook to document your data.  This gives a clear indication of your data format, as well as how your file was created from the raw input data.

Conclusions

I am probably not the typical student for these courses.  I already work as a data scientist. Additionally, I am a Ph.D. student in Computer Science, and the author of several Artificial Intelligence (AI books). Why am I taking this course?  First, is to deepen my knowledge of the R programming language. Secondly, is to round out my knowledge of data science and see how others are using it.  My background is more in information technology and artificial intelligence.

So far I am quite impressed with the courses.  I passed the first three courses "with distinction", and am looking forward to the next two. I am not planning on taking three in parallel again.  These courses are a decent amount of work. I am also hopeful that the the upcoming course on reproducible research will be helpful for my Ph.D dissertation.

18 thoughts on “Review of the First Three Johns Hopkins Coursera Data Science Courses

  1. Bernie Keim

    HI Jeff,
    I'm taking these courses from Johns Hopkins now and am planning to do the specialization. I generally agree with everything you've said although I do have some issues with the courses. I find many of the examples used during lectures seem superficial. For example, in R programming the examples are often very trivial (data frames with col names like foo and bar) ... trivial examples are stunningly dull.

    In the edX MIT Analytic's Edge course (mentioned by louiedinh), the lectures, examples and recitations were top notch -- they were practical with clear explanations--and importantly, interesting. Unfortunately the course is just wrapping up and the rumor mill says it won't be offered again until next spring (definitely one to keep a watch for) ... one of the best courses I've taken on the subject.

    Reply
  2. Reinaldo

    I am on the same path myself. It has been quite challeging and exciting.R It's the first programming I've ever learned. I had to take the R programming and Getting and Cleaning Data courses twice hahaha. Not enough time to finish the assigments, watch the lectures and work on my startup at the same time.The courses are demanding for novice programmers. I have to say the forums of the courses are amazing. Also stackoverflow provides with lots of useful insights. People is always willing to help and explain everything. MOOCs are definitely a powerful learning tool foo anyone who lives outside of major learning centers( as myself, I live in a small city in Colombia near the border with Venezuela) or can't afford to travel.

    Keep up the good work Jeff, congratulations .

    There is something that's been quite diffcult for me and its understanding control structres mainly loops. Does anyonw has advice to learn about faster?

    Reply
  3. Pingback: Washington University MiniMed 2 | Jeff Heaton

  4. Pingback: Impressed by the Johns Hopkins Data Science Certification at Coursera | Jeff Heaton

  5. Sunil Agarwal (@sunil_agarwal)

    I took it first time; and believe all 3 initial introductory courses were good. Although I am taking the next three bunch in parallel - will highly recommend EDA, and Reproducible Research subject. I would say on Statistical Inference - I had difficulty following - perhaps a solid understanding in statistics always helps.

    Reply
  6. Appan Ponnappanpan

    I am also taking these courses and they are quite good. It should be a good start for people getting their feet wet but I am not sure whether even an experienced data scientist will learn anything new from these.

    There are a few areas which are not covered by any of the nine courses - big data processing (MR, Hadoop, etc.) and Graph & Text/Web analytics. But there are other specialized courses in Udacity, Coursera & EdX which address these topics & which should be taken to complement this certification. The Univ. of Washington's Intro. to Data Science course (https://www.coursera.org/course/datasci) also looks like a good introductory course.

    But I am yet to see a course which does cover the 'Velocity' aspect of big data - how to deal with data-in-motion? - Stream processing, CEP, etc. Yes, most of the courses cover the 'Volume' & 'Variety' aspects of big data & of course the analytics part which is the core.

    Reply
  7. Pingback: My review of classes 4 & 5 of Johns Hopkins COURSERA Data Science Specialization | Jeff Heaton

  8. Lena

    Hi Jeff,
    This is a great review - very helpful!

    I want to start the coursera specialization in Data Science and have a question for you.

    How work intensive are the classes? (I know you are saying that you don't recommend taking three at once but...)
    I have a strong background in probability, am a bit familiar with R (did a tutorial on it) and would have 2.5 days to dedicate per week. Are 3 or even 4 courses good manageable in this scenario?

    Many Thanks!

    L.

    Reply
  9. C

    I've taken several data science courses with topics ranging from statistics to programming. I feel the in the Coursera specialization you are essentially expected to teach yourself most of the material. The video lectures seem to act as a supplement to your independent study. Don't take this to learn 'data science'. Only take it for the pedigree.

    Reply
  10. Pingback: Johns Hopkins’ Online Data Science Certifiation Via Coursera - d-Bo Lab

  11. Pingback: Review of the First Three Johns Hopkins Coursera Data Science Courses | bigolddata

    1. Jeff Heaton Post author

      I am going to give the classic, "it depends." If I were interviewing someone and saw they had completed the JH data science program it would mean something to me. If this were a candidate that had a strong IT background and was transitioning to data science, this would show me that they know at least the basics. That they know their way around R. Other positive factors would be experience with the other V's: Volume, Variety and Velocity. The JH program gives you some degree of practice with variety, but does not address volume or velocity.

      Reply

Leave a Reply