I am done with the Coursera Johns Hopkins Data Science specialization. This is my first specialization earned from Coursera. The final step for me was the capstone project. Prior to the capstone project there were 9 other courses I needed to take. The whole process took about 8-9 months. This post is primarily about the capstone project. You can read my opinions on the individual courses from the following blog posts:
- Impressed by the Johns Hopkins Data Science Certification at Coursera
- Review of the First Three Johns Hopkins Coursera Data Science Courses
- 50% Done with Johns Hopkins COURSERA Data Science Specialization, My Review of Classes 3&4
- My Thoughts on Courses 1-9 of the Johns Hopkins Coursera Data Science Specialization
Once you complete all ten courses, including the capstone, you are issued a certificate of completion. This certificate is publicly share-able. You can see my certificate here.
I am probably not the typical student for this program. I am a part-time computer science PhD student, and a full-time data scientist for a large insurance company. While many of the concepts were review, this course forced me to use the R programming language. Left to my own devices I typically use Java and Python for data science. I also learned to use R Publish, Shiny and R Markdown. I also learned about reproducible research. Some of the topics covered in reproducible research were useful to me in my PhD program.
I really liked this program. Courses 1-9 provide a great introduction to the predictive modelling side of data science. Both machine learning and traditional regression models were covered. R can be a slow and painful language, at times, but I was able to get through. It is my opinion that R is primarily useful for ferrying data between models and visualization graphs. It is not good for heavy-lifting and data wrangling. The syntax to R is somewhat appalling. However, it is a domain specific language (DSL), not a general purpose language like Python. Don't get me wrong. I like R for setting up models and graphics. Not for performing tasks better suited to a general purpose language.
The capstone project was to produce a program similar to Swiftkey, the company that was the partner/sponsor for the capstone. If you are not familiar with Swiftkey, it attempts to speed mobile text input by predicting the next word you are going to type. For example, you might type "to be or not to ____". The application should fill in "be". The end program had to be written in R and deployed to a Shiny Server.
This project was somewhat flawed in several regards.
- Natural Language Processing was not covered in the course. Neither was unstructured data. The only material provided on NLP was a handful of links to sites such as Wikipedia.
- The first 9 courses had a clear direction. However, less than half of them had anything to do with the capstone.
- The project is not typical of what you would see in most businesses as a data scientist. It would have been better to do something similar to Kaggle or one of the KDD cups.
- In my opinion, R is a bad choice for this sort of project. During the meetup with Swiftkey, they were asked what tools they used. R was not among them. R is so cool for many things, why not showcase its abilities?
- Student peer review is bad... bad... bad... But it might be the only choice. The problem with peer review is you have three random reviewers. They might be easy, they might be hard. They might penalize you for the fact that they don't know how to load your program! (this happened to me on a previous coursera course).
- Perfect scores on the quizzes were really not possible. We were given several sample sentences to predict. The sentences were very specialized and no model would predict them correctly. The Swiftkey surely did not. Using my own human intuition and several text mining apps I wrote in Java, I did get 100% on the quizzes. Even though the instructions clearly said to use your final model. Knowing I might draw a short straw on peer review, I opted to do what I could to get max points. I don't care about my grade, but falling below the cutoff for a bad peer review would not be cool!
- Marketing based rubric for final project. One of the grading criteria posted the question, "Would you hire this person?" Seriously? I do participate in the hiring process for data scientists. I would never hire someone without meeting them, performing a tech interview, and small coding challenge. I hope this stat is not used in marketing material. xx% of our graduates produced programs that might land them a job.
After spending several days writing very slow model building code in R, I eventually dropped it and used Java and OpenNLP to write code that would build my model in under 20 minutes. Others ran into the same issues. There are somewhat kludge interfaces between R and OpenNLP, Weka and OpenNLP. But these are native Java apps. I just skipped the kludge and built my model in Java and wrote a Shiny app to use the model in R. This was enough to pass the program. I was not alone in this approach, based on forum comments.
Okay, I will just say it. I thought this was a bad capstone. The rest of the program was really good! If I could make a suggestion, I would say to let the students choose a Kaggle competition to compete. The Kaggle competitions are closer to the sort of data real data scientists will see. I am proud of the certificate that I earned. If I were interviewing someone who had this certificate I would consider it a positive. The candidate would still need to go through a standard interview/evaluation process.