Done with Coursera Johns Hopkins Data Science Specialization

I am done with the Coursera Johns Hopkins Data Science specialization.  This is my first specialization earned from Coursera.  The final step for me was the capstone project.  Prior to the capstone project there were 9 other courses I needed to take.  The whole process took about 8-9 months.  This post is primarily about the capstone project.  You can read my opinions on the individual courses from the following blog posts:

Once you complete all ten courses, including the capstone, you are issued a certificate of completion.  This certificate is publicly share-able.  You can see my certificate here.

I am probably not the typical student for this program.  I am a part-time computer science PhD student, and a full-time data scientist for a large insurance company.  While many of the concepts were review, this course forced me to use the R programming language.  Left to my own devices I typically use Java and Python for data science.  I also learned to use R Publish, Shiny and R Markdown.  I also learned about reproducible research.  Some of the topics covered in reproducible research were useful to me in my PhD program.

I really liked this program.  Courses 1-9 provide a great introduction to the predictive modelling side of data science.  Both machine learning and traditional regression models were covered.  R can be a slow and painful language, at times, but I was able to get through.  It is my opinion that R is primarily useful for ferrying data between models and visualization graphs.  It is not good for heavy-lifting and data wrangling.  The syntax to R is somewhat appalling.  However, it is a domain specific language (DSL), not a general purpose language like Python.  Don't get me wrong.  I like R for setting up models and graphics.  Not for performing tasks better suited to a general purpose language.

The capstone project was to produce a program similar to Swiftkey, the company that was the partner/sponsor for the capstone.  If you are not familiar with Swiftkey, it attempts to speed mobile text input by predicting the next word you are going to type.  For example, you might type "to be or not to ____".  The application should fill in "be".  The end program had to be written in R and deployed to a Shiny Server.

This project was somewhat flawed in several regards.

  • Natural Language Processing was not covered in the course.  Neither was unstructured data.  The only material provided on NLP was a handful of links to sites such as Wikipedia.
  • The first 9 courses had a clear direction.  However, less than half of them had anything to do with the capstone.
  • The project is not typical of what you would see in most businesses as a data scientist.  It would have been better to do something similar to Kaggle or one of the KDD cups.
  • In my opinion, R is a bad choice for this sort of project.  During the meetup with Swiftkey, they were asked what tools they used.  R was not among them. R is so cool for many things, why not showcase its abilities?
  • Student peer review is bad... bad... bad... But it might be the only choice.  The problem with peer review is you have three random reviewers.  They might be easy, they might be hard.  They might penalize you for the fact that they don't know how to load your program! (this happened to me on a previous coursera course).
  • Perfect scores on the quizzes were really not possible.  We were given several sample sentences to predict.  The sentences were very specialized and no model would predict them correctly.  The Swiftkey surely did not.  Using my own human intuition and several text mining apps I wrote in Java, I did get 100% on the quizzes.  Even though the instructions clearly said to use your final model.  Knowing I might draw a short straw on peer review, I opted to do what I could to get max points.  I don't care about my grade, but falling below the cutoff for a bad peer review would not be cool!
  • Marketing based rubric for final project.  One of the grading criteria posted the question, "Would you hire this person?"  Seriously?  I do participate in the hiring process for data scientists.  I would never hire someone without meeting them, performing a tech interview, and small coding challenge.  I hope this stat is not used in marketing material.  xx% of our graduates produced programs that might land them a job.

After spending several days writing very slow model building code in R, I eventually dropped it and used Java and OpenNLP to write code that would build my model in under 20 minutes.  Others ran into the same issues.  There are somewhat kludge interfaces between R and OpenNLP, Weka and OpenNLP.  But these are native Java apps.  I  just skipped the kludge and built my model in Java and wrote a Shiny app to use the model in R.  This was enough to pass the program.  I was not alone in this approach, based on forum comments.

Final Thoughts

Okay, I will just say it.  I thought this was a bad capstone.  The rest of the program was really good!  If I could make a suggestion, I would say to let the students choose a Kaggle competition to compete.  The Kaggle competitions are closer to the sort of data real data scientists will see.  I am proud of the certificate that I earned.  If I were interviewing someone who had this certificate I would consider it a positive.  The candidate would still need to go through a standard interview/evaluation process.

Quick and very dirty data wrangling example

Data science is often described as the intersection of statistics, domain knowledge and hacking skills.  One important part of hacking skills is data wrangling.  Data are rarely in the exact form that you need them.  I am currently working on an example for AIFH Vol 3 that will use a SOM and compare nations based on several statistics.  I could not find a dataset that fit exactly what I was looking for.  So I decided to create my own dataset.

I wanted a list of countries with three different data points that somehow indicate that nation's prosperity.  I chose GDP, lifespan and literacy rate.  Remember, this is a computer science experiment, not a sociology experiment.  I am sure others could come up with a much better set of data points to compare countries.  However, for my example program these will work just fine.

I could not find a data set that was already completed.  However, all of this data is contained in Wikipedia.  To wrangle the data I created a simple Python script to accomplish this.  I am really starting to like Python for quick scripting projects.  I could have also used R, Groovy, Perl or a host of others.  The end result looks something like this:

code,country,gdp,lifespan,literacy
AFG,Afghanistan,20650,60,0.431
ALB,Albania,12800,74,0.98
DZA,Algeria,215700,73.12,0.918
AND,Andorra,4800,84.2,1.0
AGO,Angola,124000,52,0.826
ATG,Antigua and Barbuda,1220,75.8,0.984

[Full File]

You can download the entire contents of Wikipedia into a data file.  This is usually how you should deal with Wikipedia data.  Do not use HTTP to pull large volumes of data from Wikipedia.  This is a good way to get blocked from Wikipedia.  Also, the datafile for Wikipedia is not HTML encoded and much easier to parse.  I simply pulled the nation codes page, GDP, literacy, and lifespan pages into text files that my Python script could parse.

I linked the files together (joined) using the nation name as a key.  If a nation's name did not appear in all lists I discarded that nation.

You can see my Python code here.  This code could be more readable.  But it gets the job done.  It is a quick data wrangling hack.  If I needed to re-pull the data on a frequent basis, particularly if it were high-velocity data, I would do something more formal.

Intellivision, Imitation and Minecraft

In my part of the world this is the holiday season.  Between Thanksgiving, Christmas and New Years, there is quite a bit going on in December.  Having just finished my first semester in a new degree program, as well as the capstone to the Johns Hopkins Data Science specialization, I was feeling particularly festive!  I hope everyone (who celebrated it) had a wonderful Christmas and holiday season.  Here is how I spent the holiday!

Intellivision

My primary gift this Christmas was an Intellivision video game system.  Back in 1981 I received the original Intellivision system for Christmas.  That Christmas, back in the early eighties, is probably my most memorable.  I could finally play video games, in my own house!  No need for a pocket full of quarters for the local arcade.  My father and I spent hours playing games with Intellivision.  Some of our favorites were Utopia, Sea Battle, Dungeons and Dragons, and Space Battle.

However, my absolute favorite was Utopia!  I think Utopia was the original nation-level strategy game.  There was no AI component.  Two players each commanded an island and controlled the military and economy of the island.  I would often play the game by myself and try to control both islands, at the same time.  My basic strategy was to earn as much money as possible fishing.  Then buy as many factories as I could.  Factories would earn money just sitting there.  Eventually I had enough money rolling in from the factories that I no longer needed to fish.  The factories also polluted and kept population down, so I did not have to waste as much of my small islands on non-income producing tiles to keep my population happy.  Who needs schools, hospitals and houses when you can have income producing factories.  Eventually I could even quit producing crops and feed the population with only fish, so long as I had one fishing boat for each 500 people.  Eventually, I ended up with tons of money, about 1100 people working in my factories in a heavily polluted island.  It did not matter what your population was, as the factories would still produce just fine.  I would have really hated to live in my nation!

I tried many other strategies as well.  Some actually lead to a more utopian island for my population.

The new Intellivision system came with 60 games, and cost around $40.  How is that for deflation?  Back in the eighties I recall sometimes shelling out 40 bucks for a single game!  Most games I could no longer beat.  It has been too many years, the muscle memory I built up to beat these games are long gone.  However, I could still build my polluted despotic nation in Utopia just fine!  I can't imagine how many hours I spent playing that game as a child that I still remembered it exactly.  Here is a picture of the TV screen with my two islands that I built up rather quickly.

intellivision_utopia

I was quite happy with the new Intellivision system.  The gameplay is just as I remembered it from Intellivision.  Some of the colors and sounds seemed a little different.

Imitation

My wife and I both went to see Imitation Game at the Wehrenberg Theaters 5-star lounge.  I might owe my wife several movie selections for agreeing to see a movie about Alan Turing, the man who many consider to be the father of computer science and artificial intelligence.  The movie focused around the efforts of his team to assist the British government to break the Enigma code used by the Nazi armed forces.

I enjoyed the movie.  So did my wife.  Clearly some parts of reality were adapted to make a more compelling movie.  The movie stayed clear of most of the mathematics involved in cracking Enigma.  They did mention the flaw in always starting an encrypted message with the same word.  This is not an issue with modern cryptology, however, for a system such as Enigma, it could be a fatal flaw.  They also brushed over some other computer science topics, such as big-oh analysis, the turing test, universal turing machines, and elements of cryptography.  Nothing really deep.  They only partially explained a Casear cipher.

Unfortunately, they fairly directly attributed the invention of the computer to Turing.  Turing did a fantastic job of classifying what a computer is.  With all of the advances in computing his definition still holds.  ACM discusses this some in this article.

Our local theatre has a 5-star lounge.  My wife and I enjoy this!  You get to sit in nice recliners and can order food and wine!

5star_1 5star_2

 

It is fun!  I know some of my friends have this setup essentially recreated at home!  However, our entertainment room is not nearly so cool!

I do not know a great deal about cryptograph.  However, I am taking a graduate class on it this semester.  So I will likely be learning much more soon!

Minecraft

Another part of my Christmas holiday was playing Minecraft with my niece and nephew.  My little nephew really likes Minecraft!  I am sure I would have been fascinated by it at his age.  However, all I had was Utopia.  I created a large arena for us to battle in.  This included working redstone lights to control the number of monsters that would spawn around us.  I setup a computer to have a minecraft server running so that we could both be in the world at the same time.

minecraft_arena

It was a good Christmas!

First Semester Done: Artificial Intelligence and Database Management Systems

I survived my first semester in the PhD of Computer Science program at Nova Southeastern University in Ft. Lauderdale, FL.  The two classes that I took this semester were CISD 750: Database Management Systems and CISD 760: Artificial Intelligence.  Both classes were a challenging mix of research, exams and papers.  I ended up with an A in AI and an A- in DBMS.  Both required considerable work.  I gained valuable knowledge insights for my ultimate dissertation.  Both instructors were helpful and very knowledgeable of their topics.

Journal articles were a major part of both classes.  There were over 20 assigned papers for the database class and I reviewed nearly that many for the AI class.  Reading journal articles was something that I expected from a doctorate program.  While this is a distance degree program, both classes also had challenging in-class mid-term examinations.  The program requires me to spend four days, in two trips, on-campus per semester.  I traveled to Ft. Lauderdale both in August and October.

For this semester, the AI professor was by Sumitra Mukherjee and the database professor was by Junping Sun.  Both were great!  I would very much recommend classes taught by either.

CISD 750: Database Management Systems

The database management class covered many aspects of DBMS design.  I had not previously taken an academic class on database systems, so there was quite a bit of new material for me.  I had not worked with relational algebra prior to this class.  I also learned about functional dependancies, schema normalization, the chase algorithm, hashing, indexing, dynamic programming, query optimization, and other topics.  Most of the focus of the class was on relational databases.  However, some time was spent looking into some NoSQL topics.  The final and mid-term both tested our abilities to work through these algorithms.

I chose to do my research paper on frequent itemset mining.  This is a Knowledge Discovery in Databases (KDD) topic.  I compared the Apriori, Eclat and FP-Growth algorithms.  I did an emperical study of what datasets are conducive to each algorithm.  Most papers that I read about these algorithms studied the performance effects of varying the support threshold using known datasets.  I wanted to take a different approach, so I looked at how I could generate a dataset a specified frequent itemset density and number of frequent items.  I created a Python application to simulate generate this data.  I was happy with how the paper turned out.  I plan on posting the results from this research in a separate post here in the future.

I really liked the textbook that was chosen for this class.  Database Systems: The Complete Book (2nd Edition) appears to be a classic in the field of DBMS.  The examples and explanations were really clear.  The material was very different than what I am used to for databases.  I've worked with databases, such as Oracle and MySQL, for years.  It was very interesting to see DBMS at a more academic level.  I feel that I gained a deeper understanding of database topics.

CISD 760: Artificial Intelligence

The AI class covered a wide range of topics that provided a very good foundation of AI.  The optional textbook was Artificial Intelligence a Modern Approach.  I feel the textbook was a great choice, and I already owned a copy before the course began.  Topics covered in this course included A*, Neural Networks, Decision Trees, Bayesian Inference, and Genetic Algorithms.  We were given an assignment were we had to make use of several of these algorithms.  I used Encog for the Genetic Algorithm and A* portions of the programming assignment.

Another assignment asked us to choose a peer reviewed article and perform a critique.  For my article, I choose:

Larochelle, H., Bengio, Y., Louradour, J., & Lamblin, P. (2009, June). Exploring strategies for training deep neural networks. J. Mach. Learn. Res., 10 , 1-40.

I am interested in deep learning and wanted to research, and understand, some of the issues with applying it to continuous inputs.  I ended up implementing a deep belief neural network in Java.  The source code to this is on GitHub.  I will make use of this code for Volume 3 of my AIFH series.

The major paper for this class was to write an idea paper.  Idea papers are used in this program to capture potential ideas for a dissertation.  I wrote an idea paper that detailed research that I might like to perform to use continuous input with deep learning.  I believe that I would like to do my dissertation in the area of AI.  I am not sure I will choose deep learning.  Nevertheless, the paper was a great exercise.  There are several areas that I gently nudged the boundary of human understanding, while writing parts of Encog.  I plan to explore several of these for a potential dissertation.  I also still have one more semester before I need to really get serious about a dissertation topic.

Next Semester

Next semester I am taking two courses again, and then two more in the fall of 2015.  Once I am through these courses I have two more semesters where I will split between a single course and research hours.  After that I will be a phd candidate and on to dissertation work.

The two courses that I will take next semester (Winter 2015) are:

  • CISD 792  Computer Graphics
  • ISEC 730 / DCIS 730 Network Security and Cryptography

The textbooks for my computer graphics class are shown here.  There is no assigned book for the security class.  My guess is there will be a number of assigned papers for the security class.

books_winter_2015

 

I am looking forward to the next semester!  It looks like I will be learning about three.js in the graphics class.  I am really looking forward to that.

Handling Multiple Java Versions on a Mac

I would like to make use of Java 8 in Encog now, as well as the upcoming volume 3 of AIFH.  I primarily use a Mac at home and a Microsoft Surface while traveling.  In this post, I will describe how I setup the Mac to use Java 8.

The first step is to install Java 8, or whatever version of Java you wish to use.  These can be found from the following directory.

http://www.oracle.com/technetwork/java/javase/downloads/index.html

This will download a Mac package to install.  Once you install this package, you want to get to the point that you can go to a command line and issue the command java -version and see the correct version.  For example:

jeffs-mbp:~ jheaton$ java -version
java version "1.8.0_25"
Java(TM) SE Runtime Environment (build 1.8.0_25-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)
jeffs-mbp:~ jheaton$ 

To do this you must first find the exact version number of Java that you wish to use.  The java_home command can do that.

jeffs-mbp:~ jheaton$ /usr/libexec/java_home  -V
Matching Java Virtual Machines (5):
    1.8.0_25, x86_64:	"Java SE 8"	/Library/Java/JavaVirtualMachines/jdk1.8.0_25.jdk/Contents/Home
    1.7.0_25, x86_64:	"Java SE 7"	/Library/Java/JavaVirtualMachines/jdk1.7.0_25.jdk/Contents/Home
    1.7.0_13, x86_64:	"Java SE 7"	/Library/Java/JavaVirtualMachines/jdk1.7.0_13.jdk/Contents/Home
    1.6.0_65-b14-462, x86_64:	"Java SE 6"	/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home
    1.6.0_65-b14-462, i386:	"Java SE 6"	/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home

/Library/Java/JavaVirtualMachines/jdk1.8.0_25.jdk/Contents/Home
jeffs-mbp:~ jheaton$

Now that you have the version number, you must add the following line to your ~/.bash_profile file.

export JAVA_HOME=`/usr/libexec/java_home -v 1.8.0_25, x86_64`

Make sure to remove any other JAVA_HOME directives.

 

The 18 Kickstarter Projects that I've Backed

As of today I've backed 20 Kickstart projects, have run two successful projects and am planning my third.  Of these 20 backed projects, 18 were funded.  In this post I will give a summary of my experiences, as a backer, with Kickstarter.  Here are a few of the items I've received.

jheaton_backer

I have not yet experienced a Kickstarter project that has gone completely AWOL.  There are certainly cases of this.  It is not something that has occurred, yet, in a project that I've backed.  Most Kickstarter projects do run late.  The initial project estimate is just that, an estimate.  Few projects make it.  These projects are often in completely uncharted waters of design, production and international order fulfillment.  Once a project is late, communication is the key.  At this point you need for your backers to keep faith in your ability to fulfill the project.  I've seen projects run late by over a year, and still pull through.

Another issue that I've observed is that backers often treat Kickstarter as a sort of online mall.  You should not be buying Kickstarter projects as birthday or Christmas gifts.  Often backers do exactly that, seeing the expected delivery date falls a few months before the date that the backer needs the gift.  Then a project runs behind, Christmas is missed, and emotions run high.  I've actually seen a few Kickstarter projects miss two Christmases in a row, and still deliver well over two years behind.  This is the nature of backing a product that is not yet complete, much less in production.  I believe that Kickstarter is about being involved in an immature product that you believe in!

Here is a summary of the Kickstarter projects that I've backed (in the order I backed them):

  1. Code Monkey Save World: Major fan of Jonathan Coulton!  The project ran late, but was every thing I expected and more.
  2. Sparki - The Easy Robot for Everyone: Cool robotics platform!  just a few months late, and everything I expected.
  3. Dog Sled Saga: Cute little platformer game.  Delivered an alpha on-time and continues to evlove.
  4. LightUp: Learn by Making: Really neat electronics kit concept, but the project has had a number of issues.  I did get the key just a few days ago.  But there seem to be some discrepancies between the described contents of the kit and what people were actually shipped.  The production quality looks to be good, and I was able to use mine.  Just no instructions, at this point.
  5. The Stage at KDHX: A local Kickstarter project for a radio station in  my area.  Everything went great!
  6. Supertoy - World's First Natural Talking Teddy Bear: An AI-enabled teddybear.  How could I pass this one up?  The video showed a teddy bear that could converse as well as Data from Star Trek. Clearly that did not happen, or you would have ready about the dawn of strong AI in all major media.  However, after over a year behind I did get a bear capable of looking up things in Wikipedia.  It is actually kind of cool, just a wild-ride of 80 somewhat strange updates of the project creator sharing YouTube videos that he found interesting.
  7. Dataflow & Reactive Programming Systems: A book project by a local author.  Project delivered what it promised in the
  8. Star Trek Continues Webseries: This is a really cool project if you are a fan of the original Star Trek.  Delivered exactly what I expected.
  9. Chaos Drift - A Nostalgic RPG Experience: A classic RPG style game from my hometown.  The project is somewhat behind but seems to be progressing well.
  10. KUNG FURY: A movie in the classic 1980's action style.  The project is a bit behind at this point, but seems to be progressing.
  11. Hello Ruby: A computer programming book for kids.  Huge Kickstarter success; however, is now behind a few months.  Looks promising.
  12. iOS App Development for Teens by Teens: A book about iOS app development targeted at teens, and written a teen.  Project delivered on-time and to expectations!
  13. JUMP Cable: A small device that can be used to recharge iPhones and other devices.  Currently shipping.
  14. Mini Museum: Totally amazing project that lets you own a miniature museum in an acrylic case.  This was one of my favorite Kickstarter projects.  Small delay, but I got mine yesterday.  Great communication with backers.
  15. A History of the Great Empires of Eve Online: A history of the MMOG Eve.  Seems to be progressing well.
  16. The Universim:  Looks like the ultimate successor to Civilizations, seems to be progressing well.
  17. Bring Reading Rainbow Back for Every Child, Everywhere!: I watched Reading Rainbow when younger, and am a major fan of the chief engineer of the Enterprise.  How could I not back LeVar.  Project was epic and seems to be on track.  Butterfly in the sky, this thing went twice as high, take a look, it hit 5 million!
  18. The PHD Movie 2: Still in Grad School:  Since entering PhD gradschool myself I've seen many references to these comics.  Project seems to be doing well!

There you have it!  These are the projects I backed on Kickstarter.  This is a total of $1041 in backing dollars.  So far it has been a great experience!

Multi Agent Modeling Presentation & Midterms

I co-presented on the topic of Agent Based Modeling (ABM) for the 2014 Society of Actuaries annual meeting in Orlando, FL during October 26-29, 2014.  October was a busy month for me.  I had to fly to Ft. Lauderdale and Orlando both in October.  I know! There are worse places to fly to, but it was still a somewhat hectic schedule.  I flew to Ft. Lauderdale first to attend my midterms for the computer science PhD program I am a student of.

october-travel-2014

 

One of the booths at the SOA meeting was setup to take green-screen photos and place the attendee on the cover of The Actuary.  Here is my photo:

jheaton_actuary

It was an interesting meeting.  There were keynote presentations by both Madeleine Albright, former USA secretary of state, and Dr. Adam Steltzner, Lead Landing Engineer of NASA’s Mars Science Laboratory Curiosity Rover Project.  Both keynote speakers were fascinating.  There were quite a few data science related sessions during the conference.

I presented, along with Dr. Anand S. Rao on ABM.  I began the talk with an overview of what ABM is and introduced the open source utility Repast. Dr. Rao continued the talk and showed how Price Waterhouse Cooper (PWC) makes use of agent modeling.   You can see the slides for the presentation here (link to SOA site).

I also had to get ready for midterms for my first semester.  I am enrolled in two courses, and had to study for Artificial Intelligence and Database Management Systems.

study-midterm-2014

I will post more about the semester once it concludes in a few weeks.  I am already signed up for Winter 2014, and will be taking Computer Graphics and Cryptography.