My First Kaggle Competition

I placed in the top 10% of my first Kaggle competition.  If you are not familiar with it, Kaggle is an ongoing forum for competitive data science. Individuals and teams compete to create the best model for data sets provided by industry and sometimes academia.  Individuals who enter are ranked as either Novice, Kaggler and Kaggle Master.  To become a Kaggle master, one must place in the top 10% of two competitions; and in one of the top 10 slots of a third competition.

I've talked about Kaggle in many of my presentations.  I've also used Kaggle data in my books. Until now, I had yet to actually enter a Kaggle competition.  I decided it was finally time to try this for myself. I competed in the Otto Group Product Classification Challenge that ended on May 18th, 2015.  My score was sufficient to land in the top 10%, so I've completed one of the requirements for Kaggle master.  My Kaggle profile can be seen here.

My goals for entering were:

  • See how hard Kaggle actually is, and move towards a Kaggle master designation.
  • Learn from the other Kagglers and forums.
  • Build a basic toolkit that I will use for future Kaggle competitions.
  • Gain an example (from my entry) for the Artificial Intelligence for Humans series.
  • Maybe get an idea or two for my future dissertation (I am a phd student at Nova Southeastern University).

The Otto Classification Challenge

First, I will give a brief introduction to the exact nature of the Otto Classification Challenge.  For a complete description, refer to the Kaggle description(found here).  This challenge was introduced by the Otto Group, who is the world's largest mail order company and currently one of the biggest e-commerce companies, mainly based in Germany and France but operating in more than 20 countries.  They have many products sold over numerous countries.  They would like to be able to classify these products into 9 categories, using 93 features (columns).  These 93 columns represent counts, and are often zero.

The data are completely redacted.  You do not know what the 9 categories are, nor do you know the meaning behind the 93 features.  You only know that the features are integer counts. Most Kaggle competitions provide you with a test and training dataset.  For the training dataset you are given the outcomes, or correct answers.  For the test set, you are only given the 93 features, and you must provide the outcome.  The test and training sets are divided as follows:

  • Test Data: 144K rows
  • Training Data: 61K rows

You do not actually submit your model to Kaggle.  Rather, you submit your predictions based on the test data.  This allows you to use any platform to make these predictions.  The actual format of a submission for this competition is the probability of each of the 9 categories being the outcome.  This is not like a university multiple choice test where you must submit your answer as A, B, C, or D.  Rather, you would submit your answer as:

  • A: 80% probability
  • B: 16% probability
  • C: 2% probability
  • D: 2% probability

I wish college exams were graded like this!  Often I am very confident about two of the answers, and can eliminate the other two.  Simply assign a probability to each, and you get a partial score.  If A were the correct answer for the above, I would get 80% of the points.

The actual Kaggle score is slightly more complex than that.  Rather, you are graded on a logarithm based scale and are very heavily penalized for having a lower probability on the correct answer. The following are a few lines from my submission:

1,0.0003,0.2132,0.2340,0.5468,6.2998e-05,0.0001,0.0050,0.0001,4.3826e-05
2,0.0011,0.0029,0.0010,0.0003,0.0001,0.5207,0.0013,0.4711,0.0011
3,3.2977e-06,4.1419e-06,7.4524e-06,2.6550e-06,5.0014e-07,0.9998,5.2621e-06,0.0001,6.6447e-06
4,0.0001,0.6786,0.3162,0.0039,3.3378e-05,4.1196e-05,0.0001,0.0001,0.0006
5,0.1403,0.0002,0.0002,6.734e-05,0.0001,0.0027,0.0009,0.0297,0.8255

Each line starts with a number that specifies the data item that is being answered.  The sample above shows the answers for items 1-5.  The next 9 values are the probabilities for each of the product classes.  These probabilities must add up to 1.0 (100%).

What I Learned from Kaggle

If you want to do well in Kaggle, the following are very important topics, along with the tools I used.

  • Deep Learning - Using H2O and Lasagne
  • Gradient Boosting Machines (GBM) - Using XGBOOST
  • Ensemble Learning - Using NumPy
  • Feature Engineering - Using NumPy and Scikit-Learn

The two areas that I learned the most about, during this challenge, were GBM parameter tuning and ensemble learning.  I got pretty good at tuning a GBM.  The individual scores for my GBM's were in line with those used by the top teams.

Before Kaggle I typically used only one model, if I were using neural networks, I just used neural networks.  If I were using an SVM, Random Forest or Gradient Boosting, I stuck to just that model.  With Kaggle, it is critical to use multiple models, ensembled to produce better results than each of the models could produce independently.

Some of my main takeaways from the competition:

  • GPU is really important for deep learning.  It is best to use a deep learning package that supports it, such as H2O, Theano or Lasagne.
  • The t-sne visualization is awesome for high-dimension visualization and creating features.
  • I need to learn to ensemble better!

This competition was the first time I used T-SNE.  It works like PCA in that it is capable of reducing dimensions, however, the data points separate in such a way that the visualization is often clearer than PCA. This is done using a stochastic nearest neighbor process. I plan to learn more about how t-sne actually performs the reduction, compared to PCA.

kag

My Approach to the Otto Challenge

So far I've only worked with single model systems.  I've used models that contain ensembles that are "built in", such as random forests and gradient boosting machines.  However, it is possible to create higher-level ensembles of these models.  I used a total of 20 models, this included 10 deep neural networks and 10 gradient boosting machines.  My deep neural network system provided one prediction and my gradient boosting machines provided the other.  These two predictions were blended together, using a simple ratio.  The resulting prediction vector was then normalized so that the sum equaled 1.0(100%).

kaggle1_model

I did not remove or engineer any fields.  For both model types I converted all 93 attributes into Z-Scores.  For the neural network I normalized all values to be in a specific range.

My 10 deep learning neural networks used a simple bagging method.  I averaged the predictions from 20 different neural networks.  Each of these neural networks was created by choosing a different 80/20 split between training and validation.  The neural network was trained on the training data until the validation score did not improve for 25 epochs.  Once training stopped I used the weights from the epoch that produced the highest training score. This process is a simple form of bagging called bootstrap aggregation.

My 10 gradient boosting machines (GBM) were each components of a 10-fold cross-validation.  I essentially broke the Kaggle training data into 10 folds and used each of these folds as a validation set, and the others as training.  This produced 10 gradient boosting machines.  I then used an NxM coefficient matrix to blend each of these together.  Where N is the number of models, M is the number of features.  In this case it was a 10x9 grid.  This matrix weighted each of the 10 model's predictive power in each of the 9 categories.  These coefficients were a straight probability calculation from the confusion matrix of each of the 10 models.  This allowed each model to potentially specialize in each of the 9 categories.

I spent considerable time tuning my GBM.  I used Nelder-Mead searches to optimize my hyper-parameter vector.  I ultimately settled on the following parameters:

params = {'max_depth': 13,'min_child_weight': 4,'subsample': .78,'gamma': 0,'colsample_bytree': 0.5, 'eta':.005, 'threads':24}

Each of these two approaches (GBM and neural network) produced a separate submission file. I then blended these together, weighting each.  I found that 0.65 gave me the best blend with my deep neural network.

What Worked Well for Top Teams

The top Kaggle teams made use of more sophisticated ensemble techniques than I did.  This will be my primary learning area for the next competition.  You can read about some of the top models here:

The above write-ups are very useful, I've already started examining their approaches.

Some of the top technologies discussed were:

  • Feature Engineering
  • Input Transformation - good write up here
    • log transforms
    • sqrt(x + 3/8) - Not sure what this one is called, but I saw it used a few times
    • z-score transforms
    • ranged transformation
  • Hyperparameter Optimization

I will probably not enter another Kaggle until the fall of this year.  This blog post will be updated to contain my notes as I investigate other techniques for this competition.

End of 2nd Semester in Computer Science PhD Program

I am done with the second semester of my PhD program.  I am now half done with the required coursework.  I am not taking a class this summer, I will be busy with AIFH Volume 3.

The two classes that I took this semester were:

Both classes were really good.  The computer graphics class introduced Three.JS for most of our assignments.  I like Three.JS and will be using it for some of Javascript examples for my books.  My class project was a 3D flocking algorithm that I've already included into my book examples, you can see it here.

The security class focused mainly on reproducing the research of a security paper, and then writing a dissertation proposal in a similar topic area.  I used Encog to reproduce a paper's research that produced a neural network based intrusion detection system.  You can see this report, and my source code here.  I am not posting my dissertation idea paper, in the event that it is a direction that I choose to go for my own dissertation.

It was a good semester, and it took quite a bit of work.  I was able to get an A in both classes.  I have two more semesters of classes, then I will focus on research and preparing for the dissertation.

Quick R Tutorial: The Big-O Chart

I needed an original Big-O chart for a publication.  This tutorial does not cover what Big-O actually is, just how to chart it.  If you want more information on Big-O I recommend reading this and this.  I know, go figure, for a computer science student. I really like using R for all things chart and visualization.  So, I turned to R for this task.  While this is a very simple chart, it does demonstrate several very common tasks in R:

  • Overlaying several plots
  • Adding a legend to a plot
  • Using "math symbols" inside of the text on a chart
  • Stripping off all the extra whitespace that R likes to generate

You can see the end-result here:

big-oh

 

This was produced with the following R code:

# Remove white space on chart
par(mar=c(4,4,1,1)+0.1)

# The x & y limits for the plot
xl <- c(0,100)
yl <- c(0,1000)

# Just use R's standard list of colors for the lines (I am too tired to be creative this morning)
pal <- palette()

# Plot each of the equations that we are interested in. Note the add=TRUE
# it causes them to overlay each other.
plot(function(n) (rep(1,length(n))),xlim=xl,ylim=yl,col=pal[1],xlab="Elements(N)",ylab="Operations")
plot(function(n) (log(n)),xlim=xl,ylim=yl,col=pal[2],add=TRUE)
plot(function(n) (n),xlim=xl,ylim=yl,col=pal[3],add=TRUE)
plot(function(n) (n*log(n)),xlim=xl,ylim=yl,col=pal[4],add=TRUE)
plot(function(n) (n^2),xlim=xl,ylim=yl,col=pal[5],add=TRUE)
plot(function(n) (2^n),xlim=xl,ylim=yl,col=pal[6],add=TRUE)
plot(function(n) (factorial(n)),xlim=xl,ylim=yl,col=pal[7],add=TRUE)

# Generate the legend, note the use of expression to handle n to the power of 2.
legend('topright', c("O(1)","O(log(n))","O(n)","O(n log(n))",
 expression(O(n^2)),expression(O(2^n),"O(!n)"))
 ,lty=1, col=pal, bty='n', cex=.75)

pal[1]

Using Encog to Replicate Research

For the one of my PhD courses at Nova Southeastern University (NSU) it was necessary to reproduce the research of the following paper:

I. Ahmad, A. Abdullah, and A. Alghamdi, “Application of artificial neural network in detection of probing attacks,” in IEEE Symposium on Industrial Electronics Applications, 2009. ISIEA 2009., vol. 2, Oct 2009, pp. 557–562.

This paper demonstrated how to use a neural network to build a basic intrusion detection system (IDS) for the KDD99 dataset.  It is important to reproduce research, in an academic setting.  This means that you were able to obtain the same results as the original researchers, using the same techniques.  I do this often when I write books or implement parts of Encog. This allows me convince myself that I have implemented an algorithm correctly, and as the researchers intended.  I don't always agree with what the original researcher did.  If I change it, when I implement Encog, I am now in the area of "original research," and my changes must be labeled as such.

Some researchers are more helpful than others for replication of research.  Additionally, neural networks are stochastic (they use random numbers).  Basing recommendations off of a small number of runs is usually a bad idea, when dealing with a stochastic system.  Their small number of runs caused the above researchers to conclude that two hidden layers was optimal for their dataset.  Unless you are dealing with deep learning, this is almost always not the case.  The universal approximation theorem rules out more than a single layer for the old-school sort of perceptron neural network used in this paper.  Additionally, the vanishing gradient problem prevents the RPROP training that the researchers from fitting well with larger numbers of hidden layers.  The researchers tried up to 4 hidden layers.

For my own research replication I used the same dataset, with many training runs to make sure that their results were within my high-low range.  To prove that a single layer does better I used ANOVA and Tukey's HSD to show that differences among the different neural network architectures were indeed statistically significant and my box and whiskers plot shows that training runs with a single layer more consistently converged to a better mean RMSE.

I am attaching both my paper and code in case it is useful.  This is a decent tutorial on using the latest Encog code to normalize and fit to a data set.

The class also required us to write up the results in IEEE conference format.  I am a fan of LaTex, so that is what I used.

  • Source code Includes: [Source Code Link]
    • Python data prep script
    • R code used to produce graphics and stat analysis
    • Java code to run the training
  • My report for download (PDF): [Paper Link]
  • My report on ResearchGate: [Link]

The code is under LGPL, so feel free to reuse.

How I Got Into Data Science from IT Programming

Drew Conway describes data scientist as the combination of domain expertise, statistics and hacker skills.  If you are an IT programmer, you likely already have the last requirement.  If you are a good IT programmer, you probably already understand something about the business data, and have the domain expertise requirement.  In this post I describe how I gained knowledge of statistics/machine learning through a path of open source involvement and publication.

There are quite a few articles that discuss how to become a data scientist.  Some of them are even quite good! Most speak in very general terms.  I wrote such an summary awhile back that provides a very general description of what a data scientist is. In this post, I will describe my own path to becoming a data scientist.  I started out as a Java programmer in a typical IT job.

Publications

My publications were some of my earliest credentials.  I started publishing before I had my bachelor's degree.  My publications and side-programming jobswere the major factors that helped me obtain my first "real" programming job, working for a Fortune 500 manufacturing company, back in 1995.  I did not have my first degree at that point.  In 1995 I was working on a bachelor's degree part-time.

Back in the day, I wrote for publications such as C/C++ Users Journal, Java Developers Journal, and Windows/DOS Developer's journal.  These were all paper-based magazines.  Often on the racks at book stores.  The world has really changed since then!  These days I publish code on sites like GitHub and CodeProject.  A great way to gain experience is to find interesting projects to work on, using open source tools.  Then post your projects to GitHub, CodeProject and others.

I've always enjoyed programming and have applied it to many individual projects.  Back in the 80's I was writing BBS software so that I could run a board on a C64 despite insufficient funds from high school jobs to purchase a RAM expander.  In the 90's I was hooking up web web cams and writing CGI and then later ASP/JSP code to build websites.  I wrote web servers and spiders from the socket up in C++.  Around that time I wrote my first neural network.  Always publish! A hard drive full of cool project code sitting in your desk is not telling the world what you've done.  Support open source, a nice set of independent projects on GitHub looks really good.

Starting with AI

Artificial intelligence is closely related to data science.  In many ways data science is the application of certain AI techniques to potentially large amounts of data.  AI is also closely linked with statistics, an integral part of data science.  I started with AI because it was fun.  I never envisioned using it in my "day job".  As soon as I got my first neural network done I wrote an article for Java Users Journal.  I quickly discovered that AI had a coolness factor that could help me convince editors to publish my software.  I also published my first book on AI.

Writing code for a book is very different than writing code for a corporate project/open source project.

  • Book code: Readability and understandably are paramount.  Second to none.
  • Corporate/Open Source Code: Readability and understandably are critical.  However, real-world necessity often forces scalability & performance to take the front seat.

For example, if my book's main goal is to show how to use JSP to build a simple blog, do I really care if the blog can scale to the traffic seen by a top-100 website?  Likewise, if my goal is to show how a backpropagation neural network trains, do I really want to muddy the water with concurrency?

The neural network code in my books is meant to be example code.  A clear starting point for something.  But this code is not meant to be "industrial strength".  However, when people start asking you questions that indicate that they are using your example code for "real projects", it is now time to start (or join) an open source project!  This is why I started the Encog project.  This might be a path to an open source project for you!

Deepening my Understanding of AI

I've often heard that neural networks are the gateway drug to greater artificial intelligence.  Neural networks are an interesting creature.  They have risen and fallen from grace several times.  Currently they are back, and with a vengeance.   Most implementations of deep learning are based on neural networks.  If you would like to learn more about deep learning, I am currently running a Kickstarter campaign on that very topic.  [More info here]

I took several really good classes from UDacity, just as they were introduced.  These classes have been somewhat re-branded.  However, UDacity still several great AI and Machine Learning courses.  I also recommend (and have taken) the Johns Hopkins Coursera Data Science specialization.  Its not perfect, but it will expose you to many concepts in AI.  You can read my summary of it here.

Also, learn statistics.  At least the basics of classical statistics.  You should understand concepts like mean, mode, median, linear regression, anova, manova, Tukey HSD, p-values, etc.  A simple undergraduate course in statistics will give you the foundation.  You can build on more complex topics such as Bayesian networks, belief networks, and others later.  Udacity has a nice intro to statistics course.

Kickstarter Projects

Public projects are always a good thing.  My projects have brought me speaking opportunities and book opportunities (though I mostly self publish now).  Kickstarter has been great for this.  I launched my Artificial Intelligence for Humans series of books through Kickstarter.  I currently have volume three running as a Kickstarter.   [More info here]

From AI to Data Science

When data science first started to enter the public scene I was working as a Java programmer writing AI books as a hobby.  A job opportunity at my current company later opened up in data science.  I did not even realize that the opportunity was available, I really was not looking.  However, during the recruiting process they discovered that someone with knowledge of the needed areas lived right here in town.  They had found my project pages.  This led to some good opportunities right in my current company.

The point is, get your projects out there!  If you don't have an idea for a project, then enter Kaggle.  You probably won't win.  Try to become a Kaggle master.  That will be hard.  But you will learn quite a bit trying.  Write about your efforts.  Post code to GitHub.  If you use open source tools, write to their creators and send links to your efforts.  Open source creators love to post links to people who are actually using their code.  For bigger projects (with many or institutional creators), post to their communities.  Kaggle gives you a problem to solve.  You don't have to win.  It will give you something to talk about during an interview.

Deepening my Knowledge as a Data Scientist

I try to always be learning.  You will always hear terminology that you feel like you should know, but do not.  This happens to me every day.  Keep a list of what you don't know, and keep prioritizing and tacking the list (dare I say backlog grooming).  Keep learning!  Get involved in projects like Kaggle and read the discussion board.  This will show you what you do not know really quick. Write tutorials on your efforts.  If something was hard for you, it was hard for others who will appreciate a tutorial.

I've seen a number of articles that question "Do you need a PhD to work as a data scientist?" The answer is that it will help, but is not necessary.  I know numerous data scientists with varying levels of academic credentials.  A PhD demonstrates that someone can follow the rigors of formal academic research and extend human knowledge.  When I became a data scientist I was not a PhD student.

At this point, I am a PhD student in computer science, you can read more about that here.  I want to learn the process of academic research because I am starting to look at algorithms and techniques that would qualify as original research.  Additionally, I've given advice to a several other PhD students, who were using my projects open source projects in their dissertations.  It was time for me to take the leap.

Conclusions

Data science is described, by Drew Conway, as the intersection of hacker skills, statistics and domain knowledge.  To be an "IT programmer" you most likely already have two of these skills.  Hacker skills is ability to write programs that can wrangle data into many different formats and automate processes.  Domain knowledge is knowing something about the business that you are programming for.  Is your business data just a bunch of columns to you?  An effective IT programmer learns about the business and it's data.  So does an effective data scientist.

This leaves, only really statistics (and machine learning/AI).  You can learn that from books, MOOCS, and other sources.  Some were mentioned earlier in this article.  I have a list of some of my favorites here.  I also have a few books to teach you about AI.

Most importantly, tinker and learn.  Build/publish projects, blog and contribute to open source.  When you talk to someone interested in hiring you as a data scientist, you will have experience to talk about.  Also have a GitHub profile, linked to LinkedIn that shows you do in-fact have something to talk about.

PhD Update: Second Cluster Visit for Winter 2015 Semester at NSU

I just got back to St. Louis from my second cluster visit to NSU for the phd in computer science.  I was feeling pretty good about choosing a distance learning program at a university in Ft. Lauderdale, Florida after the really cold weather we've been seeing in St. Louis lately.  There was a new blanket of snow on my drive way the morning that I left for the airport.  I just drove over it and headed out, the snow was melted by the time I returned.  The regular (non-distance learning) students were all on spring break, so the campus was unusually empty.  I never did take a spring break trip as an undergrad, so it works out that I have to go to Florida 4 times a year for my doctoral program.

I am taking two classes: CISD 792: Computer Graphics, taught by Dr. Laszlo, and ISEC 730: Computer Security and Cryptography, taught by Dr. Cannady. The computer graphics class focuses on Three.JS and OpenGL, and consists of numerous programming assignments.  I am learning about 3D programming. I think this will be very useful for some visualizations that I might want to do for my books.  The security class is more focused on writing.  I did a decent amount of programming to replicate the research of a paper that applied neural networks to intrusion detection.  I will post my Java code for this later, as it is a decent Encog tutorial.  One of several reasons that I entered a PhD program was to learn academic writing, so the security class is working out well.  I am finding both classes every beneficial and interesting.

This is my fourth trip to Ft. Lauderdale for the program.  My wife has come along with me each time so far.  We usually try to do at least one "tourist activity" each time.  This time we went to see a spring training baseball game that featured our home-team, the St. Louis Cardinals against the Miami Marlins, (notes to everyone, especially lawyers: both of those names are trademarks of the MLB, MLB is also a trademark of the MLB)  The St. Louis team won, so this made for a particularly enjoyable game!

I also kept tabs on my Kickstarter campaign while traveling.  So far the deep learning and neural network is going well!  If you would like to back, and obtain my latest book, click here.

I took this picture at the student center at NSU.  They have a really cool Dr. Who police box.  You can also see the palm trees out the window.  They have a beautiful campus!  And now they have a police box!

jheaton_nsu_drwho

Kickstarter started for Artificial Intelligence for Humans Volume 3: Deep Learning and Neural Networks

I just started my third Kickstarter project. This is the third volume in my Artificial Intelligence for Humans Series (AIFH).  This book will cover the same sort of neural network material as some of my earlier book, but it will also include newer topics in neural networks such as deep learning, convolutional neural networks, NEAT and HyperNEAT.  The book will be published by December 31, 2015. The planned table of contents is given here:

  • Chapter 1: Neural Network Basics
  • Chapter 2: Classic Neural Networks
  • Chapter 3: Feedforward Neural Networks
  • Chapter 4: Propagation Training
  • Chapter 5: Recurrent Neural Networks
  • Chapter 6: Radial Basis Function Networks
  • Chapter 7: NEAT and HyperNEAT Neural Networks
  • Chapter 8: Pruning and Model Selection
  • Chapter 9: Dropout and Regularization
  • Chapter 10: Architecting Neural Networks
  • Chapter 11: Deep Belief Neural Networks
  • Chapter 12: Deep Convolutional Network
  • Chapter 13: Modeling with Neural Networks

If you think this sort of book might interest you, I would really appreciate your support in the Kickstarter project!