Simple Real Life Example of Overfitting

Overfitting is a very common problem for Artificial Intelligence and Data Science practitioners.  Overfitting gives said practitioners a false sense of security in their models.  In this blog post I want to illustrate overfitting with a real-world example.  I will also introduce the concepts of crossvalidation, hyperparameters and parameters.

Consider if you are studying for an industry certification exam.  Most of these exams have practice tests.  Most likely you would study the relevant material, and then take the practice exam.  However, what if you score fairly low on the practice exam?  You might now use the practice exam to guide further study.  You are now “overfitting” to the practice exam.

Is the practice exam still a good indication of what score you will get on the real exam?  Most likely, if you took the second practice exam again, you would get nearly 100%!  You very likely improved your eventual certification score.  But how big of an improvement?  It would be naïve to expect a 100% score on the real certification exam.

What can be done about overfitting?  There are many answers.  For this post I will focus on cross-validation.  What went wrong?  Why did you fail the practice exam?  The answer is most likely that your “study plan” for the certification test was ineffective.  In machine learning, your study plan is your hyperparameters.  The hyperparameters specify your model, and any high-level parameters of that model.

If you are using a neural network, your hyperparameters are the structure of the network (hidden layers, activation function, neuron counts, learning rate, etc.).  If you are using a support vector machine your hyperparameters are your kernel type, gamma and constant values.  If you are using a random forest, your hyperparameter is the classifier count (number of trees).

This is what it looks like if we apply crossvalidation to your certification exam “study plan”. We first randomize the order of the practice test.  We do not want any stratification bias.  We break the questions into five groups.  This is called k-leave out crossvalidation.  For each potential study plan we study using 4/5 of the questions and evaluate with the remaining 1/5.  Our anticipated actual certification test score is the average of all five trial runs (using the 1/5).

This average is just one study plan (set of hyperparameters).  Now we try a different study plan.  We clear our memory, and repeat.  Of course we humans cannot clear our memories, so this works better for machines.  Eventually we come up with the optimal set of hyperparameters.  We now execute the best “study plan” with all 5 subsets.  We can now expect a real score similar to the average score that caused us to pick this “study plan”.  The “learning” that we ultimately achieve with this plan is the parameters.  For a model, the parameters are the weights or coefficients that fitting (training) produces.  This is the difference between parameters and hyperparameters.

Impressed by the Johns Hopkins Data Science Certification at Coursera

I decided to try the Johns Hopkins Data Science Specialization at Coursera. My background is more in Artificial Intelligence, than Data Science.  AI and Data Science are both very closely related.  However, because I now work full time as a data scientist, I am looking to broaden my knowledge of data science.  New ideas in data science will also be helpful, as I am starting a PhD in Computer Science in the fall.

The specialization includes nine four-week courses and a capstone project.  The capstone project will apparently be somewhat competitive, as they will profile the top ten students on the Simply Statistics blog. Perhaps sort of a mini-Kaggle? Since I am somewhat masochistic, I decided to take the first three classes simultaneously.  I am currently enrolled in the following:

  • The Data Scientist’s Toolbox
  • R Programming
  • Getting and Cleaning Data

These courses are a decent amount of work.  I am not sure I will take the next three in parallel.  I will write more about my impressions of each of the above three courses when I complete them in a few weeks.

The specialization uses the R Programming Language. My experience in AI has been with C++, Java and .Net writing low level models for the Encog Project. For data science projects I primarily use Python.  Though I have done some work in R, overall R is somewhat new to me. R is very different than C++ or Python.  R is a not a general purpose programming language.  R is a DSL for statistical computing.  I am very pleased that this specialization will deepen my understanding of R in the programming language.  Python and R seem to be the two leading languages for data science.

So far I am impressed with the assignments.  We are using data from CSV Files, web pages and even web service API’s.  The assignments are challenging, especially if you have not done a great deal of R programming.  I do not recommend taking “R Programming” simultaneous with “Getting and Cleaning Data”, as “R Programing” is a solid prerequisite to “Getting and Cleaning Data”.  I did take these two together, but it required me completing the assignments for “R Programming” in the first week!  This is okay if you have time, and previous programming experience.

I will write more about each class once I complete them May 5th.

When is a Model’s Training Error Calculated?

Neural Networks are typically trained by adjusting their weights to lower an error function.  This same process also applies to other models, such as Support Vector Machines.  There are a wide variety of ways to train a neural network.  Most of these trainers report the overall error of the neural network and training data.  The error is typically the difference between the expected output of a model, and the ideal expected output for a trained model.  These ideal output values are always included in a supervised training set. The overall training error is the average error, across all training elements, for a single iteration of training.

Continue reading

Mapping Data to a Machine Learning Model’s Inputs and Outputs

How many inputs does my model have?  How many outputs does my model have?  These two related questions can often lead to great confusion when setting up a model such as a neural network or a support vector machine.  These models work by accepting a fixed number of inputs and returning a fixed number of outputs based on those inputs.  If you need to review how models work this article may help.

Ideally your model would have the same number of inputs and outputs as your collected data. However, this is rarely the case.  This is because data are rarely presented to the model in exactly the same way as you originally received the data.

Continue reading

So you Want to be a Data Scientist?

Harvard business review calls it the sexiest job of the 21st century. But, what skills are needed to become a data scientist, and how can you get these skills?  I began as an advanced computer programmer with business knowledge.  Open Source involvement in Artificial Intelligence gave me the foundation to  move into a data science role.

There are really three critical skills that a data scientist must posses. A data scientist must be a statistician, domain expert and hacker – not necessarily in that order. There are different types of data scientist. Each type will be stronger in one of these three skills. Lets take a look at each of these three skills and see how you might build up your knowledge.

Continue reading

Press Release for Artificial Intelligence for Humans Volume 2: Nature Inspired Algorithms

FOR IMMEDIATE RELEASE

February 20, 2014

Jeff Heaton,  jheaton at heatonresearch.com

Nature-inspired algorithm book teaches programmers basics of Artificial Intelligence

Second volume of the popular Artificial Intelligence for Humans series launches on Kickstarter

St. Louis, MO – The birds and the bees serve as the muse behind the mathematical formulas featured in the latest book from data scientist Jeff Heaton that releases today on Kickstarter.  This new volume of the Artificial Intelligence for Humans series introduces algorithms inspired by elements of nature to teach programmers the fundamentals of AI.

“Artificial Intelligence for Humans is a series of books that presents the topic in a mathematically gentle manner,” said Heaton.  “Computer programmers are not necessarily wizards of all the Calculus, Linear Algebra and Statistical concepts that are required to work with AI.  This series will help programmers apply the ideas of AI to data analysis by fully explaining all the relevant math techniques and providing real-life examples.”

As an important component to the fields of Data Science and Big Data, Artificial Intelligence allows businesses to capitalize on vast amounts of collected data so they can tailor their products to customer needs.   Personalizing products for customers through data mining offers businesses the ability to enhance their services and profitability.

Heaton’s latest volume on AI explores how genomes, cells, ants, birds, and evolution as well as other natural processes influence programming and provides useful applications for the IT professional interested in delving into this dynamic field of computer science.

Programming examples are provided in Java, C# and Python.  Additional languages may be added as stretch goals during the Kickstarter campaign.

Heaton will seek Kickstarter pledges to support this book, prior to its August 2014 publication date, at levels between $5 and $250.

The first volume, Fundamental Algorithms (ISBN: 978-1493682225), attracted 818 backers and achieved 755% of its Kickstarter funding goal on July 10, 2013.  It was delivered on time to project supporters in December 2013.

About Jeff Heaton: Data Scientist, computer programmer and indie publisher specializing in Artificial Intelligence, Jeff is an active technology blogger, open source contributor, and author of more than ten books. Having worked fifteen years in the life insurance industry, Jeff is a Fellow of the Life Management Institute and a senior member of the IEEE.

For more information about the Artificial Intelligence for Humans series, please contact Jeff Heaton (jheaton@heatonresearch.com or visit the following sites:

Artificial Intelligence for Humans, Volume 2: Nature Inspired Algorithms

Artificial Intelligence for Humans, Volume 2: Nature Inspired Algorithms

-###-

Overfitting happens to humans too

Does over fitting happen to human beings? Or, is this a phenomenon associated only with machine learning? Just a random thought that I had today.

Over fitting is a common problem that most Artificial Intelligence practitioners face. Wikipedia defines over fitting as “overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship”.  Basically, the model becomes fixated on outliers.  This is often the result of overtraining (obsession) on a relatively small data set.

Does the human brain do this too?  Of course it does!  Stereotypes are essentially the “human brain” equivalent of over fitting.  Maybe your parents told that everyone from the Republic of Elbonia, acts in a particular way.  (Potential outlier datapoint!)  You then watch a movie that also portrays this peculiar trait of Elbonians. (Another potential outlier!)  Maybe your parents saw that same movie!  Danger, we are now counting the same outliers multiple times and now counting dependent variables amount the independent variables they are drawing from!

However, if we actually sample 1,000 Elbonians vs 1,000 members of the general population, we find out that the occurrence of said statistical trait no significant statistical variance.  Said human’s perception of Elbonians is overfit.