Overfitting is a very common problem for Artificial Intelligence and Data Science practitioners. Overfitting gives said practitioners a false sense of security in their models. In this blog post I want to illustrate overfitting with a real-world example. I will also introduce the concepts of crossvalidation, hyperparameters and parameters.
Consider if you are studying for an industry certification exam. Most of these exams have practice tests. Most likely you would study the relevant material, and then take the practice exam. However, what if you score fairly low on the practice exam? You might now use the practice exam to guide further study. You are now “overfitting” to the practice exam.
Is the practice exam still a good indication of what score you will get on the real exam? Most likely, if you took the second practice exam again, you would get nearly 100%! You very likely improved your eventual certification score. But how big of an improvement? It would be naïve to expect a 100% score on the real certification exam.
What can be done about overfitting? There are many answers. For this post I will focus on cross-validation. What went wrong? Why did you fail the practice exam? The answer is most likely that your “study plan” for the certification test was ineffective. In machine learning, your study plan is your hyperparameters. The hyperparameters specify your model, and any high-level parameters of that model.
If you are using a neural network, your hyperparameters are the structure of the network (hidden layers, activation function, neuron counts, learning rate, etc.). If you are using a support vector machine your hyperparameters are your kernel type, gamma and constant values. If you are using a random forest, your hyperparameter is the classifier count (number of trees).
This is what it looks like if we apply crossvalidation to your certification exam “study plan”. We first randomize the order of the practice test. We do not want any stratification bias. We break the questions into five groups. This is called k-leave out crossvalidation. For each potential study plan we study using 4/5 of the questions and evaluate with the remaining 1/5. Our anticipated actual certification test score is the average of all five trial runs (using the 1/5).
This average is just one study plan (set of hyperparameters). Now we try a different study plan. We clear our memory, and repeat. Of course we humans cannot clear our memories, so this works better for machines. Eventually we come up with the optimal set of hyperparameters. We now execute the best “study plan” with all 5 subsets. We can now expect a real score similar to the average score that caused us to pick this “study plan”. The “learning” that we ultimately achieve with this plan is the parameters. For a model, the parameters are the weights or coefficients that fitting (training) produces. This is the difference between parameters and hyperparameters.