I think the time is approaching for a major upgrade to Encog, spread over several versions. I am beginning a PhD in computer science in a few days. I would like to use Encog for some of my research, and there are some gaps I need to fill. When I first started Encog, I was only a few years into my own AI journey. Now, six years into Encog, I have a clearer idea of how I need to structure some things. I am very impressed by the CARET package for R and the scikit-learn package for Python. Both of these packages allow you to experiment with a wide variety of models to find what best fits your data. I hope to apply inspiration from both of these to Encog.
What I Like about Encog
Encog is fast and efficient. Encog makes use of multi-core technology quite well. I feel the low-level models (neural networks, SVM's and training algorithms) are quite solid. There is also a decent amount of unit test coverage built into these core models. The foundation really is strong. You can always add MORE models, and my goal is not to add every model into Encog. Only those models that I care about, or the community convinces me to care about. Additionally, sometimes contributors provide me with working Encog compatible models, and these are also added.
What Could Use Some Work in Encog
The part of Encog that is the weakest is all of the infrastructure that gets the data in and out of these models. Many successful projects that use Encog (some of my own included) simply write code to directly wrangle the data and place it into the format that a model can accept. Models only accept fixed length vectors of numbers. However data occur in a wide range of formats, such as time-series, strings, and other formats. Ultimately, the programmer does have to take most of responsibility for wrangling their data. This stuff is rarely completely automatic. If it were, there would be little need data scientists.
My first attempt to make data wrangling easier was "Encog Analyst". This worked to some degree, but not well. Encog Analyst is fairly cool in that you just point it at a CSV file and pick a model type (i.e. a type of neural network) and it generates an EGA file that tells Encog how to map your CSV file to the data. You click run, you data is broken into test/training sets, and a model builds. Sounds good, right? I thought so initially. However, this is very limiting. And it does nothing for model selection.
Model selection is the process where you adjust your hyper-paramaters and find the best fit for your model. Hyper-paramaters are how many hidden layers you have, the type of kernel your SVM uses, etc. Its hard to pick these, yet they can make all the difference in the accuracy of your models. There are many different means for selecting them, and often it is automated trial and error. Encog Analyst does nothing to automate this trial and error. You have to continually modify your EGA file, and the file is not that easy to work with. If you want to switch from a SVM to a neural network, you must change large blocks of your file. If you want to try various combinations of hidden layers and activation functions you must manually do that as well. If you switch model or activation function and now need to normalize in a different way, you need to change that too. Model selection is just not easy with Encog Analyst.
I want to change Encog Analyst so that you define the input data in terms of what each column is. Is the column a string (categorical), number (continuous), time (time-series), etc. Then, you give it a list of models to attempt. Encog will know how the data must be normalized for each model type and will do this for you. You can override these mappings of how each model wants its data, but, usually you will not.
This will mainly be accessed through API's. I will give Encog Workbench the ability to interact with the new
What Will be Added?
For my own research, I will be specifically adding the following. Not necessarily in this order, this will happen over several versions.
- Support for ARFF files.
- Support for PMML files.
- Faster linear algebra using BLAS packages, and support for GPU based versions of BLAS.
- New models: CART Trees, Random Forest & Gradient Boosting Machines
- Better support for cross-validation techniques.
- The new Encog Analyst that will be designed to facilitate model selection.
- Better random number support.
- More error calculation methods, and better support for them
- Addition of the code from volumes 1&2 of my artificial intelligence for humans series.
- Update of Encog documentation to support these changes.
I will release more information on how I plan to stage these. I really intend for this to happen in a series of small version releases that each are fairly short. I also plan on minimizing breaking changes to existing code.
More information will follow soon.