Creating a Python 3.x Data Science Working Environment

The purpose of this post is to demonstrate how to get an effective "data science" environment up and running with Python 3.  This blog post will give a common set of instructions for my books, articles and other information provided by me.  Even if you are not reading one of my books or articles, you might find this information useful.

I feel that Python 3.x made some great strides towards source code clarity and binary efficiency.  To debate the relative pros and cons of Python 3.x vs 2.x is not the purpose of this posting.  The purpose of this posting is to document how to install, what I consider to be, a decent "data scientist" working environment with Python 3.x.  The primary purpose of this post is so that I do not forget how to do this!  However, the rest of the world might benefit as well.

First of all, if you do not need Python 3.x, then just install Anaconda and call it a day. Anachonda is a scientific distribution of Python 2.7 that will give you all that you need.

https://store.continuum.io/cshop/anaconda/

What do I mean by a "data scientist" environment?  In particular, I make use of the following packages:

  • Numpy - For numerical processing.
  • Scipy - For scientific processing not covered by Numpy.
  • Scikit-Learn - For machine learning.
  • Theano - For numerical processing not covered by Numpy and deep learning.
  • Matplotlib - For charting.
  • Pygame - For visualization.
  • Oracle - For database access.
  • (anything else needed by the above)

The first thing to realize about installing anything in Python is that you are dealing with "pure Python" and "binary" packages.  Pure Python packages can be installed with the "pip" command.  Most serious models are written in C, C++ or Fortran (yes I said Fortran, it is a serous data science language, even in 2014).  They would simply be too slow in pure Python.  All of the above packages are "binary", and must be installed in their own unique ways.

Another consideration is 64-bit or 32-bit.  This document assumes 64-bit!

Installing Python 3.x on Windows, Mac and Linux all present their own unique challenges.  I will eventually describe all three, however, for now, this post is a "work in progress."

Windows Installation

The first step is to install the latest version of Python, which can be found here. Make sure to download the latest 64-bit version of 3.x. You must normally go through an intricate process to compile a binary Python package.  The University of California at Irvine provides a repository of these packages, with Windows installers.  This will save you a great deal of time!!!

Install these packages, in this order!

  • Python 3.x, get it here
  • pygame, get it at UCI
  • numpy, get it at UCI
  • scipy, get it at UCI
  • six, Pure Python, install with "pip install six"
  • dateutil, Pure Python, install with "pip install dateutil"
  • pytz, Pure Python, install with "pip install pytz"
  • pyparsing, Pure Python, install with "pip install pyparsing"
  • matplotlib, get it at UCI
  • theano, get it at UCI

Using Oracle in 64-bit has its own set of issues, however, once its installed, it works fine.  I will add Oracle instructions soon.

Mac Installation

More to come

Linux Installation

More to come

Beginning a Computer Science PhD at Nova Southeastern University

I just returned from my first cluster meeting for PhD program in computer science at Nova Southeastern University.  Because it was Labor Day weekend my wife went with me, and we made it a mini Florida vacation.  We flew into Ft. Lauderdale on Wed, Sept 27, 2014.

  • Wednesday: Travel day, and check into hotel.
  • Thursday: program orientation, library tour and kickoff reception.
  • Friday: Two class sessions (4-hours each), with a 1.5 hr break for lunch.
  • Saturday: Two class sessions (4-hours each), with a 1.5 hr break for lunch.
  • Sunday: Vacation day with my wife.  We checked out the beach at Ft. Lauderdale and had a a canal tour.
  • Monday (Labor day, USA holiday): Flew back to St. Louis.

I stayed at the closest hotel to the NSU campus, the Holiday Inn Airport.  The hotel is a good value.  They offer a free breakfast and shuttle to/from the airport.  It is nearly a 1.2 mile walk to the campus.  The roads are walk-able and have crosswalks.  In October I might walk it and shower on campus before class.  This would save the rental car expense.  I did walk to the university a few times this trip, however, more for exercise.  After a 1.2 mile walk, in the hot Florida summer sun, I would not be very popular!

Program Orientation

The program orientation was lead by Dr. Seagull, the Associate Dean of Academic Affairs at GSCIS. He presented an overview of the program and the school.  NSU was founded in 1964, and is currently celebrating their 50th anniversary.  There were many different 50th anniversary banners and sings throughout the campus.  For my program I must complete the following:

  • 32 credit hours of course work.  This will amount to 8 individual four-credit hour courses.  I am currently enrolled in my first two. I will need to fly to Ft. Lauderdale twice a semester for these courses.
  • 8 credit hours of directed research.  It will be at least a year before I start this part. However, my understanding is that I will work on a research problem with one of the professors.  This should help prepare for my own dissertation.
  • 24(or more) dissertation hours.

The faculty is composed of both full-time and adjunct faculty members.  For my first two classes, both instructors are full-time and live in the Florida area and had been with NSU for over a decade.

Artificial Intelligence Class

The artificial intelligence class (CISD 760) is taught by Sumitra Mukherjee and uses the textbook Artificial Intelligence a Modern Approach.  For the first class sessions, the professor lectured on path finding, modeling, optimization, and Bayesian inference. We saw neural networks, genetic algorithms, decision trees, Bayesian belief networks and other algorithms.

The professor also covered the assignments for the semester.

  • Assignment 1: Select a peer reviewed paper in your research area of interest.
  • Assignment 2: Complete programs for four AI problems.  Path finding, vector optimization, data science/predictive modeling (neural net vs decision tree) and Bayesian inference.
  • Assignment 3: Critique the paper from assignment 1, and write an "idea paper" describing further research you might like to pursue.
  • In-class mid-term at the next cluster meeting.
  • Final examination assignment completed over the last several weeks of the semester.

Most PhD programs have qualifying exams, and NSU's CS PhD program breaks this requirement over 8 in-class mid-semester examinations.  For the AI class, this is accomplished with the mid-term assignment.

I've already started the programming assignment and am making use of Python with DEAP, Numpy and scikit-learn.  I think this will be a great class.  The assignment gives a good chance to try out some of the AI algorithms.  The assignments also allow us to start thinking about dissertation topics in AI.  I plan to conduct my dissertation research in the field of AI.

Data Base Management Systems Class

The data base management systems class (CISD 750) class is taught by Junping Sun using the textbook Database Systems: The Complete Book (2nd Edition).  For the first class sessions, the professor lectured on a variety of topics in database theory.  This class is different than your typical "IT SQL" class.  This course is really more on the design and implementation of an actual database system.  I was familiar with the topics discussed, but I will have quite a bit of studying to do for this class.  Both professors seemed very knowledgeable of their respective areas, and very current on the latest research.

The assignments for this course are:

  • Research Proposal.
  • Research Report.
  • In-class mid-term at the next cluster meeting.
  • Final examination assignment completed over the last several weeks of the semester.

We are supposed to select a research topic for this course.  KDD (Knowledge Discovery in Databases) is one of the topics.  This is the area I plan to research.  KDD is what computer science groups, such as the ACM, call Data Science.

So far it looks like a good program.  I wanted to enter the world of academic publishing, but could not fit a traditional PhD program into my life.  This program will be quite a bit of work.  But, so far, the program looks like it will be a good fit for me.

Around the NSU Campus

Here are some pictures that my wife and I took around the NSU campus.

The campus is large, and it took some walking to learn my way around!

Jeff Heaton at NSU

Walking to class at NSU.

You can see the main entrance to the campus below.

Jeff in front of NSU

 

This is the Carl DeSantis building at NSU.  The Graduate School of Information and Computer Science is located here.  I spent most of my time in the building below.

IMG_6349

Of course, this is Ft. Lauderdale, Florida, I had to stop at the beach!

IMG_6403

 

Ft. Lauderdale is made up of many interesting canals.  My wife and I also took a canal tour.

IMG_6428

 

 

Next Steps for Encog

I think the time is approaching for a major upgrade to Encog, spread over several versions.  I am beginning a PhD in computer science in a few days.  I would like to use Encog for some of my research, and there are some gaps I need to fill.  When I first started Encog, I was only a few years into my own AI journey.  Now, six years into Encog, I have a clearer idea of how I need to structure some things.  I am very impressed by the CARET package for R and the scikit-learn package for Python.  Both of these packages allow you to experiment with a wide variety of models to find what best fits your data.  I hope to apply inspiration from both of these to Encog.

What I Like about Encog

Encog is fast and efficient.  Encog makes use of multi-core technology quite well.  I feel the low-level models (neural networks, SVM's and training algorithms) are quite solid.  There is also a decent amount of unit test coverage built into these core models.  The foundation really is strong.  You can always add MORE models, and my goal is not to add every model into Encog. Only those models that I care about, or the community convinces me to care about. Additionally, sometimes contributors provide me with working Encog compatible models, and these are also added.

What Could Use Some Work in Encog

The part of Encog that is the weakest is all of the infrastructure that gets the data in and out of these models.  Many successful projects that use Encog (some of my own included) simply write code to directly wrangle the data and place it into the format that a model can accept.  Models only accept fixed length vectors of numbers.  However data occur in a wide range of formats, such as time-series, strings, and other formats.  Ultimately, the programmer does have to take most of responsibility for wrangling their data.  This stuff is rarely completely automatic.  If it were, there would be little need data scientists.

My first attempt to make data wrangling easier was "Encog Analyst".  This worked to some degree, but not well.  Encog Analyst is fairly cool in that you just point it at a CSV file and pick a model type (i.e. a type of neural network) and it generates an EGA file that tells Encog how to map your CSV file to the data.  You click run, you data is broken into test/training sets, and a model builds.  Sounds good, right?  I thought so initially.  However, this is very limiting.  And it does nothing for model selection.

Model selection is the process where you adjust your hyper-paramaters and find the best fit for your model.  Hyper-paramaters are how many hidden layers you have, the type of kernel your SVM uses, etc.  Its hard to pick these, yet they can make all the difference in the accuracy of your models.  There are many different means for selecting them, and often it is automated trial and error.  Encog Analyst does nothing to automate this trial and error.  You have to continually modify your EGA file, and the file is not that easy to work with.  If you want to switch from a SVM to a neural network, you must change large blocks of your file.  If you want to try various combinations of hidden layers and activation functions you must manually do that as well.  If you switch model or activation function and now need to normalize in a different way, you need to change that too.  Model selection is just not easy with Encog Analyst.

I want to change Encog Analyst so that you define the input data in terms of what each column is.  Is the column a string (categorical), number (continuous), time (time-series), etc.  Then, you give it a list of models to attempt.  Encog will know how the data must be normalized for each model type and will do this for you.  You can override these mappings of how each model wants its data, but, usually you will not.

This will mainly be accessed through API's.  I will give Encog Workbench the ability to interact with the new

What Will be Added?

For my own research, I will be specifically adding the following.  Not necessarily in this order, this will happen over several versions.

  • Support for ARFF files.
  • Support for PMML files.
  • Faster linear algebra using BLAS packages, and support for GPU based versions of BLAS.
  • New models: CART Trees, Random Forest & Gradient Boosting Machines
  • Better support for cross-validation techniques.
  • The new Encog Analyst that will be designed to facilitate model selection.
  • Better random number support.
  • More error calculation methods, and better support for them
  • Addition of the code from volumes 1&2 of my artificial intelligence for humans series.
  • Update of Encog documentation to support these changes.

I will release more information on how I plan to stage these.  I really intend for this to happen in a series of small version releases that each are fairly short.  I also plan on minimizing breaking changes to existing code.

More information will follow soon.

Machine Learning 101 Presentation

Today I gave a presentation for the St. Louis Microsoft Business Intelligence Users Group (BI). This post contains all of the links that I discussed in the presentation.  Even if you did not attend the presentation you can download a PDF of my slides.

My Thoughts on Courses 1-9 of the Johns Hopkins Coursera Data Science Specialization

I am now ABC (all but the capstone) on Johns Hopkins Coursera Data Science Specialization.  I now only need to complete the capstone project to finish the specialization.  That should happen sometime this fall, depending on when they offer the capstone for the first time.

I am not sure how typical of a student I was for this program.  I currently work as a data scientist, have a decent background in AI, have a number of publications, and am currently completing a PhD in computer science.  So, a logical question is what did I want from this program?

  1. I have not done a great deal of R programming.  This program focused heavily on R.  I view this as both a strength and weakness of the program.  I am mostly a Java/Python/C#/C++ guy.  I found the R instruction very useful.
  2. I've focused mainly in AI/machine learning, I hoped this program would fill in some gaps.
  3. Curiosity.

In a nutshell, here are my opinions.

Pros: Very practical real-world data sets.  Experience with both black-box (machine learning) and more explainable (regression models) systems.  Introduction to Slidify and Shiny, I've already used both in my day-to-day job.  It takes some real work and understanding to make it through this program.  The last three courses rocked!

Cons: Peer review is really hit or miss.  More on this later.  Some lecture material was sub-par (statistical inference) compared to Khan Academy.  Only reinvent the wheel if you are going to make a better wheel.

Course Breakdown

Here are my quick opinions on some of these courses.  I've posted more on some of these as I took them (see past blog posts here here and here).

  1. The Data Scientist’s Toolbox: Basically, can you install R, RStudio and use GitHub.  I've already done all three so I got little from this course.  If you have not dealt with R, RStudio and GitHub this class will be a nice slow intro to the program.
  2. R Programming: I enjoyed this course!  It was hard, considering I was taking classes #1 and #3 at the same time.  If you had no programming experiance, this course will be really hard!  Be ready to supplement the instruction with lots of Google Searching.
  3. Getting and Cleaning Data: Data wrangling is an important part of data science.  Getting data into a tidy format is important.  This course used quite a bit of R programming.  For me, not being an R programmer and taking course #2 at the same time meant extra work.  If you are NOT an advanced programmer already DO NOT take #2 and #3 at the same time.
  4. Exploratory Data Analysis: This was a valuable class, it taught you all about the R graphing packages.
  5. Reproducible Research:  Valuable course!  Learning about R markdown was very useful, I am already using this in one of my books to make sure that several of my examples are reproducible by providing a RMD script to produce all the charts from my book.
  6. Statistical Inference:  This was an odd class.  I already knew statistical inference and did quite well despite not watching any lectures (hardly).  I don't believe this course made anyone happy (hardly).  Either you already knew the topic and were bored, or you were completely lost trying to learn statistics for the first time.  There are several Khan academy videos that cover all the material in this course.  Why dose Hopkins need to reproduce this?  Is this not the point of MOOC?  Why not link to the Khan academy videos and test the students.  Best of both worlds!  Also, 90% of the material was not used in the rest of the course, so I suspect many students might have been left wondering why this course is for.
  7. Regression Models: Great course, this is the explainable counterpart to machine learning.  You are introduced to linear regression and GLM's.  This course was setup as the perfect counterpart to #8. My only beef on this course was that I got screwed by peer review.  More on this later.
  8. Practical Machine Learning: Great course.  This course showed some of the most current model types in data science: Gradient Boosting Machines (GBMs) and Random Forests.  Also a great description of boosting and an awesome Kaggle like assignment where you submitted results from your model to see if you can match a "hidden data set".
  9. Developing Data Products: Great course.  I really enjoyed playing with shiny, and even used it for one of the examples in my upcoming book.  You can see my shiny project here. https://jeffheaton.shinyapps.io/shiny/  and writeup here.  http://slidify.org/publish.html  They encouraged you to post these to GitHub and public sites, so I assume I am not violating anything by posting here.  Don't plagiarize me!!!

Peer Reviewed Grading

If you are not familiar with peer review grading, here is how it works.  For each project you are given four counterparts that review and grade your assignment.  This is mostly double-blind, as neither the student or reviewer knows the other.  I used my regular GitHub account on all assignments.  So it was pretty obvious who I was.  I was even emailed by a grader once who recognized me from my open source projects.  Your grade is an average of what those four people gave you.  At $49 a course maybe this is the only way they can afford grade.  I currently spend nearly 100 times that for each of my PhD courses. :(

Overall, peer review grading worked good for me in all courses but one.  Here are some of my concerns on peer grading.

  1. You probably have many graders who are pressed for time and just give high-marks without much thought.  (just a guess/opinion)
  2. You are going to be graded by people who may not have not gotten the question correctly in the first place.
  3. You are instructed NOT to run the R program.  So now I am being graded on someone's ability to mentally compile and execute my program?
  4. Each peer is going to apply different standards.  You could get radically different marks depending on who your four peers were.

So here is my story in the one case where peer review did not work for me.  I in the upper 98-99% range on most of these courses.  Except for course #8.  I had good scores going into the final project.  However, two of my peers knocked me for these reasons:

  • Two of my peers could not download my file from Coursera.  Yet the other two had no problem.  Fine, so I get a zero because someone's ISP was flaking out.
  • Two of peers did not give me credit because the felt I had not used RMD for my report?? (which I had) Fine, so I lose a fair amount of points because two random peers did not know what RMD output looks like.

This took a toll on my grade, I still passed.  But this is the one course I did not get "with distinction" on.  Yeah big deal.  In the grand scheme of things I don't really care.  Just mildly annoying.  However, if you are hovering near a 70%, and you get one or two bad reviewers you are probably toast.

Conclusions

Great program.  It won't make you a star data scientist, but it will give you a great foundation to go from.  Kaggle might be a good next step.  Another might be a blog and doing some real, and interesting data science to showcase your skills!  This is somewhat how I got into data science.

Spoke at the 2014 Health Actuaries Conference in San Francisco

2014_health_soa

I was an invited speaker at this year's Society of Actuaries (SOA) Health 2014 conference in San Francisco.  I presented on the topic of Deep Learning and Unstructured Data.  I've been working with both topics for nearly a year now. I make use of deep learning through the Theano framework for Python.  I hope to add deep learning support to Encog at some point.  It was fun to meet with other data scientists in the insurance industry.  I left with a greater appreciation for several model types that I've not previously worked with, such as Gradient Boosting Machines (GBM).  I also wrote an article on deep learning for the SOA Forecasting and Futurism Section.  You can find the PDF here, the July 2014 edition.

 

50% Done with Johns Hopkins COURSERA Data Science Specialization, My Review of Classes 3&4

I am 50% done with the Johns Hopkins COURSERA Data Science Specialization, and continuing onward.  I took courses 1 through 3 in a single batch, and found them useful.  You can read my review of courses 1 through 3 here.  I decided not to take three classes in parallel again.  I am learning the R programming language as I go, so these courses can take some time.  I plan on sticking to two courses at a time until I complete the 9 courses.  This will allow me to tackle the final capstone project sometime this fall.

I am taking the signature track.  This means that COURSERA issues me a verification certificate for each course.  You can see my certificates for these two courses here.

My grades were high enough that I earned the "with distinction" note.  This has been the case with all five classes so far.  I hope that I am not giving the impression that these courses are easy.  Short of loading data (that I had preprocessed elsewhere) into R for modeling, I had not really dealt with R much.  My goals for this class are twofold.  First, to finally sit down and really learn R.  My second goal is to see more application of data science in new domains.  As a data scientist, I work primarily with life insurance industry data.

Both of the current two classes really fit the bill for this.  Both give me some insights into how data science is used in academic, peer reviewed, research.  Given that I am entering a distance learning Computer Science PhD program, this is very good exposure for me.

Review of Exploratory Data Analysis

Exploratory data analysis taught me all about charting in R.  There are three central charting packages for R.  Each of them has their uses, and this class taught all free.  In the world of R charting there is the base charting, lattice charting and grammar of graphics.  These packages all have their place.  The ggplot (grammar of graphics) attempts to address some of the issues in earlier charting packages.

The entire point of exploratory data analysis is to produce quick charts to allow you to get a handle on what the data look like.  These are not formal graphics that you will include in your report, though they might become that.

Review of Reproducible Research

If you've read many academic articles, you know how hard it can be to reproduce exactly what the researcher is talking about.  The goal of this course is to automate all numerical analysis to a simple R script.  This allows the reader to know exactly where numbers came from, and what the process is.  Many academic papers lack any sort of direction.

Conclusions

The next course for me is Practical Machine Learning.  I am taking this course by itself.  I only have two courses to finish before the capstone.  This would allow me to complete the program (but not the capstone) by September.  In September, I start a PhD in computer science.  Because of this I want to keep the schedule clear for the fall.