The purpose of this post is to demonstrate how to get an effective "data science" environment up and running with Python 3. This blog post will give a common set of instructions for my books, articles and other information provided by me. Even if you are not reading one of my books or articles, you might find this information useful.
I feel that Python 3.x made some great strides towards source code clarity and binary efficiency. To debate the relative pros and cons of Python 3.x vs 2.x is not the purpose of this posting. The purpose of this posting is to document how to install, what I consider to be, a decent "data scientist" working environment with Python 3.x. The primary purpose of this post is so that I do not forget how to do this! However, the rest of the world might benefit as well.
First of all, if you do not need Python 3.x, then just install Anaconda and call it a day. Anachonda is a scientific distribution of Python 2.7 that will give you all that you need.
What do I mean by a "data scientist" environment? In particular, I make use of the following packages:
- Numpy - For numerical processing.
- Scipy - For scientific processing not covered by Numpy.
- Scikit-Learn - For machine learning.
- Theano - For numerical processing not covered by Numpy and deep learning.
- Matplotlib - For charting.
- Pygame - For visualization.
- Oracle - For database access.
- (anything else needed by the above)
The first thing to realize about installing anything in Python is that you are dealing with "pure Python" and "binary" packages. Pure Python packages can be installed with the "pip" command. Most serious models are written in C, C++ or Fortran (yes I said Fortran, it is a serous data science language, even in 2014). They would simply be too slow in pure Python. All of the above packages are "binary", and must be installed in their own unique ways.
Another consideration is 64-bit or 32-bit. This document assumes 64-bit!
Installing Python 3.x on Windows, Mac and Linux all present their own unique challenges. I will eventually describe all three, however, for now, this post is a "work in progress."
The first step is to install the latest version of Python, which can be found here. Make sure to download the latest 64-bit version of 3.x. You must normally go through an intricate process to compile a binary Python package. The University of California at Irvine provides a repository of these packages, with Windows installers. This will save you a great deal of time!!!
- UCI Python Package Library (Make sure to use the 64-bit, aka AMD64 versions, yes even if you have an Intel chip)
Install these packages, in this order!
- Python 3.x, get it here
- pygame, get it at UCI
- numpy, get it at UCI
- scipy, get it at UCI
- six, Pure Python, install with "pip install six"
- dateutil, Pure Python, install with "pip install dateutil"
- pytz, Pure Python, install with "pip install pytz"
- pyparsing, Pure Python, install with "pip install pyparsing"
- matplotlib, get it at UCI
- theano, get it at UCI
Using Oracle in 64-bit has its own set of issues, however, once its installed, it works fine. I will add Oracle instructions soon.
More to come
More to come