Quick and very dirty data wrangling example

Data science is often described as the intersection of statistics, domain knowledge and hacking skills.  One important part of hacking skills is data wrangling.  Data are rarely in the exact form that you need them.  I am currently working on an example for AIFH Vol 3 that will use a SOM and compare nations based on several statistics.  I could not find a dataset that fit exactly what I was looking for.  So I decided to create my own dataset.

I wanted a list of countries with three different data points that somehow indicate that nation's prosperity.  I chose GDP, lifespan and literacy rate.  Remember, this is a computer science experiment, not a sociology experiment.  I am sure others could come up with a much better set of data points to compare countries.  However, for my example program these will work just fine.

I could not find a data set that was already completed.  However, all of this data is contained in Wikipedia.  To wrangle the data I created a simple Python script to accomplish this.  I am really starting to like Python for quick scripting projects.  I could have also used R, Groovy, Perl or a host of others.  The end result looks something like this:

ATG,Antigua and Barbuda,1220,75.8,0.984

[Full File]

You can download the entire contents of Wikipedia into a data file.  This is usually how you should deal with Wikipedia data.  Do not use HTTP to pull large volumes of data from Wikipedia.  This is a good way to get blocked from Wikipedia.  Also, the datafile for Wikipedia is not HTML encoded and much easier to parse.  I simply pulled the nation codes page, GDP, literacy, and lifespan pages into text files that my Python script could parse.

I linked the files together (joined) using the nation name as a key.  If a nation's name did not appear in all lists I discarded that nation.

You can see my Python code here.  This code could be more readable.  But it gets the job done.  It is a quick data wrangling hack.  If I needed to re-pull the data on a frequent basis, particularly if it were high-velocity data, I would do something more formal.

Intellivision, Imitation and Minecraft

In my part of the world this is the holiday season.  Between Thanksgiving, Christmas and New Years, there is quite a bit going on in December.  Having just finished my first semester in a new degree program, as well as the capstone to the Johns Hopkins Data Science specialization, I was feeling particularly festive!  I hope everyone (who celebrated it) had a wonderful Christmas and holiday season.  Here is how I spent the holiday!


My primary gift this Christmas was an Intellivision video game system.  Back in 1981 I received the original Intellivision system for Christmas.  That Christmas, back in the early eighties, is probably my most memorable.  I could finally play video games, in my own house!  No need for a pocket full of quarters for the local arcade.  My father and I spent hours playing games with Intellivision.  Some of our favorites were Utopia, Sea Battle, Dungeons and Dragons, and Space Battle.

However, my absolute favorite was Utopia!  I think Utopia was the original nation-level strategy game.  There was no AI component.  Two players each commanded an island and controlled the military and economy of the island.  I would often play the game by myself and try to control both islands, at the same time.  My basic strategy was to earn as much money as possible fishing.  Then buy as many factories as I could.  Factories would earn money just sitting there.  Eventually I had enough money rolling in from the factories that I no longer needed to fish.  The factories also polluted and kept population down, so I did not have to waste as much of my small islands on non-income producing tiles to keep my population happy.  Who needs schools, hospitals and houses when you can have income producing factories.  Eventually I could even quit producing crops and feed the population with only fish, so long as I had one fishing boat for each 500 people.  Eventually, I ended up with tons of money, about 1100 people working in my factories in a heavily polluted island.  It did not matter what your population was, as the factories would still produce just fine.  I would have really hated to live in my nation!

I tried many other strategies as well.  Some actually lead to a more utopian island for my population.

The new Intellivision system came with 60 games, and cost around $40.  How is that for deflation?  Back in the eighties I recall sometimes shelling out 40 bucks for a single game!  Most games I could no longer beat.  It has been too many years, the muscle memory I built up to beat these games are long gone.  However, I could still build my polluted despotic nation in Utopia just fine!  I can't imagine how many hours I spent playing that game as a child that I still remembered it exactly.  Here is a picture of the TV screen with my two islands that I built up rather quickly.


I was quite happy with the new Intellivision system.  The gameplay is just as I remembered it from Intellivision.  Some of the colors and sounds seemed a little different.


My wife and I both went to see Imitation Game at the Wehrenberg Theaters 5-star lounge.  I might owe my wife several movie selections for agreeing to see a movie about Alan Turing, the man who many consider to be the father of computer science and artificial intelligence.  The movie focused around the efforts of his team to assist the British government to break the Enigma code used by the Nazi armed forces.

I enjoyed the movie.  So did my wife.  Clearly some parts of reality were adapted to make a more compelling movie.  The movie stayed clear of most of the mathematics involved in cracking Enigma.  They did mention the flaw in always starting an encrypted message with the same word.  This is not an issue with modern cryptology, however, for a system such as Enigma, it could be a fatal flaw.  They also brushed over some other computer science topics, such as big-oh analysis, the turing test, universal turing machines, and elements of cryptography.  Nothing really deep.  They only partially explained a Casear cipher.

Unfortunately, they fairly directly attributed the invention of the computer to Turing.  Turing did a fantastic job of classifying what a computer is.  With all of the advances in computing his definition still holds.  ACM discusses this some in this article.

Our local theatre has a 5-star lounge.  My wife and I enjoy this!  You get to sit in nice recliners and can order food and wine!

5star_1 5star_2


It is fun!  I know some of my friends have this setup essentially recreated at home!  However, our entertainment room is not nearly so cool!

I do not know a great deal about cryptograph.  However, I am taking a graduate class on it this semester.  So I will likely be learning much more soon!


Another part of my Christmas holiday was playing Minecraft with my niece and nephew.  My little nephew really likes Minecraft!  I am sure I would have been fascinated by it at his age.  However, all I had was Utopia.  I created a large arena for us to battle in.  This included working redstone lights to control the number of monsters that would spawn around us.  I setup a computer to have a minecraft server running so that we could both be in the world at the same time.


It was a good Christmas!

First Semester Done: Artificial Intelligence and Database Management Systems

I survived my first semester in the PhD of Computer Science program at Nova Southeastern University in Ft. Lauderdale, FL.  The two classes that I took this semester were CISD 750: Database Management Systems and CISD 760: Artificial Intelligence.  Both classes were a challenging mix of research, exams and papers.  I ended up with an A in AI and an A- in DBMS.  Both required considerable work.  I gained valuable knowledge insights for my ultimate dissertation.  Both instructors were helpful and very knowledgeable of their topics.

Journal articles were a major part of both classes.  There were over 20 assigned papers for the database class and I reviewed nearly that many for the AI class.  Reading journal articles was something that I expected from a doctorate program.  While this is a distance degree program, both classes also had challenging in-class mid-term examinations.  The program requires me to spend four days, in two trips, on-campus per semester.  I traveled to Ft. Lauderdale both in August and October.

For this semester, the AI professor was by Sumitra Mukherjee and the database professor was by Junping Sun.  Both were great!  I would very much recommend classes taught by either.

CISD 750: Database Management Systems

The database management class covered many aspects of DBMS design.  I had not previously taken an academic class on database systems, so there was quite a bit of new material for me.  I had not worked with relational algebra prior to this class.  I also learned about functional dependancies, schema normalization, the chase algorithm, hashing, indexing, dynamic programming, query optimization, and other topics.  Most of the focus of the class was on relational databases.  However, some time was spent looking into some NoSQL topics.  The final and mid-term both tested our abilities to work through these algorithms.

I chose to do my research paper on frequent itemset mining.  This is a Knowledge Discovery in Databases (KDD) topic.  I compared the Apriori, Eclat and FP-Growth algorithms.  I did an emperical study of what datasets are conducive to each algorithm.  Most papers that I read about these algorithms studied the performance effects of varying the support threshold using known datasets.  I wanted to take a different approach, so I looked at how I could generate a dataset a specified frequent itemset density and number of frequent items.  I created a Python application to simulate generate this data.  I was happy with how the paper turned out.  I plan on posting the results from this research in a separate post here in the future.

I really liked the textbook that was chosen for this class.  Database Systems: The Complete Book (2nd Edition) appears to be a classic in the field of DBMS.  The examples and explanations were really clear.  The material was very different than what I am used to for databases.  I've worked with databases, such as Oracle and MySQL, for years.  It was very interesting to see DBMS at a more academic level.  I feel that I gained a deeper understanding of database topics.

CISD 760: Artificial Intelligence

The AI class covered a wide range of topics that provided a very good foundation of AI.  The optional textbook was Artificial Intelligence a Modern Approach.  I feel the textbook was a great choice, and I already owned a copy before the course began.  Topics covered in this course included A*, Neural Networks, Decision Trees, Bayesian Inference, and Genetic Algorithms.  We were given an assignment were we had to make use of several of these algorithms.  I used Encog for the Genetic Algorithm and A* portions of the programming assignment.

Another assignment asked us to choose a peer reviewed article and perform a critique.  For my article, I choose:

Larochelle, H., Bengio, Y., Louradour, J., & Lamblin, P. (2009, June). Exploring strategies for training deep neural networks. J. Mach. Learn. Res., 10 , 1-40.

I am interested in deep learning and wanted to research, and understand, some of the issues with applying it to continuous inputs.  I ended up implementing a deep belief neural network in Java.  The source code to this is on GitHub.  I will make use of this code for Volume 3 of my AIFH series.

The major paper for this class was to write an idea paper.  Idea papers are used in this program to capture potential ideas for a dissertation.  I wrote an idea paper that detailed research that I might like to perform to use continuous input with deep learning.  I believe that I would like to do my dissertation in the area of AI.  I am not sure I will choose deep learning.  Nevertheless, the paper was a great exercise.  There are several areas that I gently nudged the boundary of human understanding, while writing parts of Encog.  I plan to explore several of these for a potential dissertation.  I also still have one more semester before I need to really get serious about a dissertation topic.

Next Semester

Next semester I am taking two courses again, and then two more in the fall of 2015.  Once I am through these courses I have two more semesters where I will split between a single course and research hours.  After that I will be a phd candidate and on to dissertation work.

The two courses that I will take next semester (Winter 2015) are:

  • CISD 792  Computer Graphics
  • ISEC 730 / DCIS 730 Network Security and Cryptography

The textbooks for my computer graphics class are shown here.  There is no assigned book for the security class.  My guess is there will be a number of assigned papers for the security class.



I am looking forward to the next semester!  It looks like I will be learning about three.js in the graphics class.  I am really looking forward to that.

Handling Multiple Java Versions on a Mac

I would like to make use of Java 8 in Encog now, as well as the upcoming volume 3 of AIFH.  I primarily use a Mac at home and a Microsoft Surface while traveling.  In this post, I will describe how I setup the Mac to use Java 8.

The first step is to install Java 8, or whatever version of Java you wish to use.  These can be found from the following directory.


This will download a Mac package to install.  Once you install this package, you want to get to the point that you can go to a command line and issue the command java -version and see the correct version.  For example:

jeffs-mbp:~ jheaton$ java -version
java version "1.8.0_25"
Java(TM) SE Runtime Environment (build 1.8.0_25-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)
jeffs-mbp:~ jheaton$ 

To do this you must first find the exact version number of Java that you wish to use.  The java_home command can do that.

jeffs-mbp:~ jheaton$ /usr/libexec/java_home  -V
Matching Java Virtual Machines (5):
    1.8.0_25, x86_64:	"Java SE 8"	/Library/Java/JavaVirtualMachines/jdk1.8.0_25.jdk/Contents/Home
    1.7.0_25, x86_64:	"Java SE 7"	/Library/Java/JavaVirtualMachines/jdk1.7.0_25.jdk/Contents/Home
    1.7.0_13, x86_64:	"Java SE 7"	/Library/Java/JavaVirtualMachines/jdk1.7.0_13.jdk/Contents/Home
    1.6.0_65-b14-462, x86_64:	"Java SE 6"	/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home
    1.6.0_65-b14-462, i386:	"Java SE 6"	/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home

jeffs-mbp:~ jheaton$

Now that you have the version number, you must add the following line to your ~/.bash_profile file.

export JAVA_HOME=`/usr/libexec/java_home -v 1.8.0_25, x86_64`

Make sure to remove any other JAVA_HOME directives.


The 18 Kickstarter Projects that I've Backed

As of today I've backed 20 Kickstart projects, have run two successful projects and am planning my third.  Of these 20 backed projects, 18 were funded.  In this post I will give a summary of my experiences, as a backer, with Kickstarter.  Here are a few of the items I've received.


I have not yet experienced a Kickstarter project that has gone completely AWOL.  There are certainly cases of this.  It is not something that has occurred, yet, in a project that I've backed.  Most Kickstarter projects do run late.  The initial project estimate is just that, an estimate.  Few projects make it.  These projects are often in completely uncharted waters of design, production and international order fulfillment.  Once a project is late, communication is the key.  At this point you need for your backers to keep faith in your ability to fulfill the project.  I've seen projects run late by over a year, and still pull through.

Another issue that I've observed is that backers often treat Kickstarter as a sort of online mall.  You should not be buying Kickstarter projects as birthday or Christmas gifts.  Often backers do exactly that, seeing the expected delivery date falls a few months before the date that the backer needs the gift.  Then a project runs behind, Christmas is missed, and emotions run high.  I've actually seen a few Kickstarter projects miss two Christmases in a row, and still deliver well over two years behind.  This is the nature of backing a product that is not yet complete, much less in production.  I believe that Kickstarter is about being involved in an immature product that you believe in!

Here is a summary of the Kickstarter projects that I've backed (in the order I backed them):

  1. Code Monkey Save World: Major fan of Jonathan Coulton!  The project ran late, but was every thing I expected and more.
  2. Sparki - The Easy Robot for Everyone: Cool robotics platform!  just a few months late, and everything I expected.
  3. Dog Sled Saga: Cute little platformer game.  Delivered an alpha on-time and continues to evlove.
  4. LightUp: Learn by Making: Really neat electronics kit concept, but the project has had a number of issues.  I did get the key just a few days ago.  But there seem to be some discrepancies between the described contents of the kit and what people were actually shipped.  The production quality looks to be good, and I was able to use mine.  Just no instructions, at this point.
  5. The Stage at KDHX: A local Kickstarter project for a radio station in  my area.  Everything went great!
  6. Supertoy - World's First Natural Talking Teddy Bear: An AI-enabled teddybear.  How could I pass this one up?  The video showed a teddy bear that could converse as well as Data from Star Trek. Clearly that did not happen, or you would have ready about the dawn of strong AI in all major media.  However, after over a year behind I did get a bear capable of looking up things in Wikipedia.  It is actually kind of cool, just a wild-ride of 80 somewhat strange updates of the project creator sharing YouTube videos that he found interesting.
  7. Dataflow & Reactive Programming Systems: A book project by a local author.  Project delivered what it promised in the
  8. Star Trek Continues Webseries: This is a really cool project if you are a fan of the original Star Trek.  Delivered exactly what I expected.
  9. Chaos Drift - A Nostalgic RPG Experience: A classic RPG style game from my hometown.  The project is somewhat behind but seems to be progressing well.
  10. KUNG FURY: A movie in the classic 1980's action style.  The project is a bit behind at this point, but seems to be progressing.
  11. Hello Ruby: A computer programming book for kids.  Huge Kickstarter success; however, is now behind a few months.  Looks promising.
  12. iOS App Development for Teens by Teens: A book about iOS app development targeted at teens, and written a teen.  Project delivered on-time and to expectations!
  13. JUMP Cable: A small device that can be used to recharge iPhones and other devices.  Currently shipping.
  14. Mini Museum: Totally amazing project that lets you own a miniature museum in an acrylic case.  This was one of my favorite Kickstarter projects.  Small delay, but I got mine yesterday.  Great communication with backers.
  15. A History of the Great Empires of Eve Online: A history of the MMOG Eve.  Seems to be progressing well.
  16. The Universim:  Looks like the ultimate successor to Civilizations, seems to be progressing well.
  17. Bring Reading Rainbow Back for Every Child, Everywhere!: I watched Reading Rainbow when younger, and am a major fan of the chief engineer of the Enterprise.  How could I not back LeVar.  Project was epic and seems to be on track.  Butterfly in the sky, this thing went twice as high, take a look, it hit 5 million!
  18. The PHD Movie 2: Still in Grad School:  Since entering PhD gradschool myself I've seen many references to these comics.  Project seems to be doing well!

There you have it!  These are the projects I backed on Kickstarter.  This is a total of $1041 in backing dollars.  So far it has been a great experience!

Multi Agent Modeling Presentation & Midterms

I co-presented on the topic of Agent Based Modeling (ABM) for the 2014 Society of Actuaries annual meeting in Orlando, FL during October 26-29, 2014.  October was a busy month for me.  I had to fly to Ft. Lauderdale and Orlando both in October.  I know! There are worse places to fly to, but it was still a somewhat hectic schedule.  I flew to Ft. Lauderdale first to attend my midterms for the computer science PhD program I am a student of.



One of the booths at the SOA meeting was setup to take green-screen photos and place the attendee on the cover of The Actuary.  Here is my photo:


It was an interesting meeting.  There were keynote presentations by both Madeleine Albright, former USA secretary of state, and Dr. Adam Steltzner, Lead Landing Engineer of NASA’s Mars Science Laboratory Curiosity Rover Project.  Both keynote speakers were fascinating.  There were quite a few data science related sessions during the conference.

I presented, along with Dr. Anand S. Rao on ABM.  I began the talk with an overview of what ABM is and introduced the open source utility Repast. Dr. Rao continued the talk and showed how Price Waterhouse Cooper (PWC) makes use of agent modeling.   You can see the slides for the presentation here (link to SOA site).

I also had to get ready for midterms for my first semester.  I am enrolled in two courses, and had to study for Artificial Intelligence and Database Management Systems.


I will post more about the semester once it concludes in a few weeks.  I am already signed up for Winter 2014, and will be taking Computer Graphics and Cryptography.

Roasting and Brining a Turkey


There are many different ways to cook a turkey.  I've prepared turkeys for our family Thanksgiving meals for the last several years.  The results have been good enough that I am often asked about the exact process.  In this post I will cover the process.  Like many of my posts about artificial intelligence, mathematics and computer programming, this post is also for me to remember the process that I used.

I brine and then roast the turkey.  Brining is a process that I first learned about from a wine/food class at  Balaban's restaurant.  If you are ever in the St. Louis, MO area, I highly recommend Balaban's!  The first year we tried brining it was a hit.  Brining produces a very juicy, flavorful turkey!

Preparing the Turkey

I usually prepare two turkeys each Thanksgiving holiday.  My wife and I usually celebrate thanksgiving with her and my parts of the family on two consecutive days.  Because of this, I have two turkeys to thaw and brine.  This requires some planning.  I usually have a turkey schedule on the refrigerator leading up to the big day.


I always use a frozen turkey, usually a Butterball.  We did try a fresh turkey one year; however, it did not seem to have a significantly different taste-- at least to us.  Also, because I am cooking two consecutive turkeys, it is difficult to acquire a turkey for the day after Thanksgiving.  I prefer buying two frozen turkeys a week or two before Thanksgiving.


I usually buy a 20 pound bird, and therefore need to thaw for 5 days.  Butterball has a great calculator for this.  I find that at the end of 4 complete days my 20 pound turkey is nearly completely thawed.  At the beginning of the fifth day, I begin the brining process.  There are certain safety procedures you should always follow when cooking a turkey.  For precise instructions refer to the USDA.


To brine I need the following ingredients.  This process is based on my own experimentation, drawn from several recipes.  I use the following ingredients:

  • Salt
  • Half of a cup of brown sugar
  • Half of a cup of regular sugar
  • One onion (chopped)
  • 4 stalks of celery (chopped)
  • 4 carrots (chopped)
  • 1 egg (only used to measure)

I've always used the floating egg method to determine how much salt to use for the brine solution.  The amount of brine solution you will need will depend on the method that you use to submerge your turkey.  I always brine in the refrigerator, so I need several gallons of brine to fill my brining bag and place it inside the refrigerator.  I have a large vegetable drawer in my refrigerator that holds the brining bag perfectly.  I've also considered purchasing a large vat that would fit inside my refrigerator.  Whatever method you use, the turkey must be submerged in the brine solution for 24 hours.  Doing this outside of a refrigerator is not recommended.

  • Place several gallons of water in a large stockpot, as much brine as you need.
  • Add salt to the water until an egg floats.  (There are other ways to determine the amount of salt.  This method has worked well for me.)
  • Add half a cup of brown sugar, half a cup of regular sugar, and the chopped vegetables.
  • Bring the entire solution to a boil for 20 minutes
  • Let the solution cool overnight.  It takes awhile for this much boiling water to cool, so plan accordingly.
  • Do not submerge the turkey in a boiling/near boiling brine solution, as this would unsafely begin to cook the turkey.  Chill the brine to room temperature or below.
  • Remove vegetables from brine and discard.  Be careful handling any boiling liquid.
  • Unwrap turkey and remove any giblets/bags that are in the neck and/or body cavities.
  • Submerge the turkey in the brine for 24 hours.


Once the brining of the turkey is complete cooking can begin.  I use the following ingredients for the actual cooking.

  • Brined turkey
  • Olive Oil (Extra Virgin)
  • Poultry Spice
  • Pepper
  • Apple
  • 3 Carrots
  • Rosemary

I do not like to cook stuffing inside the turkey.  This practice is considered unsafe by the USDA.  However, I do like to stuff the turkey with an apple, carrots and some rosemary to add some to the taste and aroma of the cooked turkey.

  • Remove turkey from brine solution and rinse.
  • Add quartered apple, carrots and rosemary to the turkey's body cavity.
  • Brush olive oil on the outside skin of the entire turkey.
  • Sprinkle pepper and poultry spice over the outside of the turkey.
  • Insert a thermometer into the thickets part of the thigh of the turkey (there are youtube vides that show where this is)
  • Cover turkey in foil for the first 1.5 hours of cooking.
  • Cook turkey until the thigh is 165 degrees. For precise instructions refer to the USDA.
  • Basting is not necessary but will result in more even browning of the turkey.
  • Once the thigh reaches 165 degrees, allow the turkey to sit for 30 minutes before serving.

This is what my turkey looked like just before cooking.


I use an electronic thermometer.  You can see the cord in the picture above and the actual themometer in the picture below.



Here are some other recipes that I like for the Thanksgiving day meal.