End of 2nd Semester in Computer Science PhD Program

I am done with the second semester of my PhD program.  I am now half done with the required coursework.  I am not taking a class this summer, I will be busy with AIFH Volume 3.

The two classes that I took this semester were:

Both classes were really good.  The computer graphics class introduced Three.JS for most of our assignments.  I like Three.JS and will be using it for some of Javascript examples for my books.  My class project was a 3D flocking algorithm that I've already included into my book examples, you can see it here.

The security class focused mainly on reproducing the research of a security paper, and then writing a dissertation proposal in a similar topic area.  I used Encog to reproduce a paper's research that produced a neural network based intrusion detection system.  You can see this report, and my source code here.  I am not posting my dissertation idea paper, in the event that it is a direction that I choose to go for my own dissertation.

It was a good semester, and it took quite a bit of work.  I was able to get an A in both classes.  I have two more semesters of classes, then I will focus on research and preparing for the dissertation.

Quick R Tutorial: The Big-O Chart

I needed an original Big-O chart for a publication.  This tutorial does not cover what Big-O actually is, just how to chart it.  If you want more information on Big-O I recommend reading this and this.  I know, go figure, for a computer science student. I really like using R for all things chart and visualization.  So, I turned to R for this task.  While this is a very simple chart, it does demonstrate several very common tasks in R:

  • Overlaying several plots
  • Adding a legend to a plot
  • Using "math symbols" inside of the text on a chart
  • Stripping off all the extra whitespace that R likes to generate

You can see the end-result here:

big-oh

 

This was produced with the following R code:

# Remove white space on chart
par(mar=c(4,4,1,1)+0.1)

# The x & y limits for the plot
xl <- c(0,100)
yl <- c(0,1000)

# Just use R's standard list of colors for the lines (I am too tired to be creative this morning)
pal <- palette()

# Plot each of the equations that we are interested in. Note the add=TRUE
# it causes them to overlay each other.
plot(function(n) (rep(1,length(n))),xlim=xl,ylim=yl,col=pal[1],xlab="Elements(N)",ylab="Operations")
plot(function(n) (log(n)),xlim=xl,ylim=yl,col=pal[2],add=TRUE)
plot(function(n) (n),xlim=xl,ylim=yl,col=pal[3],add=TRUE)
plot(function(n) (n*log(n)),xlim=xl,ylim=yl,col=pal[4],add=TRUE)
plot(function(n) (n^2),xlim=xl,ylim=yl,col=pal[5],add=TRUE)
plot(function(n) (2^n),xlim=xl,ylim=yl,col=pal[6],add=TRUE)
plot(function(n) (factorial(n)),xlim=xl,ylim=yl,col=pal[7],add=TRUE)

# Generate the legend, note the use of expression to handle n to the power of 2.
legend('topright', c("O(1)","O(log(n))","O(n)","O(n log(n))",
 expression(O(n^2)),expression(O(2^n),"O(!n)"))
 ,lty=1, col=pal, bty='n', cex=.75)

pal[1]

Using Encog to Replicate Research

For the one of my PhD courses at Nova Southeastern University (NSU) it was necessary to reproduce the research of the following paper:

I. Ahmad, A. Abdullah, and A. Alghamdi, “Application of artificial neural network in detection of probing attacks,” in IEEE Symposium on Industrial Electronics Applications, 2009. ISIEA 2009., vol. 2, Oct 2009, pp. 557–562.

This paper demonstrated how to use a neural network to build a basic intrusion detection system (IDS) for the KDD99 dataset.  It is important to reproduce research, in an academic setting.  This means that you were able to obtain the same results as the original researchers, using the same techniques.  I do this often when I write books or implement parts of Encog. This allows me convince myself that I have implemented an algorithm correctly, and as the researchers intended.  I don't always agree with what the original researcher did.  If I change it, when I implement Encog, I am now in the area of "original research," and my changes must be labeled as such.

Some researchers are more helpful than others for replication of research.  Additionally, neural networks are stochastic (they use random numbers).  Basing recommendations off of a small number of runs is usually a bad idea, when dealing with a stochastic system.  Their small number of runs caused the above researchers to conclude that two hidden layers was optimal for their dataset.  Unless you are dealing with deep learning, this is almost always not the case.  The universal approximation theorem rules out more than a single layer for the old-school sort of perceptron neural network used in this paper.  Additionally, the vanishing gradient problem prevents the RPROP training that the researchers from fitting well with larger numbers of hidden layers.  The researchers tried up to 4 hidden layers.

For my own research replication I used the same dataset, with many training runs to make sure that their results were within my high-low range.  To prove that a single layer does better I used ANOVA and Tukey's HSD to show that differences among the different neural network architectures were indeed statistically significant and my box and whiskers plot shows that training runs with a single layer more consistently converged to a better mean RMSE.

I am attaching both my paper and code in case it is useful.  This is a decent tutorial on using the latest Encog code to normalize and fit to a data set.

The class also required us to write up the results in IEEE conference format.  I am a fan of LaTex, so that is what I used.

  • Source code Includes: [Source Code Link]
    • Python data prep script
    • R code used to produce graphics and stat analysis
    • Java code to run the training
  • My report for download (PDF): [Paper Link]
  • My report on ResearchGate: [Link]

The code is under LGPL, so feel free to reuse.

How I Got Into Data Science from IT Programming

Drew Conway describes data scientist as the combination of domain expertise, statistics and hacker skills.  If you are an IT programmer, you likely already have the last requirement.  If you are a good IT programmer, you probably already understand something about the business data, and have the domain expertise requirement.  In this post I describe how I gained knowledge of statistics/machine learning through a path of open source involvement and publication.

There are quite a few articles that discuss how to become a data scientist.  Some of them are even quite good! Most speak in very general terms.  I wrote such an summary awhile back that provides a very general description of what a data scientist is. In this post, I will describe my own path to becoming a data scientist.  I started out as a Java programmer in a typical IT job.

Publications

My publications were some of my earliest credentials.  I started publishing before I had my bachelor's degree.  My publications and side-programming jobswere the major factors that helped me obtain my first "real" programming job, working for a Fortune 500 manufacturing company, back in 1995.  I did not have my first degree at that point.  In 1995 I was working on a bachelor's degree part-time.

Back in the day, I wrote for publications such as C/C++ Users Journal, Java Developers Journal, and Windows/DOS Developer's journal.  These were all paper-based magazines.  Often on the racks at book stores.  The world has really changed since then!  These days I publish code on sites like GitHub and CodeProject.  A great way to gain experience is to find interesting projects to work on, using open source tools.  Then post your projects to GitHub, CodeProject and others.

I've always enjoyed programming and have applied it to many individual projects.  Back in the 80's I was writing BBS software so that I could run a board on a C64 despite insufficient funds from high school jobs to purchase a RAM expander.  In the 90's I was hooking up web web cams and writing CGI and then later ASP/JSP code to build websites.  I wrote web servers and spiders from the socket up in C++.  Around that time I wrote my first neural network.  Always publish! A hard drive full of cool project code sitting in your desk is not telling the world what you've done.  Support open source, a nice set of independent projects on GitHub looks really good.

Starting with AI

Artificial intelligence is closely related to data science.  In many ways data science is the application of certain AI techniques to potentially large amounts of data.  AI is also closely linked with statistics, an integral part of data science.  I started with AI because it was fun.  I never envisioned using it in my "day job".  As soon as I got my first neural network done I wrote an article for Java Users Journal.  I quickly discovered that AI had a coolness factor that could help me convince editors to publish my software.  I also published my first book on AI.

Writing code for a book is very different than writing code for a corporate project/open source project.

  • Book code: Readability and understandably are paramount.  Second to none.
  • Corporate/Open Source Code: Readability and understandably are critical.  However, real-world necessity often forces scalability & performance to take the front seat.

For example, if my book's main goal is to show how to use JSP to build a simple blog, do I really care if the blog can scale to the traffic seen by a top-100 website?  Likewise, if my goal is to show how a backpropagation neural network trains, do I really want to muddy the water with concurrency?

The neural network code in my books is meant to be example code.  A clear starting point for something.  But this code is not meant to be "industrial strength".  However, when people start asking you questions that indicate that they are using your example code for "real projects", it is now time to start (or join) an open source project!  This is why I started the Encog project.  This might be a path to an open source project for you!

Deepening my Understanding of AI

I've often heard that neural networks are the gateway drug to greater artificial intelligence.  Neural networks are an interesting creature.  They have risen and fallen from grace several times.  Currently they are back, and with a vengeance.   Most implementations of deep learning are based on neural networks.  If you would like to learn more about deep learning, I am currently running a Kickstarter campaign on that very topic.  [More info here]

I took several really good classes from UDacity, just as they were introduced.  These classes have been somewhat re-branded.  However, UDacity still several great AI and Machine Learning courses.  I also recommend (and have taken) the Johns Hopkins Coursera Data Science specialization.  Its not perfect, but it will expose you to many concepts in AI.  You can read my summary of it here.

Also, learn statistics.  At least the basics of classical statistics.  You should understand concepts like mean, mode, median, linear regression, anova, manova, Tukey HSD, p-values, etc.  A simple undergraduate course in statistics will give you the foundation.  You can build on more complex topics such as Bayesian networks, belief networks, and others later.  Udacity has a nice intro to statistics course.

Kickstarter Projects

Public projects are always a good thing.  My projects have brought me speaking opportunities and book opportunities (though I mostly self publish now).  Kickstarter has been great for this.  I launched my Artificial Intelligence for Humans series of books through Kickstarter.  I currently have volume three running as a Kickstarter.   [More info here]

From AI to Data Science

When data science first started to enter the public scene I was working as a Java programmer writing AI books as a hobby.  A job opportunity at my current company later opened up in data science.  I did not even realize that the opportunity was available, I really was not looking.  However, during the recruiting process they discovered that someone with knowledge of the needed areas lived right here in town.  They had found my project pages.  This led to some good opportunities right in my current company.

The point is, get your projects out there!  If you don't have an idea for a project, then enter Kaggle.  You probably won't win.  Try to become a Kaggle master.  That will be hard.  But you will learn quite a bit trying.  Write about your efforts.  Post code to GitHub.  If you use open source tools, write to their creators and send links to your efforts.  Open source creators love to post links to people who are actually using their code.  For bigger projects (with many or institutional creators), post to their communities.  Kaggle gives you a problem to solve.  You don't have to win.  It will give you something to talk about during an interview.

Deepening my Knowledge as a Data Scientist

I try to always be learning.  You will always hear terminology that you feel like you should know, but do not.  This happens to me every day.  Keep a list of what you don't know, and keep prioritizing and tacking the list (dare I say backlog grooming).  Keep learning!  Get involved in projects like Kaggle and read the discussion board.  This will show you what you do not know really quick. Write tutorials on your efforts.  If something was hard for you, it was hard for others who will appreciate a tutorial.

I've seen a number of articles that question "Do you need a PhD to work as a data scientist?" The answer is that it will help, but is not necessary.  I know numerous data scientists with varying levels of academic credentials.  A PhD demonstrates that someone can follow the rigors of formal academic research and extend human knowledge.  When I became a data scientist I was not a PhD student.

At this point, I am a PhD student in computer science, you can read more about that here.  I want to learn the process of academic research because I am starting to look at algorithms and techniques that would qualify as original research.  Additionally, I've given advice to a several other PhD students, who were using my projects open source projects in their dissertations.  It was time for me to take the leap.

Conclusions

Data science is described, by Drew Conway, as the intersection of hacker skills, statistics and domain knowledge.  To be an "IT programmer" you most likely already have two of these skills.  Hacker skills is ability to write programs that can wrangle data into many different formats and automate processes.  Domain knowledge is knowing something about the business that you are programming for.  Is your business data just a bunch of columns to you?  An effective IT programmer learns about the business and it's data.  So does an effective data scientist.

This leaves, only really statistics (and machine learning/AI).  You can learn that from books, MOOCS, and other sources.  Some were mentioned earlier in this article.  I have a list of some of my favorites here.  I also have a few books to teach you about AI.

Most importantly, tinker and learn.  Build/publish projects, blog and contribute to open source.  When you talk to someone interested in hiring you as a data scientist, you will have experience to talk about.  Also have a GitHub profile, linked to LinkedIn that shows you do in-fact have something to talk about.

PhD Update: Second Cluster Visit for Winter 2015 Semester at NSU

I just got back to St. Louis from my second cluster visit to NSU for the phd in computer science.  I was feeling pretty good about choosing a distance learning program at a university in Ft. Lauderdale, Florida after the really cold weather we've been seeing in St. Louis lately.  There was a new blanket of snow on my drive way the morning that I left for the airport.  I just drove over it and headed out, the snow was melted by the time I returned.  The regular (non-distance learning) students were all on spring break, so the campus was unusually empty.  I never did take a spring break trip as an undergrad, so it works out that I have to go to Florida 4 times a year for my doctoral program.

I am taking two classes: CISD 792: Computer Graphics, taught by Dr. Laszlo, and ISEC 730: Computer Security and Cryptography, taught by Dr. Cannady. The computer graphics class focuses on Three.JS and OpenGL, and consists of numerous programming assignments.  I am learning about 3D programming. I think this will be very useful for some visualizations that I might want to do for my books.  The security class is more focused on writing.  I did a decent amount of programming to replicate the research of a paper that applied neural networks to intrusion detection.  I will post my Java code for this later, as it is a decent Encog tutorial.  One of several reasons that I entered a PhD program was to learn academic writing, so the security class is working out well.  I am finding both classes every beneficial and interesting.

This is my fourth trip to Ft. Lauderdale for the program.  My wife has come along with me each time so far.  We usually try to do at least one "tourist activity" each time.  This time we went to see a spring training baseball game that featured our home-team, the St. Louis Cardinals against the Miami Marlins, (notes to everyone, especially lawyers: both of those names are trademarks of the MLB, MLB is also a trademark of the MLB)  The St. Louis team won, so this made for a particularly enjoyable game!

I also kept tabs on my Kickstarter campaign while traveling.  So far the deep learning and neural network is going well!  If you would like to back, and obtain my latest book, click here.

I took this picture at the student center at NSU.  They have a really cool Dr. Who police box.  You can also see the palm trees out the window.  They have a beautiful campus!  And now they have a police box!

jheaton_nsu_drwho

Kickstarter started for Artificial Intelligence for Humans Volume 3: Deep Learning and Neural Networks

I just started my third Kickstarter project. This is the third volume in my Artificial Intelligence for Humans Series (AIFH).  This book will cover the same sort of neural network material as some of my earlier book, but it will also include newer topics in neural networks such as deep learning, convolutional neural networks, NEAT and HyperNEAT.  The book will be published by December 31, 2015. The planned table of contents is given here:

  • Chapter 1: Neural Network Basics
  • Chapter 2: Classic Neural Networks
  • Chapter 3: Feedforward Neural Networks
  • Chapter 4: Propagation Training
  • Chapter 5: Recurrent Neural Networks
  • Chapter 6: Radial Basis Function Networks
  • Chapter 7: NEAT and HyperNEAT Neural Networks
  • Chapter 8: Pruning and Model Selection
  • Chapter 9: Dropout and Regularization
  • Chapter 10: Architecting Neural Networks
  • Chapter 11: Deep Belief Neural Networks
  • Chapter 12: Deep Convolutional Network
  • Chapter 13: Modeling with Neural Networks

If you think this sort of book might interest you, I would really appreciate your support in the Kickstarter project!

 

Done with Coursera Johns Hopkins Data Science Specialization

I am done with the Coursera Johns Hopkins Data Science specialization.  This is my first specialization earned from Coursera.  The final step for me was the capstone project.  Prior to the capstone project there were 9 other courses I needed to take.  The whole process took about 8-9 months.  This post is primarily about the capstone project.  You can read my opinions on the individual courses from the following blog posts:

Once you complete all ten courses, including the capstone, you are issued a certificate of completion.  This certificate is publicly share-able.  You can see my certificate here.

I am probably not the typical student for this program.  I am a part-time computer science PhD student, and a full-time data scientist for a large insurance company.  While many of the concepts were review, this course forced me to use the R programming language.  Left to my own devices I typically use Java and Python for data science.  I also learned to use R Publish, Shiny and R Markdown.  I also learned about reproducible research.  Some of the topics covered in reproducible research were useful to me in my PhD program.

I really liked this program.  Courses 1-9 provide a great introduction to the predictive modelling side of data science.  Both machine learning and traditional regression models were covered.  R can be a slow and painful language, at times, but I was able to get through.  It is my opinion that R is primarily useful for ferrying data between models and visualization graphs.  It is not good for heavy-lifting and data wrangling.  The syntax to R is somewhat appalling.  However, it is a domain specific language (DSL), not a general purpose language like Python.  Don't get me wrong.  I like R for setting up models and graphics.  Not for performing tasks better suited to a general purpose language.

The capstone project was to produce a program similar to Swiftkey, the company that was the partner/sponsor for the capstone.  If you are not familiar with Swiftkey, it attempts to speed mobile text input by predicting the next word you are going to type.  For example, you might type "to be or not to ____".  The application should fill in "be".  The end program had to be written in R and deployed to a Shiny Server.

This project was somewhat flawed in several regards.

  • Natural Language Processing was not covered in the course.  Neither was unstructured data.  The only material provided on NLP was a handful of links to sites such as Wikipedia.
  • The first 9 courses had a clear direction.  However, less than half of them had anything to do with the capstone.
  • The project is not typical of what you would see in most businesses as a data scientist.  It would have been better to do something similar to Kaggle or one of the KDD cups.
  • In my opinion, R is a bad choice for this sort of project.  During the meetup with Swiftkey, they were asked what tools they used.  R was not among them. R is so cool for many things, why not showcase its abilities?
  • Student peer review is bad... bad... bad... But it might be the only choice.  The problem with peer review is you have three random reviewers.  They might be easy, they might be hard.  They might penalize you for the fact that they don't know how to load your program! (this happened to me on a previous coursera course).
  • Perfect scores on the quizzes were really not possible.  We were given several sample sentences to predict.  The sentences were very specialized and no model would predict them correctly.  The Swiftkey surely did not.  Using my own human intuition and several text mining apps I wrote in Java, I did get 100% on the quizzes.  Even though the instructions clearly said to use your final model.  Knowing I might draw a short straw on peer review, I opted to do what I could to get max points.  I don't care about my grade, but falling below the cutoff for a bad peer review would not be cool!
  • Marketing based rubric for final project.  One of the grading criteria posted the question, "Would you hire this person?"  Seriously?  I do participate in the hiring process for data scientists.  I would never hire someone without meeting them, performing a tech interview, and small coding challenge.  I hope this stat is not used in marketing material.  xx% of our graduates produced programs that might land them a job.

After spending several days writing very slow model building code in R, I eventually dropped it and used Java and OpenNLP to write code that would build my model in under 20 minutes.  Others ran into the same issues.  There are somewhat kludge interfaces between R and OpenNLP, Weka and OpenNLP.  But these are native Java apps.  I  just skipped the kludge and built my model in Java and wrote a Shiny app to use the model in R.  This was enough to pass the program.  I was not alone in this approach, based on forum comments.

Final Thoughts

Okay, I will just say it.  I thought this was a bad capstone.  The rest of the program was really good!  If I could make a suggestion, I would say to let the students choose a Kaggle competition to compete.  The Kaggle competitions are closer to the sort of data real data scientists will see.  I am proud of the certificate that I earned.  If I were interviewing someone who had this certificate I would consider it a positive.  The candidate would still need to go through a standard interview/evaluation process.