Drew Conway describes data scientist as the combination of domain expertise, statistics and hacker skills. If you are an IT programmer, you likely already have the last requirement. If you are a good IT programmer, you probably already understand something about the business data, and have the domain expertise requirement. In this post I describe how I gained knowledge of statistics/machine learning through a path of open source involvement and publication.
There are quite a few articles that discuss how to become a data scientist. Some of them are even quite good! Most speak in very general terms. I wrote such an summary awhile back that provides a very general description of what a data scientist is. In this post, I will describe my own path to becoming a data scientist. I started out as a Java programmer in a typical IT job.
My publications were some of my earliest credentials. I started publishing before I had my bachelor's degree. My publications and side-programming jobswere the major factors that helped me obtain my first "real" programming job, working for a Fortune 500 manufacturing company, back in 1995. I did not have my first degree at that point. In 1995 I was working on a bachelor's degree part-time.
Back in the day, I wrote for publications such as C/C++ Users Journal, Java Developers Journal, and Windows/DOS Developer's journal. These were all paper-based magazines. Often on the racks at book stores. The world has really changed since then! These days I publish code on sites like GitHub and CodeProject. A great way to gain experience is to find interesting projects to work on, using open source tools. Then post your projects to GitHub, CodeProject and others.
I've always enjoyed programming and have applied it to many individual projects. Back in the 80's I was writing BBS software so that I could run a board on a C64 despite insufficient funds from high school jobs to purchase a RAM expander. In the 90's I was hooking up web web cams and writing CGI and then later ASP/JSP code to build websites. I wrote web servers and spiders from the socket up in C++. Around that time I wrote my first neural network. Always publish! A hard drive full of cool project code sitting in your desk is not telling the world what you've done. Support open source, a nice set of independent projects on GitHub looks really good.
Starting with AI
Artificial intelligence is closely related to data science. In many ways data science is the application of certain AI techniques to potentially large amounts of data. AI is also closely linked with statistics, an integral part of data science. I started with AI because it was fun. I never envisioned using it in my "day job". As soon as I got my first neural network done I wrote an article for Java Users Journal. I quickly discovered that AI had a coolness factor that could help me convince editors to publish my software. I also published my first book on AI.
Writing code for a book is very different than writing code for a corporate project/open source project.
- Book code: Readability and understandably are paramount. Second to none.
- Corporate/Open Source Code: Readability and understandably are critical. However, real-world necessity often forces scalability & performance to take the front seat.
For example, if my book's main goal is to show how to use JSP to build a simple blog, do I really care if the blog can scale to the traffic seen by a top-100 website? Likewise, if my goal is to show how a backpropagation neural network trains, do I really want to muddy the water with concurrency?
The neural network code in my books is meant to be example code. A clear starting point for something. But this code is not meant to be "industrial strength". However, when people start asking you questions that indicate that they are using your example code for "real projects", it is now time to start (or join) an open source project! This is why I started the Encog project. This might be a path to an open source project for you!
Deepening my Understanding of AI
I've often heard that neural networks are the gateway drug to greater artificial intelligence. Neural networks are an interesting creature. They have risen and fallen from grace several times. Currently they are back, and with a vengeance. Most implementations of deep learning are based on neural networks. If you would like to learn more about deep learning, I am currently running a Kickstarter campaign on that very topic. [More info here]
I took several really good classes from UDacity, just as they were introduced. These classes have been somewhat re-branded. However, UDacity still several great AI and Machine Learning courses. I also recommend (and have taken) the Johns Hopkins Coursera Data Science specialization. Its not perfect, but it will expose you to many concepts in AI. You can read my summary of it here.
Also, learn statistics. At least the basics of classical statistics. You should understand concepts like mean, mode, median, linear regression, anova, manova, Tukey HSD, p-values, etc. A simple undergraduate course in statistics will give you the foundation. You can build on more complex topics such as Bayesian networks, belief networks, and others later. Udacity has a nice intro to statistics course.
Public projects are always a good thing. My projects have brought me speaking opportunities and book opportunities (though I mostly self publish now). Kickstarter has been great for this. I launched my Artificial Intelligence for Humans series of books through Kickstarter. I currently have volume three running as a Kickstarter. [More info here]
From AI to Data Science
When data science first started to enter the public scene I was working as a Java programmer writing AI books as a hobby. A job opportunity at my current company later opened up in data science. I did not even realize that the opportunity was available, I really was not looking. However, during the recruiting process they discovered that someone with knowledge of the needed areas lived right here in town. They had found my project pages. This led to some good opportunities right in my current company.
The point is, get your projects out there! If you don't have an idea for a project, then enter Kaggle. You probably won't win. Try to become a Kaggle master. That will be hard. But you will learn quite a bit trying. Write about your efforts. Post code to GitHub. If you use open source tools, write to their creators and send links to your efforts. Open source creators love to post links to people who are actually using their code. For bigger projects (with many or institutional creators), post to their communities. Kaggle gives you a problem to solve. You don't have to win. It will give you something to talk about during an interview.
Deepening my Knowledge as a Data Scientist
I try to always be learning. You will always hear terminology that you feel like you should know, but do not. This happens to me every day. Keep a list of what you don't know, and keep prioritizing and tacking the list (dare I say backlog grooming). Keep learning! Get involved in projects like Kaggle and read the discussion board. This will show you what you do not know really quick. Write tutorials on your efforts. If something was hard for you, it was hard for others who will appreciate a tutorial.
I've seen a number of articles that question "Do you need a PhD to work as a data scientist?" The answer is that it will help, but is not necessary. I know numerous data scientists with varying levels of academic credentials. A PhD demonstrates that someone can follow the rigors of formal academic research and extend human knowledge. When I became a data scientist I was not a PhD student.
At this point, I am a PhD student in computer science, you can read more about that here. I want to learn the process of academic research because I am starting to look at algorithms and techniques that would qualify as original research. Additionally, I've given advice to a several other PhD students, who were using my projects open source projects in their dissertations. It was time for me to take the leap.
Data science is described, by Drew Conway, as the intersection of hacker skills, statistics and domain knowledge. To be an "IT programmer" you most likely already have two of these skills. Hacker skills is ability to write programs that can wrangle data into many different formats and automate processes. Domain knowledge is knowing something about the business that you are programming for. Is your business data just a bunch of columns to you? An effective IT programmer learns about the business and it's data. So does an effective data scientist.
This leaves, only really statistics (and machine learning/AI). You can learn that from books, MOOCS, and other sources. Some were mentioned earlier in this article. I have a list of some of my favorites here. I also have a few books to teach you about AI.
Most importantly, tinker and learn. Build/publish projects, blog and contribute to open source. When you talk to someone interested in hiring you as a data scientist, you will have experience to talk about. Also have a GitHub profile, linked to LinkedIn that shows you do in-fact have something to talk about.