So you Want to be a Data Scientist?

Harvard business review calls it the sexiest job of the 21st century. But, what skills are needed to become a data scientist, and how can you get these skills?  I began as an advanced computer programmer with business knowledge.  Open Source involvement in Artificial Intelligence gave me the foundation to  move into a data science role.

There are really three critical skills that a data scientist must posses. A data scientist must be a statistician, domain expert and hacker - not necessarily in that order. There are different types of data scientist. Each type will be stronger in one of these three skills. Lets take a look at each of these three skills and see how you might build up your knowledge.

Statistician

Data scientists examine data and see insights and patterns in data.  Seeing patterns in data is nothing new. Sir Ronald Fisher was doing this back in 1936. Fisher was interested in determining what sort of flower he was looking at. You might think it is easy to determine a flower type. For humans this is an easy task.  However, researchers are still determining exactly how humans performed this seemingly simple task.

How does a human likely recognize a flower? Most likely we are considering features about the flower. What color is it? How big is it? What does it smell like? Fisher wanted to determine the iris species using a specified set of features about each iris. To do this he collected four numeric measurements for 150 iris flowers. This included 50 flowers each from three different iris species. Using these irises he was able to make a statistical model that would tell him the iris species for a new flower using just these four measurements.

There are many different ways to acquire statistical skills. If you are not familiar with basic statistics terms such as z-score, p-value, ANOVA and Normal Distribution, you should start with a Statistics 101 type class. UDacity offers several good choices for this. As your statistical skills grow, you will find Khan Academy to be indispensable. I’ve learned a great deal pouring over Khan Academy and Wikipedia pages for several statistical models. As you advance, Artificial Intelligence and specifically Machine Learning will also become important. I’ve written several books in this space that might be useful for you.

Domain Expert

A domain expert is someone who has real-world knowledge about the data that they are analyzing. While Fisher was a statistician, he was also an evolutionary biologist and a geneticist. Fisher was a domain expert. The collected iris data was not just a sheet of numbers to him. He knew something about the iris flowers he was analyzing.

This domain knowledge allowed him to know what flower measurements to consider. Fisher measured the length and width of both the petal and sepal of each iris flower. He did not measure the roots, stem thickness or chemical makeup of each flower. Because Fisher knew something about flowers, he had an idea of which measurements to consider. He also knew if his results made any sense.

Being able to determine if your model makes any sense is critical. Dogs of the Dow was a popular investing strategy from early 1990’s that would seemingly pick winning portfolios based on very simple data. Just plug in the dividend yields of the DJIA-30 stocks to gain a portfolio that beats the overall stock market average. Analysis of historic data created this model. The problem is that this model fit the historic data much better than it did the future data. “Dogs of the Dow” found a mostly coincidental pattern in the historic data. While there are some holdouts, the Dogs of the Dow is now a largely discredited model.

Becoming a domain expert is somewhat more elusive. The first question you should be asking is “what domain?" You might choose a domain such as finance, marketing, biology, or any other common business field. It will be helpful if you already have experience in a particular industry. If not, try to take some courses that will expose you to the data of a particular industry. Economics, marketing and finance classes are always good choices.

Hacker

Finally, a data scientist must be a hacker. At first this idea might seem strange. By hacker, I do not mean someone who attempts to circumvent computer security. For this definition, a hacker is a programmer. However, not every programmer is a hacker. A hacker is a programmer who will hack at a problem until that problem is solved. The hacker is not intimidated by hitting a brick wall. The hacker will come up with a very creative way around the brick wall, even if earlier attempts have all failed.

Fisher did not need to be a hacker. Fisher had 150 flowers to analyze. He measured each one by hand and made sure his data was clean and accurate. Consider if Fisher had 150 million flowers. Further, a mechanical process with a 90% accuracy rate measured each of these flowers. Now we have a huge amount of somewhat inaccurate data. We now have a “Big Data” problem. Big Data is any data set that is so large that it is difficult to work with. Typically “Big Data” starts at the point where a data set can no longer fit in the memory of a single computer. Not long ago everything over 640k (the original useable memory size of a PC) was “Big Data”.

The hacker can wrangle “Big Data” and get it into a form that a statistical model can handle. This wrangling might mean merging data from many sources, or writing automated programs to harvest data from the Internet. The hacker might need to clean the data in some way. The data might need further wrangling to even get it into a statistical model.

If you are a computer programmer already, then you already have some of the hacker skills. If you are a computer programmer who spends their free time learning new programming skills and perhaps contributing to open source, then you might be a hacker! If not, try something new. Two of the most predominant data science languages are Python and R. Java, C# and C/C++ are also choices. UDacity and Coursera both have several courses to allow you to use hacker skills to sharpen data science. The best way to learn to be a hacker is to hack. Practice examples and then experiment with data that interests you.

Acquiring Data Science Skills

To summarize, a data scientist must have three primary skills.

  • Mathematics & Statistics (statistician)
  • Business Domain Knowledge (real world knowledge)
  • Hacker (creative computer programmer)

There are also a number of programs available to teach data science.

There are also many great data science blogs.  I personally read the following.

For me, the road to data science started at programming. I worked for many years as a computer programmer in the life insurance industry. As a result, I learned quite a bit about life insurance data. I also learned how to crunch data and provide reports to present data. Long before I had ever heard the term “data science,” I developed an interest in Artificial Intelligence. What does a hacker do when they are interested in something? I started experimenting and programming. I learned more about AI. This knowledge ultimately opened doors and I moved into more of a data science role.

If you are interested in AI, you might find some of my projects interesting. I have a machine learning open source projects, write a blog and write books.  I am also currently funding my a book on nature inspired algorithms on Kickstarter.

Data Science

5 thoughts on “So you Want to be a Data Scientist?

  1. WS

    Hi Jeff,

    Just wanted to let you know you've got a great site here :) and thank you for Encog! I'm looking forward to run the workbench and program with the DLL.

    Reply
  2. Katie K.

    Great stuff! Agree with your emphasis on statistics. For another perspective, check out this highly-trafficked Quora post answering the same question: quora.com/how-do-i-become-a-data-scientist

    And for fellow readers interested in becoming Data Scientists, I suggest the Zipfian Academy 12-week immersive program. Our curriculum spans machine learning, statistics, big data and software engineering, and is a great fit for people who have solid background but need to take a leap forward in their skills: zipfianacademy.com

    Reply
  3. Pingback: My Thoughts on Courses 1-9 of the Johns Hopkins Coursera Data Science Specialization | Jeff Heaton

  4. Pingback: DataScience Links | datagraphy

Leave a Reply