50% Done with Johns Hopkins COURSERA Data Science Specialization, My Review of Classes 3&4

I am 50% done with the Johns Hopkins COURSERA Data Science Specialization, and continuing onward.  I took courses 1 through 3 in a single batch, and found them useful.  You can read my review of courses 1 through 3 here.  I decided not to take three classes in parallel again.  I am learning the R programming language as I go, so these courses can take some time.  I plan on sticking to two courses at a time until I complete the 9 courses.  This will allow me to tackle the final capstone project sometime this fall.

I am taking the signature track.  This means that COURSERA issues me a verification certificate for each course.  You can see my certificates for these two courses here.

My grades were high enough that I earned the "with distinction" note.  This has been the case with all five classes so far.  I hope that I am not giving the impression that these courses are easy.  Short of loading data (that I had preprocessed elsewhere) into R for modeling, I had not really dealt with R much.  My goals for this class are twofold.  First, to finally sit down and really learn R.  My second goal is to see more application of data science in new domains.  As a data scientist, I work primarily with life insurance industry data.

Both of the current two classes really fit the bill for this.  Both give me some insights into how data science is used in academic, peer reviewed, research.  Given that I am entering a distance learning Computer Science PhD program, this is very good exposure for me.

Review of Exploratory Data Analysis

Exploratory data analysis taught me all about charting in R.  There are three central charting packages for R.  Each of them has their uses, and this class taught all free.  In the world of R charting there is the base charting, lattice charting and grammar of graphics.  These packages all have their place.  The ggplot (grammar of graphics) attempts to address some of the issues in earlier charting packages.

The entire point of exploratory data analysis is to produce quick charts to allow you to get a handle on what the data look like.  These are not formal graphics that you will include in your report, though they might become that.

Review of Reproducible Research

If you've read many academic articles, you know how hard it can be to reproduce exactly what the researcher is talking about.  The goal of this course is to automate all numerical analysis to a simple R script.  This allows the reader to know exactly where numbers came from, and what the process is.  Many academic papers lack any sort of direction.

Conclusions

The next course for me is Practical Machine Learning.  I am taking this course by itself.  I only have two courses to finish before the capstone.  This would allow me to complete the program (but not the capstone) by September.  In September, I start a PhD in computer science.  Because of this I want to keep the schedule clear for the fall.

Signed up for fall 2014 semester at NSU for CS PhD

I am now signed up for my fall classes at Nova Southeastern University.  This is my first semester in the Computer Science Ph.D. program.  I am taking 8 credit hours.  This consists of two 4 credit hour courses.  In August I will travel to their campus in Ft. Lauderdale, Florida. The distance format requires me to travel to their campus twice a semester. I am looking forward to my first semester, and learning how their distance PhD program works.

I signed up for the following courses.

These are the books that I will use for the first semester.

nsu_books_fall_2014

I already owned Peter Norvig's AIMA book that will be used in CISD 760, and am somewhat familiar with it.  This book is absolutely a classic in the AI field, and I am very glad that it will be the course book for my class.

I am also looking forward to the database systems class.  I am not familiar with the text book, but it seems to cover quite a few topics.

I am looking forward to the semester.  I am sure both courses will be a fair amount of work.  I timed the writing for volume 2 of AIFH to complete just before courses begin.  I will post more about the semester as it begins and I see exactly what I've gotten myself into. :)

 

Review of Traveling Salesman (the Movie)

This weekend I watched Traveling Salesman the Movie .  Overall, I was not that impressed.  I came in with very high hopes. The name of the movie comes from the fact that the Traveling Salesman is a prime example of a NP hard problem.  I found it interesting that no where in the movie was the Traveling Salesman Problem (TSP) actually explained, let alone much of a definition of NP hard. I am not entirely sure the words "traveling salesman" even came up once in the movie.  There was a brief mention of the knapsack problem.  This makes it a bit difficult to follow for a non-computer science background person.  I watched the movie with my wife, and needed to explain TSP and NP-Hard.

What the movie is actually about, is not evident from the title.  "P vs NP" is not the centerpiece of the plot.  P vs NP is simply a stand-in for any great technical advance that a government might want to control.  Strong AI, Cold Fusion, Human Cloning or FTL travel could have filled this role.  Any technical advance that is so great, that it will alter the course of history-- most likely to the advantage of the nation that wields it.  Many obvious comparisons to the Manhattan Project were made by the movie.

I really had a hard time seeing "P vs NP" as being on the same level as the Manhattan Project.  If they wanted this level of comparison, they could at least pick a technology that might actually happen.  P is almost certainly not equal to NP. In a 2012 poll, only 9% of computer scientists believed that P might equal NP.  If we could solve NP problems in polynomial time this might be roughly equivalent to a nearly infinitely fast computer.  Such a machine could decrypt nearly anything allowing cybere warfare on an epic scale.

Movies, like other art forms, are the subject of interpretation.  What a movie is actually "about", is typically a personal interpretation.  For me, TSP the Movie was a very lengthy dialog between several scientists and a government that wants to lock away their discovery.  Additionally, who is "accountable" for the use or misuse of the technology-- the scientists or the government. Should have the United States published their research on the Manhattan Project? Additionally, what will the government ultimately do to protect its secrets?  If you want to watch a talky hour long debate on these topics, this movie is for you!  Traveling Salesman (the movie) has little to do about the Traveling Salesman (the problem). Overall, I feel I was conned into a political debate.  And, for me, not that compelling of an argument for liberation of information.

What are my favorite eLearning opportunities?

In a twitter conversation, someone asked me what my favorite eLearning classes were.  An interesting question!  An even more interesting topic for a blog post.

Due to a limitation of time, I am forced to prioritize.  Here are the top ones for me.

  • Johns Hopkins Coursera Data Science Specialization - I am currently enrolled in this, about half way through.  It is really good! You can read my review of the first three courses here.
  • AI Class and Machine Learning Class - Both originally out of Stanford, now on Coursera and Udacity respectively - These were great.  I took these a few years ago when they first came out. They teach you a wide range of ML and AI algorithms, at the lowest level.
  • Khan Academy - Any time I find a hole in my math knowledge, I can usually fine a video here to fill it.
  • Statistics @ Udacity by San Jose University - I took this course for college credit.  Even though it was probably the 3rd undergrad stats class I took.  I wanted to see what the "for credit" approach was like.  It was a great class.  Anyone who is interested in AI/ML/Data Science and is NOT familiar with stats REALLY should take a stats class.
  • Nova Southeastern University - This is actually more distance learning than eLearning.  I am a student in their computer science PhD program.  You can read more about that here. I do have to go to Ft. Lauderdale Florida four times a year.  I will be writing more about this as I progress through the program.  I liked this program because I could earn a research oriented PhD in computer science.  Not the management programs that many other distance/online/non-traditional PhD programs offer.  Not that there is anything wrong with such programs, I am just an uber-techie. :)

If I were in the market for a masters degree, I would seriously consider one of these two.

There are even search engines for online courses, such as http://www.skilledup.com/

Washington University MiniMed 2

My wife and I just finished Mini-Med School 2 at Washington University in St. Louis.  Mini-Med is a program offered by WashU that allows laypeople to learn about medicine.  Each night is taught by a world class expert in their field.  The biographies of these lecturers is absolutely amazing!  These are some of the doctors who are pushing the boundaries of human understanding in medicine!

There are three different courses offered in Mini-Med.

  • Mini-Med 1: Lectures and some hands-on labs.  I learned to suture and used a laparoscopy simulator. I also got to tour the Washington University Genome Institute.
  • Mini-Med 2: Lectures and more hands-on labs.  I saw specimens of human organs from cadavers, became CPR certified, learned to examine patients, attended a posture lab, and took a walking tour of the medical school.
  • Mini-Med 3: I have not yet taken Mini-Med three yet, however, it continues with lectures and meeting patient/doctor teams and a tour of the Goldfarb School of Nursing. I am looking forward to Mini-Med 3 this fall.

The Mini-Med courses are taught around the typical semester schedule.  There are no classes during the summer semester.  At the end of each class you are given a certificate, if you attended the required number of class sessions.  This is me at the Mini-Med 2 graduation ceremony!

minimed-2-p1

Jeff graduating Mini-Med 2

I really enjoyed the hand-on labs.  I particularly enjoyed trying my hand at microsurgery.  My wife and I also earned our CPR certification.  Most of the labs were led by medical students, residents and postdocs. They were very knowledgeable, and it was interesting talking with some of them about their medical school journey. We really enjoyed learning from them.

Jeff trying his hand at microsurgery

One night we saw human organ specimens from cadavers.  It was fascinating to be able to hold and examine vital organs.  We saw specimens of hearts, kidneys, bladders, the GI track, and even the brain!  Seeing two human brains was fascinating.  Being an AI/Machine Learning programmer, it was fascinating to see the "real thing".

Jeff holding a human brain

Jeff holding a human brain

My wife, Tracy, and I are both life-long learners.  We are both involved in graduate programs. So this is right up our alley.  Tracy is earning a masters degree in Spanish.  So she found it fascinating to see the actual "voice box" that she has seen many times in her linguistics texts.  It was fascinating for us both to learn valuable knowledge from outside out fields of study.

I work as a data scientist for a life insurance company and also working on a doctorate of Computer Science.  I use predictive modeling for insurance underwriting.  Life underwriting has much affinity with medicine.  The two Mini-Med classes have given me valuable information for my job.  As a data scientist it is very useful to learn about the knowledge domain that you are attempting to analyze. In addition to Mini-Med I also worked on the Johns Hopkins Coursera Data Science specialization this semester.

 

Review of the First Three Johns Hopkins Coursera Data Science Courses

I am currently working towards the Johns Hopkins Data Science Specialization at Coursera. I am now complete with the first three courses.  I posted my initial, and very positive, impressions when I was about half-way through the first four-week block.  My impressions are still very favorable at completion. Now that the course is complete, I can post my complete thoughts for the first three courses.

There are a total of nine courses, and a capstone.  After completing all 10 requirements you earn the "specialization".  After you complete a course you are given an "online certificate".  You can link this to your Linked In page, or other social media.  You can see my first three certificates here.

Some of my courses show a 101%.  I am not entirely sure what that means.  I did not lose any points, but I did not think I had extra credit.  But I won't complain. :)

The instructors are releasing these courses in batches of three. Each course lasts for four weeks. I decided to take the first three in parallel.  This is not necessarily a good idea, depending on your experience level. I did scale back to two courses for the second month.  The capstone project will not be released until this fall, so I decided to slow my progress so as to finish in conjunction with the capstone project.

Data Science Specialization (overall)

Overall, I feel this is a very solid program.  The three instructors all have very solid academic credentials and real-world experience in data science.  If I saw this program listed on a resume, I would consider it a plus.  The program is challenging and far from trivial.  As a note, I currently work as a data scientist and have a number of publications in the area of Artificial Intelligence.  See the conclusions section for my motivations for earning this "specialization".

This specialization is very much geared towards the R programming language.  Python and R seem to be the two major players in the field of data science. C is often used to create high-performance models to be used with Python & R.  More traditional languages, such as Java/C# sometimes have their place, particularly when a data product needs to be "weaponized" for production.

Currently, I am much more in the Python & C/C++ camp than R.  Gaining familiarity with R was one of the major motivators for me.  The R programming language WILL be difficult for someone coming out of a Java/C# background.  Python is a bit closer to R.  The value of R is its community and third party add ons that all work together.  Simply as a programming language, R is not that good.  It is a domain specific language (DSL) for statistics.  You would not write a word processor or video game in R!  Python is a general purpose language that also has much of the community and statistical models of R.  If you are going to be a data scientist, you should know both R and Python.

All of the courses are organized somewhat similarly. You are provided with video lectures and HTML/PDF versions of the slights.  All course material was available on "day 1" of the course. The following assessments are used to calculate your grade.

  • Quizzes: These were a combination of multiple choice and fill-in-the-blank.  You will sometimes use R to answer the quizzes. Most quizzes can be attempted twice, and you are given the higher of the two scores.
  • Programming Assignments: Programming assignments are essentially "unit tests" that you use a submit script to transmit.  Your answer, but not your code, is evaluated.  You can retry programming assignments as many times as you like.
  • Peer Reviewed Assignments: These are essentially documents, with attached files, that you submit for review. They are graded somewhat "subjectively" by your classmates. You are typically submitting program code, result data and screen shots.

All of the above examples have "fixed deadlines".  The deadlines occur at the end of each of the four weeks of the course. After the deadline, you get partial credit for two days.  Beyond that, it is a zero.

The "peer reviewed assignments" did cause some concerns amount the students.  You are simply given a final grade in the peer reviewed section.  It is a black box.  You do not know how many students evaluated you, what they gave you, or why they graded you how they did. Graders are instructed NOT to run your R program.  (good advise, from a security point)  But this brought up concerns about students ability to grade a program they cannot run.  I ran into no issues with peer review.  I felt the grades I received were fair.

  • Evaluating your understanding: Really good!  I felt the programming assignments were a great mix of projects, documentation and quizzes.  If you can complete the assignments for these three classes, you have a good understanding of the subject matter.
  • Providing Experience with the technology: Really good!  You get hands-on experience with R, Rstudio and GitHub. The fact that they include GitHub and exercises in collaboration is great!
  • Teaching the subject material: Acceptable. I must commend the instructors for placing HTML and PDF versions of their slides.  Trying to flip through video and find something is just painful.  However, not everything from the assignments is in the course material.  This does not bother me, and I think it is a good thing. However, this was a stumbling block for some students. However, you need to know how to go out and find answers. Especially, if you aspire to be a data scientist. As a data scientist, my specifications are often quite vague!  You need to be a researcher/hacker and figure it out.

The Data Scientist's Toolbox

This is the first course in the series. This course was really easy for me, and I did not learn that much.  However, this is mainly because I already had R and R Studio installed.  I also already had a GitHub account.  If you are already at this level, then this class is just something you need to check off the list to get the specialization.

However, this class does set a great foundation!  It is absolutely awesome that the instructions expose you to GitHub for collaboration and sharing.  They also include how to ask questions, and where to ask them.  As an open source maintainer I can say that "question asking etiquette" is NOT common knowledge.

This course lays a great foundation!  But, you may already have said foundation.

R Programming

This is the second course in the series, and it was very good. It teaches you some of the fundamentals of the R programming language.  You are given assignments that test your knowledge of the topics. Some of the topics include:

  • Reading data files
  • Output to files
  • How to use R's looping functions
  • How to NOT use R's looping functions and use the various apply functions
  • Data frames

R is vector based, with all sorts of "helper functions" to perform various tasks quickly.  This can be a bit frustrating, until you get used to it.  Python has some of the same issues.  When I first learned Python, I would frequently lament that you can reduce anything to 2 lines of code if you know the correct "pythonic magic" to invoke.  Pythonic refers to a program that is expressed in Python's unique style.

Getting and Cleaning Data

This is a very practical course. Data are rarely in the form you want them!  Wrangling data is a critical skill.  This course introduces the concept of tidy data. Tidy data are data that has been transformed in some way to make it easier to model.  You almost always need to do this. This is why it is critical to document your "tidy steps".  I've run into a number of peer-reviewed academic articles where enough instructions are not given to realistically reproduce their research. You should always provide scripts to help reproduce your research.  One of the instructors wrote a book on this topic.

This course also provides details on obtaining data from the web, CSV, XML, JSON and API's.  The final project asks you to merge a public data set and provide a tidy data set made up of several files.  You also create a codebook to document your data.  This gives a clear indication of your data format, as well as how your file was created from the raw input data.

Conclusions

I am probably not the typical student for these courses.  I already work as a data scientist. Additionally, I am a Ph.D. student in Computer Science, and the author of several Artificial Intelligence (AI books). Why am I taking this course?  First, is to deepen my knowledge of the R programming language. Secondly, is to round out my knowledge of data science and see how others are using it.  My background is more in information technology and artificial intelligence.

So far I am quite impressed with the courses.  I passed the first three courses "with distinction", and am looking forward to the next two. I am not planning on taking three in parallel again.  These courses are a decent amount of work. I am also hopeful that the the upcoming course on reproducible research will be helpful for my Ph.D dissertation.

Reservoir Sampling (how to pick a random tree node/leaf)

Reservoir Sampling is a really cool algorithm that I've used a number of times in the last few weeks.  It also very much illustrates a simple, yet common, "Big Data" issue.  Often I want to randomly sample (usually uniformly) from either a tree or a list.  Both a tree and a list are approximately the same algorithm.  For  list, count the number of elements, say 10, then generate a random number in this range.  If you get 4, then choose the 4th element. Same deal for a tree.  If there are a total of 10 nodes and leafs, then pick a random number in the range 10.  If you get 4, then traverse the tree any way you like, and choose the 4th element.

A problem occurs when you do not know the size of the tree or list.  This can be solved with bruit force. Just loop/traverse the structure and determine the length.  Now pick a random number in that range, and loop up to the index that matches that random number.  The problem is you are traversing your data TWICE.  If you have 1000 elements, then you are looping at least 1000 times plus a second partial set of 1000 the next time.  It is only partial, because your chosen index most likely not at the very end.  In the worst case, it would not be partial, and you visit each element twice, or 2000.  In Computer Science this algorithm's efficiency is said to be O(N + log N), or O(2N) worst case.  The "log N" essentially means that for the second traversal/iteration we do not visit all N (in this case 1000) elements.

The beauty of Reservoir Sampling is that it is O(N).  This means that for our 1000 element list, we process 1000 elements, even in the worst case.  This is great for Big Data.  In Big Data, you do everything you can to visit each datum as few of times as possible.

Algorithm Description

A nice description of this, from Wikipedia, is given here. Note, that we are using 1-based arrays, not zero.

The algorithm creates a "reservoir" array of size k and populates it with the first k items of S. It then iterates through the remaining elements of S until S is exhausted. At the ith element of S, the algorithm generates a random number j between 1 and i. If j is less than k, the j th element of the reservoir array is replaced with the i th element of S. In effect, for all i, the ith element of S is chosen to be included in the reservoir with probability k/i. Similarly, at each iteration the jth element of the reservoir array is chosen to be replaced with probability 1/k * k/i, which simplifies to 1/i. It can be shown that when the algorithm has finished executing, each item in S has equal probability (i.e. k/length(S)) of being chosen for the reservoir.

Reservoir Sampling in Action

Lets look at a few cases, and see how we would do this.  Consider if we had a list of length 5, and we wanted to sample a single element (reservoir size (k) is 1). We will assume that S is made up of a count from 1 to 5: {1,2,3,4,5}.

We initialize the reservoir (R) with the first value in S, this would be 1Now we look at the second value in S, which is 2. We need to decide if 2 should replace the value (1) already in the reservoir.  We've consider 2 values at this point, so the probability of replacing the value already in the reservoir is 1/2, or 0.5. For the next element the probability if 1/3, then 1/4, and finally 1/5 for the final element.