Basic Classification in R: Neural Networks and Support Vector Machines

In this article I will introduce you to classification in R. We will use the Iris data set to perform this classification.  The Iris data set is a classic data set that is often used to demonstrate machine learning.  This data set provides four measurements for three different iris species.  Data such as this typically comes in a CSV File.  The iris CSV file looks something like this.

"sepal_l","sepal_w","petal_l","petal_w","species"
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa

You can download the above file here.

Reading a CSV File in R

By default R expects to find files in your home directory.  You can also specify a full path.  We will now load the iris dataset.  Of course, R has the iris dataset build into the variables iris and iris3.  However, we will assume that you might want to use your own dataset.  Therefore I will demonstrate how to load the iris.csv file.  The following command is used to load the Iris data set.

irisdata <- read.csv(file="iris.csv",head=TRUE,sep=",")

You can also load the data right over the web.

irisdata <- read.csv("http://www.heatonresearch.com/dload/data/iris.csv",head=TRUE,sep=",")

Now that the iris data set is loaded, you can display the entire data set just by entering the variable name.

> irisdata
sepal_l sepal_w petal_l petal_w species
1 5.1 3.5 1.4 0.2 Iris-setosa
2 4.9 3.0 1.4 0.2 Iris-setosa
3 4.7 3.2 1.3 0.2 Iris-setosa
4 4.6 3.1 1.5 0.2 Iris-setosa
5 5.0 3.6 1.4 0.2 Iris-setosa
6 5.4 3.9 1.7 0.4 Iris-setosa
7 4.6 3.4 1.4 0.3 Iris-setosa
...

You can also use the summary function to provide a very useful summary of the iris data.

> summary(irisdata)
 sepal_l sepal_w petal_l petal_w 
 Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 
 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 
 Median :5.800 Median :3.000 Median :4.350 Median :1.300 
 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 
 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 
 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 
 species 
 Iris-setosa :50 
 Iris-versicolor:50 
 Iris-virginica :50

Training and Validation Data

It is often useful to break the data into training and validation sets.  This allows you to validate the SVM or ANN on data that it was never trained with.  The Iris dataset has 150 elements in it.  For our training set we will sample 100 elements from this 150 element set.  This is done with the following commands.

irisTrainData = sample(1:150,100)
irisValData = setdiff(1:150,irisTrainData)

It is very important to note that the above vectors are only indexes, and not the actual data.  To obtain the actual data you must use one of the following commands.

irisdata[irisTrainData,]
irisdata[irisValData,]

Using a Support Vector Machine (SVM)

I will now show you how to train a support vector for the Iris data set.  First, we must tell R that we are using SVM's.

library(kernlab)

Next, we create a radial basis function (RBF) that will be used during training.  This will be used as the kernel function.

rbf <- rbfdot(sigma=0.1)

Next we train the SVM.

irisSVM <- ksvm(species~.,data=irisdata[irisTrainData,],type="C-bsvc",kernel=rbf,C=10,prob.model=TRUE)

Next we get the fitted values for this iris SVM.

fitted(irisSVM)

Test on the validation set with probabilities as output.  The -5 means to remove the 5th column, which is species.  We are trying to predict species.

predict(irisSVM, irisdata[irisValData,-5], type="probabilities")

This produces output similar to the following.

 Iris-setosa Iris-versicolor Iris-virginica
 [1,] 0.964182671 0.022183652 0.013633677
 [2,] 0.952685528 0.032202528 0.015111944
 [3,] 0.966094194 0.021206723 0.012699083
 [4,] 0.965805632 0.020603214 0.013591154
 [5,] 0.962410318 0.024487673 0.013102009
 [6,] 0.964783325 0.022303353 0.012913322
 [7,] 0.975483475 0.012628443 0.011888082
 [8,] 0.918612644 0.060459572 0.020927784
 [9,] 0.953575715 0.030428791 0.015995494
[10,] 0.948050721 0.035563597 0.016385682
...

The above shows the predictions for the first 10 elements of the validation set.  The numbers you see are probabilities.  As you can see each line has one column with the maximum probability.  These samples are all Iris-setosa.  I only show ten rows, so there is not much variety.  If you run the above command in R, you will see the other species as well.

Using a Neural Network (ANN)

I will now show you how to do exactly the same thing using an Artificial Neural Network.  First, we must tell R that we are using ANN's.

library(nnet)

The neural network requires that the species be normalized using one-of-n normalization. We will normalize between 0 and 1.  This can be done with the following command.

ideal <- class.ind(irisdata$species)

We can now train a neural network for the training data.

irisANN = nnet(irisdata[irisTrainData,-5], ideal[irisTrainData,], size=10, softmax=TRUE)

Now we can test the output from the neural network.

 predict(irisANN, irisdata[irisValData,-5], type="class")

The new series of books will cover R, as well as the usual Java and C#. You can pledge ($7) at Kickstarter and pre-order and support this project.

6 thoughts on “Basic Classification in R: Neural Networks and Support Vector Machines

  1. Miley

    Wow thats pretty cool. So R does not abstract the difference in the way the class (species) is represented for you? Its too bad that the code is so different between the ANN and SVM. Its not really plug and play. I've always found that aspect of the Encog analyst to be quite useful!

    Reply
  2. Vincenzo

    Instead it seems to be very similar

    predict(irisANN, irisdata[irisValData,-5], type="class")
    predict(irisSVM, irisdata[irisValData,-5], type="probabilities")

    Jeff where can I find this example with ANN for Java (not workbench)?

    Thank you

    Reply
  3. Tôm

    Hi Jeff,
    That's great! Thank for your usefull tips.
    Currently I have a classification problem:
    - There are C classes
    - For each class, there are M training images, N test images.
    - For each image, I computed distances to each class by 4 methods (4 different metric distances from each image to each class).
    Format of data is as below
    (class1, class2, image, dist1, dist2, dist3, dist4)
    dist1..4 are distances from image (belongs to class2) to class1.

    Size of training data: (C x M ) x M rows
    Size of test data: (C x M ) x N rows
    C=101,
    M=15
    N=16

    Do you know what method is good for this classification problem?
    (My target is classify images into classes)

    If you can help me, I will send my datasheet,
    Thank you very much

    Reply

Leave a Reply