In this article I will introduce you to classification in R. We will use the Iris data set to perform this classification. The Iris data set is a classic data set that is often used to demonstrate machine learning. This data set provides four measurements for three different iris species. Data such as this typically comes in a CSV File. The iris CSV file looks something like this.
"sepal_l","sepal_w","petal_l","petal_w","species" 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 5.0,3.6,1.4,0.2,Iris-setosa 5.4,3.9,1.7,0.4,Iris-setosa 4.6,3.4,1.4,0.3,Iris-setosa
You can download the above file here.
Reading a CSV File in R
By default R expects to find files in your home directory. You can also specify a full path. We will now load the iris dataset. Of course, R has the iris dataset build into the variables iris and iris3. However, we will assume that you might want to use your own dataset. Therefore I will demonstrate how to load the iris.csv file. The following command is used to load the Iris data set.
irisdata <- read.csv(file="iris.csv",head=TRUE,sep=",")
You can also load the data right over the web.
irisdata <- read.csv("http://www.heatonresearch.com/dload/data/iris.csv",head=TRUE,sep=",")
Now that the iris data set is loaded, you can display the entire data set just by entering the variable name.
> irisdata sepal_l sepal_w petal_l petal_w species 1 5.1 3.5 1.4 0.2 Iris-setosa 2 4.9 3.0 1.4 0.2 Iris-setosa 3 4.7 3.2 1.3 0.2 Iris-setosa 4 4.6 3.1 1.5 0.2 Iris-setosa 5 5.0 3.6 1.4 0.2 Iris-setosa 6 5.4 3.9 1.7 0.4 Iris-setosa 7 4.6 3.4 1.4 0.3 Iris-setosa ...
You can also use the summary function to provide a very useful summary of the iris data.
> summary(irisdata) sepal_l sepal_w petal_l petal_w Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 Median :5.800 Median :3.000 Median :4.350 Median :1.300 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 species Iris-setosa :50 Iris-versicolor:50 Iris-virginica :50
Training and Validation Data
It is often useful to break the data into training and validation sets. This allows you to validate the SVM or ANN on data that it was never trained with. The Iris dataset has 150 elements in it. For our training set we will sample 100 elements from this 150 element set. This is done with the following commands.
irisTrainData = sample(1:150,100) irisValData = setdiff(1:150,irisTrainData)
It is very important to note that the above vectors are only indexes, and not the actual data. To obtain the actual data you must use one of the following commands.
Using a Support Vector Machine (SVM)
I will now show you how to train a support vector for the Iris data set. First, we must tell R that we are using SVM's.
Next, we create a radial basis function (RBF) that will be used during training. This will be used as the kernel function.
rbf <- rbfdot(sigma=0.1)
Next we train the SVM.
irisSVM <- ksvm(species~.,data=irisdata[irisTrainData,],type="C-bsvc",kernel=rbf,C=10,prob.model=TRUE)
Next we get the fitted values for this iris SVM.
Test on the validation set with probabilities as output. The -5 means to remove the 5th column, which is species. We are trying to predict species.
predict(irisSVM, irisdata[irisValData,-5], type="probabilities")
This produces output similar to the following.
Iris-setosa Iris-versicolor Iris-virginica [1,] 0.964182671 0.022183652 0.013633677 [2,] 0.952685528 0.032202528 0.015111944 [3,] 0.966094194 0.021206723 0.012699083 [4,] 0.965805632 0.020603214 0.013591154 [5,] 0.962410318 0.024487673 0.013102009 [6,] 0.964783325 0.022303353 0.012913322 [7,] 0.975483475 0.012628443 0.011888082 [8,] 0.918612644 0.060459572 0.020927784 [9,] 0.953575715 0.030428791 0.015995494 [10,] 0.948050721 0.035563597 0.016385682 ...
The above shows the predictions for the first 10 elements of the validation set. The numbers you see are probabilities. As you can see each line has one column with the maximum probability. These samples are all Iris-setosa. I only show ten rows, so there is not much variety. If you run the above command in R, you will see the other species as well.
Using a Neural Network (ANN)
I will now show you how to do exactly the same thing using an Artificial Neural Network. First, we must tell R that we are using ANN's.
The neural network requires that the species be normalized using one-of-n normalization. We will normalize between 0 and 1. This can be done with the following command.
ideal <- class.ind(irisdata$species)
We can now train a neural network for the training data.
irisANN = nnet(irisdata[irisTrainData,-5], ideal[irisTrainData,], size=10, softmax=TRUE)
Now we can test the output from the neural network.
predict(irisANN, irisdata[irisValData,-5], type="class")
The new series of books will cover R, as well as the usual Java and C#. You can pledge ($7) at Kickstarter and pre-order and support this project.