The following are steps I took to build a machine learning model.
library(caret);
## Loading required package: lattice
## Loading required package: ggplot2
library(randomForest);
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
library(gridExtra);
## Loading required package: grid
set.seed(042966)
training = read.csv("~/Downloads/pml-training.csv");
testing = read.csv("~/Downloads/pml-testing.csv")
training$classe = as.factor(training$classe)
Looked at the number of variables. Also saw that some of the variables had NAs. Also observed that some of variables were derived quantities like std, var, skew, etc. I decide to eliminate all columns with NAs as well as derived quantities. I was not sure of the latter (derived quantities) but I was hoping that the ML algorithm would be able to deduce dependencies on derived quantities by itself.
dim(training)
## [1] 19622 160
cleanTraining = training[,-(grep("kurtosis|min|total|max|kurtosis|skew|stddev|var|avg|skew|amplitude|window|X|timestamp",colnames(training)))]
dim(cleanTraining)
## [1] 19622 50
colnames(cleanTraining)
## [1] "user_name" "roll_belt" "pitch_belt"
## [4] "yaw_belt" "gyros_belt_x" "gyros_belt_y"
## [7] "gyros_belt_z" "accel_belt_x" "accel_belt_y"
## [10] "accel_belt_z" "magnet_belt_x" "magnet_belt_y"
## [13] "magnet_belt_z" "roll_arm" "pitch_arm"
## [16] "yaw_arm" "gyros_arm_x" "gyros_arm_y"
## [19] "gyros_arm_z" "accel_arm_x" "accel_arm_y"
## [22] "accel_arm_z" "magnet_arm_x" "magnet_arm_y"
## [25] "magnet_arm_z" "roll_dumbbell" "pitch_dumbbell"
## [28] "yaw_dumbbell" "gyros_dumbbell_x" "gyros_dumbbell_y"
## [31] "gyros_dumbbell_z" "accel_dumbbell_x" "accel_dumbbell_y"
## [34] "accel_dumbbell_z" "magnet_dumbbell_x" "magnet_dumbbell_y"
## [37] "magnet_dumbbell_z" "roll_forearm" "pitch_forearm"
## [40] "yaw_forearm" "gyros_forearm_x" "gyros_forearm_y"
## [43] "gyros_forearm_z" "accel_forearm_x" "accel_forearm_y"
## [46] "accel_forearm_z" "magnet_forearm_x" "magnet_forearm_y"
## [49] "magnet_forearm_z" "classe"
I looked at the study and I felt that I didnt think that there would be any dependancy in data coming from the four sensors and the associated data. And even if there was any dependancy, I wanted to deduce it from the algorithm. So I didnt any preprocessing like PCA
Given that this was a classification problem and the lecture notes indicated that Random Forest (RF) and boosting were the most successful algorithms, I decided to start with RF.
Since RF splits the training set by itself (into two thirds and one third) I did not split the training set into two parts for cross validation.
Since I chose RF, i needed to the set the number of trees. The default is 500. I didnt know if 500 was too low or too high.
# Fit the RF model
numTree = 500;
modelRf <- randomForest(classe ~., cleanTraining, importance=TRUE, ntree=numTree)
print(modelRf) # view results
##
## Call:
## randomForest(formula = classe ~ ., data = cleanTraining, importance = TRUE, ntree = numTree)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 0.24%
## Confusion matrix:
## A B C D E class.error
## A 5579 1 0 0 0 0.0001792115
## B 8 3786 3 0 0 0.0028970240
## C 0 8 3411 3 0 0.0032144944
## D 0 0 16 3198 2 0.0055970149
## E 0 0 2 5 3600 0.0019406709
modelRf$confusion # confusion matrix
## A B C D E class.error
## A 5579 1 0 0 0 0.0001792115
## B 8 3786 3 0 0 0.0028970240
## C 0 8 3411 3 0 0.0032144944
## D 0 0 16 3198 2 0.0055970149
## E 0 0 2 5 3600 0.0019406709
outofSampleError50 = 1-sum(diag(modelRf$confusion))/sum(modelRf$confusion)
outofSampleError50
## [1] 0.002446937
plot(1:numTree,modelRf$err.rate[,1],main="OOB error versus #Tree")
#importance(modelRf) # importance of each predictor
ind = colnames(cleanTraining[order(importance(modelRf,type = 1),decreasing=TRUE)])
ind[1:15]
## [1] "yaw_belt" "roll_belt" "magnet_dumbbell_z"
## [4] "pitch_belt" "magnet_dumbbell_y" "gyros_arm_y"
## [7] "pitch_forearm" "magnet_forearm_z" "accel_dumbbell_z"
## [10] "gyros_forearm_z" "accel_dumbbell_y" "gyros_dumbbell_z"
## [13] "roll_arm" "gyros_forearm_y" "magnet_belt_x"
It turns out I didnt need 500 trees. After 100 trees, the OOB error does not decrease significantly.
Next I tried to determine whether I needed the 50 variables or I could reduce it to a smaller set. I used the rfcv function to do this. Since I new that I didnt need 500 trees, I used a smaller number of trees to do the cross validation. This took a long time if I ran it with 500 trees.
indexClasse = grep("classe",colnames(cleanTraining))
rf.cv = rfcv(cleanTraining[,-indexClasse],cleanTraining[,indexClasse],
cv.fold=10,ntree=100)
plot.new();
with(rf.cv, plot(n.var, error.cv,log="x", type="o", lwd=2, main ="error versus #variables\n(based on Cross Validation)"))
As you can see with the above plot about 12 variables is all you need. The 12 most variable are
colnames(cleanTraining[ind[1:12]])
## [1] "yaw_belt" "roll_belt" "magnet_dumbbell_z"
## [4] "pitch_belt" "magnet_dumbbell_y" "gyros_arm_y"
## [7] "pitch_forearm" "magnet_forearm_z" "accel_dumbbell_z"
## [10] "gyros_forearm_z" "accel_dumbbell_y" "gyros_dumbbell_z"
Next I run the randomForest model with 12 most important variables.
modelRf2 = randomForest(cleanTraining$classe ~ ., cleanTraining[ind[1:12]],importance=TRUE,ntree=500)
modelRf2$confusion # confusion matrix
## A B C D E class.error
## A 5564 4 2 9 1 0.002867384
## B 23 3751 20 3 0 0.012114827
## C 0 16 3397 9 0 0.007305669
## D 2 1 21 3189 3 0.008395522
## E 0 7 8 8 3584 0.006376490
# Error with 12 variables
outofSampleError12= 1-sum(diag(modelRf2$confusion))/sum(modelRf2$confusion)
Comparing the results from the original run with 50 variables.While the model with 50 variables has lower error both are under one per cent.
c(outofSampleError50,outofSampleError12)
## [1] 0.002446937 0.006983835
Finally I run the model with test data with two models (50 variables and 12 variables)
predictions50 = predict(modelRf,testing)
predictions12 = predict(modelRf2,testing)
predictions50 == predictions12
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [15] TRUE TRUE TRUE TRUE TRUE TRUE
I tried boosting since that was mentioned as the other widely used algorithm. However, I couldnt get any results even after waiting for 30 minutes. Besides, I was getting such low errors with RF, I decided not to pursue other algorithms further.