Course Project Submission

The following are steps I took to build a machine learning model.

Summary

  • load data
  • inspect predictors to see anomalies, plot predictors
  • clean data
  • choose algorithm
  • determine algorithm parameters
  • cross validate to determine reduce number of predictors
  • compute OOB sample
  • predict on test set

Details

Load the data

library(caret);
## Loading required package: lattice
## Loading required package: ggplot2
library(randomForest);
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
library(gridExtra);
## Loading required package: grid
set.seed(042966)

training = read.csv("~/Downloads/pml-training.csv");
testing = read.csv("~/Downloads/pml-testing.csv")

training$classe = as.factor(training$classe)

Inspect the data

Looked at the number of variables. Also saw that some of the variables had NAs. Also observed that some of variables were derived quantities like std, var, skew, etc. I decide to eliminate all columns with NAs as well as derived quantities. I was not sure of the latter (derived quantities) but I was hoping that the ML algorithm would be able to deduce dependencies on derived quantities by itself.

dim(training)
## [1] 19622   160
cleanTraining = training[,-(grep("kurtosis|min|total|max|kurtosis|skew|stddev|var|avg|skew|amplitude|window|X|timestamp",colnames(training)))]
dim(cleanTraining)
## [1] 19622    50
colnames(cleanTraining)
##  [1] "user_name"         "roll_belt"         "pitch_belt"       
##  [4] "yaw_belt"          "gyros_belt_x"      "gyros_belt_y"     
##  [7] "gyros_belt_z"      "accel_belt_x"      "accel_belt_y"     
## [10] "accel_belt_z"      "magnet_belt_x"     "magnet_belt_y"    
## [13] "magnet_belt_z"     "roll_arm"          "pitch_arm"        
## [16] "yaw_arm"           "gyros_arm_x"       "gyros_arm_y"      
## [19] "gyros_arm_z"       "accel_arm_x"       "accel_arm_y"      
## [22] "accel_arm_z"       "magnet_arm_x"      "magnet_arm_y"     
## [25] "magnet_arm_z"      "roll_dumbbell"     "pitch_dumbbell"   
## [28] "yaw_dumbbell"      "gyros_dumbbell_x"  "gyros_dumbbell_y" 
## [31] "gyros_dumbbell_z"  "accel_dumbbell_x"  "accel_dumbbell_y" 
## [34] "accel_dumbbell_z"  "magnet_dumbbell_x" "magnet_dumbbell_y"
## [37] "magnet_dumbbell_z" "roll_forearm"      "pitch_forearm"    
## [40] "yaw_forearm"       "gyros_forearm_x"   "gyros_forearm_y"  
## [43] "gyros_forearm_z"   "accel_forearm_x"   "accel_forearm_y"  
## [46] "accel_forearm_z"   "magnet_forearm_x"  "magnet_forearm_y" 
## [49] "magnet_forearm_z"  "classe"

PreProcessing

I looked at the study and I felt that I didnt think that there would be any dependancy in data coming from the four sensors and the associated data. And even if there was any dependancy, I wanted to deduce it from the algorithm. So I didnt any preprocessing like PCA

Selecting model

Given that this was a classification problem and the lecture notes indicated that Random Forest (RF) and boosting were the most successful algorithms, I decided to start with RF.

Since RF splits the training set by itself (into two thirds and one third) I did not split the training set into two parts for cross validation.

Fit the model

Since I chose RF, i needed to the set the number of trees. The default is 500. I didnt know if 500 was too low or too high.

# Fit the RF model 
numTree = 500;
modelRf <- randomForest(classe ~., cleanTraining, importance=TRUE, ntree=numTree)
print(modelRf) # view results 
## 
## Call:
##  randomForest(formula = classe ~ ., data = cleanTraining, importance = TRUE,      ntree = numTree) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.24%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 5579    1    0    0    0 0.0001792115
## B    8 3786    3    0    0 0.0028970240
## C    0    8 3411    3    0 0.0032144944
## D    0    0   16 3198    2 0.0055970149
## E    0    0    2    5 3600 0.0019406709
modelRf$confusion # confusion matrix
##      A    B    C    D    E  class.error
## A 5579    1    0    0    0 0.0001792115
## B    8 3786    3    0    0 0.0028970240
## C    0    8 3411    3    0 0.0032144944
## D    0    0   16 3198    2 0.0055970149
## E    0    0    2    5 3600 0.0019406709
outofSampleError50 = 1-sum(diag(modelRf$confusion))/sum(modelRf$confusion)
outofSampleError50
## [1] 0.002446937
plot(1:numTree,modelRf$err.rate[,1],main="OOB error versus #Tree")

#importance(modelRf) # importance of each predictor
ind = colnames(cleanTraining[order(importance(modelRf,type = 1),decreasing=TRUE)])
ind[1:15]
##  [1] "yaw_belt"          "roll_belt"         "magnet_dumbbell_z"
##  [4] "pitch_belt"        "magnet_dumbbell_y" "gyros_arm_y"      
##  [7] "pitch_forearm"     "magnet_forearm_z"  "accel_dumbbell_z" 
## [10] "gyros_forearm_z"   "accel_dumbbell_y"  "gyros_dumbbell_z" 
## [13] "roll_arm"          "gyros_forearm_y"   "magnet_belt_x"

It turns out I didnt need 500 trees. After 100 trees, the OOB error does not decrease significantly.

Cross Validation

Next I tried to determine whether I needed the 50 variables or I could reduce it to a smaller set. I used the rfcv function to do this. Since I new that I didnt need 500 trees, I used a smaller number of trees to do the cross validation. This took a long time if I ran it with 500 trees.

indexClasse = grep("classe",colnames(cleanTraining))
rf.cv = rfcv(cleanTraining[,-indexClasse],cleanTraining[,indexClasse],
             cv.fold=10,ntree=100)
plot.new();
with(rf.cv, plot(n.var, error.cv,log="x", type="o", lwd=2, main ="error versus #variables\n(based on Cross Validation)"))

As you can see with the above plot about 12 variables is all you need. The 12 most variable are

colnames(cleanTraining[ind[1:12]])
##  [1] "yaw_belt"          "roll_belt"         "magnet_dumbbell_z"
##  [4] "pitch_belt"        "magnet_dumbbell_y" "gyros_arm_y"      
##  [7] "pitch_forearm"     "magnet_forearm_z"  "accel_dumbbell_z" 
## [10] "gyros_forearm_z"   "accel_dumbbell_y"  "gyros_dumbbell_z"

Next I run the randomForest model with 12 most important variables.

modelRf2 = randomForest(cleanTraining$classe ~ ., cleanTraining[ind[1:12]],importance=TRUE,ntree=500)
modelRf2$confusion # confusion matrix
##      A    B    C    D    E class.error
## A 5564    4    2    9    1 0.002867384
## B   23 3751   20    3    0 0.012114827
## C    0   16 3397    9    0 0.007305669
## D    2    1   21 3189    3 0.008395522
## E    0    7    8    8 3584 0.006376490
# Error with 12 variables
outofSampleError12= 1-sum(diag(modelRf2$confusion))/sum(modelRf2$confusion)

Comparing the results from the original run with 50 variables.While the model with 50 variables has lower error both are under one per cent.

c(outofSampleError50,outofSampleError12)
## [1] 0.002446937 0.006983835

Predicting

Finally I run the model with test data with two models (50 variables and 12 variables)

predictions50 = predict(modelRf,testing)
predictions12 = predict(modelRf2,testing)
predictions50 == predictions12
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [15] TRUE TRUE TRUE TRUE TRUE TRUE

What about other models

I tried boosting since that was mentioned as the other widely used algorithm. However, I couldnt get any results even after waiting for 30 minutes. Besides, I was getting such low errors with RF, I decided not to pursue other algorithms further.