443-970-2353
[email protected]
CV Resume
In this project, ensemble tree-based predictive models that determine the manner an exercise is performed are built. The models used are Random Forest, Adaptive Boosting and Bagged Adaptive Boosting. Data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants, who were asked to perform barbell lifts correctly and incorrectly in 5 different ways, are employed to predict whether a particular exercise is performed correctly or not.
The dataset used for building the ensemble tree-based models can be downloaded from here. First, let's download the dataset from the provided link and divide it into two subsets: one for training, building and evaluating the models and the second subset to evaluate out of sample error.
According to the rule of thumb for prediction study design, I partitioned the data into 60% training and 40% testing. Then, I put aside the testing data and finished the model building process using the training data. Before building the model, I made exploration on the training data. I also tried to understand the covariates (features) and their associations. The paper by Velloso et al. (2013) explains the covariates. The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.
# Load required libraries
require(ggplot2)
library(caret)
library(nnet)
library(NeuralNetTools)
library(plotmo)
library(ROCR)
library(dplyr)
data1<-read.csv("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",na.strings=c("NA",""))
Now, let's see the dimensions of the dataset and the variables.
dim(data1)
Then, I divided the dataset into training and testing subsets.
for_training<-createDataPartition(y=data1$classe,p=0.6,list=FALSE)
training <-data1[for_training,]
testing <-data1[-for_training,]
dim(training)
tail(names(training)) # to see some of the features
The training data has 160 columns (variables) and 11776 rows (observations). The predictand (dependent) variable is “classe”, which is the right most column.
Then, I removed the covariates which are not important for the machine learning algorithm development. These are the row number(X), user_name, raw_timestamp_part1, raw_timestamp_part2, cvtd_timestamp, new_window and num_window.
It is vital to keep in mind that the testing data is not used in any way for model development. The model building is done entirely based on the training data.
Remove data not important for model development
training<-training[,8:160]
Many of the covariates have lots of missing data. Those with missing data are removed.
non_missing_training<-apply(!is.na(training),2,sum)>=dim(training)[1]
training<-training[,non_missing_training]
dim(training)
After removing the variables with missing data, I am left with 52 features and the dependent variable (53 columns in total).
Random forest, along with Boosting, is one of the most accurate classifiers. Random Forest is designed to improve prediction accuracy of CART. It works by building a large number of CART trees. To make a prediction for a new observation, each tree “votes” on the outcome, and we pick the outcome that receives the majority of the votes. More information can be found on Wikipedia. On many problems, the performance of random forests is very similar to boosting (Hastie et al., 2005).
Ensemble tree techniques are suitable for this project dataset because: (a) they handle non-linearity (b) they handle large inputs whose interactions is not understood, (c) they handle unscaled inputs and categorical data, (d) they help to visually examine how the predictors are contributing into the final model by extracting the model output trees.
Here, we are using Random Forest, Adaptive Boosting and Bagged Adaptive Booting.
The caret package is employed to build and evaluate the various models considered.
Let's use 10 fold cross validation
set.seed(3333) # this is set to ensure the reproducibility of the results
rf_fit<-train(classe~.,data=training,method="rf", allowParallel=TRUE,prox=TRUE,
trControl=trainControl(method="cv",number=10))
Let's see the importance of the covariates.
importance_rf <- varImp(rf_fit)
plot(importance_rf, main = "Variable Importance of features from Random Forest")
or we can have a look at the top 20 important features:
importance_rf
Then let us see the confusion matrix to understand how the model performs:
options(digits=2)
rf_fit$finalModel$confusion
The confusion matrix shows that the model is doing a good job in fitting the training data set. The summary of the model is below:
rf_fit
Similar to what was done to the training data, the first eight columns and those with missing covariates are removed from the testing data and the rest covariates are used for prediction.
testing <-testing[,8:160]
Similar to what was done in the training data, remove covariates with NAs
non_missing_testing<-apply(!is.na(testing),2,sum)==dim(testing)[1]
testing<-testing[,non_missing_testing]
dim(testing)
predict using the model built in with the training data
prediction_rf<-predict(rf_fit,testing)
accuracy_rf = sum(prediction_rf==testing$classe)/length(testing$classe)
cat("The accuracy from Random Forest Model is ",round(accuracy_rf,3))
We see that Random Forest has very good accuracy with error of less than 1%
Now, let's download this dataset and make predictions using the model built above.
Remove the first seven columns which are not covariates for the model and also remove covariates with missing values
test_given<-read.csv("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",na.strings=c("NA",""))
test_given<-test_given[,8:160]
non_missing_test<-apply(!is.na(test_given),2,sum)>=dim(test_given)[1]
test_given<-test_given[,non_missing_test]
Predict for the test data and save the result
prediction = predict(rf_fit,test_given)
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
pml_write_files(prediction)
The model predicts the test data 100% accurate.
cctrl <- trainControl(method = "cv", number = 10, returnResamp = "all",
classProbs = TRUE)
grid <- expand.grid(mfinal = (1:20), maxdepth = c(1:10),
coeflearn = c("Breiman", "Freund", "Zhu"))
set.seed(100)
my_adaboost <- train(classe~., data=training,
method = "AdaBoost.M1",
trControl = cctrl,
tuneGrid = grid,
metric = "Accuracy")
importance_adaboost <- varImp(my_adaboost)
plot(importance_adaboost, main = "Variable Importance of features from Adaptive Boosting")
importance_adaboost
my_adaboost$bestTune
prediction_adaboost<-predict(my_adaboost,testing)
accuracy_adaboost = sum(prediction_adaboost==testing$classe)/length(testing$classe)
cat("The accuracy from Adaptive Boosting Model is ",round(accuracy_adaboost,3))
We see that Adaptive Boosting has similar accuracy with Random Forest.
cctrl <- trainControl(method = "cv", number = 10, returnResamp = "all",
classProbs = TRUE)
grid <- expand.grid(mfinal = (1:20), maxdepth = c(1:10))
set.seed(100)
my_adabag <- train(classe~., data=training,
method = "AdaBag",
trControl = cctrl,
tuneGrid = grid,
metric = "Accuracy",
preProc = c("center", "scale"))
importance_adabag <- varImp(my_adabag)
plot(importance_adabag, main = "Variable Importance of features from\n Bagged Adaptive Boosting")
importance_adabag
my_adabag$bestTune
prediction_adabag<-predict(my_adabag,testing)
accuracy_adabag = sum(prediction_adabag==testing$classe)/length(testing$classe)
cat("The accuracy from Bagged Adaptive Boosting Model is ",round(accuracy_adabag,3))
In this project, we saw the performance of various ensemble tree-based machine learning techniques in recognizing human activity. We considered Random Forest, Adaptive Boosting and Bagged Adaptive Boosting. We used 10-fold cross-validation to evaluate the performance of the models. All models have very high accuracy with errors of about 1%.
Hastie, T., R. Tibshirani, J. Friedman, and J. Franklin, 2005: The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer, 27, 83-85.
Velloso, E., A. Bulling, H. Gellersen, W. Ugulino, and H. Fuks, 2013: Qualitative activity recognition of weight lifting exercises. Proceedings of the 4th Augmented Human International Conference, ACM, 116-123.