443-970-2353
[email protected]
CV Resume
In this post, I used different linear (generalized linear model, ridge-regression, partial least squares, lasso and elastic net) and non-linear (support vector machines, classification and regression trees, random forest, neural network and boosting) regression models to predict global average temperature anomaly. First, I did exploratory data analysis to understand the features and how they can be associated with global temperature then built predicitve models employing various machine learning techniques, assessed the importance of the various independet variables in predicting average global temperature anomaly and compared the performance of the different models considered. The data is from the MITX Analytics Edge edx course Website.
Let's download the data and see the variables in the data set.
# Load required libraries
library(downloader)
require(ggplot2)
library(caret)
library(nnet)
library(NeuralNetTools)
# Download data
url<-"https://courses.edx.org/asset-v1:[email protected]+block/climate_change.csv"
download(url,dest="climate.csv")
clim = read.csv("climate.csv")
str(clim)
Let's use the data upto 2006 for training and predict for 2007 and 2008. We will not use the variables 'Year' and 'Month' in predicitive model building process.
training<-subset(clim,clim$Year<=2006)
testing<-subset(clim,clim$Year > 2006)
testingX<-subset(clim[,3:10],clim$Year> 2006)
testingY<-subset(clim[,11],clim$Year> 2006)
ggplot(training, aes(Year, Temp))+geom_point(color="purple")+
ggtitle('Fig.1. Temperature anomaly over time')+
xlab('')+ylab('Temperature')+
stat_smooth(method=lm, colour='blue',span=0.2)
From Fig.1., we see the clear increasing trend of temperature over the data period.
One of the main causes of global temperature increase is asscoaited with inceased carbondioxide emission to the atmosphere from different sources. Let's see the relationship between temperature and carbondiodixe.
ggplot(training, aes(CO2, Temp))+geom_point()+
ggtitle('Fig.2. Temperature anomaly vs Carbondioxe concentration')+
xlab(' Carbondioxide')+ylab('Temperature')+stat_smooth(method=lm,colour='blue', span=0.2)
As shown in Fig.2, carbondioxide and temperature are strongly related.
Following an oscillation in the tropical Pacific Ocean, global tempreature fluctuates. This oscillation is represented by the variable MEI in the dataset. Let's plot the association of temperature anomaly with the MEI index.
ggplot(training, aes(MEI, Temp))+geom_point()+
ggtitle('Fig.3.Temperature anomaly vs MEI')+xlab(' MEI')+
ylab('Temperature')+stat_smooth(method=lm,colour='blue', span=0.2)
The figure does not reveal strong linear association between MEI and average global temperature, however, MEI can have systematic influences on global temperature and it could be an important predictor as we will see later.
Let's fit a generalized linear model with the "glm" function and generate variable importance figure, predic on the test data and plot observed values versus predicted values using the ggplot2 package. To access accuracy, let's use repeated k-fold cross-validation (particularly 10-fold cross-validation repeated 5 times).
control <- trainControl(method="repeatedcv", number=10, repeats=5)
my_glm <- train(training[,3:10], training[,11],
method = "glm",
preProc = c("center", "scale"),
trControl = control)
# Predict using the test data
pred<-predict(my_glm,testingX)
my_data=as.data.frame(cbind(predicted=pred,observed=testingY))
ggplot(my_data,aes(predicted,observed))+geom_point()+geom_smooth(method=lm)+ggtitle('GLM')
# Print, plot variable importance
print(varImp(my_glm, scale = FALSE))
plot(varImp(my_glm, scale = FALSE), main="Variable Importance using GLM")
We can see the summary of the model and understand which features are significant.
summary(my_glm)
From the summary of the linear model, we see that all features except CH4 are important predictors.
Now, let's use the glmStepAIC method to select important features automatically via Akaike Information Criterion (AIC).
control <- trainControl(method="repeatedcv", number=10, repeats=5)
my_stepglm <- train(training[,3:10], training[,11],
method = "glmStepAIC",
preProc = c("center", "scale"),
trControl = control)
# Predict using the test data
pred<-predict(my_stepglm,testingX)
my_data=as.data.frame(cbind(predicted=pred,observed=testingY))
ggplot(my_data,aes(predicted,observed))+geom_point()+geom_smooth(method=lm)+ggtitle('GLMStepAIC')
# Print, plot variable importance
print(varImp(my_stepglm, scale = FALSE))
plot(varImp(my_stepglm, scale = FALSE), main="Variable Importance using GLMStepAIC")
Prior to performing PLS, the predictors should be centered and scaled, especially if the predictors are on scales of differing magnitude. PLS will seek directions of maximum variation while simultaneously considering correlation with the response.
control <- trainControl(method="repeatedcv", number=10, repeats=5)
my_tunedpls <- train(training[,3:10], training[,11],
method = "pls",
## The default tuning grid evaluates
## components 1... tuneLength
tuneLength = 20,
trControl = control,
preProc = c("center", "scale"))
# Predict using test data
pred<-predict(my_tunedpls,testingX)
my_data=as.data.frame(cbind(predicted=pred,observed=testingY))
ggplot(my_data,aes(predicted,observed))+geom_point()+geom_smooth(method=lm)+ggtitle('PLS')
# Print, plot variable importance
print(varImp(my_tunedpls, scale = FALSE))
plot(varImp(my_tunedpls, scale = FALSE), main="Variable Importance using PLS")
summary(my_tunedpls)
plot(my_tunedpls)
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15))
control <- trainControl(method="repeatedcv", number=10, repeats=5)
set.seed(100)
my_ridgeReg <- train(training[,3:10], training[,11],
method = "ridge",
## Fit the model over many penalty values
tuneGrid = ridgeGrid,
trControl = control,
## put the predictors on the same scale
preProc = c("center", "scale"))
# Predict using the test data
pred<-predict(my_ridgeReg,testingX)
my_data=as.data.frame(cbind(predicted=pred,observed=testingY))
ggplot(my_data,aes(predicted,observed))+geom_point()+geom_smooth(method=lm)+ggtitle('Ridge Regression')
# Print, plot variable importance
print(varImp(my_ridgeReg, scale = FALSE))
plot(varImp(my_ridgeReg, scale = FALSE), main="Variable Importance using Ridge Regression")
plot(my_ridgeReg)
lassoGrid <- expand.grid(.fraction = seq(.05, 1, length = 20))
control <- trainControl(method="repeatedcv", number=10, repeats=5)
set.seed(100)
my_lasso <- train(training[,3:10], training[,11],
method = "lasso",
## Fit the model over many penalty values
tuneGrid = lassoGrid,
trControl = control,
## put the predictors on the same scale
preProc = c("center", "scale"))
# Predict using test data
pred<-predict(my_lasso,testingX)
my_data=as.data.frame(cbind(predicted=pred,observed=testingY))
ggplot(my_data,aes(predicted,observed))+geom_point()+geom_smooth(method=lm)+ggtitle('lasso')
# Print, plot variable importance
print(varImp(my_lasso, scale = FALSE))
plot(varImp(my_lasso, scale = FALSE), main="Variable Importance using lasso")
plot(my_lasso)
enetGrid <- expand.grid(.lambda = c(0, 0.01, .1),
.fraction = seq(.05, 1, length = 20))
control <- trainControl(method="repeatedcv", number=10, repeats=5)
set.seed(100)
my_enet <- train(training[,3:10], training[,11],
method = "enet",
## Fit the model over many penalty values
tuneGrid = enetGrid,
trControl = control,
## put the predictors on the same scale
preProc = c("center", "scale"))
# Predict using the test data
pred<-predict(my_enet,testingX)
my_data=as.data.frame(cbind(predicted=pred,observed=testingY))
ggplot(my_data,aes(predicted,observed))+geom_point()+geom_smooth(method=lm)+ggtitle('Elastic net')
# Print, plot variable importance
print(varImp(my_enet, scale = FALSE))
plot(varImp(my_enet, scale = FALSE), main="Variable Importance using elastic net model")
plot(my_enet)
Now, let's work with non-linear regression models: neural network, support vector machines, random forest, boosting, and classification and regression trees.
To choose the number of hidden units and the amount of weight decay via resampling, the train function can be applied.
First, we remove predictors to ensure that the maximum absolute pairwise correlation between the predictors is less than 0.75.
# The findCorrelation takes a correlation matrix and determines the
# column numbers that should be removed to keep all pair-wise correlations below a threshold
tooHigh <- findCorrelation(cor(training[,3:10]), cutoff = .75)
trainXnnet <- training[,3:10][, -tooHigh]
testXnnet <- testing[,3:10][, -tooHigh]
## Create a specific candidate set of models to evaluate:
nnetGrid <- expand.grid(.decay = seq(0, 0.1, .01),
.size = c(3:10),
## The next option is to use bagging
## instead of different random
## seeds.
.bag = FALSE)
set.seed(100)
control <- trainControl(method="repeatedcv", number=10, repeats=5)
my_avnnet <- train(trainXnnet, training[,11],
method = "avNNet",repeats = 10,
tuneGrid = nnetGrid,
trControl = control,
## Automatically standardize data prior to modeling and prediction
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 10 * (ncol(trainXnnet) + 1) + 10 + 1,
maxit = 500)
# Predict using the test data
pred<-predict(my_avnnet,testXnnet)
my_data=as.data.frame(cbind(predicted=pred,observed=testingY))
ggplot(my_data,aes(predicted,observed))+geom_point()+geom_smooth(method=lm)+ggtitle('ANN')
# Print, plot variable importance
print(varImp(my_avnnet, scale = FALSE))
plot(varImp(my_avnnet, scale = FALSE), main="Variable Importance using AvNNet")
Now let's visualize the neural nets used for averagig.
for (i in 1:length(my_avnnet$finalModel$model))
{
allmods <- my_avnnet$finalModel$model
mod <- allmods[[i]] # first model, change this number to look at the other models
wts <- mod$wts
decay <- mod$decay
struct <- mod$n
# recreate
recmod <- nnet(trainXnnet, training[,11], Wts = wts, decay = decay, size = struct[2], maxit = 0)
# use the functions to look at the individual model
plotnet(recmod)
}