443-970-2353
[email protected]
CV Resume
Here I show how to build various machine learning models in Python and R. The models include linear regression, logistic regression, tree-based models (bagging and random forest) and support vector machines (SVM).
The data set used for the linear regression models is from here.
setwd("C:/Fish/Python/Python_vs_R")
options(jupyter.plot_mimetypes = 'image/png')
options(repr.plot.width = 6)
options(repr.plot.height = 4)
clim<-read.csv("climate_change.csv")
names(clim)
dim(clim)
training<-subset(clim,clim$Year<=2006)
testing<-subset(clim,clim$Year> 2006)
model1<-lm(Temp~MEI+CO2+CH4+N2O+CFC.11+CFC.12+TSI+Aerosols,data=training)
summary(model1)
In this preliminary model, we see that multiple R-squared is 0.7509 and we all covariates are significant except CH4.
options(repr.plot.width = 6)
options(repr.plot.height = 4)
pred_train<-predict(model1,data=training)
plot(training$Temp,pred_train,col='darkblue',xlab='Observed temperature',ylab='fitted values',pch=16)
pred_test<-predict(model1, newdata=testing)
plot(testing$Temp,pred_test,col='green',pch=16,xlab='Observed temperature',ylab='Predictions')
Now, let's see how to build a linear regression model in Python.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import metrics
from sklearn.cross_validation import train_test_split
clim = pd.read_csv(r'https://courses.edx.org/asset-v1:[email protected]+block/climate_change.csv')
Let's look at the covariates.
clim.head(3)
clim.shape
we can also plot the data. Let's visualize the relationship between the features and the response (temperature) using scatterplots with seaborn.
sns.pairplot(clim, x_vars=['MEI','CO2','CH4','N2O'], y_vars='Temp', size=2.5, aspect=1, kind='reg')
sns.pairplot(clim, x_vars=['CFC-11','CFC-12','TSI','Aerosols'], y_vars='Temp', size=2.5, aspect=1, kind='reg')
We can also plot the scatter plots using pandas.
pd.scatter_matrix(clim[['Year','MEI','CO2','CH4','N2O','CFC-11','CFC-12','TSI','Aerosols','Temp']], figsize=(15, 15),alpha=0.2,diagonal='kde')
plt.show()
training=clim[clim['Year']<=2006]
testing=clim[clim['Year']>2006]
training.shape, testing.shape
np.unique(training['Year'])
The training period covers from 1983 to 2006.
np.unique(testing['Year'])
The testing data covers 2007 and 2008.
### SCIKIT-LEARN ###
from sklearn import linear_model
linear = linear_model.LinearRegression()
# create X and y
feature_cols = ['MEI','CO2','CH4','N2O','CFC-11','CFC-12','TSI','Aerosols']
X = training[feature_cols]
y = training.Temp
# fit
linear.fit(X, y)
print 'R-squared: %0.2f'%linear.score(X,y)
or we can calculate the R-squared value using the metrics.r2_score method.
y_pred = linear.predict(X)
round(metrics.r2_score(y, y_pred),2)
The R-squared value is 0.75, the same with what R gives us.
print the coefficients
print('Coefficient: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)
feature_cols = ['MEI','CO2','CH4','N2O','CFC-11','CFC-12','TSI','Aerosols']
X_test = testing[feature_cols]
predicted= linear.predict(X_test)
obs_pred=pd.DataFrame({"Observation":testing.Temp,"Predicted":predicted})
obs_pred.plot(kind='scatter', x='Observation', y='Predicted',color='DarkBlue')
plt.show()
or using seaborn:
sns.pairplot(obs_pred, x_vars=['Observation'], y_vars='Predicted', size=10, aspect=1, kind='reg')
plt.show()
Let's predict the risk of a borrower being unable to repay a loan. The data set used here is from LendingClub.com.
Our response is the 'not_fully_paid' variable which shows that loan was not paid back in full. The data used here can be downloaded from here.
setwd("C:/Fish/Python/Python_vs_R")
loans<-read.csv("loans_imputed.csv")
Let's explore the dataset.
str(loans)
set.seed(144)
library(caTools)
split=sample.split(loans$not.fully.paid,SplitRatio = 0.7)
train<-loans[split==TRUE, ]
test<-loans[split==FALSE, ]
mod1<-glm(not.fully.paid~., data=train, family=binomial)
Let's see the significant features.
summary(mod1)
Now, we can predict using the test data.
predicted.risk=predict(mod1,newdata=test, type="response")
test$predicted.risk=predicted.risk
Let's plot the Receiver operating characteristic (ROC) curve.
# load ROCR package
library(ROCR)
# Prediction function
ROCRpred = prediction(test$predicted.risk,test$not.fully.paid)
# Performance function
ROCRperf = performance(ROCRpred, "tpr", "fpr")
# Plot ROC curve
options(repr.plot.width = 8)
options(repr.plot.height = 5)
plot(ROCRperf, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))
Now, let's calculate the area under the curve (AUC) value. AUC is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming 'positive' ranks higher than 'negative')
# AUC
round(as.numeric(performance(ROCRpred, "auc")@y.values),2)
We can also calculate confusion matrix and derivates from the confusion matrix.
# The confusion matrix can be computed with the following commands:
test$predicted.risk = predict(mod1, newdata=test, type="response")
table(test$not.fully.paid, test$predicted.risk > 0.5) # using 0.5 as a threshold
Now, let's calculate accuracy, sensitivity and specificity.
accuracy=round(sum(diag(table(test$not.fully.paid, test$predicted.risk > 0.5)))/sum(table(test$not.fully.paid, test$predicted.risk > 0.5)),2)
sensitivity=round(3/16,2)
specificity=round(2400/(2400+457),2)
cat("\tAccuracy = ",accuracy ,"\n", "\tSensitivity = ",sensitivity, "\n","\tSpecificity = ",specificity ,"\n")
loan = pd.read_csv(r'http://courses.edx.org/asset-v1:MITx+15.071x_2a+2T2015+type@asset+block/loans_imputed.csv')
Let's have a look at the dataset.
loan.head(3)
loan.shape
Let's create a Python list of feature names and use the list to select a subset of the original DataFrame.
# create a Python list of feature names
feature_cols = ['credit.policy', 'int.rate','installment','log.annual.inc','dti','fico','days.with.cr.line','revol.bal','revol.util'
,'inq.last.6mths','delinq.2yrs','pub.rec']
# use the list to select a subset of the original DataFrame
X = loan[feature_cols]
# print the first 5 rows of the features.
X.head()
Then, let's prepare the dependent variable (response).
y = loan['not.fully.paid']
# print the first 5 values of the predictand
y.head()
#Import Library
from sklearn.linear_model import LogisticRegression
# Create logistic regression object
model = LogisticRegression()
Train the model using the training sets and check score
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=144,test_size=0.3)
# default split is 75% for training and 25% for testing
Check that the training data is 70% of the original data and the rest 30% is allocated for testing.
print X_train.shape
print y_train.shape
print X_test.shape
print y_test.shape
# Create logistic regression object
model = LogisticRegression()
# Train the model using the training sets and check score
model.fit(X_train, y_train)
print 'The accuracy of this model is %0.2f' %round(model.score(X_train, y_train),2)
The accury is similar using Python and R.
# Equation coefficient and Intercept
print('Coefficient: \n', model.coef_)
print('Intercept: \n', model.intercept_)
#Predict Output
predicted= model.predict(X_test)
Let's predict the risk of a borrower being unable to repay a loan. The data set used here is from LendingClub.com.
Our response is the 'not_fully_paid' variable which shows that loan was not paid back in full. The data used here can be downloaded from here.
setwd("C:/Fish/Python/Python_vs_R")
library(caTools)
loans<-read.csv("loans_imputed.csv")
set.seed(144)
library(caTools)
split=sample.split(loans$not.fully.paid,SplitRatio = 0.7)
train<-loans[split==TRUE, ]
test<-loans[split==FALSE, ]
library(rpart)
library(rpart.plot)
CARTmodel = rpart(not.fully.paid~., data=train, method="class",cp=0.002)
prp(CARTmodel)
prediction<-predict(CARTmodel, type="class",newdata=test)
table<-table(prediction,test$not.fully.paid)
accuracy<-sum(diag(table))/(sum(table))
cat("The accuracy is ",round(accuracy,2))
The accuracy is similar to what we found using logistic regression using R and Python.
loan = pd.read_csv(r'http://courses.edx.org/asset-v1:MITx+15.071x_2a+2T2015+type@asset+block/loans_imputed.csv')
# create a Python list of feature names
feature_cols = ['credit.policy', 'int.rate','installment','log.annual.inc','dti','fico','days.with.cr.line','revol.bal','revol.util'
,'inq.last.6mths','delinq.2yrs','pub.rec']
# use the list to select a subset of the original DataFrame
X = loan[feature_cols]
# print the first 5 rows
X.head()
# select a Series from the DataFrame
y = loan['not.fully.paid']
# print the first 5 values
y.head()
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=144,test_size=0.3)
from sklearn import tree
# Create tree object
model = tree.DecisionTreeClassifier(criterion='gini') # for classification, here you can change the algorithm as gini or entropy (information gain) by default it is gini
# model = tree.DecisionTreeRegressor() for regression
# Train the model using the training sets and check score
model.fit(X_train, y_train)
from sklearn import tree
from sklearn.metrics import accuracy_score
clf = tree.DecisionTreeClassifier(min_samples_split=60) # default value of min_samples_split is 2
print round(accuracy_score(clf.fit(X_train,y_train).predict(X_test),y_test),2)
Again, let's predict the risk of a borrower being unable to repay a loan. The data set used here is from LendingClub.com.
Our response is the 'not_fully_paid' variable which shows that loan was not paid back in full. The data used here can be downloaded from here.
setwd("C:/Fish/Python/Python_vs_R")
library(caTools)
loans<-read.csv("loans_imputed.csv")
set.seed(144)
library(caTools)
split=sample.split(loans$not.fully.paid,SplitRatio = 0.7)
train<-loans[split==TRUE, ]
test<-loans[split==FALSE, ]
library(e1071)
fit = svm(not.fully.paid~., data=train)
summary(fit)
Now, let's predict using the test data
predicted= predict(fit,test)
We can calculate the confusion matrix
table(test$not.fully.paid, predicted>0.5)
cat('Accuracy:=',sum(diag(table(test$not.fully.paid, predicted>0.5)))/sum(table(test$not.fully.paid, predicted>0.5)))
loan = pd.read_csv(r'http://courses.edx.org/asset-v1:MITx+15.071x_2a+2T2015+type@asset+block/loans_imputed.csv')
# create a Python list of feature names
feature_cols = ['credit.policy', 'int.rate','installment','log.annual.inc','dti','fico','days.with.cr.line','revol.bal','revol.util'
,'inq.last.6mths','delinq.2yrs','pub.rec']
X = loan[feature_cols]
y = loan['not.fully.paid']
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=144,test_size=0.3)
#Import Library
from sklearn import svm
model = svm.SVC(kernel='rbf',C=10000.0) # check with different kernels and C
model.fit(X, y)
model.score(X, y)
#Predict Output
predicted= model.predict(X_test)
Calculate accuracy.
from sklearn.metrics import accuracy_score
print 'Accuracy is ', round(accuracy_score(model.fit(X_train,y_train).predict(X_test),y_test),2)
setwd("C:/Fish/Python/Python_vs_R")
library(caTools)
loans<-read.csv("loans_imputed.csv")
set.seed(144)
library(caTools)
split=sample.split(loans$not.fully.paid,SplitRatio = 0.7)
train<-loans[split==TRUE, ]
test<-loans[split==FALSE, ]
library(randomForest)
fit = randomForest(as.factor(not.fully.paid)~., data=train)
fit
Let's see variable importance.
importance (fit)
We can plot variable importance from the Random Forest model.
options(repr.plot.width = 8)
options(repr.plot.height = 5)
varImpPlot(fit,main ="Variable Importance of features")
prediction=predict(fit, test)
accuracy = sum(prediction==test$not.fully.paid)/length(test$not.fully.paid)
round(accuracy,2)
loan = pd.read_csv(r'http://courses.edx.org/asset-v1:MITx+15.071x_2a+2T2015+type@asset+block/loans_imputed.csv')
# create a Python list of feature names
feature_cols = ['credit.policy', 'int.rate','installment','log.annual.inc','dti','fico','days.with.cr.line','revol.bal','revol.util'
,'inq.last.6mths','delinq.2yrs','pub.rec']
X = loan[feature_cols]
y = loan['not.fully.paid']
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=144,test_size=0.3)
#Import Library
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X, y)
model.score(X, y)
#Predict Output
predicted= model.predict(X_test)
from sklearn.metrics import accuracy_score
print 'Accuracy is ', round(accuracy_score(model.fit(X_train,y_train).predict(X_test),y_test),2)
Let's standardize the data and see the how the accuracy changes.
from sklearn import preprocessing
std_scale = preprocessing.StandardScaler().fit(X_train)
X_train = std_scale.transform(X_train)
X_test = std_scale.transform(X_test)
model = RandomForestClassifier()
model.fit(X, y)
model.score(X, y)
#Predict Output
predicted= model.predict(X_test)
from sklearn.metrics import accuracy_score
print 'Accuracy is ', round(accuracy_score(model.fit(X_train,y_train).predict(X_test),y_test),2)
In this post, we saw how to implement various machine learning techniques (inclusing linear regression, logistic regression, bagging, random forest, and support vector machines) using R and Python, particularly using the scikit-learn Python library.