443-970-2353
[email protected]
CV Resume
This blog post series is on machine learning with Python and R. We will use the Scikit-learn library in Python and the Caret package in R. In this part, we will first perform exploratory Data Analysis (EDA) on a real-world dataset, and then apply non-regularized linear regression to solve a supervised regression problem on the dataset. We will predict power output given a set of environmental readings from various sensors in a natural gas-fired power generation plant. In the second part of the post, we will work with regularized linear regression models (ridge, lasso and elasticnet). Next, we will see the other non-linear regression models.
The real-world data we are using in this post consists of 9,568 data points, each with 4 environmental attributes collected from a Combined Cycle Power Plant over 6 years (2006-2011), and is provided by the University of California, Irvine at UCI Machine Learning Repository Combined Cycle Power Plant Data Set. You can find more details about the dataset on the UCI page. The task is a regression problem since the label (or target) we are trying to predict is numeric.
Import libraries to perfrom Extract-Transform-Load (ETL) and Exploratory Data Analysis (EDA)
The caret package (short for _C_lassification _A_nd _RE_gression _T_raining) is a set of functions that attempt to streamline the process for creating predictive models.
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import matplotlib.pyplot as plt
plt.style.use('ggplot') # if you are an R user and want to feel at home
library(readxl)
library(ggplot2)
library(corrplot)
library(tidyverse)
Load Data
power_plant = pd.read_excel("Folds5x2_pp.xlsx")
power_plant = read_excel("Folds5x2_pp.xlsx")
Exploratory Data Analysis (EDA)
This is a step that we should always perform before trying to fit a model to the data, as this step will often lead to important insights about our data.
type(power_plant)
class(power_plant)
# See first few rows
power_plant.head()
# Caret faces problems working with tbl,
# so let's change the data to simple data frame
power_plant = data.frame(power_plant)
message("The class is now ", class(power_plant))
# See first few rows
head(power_plant)
The columns in the DataFrame are:
AT = Atmospheric Temperature in C
V = Exhaust Vacuum Speed
AP = Atmospheric Pressure
RH = Relative Humidity
PE = Power Output
Power Output is the value we are trying to predict given the measurements above.
# Size of DataFrame
power_plant.shape # we have 9568 rows and 5 columns
dim(power_plant)
# we have 9568 rows and 5 columns
# class of each column in the DataFrame
power_plant.dtypes # all columns are numeric
map(power_plant, class)
# all columns are numeric
# Are there any missing values in any of the columns?
power_plant.info() # There is no missing data in all of the columns
# Check for missing values
map(power_plant, ~sum(is.na(.))) # using purrr package
# There is no missing data in all of the columns
Visualize relationship between variables
Before we perform any modeling, it is a good idea to explore correlations between the predictors and the predictand. This step can be important as it helps us to select appropriate models. If our features and the outcome are linearly related, we may start with linear regression models. However, if the relationships between the label and the features are non-linear, non-linear ensemble models such as random forest can be better.
# Correlation between power output and temperature
power_plant.plot(x ='AT', y = 'PE', kind ="scatter",
figsize = [10,10],
color ="b", alpha = 0.3,
fontsize = 14)
plt.title("Temperature vs Power Output",
fontsize = 24, color="darkred")
plt.xlabel("Atmospheric Temperature", fontsize = 18)
plt.ylabel("Power Output", fontsize = 18)
plt.show()
# Correlation between atmospheric temperature and
# power output
power_plant %>% ggplot(aes(AT, PE)) +
geom_point(color= "blue", alpha = 0.3) +
ggtitle("Temperature vs Power Output") +
xlab("Atmospheric Temperature") +
ylab("Power Output") +
theme(plot.title = element_text(color="darkred",
size=18,hjust = 0.5),
axis.text.y = element_text(size=12),
axis.text.x = element_text(size=12,hjust=.5),
axis.title.x = element_text(size=14),
axis.title.y = element_text(size=14))
As shown in the above figure, there is strong linear correlation between Atmospheric Temperature and Power Output.
# Correlation between Exhaust Vacuum Speed and power output
power_plant.plot(x ='V', y = 'PE',kind ="scatter",
figsize = [10,10],
color ="g", alpha = 0.3,
fontsize = 14)
plt.title("Exhaust Vacuum Speed vs Power Output", fontsize = 24, color="darkred")
plt.xlabel("Atmospheric Temperature", fontsize = 18)
plt.ylabel("Power Output", fontsize = 18)
plt.show()
# Correlation between Exhaust Vacuum Speed
# and power output
power_plant %>% ggplot(aes(V, PE)) +
geom_point(color= "darkgreen", alpha = 0.3) +
ggtitle("Exhaust Vacuum Speed vs Power Output") +
xlab("Exhaust Vacuum Speed") +
ylab("Power Output") +
theme(plot.title = element_text(color="darkred",size=18,hjust = 0.5),
axis.text.y = element_text(size=12),
axis.text.x = element_text(size=12,hjust=.5),
axis.title.x = element_text(size=14),
axis.title.y = element_text(size=14))