Principal component analysis (PCA) is a statistical analysis technique that transforms possibly correlated variables to orthogonal linearly uncorrelated values. We can use it for applications such as data compression, to get dominant physical processes and to get important features for machine learning. Without losing much information, it reduces the number of features in the original data.
Recently, I was using PCA with Spark with sparse matrix with millions of rows and to make sure everything was right I first started with a small dataset and compared the results from Spark, Python and R which led to this blog post.
In this post, I will cover data prepocessing required and how to implement PCA in R, Python and Spark and how to translate the results.
The calculation is done using singular value decomposition (SVD). The SVD of any m x n matrix is calculated as
$A = U \sum V^{T}$
where U is an m × m orthogonal matrix whose columns are the eigenvectors of $AA^{T}$ , V is an n × n orthogonal matrix whose columns are the eigenvectors of $A^{T}A$, and Σ is an m × n diagonal matrix and its values are zero except along the diagonal.
When applying PCA, we have to center our data that is we have to subtract the column mean. Next, based on the nature of our data, we may need to standardize our data (make each feature have a unit variance and zero mean). If the columns are in different scales such as year, temperature, carbondioxide concentration, for example, we have to standardize the data. If the data is in the same unit, on the other hand, standardizing it can result in loss of important information. In the first case, when the columns are in the same unit and in similar scale, we use the covariance matrix for SVD but when the units are different since we standardize the data, we use the correlation matrix.
The principal components (PCs) are the matrix product of the original data and the matrix V which is equal to the product of matrixes U and ∑.
For this example, we will use the iris data. The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Each sample has four features: the length and the width of the sepals and petals, in centimeters. You can get the data, including details, from UCI Machine Learning Repository. I downloaded the data from here
options(repr.plot.width=8, repr.plot.height=5)
library(tidyverse)
options(readr.num_columns = 0)
library(corrplot)
iris = read_csv("https://raw.githubusercontent.com/venky14/Machine-Learning-with-Iris-Dataset/master/Iris.csv")
head(iris) # show sample records
All the units are in cm. Since we do not need the Id and Species columns for our analysis, let's drop them.
species = iris$Species # creating a vector to use it later
iris = iris %>% select(-Id,-Species)
Since the columns are in the same scale, we do not need to scale it: just centering the columns is enough (SVD uses covariance matrix in this case). As shown below, we can specify if we want to center it or scale it.
correlations = cor(iris)
corrplot(correlations, method="circle", addCoef.col = "white",
title = 'Fig.1. Correlation', mar=c(0,0,1,0))
As shown in Fig.1. above, some of the variables are strongly correlated. This means we can reduce the dimensionality with out losing information. When the variables are strongly correlated, the first few PCs explain large portion of the variability in the data. When applying PCA with R, Python or Spark, we have to make sure that the rows are samples and the columns are variables.
pcp = prcomp(iris, center = TRUE, scale. = FALSE) # center it but do not scale it
The prcomp function uses SVD and returns the following:
The variance explained by each PC is shown below. The PCs are ranked based on the variance they explain: the first PC explains the highest variance, followed by the second PC and so on.
So, we see that the first PC explains almost all the variance (92.4616%) while the fourth PC explains only 0.5183%. In such scenarios where we have many columns and the first few PCs account for most of the variability in the original data (such as 95% of the variance), we can use the first few PCs for data explorations and for machine learning. In data explorations, we can visualize the PCs and get a better understanding of processes using those PCs. In machine learning, on the other hand, by using the PCs, we will work with less number of columns and this can significantly reduce computational cost and help in reducing over-fitting. It is important to mention here that when you use PCA for machine learning, the test data should be transformed using the loading vectors found from the training data otherwise your model will have difficulty generalizing to unseen data.
variance = data_frame(PC = paste0('PC', seq(1,4)), variance = round(100.00*pcp$sdev^2/sum(pcp$sdev^2),4))
variance %>% ggplot(aes(x = PC, y = variance)) + geom_bar(stat = 'identity') + ggtitle('Fig.2. Variance Explained') +
theme(axis.title = element_text(size=14), plot.title = element_text(size = 16,colour="darkblue", hjust = 0.5),
axis.text = element_text(size=14)) + geom_text(aes(label = variance), vjust = -0.4) + xlab('') +
ylab('Variance Explained (%)')
We can see which variables have the largest effects in each PC from the rotations (loadings). The features which have the highest magnitude of loadings in the first PC have the highest effect. From the loading of the iris data below, we observe that petal length has the highest effect in the first PC while in the second PC sepal length and sepal width are the most important. If two features have similar loadings in the first and the second PCs, the one in the first PC is more important. Generally, for similar loadings, features in the $n^{th} -1$ PC are more important than those in the $n^{th}$ PC.
round(pcp$rotation, 4)
rotation = as.data.frame(pcp$rotation)
rotation$dimension = row.names(rotation)
rotation = rotation %>% melt(key = dimension, value = pc) %>% mutate(magnitude = abs(value))
rotation %>% ggplot(aes(x = dimension, y = variable , fill = magnitude)) + geom_tile()+xlab('') + ylab('') +
scale_fill_gradient2(low = "sky blue", high = "red") + ggtitle('Fig.3. Loadings of PCs') +
theme(axis.title = element_text(size=14), plot.title = element_text(size = 18,colour="darkblue", hjust = 0.5),
axis.text = element_text(size=14)) + geom_text(aes(label = round(magnitude,3)))
Sample values from the PCs are shown below to help us compare with the results from Python and Spark.
round(head(pcp$x,10),5)
As shown below, the PCs are orthogonal to each other (their correlations are zero).
round(cor(pcp$x), 5)
We can also plot the PCs and color them by the species. Let's take the first two PCs, which account for about 98% of the variance, and see the pattern. We can observe from the figure below that using the first two PCs, we are able to identify two distinct clusters: one for setosa and another one for versicolor and viginica.
df = as.data.frame(pcp$x)
df$Species = species
df %>% ggplot(aes(x = PC1, y = PC2, color = Species)) + geom_point() +
ggtitle('Fig.4. The first two PCs') + theme(axis.title = element_text(size=14),
plot.title = element_text(size = 18, hjust = 0.5),
axis.text = element_text(size=14))
scaled_data = scale(iris,center = TRUE, scale = FALSE )
my_svd = svd(scaled_data)
all(my_svd$v, pcp$rotation)
all(scaled_data %*% my_svd$v, pcp$x)
all(my_svd$d^2, pcp$x^2 * (nrow(scaled_data) - 1))
In Python, unlike R, the PCA decomposition will not center or scale the input data automatically. There is no parameter that controls whether to center or scale the data. We have to process our data before applying PCA.
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('ggplot')
from sklearn.preprocessing import scale
%matplotlib inline
df = pd.read_csv("https://raw.githubusercontent.com/venky14/Machine-Learning-with-Iris-Dataset/master/Iris.csv")
df.head()
df = df.drop(['Id','Species'], axis = 1)
Just to see how to do it in Python as well, let's create a correlation matrix heatmap. As I said above, strong correlation among variables means the first few PCs can explain most of the variation in the data and PCA could be a right application for such kind of datasets.
corr = df.corr()
plt.figure(figsize = (14, 8))
sns.heatmap(corr, cmap="RdBu",annot = True,
xticklabels=corr.columns.values,
yticklabels=corr.columns.values)
plt.title('Fig.5. Correlation matrix')
plt.xticks(size = 16)
plt.yticks(size = 16)
plt.show()
Now, let's center the data using the scale function. As you can see below, since we are dealing with variables that have the same unit and similar scales, we do not need to standardize it.
scaled = scale(df, with_mean=True, with_std=False)
What does the data look like when the mean of each column is subtracted from each value of each variable?
scaled[:10,:]
Now, let's apply PCA using the centered data.
model = PCA()
model.fit(scaled)
transformed = model.transform(scaled)
As shown below, the variance explained values we got from R and Python are the same.
np.round(model.explained_variance_ratio_ * 100,4)
var = pd.DataFrame ({'variance':np.round(model.explained_variance_ratio_ * 100,4),
'PC':['p' + str(i) for i in range(1,5)]})
ax = var.plot(kind='bar', figsize=(10,7), color="coral", fontsize=13, legend = False)
ax.set_alpha(0.8)
ax.set_title("Fig. 6. Variance Explained", fontsize=18)
ax.set_xlabel("PC", fontsize=16)
ax.set_ylabel("Variance Explained (%)", fontsize=16)
rects = ax.patches
labels = list(var['variance'])
labels = [str(variance) +'%' for variance in labels]
for rect, label in zip(rects, labels):
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width() / 2, height + 1, label,
ha ='center', va='bottom')
The loadings are shown below. The result is an array and as compred to R and Spark, the array is transposed. It is important to note how to access each PCs loading.
loadings = np.round(model.components_,4)
loadings
To access the loading of the first PC, we extract the first row and all the columns as shown below. Note: rows and columns of the loadings have been switched between the results from R and Python.
loadings[0,]
A sample of the principal components are shown below. They are the same with the results from R and Spark. As expected, they are uncorrelated.
np.set_printoptions(suppress=True)
transformed = np.round(transformed, 5)
transformed[:10,:]
np.round(np.corrcoef(transformed, rowvar=False),5) # If `rowvar` is True (default), then each row represents a
# variable, with observations in the columns. Otherwise, the relationship
# is transposed: each column represents a variable, while the rows
# contain observations.
import findspark
findspark.init()
In Spark, the input data format for PCA is different from that of R and Python. We have to create vector column data using VectorAssembler.
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.util import MLUtils
import numpy as np
from pyspark.ml.feature import StandardScaler
import pyspark.sql.functions as f
import pyspark.sql.types
import pandas as pd
from pyspark.sql import Row
spark = SparkSession.builder.getOrCreate()
sc = SparkContext.getOrCreate()
from pyspark.ml.feature import VectorAssembler
df = pd.read_csv("https://raw.githubusercontent.com/venky14/Machine-Learning-with-Iris-Dataset/master/Iris.csv")
df = df.drop(['Species'], axis = 1)
spark_df = spark.createDataFrame(df)
spark_df.show(10)
We can use VectorAssembler transformer to combine a given list of columns into a single vector column.
cols = spark_df.drop('Id').columns
cols
assembler = VectorAssembler(inputCols=cols, outputCol = 'features')
output_dat = assembler.transform(spark_df).select('Id', 'features')
output_dat.show(5, truncate = False)
In Spark, unlike R and like Python, the PCA decomposition will not scale the input data automatically. There is no parameter that controls whether to center or standardize the data. We have to process our data before applying PCA. Now, let's center our data. We do not need to scale it (withStd=False).
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=False, withMean=True)
# Compute summary statistics by fitting the StandardScaler
scalerModel = scaler.fit(output_dat)
# Normalize each feature to have unit standard deviation.
scaledData = scalerModel.transform(output_dat)
scaledData.select(['Id', 'scaledFeatures']).show(5, truncate = False) # sample centered data
pca = PCA(k=4, inputCol = scaler.getOutputCol(), outputCol="pcaFeatures")
model = pca.fit(scaledData)
transformed_feature = model.transform(scaledData)
np.round(100.00*model.explainedVariance.toArray(),4)
pcs = np.round(model.pc.toArray(),4)
pcs
Note: the loadings are in the same structure as in R but in Python they are transposed.
pcs = np.round(model.pc.toArray(),4)
df_pc = pd.DataFrame(pcs, columns = ['PC1','PC2','PC3','PC4'], index = cols)
df_pc
df_pc['PC1']
The principal components are shown below. They are the same with the results from R and Python. Note: if you run your PCA in Spark, the PCA data may not be in the same order as you get from R and Python from a single machine. This is happening because with Spark in the cluster, the results are collected from different machines and the order does not matter.
transformed_feature.select('pcaFeatures').show(10, truncate = False)
We can round the PCAs above to compare the results with those from R and Python.
transformed_feature.select('pcaFeatures').rdd.map(lambda x: Row(pcaFeatures = str([x for x in list(np.round(x[0].\
toArray(),4))]))).toDF().show(truncate = False)
The results so far are the same with all the tools.
So as mentioned above, when the units are different and the scale varies quite significatly, we have to standardize our data. For this example, let's use climate change data. The data is available here
df = read_csv('https://courses.edx.org/asset-v1:[email protected]+block/climate_change.csv')
head(df)
Just as a side note, how has the global temperature been changing in the last 30+ years?
df %>% ggplot(aes(x = Year, y = Temp)) + geom_bar(stat = 'identity') +
ggtitle('Fig. 7. Global Temperature Anomaly') +
theme(axis.title = element_text(size=14), plot.title = element_text(size = 18, hjust = 0.5),
axis.text = element_text(size=14))
If we see the correlation matrix below, we observe that some of the features are strongly correlated. If we want to build a model that predicts global temperature anomaly, rather than using all the columns, it would make sense to first apply PCA and consider only the PCs that explain large portion of the variability in the data (for example 99% of the variance: this depends on the application. In some case 90% or 95% of the variance could be enough.)
correlations = cor(df %>% select(-Temp))
corrplot(correlations, method="number", diag=FALSE, title = 'Fig.8. Correlation', mar=c(0,0,1,0))
As shown below, the last four PCs explain less than 1% of the variance. So we can safely exclude them from our modeling efforts.
Now, since the variables have different units, I am also standardizing the data by setting scale. = TRUE
pca1 = prcomp(df %>% select(-Temp), center = TRUE, scale. = TRUE)
scaled = scale(df %>% select(-Temp), center = TRUE, scale = TRUE)
my_svd = svd(scaled)
all(pca1$x, my_svd$v)
variance = data_frame(PC = paste0('PC', seq(1,10)), variance = round(100.00*pca1$sdev^2/sum(pca1$sdev^2),4)) %>%
mutate(PC = factor(PC,levels = PC[order(variance,decreasing =F)]))
variance %>% ggplot(aes(x = PC, y = variance)) + geom_bar(stat = 'identity', fill = 'skyblue') + ggtitle('Fig.9. Variance Explained') +
theme(axis.title = element_text(size=14), plot.title = element_text(size = 16,colour="darkblue", hjust = 0.5),
axis.text = element_text(size=14)) + geom_text(aes(label = variance), hjust =0.5) + xlab('') +
ylab('Variance Explained (%)') + coord_flip()