443-970-2353
[email protected]
CV Resume
We can use unsupervised machine learning to identify which drugs are associated with which adverse events. Specifically, machine learning can help us to create clusters based on gender, age, outcome of adverse event, route drug was administered, purpose the drug was used for, body mass index, etc. This can help for quickly discovering hidden associations between drugs and adverse events.
Clustering is a non-supervised learning technique which has wide applications. Some examples where clustering is commonly applied are market segmentation, social network analytics, and astronomical data analysis. Clustering is grouping of data into sub-groups so that objects within a cluster have high similarity in comparison to other objects in that cluster, but are very dissimilar to objects in other classes.
In this post, we will see how we can use hierarchical clustering to identify drug adverse events.
You can read about hierarchical clustering from Wikipedia.
For clustering, each pattern is represented as a vector in multidimensional space and a distance measure is used to find the dissimilarity between the instances.
library(dplyr)
library(caret)
library(dendextend)
library(stringi)
Let's create fake drug adverse event data where we can visually identify the clusters and see if our machine learning algorithm can identify the clusters. If we have millions of rows of adverse event data, clustering can help us to summarize the data and get insights quickly.
Let's assume a drug AAA results in adverse events shown below. We will see in which group (cluster) the drug results in what kind of reactions (adverse events).
In the table shown below, I have created four clusters:
mydata=read.csv("my_data.csv",stringsAsFactors = F)
names(mydata)=stri_trans_totitle(names(mydata))
mydata
To perfom hierarchical clustering, we need to change the text to numeric values so that we can calculate distances.
age=mydata$Age
mydata=select(mydata,-Age)
my_matrix = as.data.frame(do.call(cbind, lapply(mydata, function(x) table(1:nrow(mydata), x))))
my_matrix$Age=age
head(my_matrix)
preproc = preProcess(my_matrix)
my_matrixNorm = as.matrix(predict(preproc, my_matrix))
distances = dist(my_matrixNorm, method = "euclidean")
clusterdrug = hclust(distances, method = "ward.D")
plot(clusterdrug, cex=0.5, labels = FALSE,cex=0.5,xlab = "", sub = "",cex=1.2)
From the dendrogram shown above, we see that four distinct clusters can be created from the fake data we created.
dend <- as.dendrogram(clusterdrug)
# Color the branches based on the clusters:
dend <- color_branches(dend, k=4) #, groupLabels=iris_species)
# We hang the dendrogram a bit:
dend <- hang.dendrogram(dend,hang_height=0.1)
# reduce the size of the labels:
# dend <- assign_values_to_leaves_nodePar(dend, 0.5, "lab.cex")
dend <- set(dend, "labels_cex", 0.5)
plot(dend)
clusterGroups = cutree(clusterdrug, k = 4)
mydata$cluster=clusterGroups
mydata
observationsH=c()
for (i in seq(1,4)){
observationsH=c(observationsH,length(subset(clusterdrug, clusterGroups==i)))
}
observationsH =as.data.frame(list(cluster=c(1:4),Number_of_observations=observationsH))
observationsH
distances = dist(my_matrixNorm, method = "canberra")
clusterdrug = hclust(distances, method = "mcquitty")
plot(clusterdrug, cex=0.5, labels = FALSE,cex=0.5,xlab = "", sub = "",cex=1.2)
dend <- as.dendrogram(clusterdrug)
# Color the branches based on the clusters:
dend <- color_branches(dend, k=4) #, groupLabels=iris_species)
# We hang the dendrogram a bit:
dend <- hang.dendrogram(dend,hang_height=0.1)
# reduce the size of the labels:
# dend <- assign_values_to_leaves_nodePar(dend, 0.5, "lab.cex")
dend <- set(dend, "labels_cex", 0.5)
plot(dend)
clusterGroups2 = cutree(clusterdrug, k = 4)
mydata$cluster2=clusterGroups2
mydata
z=do.call(cbind,lapply(1:4, function(i) round(colMeans(subset(my_matrix,clusterGroups==i)),2)))
colnames(z)=paste0('cluster',seq(1,4))
z
Age=z[nrow(z),]
z=z[1:(nrow(z)-1),]
my_result=matrix(0,ncol=4,nrow=ncol(mydata))
for(i in seq(1,4)){
for(j in seq(1,ncol(mydata))){
q = names(mydata)[j]
q = as.vector(as.matrix(unique(mydata[q])))
my_result[j,i]=names(sort(z[q,i],decreasing = TRUE)[1])
}}
colnames(my_result)=paste0('Cluster',seq(1,4))
rownames(my_result)=names(mydata)
my_result=rbind(Age,my_result)
my_result <- cbind(Attribute =c("Age","Route","Sex","Outcome Code","Indication preferred term","Adverse event"), my_result)
rownames(my_result) <- NULL
my_result
We see that we have created the clusters using hierarchical clustering. From cluster 1, for male in the 60s, the drug results in vasculitic rash when taken for rheumatoid arthritis. We can interpret the other clusters similarly. Remember, this data is not real data. It is fake data made to show the application of clustering for drug adverse event study.
From, this short post, we see that clustering can be used for knowledge discovery in drug adverse event reactions. Specially in cases where the data has millions of observations, where we cannot get any insight visually, clustering becomes handy for summarizing our data, for getting statistical insights and for discovering new knowledge.
Though both techniques give the same results, it is important to note that one could be more appropriate than the other based on the application. We also see that the dendrograms are somewhat different which could lead us to decide on different number of clusters.