Fisseha Berhane, PhD

Data Scientist

443-970-2353 fisseha@jhu.edu CV Resume Linkedin GitHub twitter twitter

In [6]:



Machine Learning for Drug Adverse Event Discovery

We can use unsupervised machine learning to identify which drugs are associated with which adverse events. Specifically, machine learning can help us to create clusters based on gender, age, outcome of adverse event, route drug was administered, purpose the drug was used for, body mass index, etc. This can help for quickly discovering hidden associations between drugs and adverse events.

Clustering is a non-supervised learning technique which has wide applications. Some examples where clustering is commonly applied are market segmentation, social network analytics, and astronomical data analysis. Clustering is grouping of data into sub-groups so that objects within a cluster have high similarity in comparison to other objects in that cluster, but are very dissimilar to objects in other classes.

In this post, we will see how we can use hierarchical clustering to identify drug adverse events.

You can read about hierarchical clustering from Wikipedia.

For clustering, each pattern is represented as a vector in multidimensional space and a distance measure is used to find the dissimilarity between the instances.

Install Packages

In [8]:
library(dplyr)
library(caret)
library(dendextend)
library(stringi)

Data

Let's create fake drug adverse event data where we can visually identify the clusters and see if our machine learning algorithm can identify the clusters. If we have millions of rows of adverse event data, clustering can help us to summarize the data and get insights quickly.

Let's assume a drug AAA results in adverse events shown below. We will see in which group (cluster) the drug results in what kind of reactions (adverse events).

In the table shown below, I have created four clusters:

  • Route=ORAL, Age=60s, Sex=M, Outc_code=OT, Indi_pt=RHEUMATOID ARTHRITIS and Pt=VASCULITIC RASH + some noise
  • Route=TOPICAL, Age=early 20s, Sex=F, Outc_code=HO, Indi_pt=URINARY TRACT INFECTION and Pt=VOMITING + some noise
  • Route=INTRAVENOUS, Age=about 5, Sex=F, Outc_code=LT, Indi_pt=TONSILLITIS and Pt=VOMITING + some noise
  • Route=OPHTHALMIC, Age=early 50s, Sex=F, Outc_code=DE, Indi_pt=Senile osteoporosis and Pt=Sepsis + some noise
In [9]:
mydata=read.csv("my_data.csv",stringsAsFactors = F)
names(mydata)=stri_trans_totitle(names(mydata))
In [64]:
mydata
Out[64]:
RouteAgeSexOutc_codIndi_ptPt
1ORAL63MOTRHEUMATOID ARTHRITISVASCULITIC RASH
2ORAL66FOTRHEUMATOID ARTHRITISVASCULITIC RASH
3ORAL66MOTRHEUMATOID ARTHRITISVASCULITIC RASH
4ORAL57MOTRHEUMATOID ARTHRITISVASCULITIC RASH
5ORAL66MOTRHEUMATOID ARTHRITISVASCULITIC RASH
6ORAL66MOTRHEUMATOID ARTHRITISVASCULITIC RASH
7ORAL64MHORHEUMATOID ARTHRITISVASCULITIC RASH
8ORAL56MOTRHEUMATOID ARTHRITISVASCULITIC RASH
9ORAL66MOTRHEUMATOID ARTHRITISVASCULITIC RASH
10ORAL66MOTRHEUMATOID ARTHRITISVASCULITIC RASH
11ORAL52MOTRHEUMATOID ARTHRITISVASCULITIC RASH
12ORAL66FLTRHEUMATOID ARTHRITISVASCULITIC RASH
13ORAL59MOTRHEUMATOID ARTHRITISVASCULITIC RASH
14ORAL61MOTRHEUMATOID ARTHRITISVASCULITIC RASH
15ORAL66MOTRHEUMATOID ARTHRITISVASCULITIC RASH
16ORAL66MOTRHEUMATOID ARTHRITISVASCULITIC RASH
17ORAL48MOTRHEUMATOID ARTHRITISVASCULITIC RASH
18ORAL60MOTRHEUMATOID ARTHRITISVASCULITIC RASH
19ORAL62MOTRHEUMATOID ARTHRITISVASCULITIC RASH
20ORAL60MOTRHEUMATOID ARTHRITISVASCULITIC RASH
21ORAL18FHOURINARY TRACT INFECTIONVOMITING
22TOPICAL15FOTURINARY TRACT INFECTIONVOMITING
23TOPICAL14FHOURINARY TRACT INFECTIONVOMITING
24TOPICAL17FHOURINARY TRACT INFECTIONVOMITING
25TOPICAL17MHOURINARY TRACT INFECTIONVOMITING
26TOPICAL17FLTURINARY TRACT INFECTIONVOMITING
27TOPICAL18FHOURINARY TRACT INFECTIONVOMITING
28TOPICAL17FHOURINARY TRACT INFECTIONVOMITING
29TOPICAL24MHOURINARY TRACT INFECTIONVOMITING
30TOPICAL17FHOURINARY TRACT INFECTIONVOMITING
31TOPICAL20FOTURINARY TRACT INFECTIONVOMITING
32TOPICAL17FHOURINARY TRACT INFECTIONVOMITING
33TOPICAL17FHOURINARY TRACT INFECTIONVOMITING
34INTRAVENOUS5FLTTONSILLITISVOMITING
35INTRAVENOUS6MLTTONSILLITISVOMITING
36INTRAVENOUS7FLTTONSILLITISVOMITING
37INTRAVENOUS8FHOTONSILLITISVOMITING
38INTRAVENOUS5FLTTONSILLITISVOMITING
39INTRAVENOUS4MLTTONSILLITISVOMITING
40INTRAVENOUS6FLTTONSILLITISVOMITING
41INTRAVENOUS7FLTTONSILLITISVOMITING
42INTRAVENOUS8FOTTONSILLITISVOMITING
43INTRAVENOUS5FLTTONSILLITISVOMITING
44INTRAVENOUS4MLTTONSILLITISVOMITING
45INTRAVENOUS7FOTTONSILLITISVOMITING
46INTRAVENOUS5FLTTONSILLITISVOMITING
47INTRAVENOUS6FLTTONSILLITISVOMITING
48INTRAVENOUS4FLTTONSILLITISVOMITING
49OPHTHALMIC45FDESenile osteoporosisSepsis
50OPHTHALMIC43FDESenile osteoporosisSepsis
51OPHTHALMIC44FLTSenile osteoporosisSepsis
52OPHTHALMIC42FDESenile osteoporosisSepsis
53ORAL40FDESenile osteoporosisSepsis
54OPHTHALMIC45FDESenile osteoporosisSepsis
55OPHTHALMIC48FDESenile osteoporosisSepsis
56OPHTHALMIC40FHOSenile osteoporosisSepsis
57OPHTHALMIC42FDESenile osteoporosisSepsis
58ORAL45FDESenile osteoporosisSepsis
59OPHTHALMIC44FDESenile osteoporosisSepsis
60OPHTHALMIC43FDESenile osteoporosisSepsis
61OPHTHALMIC45FOTSenile osteoporosisSepsis
62OPHTHALMIC46FDESenile osteoporosisSepsis
63OPHTHALMIC47FDESenile osteoporosisSepsis
64OPHTHALMIC45FDESenile osteoporosisSepsis
65OPHTHALMIC48FDESenile osteoporosisSepsis
66OPHTHALMIC49FDESenile osteoporosisSepsis
67OPHTHALMIC45FDESenile osteoporosisSepsis
68OPHTHALMIC44FDESenile osteoporosisSepsis
69OPHTHALMIC43FDESenile osteoporosisSepsis
70OPHTHALMIC45FOTSenile osteoporosisSepsis
71OPHTHALMIC46FDESenile osteoporosisSepsis
72OPHTHALMIC47FDESenile osteoporosisSepsis


Hierarchical Clustering

To perfom hierarchical clustering, we need to change the text to numeric values so that we can calculate distances.

  • Since age is numeric, we will remove it from the rest of the variables and change the character variables to multidimensional numeric space.
In [10]:
age=mydata$Age
mydata=select(mydata,-Age)

Create a Matrix

In [11]:
my_matrix = as.data.frame(do.call(cbind, lapply(mydata, function(x) table(1:nrow(mydata), x))))
In [13]:
my_matrix$Age=age
head(my_matrix)
Out[13]:
INTRAVENOUSOPHTHALMICORALTOPICALFMDEHOLTOTRHEUMATOID ARTHRITISSenile osteoporosisTONSILLITISURINARY TRACT INFECTIONSepsisVASCULITIC RASHVOMITINGAge
10010010001100001063
20010100001100001066
30010010001100001066
40010010001100001057
50010010001100001066
60010010001100001066
  • We need to normalize our variables
  • using caret package.
In [14]:
preproc = preProcess(my_matrix)
my_matrixNorm = as.matrix(predict(preproc, my_matrix))
Next, let's calculate distance and apply hierarchical clustering and plot the dendogram
In [15]:
distances = dist(my_matrixNorm, method = "euclidean")

clusterdrug = hclust(distances, method = "ward.D") 

plot(clusterdrug, cex=0.5, labels = FALSE,cex=0.5,xlab = "", sub = "",cex=1.2)

From the dendrogram shown above, we see that four distinct clusters can be created from the fake data we created.

In [16]:
dend <- as.dendrogram(clusterdrug) 

# Color the branches based on the clusters:
dend <- color_branches(dend, k=4) #, groupLabels=iris_species)

# We hang the dendrogram a bit:
dend <- hang.dendrogram(dend,hang_height=0.1)
# reduce the size of the labels:
# dend <- assign_values_to_leaves_nodePar(dend, 0.5, "lab.cex")
dend <- set(dend, "labels_cex", 0.5)

plot(dend)

Now, let's create cluster groups with four clusters.

In [17]:
clusterGroups = cutree(clusterdrug, k = 4)
  • add the clusterGroups column
In [19]:
mydata$cluster=clusterGroups
mydata
Out[19]:
RouteSexOutc_codIndi_ptPtcluster
1ORALMOTRHEUMATOID ARTHRITISVASCULITIC RASH1
2ORALFOTRHEUMATOID ARTHRITISVASCULITIC RASH1
3ORALMOTRHEUMATOID ARTHRITISVASCULITIC RASH1
4ORALMOTRHEUMATOID ARTHRITISVASCULITIC RASH1
5ORALMOTRHEUMATOID ARTHRITISVASCULITIC RASH1
6ORALMOTRHEUMATOID ARTHRITISVASCULITIC RASH1
7ORALMHORHEUMATOID ARTHRITISVASCULITIC RASH1
8ORALMOTRHEUMATOID ARTHRITISVASCULITIC RASH1
9ORALMOTRHEUMATOID ARTHRITISVASCULITIC RASH1
10ORALMOTRHEUMATOID ARTHRITISVASCULITIC RASH1
11ORALMOTRHEUMATOID ARTHRITISVASCULITIC RASH1
12ORALFLTRHEUMATOID ARTHRITISVASCULITIC RASH1
13ORALMOTRHEUMATOID ARTHRITISVASCULITIC RASH1
14ORALMOTRHEUMATOID ARTHRITISVASCULITIC RASH1
15ORALMOTRHEUMATOID ARTHRITISVASCULITIC RASH1
16ORALMOTRHEUMATOID ARTHRITISVASCULITIC RASH1
17ORALMOTRHEUMATOID ARTHRITISVASCULITIC RASH1
18ORALMOTRHEUMATOID ARTHRITISVASCULITIC RASH1
19ORALMOTRHEUMATOID ARTHRITISVASCULITIC RASH1
20ORALMOTRHEUMATOID ARTHRITISVASCULITIC RASH1
21ORALFHOURINARY TRACT INFECTIONVOMITING2
22TOPICALFOTURINARY TRACT INFECTIONVOMITING2
23TOPICALFHOURINARY TRACT INFECTIONVOMITING2
24TOPICALFHOURINARY TRACT INFECTIONVOMITING2
25TOPICALMHOURINARY TRACT INFECTIONVOMITING2
26TOPICALFLTURINARY TRACT INFECTIONVOMITING2
27TOPICALFHOURINARY TRACT INFECTIONVOMITING2
28TOPICALFHOURINARY TRACT INFECTIONVOMITING2
29TOPICALMHOURINARY TRACT INFECTIONVOMITING2
30TOPICALFHOURINARY TRACT INFECTIONVOMITING2
31TOPICALFOTURINARY TRACT INFECTIONVOMITING2
32TOPICALFHOURINARY TRACT INFECTIONVOMITING2
33TOPICALFHOURINARY TRACT INFECTIONVOMITING2
34INTRAVENOUSFLTTONSILLITISVOMITING3
35INTRAVENOUSMLTTONSILLITISVOMITING3
36INTRAVENOUSFLTTONSILLITISVOMITING3
37INTRAVENOUSFHOTONSILLITISVOMITING3
38INTRAVENOUSFLTTONSILLITISVOMITING3
39INTRAVENOUSMLTTONSILLITISVOMITING3
40INTRAVENOUSFLTTONSILLITISVOMITING3
41INTRAVENOUSFLTTONSILLITISVOMITING3
42INTRAVENOUSFOTTONSILLITISVOMITING3
43INTRAVENOUSFLTTONSILLITISVOMITING3
44INTRAVENOUSMLTTONSILLITISVOMITING3
45INTRAVENOUSFOTTONSILLITISVOMITING3
46INTRAVENOUSFLTTONSILLITISVOMITING3
47INTRAVENOUSFLTTONSILLITISVOMITING3
48INTRAVENOUSFLTTONSILLITISVOMITING3
49OPHTHALMICFDESenile osteoporosisSepsis4
50OPHTHALMICFDESenile osteoporosisSepsis4
51OPHTHALMICFLTSenile osteoporosisSepsis4
52OPHTHALMICFDESenile osteoporosisSepsis4
53ORALFDESenile osteoporosisSepsis4
54OPHTHALMICFDESenile osteoporosisSepsis4
55OPHTHALMICFDESenile osteoporosisSepsis4
56OPHTHALMICFHOSenile osteoporosisSepsis4
57OPHTHALMICFDESenile osteoporosisSepsis4
58ORALFDESenile osteoporosisSepsis4
59OPHTHALMICFDESenile osteoporosisSepsis4
60OPHTHALMICFDESenile osteoporosisSepsis4
61OPHTHALMICFOTSenile osteoporosisSepsis4
62OPHTHALMICFDESenile osteoporosisSepsis4
63OPHTHALMICFDESenile osteoporosisSepsis4
64OPHTHALMICFDESenile osteoporosisSepsis4
65OPHTHALMICFDESenile osteoporosisSepsis4
66OPHTHALMICFDESenile osteoporosisSepsis4
67OPHTHALMICFDESenile osteoporosisSepsis4
68OPHTHALMICFDESenile osteoporosisSepsis4
69OPHTHALMICFDESenile osteoporosisSepsis4
70OPHTHALMICFOTSenile osteoporosisSepsis4
71OPHTHALMICFDESenile osteoporosisSepsis4
72OPHTHALMICFDESenile osteoporosisSepsis4

Number of Observations in Each Cluster

In [20]:
observationsH=c()

for (i in seq(1,4)){
  
  observationsH=c(observationsH,length(subset(clusterdrug, clusterGroups==i)))
}
observationsH =as.data.frame(list(cluster=c(1:4),Number_of_observations=observationsH))
observationsH
Out[20]:
clusterNumber_of_observations
1120
2213
3315
4424

Try other clustering techniques

In [23]:
distances = dist(my_matrixNorm, method = "canberra")

clusterdrug = hclust(distances, method = "mcquitty") 

plot(clusterdrug, cex=0.5, labels = FALSE,cex=0.5,xlab = "", sub = "",cex=1.2)
In [24]:
dend <- as.dendrogram(clusterdrug) 

# Color the branches based on the clusters:
dend <- color_branches(dend, k=4) #, groupLabels=iris_species)

# We hang the dendrogram a bit:
dend <- hang.dendrogram(dend,hang_height=0.1)
# reduce the size of the labels:
# dend <- assign_values_to_leaves_nodePar(dend, 0.5, "lab.cex")
dend <- set(dend, "labels_cex", 0.5)

plot(dend)
In [26]:
clusterGroups2 = cutree(clusterdrug, k = 4)
mydata$cluster2=clusterGroups2
mydata
Out[26]:
RouteSexOutc_codIndi_ptPtclustercluster2
1ORALMOTRHEUMATOID ARTHRITISVASCULITIC RASH11
2ORALFOTRHEUMATOID ARTHRITISVASCULITIC RASH11
3ORALMOTRHEUMATOID ARTHRITISVASCULITIC RASH11
4ORALMOTRHEUMATOID ARTHRITISVASCULITIC RASH11
5ORALMOTRHEUMATOID ARTHRITISVASCULITIC RASH11
6ORALMOTRHEUMATOID ARTHRITISVASCULITIC RASH11
7ORALMHORHEUMATOID ARTHRITISVASCULITIC RASH11
8ORALMOTRHEUMATOID ARTHRITISVASCULITIC RASH11
9ORALMOTRHEUMATOID ARTHRITISVASCULITIC RASH11
10ORALMOTRHEUMATOID ARTHRITISVASCULITIC RASH11
11ORALMOTRHEUMATOID ARTHRITISVASCULITIC RASH11
12ORALFLTRHEUMATOID ARTHRITISVASCULITIC RASH11
13ORALMOTRHEUMATOID ARTHRITISVASCULITIC RASH11
14ORALMOTRHEUMATOID ARTHRITISVASCULITIC RASH11
15ORALMOTRHEUMATOID ARTHRITISVASCULITIC RASH11
16ORALMOTRHEUMATOID ARTHRITISVASCULITIC RASH11
17ORALMOTRHEUMATOID ARTHRITISVASCULITIC RASH11
18ORALMOTRHEUMATOID ARTHRITISVASCULITIC RASH11
19ORALMOTRHEUMATOID ARTHRITISVASCULITIC RASH11
20ORALMOTRHEUMATOID ARTHRITISVASCULITIC RASH11
21ORALFHOURINARY TRACT INFECTIONVOMITING22
22TOPICALFOTURINARY TRACT INFECTIONVOMITING22
23TOPICALFHOURINARY TRACT INFECTIONVOMITING22
24TOPICALFHOURINARY TRACT INFECTIONVOMITING22
25TOPICALMHOURINARY TRACT INFECTIONVOMITING22
26TOPICALFLTURINARY TRACT INFECTIONVOMITING22
27TOPICALFHOURINARY TRACT INFECTIONVOMITING22
28TOPICALFHOURINARY TRACT INFECTIONVOMITING22
29TOPICALMHOURINARY TRACT INFECTIONVOMITING22
30TOPICALFHOURINARY TRACT INFECTIONVOMITING22
31TOPICALFOTURINARY TRACT INFECTIONVOMITING22
32TOPICALFHOURINARY TRACT INFECTIONVOMITING22
33TOPICALFHOURINARY TRACT INFECTIONVOMITING22
34INTRAVENOUSFLTTONSILLITISVOMITING33
35INTRAVENOUSMLTTONSILLITISVOMITING33
36INTRAVENOUSFLTTONSILLITISVOMITING33
37INTRAVENOUSFHOTONSILLITISVOMITING33
38INTRAVENOUSFLTTONSILLITISVOMITING33
39INTRAVENOUSMLTTONSILLITISVOMITING33
40INTRAVENOUSFLTTONSILLITISVOMITING33
41INTRAVENOUSFLTTONSILLITISVOMITING33
42INTRAVENOUSFOTTONSILLITISVOMITING33
43INTRAVENOUSFLTTONSILLITISVOMITING33
44INTRAVENOUSMLTTONSILLITISVOMITING33
45INTRAVENOUSFOTTONSILLITISVOMITING33
46INTRAVENOUSFLTTONSILLITISVOMITING33
47INTRAVENOUSFLTTONSILLITISVOMITING33
48INTRAVENOUSFLTTONSILLITISVOMITING33
49OPHTHALMICFDESenile osteoporosisSepsis44
50OPHTHALMICFDESenile osteoporosisSepsis44
51OPHTHALMICFLTSenile osteoporosisSepsis44
52OPHTHALMICFDESenile osteoporosisSepsis44
53ORALFDESenile osteoporosisSepsis44
54OPHTHALMICFDESenile osteoporosisSepsis44
55OPHTHALMICFDESenile osteoporosisSepsis44
56OPHTHALMICFHOSenile osteoporosisSepsis44
57OPHTHALMICFDESenile osteoporosisSepsis44
58ORALFDESenile osteoporosisSepsis44
59OPHTHALMICFDESenile osteoporosisSepsis44
60OPHTHALMICFDESenile osteoporosisSepsis44
61OPHTHALMICFOTSenile osteoporosisSepsis44
62OPHTHALMICFDESenile osteoporosisSepsis44
63OPHTHALMICFDESenile osteoporosisSepsis44
64OPHTHALMICFDESenile osteoporosisSepsis44
65OPHTHALMICFDESenile osteoporosisSepsis44
66OPHTHALMICFDESenile osteoporosisSepsis44
67OPHTHALMICFDESenile osteoporosisSepsis44
68OPHTHALMICFDESenile osteoporosisSepsis44
69OPHTHALMICFDESenile osteoporosisSepsis44
70OPHTHALMICFOTSenile osteoporosisSepsis44
71OPHTHALMICFDESenile osteoporosisSepsis44
72OPHTHALMICFDESenile osteoporosisSepsis44

What is the most common observation in each cluster?

Calculate column average for each cluster

In [21]:
z=do.call(cbind,lapply(1:4, function(i) round(colMeans(subset(my_matrix,clusterGroups==i)),2)))
colnames(z)=paste0('cluster',seq(1,4))
z
Out[21]:
cluster1cluster2cluster3cluster4
INTRAVENOUS0010
OPHTHALMIC0.000.000.000.92
ORAL1.000.080.000.08
TOPICAL0.000.920.000.00
F0.100.850.801.00
M0.900.150.200.00
DE0.000.000.000.83
HO0.050.770.070.04
LT0.050.080.800.04
OT0.900.150.130.08
RHEUMATOID ARTHRITIS1000
Senile osteoporosis0001
TONSILLITIS0010
URINARY TRACT INFECTION0100
Sepsis0001
VASCULITIC RASH1000
VOMITING0110
Age61.8017.54 5.8044.62
In [22]:
Age=z[nrow(z),]
z=z[1:(nrow(z)-1),]
In [75]:
my_result=matrix(0,ncol=4,nrow=ncol(mydata))
for(i in seq(1,4)){
    for(j in seq(1,ncol(mydata))){
q = names(mydata)[j]
q = as.vector(as.matrix(unique(mydata[q])))
my_result[j,i]=names(sort(z[q,i],decreasing = TRUE)[1])
    }}

colnames(my_result)=paste0('Cluster',seq(1,4))
rownames(my_result)=names(mydata)
my_result=rbind(Age,my_result)
my_result <- cbind(Attribute =c("Age","Route","Sex","Outcome Code","Indication preferred term","Adverse event"), my_result)
rownames(my_result) <- NULL

my_result
Out[75]:
Attributecluster1cluster2cluster3cluster4
Age 61.8 17.545.8 44.62
Route ORAL TOPICAL INTRAVENOUSOPHTHALMIC
SexM F F F
Outcome CodeOT HO LT DE
Indication preferred termRHEUMATOID ARTHRITIS URINARY TRACT INFECTION TONSILLITIS Senile osteoporosis
Adverse event VASCULITIC RASHVOMITING VOMITING Sepsis

We see that we have created the clusters using hierarchical clustering. From cluster 1, for male in the 60s, the drug results in vasculitic rash when taken for rheumatoid arthritis. We can interpret the other clusters similarly. Remember, this data is not real data. It is fake data made to show the application of clustering for drug adverse event study.

From, this short post, we see that clustering can be used for knowledge discovery in drug adverse event reactions. Specially in cases where the data has millions of observations, where we cannot get any insight visually, clustering becomes handy for summarizing our data, for getting statistical insights and for discovering new knowledge.

Though both techniques give the same results, it is important to note that one could be more appropriate than the other based on the application. We also see that the dendrograms are somewhat different which could lead us to decide on different number of clusters.

comments powered by Disqus