443-970-2353
[email protected]
CV Resume
The dplyr package, which is one of my favorite R packages, works with in-memory data and with data stored in databases. In this post, I will share my experience on using dplyr to work with databases.
Using dplyr with databases has huge advantage when our data is big where loading it to R is impossible or slows down our analytics. If our data resides in a database, rather than loading the whole data to R, we can work with subsets or aggregates of the data that we are interested in. Further, if we have many data files, putting our data in a database, rather than in csv or other format, has better security and is easier to manage.
dplyr is a really powerful package for data manipulation, data exploration and feature engineering in R and if you do not know SQL, it provides the ability to work with databases just within R. Further, dplyr functions are easy to write and read. dplyr considers database tables as data frames and it uses lazy evaluation (it delays the actual operation until necessary and loads data onto R from the database only when we need it) and for someone who knows Spark, the processes and even the functions have similarities.
dplyr supports a couple of databases such as sqlite, mysql and postgresql. In this post, we will see how to work with sqlite database. You can get more information from the dplyr database vignette here.
When people take drugs, if they experience any adverse events, they can report it to the FDA. These data are in public domain and anyone can download them and analyze them. In this post, we will download demography information of the patients, drug they used and for what indication they used it, reaction and outcome. Then, we will put all the datasets in a database and use dplyr to work with the databases.
You can read more about the advesre events data in my previous post here.
You can simply run the code below and it will download the adverse events data and create one large dataset, for each category, by merging the various datasets. For demonstration purposes, let's use the adverse event reports from 2013-2015. The adverse events are released in quarterly data files (a single data file for every category every three months).
library(dplyr)
library(ggplot2)
library(data.table)
library(plotly)
library(compare)
year_start=2013
year_last=2015
for (i in year_start:year_last){
j=c(1:4)
for (m in j){
url1<-paste0("http://www.nber.org/fda/faers/",i,"/demo",i,"q",m,".csv.zip")
download.file(url1,dest="data.zip") # Demography
unzip ("data.zip")
url2<-paste0("http://www.nber.org/fda/faers/",i,"/drug",i,"q",m,".csv.zip")
download.file(url2,dest="data.zip") # Drug
unzip ("data.zip")
url3<-paste0("http://www.nber.org/fda/faers/",i,"/reac",i,"q",m,".csv.zip")
download.file(url3,dest="data.zip") # Reaction
unzip ("data.zip")
url4<-paste0("http://www.nber.org/fda/faers/",i,"/outc",i,"q",m,".csv.zip")
download.file(url4,dest="data.zip") # Outcome
unzip ("data.zip")
url5<-paste0("http://www.nber.org/fda/faers/",i,"/indi",i,"q",m,".csv.zip")
download.file(url5,dest="data.zip") # Indication for use
unzip ("data.zip")
}
}
filenames <- list.files(pattern="^demo.*.csv", full.names=TRUE)
demography = rbindlist(lapply(filenames, fread,
select=c("primaryid","caseid","age","age_cod","event_dt",
"sex","wt","wt_cod","occr_country"),data.table=FALSE))
str(demography)
We see that our demography data has more than 3 million rows and the variables are age, age code, date the event happened, sex, weight, weight code and country where the event happened.
filenames <- list.files(pattern="^drug.*.csv", full.names=TRUE)
drug = rbindlist(lapply(filenames, fread,
select=c("primaryid","drug_seq","drugname","route"
),data.table=FALSE))
str(drug)
We can see that the drug data has about ten million rows and among the variables are drug name and route.
filenames <- list.files(pattern="^indi.*.csv", full.names=TRUE)
indication = rbindlist(lapply(filenames, fread,
select=c("primaryid","indi_drug_seq","indi_pt"
),data.table=FALSE))
str(indication)
The indication data has more than six million rows and the variables are primaryid, drug sequence and indication (indication prefered term).
filenames <- list.files(pattern="^outc.*.csv", full.names=TRUE)
outcome = rbindlist(lapply(filenames, fread,
select=c("primaryid","outc_cod"),data.table=FALSE))
str(outcome)
The outcome data has more than two million rows and the variables are primaryid and outcome code (outc_cod).
filenames <- list.files(pattern="^reac.*.csv", full.names=TRUE)
reaction = rbindlist(lapply(filenames, fread,
select=c("primaryid","pt"),data.table=FALSE))
str(reaction)
So, we see that the adverse events (reaction) data has more than nine million rows and the variables are primaryid and prefered term for adverse event (pt).
To create SQLite database in R, we do not need anything, just specifying the path only. We use src_sqlite to connect to an existing sqlite database, and tbl to connect to tables within that database. We can also use src_sqlite to create new SQlite database at the specified path. If we do not specify a path, it will be created in our working directory.
my_database<- src_sqlite("adverse_events", create = TRUE) # create =TRUE creates a new database
To upload data to the database, we use the dplyr function copy_to. According to the documentation, wherever possible, the new object will be temporary, limited to the current connection to the source. So, we have to change temporary to false to make it permament.
copy_to(my_database,demography,temporary = FALSE) # uploading demography data
copy_to(my_database,drug,temporary = FALSE) # uploading drug data
copy_to(my_database,indication,temporary = FALSE) # uploading indication data
copy_to(my_database,reaction,temporary = FALSE) # uploading reaction data
copy_to(my_database,outcome,temporary = FALSE) #uploading outcome data
Now, I have put all the datasets in the "adverse_events" database. I can query it and do analytics I want.
my_db <- src_sqlite("adverse_events", create = FALSE)
# create is false now because I am connecting to an existing database
src_tbls(my_db)
We use the same dplyr verbs that we use in data manipulation to work with databases. dplyr translates the R code we write to SQL code. We use tbl to connect to tables within the database.
demography = tbl(my_db,"demography" )
class(demography)
head(demography,3)
US = filter(demography, occr_country=='US') # Filtering demography of patients from the US
We can see the query dplyr has generated:
US$query
We can also see how the database plans to excute the query:
explain(US)
Let's similarly connect to the other tables in the database.
drug = tbl(my_db,"drug" )
indication = tbl(my_db,"indication" )
outcome = tbl(my_db,"outcome" )
reaction = tbl(my_db,"reaction" )
It is very interesting to note that dplyr delays the actual operation until necessary and loads data onto R from the database only when we need it. When we use actions such as collect(), head(), count(), etc, the commands are excuted.
While we can use head() on database tbls, we can’t find the last rows without executing the whole query.
head(indication,3)
tail(indication,3)
We can pipe dplyr operations together with %>% from the magrittr R package. The pipeline %>% takes the output from the left-hand side of the pipe as the first argument to the function on the right hand side.
Find the top ten countries with the highest number of adverse events
demography%>%group_by(Country= occr_country)%>% #grouped by country
summarize(Total=n())%>% # found the count for each country
arrange(desc(Total))%>% # sorted them in descending order
filter(Country!='')%>% # removed reports that does not have country information
head(10) # took the top ten
We can also include ggplot in the chain:
demography%>%group_by(Country= occr_country)%>% #grouped by country
summarize(Total=n())%>% # found the count for each country
arrange(desc(Total))%>% # sorted them in descending order
filter(Country!='')%>% # removed reports that does not have country information
head(10)%>% # took the top ten
mutate(Country = factor(Country,levels = Country[order(Total,decreasing =F)]))%>%
ggplot(aes(x=Country,y=Total))+geom_bar(stat='identity',color='skyblue',fill='#b35900')+
xlab("")+ggtitle('Top ten countries with highest number of adverse event reports')+
coord_flip()+ylab('Total number of reports')
Find the most common drug
drug%>%group_by(drug_name= drugname)%>% #grouped by drug_name
summarize(Total=n())%>% # found the count for each drug name
arrange(desc(Total))%>% # sorted them in descending order
head(1) # took the most frequent drug
What are the top 5 most common outcome?
head(outcome,3) # to see the variable names
outcome%>%group_by(Outcome_code= outc_cod)%>% #grouped by Outcome_code
summarize(Total=n())%>% # found the count for each Outcome_code
arrange(desc(Total))%>% # sorted them in descending order
head(5) # took the top five
What are the top ten reactions?
head(reaction,3) # to see the variable names
reaction%>%group_by(reactions= pt)%>% # grouped by reactions
summarize(Total=n())%>% # found the count for each reaction type
arrange(desc(Total))%>% # sorted them in descending order
head(10) # took the top ten
inner_joined = demography%>%inner_join(outcome, by='primaryid',copy = TRUE)%>%
inner_join(reaction, by='primaryid',copy = TRUE)
head(inner_joined)
Let's join drug and indication using two keys (primary and secondary keys).
drug_indication= indication%>%rename(drug_seq=indi_drug_seq)%>%
inner_join(drug, by=c("primaryid","drug_seq"))
head(drug_indication)
In this post, we saw how to use the dplyr package to create a database and upload data to the database. We also saw how to perform various analytics by querying data from the database. Working with databases in R has huge advantage when our data is big where loading it to R is impossible or slows down our analytics. If our data resides in a database, rather than loading the whole data to R, we can work with subsets or aggregates of the data that we are interested in. Further, if we have many data files, putting our data in a database, rather than in csv or other format, has better security and is easier to manage.
This is enough for this post. See you in my next post. You can read about dplyr two-table verbs here.