# Fisseha Berhane, PhD

#### Data Scientist

443-970-2353 fisseha@jhu.edu CV Resume

## Working with databases in R¶

The dplyr package, which is one of my favorite R packages, works with in-memory data and with data stored in databases. In this post, I will share my experience on using dplyr to work with databases.

Using dplyr with databases has huge advantage when our data is big where loading it to R is impossible or slows down our analytics. If our data resides in a database, rather than loading the whole data to R, we can work with subsets or aggregates of the data that we are interested in. Further, if we have many data files, putting our data in a database, rather than in csv or other format, has better security and is easier to manage.

dplyr is a really powerful package for data manipulation, data exploration and feature engineering in R and if you do not know SQL, it provides the ability to work with databases just within R. Further, dplyr functions are easy to write and read. dplyr considers database tables as data frames and it uses lazy evaluation (it delays the actual operation until necessary and loads data onto R from the database only when we need it) and for someone who knows Spark, the processes and even the functions have similarities.

dplyr supports a couple of databases such as sqlite, mysql and postgresql. In this post, we will see how to work with sqlite database. You can get more information from the dplyr database vignette here.

When people take drugs, if they experience any adverse events, they can report it to the FDA. These data are in public domain and anyone can download them and analyze them. In this post, we will download demography information of the patients, drug they used and for what indication they used it, reaction and outcome. Then, we will put all the datasets in a database and use dplyr to work with the databases.

You can simply run the code below and it will download the adverse events data and create one large dataset, for each category, by merging the various datasets. For demonstration purposes, let's use the adverse event reports from 2013-2015. The adverse events are released in quarterly data files (a single data file for every category every three months).

In [4]:
library(dplyr)
library(ggplot2)
library(data.table)
library(plotly)
library(compare)


In [3]:
year_start=2013
year_last=2015

for (i in year_start:year_last){
j=c(1:4)

for (m in j){

url1<-paste0("http://www.nber.org/fda/faers/",i,"/demo",i,"q",m,".csv.zip")
unzip ("data.zip")

url2<-paste0("http://www.nber.org/fda/faers/",i,"/drug",i,"q",m,".csv.zip")
unzip ("data.zip")

url3<-paste0("http://www.nber.org/fda/faers/",i,"/reac",i,"q",m,".csv.zip")
unzip ("data.zip")

url4<-paste0("http://www.nber.org/fda/faers/",i,"/outc",i,"q",m,".csv.zip")
unzip ("data.zip")

url5<-paste0("http://www.nber.org/fda/faers/",i,"/indi",i,"q",m,".csv.zip")
unzip ("data.zip")
}
}


#### Concatenate the quarterly data files and create single dataset for each category¶

• Demography
In [ ]:
filenames <- list.files(pattern="^demo.*.csv", full.names=TRUE)

select=c("primaryid","caseid","age","age_cod","event_dt",
"sex","wt","wt_cod","occr_country"),data.table=FALSE))

In [19]:
str(demography)

Classes 'data.table' and 'data.frame':	3037542 obs. of  9 variables:
$primaryid : int 30375293 30936912 32481334 35865322 37005182 37108102 37820163 38283002 38346784 40096383 ...$ caseid      : int  3037529 3093691 3248133 3586532 3700518 3710810 3782016 3828300 3834678 4009638 ...
$event_dt : int 199706 199610 1996 20000627 200101 20010810 20120409 NA 20020615 20030619 ...$ age         : chr  "44" "38" "28" "45" ...
$age_cod : chr "YR" "YR" "YR" "YR" ...$ sex         : chr  "F" "F" "F" "M" ...
$wt : num 56 56 54 NA NA 80 102 NA NA 87.3 ...$ wt_cod      : chr  "KG" "KG" "KG" "" ...
$occr_country: chr "US" "US" "US" "AR" ... - attr(*, ".internal.selfref")=<externalptr>  We see that our demography data has more than 3 million rows and the variables are age, age code, date the event happened, sex, weight, weight code and country where the event happened. • Drug In [ ]: filenames <- list.files(pattern="^drug.*.csv", full.names=TRUE) drug = rbindlist(lapply(filenames, fread, select=c("primaryid","drug_seq","drugname","route" ),data.table=FALSE))  In [10]: str(drug)  'data.frame': 10026718 obs. of 4 variables:$ primaryid: int  30375293 30375293 30375293 30375293 30375293 30375293 30375293 30375293 30936912 30936912 ...
$drug_seq : int 1 2 3 4 5 6 7 8 1 2 ...$ drugname : chr  "AVONEX" "AVONEX" "ZANAFLEX" "STEROID (NOS)" ...
$route : chr "INTRAMUSCULAR" "INTRAMUSCULAR" "" "" ...  We can see that the drug data has about ten million rows and among the variables are drug name and route. • Diagnoses/Indications In [ ]: filenames <- list.files(pattern="^indi.*.csv", full.names=TRUE) indication = rbindlist(lapply(filenames, fread, select=c("primaryid","indi_drug_seq","indi_pt" ),data.table=FALSE))  In [12]: str(indication)  'data.frame': 6395157 obs. of 3 variables:$ primaryid    : int  30375293 30375293 30375293 30375293 30375293 30375293 30375293 30375293 30936912 30936912 ...
$indi_drug_seq: int 1 2 3 4 5 6 7 8 1 2 ...$ indi_pt      : chr  "Multiple sclerosis" "Multiple sclerosis" "Muscle spasticity" "Arthritis" ...


The indication data has more than six million rows and the variables are primaryid, drug sequence and indication (indication prefered term).

• Outcomes
In [ ]:
filenames <- list.files(pattern="^outc.*.csv", full.names=TRUE)
select=c("primaryid","outc_cod"),data.table=FALSE))

In [14]:
str(outcome)

'data.frame':	2138932 obs. of  2 variables:
$primaryid: int 30375293 30936912 32481334 35865322 35865322 37005182 37108102 37820163 38283002 38346784 ...$ outc_cod : chr  "HO" "HO" "HO" "DE" ...


The outcome data has more than two million rows and the variables are primaryid and outcome code (outc_cod).

In [62]:
filenames <- list.files(pattern="^reac.*.csv", full.names=TRUE)
select=c("primaryid","pt"),data.table=FALSE))

In [16]:
str(reaction)

'data.frame':	9014747 obs. of  2 variables:
$primaryid: int 30375293 30375293 30375293 30375293 30375293 30375293 30375293 30375293 30375293 30375293 ...$ pt       : chr  "Amenorrhoea" "Asthenia" "Bladder disorder" "Blood pressure increased" ...


So, we see that the adverse events (reaction) data has more than nine million rows and the variables are primaryid and prefered term for adverse event (pt).

### Creat a database¶

To create SQLite database in R, we do not need anything, just specifying the path only. We use src_sqlite to connect to an existing sqlite database, and tbl to connect to tables within that database. We can also use src_sqlite to create new SQlite database at the specified path. If we do not specify a path, it will be created in our working directory.

In [41]:
my_database<- src_sqlite("adverse_events", create = TRUE) # create =TRUE creates a new database


### Put data in the database¶

To upload data to the database, we use the dplyr function copy_to. According to the documentation, wherever possible, the new object will be temporary, limited to the current connection to the source. So, we have to change temporary to false to make it permament.

In [ ]:
copy_to(my_database,demography,temporary = FALSE) # uploading demography data


Now, I have put all the datasets in the "adverse_events" database. I can query it and do analytics I want.

### Connect to database¶

In [16]:
my_db <- src_sqlite("adverse_events", create = FALSE)
# create is false now because I am connecting to an existing database

• List the tables in the database
In [17]:
src_tbls(my_db)

Out[17]:
1. "demography"
2. "drug"
3. "indication"
4. "outcome"
5. "reaction"
6. "sqlite_stat1"

### Querying the database¶

We use the same dplyr verbs that we use in data manipulation to work with databases. dplyr translates the R code we write to SQL code. We use tbl to connect to tables within the database.

In [18]:
demography = tbl(my_db,"demography" )

In [19]:
class(demography)

Out[19]:
1. "tbl_sqlite"
2. "tbl_sql"
3. "tbl"
In [20]:
head(demography,3)

Out[20]:
primaryidcaseidageage_codevent_dtsexwtwt_codoccr_country
130375293303752944YR199706F56KGUS
230936912309369138YR199610F56KGUS
332481334324813328YR1996F54KGUS
In [21]:
US = filter(demography, occr_country=='US')  # Filtering demography of patients from the US


We can see the query dplyr has generated:

In [22]:
US\$query

Out[22]:
<Query> SELECT "primaryid", "caseid", "age", "age_cod", "event_dt", "sex", "wt", "wt_cod", "occr_country"
FROM "demography"
WHERE "occr_country" = 'US'
<SQLiteConnection>

We can also see how the database plans to excute the query:

In [23]:
explain(US)

<SQL>
SELECT "primaryid", "caseid", "age", "age_cod", "event_dt", "sex", "wt", "wt_cod", "occr_country"
FROM "demography"
WHERE "occr_country" = 'US'

<PLAN>
selectid order from                detail
1        0     0    0 SCAN TABLE demography


Let's similarly connect to the other tables in the database.

In [24]:
drug = tbl(my_db,"drug" )
indication = tbl(my_db,"indication" )
outcome = tbl(my_db,"outcome" )
reaction = tbl(my_db,"reaction" )


It is very interesting to note that dplyr delays the actual operation until necessary and loads data onto R from the database only when we need it. When we use actions such as collect(), head(), count(), etc, the commands are excuted.

While we can use head() on database tbls, we can’t find the last rows without executing the whole query.

In [25]:
head(indication,3)

Out[25]:
primaryidindi_drug_seqindi_pt
184803481020135312CONTRACEPTION
284803541020135329SCHIZOPHRENIA
384803551020135331ANXIETY
In [26]:
tail(indication,3)

Error: tail is not supported by sql sources

• Now, let's use the dplyr verbs (select, arrange, filter, mutate, summarize, rename) on the tables from the database.

We can pipe dplyr operations together with %>% from the magrittr R package. The pipeline %>% takes the output from the left-hand side of the pipe as the first argument to the function on the right hand side.

Find the top ten countries with the highest number of adverse events

In [27]:
demography%>%group_by(Country= occr_country)%>% #grouped by country
summarize(Total=n())%>%    # found the count for each country
arrange(desc(Total))%>%    # sorted them in descending order
filter(Country!='')%>%     # removed reports that does not have country information
head(10)                   # took the top ten

Out[27]:
CountryTotal
1US2048277
2GB99110
3JP85637
4CA66550
5FR64356
6DE57362
7IT40124
8BR38051
9AU23818
10NL23528

We can also include ggplot in the chain:

In [28]:
demography%>%group_by(Country= occr_country)%>% #grouped by country
summarize(Total=n())%>%    # found the count for each country
arrange(desc(Total))%>%    # sorted them in descending order
filter(Country!='')%>%     # removed reports that does not have country information
head(10)%>%                  # took the top ten
mutate(Country = factor(Country,levels = Country[order(Total,decreasing =F)]))%>%
ggplot(aes(x=Country,y=Total))+geom_bar(stat='identity',color='skyblue',fill='#b35900')+
xlab("")+ggtitle('Top ten countries with highest number of adverse event reports')+
coord_flip()+ylab('Total number of reports')


Find the most common drug

In [127]:
drug%>%group_by(drug_name= drugname)%>% #grouped by drug_name
summarize(Total=n())%>%    # found the count for each drug name
arrange(desc(Total))%>%    # sorted them in descending order
head(1)                   # took the most frequent drug

Out[127]:
drug_nameTotal
1HUMIRA170206

What are the top 5 most common outcome?

In [123]:
head(outcome,3)  # to see the variable names

Out[123]:
primaryidoutc_cod
18480347OT
28480348HO
38480350HO
In [122]:
outcome%>%group_by(Outcome_code= outc_cod)%>% #grouped by Outcome_code
summarize(Total=n())%>%    # found the count for each Outcome_code
arrange(desc(Total))%>%    # sorted them in descending order
head(5)                   # took the top five

Out[122]:
Outcome_codeTotal
1OT1143177
2HO789956
3DE342277
4LT82464
5DS70071

What are the top ten reactions?

In [124]:
head(reaction,3)  # to see the variable names

Out[124]:
primaryidpt
18480347ANAEMIA HAEMOLYTIC AUTOIMMUNE
28480348OPTIC NEUROPATHY
38480349DYSPNOEA
In [125]:
reaction%>%group_by(reactions= pt)%>% # grouped by reactions
summarize(Total=n())%>%    # found the count for each reaction type
arrange(desc(Total))%>%    # sorted them in descending order
head(10)                   # took the top ten

Out[125]:
reactionsTotal
1Drug ineffective194615
2Death146317
3Nausea131899
4Fatigue126673
6Pain104215
7Dyspnoea100962
8Diarrhoea97440
9Malaise88018
10Dizziness87133

### Joins¶

##### Let's join demography data with outcome based on primaryid and with reaction data¶
In [133]:
inner_joined = demography%>%inner_join(outcome, by='primaryid',copy = TRUE)%>%
inner_join(reaction, by='primaryid',copy = TRUE)

In [134]:
head(inner_joined)

Out[134]:
primaryidcaseidageage_codevent_dtsexwtwt_codoccr_countryoutc_codpt
130375293303752944YR199706F56KGUSHOAmenorrhoea
230375293303752944YR199706F56KGUSHOAsthenia
430375293303752944YR199706F56KGUSHOBlood pressure increased
530375293303752944YR199706F56KGUSHODehydration
630375293303752944YR199706F56KGUSHODepression

#### We can also use primary key and secondary key in our joins¶

Let's join drug and indication using two keys (primary and secondary keys).

In [137]:
drug_indication= indication%>%rename(drug_seq=indi_drug_seq)%>%
inner_join(drug, by=c("primaryid","drug_seq"))

In [138]:
head(drug_indication)

Out[138]:
primaryiddrug_seqindi_ptdrugnameroute
184803481020135312CONTRACEPTIONLEVONORGESTRELORAL
284803541020135329SCHIZOPHRENIAOLANZAPINEINTRAMUSCULAR
384803551020135331ANXIETYCYMBALTA
484803571020135333SCHIZOPHRENIAOLANZAPINEINTRAMUSCULAR
584803581020135334OSTEOPOROSISPROLIASUBCUTANEOUS
684803581020135337HYPOTHYROIDISMLEVOTHYROXINE SODIUMORAL

In this post, we saw how to use the dplyr package to create a database and upload data to the database. We also saw how to perform various analytics by querying data from the database. Working with databases in R has huge advantage when our data is big where loading it to R is impossible or slows down our analytics. If our data resides in a database, rather than loading the whole data to R, we can work with subsets or aggregates of the data that we are interested in. Further, if we have many data files, putting our data in a database, rather than in csv or other format, has better security and is easier to manage.

This is enough for this post. See you in my next post. You can read about dplyr two-table verbs here.