Fisseha Berhane, PhD

Data Scientist

443-970-2353 fisseha@jhu.edu CV Resume Linkedin GitHub twitter twitter

Working with databases in R






The dplyr package, which is one of my favorite R packages, works with in-memory data and with data stored in databases. In this post, I will share my experience on using dplyr to work with databases.

Using dplyr with databases has huge advantage when our data is big where loading it to R is impossible or slows down our analytics. If our data resides in a database, rather than loading the whole data to R, we can work with subsets or aggregates of the data that we are interested in. Further, if we have many data files, putting our data in a database, rather than in csv or other format, has better security and is easier to manage.

dplyr is a really powerful package for data manipulation, data exploration and feature engineering in R and if you do not know SQL, it provides the ability to work with databases just within R. Further, dplyr functions are easy to write and read. dplyr considers database tables as data frames and it uses lazy evaluation (it delays the actual operation until necessary and loads data onto R from the database only when we need it) and for someone who knows Spark, the processes and even the functions have similarities.

dplyr supports a couple of databases such as sqlite, mysql and postgresql. In this post, we will see how to work with sqlite database. You can get more information from the dplyr database vignette here.

When people take drugs, if they experience any adverse events, they can report it to the FDA. These data are in public domain and anyone can download them and analyze them. In this post, we will download demography information of the patients, drug they used and for what indication they used it, reaction and outcome. Then, we will put all the datasets in a database and use dplyr to work with the databases.

You can read more about the advesre events data in my previous post here.

You can simply run the code below and it will download the adverse events data and create one large dataset, for each category, by merging the various datasets. For demonstration purposes, let's use the adverse event reports from 2013-2015. The adverse events are released in quarterly data files (a single data file for every category every three months).

Load R Packages

In [4]:
library(dplyr)
library(ggplot2)
library(data.table)
library(plotly)
library(compare)

Download adverse events data

In [3]:
year_start=2013
year_last=2015

for (i in year_start:year_last){
            j=c(1:4)
            
            for (m in j){
            
            url1<-paste0("http://www.nber.org/fda/faers/",i,"/demo",i,"q",m,".csv.zip")
                download.file(url1,dest="data.zip") # Demography
                unzip ("data.zip")
                
            url2<-paste0("http://www.nber.org/fda/faers/",i,"/drug",i,"q",m,".csv.zip")
                download.file(url2,dest="data.zip")   # Drug 
                unzip ("data.zip")
                
            url3<-paste0("http://www.nber.org/fda/faers/",i,"/reac",i,"q",m,".csv.zip")
                download.file(url3,dest="data.zip") # Reaction
                unzip ("data.zip")
                
                
            url4<-paste0("http://www.nber.org/fda/faers/",i,"/outc",i,"q",m,".csv.zip")
                download.file(url4,dest="data.zip") # Outcome
                unzip ("data.zip")
                
                
            url5<-paste0("http://www.nber.org/fda/faers/",i,"/indi",i,"q",m,".csv.zip")
                download.file(url5,dest="data.zip") # Indication for use
                unzip ("data.zip")
            }
        }

Concatenate the quarterly data files and create single dataset for each category

  • Demography
In [ ]:
filenames <- list.files(pattern="^demo.*.csv", full.names=TRUE)

demography = rbindlist(lapply(filenames, fread,
        select=c("primaryid","caseid","age","age_cod","event_dt",
         "sex","wt","wt_cod","occr_country"),data.table=FALSE))
In [19]:
str(demography)
Classes 'data.table' and 'data.frame':	3037542 obs. of  9 variables:
 $ primaryid   : int  30375293 30936912 32481334 35865322 37005182 37108102 37820163 38283002 38346784 40096383 ...
 $ caseid      : int  3037529 3093691 3248133 3586532 3700518 3710810 3782016 3828300 3834678 4009638 ...
 $ event_dt    : int  199706 199610 1996 20000627 200101 20010810 20120409 NA 20020615 20030619 ...
 $ age         : chr  "44" "38" "28" "45" ...
 $ age_cod     : chr  "YR" "YR" "YR" "YR" ...
 $ sex         : chr  "F" "F" "F" "M" ...
 $ wt          : num  56 56 54 NA NA 80 102 NA NA 87.3 ...
 $ wt_cod      : chr  "KG" "KG" "KG" "" ...
 $ occr_country: chr  "US" "US" "US" "AR" ...
 - attr(*, ".internal.selfref")=<externalptr> 

We see that our demography data has more than 3 million rows and the variables are age, age code, date the event happened, sex, weight, weight code and country where the event happened.

  • Drug
In [ ]:
filenames <- list.files(pattern="^drug.*.csv", full.names=TRUE)
drug = rbindlist(lapply(filenames, fread,
        select=c("primaryid","drug_seq","drugname","route"
         ),data.table=FALSE))
In [10]:
str(drug)
'data.frame':	10026718 obs. of  4 variables:
 $ primaryid: int  30375293 30375293 30375293 30375293 30375293 30375293 30375293 30375293 30936912 30936912 ...
 $ drug_seq : int  1 2 3 4 5 6 7 8 1 2 ...
 $ drugname : chr  "AVONEX" "AVONEX" "ZANAFLEX" "STEROID (NOS)" ...
 $ route    : chr  "INTRAMUSCULAR" "INTRAMUSCULAR" "" "" ...

We can see that the drug data has about ten million rows and among the variables are drug name and route.

  • Diagnoses/Indications
In [ ]:
filenames <- list.files(pattern="^indi.*.csv", full.names=TRUE)
indication = rbindlist(lapply(filenames, fread,
        select=c("primaryid","indi_drug_seq","indi_pt"
         ),data.table=FALSE))
In [12]:
str(indication)
'data.frame':	6395157 obs. of  3 variables:
 $ primaryid    : int  30375293 30375293 30375293 30375293 30375293 30375293 30375293 30375293 30936912 30936912 ...
 $ indi_drug_seq: int  1 2 3 4 5 6 7 8 1 2 ...
 $ indi_pt      : chr  "Multiple sclerosis" "Multiple sclerosis" "Muscle spasticity" "Arthritis" ...

The indication data has more than six million rows and the variables are primaryid, drug sequence and indication (indication prefered term).

  • Outcomes
In [ ]:
filenames <- list.files(pattern="^outc.*.csv", full.names=TRUE)
outcome = rbindlist(lapply(filenames, fread,
        select=c("primaryid","outc_cod"),data.table=FALSE))
In [14]:
str(outcome)
'data.frame':	2138932 obs. of  2 variables:
 $ primaryid: int  30375293 30936912 32481334 35865322 35865322 37005182 37108102 37820163 38283002 38346784 ...
 $ outc_cod : chr  "HO" "HO" "HO" "DE" ...

The outcome data has more than two million rows and the variables are primaryid and outcome code (outc_cod).

  • Reaction (Adverse Event)
In [62]:
filenames <- list.files(pattern="^reac.*.csv", full.names=TRUE)
reaction = rbindlist(lapply(filenames, fread,
           select=c("primaryid","pt"),data.table=FALSE))
In [16]:
str(reaction)
'data.frame':	9014747 obs. of  2 variables:
 $ primaryid: int  30375293 30375293 30375293 30375293 30375293 30375293 30375293 30375293 30375293 30375293 ...
 $ pt       : chr  "Amenorrhoea" "Asthenia" "Bladder disorder" "Blood pressure increased" ...

So, we see that the adverse events (reaction) data has more than nine million rows and the variables are primaryid and prefered term for adverse event (pt).

Creat a database

To create SQLite database in R, we do not need anything, just specifying the path only. We use src_sqlite to connect to an existing sqlite database, and tbl to connect to tables within that database. We can also use src_sqlite to create new SQlite database at the specified path. If we do not specify a path, it will be created in our working directory.

In [41]:
my_database<- src_sqlite("adverse_events", create = TRUE) # create =TRUE creates a new database

Put data in the database

To upload data to the database, we use the dplyr function copy_to. According to the documentation, wherever possible, the new object will be temporary, limited to the current connection to the source. So, we have to change temporary to false to make it permament.

In [ ]:
copy_to(my_database,demography,temporary = FALSE) # uploading demography data
copy_to(my_database,drug,temporary = FALSE)       # uploading drug data
copy_to(my_database,indication,temporary = FALSE) # uploading indication data
copy_to(my_database,reaction,temporary = FALSE)   # uploading reaction data
copy_to(my_database,outcome,temporary = FALSE)     #uploading outcome data

Now, I have put all the datasets in the "adverse_events" database. I can query it and do analytics I want.

Connect to database

In [16]:
my_db <- src_sqlite("adverse_events", create = FALSE)
              # create is false now because I am connecting to an existing database
  • List the tables in the database
In [17]:
src_tbls(my_db)
Out[17]:
  1. "demography"
  2. "drug"
  3. "indication"
  4. "outcome"
  5. "reaction"
  6. "sqlite_stat1"

Querying the database

We use the same dplyr verbs that we use in data manipulation to work with databases. dplyr translates the R code we write to SQL code. We use tbl to connect to tables within the database.

In [18]:
demography = tbl(my_db,"demography" )
In [19]:
class(demography)
Out[19]:
  1. "tbl_sqlite"
  2. "tbl_sql"
  3. "tbl"
In [20]:
head(demography,3)
Out[20]:
primaryidcaseidageage_codevent_dtsexwtwt_codoccr_country
130375293303752944YR199706F56KGUS
230936912309369138YR199610F56KGUS
332481334324813328YR1996F54KGUS
In [21]:
US = filter(demography, occr_country=='US')  # Filtering demography of patients from the US

We can see the query dplyr has generated:

In [22]:
US$query
Out[22]:
<Query> SELECT "primaryid", "caseid", "age", "age_cod", "event_dt", "sex", "wt", "wt_cod", "occr_country"
FROM "demography"
WHERE "occr_country" = 'US'
<SQLiteConnection>

We can also see how the database plans to excute the query:

In [23]:
explain(US)
<SQL>
SELECT "primaryid", "caseid", "age", "age_cod", "event_dt", "sex", "wt", "wt_cod", "occr_country"
FROM "demography"
WHERE "occr_country" = 'US'


<PLAN>
  selectid order from                detail
1        0     0    0 SCAN TABLE demography

Let's similarly connect to the other tables in the database.

In [24]:
drug = tbl(my_db,"drug" )
indication = tbl(my_db,"indication" )
outcome = tbl(my_db,"outcome" )
reaction = tbl(my_db,"reaction" )

It is very interesting to note that dplyr delays the actual operation until necessary and loads data onto R from the database only when we need it. When we use actions such as collect(), head(), count(), etc, the commands are excuted.

While we can use head() on database tbls, we can’t find the last rows without executing the whole query.

In [25]:
head(indication,3)
Out[25]:
primaryidindi_drug_seqindi_pt
184803481020135312CONTRACEPTION
284803541020135329SCHIZOPHRENIA
384803551020135331ANXIETY
In [26]:
tail(indication,3)
Error: tail is not supported by sql sources
  • Now, let's use the dplyr verbs (select, arrange, filter, mutate, summarize, rename) on the tables from the database.

We can pipe dplyr operations together with %>% from the magrittr R package. The pipeline %>% takes the output from the left-hand side of the pipe as the first argument to the function on the right hand side.

Find the top ten countries with the highest number of adverse events

In [27]:
demography%>%group_by(Country= occr_country)%>% #grouped by country
           summarize(Total=n())%>%    # found the count for each country
           arrange(desc(Total))%>%    # sorted them in descending order
           filter(Country!='')%>%     # removed reports that does not have country information
           head(10)                   # took the top ten
Out[27]:
CountryTotal
1US2048277
2GB99110
3JP85637
4CA66550
5FR64356
6DE57362
7IT40124
8BR38051
9AU23818
10NL23528

We can also include ggplot in the chain:

In [28]:
demography%>%group_by(Country= occr_country)%>% #grouped by country
           summarize(Total=n())%>%    # found the count for each country
           arrange(desc(Total))%>%    # sorted them in descending order
           filter(Country!='')%>%     # removed reports that does not have country information
           head(10)%>%                  # took the top ten
           mutate(Country = factor(Country,levels = Country[order(Total,decreasing =F)]))%>%
          ggplot(aes(x=Country,y=Total))+geom_bar(stat='identity',color='skyblue',fill='#b35900')+
          xlab("")+ggtitle('Top ten countries with highest number of adverse event reports')+
          coord_flip()+ylab('Total number of reports')

Find the most common drug

In [127]:
drug%>%group_by(drug_name= drugname)%>% #grouped by drug_name
           summarize(Total=n())%>%    # found the count for each drug name
           arrange(desc(Total))%>%    # sorted them in descending order
           head(1)                   # took the most frequent drug
Out[127]:
drug_nameTotal
1HUMIRA170206

What are the top 5 most common outcome?

In [123]:
head(outcome,3)  # to see the variable names
Out[123]:
primaryidoutc_cod
18480347OT
28480348HO
38480350HO
In [122]:
outcome%>%group_by(Outcome_code= outc_cod)%>% #grouped by Outcome_code
           summarize(Total=n())%>%    # found the count for each Outcome_code
           arrange(desc(Total))%>%    # sorted them in descending order
           head(5)                   # took the top five
Out[122]:
Outcome_codeTotal
1OT1143177
2HO789956
3DE342277
4LT82464
5DS70071

What are the top ten reactions?

In [124]:
head(reaction,3)  # to see the variable names
Out[124]:
primaryidpt
18480347ANAEMIA HAEMOLYTIC AUTOIMMUNE
28480348OPTIC NEUROPATHY
38480349DYSPNOEA
In [125]:
reaction%>%group_by(reactions= pt)%>% # grouped by reactions
           summarize(Total=n())%>%    # found the count for each reaction type
           arrange(desc(Total))%>%    # sorted them in descending order
           head(10)                   # took the top ten
Out[125]:
reactionsTotal
1Drug ineffective194615
2Death146317
3Nausea131899
4Fatigue126673
5Headache109574
6Pain104215
7Dyspnoea100962
8Diarrhoea97440
9Malaise88018
10Dizziness87133

Joins

Let's join demography data with outcome based on primaryid and with reaction data
In [133]:
inner_joined = demography%>%inner_join(outcome, by='primaryid',copy = TRUE)%>%
               inner_join(reaction, by='primaryid',copy = TRUE)
In [134]:
head(inner_joined)
Out[134]:
primaryidcaseidageage_codevent_dtsexwtwt_codoccr_countryoutc_codpt
130375293303752944YR199706F56KGUSHOAmenorrhoea
230375293303752944YR199706F56KGUSHOAsthenia
330375293303752944YR199706F56KGUSHOBladder disorder
430375293303752944YR199706F56KGUSHOBlood pressure increased
530375293303752944YR199706F56KGUSHODehydration
630375293303752944YR199706F56KGUSHODepression

We can also use primary key and secondary key in our joins

Let's join drug and indication using two keys (primary and secondary keys).

In [137]:
drug_indication= indication%>%rename(drug_seq=indi_drug_seq)%>%
   inner_join(drug, by=c("primaryid","drug_seq"))
In [138]:
head(drug_indication)
Out[138]:
primaryiddrug_seqindi_ptdrugnameroute
184803481020135312CONTRACEPTIONLEVONORGESTRELORAL
284803541020135329SCHIZOPHRENIAOLANZAPINEINTRAMUSCULAR
384803551020135331ANXIETYCYMBALTA
484803571020135333SCHIZOPHRENIAOLANZAPINEINTRAMUSCULAR
584803581020135334OSTEOPOROSISPROLIASUBCUTANEOUS
684803581020135337HYPOTHYROIDISMLEVOTHYROXINE SODIUMORAL

In this post, we saw how to use the dplyr package to create a database and upload data to the database. We also saw how to perform various analytics by querying data from the database. Working with databases in R has huge advantage when our data is big where loading it to R is impossible or slows down our analytics. If our data resides in a database, rather than loading the whole data to R, we can work with subsets or aggregates of the data that we are interested in. Further, if we have many data files, putting our data in a database, rather than in csv or other format, has better security and is easier to manage.

This is enough for this post. See you in my next post. You can read about dplyr two-table verbs here.

comments powered by Disqus