Fisseha Berhane, PhD

Data Scientist

443-970-2353 fisseha@jhu.edu CV Resume Linkedin GitHub twitter twitter

Performing SQL selects on R data frames

For anyone who has SQL background and who wants to learn R, I guess the sqldf package is very useful because it enables us to use SQL commands in R. One who has basic SQL skills can manipulate data frames in R using their SQL skills.

You can read more about sqldf package from cran.

In this post, we will see how to peform joins and other queries to retrieve, sort and filter data in R using SQL. We will also see how we can achieve the same tasks using non-SQL R commnads.

You can download the data here.

Currently, I am working with the FDA adverse events data. These data sets are ideal for a data scientist or for any data enthusiast to work with because they contain almost any kind of data messiness: missing values, duplicates, compatibility problems of data sets created during different time periods, variable names and number of variables changing over time (e.g., sex in one dataset, gender in other dataset, if_cod in one dataset and if_code in other dataset), missing errors, etc.

In this post, we will use FDA adverse events data, which are publically available.

The FDA adverse events datasets can also be downloaded in csv format from the National Bureau of Economic Research. Since, it is easier to automate the data download in R from the National Bureau of Economic Research, in this post, we will download various datasets from this website. I encourage you to try the R codes and download data from the website and explore the adverse events datasets.

The adverse events datasets are created in quarterly temporal resolution and each quarter data includes demography information, drug/biologic information, adverse event, outcome and diagnosis, etc.

Let's download some datasets and use SQL queries to join, sort and fiter the various data frames.

Load R Packages

In [145]:
require(downloader)
library(dplyr)
library(sqldf)
library(data.table)
library(ggplot2)
library(compare)
library(plotrix)

Basic error Handing with tryCatch()

We will use the function below to download data. Since the data sets have quarterly time resolution, there are four data sets per year for each observation. So, the function below will automate the data dowlnoad. If the quartely data is not available on the website (not yet released), the function will give an error message that the dataset was not found.

  • We are downloading zipped data and unzipping it.
In [3]:
try.error = function(url)

{
  try_error = tryCatch(download(url,dest="data.zip"), error=function(e) e)

  if (!inherits(try_error, "error")){

      download(url,dest="data.zip")
        unzip ("data.zip")
      }
      
    else if (inherits(try_error, "error")){

    cat(url,"not found\n")
      }
      }

Download adverse events data

The FDA adverse events data are available since 2004. In this post, let's download data since 2013. We will check if there are data upto the present time and download what we can get.

  • Sys.time() gives the current date and time
  • the function year() from the data.table package gets the year from the current date and time data that we obtain from the Sys.time() function
  • we are downloading demography, drug, diagnosis/indication, outcome and reaction (adeverse event) data.
In [4]:
year_start=2013
year_last=year(Sys.time())

for (i in year_start:year_last){
            j=c(1:4)
            
            for (m in j){
            
            url1<-paste0("http://www.nber.org/fda/faers/",i,"/demo",i,"q",m,".csv.zip")
            url2<-paste0("http://www.nber.org/fda/faers/",i,"/drug",i,"q",m,".csv.zip")
            url3<-paste0("http://www.nber.org/fda/faers/",i,"/reac",i,"q",m,".csv.zip")
            url4<-paste0("http://www.nber.org/fda/faers/",i,"/outc",i,"q",m,".csv.zip")
            url5<-paste0("http://www.nber.org/fda/faers/",i,"/indi",i,"q",m,".csv.zip")
     
           try.error(url1)
           try.error(url2)
           try.error(url3)
           try.error(url4)
           try.error(url5)     
            }
        }
http://www.nber.org/fda/faers/2015/demo2015q4.csv.zip not found
http://www.nber.org/fda/faers/2015/drug2015q4.csv.zip not found
http://www.nber.org/fda/faers/2015/reac2015q4.csv.zip not found
http://www.nber.org/fda/faers/2015/outc2015q4.csv.zip not found
http://www.nber.org/fda/faers/2015/indi2015q4.csv.zip not found
http://www.nber.org/fda/faers/2016/demo2016q1.csv.zip not found
http://www.nber.org/fda/faers/2016/drug2016q1.csv.zip not found
http://www.nber.org/fda/faers/2016/reac2016q1.csv.zip not found
http://www.nber.org/fda/faers/2016/outc2016q1.csv.zip not found
http://www.nber.org/fda/faers/2016/indi2016q1.csv.zip not found
http://www.nber.org/fda/faers/2016/demo2016q2.csv.zip not found
http://www.nber.org/fda/faers/2016/drug2016q2.csv.zip not found
http://www.nber.org/fda/faers/2016/reac2016q2.csv.zip not found
http://www.nber.org/fda/faers/2016/outc2016q2.csv.zip not found
http://www.nber.org/fda/faers/2016/indi2016q2.csv.zip not found
http://www.nber.org/fda/faers/2016/demo2016q3.csv.zip not found
http://www.nber.org/fda/faers/2016/drug2016q3.csv.zip not found
http://www.nber.org/fda/faers/2016/reac2016q3.csv.zip not found
http://www.nber.org/fda/faers/2016/outc2016q3.csv.zip not found
http://www.nber.org/fda/faers/2016/indi2016q3.csv.zip not found
http://www.nber.org/fda/faers/2016/demo2016q4.csv.zip not found
http://www.nber.org/fda/faers/2016/drug2016q4.csv.zip not found
http://www.nber.org/fda/faers/2016/reac2016q4.csv.zip not found
http://www.nber.org/fda/faers/2016/outc2016q4.csv.zip not found
http://www.nber.org/fda/faers/2016/indi2016q4.csv.zip not found

From the error messages shown above, in the time of writing this post, we see that the database has adverse events data up to the third quarter of 2015.

  • list.files() produce a character vector of the names of files in my working directory
  • I am using regular expressions to select each categoy of dateset. Example, ^demo.*.csv means all files that start with the word demo and end with .csv.
In [18]:
filenames <- list.files(pattern="^demo.*.csv", full.names=TRUE)

cat('We have downloaded the following quarterly demography datasets')
filenames
We have downloaded the following quarterly demography datasets
Out[18]:
  1. "./demo2012q1.csv"
  2. "./demo2012q2.csv"
  3. "./demo2012q3.csv"
  4. "./demo2012q4.csv"
  5. "./demo2013q1.csv"
  6. "./demo2013q2.csv"
  7. "./demo2013q3.csv"
  8. "./demo2013q4.csv"
  9. "./demo2014q1.csv"
  10. "./demo2014q2.csv"
  11. "./demo2014q3.csv"
  12. "./demo2014q4.csv"
  13. "./demo2015q1.csv"
  14. "./demo2015q2.csv"
  15. "./demo2015q3.csv"
  • Now let's use fread function from the data.table Package and let's suppress any warnings to make our page pretty.
In [6]:
options(warn=-1) # Disable any warnings for this session
  • In the line below, we are using fread to read all demography files uing lapply function.
In [ ]:
demo=lapply(filenames,fread)

Now, let's change them to data frames and concatenate them to create one demography data

In [8]:
demo_all=do.call(rbind,lapply(1:length(demo), function(i) select(as.data.frame(demo[i]),primaryid,caseid, age,age_cod,event_dt,sex,reporter_country)))
In [9]:
dim(demo_all)
Out[9]:
  1. 3554979
  2. 7
  • We see that our demography data has more than 3.5 million rows.

Let's see the variable names:

In [10]:
names(demo_all)
Out[10]:
  1. "primaryid"
  2. "caseid"
  3. "age"
  4. "age_cod"
  5. "event_dt"
  6. "sex"
  7. "reporter_country"
Now, lets' merge all drug files
In [21]:
filenames <- list.files(pattern="^drug.*.csv", full.names=TRUE)

cat('We have downloaded the following quarterly drug datasets:\n')
filenames

drug=lapply(filenames,fread)

cat('\n')
cat('Variable names:\n')
names(drug[[1]])

drug_all=do.call(rbind,lapply(1:length(drug), function(i) select(as.data.frame(drug[i]),primaryid,caseid, drug_seq,drugname,route)))
We have downloaded the following quarterly drug datasets:
Out[21]:
  1. "./drug2012q1.csv"
  2. "./drug2012q2.csv"
  3. "./drug2012q3.csv"
  4. "./drug2012q4.csv"
  5. "./drug2013q1.csv"
  6. "./drug2013q2.csv"
  7. "./drug2013q3.csv"
  8. "./drug2013q4.csv"
  9. "./drug2014q1.csv"
  10. "./drug2014q2.csv"
  11. "./drug2014q3.csv"
  12. "./drug2014q4.csv"
  13. "./drug2015q1.csv"
  14. "./drug2015q2.csv"
  15. "./drug2015q3.csv"

Variable names:
Out[21]:
  1. "primaryid"
  2. "drug_seq"
  3. "role_cod"
  4. "drugname"
  5. "val_vbm"
  6. "route"
  7. "dose_vbm"
  8. "dechal"
  9. "rechal"
  10. "lot_num"
  11. "exp_dt"
  12. "exp_dt_num"
  13. "nda_num"
Merge all diagnoses/indications datasets
In [22]:
filenames <- list.files(pattern="^indi.*.csv", full.names=TRUE)


cat('We have downloaded the following quarterly diagnoses/indications datasets:\n')

filenames

indi=lapply(filenames,fread)

cat('\n')
cat('Variable names:\n')

names(indi[[15]])

indi_all=do.call(rbind,lapply(1:length(indi), function(i) select(as.data.frame(indi[i]),primaryid,caseid, indi_drug_seq,indi_pt)))
We have downloaded the following quarterly diagnoses/indications datasets:
Out[22]:
  1. "./indi2012q1.csv"
  2. "./indi2012q2.csv"
  3. "./indi2012q3.csv"
  4. "./indi2012q4.csv"
  5. "./indi2013q1.csv"
  6. "./indi2013q2.csv"
  7. "./indi2013q3.csv"
  8. "./indi2013q4.csv"
  9. "./indi2014q1.csv"
  10. "./indi2014q2.csv"
  11. "./indi2014q3.csv"
  12. "./indi2014q4.csv"
  13. "./indi2015q1.csv"
  14. "./indi2015q2.csv"
  15. "./indi2015q3.csv"
Variable names:
Out[22]:
  1. "primaryid"
  2. "caseid"
  3. "indi_drug_seq"
  4. "indi_pt"
Merge patient outcomes
In [137]:
filenames <- list.files(pattern="^outc.*.csv", full.names=TRUE)


cat('We have downloaded the following quarterly patient outcome datasets:\n')


filenames

outc_all=lapply(filenames,fread)


cat('\n')
cat('Variable names\n')

names(outc_all[[1]])

names(outc_all[[4]])

colnames(outc_all[[4]])=c("primaryid", "caseid", "outc_cod")
outc_all=do.call(rbind,lapply(1:length(outc_all), function(i) select(as.data.frame(outc_all[i]),primaryid,outc_cod)))
We have downloaded the following quarterly patient outcome datasets:
Out[137]:
  1. "./outc2012q1.csv"
  2. "./outc2012q2.csv"
  3. "./outc2012q3.csv"
  4. "./outc2012q4.csv"
  5. "./outc2013q1.csv"
  6. "./outc2013q2.csv"
  7. "./outc2013q3.csv"
  8. "./outc2013q4.csv"
  9. "./outc2014q1.csv"
  10. "./outc2014q2.csv"
  11. "./outc2014q3.csv"
  12. "./outc2014q4.csv"
  13. "./outc2015q1.csv"
  14. "./outc2015q2.csv"
  15. "./outc2015q3.csv"
Variable names
Out[137]:
  1. "primaryid"
  2. "outc_cod"
Out[137]:
  1. "primaryid"
  2. "caseid"
  3. "outc_code"
Finally, merge reaction (adverse event) datasets
In [142]:
filenames <- list.files(pattern="^reac.*.csv", full.names=TRUE)


cat('We have downloaded the following quarterly reaction (adverse event)  datasets:\n')


filenames

reac=lapply(filenames,fread)

cat('\n')
cat('Variable names:\n')
names(reac[[3]])

reac_all=do.call(rbind,lapply(1:length(indi), function(i) select(as.data.frame(reac[i]),primaryid,pt)))
We have downloaded the following quarterly reaction (adverse event)  datasets:
Out[142]:
  1. "./reac2012q1.csv"
  2. "./reac2012q2.csv"
  3. "./reac2012q3.csv"
  4. "./reac2012q4.csv"
  5. "./reac2013q1.csv"
  6. "./reac2013q2.csv"
  7. "./reac2013q3.csv"
  8. "./reac2013q4.csv"
  9. "./reac2014q1.csv"
  10. "./reac2014q2.csv"
  11. "./reac2014q3.csv"
  12. "./reac2014q4.csv"
  13. "./reac2015q1.csv"
  14. "./reac2015q2.csv"
  15. "./reac2015q3.csv"
Variable names:
Out[142]:
  1. "primaryid"
  2. "pt"

Let's see number of rows from each data type

In [25]:
all=as.data.frame(list(Demography=nrow(demo_all),Drug=nrow(drug_all),
                   Indications=nrow(indi_all),Outcomes=nrow(outc_all),
                   Reactions=nrow(reac_all)))

row.names(all)='Number of rows'

all
Out[25]:
DemographyDrugIndicationsOutcomesReactions
Number of rows3554979116741797232323265839510962367


SQL queries

Keep in mind that sqldf uses SQLite

COUNT

In [50]:
#SQL

sqldf("SELECT COUNT(primaryid)as 'Number of rows of Demography data'
FROM demo_all;")
Out[50]:
Number of rows of Demography data
13554979
In [51]:
# R

nrow(demo_all)
Out[51]:
3554979


LIMIT

In [52]:
#  SQL
sqldf("SELECT *
FROM demo_all 
LIMIT 6;")
Out[52]:
primaryidcaseidageage_codevent_dtsexreporter_country
17805388858679284YR20100622FUNITED STATES
27810716853141365YR20110101FUNITED STATES
37811037819103080YR20110101MBRAZIL
47817874817073968YR20101002MFRANCE
5782235785333424YRNAMUNITED KINGDOM
67831542853695484YRNAFBRAZIL
In [53]:
#R

head(demo_all,6)
Out[53]:
primaryidcaseidageage_codevent_dtsexreporter_country
17805388858679284YR20100622FUNITED STATES
27810716853141365YR20110101FUNITED STATES
37811037819103080YR20110101MBRAZIL
47817874817073968YR20101002MFRANCE
5782235785333424YRNAMUNITED KINGDOM
67831542853695484YRNAFBRAZIL
In [45]:
R1=head(demo_all,6)

SQL1 =sqldf("SELECT *
FROM demo_all 
LIMIT 6;")

all.equal(R1,SQL1)
Out[45]:
TRUE


Where

In [68]:
SQL2=sqldf("SELECT *
FROM demo_all WHERE sex ='F';")

R2 = filter(demo_all, sex=="F")


identical(SQL2, R2)
Out[68]:
TRUE


In [96]:
SQL3=sqldf("SELECT *
FROM demo_all WHERE age BETWEEN 20 AND 25;")

R3 = filter(demo_all, age >= 20 & age <= 25)

identical(SQL3, R3)
Out[96]:
TRUE


GROUP and ORDER BY

In [102]:
#SQL

sqldf("SELECT sex, COUNT(primaryid) as Total
FROM demo_all
WHERE sex IN ('F','M','NS','UNK')
GROUP BY sex
ORDER BY Total DESC ;")
Out[102]:
sexTotal
1F1993938
2M1229345
3NS107342
4UNK1331
In [104]:
# R

demo_all%>%filter(sex %in%c('F','M','NS','UNK'))%>%group_by(sex) %>%
        summarise(Total = n())%>%arrange(desc(Total))
Out[104]:
sexTotal
1F1993938
2M1229345
3NS107342
4UNK1331
In [48]:
SQL3 = sqldf("SELECT sex, COUNT(primaryid) as Total
FROM demo_all
GROUP BY sex
ORDER BY Total DESC ;")

R3 = demo_all%>%group_by(sex) %>%
        summarise(Total = n())%>%arrange(desc(Total))

compare(SQL3,R3, allowAll=TRUE)
Out[48]:
TRUE
  dropped attributes
In [146]:
SQL=sqldf("SELECT sex, COUNT(primaryid) as Total
FROM demo_all
WHERE sex IN ('F','M','NS','UNK')
GROUP BY sex
ORDER BY Total DESC ;")
In [147]:
SQL$Total=as.numeric(SQL$Total)
In [153]:
pie3D(SQL$Total, labels = SQL$sex,explode=0.1,col=rainbow(4),
   main="Pie Chart of adverse event reports by gender",cex.lab=0.5, cex.axis=0.5, cex.main=1,labelcex=1)


Inner Join
Let's join drug data and indication data based on primary id and drug sequence

First, let's check the variable names to see how to merge the two datasets.

In [58]:
names(indi_all)
names(drug_all)
Out[58]:
  1. "primaryid"
  2. "indi_drug_seq"
  3. "indi_pt"
Out[58]:
  1. "primaryid"
  2. "drug_seq"
  3. "drugname"
  4. "route"
In [60]:
names(indi_all)=c("primaryid", "drug_seq", "indi_pt" ) # so as to have the same name (drug_seq)
Inner join
In [64]:
R4= merge(drug_all,indi_all, by = intersect(names(drug_all), names(indi_all)))

R4=arrange(R3, primaryid,drug_seq,drugname,indi_pt)


SQL4= sqldf("SELECT d.primaryid as primaryid, d.drug_seq as drug_seq, d.drugname as drugname,
                       d.route as route,i.indi_pt as indi_pt
                       FROM drug_all d
                       INNER JOIN indi_all i
                      ON d.primaryid= i.primaryid AND d.drug_seq=i.drug_seq
                      ORDER BY primaryid,drug_seq,drugname, i.indi_pt")


compare(R4,SQL4,allowAll=TRUE)
Out[64]:
TRUE
  sorted
  renamed rows
  dropped row names
In [139]:
R5 = merge(reac_all,outc_all,by=intersect(names(reac_all), names(outc_all)))



SQL5 =reac_outc_new4=sqldf("SELECT r.*, o.outc_cod as outc_cod
                     FROM reac_all r 
                     INNER JOIN outc_all o
                     ON r.primaryid=o.primaryid
                     ORDER BY r.primaryid,r.pt,o.outc_cod")


compare(R5,SQL5,allowAll = TRUE)  #TRUE
Out[139]:
TRUE
  sorted
  renamed rows
  dropped row names



In [187]:
ggplot(sqldf('SELECT age, sex
             FROM demo_all
             WHERE age between 0 AND 100 AND sex IN ("F","M")
             LIMIT 10000;'), aes(x=age, fill = sex))+ geom_density(alpha = 0.6)
In [167]:
ggplot(sqldf("SELECT d.age as age, o.outc_cod as outcome
                     FROM demo_all d
                     INNER JOIN outc_all o
                     ON d.primaryid=o.primaryid
                     WHERE d.age BETWEEN 20 AND 100
                     LIMIT 20000;"),aes(x=age, fill = outcome))+ geom_density(alpha = 0.6)
In [190]:
ggplot(sqldf("SELECT de.sex as sex, dr.route as route
                     FROM demo_all de
                     INNER JOIN drug_all dr
                     ON de.primaryid=dr.primaryid
                     WHERE de.sex IN ('M','F') AND dr.route IN ('ORAL','INTRAVENOUS','TOPICAL')
                     LIMIT 200000;"),aes(x=route, fill = sex))+ geom_bar(alpha=0.6)
In [205]:
ggplot(sqldf("SELECT d.sex as sex, o.outc_cod as outcome
                     FROM demo_all d
                     INNER JOIN outc_all o
                     ON d.primaryid=o.primaryid
                     WHERE d.age BETWEEN 20 AND 100 AND sex IN ('F','M')
                     LIMIT 20000;"),aes(x=outcome,fill=sex))+ geom_bar(alpha = 0.6)


In [193]:
demo1= demo_all[1:20000,]
demo2=demo_all[20001:40000,]

UNION ALL

In [196]:
R6 <- rbind(demo1, demo2)
SQL6 <- sqldf("SELECT  * FROM demo1 UNION ALL SELECT * FROM demo2;")
compare(R6,SQL6, allowAll = TRUE)
Out[196]:
TRUE

INTERSECT

In [200]:
R7 <- semi_join(demo1, demo2)
SQL7 <- sqldf("SELECT  * FROM demo1 INTERSECT SELECT * FROM demo2;")
compare(R7,SQL7, allowAll = TRUE)
Out[200]:
TRUE


EXCEPT

In [201]:
R8 <- anti_join(demo1, demo2)
SQL8 <- sqldf("SELECT  * FROM demo1 EXCEPT SELECT * FROM demo2;")
compare(R8,SQL8, allowAll = TRUE)
Out[201]:
TRUE
  
comments powered by Disqus