443-970-2353
[email protected]
CV Resume
JavaScript Object Notation (JSON) is the most common data format used for asynchronous browser/server communication and knowing how to work with JSON data is important as we get various datasets out there in this format. In this blog post, we will see how to analyse JSON data with R. In the next blog post, we will perform the same tasks using Python. The data we will work with is drug labeling data from the Food and Drug Administration.
library(jsonlite) # flexible, robust, high performance tools for working with JSON in R
library(dplyr) # A fast, consistent tool for working with data frame like objects
library(stringi) # Character String Processing Facilities
library(ggplot2) # An Implementation of the Grammar of Graphics
library(downloader) # Download Files over HTTP and HTTPS
library(lubridate) # To work with date-times
Let's download the prescription and over-the-counter (OTC) drug labeling data from the openFDA website. The openFDA drug product labeling API provides data for prescription and over-the-counter (OTC) drug labeling.
There are five data files and we will use a for loop to download all of them. Then, we will concatenate them.
for(i in 1:5){
url=paste0("http://download.open.fda.gov/drug/label/drug-label-000",i,"-of-0005.json.zip")
download(url,dest="json.zip")
unzip ("json.zip")
}
Let's see if we have downloaded all the data files.
dir()
Let's create a character vector of the names of the json files above using list.files().
filenames <- list.files(pattern="*.json", full.names=TRUE)
filenames
To read JSON data, we are using the fromJSON function from the jsonlite package.
Let's read in one of the data files and understand the content.
drug1 = fromJSON(filenames[1])
names(drug1)
See what meta contains
drug1$meta
As we can see above, meta is metadata about the data, including a disclaimer, link to data license, last-updated date, and total matching records
Next, let's see what results contains
results=drug1$results
class(results)
glimpse(results)
As we can see above, results has 129 variables. The other important point is the variable openfda is a data frame and most other variables are lists.
Let's see the variables of the data frame openfda.
glimpse(results$openfda)
It has 21 variables and all of them are lists.
Now, let's select certain variables from each data file and merge them and see the number of submissions across time.
As a starting point, I will select the variables "effective_time" and "boxed_warning" to help me see the trend of submissions with time and how many of them contain boxed warning.
spl = fromJSON(filenames[1])$results # read drug-label-0001-of-0005.json
spl=select(spl,effective_time,boxed_warning) # select effective_time and boxed_warning
for(i in 2:length(filenames)){ # read all the rest
tmp =fromJSON(filenames[i])$results
spl=rbind(spl, select(tmp,effective_time,boxed_warning)) # concatenate all
}
glimpse(spl)
We see that the variable effective_time is character but boxed_warning is list. So, we will use unlist() to change the list to character.
But first, let's understand effective_time.
cat("Maximum number of characters in effective_time:")
max(nchar(spl$effective_time))
cat("Minimum number of characters in effective_time:")
min(nchar(spl$effective_time))
spl%>%group_by(nchar(effective_time))%>%summarize(count=n())
As shown above the variable effective_time has different number of characters. This suggests that the format is not consistent. Let's first see those that have 10 characters.
tmp=select(spl,effective_time)
filter(tmp, nchar(effective_time)==10)
What about those that contain nine characters?
filter(tmp, nchar(effective_time)==9)
Finally, let's have a look at entries with eight characters.
head(filter(tmp, nchar(effective_time)==8))
One option is to use yyyy/mm/dd format and drop the additional information. This will help us to have a consistent date format and we can do analysis on daily, monthly or annual basis. The function stri_sub from the stringi package helps us to extact fixed number of characters. So, let's extact the first eight characters only from the variable effective_time.
spl=mutate(spl,effective_time=stri_sub(effective_time,1,8))
Now, let's check that all have eight characters only.
spl%>%group_by(nchar(effective_time))%>%summarize(count=n())
Now, since all have eight characters in yyyymmdd format, we can use the ymd function from the lubridate package to change the values from character to date.
spl=mutate(spl,effective_time =ymd(effective_time))
glimpse(spl) # checking if effective_time has been changed to date format.
spl%>%group_by(effective_time)%>%summarize(Total=n())%>%
ggplot(aes(x=effective_time,y=Total))+geom_line()
There are very small number of reports prior to 2009, so let's exclude them from our figure.
spl%>%filter(effective_time>ymd("20081231"))%>%
group_by(effective_time)%>%summarize(Total=n())%>%
ggplot(aes(x=effective_time,y=Total))+geom_line()
This figure is not that helpful, so let's see number of reports per year rather than per day.
spl=mutate(spl, year = format(effective_time, "%Y"))
glimpse(spl)
Since there is only one entry for 2017, we will consider up to the end of 2016.
spl%>%filter(effective_time>ymd("20081231") & effective_time < ymd("20170101"))%>%group_by(year)%>%summarize(Total=n())%>%
ggplot(aes(x=year,y=Total, fill=Total))+geom_bar(stat="identity",color='white')+
ggtitle('Annual drug labeling submissions')+xlab('')+ylab('Total number of submission')
Now, let's see boxed warning, which is the strongest warning that the FDA requires, and signifies that medical studies indicate that the drug carries a significant risk of serious or even life-threatening adverse effects.
select(spl,boxed_warning)[25:30,]
When we change lists to characters using unlist, NULL is dropped, so let's change NULL to NA before unlisting
spl$boxed_warning[spl$boxed_warning=="NULL"] =NA
glimpse(spl) # check that NULL is changed to NA
Now, let's see number of boxed warnings per years. I will remove 2017 as it has only one entry.
spl%>%filter(effective_time>ymd("20081231") & effective_time < ymd("20170101"))%>%
group_by(year)%>%summarize(total_boxed_warning=sum(!is.na(boxed_warning)))%>%
ggplot(aes(x=year,y=total_boxed_warning))+geom_bar(stat="identity",fill='light green',color='dark blue')+
ggtitle('Annual drug labeling submissions \n with boxed warning ')+xlab('')+
ylab('Total number of submission')
head(unique(spl$boxed_warning),3)
We can use unlist to change list to character. We are using the function paste0 to concatenate the lists for each row, with the elements being separated by the value of collapse.
for(i in 1:nrow(spl)){
spl$boxed_warning[i]=paste0(unlist(spl$boxed_warning[i]),collapse=",")
}
spl$boxed_warning=unlist(spl$boxed_warning)
glimpse(spl) # check if boxed_warning has been changed to character
As we have seen above, the variable openfda is a data frame and contains important variables. So, let's extract some variables and play with them.
spl_openfda= fromJSON(filenames[1])$results$openfda
spl_openfda=select(spl_openfda,manufacturer_name,product_type,route,generic_name,brand_name,substance_name)
for(i in 2:length(filenames)){
tmp = fromJSON(filenames[i])$results$openfda
tmp = select(tmp,manufacturer_name,product_type,route,generic_name,brand_name,substance_name)
spl_openfda=rbind(spl_openfda, tmp)
}
glimpse(spl_openfda) # we have selected 6 variables and we will unlist the variables which are lists.
"NULL"%in%(spl_openfda$product_type)
To unlist, NULL must be changed to NA.
spl_openfda$product_type[spl_openfda$product_type=="NULL"] =NA
unique(spl_openfda$product_type)
spl_openfda = mutate(spl_openfda,product_type=unlist(product_type))
Now, let's unlist route.
"NULL"%in%(spl_openfda$route)
head(unique(spl_openfda$route),10)
spl_openfda$route[spl_openfda$route=="NULL"] =NA
for(i in 1:nrow(spl_openfda)){
spl_openfda$route[i]=paste0(unlist(spl_openfda$route[i]),collapse=",")
}
spl_openfda = mutate(spl_openfda, route=unlist(route))
head(unique(spl_openfda$route),10)
glimpse(spl_openfda)
Let's see count of submissions of the most frequently reported ones by product type and by route.
spl_openfda%>%group_by(product_type,route)%>%summarize(Total=n())%>%arrange(desc(Total))%>%slice(1:10)
Change NA to "unknown" as that is more convinient to work with in ggplot2.
spl_openfda$route[is.na(spl_openfda$route)] <- "Unknown"
spl_openfda$product_type[is.na(spl_openfda$product_type)] <- "Unknown"
Now, let's see number of reports by product type.
spl_openfda%>%group_by(product_type)%>%summarize(Total=n())%>%
ggplot(aes(x=product_type,y=Total))+geom_bar(stat='identity',alpha=0.5)+
xlab('')+ylab('Total number of submission')
As we can see from the figure above, most reports are on over-the-counter (OTC) drug labeling.
We can also see most frequent routes.
spl_openfda%>%group_by(route)%>%summarize(Total=n())%>%arrange(desc(Total))%>%slice(1:20)%>%
mutate(route = factor(route,levels = route[order(Total,decreasing =F)]))%>%
ggplot(aes(x=route,y=Total))+geom_bar(stat='identity',color='skyblue',fill='#b35900')+coord_flip()+
xlab("")+ggtitle("Number of reports by route")+ylab("")
So, most are administered orally followed by topical.
Now, let's change the variables manufacturer_name, generic_name, brand_name and substance_name to character using unlist.
for(i in 1:nrow(spl_openfda)){
spl_openfda$generic_name[i]=paste0(unlist(spl_openfda$generic_name[i]),collapse=",")
spl_openfda$brand_name[i]=paste0(unlist(spl_openfda$brand_name[i]),collapse=",")
spl_openfda$substance_name[i]=paste0(unlist(spl_openfda$substance_name[i]),collapse=",")
spl_openfda$manufacturer_name[i]=paste0(unlist(spl_openfda$manufacturer_name[i]),collapse=",")
}
spl_openfda = mutate(spl_openfda, generic_name=unlist(generic_name),
brand_name=unlist(brand_name),
substance_name = unlist(substance_name),
manufacturer_name =unlist(manufacturer_name))
glimpse(spl_openfda)
All lists have been changed to characters. Now, let's merge the two data frames to use the time dimension.
spl_final = cbind(spl,spl_openfda)
glimpse(spl_final)
Now, let's see number of reports per year by product type.
spl_final%>%filter(effective_time>ymd("20081231") & effective_time < ymd("20170101"))%>%group_by(year,product_type)%>%summarize(Total=n())%>%
ggplot(aes(x=year,y=Total, fill=product_type))+geom_bar(stat="identity",color='white')+
ggtitle('Annual drug labeling submissions\n by Product type')+xlab('')+ylab('Total number of submission')
In every year, which manufacturers have the highest number of submissions?
spl_final%>%filter(effective_time>ymd("20081231")& effective_time < ymd("20170101")
,nchar(manufacturer_name)>0)%>%
group_by(year, manufacturer_name)%>%summarize(Total=n())%>%top_n(1,wt=Total)
What are the most common brand names in every year?
spl_final%>%filter(effective_time>ymd("20081231") & effective_time < ymd("20170101") ,
nchar(brand_name)>0)%>%
group_by(year, brand_name)%>%summarize(Total=n())%>%top_n(1,wt=Total)
In this blog post, we downloaded JSON data from the openFDA drug product labeling API, which provides data for prescription and over-the-counter (OTC) drug labeling. Then, we analyzed the data using various R packaged after understanding the contents of the data. In the next post, we will see how to perform similar tasks using Python.