Fisseha Berhane, PhD

Data Scientist

443-970-2353 [email protected] CV Resume Linkedin GitHub twitter twitter

JSON data manipulation: the R way vs the Python way



JavaScript Object Notation (JSON) is the most common data format used for asynchronous browser/server communication and knowing how to work with JSON data is important as we get various datasets out there in this format. In this blog post, we will see how to analyse JSON data with R. In the next blog post, we will perform the same tasks using Python. The data we will work with is drug labeling data from the Food and Drug Administration.

Load required libraries

In [319]:
library(jsonlite)     # flexible, robust, high performance tools for working with JSON in R
library(dplyr)        # A fast, consistent tool for working with data frame like objects
library(stringi)      # Character String Processing Facilities
library(ggplot2)      # An Implementation of the Grammar of Graphics
library(downloader)   # Download Files over HTTP and HTTPS
library(lubridate)    # To work with date-times

Downloading JSON data

Let's download the prescription and over-the-counter (OTC) drug labeling data from the openFDA website. The openFDA drug product labeling API provides data for prescription and over-the-counter (OTC) drug labeling.

There are five data files and we will use a for loop to download all of them. Then, we will concatenate them.

In [ ]:
for(i in 1:5){
    url=paste0("http://download.open.fda.gov/drug/label/drug-label-000",i,"-of-0005.json.zip")
    download(url,dest="json.zip")
    unzip ("json.zip")
}

Let's see if we have downloaded all the data files.

In [321]:
dir()
  1. "drug-label-0001-of-0005.json"
  2. "drug-label-0002-of-0005.json"
  3. "drug-label-0003-of-0005.json"
  4. "drug-label-0004-of-0005.json"
  5. "drug-label-0005-of-0005.json"
  6. "json.zip"

Let's create a character vector of the names of the json files above using list.files().

In [322]:
filenames <- list.files(pattern="*.json", full.names=TRUE)
filenames
  1. "./drug-label-0001-of-0005.json"
  2. "./drug-label-0002-of-0005.json"
  3. "./drug-label-0003-of-0005.json"
  4. "./drug-label-0004-of-0005.json"
  5. "./drug-label-0005-of-0005.json"

Understanding the data

To read JSON data, we are using the fromJSON function from the jsonlite package.

Let's read in one of the data files and understand the content.

In [323]:
drug1 = fromJSON(filenames[1])
In [324]:
names(drug1)
  1. "meta"
  2. "results"

See what meta contains

In [325]:
drug1$meta
$last_updated
"2016-08-11"
$terms
"https://open.fda.gov/terms/"
$results
$skip
0
$total
89207
$limit
20000
$license
"https://open.fda.gov/license/"
$disclaimer
"Do not rely on openFDA to make decisions regarding medical care. While we make every effort to ensure that data is accurate, you should assume all results are unvalidated. We may limit or otherwise restrict your access to the API in line with our Terms of Service."

As we can see above, meta is metadata about the data, including a disclaimer, link to data license, last-updated date, and total matching records

Next, let's see what results contains

In [310]:
results=drug1$results
class(results)
"data.frame"
In [311]:
glimpse(results)
Observations: 20,000
Variables: 129
$ laboratory_tests                                                 <list> [...
$ package_label_principal_display_panel                            <list> [...
$ carcinogenesis_and_mutagenesis_and_impairment_of_fertility       <list> [...
$ references                                                       <list> [...
$ precautions                                                      <list> [...
$ drug_interactions                                                <list> [...
$ id                                                               <chr> "4...
$ indications_and_usage                                            <list> [...
$ pediatric_use                                                    <list> [...
$ spl_unclassified_section                                         <list> [...
$ contraindications                                                <list> [...
$ how_supplied_table                                               <list> [...
$ dosage_and_administration                                        <list> [...
$ openfda                                                          <data.frame> ...
$ labor_and_delivery                                               <list> [...
$ version                                                          <chr> "1...
$ adverse_reactions                                                <list> [...
$ information_for_patients                                         <list> [...
$ description_table                                                <list> [...
$ description                                                      <list> [...
$ warnings                                                         <list> [...
$ clinical_pharmacology_table                                      <list> [...
$ effective_time                                                   <chr> "2...
$ spl_product_data_elements                                        <list> [...
$ clinical_pharmacology                                            <list> [...
$ drug_and_or_laboratory_test_interactions                         <list> [...
$ overdosage                                                       <list> [...
$ set_id                                                           <chr> "4...
$ how_supplied                                                     <list> [...
$ nursing_mothers                                                  <list> [...
$ animal_pharmacology_and_or_toxicology                            <list> [...
$ pregnancy                                                        <list> [...
$ pharmacokinetics                                                 <list> [...
$ pharmacokinetics_table                                           <list> [...
$ dosage_forms_and_strengths                                       <list> [...
$ clinical_studies_table                                           <list> [...
$ drug_interactions_table                                          <list> [...
$ recent_major_changes                                             <list> [...
$ mechanism_of_action                                              <list> [...
$ geriatric_use                                                    <list> [...
$ clinical_studies                                                 <list> [...
$ use_in_specific_populations                                      <list> [...
$ nonclinical_toxicology                                           <list> [...
$ warnings_and_cautions_table                                      <list> [...
$ adverse_reactions_table                                          <list> [...
$ pharmacodynamics                                                 <list> [...
$ warnings_and_cautions                                            <list> [...
$ active_ingredient                                                <list> [...
$ storage_and_handling                                             <list> [...
$ inactive_ingredient                                              <list> [...
$ do_not_use                                                       <list> [...
$ keep_out_of_reach_of_children                                    <list> [...
$ purpose                                                          <list> [...
$ questions                                                        <list> [...
$ pregnancy_or_breast_feeding                                      <list> [...
$ stop_use                                                         <list> [...
$ when_using                                                       <list> [...
$ ask_doctor_or_pharmacist                                         <list> [...
$ ask_doctor                                                       <list> [...
$ other_safety_information                                         <list> [...
$ spl_medguide_table                                               <list> [...
$ warnings_table                                                   <list> [...
$ controlled_substance                                             <list> [...
$ drug_abuse_and_dependence                                        <list> [...
$ spl_medguide                                                     <list> [...
$ general_precautions                                              <list> [...
$ boxed_warning                                                    <list> [...
$ dependence                                                       <list> [...
$ dosage_and_administration_table                                  <list> [...
$ spl_unclassified_section_table                                   <list> [...
$ overdosage_table                                                 <list> [...
$ teratogenic_effects                                              <list> [...
$ precautions_table                                                <list> [...
$ spl_patient_package_insert                                       <list> [...
$ recent_major_changes_table                                       <list> [...
$ nonteratogenic_effects                                           <list> [...
$ microbiology                                                     <list> [...
$ pharmacodynamics_table                                           <list> [...
$ indications_and_usage_table                                      <list> [...
$ spl_patient_package_insert_table                                 <list> [...
$ abuse                                                            <list> [...
$ purpose_table                                                    <list> [...
$ active_ingredient_table                                          <list> [...
$ package_label_principal_display_panel_table                      <list> [...
$ dosage_forms_and_strengths_table                                 <list> [...
$ patient_medication_information_table                             <list> [...
$ patient_medication_information                                   <list> [...
$ safe_handling_warning                                            <list> [...
$ information_for_patients_table                                   <list> [...
$ user_safety_warnings                                             <list> [...
$ instructions_for_use                                             <list> [...
$ guaranteed_analysis_of_feed                                      <list> [...
$ pediatric_use_table                                              <list> [...
$ questions_table                                                  <list> [...
$ use_in_specific_populations_table                                <list> [...
$ general_precautions_table                                        <list> [...
$ microbiology_table                                               <list> [...
$ contraindications_table                                          <list> [...
$ inactive_ingredient_table                                        <list> [...
$ geriatric_use_table                                              <list> [...
$ information_for_owners_or_caregivers                             <list> [...
$ nonclinical_toxicology_table                                     <list> [...
$ carcinogenesis_and_mutagenesis_and_impairment_of_fertility_table <list> [...
$ references_table                                                 <list> [...
$ storage_and_handling_table                                       <list> [...
$ stop_use_table                                                   <list> [...
$ instructions_for_use_table                                       <list> [...
$ teratogenic_effects_table                                        <list> [...
$ keep_out_of_reach_of_children_table                              <list> [...
$ boxed_warning_table                                              <list> [...
$ accessories                                                      <list> [...
$ troubleshooting                                                  <list> [...
$ health_care_provider_letter_table                                <list> [...
$ health_care_provider_letter                                      <list> [...
$ dependence_table                                                 <list> [...
$ pregnancy_table                                                  <list> [...
$ nonteratogenic_effects_table                                     <list> [...
$ mechanism_of_action_table                                        <list> [...
$ intended_use_of_the_device                                       <list> [...
$ animal_pharmacology_and_or_toxicology_table                      <list> [...
$ environmental_warning                                            <list> [...
$ veterinary_indications                                           <list> [...
$ laboratory_tests_table                                           <list> [...
$ drug_and_or_laboratory_test_interactions_table                   <list> [...
$ ask_doctor_or_pharmacist_table                                   <list> [...
$ cleaning                                                         <list> [...
$ components                                                       <list> [...
$ diagram_of_device                                                <list> [...
$ statement_of_identity                                            <list> [...

As we can see above, results has 129 variables. The other important point is the variable openfda is a data frame and most other variables are lists.

Let's see the variables of the data frame openfda.

In [312]:
glimpse(results$openfda)
Observations: 20,000
Variables: 21
$ manufacturer_name             <list> ["Cardinal Health", "Carilion Materi...
$ unii                          <list> ["N12000U13O", "A0JWA85V8F", NULL, N...
$ product_type                  <list> ["HUMAN PRESCRIPTION DRUG", "HUMAN P...
$ rxcui                         <list> ["1649990", "617310", NULL, NULL, <"...
$ spl_set_id                    <list> ["4958c12f-c6b3-4e79-8669-2fe2eed0f6...
$ route                         <list> ["ORAL", "ORAL", "ORAL", "TOPICAL", ...
$ generic_name                  <list> ["DOXYCYCLINE", "ATORVASTATIN CALCIU...
$ upc                           <list> ["0055154586801", NULL, "03645780602...
$ pharm_class_cs                <list> ["Tetracyclines [Chemical/Ingredient...
$ brand_name                    <list> ["Doxycycline", "ATORVASTATIN CALCIU...
$ product_ndc                   <list> ["55154-5868", "68151-1013", "64578-...
$ original_packager_product_ndc <list> ["49884-727", "60505-2579", NULL, NU...
$ substance_name                <list> ["DOXYCYCLINE", "ATORVASTATIN CALCIU...
$ spl_id                        <list> ["4958c12f-c6b3-4e79-8669-2fe2eed0f6...
$ application_number            <list> ["ANDA065055", "ANDA090548", NULL, "...
$ nui                           <list> [<"N0000007948", "N0000175882">, <"N...
$ pharm_class_epc               <list> ["Tetracycline-class Drug [EPC]", "H...
$ package_ndc                   <list> ["55154-5868-0", "68151-1013-9", "64...
$ pharm_class_moa               <list> [NULL, "Hydroxymethylglutaryl-CoA Re...
$ is_original_packager          <list> [NULL, NULL, TRUE, TRUE, TRUE, TRUE,...
$ pharm_class_pe                <list> [NULL, NULL, NULL, NULL, NULL, NULL,...

It has 21 variables and all of them are lists.

Merging all the data files

Now, let's select certain variables from each data file and merge them and see the number of submissions across time.

As a starting point, I will select the variables "effective_time" and "boxed_warning" to help me see the trend of submissions with time and how many of them contain boxed warning.

In [327]:
spl = fromJSON(filenames[1])$results                  # read drug-label-0001-of-0005.json
spl=select(spl,effective_time,boxed_warning)          # select effective_time and boxed_warning

  for(i in 2:length(filenames)){                      # read all the rest
      
      tmp =fromJSON(filenames[i])$results
      
   spl=rbind(spl, select(tmp,effective_time,boxed_warning))  # concatenate all
}
In [329]:
glimpse(spl)
Observations: 89,207
Variables: 2
$ effective_time <chr> "20140826", "20130620", "20140825", "20140903", "201...
$ boxed_warning  <list> [NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NU...

We see that the variable effective_time is character but boxed_warning is list. So, we will use unlist() to change the list to character.

But first, let's understand effective_time.

In [330]:
cat("Maximum number of characters in effective_time:")
max(nchar(spl$effective_time))
Maximum number of characters in effective_time:
10
In [331]:
cat("Minimum number of characters in effective_time:")
min(nchar(spl$effective_time))
Minimum number of characters in effective_time:
8
In [332]:
spl%>%group_by(nchar(effective_time))%>%summarize(count=n())
nchar(effective_time)count
1 889196
299
310 2

As shown above the variable effective_time has different number of characters. This suggests that the format is not consistent. Let's first see those that have 10 characters.

In [333]:
tmp=select(spl,effective_time)
filter(tmp, nchar(effective_time)==10)
effective_time
12010110126
22009103009

What about those that contain nine characters?

In [334]:
filter(tmp, nchar(effective_time)==9)
effective_time
1201011118
2200911115
3201108178
4201101113
5201101113
6201004119
7200912131
8201010112
9201001018

Finally, let's have a look at entries with eight characters.

In [335]:
head(filter(tmp, nchar(effective_time)==8))
effective_time
120140826
220130620
320140825
420140903
520140905
620140902

One option is to use yyyy/mm/dd format and drop the additional information. This will help us to have a consistent date format and we can do analysis on daily, monthly or annual basis. The function stri_sub from the stringi package helps us to extact fixed number of characters. So, let's extact the first eight characters only from the variable effective_time.

In [336]:
spl=mutate(spl,effective_time=stri_sub(effective_time,1,8))

Now, let's check that all have eight characters only.

In [337]:
spl%>%group_by(nchar(effective_time))%>%summarize(count=n())
nchar(effective_time)count
1 889207

Now, since all have eight characters in yyyymmdd format, we can use the ymd function from the lubridate package to change the values from character to date.

In [338]:
spl=mutate(spl,effective_time =ymd(effective_time))
In [339]:
glimpse(spl)  # checking if effective_time has been changed to date format.
Observations: 89,207
Variables: 2
$ effective_time <date> 2014-08-26, 2013-06-20, 2014-08-25, 2014-09-03, 201...
$ boxed_warning  <list> [NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NU...
Number of submissions per day
In [340]:
spl%>%group_by(effective_time)%>%summarize(Total=n())%>%
ggplot(aes(x=effective_time,y=Total))+geom_line()

There are very small number of reports prior to 2009, so let's exclude them from our figure.

In [341]:
spl%>%filter(effective_time>ymd("20081231"))%>%
              group_by(effective_time)%>%summarize(Total=n())%>%
ggplot(aes(x=effective_time,y=Total))+geom_line()

This figure is not that helpful, so let's see number of reports per year rather than per day.

In [342]:
spl=mutate(spl, year = format(effective_time, "%Y"))
In [344]:
glimpse(spl)
Observations: 89,207
Variables: 3
$ effective_time <date> 2014-08-26, 2013-06-20, 2014-08-25, 2014-09-03, 201...
$ boxed_warning  <list> [NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NU...
$ year           <chr> "2014", "2013", "2014", "2014", "2014", "2014", "201...

Since there is only one entry for 2017, we will consider up to the end of 2016.

In [432]:
spl%>%filter(effective_time>ymd("20081231") & effective_time < ymd("20170101"))%>%group_by(year)%>%summarize(Total=n())%>%
ggplot(aes(x=year,y=Total, fill=Total))+geom_bar(stat="identity",color='white')+
ggtitle('Annual drug labeling submissions')+xlab('')+ylab('Total number of submission')

Now, let's see boxed warning, which is the strongest warning that the FDA requires, and signifies that medical studies indicate that the drug carries a significant risk of serious or even life-threatening adverse effects.

In [346]:
select(spl,boxed_warning)[25:30,]
  1. NULL
  2. NULL
  3. "WARNING: FETAL TOXICITY See full prescribing information for complete boxed warning. When pregnancy is detected, discontinue lisinopril and hydrochlorothiazide as soon as possible. Drugs that act directly on the renin-angiotensin system can cause injury and death to the developing fetus. See WARNINGS; Fetal Toxicity."
  4. NULL
  5. NULL
  6. NULL

When we change lists to characters using unlist, NULL is dropped, so let's change NULL to NA before unlisting

In [347]:
spl$boxed_warning[spl$boxed_warning=="NULL"] =NA
In [348]:
glimpse(spl)  # check that NULL is changed to NA
Observations: 89,207
Variables: 3
$ effective_time <date> 2014-08-26, 2013-06-20, 2014-08-25, 2014-09-03, 201...
$ boxed_warning  <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, "Suicidality a...
$ year           <chr> "2014", "2013", "2014", "2014", "2014", "2014", "201...

Now, let's see number of boxed warnings per years. I will remove 2017 as it has only one entry.

In [430]:
spl%>%filter(effective_time>ymd("20081231") & effective_time < ymd("20170101"))%>%
group_by(year)%>%summarize(total_boxed_warning=sum(!is.na(boxed_warning)))%>%
ggplot(aes(x=year,y=total_boxed_warning))+geom_bar(stat="identity",fill='light green',color='dark blue')+
ggtitle('Annual drug labeling submissions \n with boxed warning ')+xlab('')+
ylab('Total number of submission')
In [350]:
head(unique(spl$boxed_warning),3)
  1. NA
  2. "Suicidality and Antidepressant Drugs (See : , : , and : ) Antidepressants increased the risk compared to placebo of suicidal thinking and behavior (suicidality) in children, adolescents, and young adults in short-term studies of major depressive disorder (MDD) and other psychiatric disorders. Anyone considering the use of mirtazapine tablets or any other antidepressant in a child, adolescent, or young adult must balance this risk with the clinical need. Short-term studies did not show an increase in the risk of suicidality with antidepressants compared to placebo in adults beyond age 24; there was a reduction in risk with antidepressants compared to placebo in adults aged 65 and older. Depression and certain other psychiatric disorders are themselves associated with increases in the risk of suicide. Patients of all ages who are started on antidepressant therapy should be monitored appropriately and observed closely for clinical worsening, suicidality, or unusual changes in behavior. Families and caregivers should be advised of the need for close observation and communication with the prescriber. Mirtazapine tablets are not approved for use in pediatric patients. Warnings Clinical Worsening and Suicide Risk PRECAUTIONS Information for Patients PRECAUTIONS Pediatric Use"
  3. "WARNING: Thyroid hormones, including Levothyroxine Sodium Tablets, USP, either alone or with other therapeutic agents, should not be used for the treatment of obesity for weight loss. In euthyroid patients, doses within the range of daily hormonal requirements are ineffective for weight reduction. Larger doses may produce serious or even life threatening manifestations of toxicity, particularly when given in association with sympathomimetic amines such as those used for their anorectic effects."

We can use unlist to change list to character. We are using the function paste0 to concatenate the lists for each row, with the elements being separated by the value of collapse.

In [351]:
for(i in 1:nrow(spl)){
      spl$boxed_warning[i]=paste0(unlist(spl$boxed_warning[i]),collapse=",")
    }
spl$boxed_warning=unlist(spl$boxed_warning)
In [352]:
glimpse(spl)  # check if boxed_warning has been changed to character
Observations: 89,207
Variables: 3
$ effective_time <date> 2014-08-26, 2013-06-20, 2014-08-25, 2014-09-03, 201...
$ boxed_warning  <chr> "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA"...
$ year           <chr> "2014", "2013", "2014", "2014", "2014", "2014", "201...
Get additional variables

As we have seen above, the variable openfda is a data frame and contains important variables. So, let's extract some variables and play with them.

In [ ]:
spl_openfda= fromJSON(filenames[1])$results$openfda

spl_openfda=select(spl_openfda,manufacturer_name,product_type,route,generic_name,brand_name,substance_name)

  for(i in 2:length(filenames)){
      tmp = fromJSON(filenames[i])$results$openfda
      tmp = select(tmp,manufacturer_name,product_type,route,generic_name,brand_name,substance_name)
      spl_openfda=rbind(spl_openfda, tmp)
}
In [353]:
glimpse(spl_openfda)  # we have selected 6 variables and we will unlist the variables which are lists.
Observations: 89,207
Variables: 6
$ manufacturer_name <list> ["Cardinal Health", "Carilion Materials Manageme...
$ product_type      <list> ["HUMAN PRESCRIPTION DRUG", "HUMAN PRESCRIPTION ...
$ route             <list> ["ORAL", "ORAL", "ORAL", "TOPICAL", "ORAL", "TOP...
$ generic_name      <list> ["DOXYCYCLINE", "ATORVASTATIN CALCIUM", "HOMEOPA...
$ brand_name        <list> ["Doxycycline", "ATORVASTATIN CALCIUM", "Rescue ...
$ substance_name    <list> ["DOXYCYCLINE", "ATORVASTATIN CALCIUM PROPYLENE ...
In [355]:
"NULL"%in%(spl_openfda$product_type)
TRUE

To unlist, NULL must be changed to NA.

In [ ]:
spl_openfda$product_type[spl_openfda$product_type=="NULL"] =NA
In [356]:
unique(spl_openfda$product_type)
  1. "HUMAN PRESCRIPTION DRUG"
  2. "HUMAN OTC DRUG"
  3. NA
In [ ]:
spl_openfda = mutate(spl_openfda,product_type=unlist(product_type))

Now, let's unlist route.

In [357]:
"NULL"%in%(spl_openfda$route)
TRUE
In [358]:
head(unique(spl_openfda$route),10)
  1. "ORAL"
  2. "TOPICAL"
  3. "INTRAVENOUS"
    1. "INTRAMUSCULAR"
    2. "INTRAVENOUS"
  4. "DENTAL"
  5. "OPHTHALMIC"
  6. NULL
    1. "SOFT TISSUE"
    2. "TOPICAL"
    3. "SUBCUTANEOUS"
    4. "INTRALESIONAL"
    5. "EPIDURAL"
    6. "INTRA-ARTICULAR"
    7. "INTRAMUSCULAR"
    8. "INFILTRATION"
    9. "INTRAVENOUS"
    1. "TOPICAL"
    2. "INTRAMUSCULAR"
    3. "INTRAVENOUS"
  7. "AURICULAR (OTIC)"
In [ ]:
spl_openfda$route[spl_openfda$route=="NULL"] =NA
In [ ]:
for(i in 1:nrow(spl_openfda)){
      spl_openfda$route[i]=paste0(unlist(spl_openfda$route[i]),collapse=",")
    }
spl_openfda = mutate(spl_openfda, route=unlist(route))
In [359]:
head(unique(spl_openfda$route),10)
  1. "ORAL"
  2. "TOPICAL"
  3. "INTRAVENOUS"
  4. "INTRAMUSCULAR,INTRAVENOUS"
  5. "DENTAL"
  6. "OPHTHALMIC"
  7. "NA"
  8. "SOFT TISSUE,TOPICAL,SUBCUTANEOUS,INTRALESIONAL,EPIDURAL,INTRA-ARTICULAR,INTRAMUSCULAR,INFILTRATION,INTRAVENOUS"
  9. "TOPICAL,INTRAMUSCULAR,INTRAVENOUS"
  10. "AURICULAR (OTIC)"
In [360]:
glimpse(spl_openfda)
Observations: 89,207
Variables: 6
$ manufacturer_name <chr> "Cardinal Health", "Carilion Materials Management...
$ product_type      <chr> "HUMAN PRESCRIPTION DRUG", "HUMAN PRESCRIPTION DR...
$ route             <chr> "ORAL", "ORAL", "ORAL", "TOPICAL", "ORAL", "TOPIC...
$ generic_name      <chr> "DOXYCYCLINE", "ATORVASTATIN CALCIUM", "HOMEOPATH...
$ brand_name        <chr> "Doxycycline", "ATORVASTATIN CALCIUM", "Rescue Ca...
$ substance_name    <chr> "DOXYCYCLINE", "ATORVASTATIN CALCIUM PROPYLENE GL...

Let's see count of submissions of the most frequently reported ones by product type and by route.

In [361]:
spl_openfda%>%group_by(product_type,route)%>%summarize(Total=n())%>%arrange(desc(Total))%>%slice(1:10)
product_typerouteTotal
1HUMAN OTC DRUGTOPICAL 22711
2HUMAN OTC DRUGORAL 20142
3HUMAN OTC DRUGDENTAL 865
4HUMAN OTC DRUGNA 716
5HUMAN OTC DRUGOPHTHALMIC 545
6HUMAN OTC DRUGRECTAL 374
7HUMAN OTC DRUGNASAL 352
8HUMAN OTC DRUGSUBLINGUAL 227
9HUMAN OTC DRUGCUTANEOUS 200
10HUMAN OTC DRUGVAGINAL 136
11HUMAN PRESCRIPTION DRUGORAL 21489
12HUMAN PRESCRIPTION DRUGINTRAVENOUS 2026
13HUMAN PRESCRIPTION DRUGTOPICAL 1704
14HUMAN PRESCRIPTION DRUG RESPIRATORY (INHALATION)1289
15HUMAN PRESCRIPTION DRUGOPHTHALMIC 621
16HUMAN PRESCRIPTION DRUGNA 594
17HUMAN PRESCRIPTION DRUG INTRAMUSCULAR,INTRAVENOUS580
18HUMAN PRESCRIPTION DRUGDENTAL 306
19HUMAN PRESCRIPTION DRUGINTRAMUSCULAR 229
20HUMAN PRESCRIPTION DRUGSUBCUTANEOUS 223
21UnknownNA 11724

Change NA to "unknown" as that is more convinient to work with in ggplot2.

In [362]:
spl_openfda$route[is.na(spl_openfda$route)] <- "Unknown"
spl_openfda$product_type[is.na(spl_openfda$product_type)] <- "Unknown"

Now, let's see number of reports by product type.

In [363]:
spl_openfda%>%group_by(product_type)%>%summarize(Total=n())%>%
ggplot(aes(x=product_type,y=Total))+geom_bar(stat='identity',alpha=0.5)+
xlab('')+ylab('Total number of submission')

As we can see from the figure above, most reports are on over-the-counter (OTC) drug labeling.

We can also see most frequent routes.

In [364]:
spl_openfda%>%group_by(route)%>%summarize(Total=n())%>%arrange(desc(Total))%>%slice(1:20)%>%
mutate(route = factor(route,levels = route[order(Total,decreasing =F)]))%>%
ggplot(aes(x=route,y=Total))+geom_bar(stat='identity',color='skyblue',fill='#b35900')+coord_flip()+
xlab("")+ggtitle("Number of reports by route")+ylab("")

So, most are administered orally followed by topical.

Now, let's change the variables manufacturer_name, generic_name, brand_name and substance_name to character using unlist.

In [ ]:
for(i in 1:nrow(spl_openfda)){
      spl_openfda$generic_name[i]=paste0(unlist(spl_openfda$generic_name[i]),collapse=",")
      spl_openfda$brand_name[i]=paste0(unlist(spl_openfda$brand_name[i]),collapse=",")
      spl_openfda$substance_name[i]=paste0(unlist(spl_openfda$substance_name[i]),collapse=",")
      spl_openfda$manufacturer_name[i]=paste0(unlist(spl_openfda$manufacturer_name[i]),collapse=",")
    }
spl_openfda = mutate(spl_openfda, generic_name=unlist(generic_name),
                    brand_name=unlist(brand_name),
                    substance_name = unlist(substance_name),
                    manufacturer_name =unlist(manufacturer_name))
In [365]:
glimpse(spl_openfda)
Observations: 89,207
Variables: 6
$ manufacturer_name <chr> "Cardinal Health", "Carilion Materials Management...
$ product_type      <chr> "HUMAN PRESCRIPTION DRUG", "HUMAN PRESCRIPTION DR...
$ route             <chr> "ORAL", "ORAL", "ORAL", "TOPICAL", "ORAL", "TOPIC...
$ generic_name      <chr> "DOXYCYCLINE", "ATORVASTATIN CALCIUM", "HOMEOPATH...
$ brand_name        <chr> "Doxycycline", "ATORVASTATIN CALCIUM", "Rescue Ca...
$ substance_name    <chr> "DOXYCYCLINE", "ATORVASTATIN CALCIUM PROPYLENE GL...

All lists have been changed to characters. Now, let's merge the two data frames to use the time dimension.

In [366]:
spl_final = cbind(spl,spl_openfda)
In [367]:
glimpse(spl_final)
Observations: 89,207
Variables: 9
$ effective_time    <date> 2014-08-26, 2013-06-20, 2014-08-25, 2014-09-03, ...
$ boxed_warning     <chr> "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "...
$ year              <chr> "2014", "2013", "2014", "2014", "2014", "2014", "...
$ manufacturer_name <chr> "Cardinal Health", "Carilion Materials Management...
$ product_type      <chr> "HUMAN PRESCRIPTION DRUG", "HUMAN PRESCRIPTION DR...
$ route             <chr> "ORAL", "ORAL", "ORAL", "TOPICAL", "ORAL", "TOPIC...
$ generic_name      <chr> "DOXYCYCLINE", "ATORVASTATIN CALCIUM", "HOMEOPATH...
$ brand_name        <chr> "Doxycycline", "ATORVASTATIN CALCIUM", "Rescue Ca...
$ substance_name    <chr> "DOXYCYCLINE", "ATORVASTATIN CALCIUM PROPYLENE GL...

Now, let's see number of reports per year by product type.

In [429]:
spl_final%>%filter(effective_time>ymd("20081231") & effective_time < ymd("20170101"))%>%group_by(year,product_type)%>%summarize(Total=n())%>%
ggplot(aes(x=year,y=Total, fill=product_type))+geom_bar(stat="identity",color='white')+
ggtitle('Annual drug labeling submissions\n by Product type')+xlab('')+ylab('Total number of submission')

In every year, which manufacturers have the highest number of submissions?

In [428]:
spl_final%>%filter(effective_time>ymd("20081231")& effective_time < ymd("20170101")
                   ,nchar(manufacturer_name)>0)%>%
group_by(year, manufacturer_name)%>%summarize(Total=n())%>%top_n(1,wt=Total)
yearmanufacturer_nameTotal
12009 Physicians Total Care, Inc.119
22010 Physicians Total Care, Inc.326
32011 REMEDYREPACK INC.542
42012 Physicians Total Care, Inc.377
52013 REMEDYREPACK INC.1149
62014 Uriel Pharmacy Inc.1046
72015 Rxhomeo Private Limited d.b.a. Rxhomeo, Inc512
82016 REMEDYREPACK INC.328

What are the most common brand names in every year?

In [427]:
spl_final%>%filter(effective_time>ymd("20081231") & effective_time < ymd("20170101") ,
                   nchar(brand_name)>0)%>%
group_by(year, brand_name)%>%summarize(Total=n())%>%top_n(1,wt=Total)
yearbrand_nameTotal
12009 Oxygen129
22010 Oxygen116
32011 Oxygen45
42012 Ibuprofen56
52013 Ibuprofen77
62014 Oxygen46
72015 Ibuprofen104
82016 Ibuprofen32

Summary

In this blog post, we downloaded JSON data from the openFDA drug product labeling API, which provides data for prescription and over-the-counter (OTC) drug labeling. Then, we analyzed the data using various R packaged after understanding the contents of the data. In the next post, we will see how to perform similar tasks using Python.

comments powered by Disqus