Fisseha Berhane, PhD

Data Scientist

443-970-2353 [email protected] CV Resume Linkedin GitHub twitter twitter

Merging DataFrames with Pandas

Python's popularity in all the current trending technologies in IT is increasing from year to year. Therefore, mastering Python opens more options in the marketplace. Python is also one of the most popular data science tools. One of the reasons for Python's high popularity in data science is the Pandas Package.

Pandas is a python library which provides high-performance, easy-to-use data structures and data analysis. While data.table and dplyr are the best data manipulation packages in R, in Python, Pandas is the go-to package for intuitive and easy data analysis. In this blog post, we will see how to use Pandas to merge adverse events data, which are publically available.

When people take drugs, if they experience any adverse events, they can report it to the Food and Drug Administration (FDA). In this post, we will download demography information of the patients, drug they used and for what indication they used it, reaction they experienced and outcome of the drug adverse events. Then, we will merge the different data files. In the next part of this series, we will put the data in a database and use Pandas to clean, explore and visualize our data. I will organize the tutorials more clearly as I work on them.

We will download the datasets in csv format from the National Bureau of Economic Research. The adverse events datasets are created in quarterly temporal resolution and each quarter data includes demography information, drug/biologic information, adverse event, indication, outcome, etc. At the time of writing this post, the data covers from 2004 quarter one to 2016 quarter three. For our tutorials, let's download the data from 2013 to 2016.

Let's first import libraries that help us to download zipped csv files and unzip them.

In [6]:
import requests, zipfile, StringIO

Now, let's create a list of the urls for the various datasets we are going to download. Remember for 2016, the data goes up to quarter three at the time of this writing. To see the urls, click here and you can see the link address of each csv file by right clicking on each data file. For example, let's see the urls for demography, drug, indication, reaction and outcome for 2015 quarter three.

Now, let's create a for loop that creates a list of the urls of all the data files we want to download.

In [2]:
year_start = 2013
year_last = 2016

urls = []
for year in range(year_start, year_last + 1):
    if year < 2016:
        quarters = [1, 2, 3, 4]
    else:
        quarters = [1, 2, 3]
    for quarter in quarters:
        url_demography = "http://www.nber.org/fda/faers/" + str(year) + "/demo" + str(year) + "q" + str(quarter) + ".csv.zip"
        url_drug = "http://www.nber.org/fda/faers/" + str(year) + "/drug" + str(year) + "q" + str(quarter) + ".csv.zip"
        url_reaction = "http://www.nber.org/fda/faers/" + str(year) + "/reac" + str(year) + "q" + str(quarter) + ".csv.zip"
        url_outcome = "http://www.nber.org/fda/faers/" + str(year) + "/outc" + str(year) + "q" + str(quarter) + ".csv.zip"
        url_indication = "http://www.nber.org/fda/faers/" + str(year) + "/indi" + str(year) + "q" + str(quarter) + ".csv.zip"
        temp = [url_demography, url_drug,url_reaction, url_outcome, url_indication ]
        urls += temp

How many data files are we downloading? The len() method gives us the length of a list.

In [3]:
len(urls)
Out[3]:
75

Let's display a sample from the 75 urls of the datasets we are going to download.

In [4]:
urls[20:30]
Out[4]:
['http://www.nber.org/fda/faers/2014/demo2014q1.csv.zip',
 'http://www.nber.org/fda/faers/2014/drug2014q1.csv.zip',
 'http://www.nber.org/fda/faers/2014/reac2014q1.csv.zip',
 'http://www.nber.org/fda/faers/2014/outc2014q1.csv.zip',
 'http://www.nber.org/fda/faers/2014/indi2014q1.csv.zip',
 'http://www.nber.org/fda/faers/2014/demo2014q2.csv.zip',
 'http://www.nber.org/fda/faers/2014/drug2014q2.csv.zip',
 'http://www.nber.org/fda/faers/2014/reac2014q2.csv.zip',
 'http://www.nber.org/fda/faers/2014/outc2014q2.csv.zip',
 'http://www.nber.org/fda/faers/2014/indi2014q2.csv.zip']

Now, let's iterate over the list of the urls and download the datasets.

In [5]:
for url in urls:
    r = requests.get(url, stream = True)
    z = zipfile.ZipFile(StringIO.StringIO(r.content))
    z.extractall()

As noted above, we are working with drug, demography, reaction, indication and outcome data files. To have a list of the datasets that belong to each category, we will use pattern matching using the glob library.

In [25]:
import glob

Now, let's see how many data files we have from each category.

In [2]:
drug_files = glob.glob("drug*.csv")
demography_files = glob.glob("demo*.csv")
reaction_files = glob.glob("reac*.csv")
indication_files = glob.glob("indi*.csv")
outcome_files = glob.glob("outc*.csv")
In [3]:
print("Number of drug files: {}".format(len(drug_files)))
print("Number of drug files: {}".format(len(demography_files)))
print("Number of drug files: {}".format(len(reaction_files)))
print("Number of drug files: {}".format(len(indication_files)))
print("Number of drug files: {}".format(len(outcome_files)))
Number of drug files: 15
Number of drug files: 15
Number of drug files: 15
Number of drug files: 15
Number of drug files: 15

We see that we have 15 datasets from each category.

Let's display the data files for outcomes, to just see how the data files are named and why we used the urls we created above to download the datasets.

In [4]:
outcome_files
Out[4]:
['outc2013q1.csv',
 'outc2013q2.csv',
 'outc2013q3.csv',
 'outc2013q4.csv',
 'outc2014q1.csv',
 'outc2014q2.csv',
 'outc2014q3.csv',
 'outc2014q4.csv',
 'outc2015q1.csv',
 'outc2015q2.csv',
 'outc2015q3.csv',
 'outc2015q4.csv',
 'outc2016q1.csv',
 'outc2016q2.csv',
 'outc2016q3.csv']

Next, let's import Pandas and concatenate the various data files.

In [5]:
import pandas as pd

Drug Data

Let's use the head command from the command line and see the first few lines of one of the drug files.

In [13]:
! head -4 drug2015q2.csv
"primaryid","caseid","drug_seq","role_cod","drugname","prod_ai","val_vbm","route","dose_vbm","cum_dose_chr","cum_dose_unit","dechal","rechal","lot_num","exp_dt","exp_dtstr","exp_dt_mult","nda_num","dose_amt","dose_amtstr","dose_amt_mult","dose_unit","dose_form","dose_freq"
36357913,3635791,1,"PS","DIOVAN","VALSARTAN",1,"Oral","80 MG ONCE ONLY",,,,"D",,,,0,,,,0,,"TABLET",
36357913,3635791,2,"SS","DIOVAN","VALSARTAN",1,"Oral",,,,,"D",,,,0,,80,"80",0,"MG","TABLET",
36357913,3635791,3,"C","CARDENALIN","DOXAZOSIN MESYLATE",1,,,,,,"D",,,,0,,,,0,,,

The first row shows the columns (variables) the drug data files contain. As you can the the drug files have many variables; let's select primaryid, drug_seq, drugname and route from each data file for this tutorial and concatenate all drug files.

The pd.read_csv() method imports csv files. We will create a list of the many dataframes we will read using pd.read_csv(). Finally, we will concatenate them along rows to create a single data file for each category.

In [14]:
frames =  []

for csv in drug_files:
    df = pd.read_csv(csv)
    df = df[["primaryid","drug_seq","drugname","route"]]
    frames.append(df)

drug = pd.concat(frames)

Now, let's see how many drug records we have downloaded.

In [8]:
drug.shape
Out[8]:
(13638196, 4)

As we can see above, we have downloaded 13.7 million drug records.

We can see the columns in our dataframe as below:

In [15]:
drug.columns
Out[15]:
Index(['primaryid', 'drug_seq', 'drugname', 'route'], dtype='object')

We can also see the data type of each column.

In [13]:
drug.dtypes
Out[13]:
primaryid     int64
drug_seq      int64
drugname     object
route        object
dtype: object

We can also see the first and the last few rows as below:

In [16]:
drug.head()
Out[16]:
primaryid drug_seq drugname route
0 30375293 1 AVONEX INTRAMUSCULAR
1 30375293 2 AVONEX INTRAMUSCULAR
2 30375293 3 ZANAFLEX NaN
3 30375293 4 STEROID (NOS) NaN
4 30375293 5 AMANTADINE NaN
In [17]:
drug.tail()
Out[17]:
primaryid drug_seq drugname route
1260115 1260852811 24 ASPIRIN. Unknown
1260116 1260852811 25 Pegfilgrastim Unknown
1260117 1260852811 26 PYRIDIUM Unknown
1260118 1260852811 27 LASIX Unknown
1260119 1260852811 28 LOVENOX Unknown

Demography Data

Now, let's have a look at the first few lines of the demography data.

In [19]:
! head -4 demo2015q4.csv
"primaryid","caseid","caseversion","i_f_code","i_f_code_num","event_dt","event_dt_num","mfr_dt","mfr_dt_num","init_fda_dt","init_fda_dt_num","fda_dt","fda_dt_num","rept_cod","rept_cod_num","auth_num","mfr_num","mfr_sndr","lit_ref","age","age_cod","age_grp","age_grp_num","sex","e_sub","wt","wt_cod","rept_dt","rept_dt_num","to_mfr","occp_cod","reporter_country","occr_country","occp_cod_num"
37909993,3790999,3,"F",70,20020504,5/4/2002,20151210,12/10/2015,20020516,5/16/2002,20151217,12/17/2015,"EXP",698880,,"US-BRISTOL-MYERS SQUIBB COMPANY-11854981","BRISTOL MYERS SQUIBB",,32,"YR",,,"F","Y",,,20151217,12/17/2015,,"OT","US","US",7984
38053673,3805367,3,"F",70,20020807,8/7/2002,20151208,12/8/2015,20020617,6/17/2002,20151227,12/27/2015,"EXP",698880,,"US-BRISTOL-MYERS SQUIBB COMPANY-11896446","BRISTOL MYERS SQUIBB",,24,"YR",,,"F","Y",,,20151227,12/27/2015,,"OT","US","US",7984
38127553,3812755,3,"F",70,20020212,2/12/2002,20151224,12/24/2015,20020711,7/11/2002,20151228,12/28/2015,"EXP",698880,,"US-BRISTOL-MYERS SQUIBB COMPANY-11936044","BRISTOL MYERS SQUIBB",,27,"YR",,,"F","Y",,,20151228,12/28/2015,,"OT","US","US",7984

As you can see the demography data files have many columns. Here also, let's select only some of the variables. Primaryid, caseid, age, age_cod, event_dt, sex, wt, wt_cod and occr_country are enough for our tutorial. Then, let's create "demography" dataset by concatenating all quarterly demography data files from 2013 to 2016.

In [20]:
frames =  []

for csv in demography_files:
    df = pd.read_csv(csv)
    df = df[["primaryid","caseid","age","age_cod","event_dt",
         "sex","wt","wt_cod","occr_country"]]
    frames.append(df)

demography = pd.concat(frames)
In [22]:
demography.shape
Out[22]:
(4033608, 9)

The demography data has more than four million records.

In [23]:
demography.head()
Out[23]:
primaryid caseid age age_cod event_dt sex wt wt_cod occr_country
0 30375293 3037529 44.0 YR 199706.0 F 56.0 KG US
1 30936912 3093691 38.0 YR 199610.0 F 56.0 KG US
2 32481334 3248133 28.0 YR 1996.0 F 54.0 KG US
3 35865322 3586532 45.0 YR 20000627.0 M NaN NaN AR
4 37005182 3700518 NaN NaN 200101.0 F NaN NaN NaN

Reaction Data

In [6]:
! head -4 reac2015q3.csv
"primaryid","caseid","pt","drug_rec_act"
36450622,3645062,"Bone disorder",
36450622,3645062,"Dermatitis exfoliative",
36450622,3645062,"Diplopia",

From the reaction data sets, we will use the primaryid and the pt (preffered term for the reaction).

In [7]:
frames =  []

for csv in reaction_files:
    df = pd.read_csv(csv)
    df = df[["primaryid","pt"]]
    frames.append(df)

reaction = pd.concat(frames)
In [8]:
reaction.shape
Out[8]:
(11881650, 2)

The reaction data contains about 12 million records. Let's see a sample from the reaction data.

In [24]:
reaction.sample(n = 10)
Out[24]:
primaryid pt
450876 93391121 Syncope
147554 93802881 Paralysis
598598 123855581 Cardio-respiratory arrest
30404 92429827 Hydrocele operation
218221 111907153 Death
341646 98775691 Peroneal nerve palsy
502266 113349721 Injection site pain
435316 113083651 Visual impairment
13194 88246135 Fungal infection
574277 113643381 Drug ineffective

Outcome Data

In [11]:
! head outc2015q3.csv
"primaryid","caseid","outc_cod"
36450622,3645062,"DS"
36450622,3645062,"HO"
36450622,3645062,"LT"
36450622,3645062,"OT"
36464363,3646436,"HO"
36464363,3646436,"OT"
37751404,3775140,"OT"
38624314,3862431,"HO"
38956283,3895628,"OT"

From the outcome data files, let's take the primaryid and outc_cod (outcome code) and concatenate all outcome quarterly data files.

In [12]:
frames =  []

for csv in outcome_files:
    df = pd.read_csv(csv)
    df = df[["primaryid","outc_cod"]]
    frames.append(df)

outcome = pd.concat(frames)
In [14]:
outcome.shape
Out[14]:
(2790232, 2)

The outcome data has about three million records. We can use the sample command and specify the fraction we want to display a sample as below.

In [17]:
outcome.sample(frac = 3*10**(-6))
Out[17]:
primaryid outc_cod
172198 102629332 DE
112766 107245271 OT
116171 93436811 HO
155378 117961221 OT
36008 90063941 OT
4501 96172293 HO
185471 127328101 HO
158983 100423281 OT

Indication Data

Finally, let's concatenate the indication data files.

In [19]:
! head indi2015q3.csv
"primaryid","caseid","indi_drug_seq","indi_pt"
36450622,3645062,1,"Tinea cruris"
36450622,3645062,2,"Product used for unknown indication"
36450622,3645062,3,"Product used for unknown indication"
36464363,3646436,1,"Hypertension"
37751404,3775140,1,"Hypertension"
37751404,3775140,3,"Product used for unknown indication"
37751404,3775140,4,"Product used for unknown indication"
37751404,3775140,5,"Product used for unknown indication"
38624314,3862431,1,"Relapsing-remitting multiple sclerosis"
In [21]:
frames =  []

for csv in indication_files:
    df = pd.read_csv(csv)
    df = df[["primaryid","indi_drug_seq","indi_pt"]]
    frames.append(df)

indication = pd.concat(frames)
In [22]:
indication.shape
Out[22]:
(8894544, 3)

The indication data has about nine million rows. Let's display a sample from it.

In [23]:
indication.sample(n = 10)
Out[23]:
primaryid indi_drug_seq indi_pt
367426 119656971 20 Product used for unknown indication
177869 122268742 1 Product used for unknown indication
356484 102272041 9 Cytomegalovirus test positive
107556 122693903 5 Rheumatoid arthritis
313251 93671881 1 Hypertonic bladder
107248 105346093 7 Chronic lymphocytic leukaemia
554279 124046401 5 Hypertension
208058 112447761 2 Product used for unknown indication
211070 94466491 1 Rheumatoid arthritis
491613 99903291 5 Pain in extremity

This is enough for today. In part two of this series, we will save our data in a database and use Pandas to clean, explore and visualize our data.

comments powered by Disqus