443-970-2353
[email protected]
CV Resume
Python's popularity in all the current trending technologies in IT is increasing from year to year. Therefore, mastering Python opens more options in the marketplace. Python is also one of the most popular data science tools. One of the reasons for Python's high popularity in data science is the Pandas Package.
Pandas is a python library which provides high-performance, easy-to-use data structures and data analysis. While data.table and dplyr are the best data manipulation packages in R, in Python, Pandas is the go-to package for intuitive and easy data analysis. In this blog post, we will see how to use Pandas to merge adverse events data, which are publically available.
When people take drugs, if they experience any adverse events, they can report it to the Food and Drug Administration (FDA). In this post, we will download demography information of the patients, drug they used and for what indication they used it, reaction they experienced and outcome of the drug adverse events. Then, we will merge the different data files. In the next part of this series, we will put the data in a database and use Pandas to clean, explore and visualize our data. I will organize the tutorials more clearly as I work on them.
We will download the datasets in csv format from the National Bureau of Economic Research. The adverse events datasets are created in quarterly temporal resolution and each quarter data includes demography information, drug/biologic information, adverse event, indication, outcome, etc. At the time of writing this post, the data covers from 2004 quarter one to 2016 quarter three. For our tutorials, let's download the data from 2013 to 2016.
Let's first import libraries that help us to download zipped csv files and unzip them.
import requests, zipfile, StringIO
Now, let's create a list of the urls for the various datasets we are going to download. Remember for 2016, the data goes up to quarter three at the time of this writing. To see the urls, click here and you can see the link address of each csv file by right clicking on each data file. For example, let's see the urls for demography, drug, indication, reaction and outcome for 2015 quarter three.
Now, let's create a for loop that creates a list of the urls of all the data files we want to download.
year_start = 2013
year_last = 2016
urls = []
for year in range(year_start, year_last + 1):
if year < 2016:
quarters = [1, 2, 3, 4]
else:
quarters = [1, 2, 3]
for quarter in quarters:
url_demography = "http://www.nber.org/fda/faers/" + str(year) + "/demo" + str(year) + "q" + str(quarter) + ".csv.zip"
url_drug = "http://www.nber.org/fda/faers/" + str(year) + "/drug" + str(year) + "q" + str(quarter) + ".csv.zip"
url_reaction = "http://www.nber.org/fda/faers/" + str(year) + "/reac" + str(year) + "q" + str(quarter) + ".csv.zip"
url_outcome = "http://www.nber.org/fda/faers/" + str(year) + "/outc" + str(year) + "q" + str(quarter) + ".csv.zip"
url_indication = "http://www.nber.org/fda/faers/" + str(year) + "/indi" + str(year) + "q" + str(quarter) + ".csv.zip"
temp = [url_demography, url_drug,url_reaction, url_outcome, url_indication ]
urls += temp
How many data files are we downloading? The len() method gives us the length of a list.
len(urls)
Let's display a sample from the 75 urls of the datasets we are going to download.
urls[20:30]
Now, let's iterate over the list of the urls and download the datasets.
for url in urls:
r = requests.get(url, stream = True)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
z.extractall()
As noted above, we are working with drug, demography, reaction, indication and outcome data files. To have a list of the datasets that belong to each category, we will use pattern matching using the glob library.
import glob
Now, let's see how many data files we have from each category.
drug_files = glob.glob("drug*.csv")
demography_files = glob.glob("demo*.csv")
reaction_files = glob.glob("reac*.csv")
indication_files = glob.glob("indi*.csv")
outcome_files = glob.glob("outc*.csv")
print("Number of drug files: {}".format(len(drug_files)))
print("Number of drug files: {}".format(len(demography_files)))
print("Number of drug files: {}".format(len(reaction_files)))
print("Number of drug files: {}".format(len(indication_files)))
print("Number of drug files: {}".format(len(outcome_files)))
We see that we have 15 datasets from each category.
Let's display the data files for outcomes, to just see how the data files are named and why we used the urls we created above to download the datasets.
outcome_files
Next, let's import Pandas and concatenate the various data files.
import pandas as pd
Let's use the head command from the command line and see the first few lines of one of the drug files.
! head -4 drug2015q2.csv
The first row shows the columns (variables) the drug data files contain. As you can the the drug files have many variables; let's select primaryid, drug_seq, drugname and route from each data file for this tutorial and concatenate all drug files.
The pd.read_csv() method imports csv files. We will create a list of the many dataframes we will read using pd.read_csv(). Finally, we will concatenate them along rows to create a single data file for each category.
frames = []
for csv in drug_files:
df = pd.read_csv(csv)
df = df[["primaryid","drug_seq","drugname","route"]]
frames.append(df)
drug = pd.concat(frames)
Now, let's see how many drug records we have downloaded.
drug.shape
As we can see above, we have downloaded 13.7 million drug records.
We can see the columns in our dataframe as below:
drug.columns
We can also see the data type of each column.
drug.dtypes
We can also see the first and the last few rows as below:
drug.head()
drug.tail()
Now, let's have a look at the first few lines of the demography data.
! head -4 demo2015q4.csv
As you can see the demography data files have many columns. Here also, let's select only some of the variables. Primaryid, caseid, age, age_cod, event_dt, sex, wt, wt_cod and occr_country are enough for our tutorial. Then, let's create "demography" dataset by concatenating all quarterly demography data files from 2013 to 2016.
frames = []
for csv in demography_files:
df = pd.read_csv(csv)
df = df[["primaryid","caseid","age","age_cod","event_dt",
"sex","wt","wt_cod","occr_country"]]
frames.append(df)
demography = pd.concat(frames)
demography.shape
The demography data has more than four million records.
demography.head()
! head -4 reac2015q3.csv
From the reaction data sets, we will use the primaryid and the pt (preffered term for the reaction).
frames = []
for csv in reaction_files:
df = pd.read_csv(csv)
df = df[["primaryid","pt"]]
frames.append(df)
reaction = pd.concat(frames)
reaction.shape
The reaction data contains about 12 million records. Let's see a sample from the reaction data.
reaction.sample(n = 10)
! head outc2015q3.csv
From the outcome data files, let's take the primaryid and outc_cod (outcome code) and concatenate all outcome quarterly data files.
frames = []
for csv in outcome_files:
df = pd.read_csv(csv)
df = df[["primaryid","outc_cod"]]
frames.append(df)
outcome = pd.concat(frames)
outcome.shape
The outcome data has about three million records. We can use the sample command and specify the fraction we want to display a sample as below.
outcome.sample(frac = 3*10**(-6))
Finally, let's concatenate the indication data files.
! head indi2015q3.csv
frames = []
for csv in indication_files:
df = pd.read_csv(csv)
df = df[["primaryid","indi_drug_seq","indi_pt"]]
frames.append(df)
indication = pd.concat(frames)
indication.shape
The indication data has about nine million rows. Let's display a sample from it.
indication.sample(n = 10)
This is enough for today. In part two of this series, we will save our data in a database and use Pandas to clean, explore and visualize our data.