443-970-2353
[email protected]
CV Resume
Since I am using Ipython notebook (Jupyter), I am setting the option below so as to make my figures inline.
options(jupyter.plot_mimetypes = 'image/png')
There are different ways that we can download and read data into R. Some examples are shown below.
The American Community Survey distributes downloadable data about United States communities. Let's download the 2006 microdata survey about housing for the state of Idaho using download.file() from here:
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv
and load the data into R. The code book, describing the variable names is here:
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FPUMSDataDict06.pdf
Setting the working directory
setwd("C:/Fish/classes/summer_2015/getting_cleaning_data/quizes")
The code below checks if the data is already downloaded, if not it downloads it.
if(!file.exists('q1.csv')){
url<-"https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
download.file(url,destfile = "q1.csv")
}
Then let's read the data into R
data<-read.csv("q1.csv")
We can perform calculations on the using the data. For example, let's calculate how many properties are worth $1,000,000 or more.
From the code book, we can see that the variable 'VAL' is property value and 24 represents properies that worth 1000000+. So let's use the table command to see how many properties are worth what.
x<-data$VAL
table(x)
hist(x,breaks=24,xlab='class',main='Histogram of property value',col='darkblue')
So, we can see from column 24 that there are 53 properties worth 1000000+ or more
Now let's see how to read extensible markup language(XML) data which is frequently used to store structured data, and widely used in internet apps. It is known that extracting XML is basis for most of web scraping
Let's read the XML data on Baltimore restaurants from here: https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml.
Then let's see how many restaurants have zipcode 21218, where my office is located.
Check if XML package is present, else download it.
if(!require(XML)){
install.packages('XML')}
Load XML package and download data. I have replaced https by httP to make it downloadable.
library(XML)
fileURL<-"https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml"
fileURL2 <- sub('https', 'http', fileURL)
doc <- xmlTreeParse(fileURL2, useInternal = TRUE)
Now, let's get the rootnode and explore it.
rootNode<-xmlRoot(doc)
xmlName(rootNode)
names(rootNode)
rootNode[[1]][[1]];
Now, let's see the number of restaurants with zipcode 21218
zipcode<-xpathSApply(rootNode,"//zipcode",xmlValue)
sum(zipcode=='21218')
So, there are 69 restaurants with zipcode 21218.
We can also see histogram of restaurants by zipcode.
zipcode<-as.numeric(zipcode)
x<-unique(zipcode)
x
Let's remove the negative value.
zipcode<-zipcode[zipcode>0]
Let's see a histogram of restaurants by zipcode
hist(as.numeric(zipcode),breaks=32,xlab='zipcode',
ylab='Number of restautants',
main='Histogram of number of restaurants in Baltimore, MD, USA',
col='skyblue',border='red')
Let's download xlsx binary data by setting the download mode to binary.
The Excel spreadsheet for this example is from Natural Gas Aquisition Program which can be downloaded from https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FDATA.gov_NGAP.xlsx
Check if the data is already downloaded, else download it.
if(!file.exists('q3.xlsx')){
url<-"https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FDATA.gov_NGAP.xlsx "
download.file(url,destfile = "q3.xlsx", mode='wb')
}
Check if the package xlsx is installed, else install it. Then, load xlsx package.
if(!require(xlsx)){
install.packages('xlsx')
}
library(xlsx)
We can read column and row numbers of our interest and do any calculations as shown below.