Fisseha Berhane, PhD

Data Scientist

443-970-2353 fisseha@jhu.edu CV Resume Linkedin GitHub twitter twitter

Downloading data from the web using R

Since I am using Ipython notebook (Jupyter), I am setting the option below so as to make my figures inline.

In [35]:
options(jupyter.plot_mimetypes = 'image/png')


There are different ways that we can download and read data into R. Some examples are shown below.

Example 1

The American Community Survey distributes downloadable data about United States communities. Let's download the 2006 microdata survey about housing for the state of Idaho using download.file() from here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv

and load the data into R. The code book, describing the variable names is here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FPUMSDataDict06.pdf

Setting the working directory

In [36]:
setwd("C:/Fish/classes/summer_2015/getting_cleaning_data/quizes")

The code below checks if the data is already downloaded, if not it downloads it.

In [37]:
if(!file.exists('q1.csv')){
    url<-"https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"

    download.file(url,destfile = "q1.csv")
}

Then let's read the data into R

In [38]:
data<-read.csv("q1.csv")

We can perform calculations on the using the data. For example, let's calculate how many properties are worth $1,000,000 or more.

From the code book, we can see that the variable 'VAL' is property value and 24 represents properies that worth 1000000+. So let's use the table command to see how many properties are worth what.

In [39]:
x<-data$VAL
table(x) 
hist(x,breaks=24,xlab='class',main='Histogram of property value',col='darkblue')
Out[39]:
x
  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
 75  42  33  30  26  29  23  70  99 119 152 199 233 495 483 486 357 502 232 312 
 21  22  23  24 
164 159  47  53 

So, we can see from column 24 that there are 53 properties worth 1000000+ or more


Example 2

Now let's see how to read extensible markup language(XML) data which is frequently used to store structured data, and widely used in internet apps. It is known that extracting XML is basis for most of web scraping

Let's read the XML data on Baltimore restaurants from here: https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml.

Then let's see how many restaurants have zipcode 21218, where my office is located.

Check if XML package is present, else download it.

In [40]:
if(!require(XML)){
    install.packages('XML')}

Load XML package and download data. I have replaced https by httP to make it downloadable.

In [41]:
library(XML)

fileURL<-"https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml"

fileURL2 <- sub('https', 'http', fileURL)
doc <- xmlTreeParse(fileURL2, useInternal = TRUE)

Now, let's get the rootnode and explore it.

In [42]:
rootNode<-xmlRoot(doc)
xmlName(rootNode)

names(rootNode)

rootNode[[1]][[1]];
Out[42]:
"response"
Out[42]:
row: "row"
Out[42]:
<row _id="1" _uuid="93CACF6F-C8C2-4B87-95A8-8177806D5A6F" _position="1" _address="http://data.baltimorecity.gov/resource/k5ry-ef3g/1">
  <name>410</name>
  <zipcode>21206</zipcode>
  <neighborhood>Frankford</neighborhood>
  <councildistrict>2</councildistrict>
  <policedistrict>NORTHEASTERN</policedistrict>
  <location_1 human_address="{&quot;address&quot;:&quot;4509 BELAIR ROAD&quot;,&quot;city&quot;:&quot;Baltimore&quot;,&quot;state&quot;:&quot;MD&quot;,&quot;zip&quot;:&quot;&quot;}" needs_recoding="true"/>
</row> 

Now, let's see the number of restaurants with zipcode 21218

In [43]:
zipcode<-xpathSApply(rootNode,"//zipcode",xmlValue)

sum(zipcode=='21218')
Out[43]:
69

So, there are 69 restaurants with zipcode 21218.


We can also see histogram of restaurants by zipcode.

In [44]:
zipcode<-as.numeric(zipcode)

x<-unique(zipcode)
x
Out[44]:
  1. 21206
  2. 21231
  3. 21224
  4. 21211
  5. 21223
  6. 21218
  7. 21205
  8. 21226
  9. 21215
  10. 21202
  11. 21201
  12. 21230
  13. 21222
  14. 21212
  15. 21210
  16. 21213
  17. 21214
  18. 21225
  19. 21217
  20. 21234
  21. 21209
  22. 21227
  23. 21239
  24. 21216
  25. 21208
  26. 21207
  27. 21229
  28. 21251
  29. 21220
  30. 21287
  31. -21226
  32. 21237

Let's remove the negative value.

In [45]:
zipcode<-zipcode[zipcode>0]

Let's see a histogram of restaurants by zipcode

In [47]:
hist(as.numeric(zipcode),breaks=32,xlab='zipcode',
     ylab='Number of restautants',
     main='Histogram of number of restaurants in Baltimore, MD, USA',
     col='skyblue',border='red')

Example 3

Reading rows and columns of our choice.

Let's download xlsx binary data by setting the download mode to binary.

The Excel spreadsheet for this example is from Natural Gas Aquisition Program which can be downloaded from https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FDATA.gov_NGAP.xlsx

Check if the data is already downloaded, else download it.

In [73]:
if(!file.exists('q3.xlsx')){
    url<-"https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FDATA.gov_NGAP.xlsx "
    
    download.file(url,destfile = "q3.xlsx", mode='wb')
}

Check if the package xlsx is installed, else install it. Then, load xlsx package.

In [74]:
if(!require(xlsx)){
    install.packages('xlsx')
}

library(xlsx)

We can read column and row numbers of our interest and do any calculations as shown below.

In [75]:
rowIndex<-18:23
colIndex<- 7:15

dat<-read.xlsx('q3.xlsx',sheetIndex=1,rowIndex = rowIndex,colIndex = colIndex)

sum(dat$Zip*dat$Ext,na.rm=T)
Out[75]:
36534720


comments powered by Disqus