Fisseha Berhane, PhD

Data Scientist

443-970-2353 [email protected] CV Resume

Since I am using Ipython notebook (Jupyter), I am setting the option below so as to make my figures inline.

In [35]:
options(jupyter.plot_mimetypes = 'image/png')


There are different ways that we can download and read data into R. Some examples are shown below.

Example 1¶

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv

and load the data into R. The code book, describing the variable names is here:

Setting the working directory

In [36]:
setwd("C:/Fish/classes/summer_2015/getting_cleaning_data/quizes")


In [37]:
if(!file.exists('q1.csv')){
url<-"https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"

}


Then let's read the data into R

In [38]:
data<-read.csv("q1.csv")


We can perform calculations on the using the data. For example, let's calculate how many properties are worth $1,000,000 or more. From the code book, we can see that the variable 'VAL' is property value and 24 represents properies that worth 1000000+. So let's use the table command to see how many properties are worth what. In [39]: x<-data$VAL
table(x)
hist(x,breaks=24,xlab='class',main='Histogram of property value',col='darkblue')

Out[39]:
x
1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20
75  42  33  30  26  29  23  70  99 119 152 199 233 495 483 486 357 502 232 312
21  22  23  24
164 159  47  53 

So, we can see from column 24 that there are 53 properties worth 1000000+ or more

Example 2¶

Now let's see how to read extensible markup language(XML) data which is frequently used to store structured data, and widely used in internet apps. It is known that extracting XML is basis for most of web scraping

Let's read the XML data on Baltimore restaurants from here: https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml.

Then let's see how many restaurants have zipcode 21218, where my office is located.

In [40]:
if(!require(XML)){
install.packages('XML')}


In [41]:
library(XML)

fileURL<-"https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml"

fileURL2 <- sub('https', 'http', fileURL)
doc <- xmlTreeParse(fileURL2, useInternal = TRUE)


Now, let's get the rootnode and explore it.

In [42]:
rootNode<-xmlRoot(doc)
xmlName(rootNode)

names(rootNode)

rootNode[[1]][[1]];

Out[42]:
"response"
Out[42]:
row: "row"
Out[42]:
<row _id="1" _uuid="93CACF6F-C8C2-4B87-95A8-8177806D5A6F" _position="1" _address="http://data.baltimorecity.gov/resource/k5ry-ef3g/1">
<name>410</name>
<zipcode>21206</zipcode>
<neighborhood>Frankford</neighborhood>
<councildistrict>2</councildistrict>
<policedistrict>NORTHEASTERN</policedistrict>
</row> 

Now, let's see the number of restaurants with zipcode 21218

In [43]:
zipcode<-xpathSApply(rootNode,"//zipcode",xmlValue)

sum(zipcode=='21218')

Out[43]:
69

So, there are 69 restaurants with zipcode 21218.

We can also see histogram of restaurants by zipcode.

In [44]:
zipcode<-as.numeric(zipcode)

x<-unique(zipcode)
x

Out[44]:
1. 21206
2. 21231
3. 21224
4. 21211
5. 21223
6. 21218
7. 21205
8. 21226
9. 21215
10. 21202
11. 21201
12. 21230
13. 21222
14. 21212
15. 21210
16. 21213
17. 21214
18. 21225
19. 21217
20. 21234
21. 21209
22. 21227
23. 21239
24. 21216
25. 21208
26. 21207
27. 21229
28. 21251
29. 21220
30. 21287
31. -21226
32. 21237

Let's remove the negative value.

In [45]:
zipcode<-zipcode[zipcode>0]


Let's see a histogram of restaurants by zipcode

In [47]:
hist(as.numeric(zipcode),breaks=32,xlab='zipcode',
ylab='Number of restautants',
main='Histogram of number of restaurants in Baltimore, MD, USA',
col='skyblue',border='red')


Example 3¶

Reading rows and columns of our choice.¶

The Excel spreadsheet for this example is from Natural Gas Aquisition Program which can be downloaded from https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FDATA.gov_NGAP.xlsx

In [73]:
if(!file.exists('q3.xlsx')){
url<-"https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FDATA.gov_NGAP.xlsx "

}


Check if the package xlsx is installed, else install it. Then, load xlsx package.

In [74]:
if(!require(xlsx)){
install.packages('xlsx')
}

library(xlsx)


We can read column and row numbers of our interest and do any calculations as shown below.

In [75]:
rowIndex<-18:23
colIndex<- 7:15

sum(dat$Zip*dat$Ext,na.rm=T)