# Fisseha Berhane, PhD

#### Data Scientist

443-970-2353 [email protected] CV Resume

### Web scraping with R using rvest: Population of U.S. states and territories¶

In this post, we will use the rvest web scraping R package to scrape US population data from Wikipedia and use ggplo2 to visualize the population data by state.

In [172]:
library(rvest)
library(dplyr)
library(calibrate)
library(stringi)
library(ggplot2)
library(maps)
library(ggmap)

In [173]:
wiki= read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population")

In [174]:
states=wiki %>%
html_nodes("table") %>%
.[[1]]%>%
html_table(fill=T)

In [175]:
head(states)

Out[175]:
Rank in the fifty states, 2014Rank in all states & terri- tories, 2010State or territoryPopulation estimate for July 1, 2014Census population, April 1, 2010Census population, April 1, 2000Total number of seats in U.S. Congress (2013 - 2023)2014 Estimated pop. per Congress-ional seat2010 Census pop. per House seat[4]2000 Census pop. per House seatPercent of total U.S. pop., 2014[5]Comparable country
17000100000000000000â™ 17000100000000000000â™ 1Â California38,802,50037,253,95633,871,6487001550000000000000â™ 55705,500702,905639,08812.17%Â Poland
27000200000000000000â™ 27000200000000000000â™ 2Â Texas26,956,95825,145,56120,851,8207001380000000000000â™ 38709,394698,487651,6198.45%Â Afghanistan
37000300000000000000â™ 37000400000000000000â™ 4Â Florida19,893,29718,801,31015,982,3787001290000000000000â™ 29685,976696,345639,2956.24%Â Romania
47000400000000000000â™ 47000300000000000000â™ 3Â New York19,746,22719,378,10218,976,4577001290000000000000â™ 29680,904717,707654,3616.19%Â Romania
57000500000000000000â™ 57000500000000000000â™ 5Â Illinois12,880,58012,830,63212,419,2937001200000000000000â™ 20644,029712,813653,6474.04%Â Zimbabwe
67000600000000000000â™ 67000600000000000000â™ 6Â Pennsylvania12,787,20912,702,37912,281,0547001200000000000000â™ 20639,360705,688646,3714.01%Â Zimbabwe

Now, we can use the stringi package to fix some minor problems

In [176]:
states[,1]=stri_sub(states[,1],22)
states[,2]=stri_sub(states[,2],22)
states[,7]=stri_sub(states[,7],22)
states[,3]=stri_sub(states[,3],3)
states[,12]=stri_sub(states[,12],3)

In [177]:
head(states)

Out[177]:
Rank in the fifty states, 2014Rank in all states & terri- tories, 2010State or territoryPopulation estimate for July 1, 2014Census population, April 1, 2010Census population, April 1, 2000Total number of seats in U.S. Congress (2013 - 2023)2014 Estimated pop. per Congress-ional seat2010 Census pop. per House seat[4]2000 Census pop. per House seatPercent of total U.S. pop., 2014[5]Comparable country
1 1 1California38,802,50037,253,95633,871,648 55705,500702,905639,08812.17%Poland
2 2 2Texas26,956,95825,145,56120,851,820 38709,394698,487651,6198.45%Afghanistan
3 3 4Florida19,893,29718,801,31015,982,378 29685,976696,345639,2956.24%Romania
4 4 3New York19,746,22719,378,10218,976,457 29680,904717,707654,3616.19%Romania
5 5 5Illinois12,880,58012,830,63212,419,293 20644,029712,813653,6474.04%Zimbabwe
6 6 6Pennsylvania12,787,20912,702,37912,281,054 20639,360705,688646,3714.01%Zimbabwe

#### Now, let's apply regular expressions to remove commas.¶

In [178]:
for(i in 4:10){
states[,i] = gsub(",","",states[,i])
}

In [179]:
head(states)

Out[179]:
Rank in the fifty states, 2014Rank in all states & terri- tories, 2010State or territoryPopulation estimate for July 1, 2014Census population, April 1, 2010Census population, April 1, 2000Total number of seats in U.S. Congress (2013 - 2023)2014 Estimated pop. per Congress-ional seat2010 Census pop. per House seat[4]2000 Census pop. per House seatPercent of total U.S. pop., 2014[5]Comparable country
1 1 1California388025003725395633871648 5570550070290563908812.17%Poland
2 2 2Texas269569582514556120851820 387093946984876516198.45%Afghanistan
3 3 4Florida198932971880131015982378 296859766963456392956.24%Romania
4 4 3New York197462271937810218976457 296809047177076543616.19%Romania
5 5 5Illinois128805801283063212419293 206440297128136536474.04%Zimbabwe
6 6 6Pennsylvania127872091270237912281054 206393607056886463714.01%Zimbabwe

Let's make sure the column names are appropriate column names.

In [180]:
names(states)= make.names(names(states))
names(states)

Out[180]:
1. "Rank.in.the.fifty.states..2014"
2. "Rank.in.all.states...terri..tories..2010"
3. "State.or.territory"
4. "Population.estimate.for.July.1..2014"
5. "Census.population..April.1..2010"
6. "Census.population..April.1..2000"
7. "Total.number.of.seats.in.U.S..Congress..2013...2023."
8. "X2014.Estimated.pop..per.Congress.ional.seat"
9. "X2010.Census.pop..per.House.seat.4."
10. "X2000.Census.pop..per.House.seat"
11. "Percent.of.total.U.S..pop...2014.5."
12. "Comparable.country"
In [181]:
statesMap = map_data("state")
str(statesMap)

'data.frame':	15537 obs. of  6 variables:
$long : num -87.5 -87.5 -87.5 -87.5 -87.6 ...$ lat      : num  30.4 30.4 30.4 30.3 30.3 ...
$group : num 1 1 1 1 1 1 1 1 1 1 ...$ order    : int  1 2 3 4 5 6 7 8 9 10 ...
$region : chr "alabama" "alabama" "alabama" "alabama" ...$ subregion: chr  NA NA NA NA ...

• Now, let’s create a new variable called region with lowercase names to match the statesMap.
In [182]:
states$region = tolower(states$State.or.territory)

• We have to join the statesMap data and the population data into one data frame to use ggplot2.*
In [183]:
statesMap = merge(statesMap, states, by="region",all.x=T)
str(statesMap)

'data.frame':	15537 obs. of  18 variables:
$region : chr "alabama" "alabama" "alabama" "alabama" ...$ long                                                : num  -88.2 -88.2 -88.2 -88.2 -88.1 ...
$lat : num 35 35 34.3 34.5 34.6 ...$ group                                               : num  1 1 1 1 1 1 1 1 1 1 ...
$order : int 88 89 82 83 84 85 86 87 74 75 ...$ subregion                                           : chr  NA NA NA NA ...
$Rank.in.the.fifty.states..2014 : chr " 23" " 23" " 23" " 23" ...$ Rank.in.all.states...terri..tories..2010            : chr  " 23" " 23" " 23" " 23" ...
$State.or.territory : chr "Alabama" "Alabama" "Alabama" "Alabama" ...$ Population.estimate.for.July.1..2014                : chr  "4849377" "4849377" "4849377" "4849377" ...
$Census.population..April.1..2010 : chr "4779736" "4779736" "4779736" "4779736" ...$ Census.population..April.1..2000                    : chr  "4447100" "4447100" "4447100" "4447100" ...
$Total.number.of.seats.in.U.S..Congress..2013...2023.: chr " 9" " 9" " 9" " 9" ...$ X2014.Estimated.pop..per.Congress.ional.seat        : chr  "538820" "538820" "538820" "538820" ...
$X2010.Census.pop..per.House.seat.4. : chr "682819" "682819" "682819" "682819" ...$ X2000.Census.pop..per.House.seat                    : chr  "635300" "635300" "635300" "635300" ...
$Percent.of.total.U.S..pop...2014.5. : chr "1.52%" "1.52%" "1.52%" "1.52%" ...$ Comparable.country                                  : chr  "Central African Republic" "Central African Republic" "Central African Republic" "Central African Republic" ...

In [190]:
x=c(10,11,12,14,15,16)
for (i in x){
statesMap[,i]=as.numeric(statesMap[,i])
}

• Now, let's reorder the data
In [185]:
statesMap = statesMap[order(statesMap$group, statesMap$order),]

• Now, we can map the population values by state
In [186]:
ggplot(statesMap, aes(x = long, y = lat, group = group, fill = Census.population..April.1..2010)) +
geom_polygon(color = "black") + scale_fill_gradient(name = "Population 2010",low = "#B8E6E6", high = "darkblue", guide = "colorbar",na.value="white")

Out[186]:

In [187]:
ggplot(statesMap, aes(x = long, y = lat, group = group, fill = Population.estimate.for.July.1..2014)) +
geom_polygon(color = "black") + scale_fill_gradient(name = "Population 2014",low = "#E6E6B8", high = "#1A4C1A", guide = "colorbar",na.value="white")

Out[187]:

In [191]:
ggplot(statesMap, aes(x = long, y = lat, group = group, fill = (Population.estimate.for.July.1..2014/Census.population..April.1..2000-1)*100)) +
geom_polygon(color = "black") + scale_fill_gradient(name = "% change of census population \n between 2000 and 2014",low = "white", high = "red", guide = "colorbar",na.value="white")

Out[191]:

In [189]:
ggplot(statesMap, aes(x = long, y = lat, group = group, fill = X2010.Census.pop..per.House.seat.4.)) +
geom_polygon(color = "black") + scale_fill_gradient(name = "Census population \n per house seat 2010",low = "white", high = "black", guide = "colorbar",na.value="white")

Out[189]: