Fisseha Berhane, PhD

Data Scientist

443-970-2353 fisseha@jhu.edu CV Resume Linkedin GitHub twitter twitter

Web scraping with R using rvest: Population of U.S. states and territories

In this post, we will use the rvest web scraping R package to scrape US population data from Wikipedia and use ggplo2 to visualize the population data by state.

In [172]:
library(rvest)
library(dplyr)
library(calibrate)
library(stringi)
library(ggplot2)
library(maps)
library(ggmap)
In [173]:
wiki= read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population")
In [174]:
states=wiki %>%
  html_nodes("table") %>%
    .[[1]]%>%
  html_table(fill=T)
In [175]:
head(states)
Out[175]:
Rank in the fifty states, 2014Rank in all states & terri- tories, 2010State or territoryPopulation estimate for July 1, 2014Census population, April 1, 2010Census population, April 1, 2000Total number of seats in U.S. Congress (2013 - 2023)2014 Estimated pop. per Congress-ional seat2010 Census pop. per House seat[4]2000 Census pop. per House seatPercent of total U.S. pop., 2014[5]Comparable country
17000100000000000000♠17000100000000000000♠1 California38,802,50037,253,95633,871,6487001550000000000000♠55705,500702,905639,08812.17% Poland
27000200000000000000♠27000200000000000000♠2 Texas26,956,95825,145,56120,851,8207001380000000000000♠38709,394698,487651,6198.45% Afghanistan
37000300000000000000♠37000400000000000000♠4 Florida19,893,29718,801,31015,982,3787001290000000000000♠29685,976696,345639,2956.24% Romania
47000400000000000000♠47000300000000000000♠3 New York19,746,22719,378,10218,976,4577001290000000000000♠29680,904717,707654,3616.19% Romania
57000500000000000000♠57000500000000000000♠5 Illinois12,880,58012,830,63212,419,2937001200000000000000♠20644,029712,813653,6474.04% Zimbabwe
67000600000000000000♠67000600000000000000♠6 Pennsylvania12,787,20912,702,37912,281,0547001200000000000000♠20639,360705,688646,3714.01% Zimbabwe

Now, we can use the stringi package to fix some minor problems

In [176]:
states[,1]=stri_sub(states[,1],22)
states[,2]=stri_sub(states[,2],22)
states[,7]=stri_sub(states[,7],22)
states[,3]=stri_sub(states[,3],3)
states[,12]=stri_sub(states[,12],3)
In [177]:
head(states)
Out[177]:
Rank in the fifty states, 2014Rank in all states & terri- tories, 2010State or territoryPopulation estimate for July 1, 2014Census population, April 1, 2010Census population, April 1, 2000Total number of seats in U.S. Congress (2013 - 2023)2014 Estimated pop. per Congress-ional seat2010 Census pop. per House seat[4]2000 Census pop. per House seatPercent of total U.S. pop., 2014[5]Comparable country
1 1 1California38,802,50037,253,95633,871,648 55705,500702,905639,08812.17%Poland
2 2 2Texas26,956,95825,145,56120,851,820 38709,394698,487651,6198.45%Afghanistan
3 3 4Florida19,893,29718,801,31015,982,378 29685,976696,345639,2956.24%Romania
4 4 3New York19,746,22719,378,10218,976,457 29680,904717,707654,3616.19%Romania
5 5 5Illinois12,880,58012,830,63212,419,293 20644,029712,813653,6474.04%Zimbabwe
6 6 6Pennsylvania12,787,20912,702,37912,281,054 20639,360705,688646,3714.01%Zimbabwe

Now, let's apply regular expressions to remove commas.

In [178]:
for(i in 4:10){
states[,i] = gsub(",","",states[,i])
    }
In [179]:
head(states)
Out[179]:
Rank in the fifty states, 2014Rank in all states & terri- tories, 2010State or territoryPopulation estimate for July 1, 2014Census population, April 1, 2010Census population, April 1, 2000Total number of seats in U.S. Congress (2013 - 2023)2014 Estimated pop. per Congress-ional seat2010 Census pop. per House seat[4]2000 Census pop. per House seatPercent of total U.S. pop., 2014[5]Comparable country
1 1 1California388025003725395633871648 5570550070290563908812.17%Poland
2 2 2Texas269569582514556120851820 387093946984876516198.45%Afghanistan
3 3 4Florida198932971880131015982378 296859766963456392956.24%Romania
4 4 3New York197462271937810218976457 296809047177076543616.19%Romania
5 5 5Illinois128805801283063212419293 206440297128136536474.04%Zimbabwe
6 6 6Pennsylvania127872091270237912281054 206393607056886463714.01%Zimbabwe

Let's make sure the column names are appropriate column names.

In [180]:
names(states)= make.names(names(states))
names(states)
Out[180]:
  1. "Rank.in.the.fifty.states..2014"
  2. "Rank.in.all.states...terri..tories..2010"
  3. "State.or.territory"
  4. "Population.estimate.for.July.1..2014"
  5. "Census.population..April.1..2010"
  6. "Census.population..April.1..2000"
  7. "Total.number.of.seats.in.U.S..Congress..2013...2023."
  8. "X2014.Estimated.pop..per.Congress.ional.seat"
  9. "X2010.Census.pop..per.House.seat.4."
  10. "X2000.Census.pop..per.House.seat"
  11. "Percent.of.total.U.S..pop...2014.5."
  12. "Comparable.country"
In [181]:
statesMap = map_data("state")  
str(statesMap)
'data.frame':	15537 obs. of  6 variables:
 $ long     : num  -87.5 -87.5 -87.5 -87.5 -87.6 ...
 $ lat      : num  30.4 30.4 30.4 30.3 30.3 ...
 $ group    : num  1 1 1 1 1 1 1 1 1 1 ...
 $ order    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ region   : chr  "alabama" "alabama" "alabama" "alabama" ...
 $ subregion: chr  NA NA NA NA ...
  • Now, let’s create a new variable called region with lowercase names to match the statesMap.
In [182]:
states$region = tolower(states$State.or.territory)
  • We have to join the statesMap data and the population data into one data frame to use ggplot2.*
In [183]:
statesMap = merge(statesMap, states, by="region",all.x=T)
str(statesMap)
'data.frame':	15537 obs. of  18 variables:
 $ region                                              : chr  "alabama" "alabama" "alabama" "alabama" ...
 $ long                                                : num  -88.2 -88.2 -88.2 -88.2 -88.1 ...
 $ lat                                                 : num  35 35 34.3 34.5 34.6 ...
 $ group                                               : num  1 1 1 1 1 1 1 1 1 1 ...
 $ order                                               : int  88 89 82 83 84 85 86 87 74 75 ...
 $ subregion                                           : chr  NA NA NA NA ...
 $ Rank.in.the.fifty.states..2014                      : chr  " 23" " 23" " 23" " 23" ...
 $ Rank.in.all.states...terri..tories..2010            : chr  " 23" " 23" " 23" " 23" ...
 $ State.or.territory                                  : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
 $ Population.estimate.for.July.1..2014                : chr  "4849377" "4849377" "4849377" "4849377" ...
 $ Census.population..April.1..2010                    : chr  "4779736" "4779736" "4779736" "4779736" ...
 $ Census.population..April.1..2000                    : chr  "4447100" "4447100" "4447100" "4447100" ...
 $ Total.number.of.seats.in.U.S..Congress..2013...2023.: chr  " 9" " 9" " 9" " 9" ...
 $ X2014.Estimated.pop..per.Congress.ional.seat        : chr  "538820" "538820" "538820" "538820" ...
 $ X2010.Census.pop..per.House.seat.4.                 : chr  "682819" "682819" "682819" "682819" ...
 $ X2000.Census.pop..per.House.seat                    : chr  "635300" "635300" "635300" "635300" ...
 $ Percent.of.total.U.S..pop...2014.5.                 : chr  "1.52%" "1.52%" "1.52%" "1.52%" ...
 $ Comparable.country                                  : chr  "Central African Republic" "Central African Republic" "Central African Republic" "Central African Republic" ...
In [190]:
x=c(10,11,12,14,15,16)
for (i in x){
   statesMap[,i]=as.numeric(statesMap[,i])
}
  • Now, let's reorder the data
In [185]:
statesMap = statesMap[order(statesMap$group, statesMap$order),]
  • Now, we can map the population values by state
In [186]:
ggplot(statesMap, aes(x = long, y = lat, group = group, fill = Census.population..April.1..2010)) + 
geom_polygon(color = "black") + scale_fill_gradient(name = "Population 2010",low = "#B8E6E6", high = "darkblue", guide = "colorbar",na.value="white")
Out[186]:

In [187]:
ggplot(statesMap, aes(x = long, y = lat, group = group, fill = Population.estimate.for.July.1..2014)) + 
geom_polygon(color = "black") + scale_fill_gradient(name = "Population 2014",low = "#E6E6B8", high = "#1A4C1A", guide = "colorbar",na.value="white")
Out[187]:

In [191]:
ggplot(statesMap, aes(x = long, y = lat, group = group, fill = (Population.estimate.for.July.1..2014/Census.population..April.1..2000-1)*100)) + 
geom_polygon(color = "black") + scale_fill_gradient(name = "% change of census population \n between 2000 and 2014",low = "white", high = "red", guide = "colorbar",na.value="white")
Out[191]:

In [189]:
ggplot(statesMap, aes(x = long, y = lat, group = group, fill = X2010.Census.pop..per.House.seat.4.)) + 
geom_polygon(color = "black") + scale_fill_gradient(name = "Census population \n per house seat 2010",low = "white", high = "black", guide = "colorbar",na.value="white")
Out[189]:



comments powered by Disqus