443-970-2353
[email protected]
CV Resume
In this post, let's see how to scrape information from the web using the rvest R package. Specifically, we will scrape the "List of countries by proven oil reserves" table from wikipedia.
library(rvest)
library(dplyr)
library(calibrate)
library(stringi)
wiki= read_html("https://en.wikipedia.org/wiki/List_of_countries_by_proven_oil_reserves")
oil=wiki %>%
html_nodes("table") %>%
.[[1]]%>%
html_table()
oil[1:10,]
The table above has some minor problems and let's fix those problems step by step. Let's remove the second column.
oil=oil[,2:3]
Renaming columns:
names(oil)=c("country","reserves")
Let's remove OPEC and World data, to just work with data on country level.
oil[1:10,]
oil[103,]
oil =slice(oil,2:102)
oil[1:10,]
We just want country names and the corresponding oil reserves.
oil[,1] = gsub("\\((.*)","",oil[,1])
oil[1:10,]
Now, we can remove the first character in the coutry names using the stringi package.
oil[,1]=stri_sub(oil[,1],3)
oil[1:10,]
Now, let's work with the "reserves" column and remove the square brackets and the commas.
oil[,2] = gsub("\\[(.*)","",oil[,2]) # removing square brackets
oil[1:10,]
oil[,2] = gsub(",","",oil[,2]) # removing commas
oil[1:10,]
In the 'reserves' column where the values are ranges, we can, for simplicity, take the first value.
oil[,2] = gsub("-(.*)","",oil[,2])
oil
Now, we can work with this data frame. We can produce world map that shows oil reserves using ggplot2 or do other types of analysis.