Fisseha Berhane, PhD

Data Scientist

443-970-2353 fisseha@jhu.edu CV Resume Linkedin GitHub twitter twitter

Web scraping with R using rvest: List of countries by proven oil reserves

In this post, let's see how to scrape information from the web using the rvest R package. Specifically, we will scrape the "List of countries by proven oil reserves" table from wikipedia.

In [77]:
library(rvest)
library(dplyr)
library(calibrate)
library(stringi)
In [48]:
wiki= read_html("https://en.wikipedia.org/wiki/List_of_countries_by_proven_oil_reserves")
In [49]:
oil=wiki %>%
  html_nodes("table") %>%
  .[[1]]%>%
  html_table()
In [50]:
oil[1:10,]
Out[50]:
CountryReserves (MMbbl)
1— OPEC1,112,448 - 1,199,707
21 Venezuela (see: Oil reserves in Venezuela)297,740[2]
32 Saudi Arabia (see: Oil reserves in Saudi Arabia)268,350[2]
43 Canada (see: Oil reserves in Canada)173,625 - 175,200
54 Iran (see: Oil reserves in Iran)157,300[3]
65 Iraq (see: Oil reserves in Iraq)140,300[3]
76 Kuwait (see: Oil reserves in Kuwait)104,000[2]
87 UAE (see: Oil reserves in the United Arab Emirates)97,800
98 Russia (see: Oil reserves in Russia)80,000[2]
109 Libya (see: Oil reserves in Libya)48,014

The table above has some minor problems and let's fix those problems step by step. Let's remove the second column.

In [51]:
oil=oil[,2:3]

Renaming columns:

In [52]:
names(oil)=c("country","reserves")

Let's remove OPEC and World data, to just work with data on country level.

In [54]:
oil[1:10,]
oil[103,]
Out[54]:
countryreserves
1 OPEC1,112,448 - 1,199,707
2 Venezuela (see: Oil reserves in Venezuela)297,740[2]
3 Saudi Arabia (see: Oil reserves in Saudi Arabia)268,350[2]
4 Canada (see: Oil reserves in Canada)173,625 - 175,200
5 Iran (see: Oil reserves in Iran)157,300[3]
6 Iraq (see: Oil reserves in Iraq)140,300[3]
7 Kuwait (see: Oil reserves in Kuwait)104,000[2]
8 UAE (see: Oil reserves in the United Arab Emirates)97,800
9 Russia (see: Oil reserves in Russia)80,000[2]
10 Libya (see: Oil reserves in Libya)48,014
Out[54]:
countryreserves
103Total World (2011)[11]1,481,526
In [55]:
oil =slice(oil,2:102)
oil[1:10,]
Out[55]:
countryreserves
1 Venezuela (see: Oil reserves in Venezuela)297,740[2]
2 Saudi Arabia (see: Oil reserves in Saudi Arabia)268,350[2]
3 Canada (see: Oil reserves in Canada)173,625 - 175,200
4 Iran (see: Oil reserves in Iran)157,300[3]
5 Iraq (see: Oil reserves in Iraq)140,300[3]
6 Kuwait (see: Oil reserves in Kuwait)104,000[2]
7 UAE (see: Oil reserves in the United Arab Emirates)97,800
8 Russia (see: Oil reserves in Russia)80,000[2]
9 Libya (see: Oil reserves in Libya)48,014
10 Nigeria (see: Oil reserves in Nigeria)37,200

Now, let's apply some regular expressions to make the data tidy.

We just want country names and the corresponding oil reserves.

In [84]:
oil[,1] = gsub("\\((.*)","",oil[,1])
In [57]:
oil[1:10,]
Out[57]:
countryreserves
1 Venezuela 297,740[2]
2 Saudi Arabia 268,350[2]
3 Canada 173,625 - 175,200
4 Iran 157,300[3]
5 Iraq 140,300[3]
6 Kuwait 104,000[2]
7 UAE 97,800
8 Russia 80,000[2]
9 Libya 48,014
10 Nigeria 37,200

Now, we can remove the first character in the coutry names using the stringi package.

In [58]:
oil[,1]=stri_sub(oil[,1],3)
In [59]:
oil[1:10,]
Out[59]:
countryreserves
1Venezuela 297,740[2]
2Saudi Arabia 268,350[2]
3Canada 173,625 - 175,200
4Iran 157,300[3]
5Iraq 140,300[3]
6Kuwait 104,000[2]
7UAE 97,800
8Russia 80,000[2]
9Libya 48,014
10Nigeria 37,200

Now, let's work with the "reserves" column and remove the square brackets and the commas.

In [85]:
oil[,2] = gsub("\\[(.*)","",oil[,2])  # removing square brackets
oil[1:10,]
Out[85]:
countryreserves
1Venezuela 297740
2Saudi Arabia 268350
3Canada 173625
4Iran 157300
5Iraq 140300
6Kuwait 104000
7UAE 97800
8Russia 80000
9Libya 48014
10Nigeria 37200
In [61]:
oil[,2] = gsub(",","",oil[,2]) # removing commas
In [62]:
oil[1:10,]
Out[62]:
countryreserves
1Venezuela 297740
2Saudi Arabia 268350
3Canada 173625 - 175200
4Iran 157300
5Iraq 140300
6Kuwait 104000
7UAE 97800
8Russia 80000
9Libya 48014
10Nigeria 37200

In the 'reserves' column where the values are ranges, we can, for simplicity, take the first value.

In [63]:
oil[,2] = gsub("-(.*)","",oil[,2])
In [64]:
oil
Out[64]:
countryreserves
1Venezuela 297740
2Saudi Arabia 268350
3Canada 173625
4Iran 157300
5Iraq 140300
6Kuwait 104000
7UAE 97800
8Russia 80000
9Libya 48014
10Nigeria 37200
11United States 36420
12Kazakhstan30002
13China25585
14Qatar25382
15Brazil 13986
16Angola10470
17Mexico 10364
18Algeria9940
19Azerbaijan7000
20Ecuador7000
21Norway6900
22United Kingdom6900
23European Union 6700
24Malaysia5800
25India5650
26Oman5500
27Ghana 5000
28Egypt4500
29Vietnam4400
30Australia4158
31Indonesia3990
32Gabon3700
33Yemen3000
34Sudan2800
35Syria2500
36Mongolia2493
37Colombia2377
38Congo, Republic of the1940
39Equatorial Guinea1705
40Chad1500
41Peru1240
42Brunei1200
43Uganda1000
44Denmark900
45Trinidad and Tobago830
46Romania650
47Turkmenistan600
48Uzbekistan594
49East Timor554
50Argentina2330
51Thailand442
52Tunisia425
53Italy400
54Ukraine395
55Pakistan313
56Netherlands310
57Germany276
58Turkey262
59Cameroon200
60Bolivia200
61Albania199
62Belarus198
63Congo, Democratic Republic of the180
64Cuba 124
65Papua New Guinea170
66Philippines168
67New Zealand166
68Chile150
69Spain150
70Bahrain125
71France101
72Ivory Coast100
73Mauritania100
74Poland96
75Austria89
76Guatemala83
77Afghanistan80
78Suriname79
79Serbia77
80Croatia66
81Burma50
82Japan44
83Kyrgyzstan40
84Georgia35
85Hungary26
86Bangladesh28
87Bulgaria15
88South Africa15
89Czech Republic15
90Lithuania12
91Tajikistan12
92Greece10
93Slovakia9
94Benin8
95Belize7
96Taiwan2
97Israel2
98Barbados2
99Jordan1
100Morocco0.7
101Ethiopia0.4

Now, we can work with this data frame. We can produce world map that shows oil reserves using ggplot2 or do other types of analysis.

One can refer to my previous post to see how to produce global map similar to what is shown below



Wikipedia


comments powered by Disqus