Fisseha Berhane, PhD

Data Scientist

443-970-2353 [email protected] CV Resume Linkedin GitHub twitter twitter

Top searches associated with each nation

In this post, we will get top searches associated with each nation. In doing so, first, we will scrape the list of world countries from wikipedia. We are including all member states of the United Nations. We will use the gtrendsR package to get the data from the Google Trends API.

Load all required packages

In [2]:
library(rvest)
library(dplyr)
library(calibrate)
library(stringi)
library(ggplot2)
library(maps)
library(ggmap)
library(stringr)
library(gtrendsR)
library(dplyr)

Scrape UN member states from Wikipedia

In [26]:
wiki= read_html("https://en.wikipedia.org/wiki/Member_states_of_the_United_Nations")

countries=wiki %>%
  html_nodes("table") %>%
    .[[2]]%>%
  html_table(fill=T)

Apply text extraction and regular expressions to remove unnecessary characters

In [ ]:
countries[,1]=stri_sub(countries[,1],3)
countries[,1]= gsub("\\(.*)","",countries[,1])
countries[,1]= gsub("\\[.*]","",countries[,1])
countries= countries[,1]

How many countries are members of the UN?

In [59]:
length(countries)
Out[59]:
193

So, there are 193 countries.

Let's see the first 10 countries
In [61]:
countries[1:10]
Out[61]:
  1. "Afghanistan"
  2. "Albania"
  3. "Algeria"
  4. "Andorra"
  5. "Angola"
  6. "Antigua and Barbuda"
  7. "Argentina"
  8. "Armenia"
  9. "Australia"
  10. "Austria"

We will use the gtrendsR package to get data from the Google Trends API. We need to give it our gmail user name and password.

In [6]:
gconnect(username, password)

Top searches for India

Let's first visualize the top searched terms associated with a single country. Let's take India.

In [13]:
data=gtrends('india')
z=as.data.frame(data$searches)
names(z)=c('searches','hits')
z=filter(z,hits>10)

 z$searches <- factor(z$searches, levels = z$searches[order(z$hits,decreasing =F)])  # re-order the terms

ggplot(z, aes(searches,hits))+ 
  geom_bar(stat='identity',fill="skyblue",color='black')+ylab('Hits')+
  theme(axis.title.y=element_blank(), axis.text.y = element_text(colour="black",size=10),
        axis.title.x = element_text(colour="blue",size=14),
        axis.text.x = element_text(colour="black",size=14,angle=90,hjust=.5,
                                   vjust=.5))+coord_flip()+
  ggtitle('Related top searches for India')+theme(plot.title = element_text(size = 14,colour="blue"))

Spatial Distribution of the Searches

We can also see the spatial distribution of the searches. We can make animations that show how the searches vary over space and time.

In [25]:
regions = as.data.frame(data$regions)

names(regions)=c('region','hits')

regions$region[regions$region=="United States"] = "USA"

world_map = map_data("world")

world_map =merge(world_map, regions, by="region",all.x = TRUE)

world_map = world_map[order(world_map$group, world_map$order),]

g=ggplot(world_map, aes(x=long, y=lat, group=group))+
  geom_polygon(aes(fill=hits), color="gray70") 

g+theme(axis.text.y   = element_blank(),
        axis.text.x   = element_blank(),
        axis.title.y  = element_blank(),
        axis.title.x  = element_blank(),
        panel.background = element_blank(),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank())+
scale_fill_gradient(low = "skyblue", high = "blue", guide = "colorbar",na.value="white")+
 theme(legend.key.size = unit(1, "cm"),
         legend.title = element_text(size = 12, colour = "blue"),
         legend.title.align=0.3,legend.text = element_text(size = 10))+
         theme(panel.border = element_rect(colour = "gray70", fill=NA, size=0.5))

Top searched terms associated with each nation

In [ ]:
searches=c()

for(i in 1:dim(countries)[1]){
    data=gtrends(as.character(countries[i,1]))
    z=data$searches
    top_1=paste(as.data.frame(z)[,1][1],collapse=", ")
    searches=c(searches,top_1)
    
}
In [9]:
Data=as.data.frame(list(Country=countries, Top_searches=searches))

The data frame Data above cotains, the top most searched term associated with each country.

Top search terms by catagory

Now, let's group the countries based on the top most searched term.

Countries with top most search terms having "war" in them

In [81]:
Data[sapply(searches, function(x) "war"%in%unlist(str_split(x,' '))),]
Out[81]:
CountryTop_searches
1Afghanistanafghanistan war
79Iraqiraq war
189Vietnamvietnam war

Countries with top most search terms referring to an airline

In [82]:
Data[sapply(searches, function(x) "air"%in%unlist(str_split(x,' '))| 
    "airways"%in%unlist(str_split(x,' '))),]
Out[82]:
CountryTop_searches
13Bahrainbahrain air
32Canadaair canada
89Kuwaitkuwait airways
109Mauritiusair mauritius
122New Zealandnew zealand air
127Omanoman air
137Qatarqatar airways
151Serbiaserbia air
186Uzbekistanuzbekistan airways

Countries with top most search terms having "hotel" in them

In [43]:
Data[sapply(searches, function(x) "hotel"%in%unlist(str_split(x,' '))),]
Out[43]:
CountryTop_searches
4Andorrahotel andorra
106Maltahotel malta
142Rwandahotel rwanda
147San Marinosan marino hotel
156Sloveniahotel slovenia

Countries with top most search terms referring to thier past history

In [83]:
Data[sapply(searches, function(x) "ancient"%in%unlist(str_split(x,' '))),]
Out[83]:
CountryTop_searches
52Egyptancient egypt
66Greeceancient greece
In [84]:
Data[sapply(searches, function(x) "weather"%in%unlist(str_split(x,' '))),]
Out[84]:
CountryTop_searches
15Barbadosbarbados weather
43Cyprusweather cyprus

Getting the map of the country is most important

In [86]:
Data[sapply(searches, function(x) "map"%in%unlist(str_split(x,' '))),]
Out[86]:
CountryTop_searches
6Antigua and Barbudaantigua map
20Bhutanbhutan map
82Italyitaly map
86Kazakhstankazakhstan map
90Kyrgyzstanmap kyrgyzstan
115Moroccomap morocco
131Papua New Guineanew guinea map
161Spainmap spain

What about Africa?

Many people consider Africa as a single country. Others, on the other hand, when they search for information related to an African country, they add the word "Africa" to the country name. Let's see countries, whose top most search term is the name of the county plus Africa.

In [87]:
Data[sapply(searches, function(x) "africa"%in%unlist(str_split(x,' '))),]
Out[87]:
CountryTop_searches
23Botswanabotswana africa
33Central African Republiccentral africa republic
48Djiboutiafrica djibouti
54Equatorial Guineaequatorial guinea africa
94Lesotholesotho africa
108Mauritaniamauritania africa
116Mozambiquemozambique africa
118Namibianamibia africa
159South Africaindia south africa
165Swazilandswaziland africa
183Tanzaniatanzania africa
191Zambiazambia africa

As shown in the table above, the top search for many African countries is the country name plus Africa. Of course, central africa republic is a country and I do not know why "india south africa" is the top searched term associated with South Africa.

Summary

In this post, we searched what the top search associated with each country is. We scraped the list of the UN member states from wikipedia. Further, we classified the top searches based on the common terms they have. Countries linked with war are Iraq, Vietnam and Afghanistan. For many countries, the most searched term is an airline such as air Canada, qatar airways, etc. It is expected that Egypt and Greece would be searched most for thier interesting and intriguing history and our results show the same. A dozen of countries (Italy, Spain, Morocco, etc) are searched most for thier maps.

comments powered by Disqus