443-970-2353
[email protected]
CV Resume
In this post, I will show how to scrape google scholar. Particularly, we will use the 'rvest' R package to scrape the google scholar account of my PhD advisor. We will see his coauthors, how many times they have been cited and thier affilations.
"rvest, inspired by libraries like beautiful soup, makes it easy to scrape (or harvest) data from html web pages", wrote Hadley Wickham on RStudio Blog. Since it is designed to work with magrittr, we can express complex operations as elegant pipelines composed of simple and easily understood pieces of code.
We will use ggplot2 to create plots.
library(rvest)
library(ggplot2)
Let's use SelectorGadget to find out which css selector matches the "cited by" column.
Use read_html() to parse the html page.
page <- read_html("https://scholar.google.com/citations?user=sTR9SIQAAAAJ&hl=en&oi=ao")
Specify the css selector in html_nodes() and extract the text with html_text(). Finally, change the string to numeric using as.numeric().
citations = page%>% html_nodes("#gsc_a_b .gsc_a_c")%>%html_text()%>%as.numeric()
see the number of citations
citations
A plot is worth more than thousand words.
barplot(citations,main="How many times has each paper been cited?",
ylab='Number of citations',col="skyblue",xlab="")
My PhD advisor, Ben Zaitchik, is a really smart scientist. He not only has the skills to create network and cooperate with other scientists, but also intelligence and patience.
Next, let's see his coauthors, thier affilations and how many times they have been cited.
Similarly, we will use SelectorGadget to find out which css selector matches the Co-authors.
page <- read_html("https://scholar.google.com/citations?view_op=list_colleagues&hl=en&user=sTR9SIQAAAAJ")
Coauthors = page%>% html_nodes(css = ".gsc_1usr_name a")%>%html_text()
Coauthors=as.data.frame(Coauthors)
names(Coauthors)='Coauthors'
Exploring Coauthors
head(Coauthors)
dim(Coauthors)
As of today, he has published with 27 people.
page <- read_html("https://scholar.google.com/citations?view_op=list_colleagues&hl=en&user=sTR9SIQAAAAJ")
citations = page%>% html_nodes(css = ".gsc_1usr_cby")%>%html_text()
citations
citations = gsub('Cited by','',citations)
citations
Change string to numeric and then to data frame to make it easy to use with ggplot2
citations=as.numeric(citations)
citations=as.data.frame(citations)
page <- read_html("https://scholar.google.com/citations?view_op=list_colleagues&hl=en&user=sTR9SIQAAAAJ")
affilation = page%>% html_nodes(css = ".gsc_1usr_aff")%>%html_text()
affilation=as.data.frame(affilation)
names(affilation)='Affilation'
cauthors=cbind(Coauthors,citations,affilation)
cauthors
Let's re-order coauthors based on their citations so as to make our plot in a decreasing order.
cauthors$Coauthors <- factor(cauthors$Coauthors, levels =
cauthors$Coauthors[order(cauthors$citations,decreasing =F)])
ggplot(cauthors,aes(Coauthors,citations))+geom_bar(stat="identity", fill="#ff8c1a",size=5)+
theme(axis.title.y = element_blank())+ylab("# of citations")+
theme(plot.title=element_text(size = 18,colour="blue"), axis.text.y = element_text(colour="grey20",size=12))+
ggtitle('Citations of his coauthors')+coord_flip()
He has published with scientists who have been cited more than 12000 times and with students like me who are just toddling.
In this post, we saw how to scrape Google Scholar. We scraped the account of my advisor and got data on the citations of his papers and his coauthors with thier affilations and how many times they have been cited.
As we have seen in this post, it is easy to scrape an html page using the rvest R package. It is also important to note that SelectorGadget is useful to find out which css selector matches the data of our interest.
Update: My advisor told me that Google Scholar picks up a minority of his co-authors. Some of the scientists who published with him and who my advisor would expect to be the most cited don’t show up. Further, the results for some others are counterintuitive (e.g., seniors who have more publications, have less Google Scholar citations than their juniors). So, Google Scholar data should be used with caution.
If you have any question feel free to post a comment below.