In this post, I will show how to scrape google scholar. Particularly, we will use the 'rvest' R package to scrape the google scholar account of my PhD advisor. We will see his coauthors, how many times they have been cited and thier affilations.
"rvest, inspired by libraries like beautiful soup, makes it easy to scrape (or harvest) data from html web pages", wrote Hadley Wickham on RStudio Blog. Since it is designed to work with magrittr, we can express complex operations as elegant pipelines composed of simple and easily understood pieces of code.
We will use ggplot2 to create plots.
Use read_html() to parse the html page.
page <- read_html("https://scholar.google.com/citations?user=sTR9SIQAAAAJ&hl=en&oi=ao")
Specify the css selector in html_nodes() and extract the text with html_text(). Finally, change the string to numeric using as.numeric().
citations = page%>% html_nodes("#gsc_a_b .gsc_a_c")%>%html_text()%>%as.numeric()
see the number of citations
A plot is worth more than thousand words.
barplot(citations,main="How many times has each paper been cited?", ylab='Number of citations',col="skyblue",xlab="")
My PhD advisor, Ben Zaitchik, is a really smart scientist. He not only has the skills to create network and cooperate with other scientists, but also intelligence and patience.
Next, let's see his coauthors, thier affilations and how many times they have been cited.
Similarly, we will use SelectorGadget to find out which css selector matches the Co-authors.
page <- read_html("https://scholar.google.com/citations?view_op=list_colleagues&hl=en&user=sTR9SIQAAAAJ") Coauthors = page%>% html_nodes(css = ".gsc_1usr_name a")%>%html_text() Coauthors=as.data.frame(Coauthors) names(Coauthors)='Coauthors'
|4||M. Tugrul Yilmaz|
|5||Joseph A. Santanello, Jr.|
As of today, he has published with 27 people.
page <- read_html("https://scholar.google.com/citations?view_op=list_colleagues&hl=en&user=sTR9SIQAAAAJ") citations = page%>% html_nodes(css = ".gsc_1usr_cby")%>%html_text()
citations = gsub('Cited by','',citations)
Change string to numeric and then to data frame to make it easy to use with ggplot2
page <- read_html("https://scholar.google.com/citations?view_op=list_colleagues&hl=en&user=sTR9SIQAAAAJ") affilation = page%>% html_nodes(css = ".gsc_1usr_aff")%>%html_text() affilation=as.data.frame(affilation) names(affilation)='Affilation'
|1||Jason Evans||2228||University of New South Wales|
|2||Mutlu Ozdogan||1272||Assistant Professor of Environmental Science and Forest Ecology, University of Wisconsin|
|3||Rasmus Houborg||816||Research Scientist at King Abdullah University of Science and Technology|
|4||M. Tugrul Yilmaz||395||Assistant Professor, Civil Engineering Department, Middle East Technical University, Turkey|
|5||Joseph A. Santanello, Jr.||652||NASA-GSFC Hydrological Sciences Laboratory|
|6||Seth Guikema||1531||Johns Hopkins University|
|7||Christopher Hain||673||Assistant Research Scientist, University of Maryland|
|8||John D. Bolten||467||NASA|
|9||Sujay Kumar||7967||Hydrological sciences laboratory, NASA Goddard Space Flight Center|
|10||Wade Crow||3970||USDA Hydrology and Remote Sensing Laboratory|
|11||Tonie van Dam||2602||Professor University of Luxemburg|
|12||Christa Peters-Lidard||3468||Deputy Director for Hydrospheric and Biospheric Sciences, NASA Goddard Space Flight â€¦|
|13||HM van Es||3175||cornell university|
|14||Amin Dezfuli||121||Johns Hopkins University|
|15||Erin Urquhart||32||ORISE Postdoctoral Fellow|
|16||Matthew Hoffman||469||Assistant Professor of Mathematical Sciences, Rochester Institute of Technology|
|17||Weston Buckley Anderson||50||Columbia University|
|18||Hamada Badr||11||PhD Candidate at Johns Hopkins University|
|19||Jeremy Foltz||1187||Professor of Ag. & Applied Economics, University of Wisconsin-Madison|
|20||Francisco Olivera||1451||Texas A&M University|
|21||Stan D. Wullschleger||12411||Oak Ridge National Laboratory|
|22||William Pan||1937||Duke University, Nicholas School of Environment & Global Health Institute|
|23||Fisseha Berhane||9||PhD Candidate, Johns Hopkins University|
|24||Tsegaye Tadesse||705||Associate Professor/ Climatologist-Remote Sensing Expert, National Drought Mitigation â€¦|
|25||Denis Valle||336||Assistant Professor, University of Florida|
|26||Sauleh Siddiqui||186||Assistant Professor of Civil Engineering, Johns Hopkins University|
|27||Partha Sarathi Bhattacharjee, Ph.D.||192||Support Scientist|
Let's re-order coauthors based on their citations so as to make our plot in a decreasing order.
cauthors$Coauthors <- factor(cauthors$Coauthors, levels = cauthors$Coauthors[order(cauthors$citations,decreasing =F)])
ggplot(cauthors,aes(Coauthors,citations))+geom_bar(stat="identity", fill="#ff8c1a",size=5)+ theme(axis.title.y = element_blank())+ylab("# of citations")+ theme(plot.title=element_text(size = 18,colour="blue"), axis.text.y = element_text(colour="grey20",size=12))+ ggtitle('Citations of his coauthors')+coord_flip()
He has published with scientists who have been cited more than 12000 times and with students like me who are just toddling.
In this post, we saw how to scrape Google Scholar. We scraped the account of my advisor and got data on the citations of his papers and his coauthors with thier affilations and how many times they have been cited.
As we have seen in this post, it is easy to scrape an html page using the rvest R package. It is also important to note that SelectorGadget is useful to find out which css selector matches the data of our interest.
Update: My advisor told me that Google Scholar picks up a minority of his co-authors. Some of the scientists who published with him and who my advisor would expect to be the most cited don’t show up. Further, the results for some others are counterintuitive (e.g., seniors who have more publications, have less Google Scholar citations than their juniors). So, Google Scholar data should be used with caution.
If you have any question feel free to post a comment below.