Fisseha Berhane, PhD

Data Scientist

443-970-2353 fisseha@jhu.edu CV Resume Linkedin GitHub twitter twitter

Google scholar Scraping with R



In this post, I will show how to scrape google scholar. Particularly, we will use the 'rvest' R package to scrape the google scholar account of my PhD advisor. We will see his coauthors, how many times they have been cited and thier affilations.

"rvest, inspired by libraries like beautiful soup, makes it easy to scrape (or harvest) data from html web pages", wrote Hadley Wickham on RStudio Blog. Since it is designed to work with magrittr, we can express complex operations as elegant pipelines composed of simple and easily understood pieces of code.

Load required libraries

We will use ggplot2 to create plots.

In [169]:
library(rvest)
library(ggplot2)

How many times have his papers been cited

Let's use SelectorGadget to find out which css selector matches the "cited by" column.

Use read_html() to parse the html page.

In [170]:
page <- read_html("https://scholar.google.com/citations?user=sTR9SIQAAAAJ&hl=en&oi=ao")

Specify the css selector in html_nodes() and extract the text with html_text(). Finally, change the string to numeric using as.numeric().

In [171]:
citations = page%>% html_nodes("#gsc_a_b .gsc_a_c")%>%html_text()%>%as.numeric()

see the number of citations

In [172]:
citations
Out[172]:
  1. 148
  2. 96
  3. 79
  4. 64
  5. 57
  6. 57
  7. 57
  8. 55
  9. 52
  10. 50
  11. 48
  12. 37
  13. 34
  14. 33
  15. 30
  16. 28
  17. 26
  18. 25
  19. 23
  20. 22

Create a barplot of the number of citation

A plot is worth more than thousand words.

In [173]:
barplot(citations,main="How many times has each paper been cited?",
        ylab='Number of citations',col="skyblue",xlab="")

Coauthors, thier affilations and how many times they have been cited

My PhD advisor, Ben Zaitchik, is a really smart scientist. He not only has the skills to create network and cooperate with other scientists, but also intelligence and patience.

Next, let's see his coauthors, thier affilations and how many times they have been cited.

Similarly, we will use SelectorGadget to find out which css selector matches the Co-authors.

In [174]:
page <- read_html("https://scholar.google.com/citations?view_op=list_colleagues&hl=en&user=sTR9SIQAAAAJ")

Coauthors = page%>% html_nodes(css = ".gsc_1usr_name a")%>%html_text()
Coauthors=as.data.frame(Coauthors)
names(Coauthors)='Coauthors'

Exploring Coauthors

In [175]:
head(Coauthors)  
dim(Coauthors)
Out[175]:
Coauthors
1Jason Evans
2Mutlu Ozdogan
3Rasmus Houborg
4M. Tugrul Yilmaz
5Joseph A. Santanello, Jr.
6Seth Guikema
Out[175]:
  1. 27
  2. 1

As of today, he has published with 27 people.

How many times have his coauthors been cited?

In [176]:
page <- read_html("https://scholar.google.com/citations?view_op=list_colleagues&hl=en&user=sTR9SIQAAAAJ")

citations = page%>% html_nodes(css = ".gsc_1usr_cby")%>%html_text()
In [177]:
citations
Out[177]:
  1. "Cited by 2228"
  2. "Cited by 1272"
  3. "Cited by 816"
  4. "Cited by 395"
  5. "Cited by 652"
  6. "Cited by 1531"
  7. "Cited by 673"
  8. "Cited by 467"
  9. "Cited by 7967"
  10. "Cited by 3970"
  11. "Cited by 2602"
  12. "Cited by 3468"
  13. "Cited by 3175"
  14. "Cited by 121"
  15. "Cited by 32"
  16. "Cited by 469"
  17. "Cited by 50"
  18. "Cited by 11"
  19. "Cited by 1187"
  20. "Cited by 1451"
  21. "Cited by 12411"
  22. "Cited by 1937"
  23. "Cited by 9"
  24. "Cited by 705"
  25. "Cited by 336"
  26. "Cited by 186"
  27. "Cited by 192"

Let's extract the numeric characters only using global substitute

In [178]:
citations = gsub('Cited by','',citations)
In [179]:
citations
Out[179]:
  1. " 2228"
  2. " 1272"
  3. " 816"
  4. " 395"
  5. " 652"
  6. " 1531"
  7. " 673"
  8. " 467"
  9. " 7967"
  10. " 3970"
  11. " 2602"
  12. " 3468"
  13. " 3175"
  14. " 121"
  15. " 32"
  16. " 469"
  17. " 50"
  18. " 11"
  19. " 1187"
  20. " 1451"
  21. " 12411"
  22. " 1937"
  23. " 9"
  24. " 705"
  25. " 336"
  26. " 186"
  27. " 192"

Change string to numeric and then to data frame to make it easy to use with ggplot2

In [180]:
citations=as.numeric(citations)
citations=as.data.frame(citations)

Affilation of coauthors

In [181]:
page <- read_html("https://scholar.google.com/citations?view_op=list_colleagues&hl=en&user=sTR9SIQAAAAJ")

affilation = page%>% html_nodes(css = ".gsc_1usr_aff")%>%html_text()
affilation=as.data.frame(affilation)
names(affilation)='Affilation'

Now, let's create a data frame that consists of coauthors, citations and affilations

In [182]:
cauthors=cbind(Coauthors,citations,affilation)
cauthors
Out[182]:
CoauthorscitationsAffilation
1Jason Evans2228University of New South Wales
2Mutlu Ozdogan1272Assistant Professor of Environmental Science and Forest Ecology, University of Wisconsin
3Rasmus Houborg816Research Scientist at King Abdullah University of Science and Technology
4M. Tugrul Yilmaz395Assistant Professor, Civil Engineering Department, Middle East Technical University, Turkey
5Joseph A. Santanello, Jr.652NASA-GSFC Hydrological Sciences Laboratory
6Seth Guikema1531Johns Hopkins University
7Christopher Hain673Assistant Research Scientist, University of Maryland
8John D. Bolten467NASA
9Sujay Kumar7967Hydrological sciences laboratory, NASA Goddard Space Flight Center
10Wade Crow3970USDA Hydrology and Remote Sensing Laboratory
11Tonie van Dam2602Professor University of Luxemburg
12Christa Peters-Lidard3468Deputy Director for Hydrospheric and Biospheric Sciences, NASA Goddard Space Flight …
13HM van Es3175cornell university
14Amin Dezfuli121Johns Hopkins University
15Erin Urquhart32ORISE Postdoctoral Fellow
16Matthew Hoffman469Assistant Professor of Mathematical Sciences, Rochester Institute of Technology
17Weston Buckley Anderson50Columbia University
18Hamada Badr11PhD Candidate at Johns Hopkins University
19Jeremy Foltz1187Professor of Ag. & Applied Economics, University of Wisconsin-Madison
20Francisco Olivera1451Texas A&M University
21Stan D. Wullschleger12411Oak Ridge National Laboratory
22William Pan1937Duke University, Nicholas School of Environment & Global Health Institute
23Fisseha Berhane9PhD Candidate, Johns Hopkins University
24Tsegaye Tadesse705Associate Professor/ Climatologist-Remote Sensing Expert, National Drought Mitigation …
25Denis Valle336Assistant Professor, University of Florida
26Sauleh Siddiqui186Assistant Professor of Civil Engineering, Johns Hopkins University
27Partha Sarathi Bhattacharjee, Ph.D.192Support Scientist

Re-order coauthors based on their citations

Let's re-order coauthors based on their citations so as to make our plot in a decreasing order.

In [183]:
cauthors$Coauthors <- factor(cauthors$Coauthors, levels = 
                cauthors$Coauthors[order(cauthors$citations,decreasing =F)])
In [21]:
ggplot(cauthors,aes(Coauthors,citations))+geom_bar(stat="identity", fill="#ff8c1a",size=5)+
theme(axis.title.y   = element_blank())+ylab("# of citations")+
theme(plot.title=element_text(size = 18,colour="blue"), axis.text.y = element_text(colour="grey20",size=12))+
              ggtitle('Citations of his coauthors')+coord_flip()

He has published with scientists who have been cited more than 12000 times and with students like me who are just toddling.

Summary

In this post, we saw how to scrape Google Scholar. We scraped the account of my advisor and got data on the citations of his papers and his coauthors with thier affilations and how many times they have been cited.

As we have seen in this post, it is easy to scrape an html page using the rvest R package. It is also important to note that SelectorGadget is useful to find out which css selector matches the data of our interest.


Update: My advisor told me that Google Scholar picks up a minority of his co-authors. Some of the scientists who published with him and who my advisor would expect to be the most cited don’t show up. Further, the results for some others are counterintuitive (e.g., seniors who have more publications, have less Google Scholar citations than their juniors). So, Google Scholar data should be used with caution.

If you have any question feel free to post a comment below.





comments powered by Disqus