I am practicing web scrapping, regular expressions and natural language processing in R. In this post, I will find the most commonly used words in one of my published papers. The paper can be accessed from the Journal of Climate.
I am using the painless 'rvest' R package for web scraping and the 'tm' package for natural language processing. The 'stringr' package is used to make my text tidy (remove unwanted characters).
library(rvest) library(tm) library(SnowballC) library(stringr)
page <- read_html("http://journals.ametsoc.org/doi/full/10.1175/JCLI-D-13-00693.1") paper=page%>% html_nodes(css = ".NLM_sec_level_1")%>%html_text() paper # Let's see the first one
The paper is diveded in to introduction, Data and methods, Results and discussion, and Conclusions
paper=str_replace_all(paper, "[^[:alnum:]]", " ") paper=gsub("[^A-Za-z0-9 ]", "-", paper) paper
# Create corpus corpus = Corpus(VectorSource(x))
# Convert to lower-case corpus = tm_map(corpus, tolower)
# convert corpus to a Plain Text Document corpus = tm_map(corpus, PlainTextDocument)
# Remove punctuation corpus = tm_map(corpus, removePunctuation)
# Remove stopwords corpus = tm_map(corpus, removeWords, stopwords("english"))
# Create matrix frequencies = DocumentTermMatrix(corpus)
Let's search words that occur at least 50 times
The paper investigates the impacts of an intra-seasonal oscillation called MJO on precipitation over east Africa. The processes through which the oscillation influences precipitation over the region is investigated by lookig at anomalies of various atmospheric and oceanic fields such as sea surface temperature, atmospheric winds and sea level pressure. Therefore, it is expected that anomalies, mjo and precipitation are the top three commonly used words the in paper.