443-970-2353
[email protected]
CV Resume
I am practicing web scrapping, regular expressions and natural language processing in R. In this post, I will find the most commonly used words in one of my published papers. The paper can be accessed from the Journal of Climate.
I am using the painless 'rvest' R package for web scraping and the 'tm' package for natural language processing. The 'stringr' package is used to make my text tidy (remove unwanted characters).
library(rvest)
library(tm)
library(SnowballC)
library(stringr)
page <- read_html("http://journals.ametsoc.org/doi/full/10.1175/JCLI-D-13-00693.1")
paper=page%>% html_nodes(css = ".NLM_sec_level_1")%>%html_text()
paper[1] # Let's see the first one
length(paper)
The paper is diveded in to introduction, Data and methods, Results and discussion, and Conclusions
paper=str_replace_all(paper, "[^[:alnum:]]", " ")
paper=gsub("[^A-Za-z0-9 ]", "-", paper)
paper[2]
# Create corpus
corpus = Corpus(VectorSource(x))
# Convert to lower-case
corpus = tm_map(corpus, tolower)
# convert corpus to a Plain Text Document
corpus = tm_map(corpus, PlainTextDocument)
# Remove punctuation
corpus = tm_map(corpus, removePunctuation)
# Remove stopwords
corpus = tm_map(corpus, removeWords, stopwords("english"))
# Create matrix
frequencies = DocumentTermMatrix(corpus)
Let's search words that occur at least 50 times
findFreqTerms(frequencies, lowfreq=50)
The paper investigates the impacts of an intra-seasonal oscillation called MJO on precipitation over east Africa. The processes through which the oscillation influences precipitation over the region is investigated by lookig at anomalies of various atmospheric and oceanic fields such as sea surface temperature, atmospheric winds and sea level pressure. Therefore, it is expected that anomalies, mjo and precipitation are the top three commonly used words the in paper.