# Fisseha Berhane, PhD

#### Data Scientist

443-970-2353 [email protected] CV Resume

### Web Scraping and Natural Language Processing: most commonly used words in a journal paper¶

I am practicing web scrapping, regular expressions and natural language processing in R. In this post, I will find the most commonly used words in one of my published papers. The paper can be accessed from the Journal of Climate.

I am using the painless 'rvest' R package for web scraping and the 'tm' package for natural language processing. The 'stringr' package is used to make my text tidy (remove unwanted characters).

In [4]:
library(rvest)
library(tm)
library(SnowballC)
library(stringr)


### scraping¶

In [5]:
page <- read_html("http://journals.ametsoc.org/doi/full/10.1175/JCLI-D-13-00693.1")

paper=page%>% html_nodes(css = ".NLM_sec_level_1")%>%html_text()
paper[1]  # Let's see the first one

Out[5]:
In [24]:
length(paper)

Out[24]:
4

The paper is diveded in to introduction, Data and methods, Results and discussion, and Conclusions

#### Remove unwanted characters¶

In [7]:
paper=str_replace_all(paper, "[^[:alnum:]]", " ")
paper=gsub("[^A-Za-z0-9 ]", "-", paper)
paper[2]

Out[7]:

### Searching most common words¶

In [10]:
# Create corpus

corpus = Corpus(VectorSource(x))

In [26]:
# Convert to lower-case

corpus = tm_map(corpus, tolower)

In [15]:
# convert corpus to a Plain Text Document

corpus = tm_map(corpus, PlainTextDocument)

In [12]:
# Remove punctuation

corpus = tm_map(corpus, removePunctuation)

In [13]:
# Remove stopwords

corpus = tm_map(corpus, removeWords, stopwords("english"))

In [16]:
# Create matrix

frequencies = DocumentTermMatrix(corpus)


### Most commonly used words¶

Let's search words that occur at least 50 times

In [25]:
findFreqTerms(frequencies, lowfreq=50)

Out[25]:
1. "anomalies"
2. "mjo"
3. "precipitation"

### Conclusion¶

The paper investigates the impacts of an intra-seasonal oscillation called MJO on precipitation over east Africa. The processes through which the oscillation influences precipitation over the region is investigated by lookig at anomalies of various atmospheric and oceanic fields such as sea surface temperature, atmospheric winds and sea level pressure. Therefore, it is expected that anomalies, mjo and precipitation are the top three commonly used words the in paper.