### Web Scraping and Natural Language Processing: most commonly used words in a journal paper¶

I am practicing web scrapping, regular expressions and natural language processing in R. In this post, I will find the most commonly used words in one of my published papers. The paper can be accessed from the Journal of Climate.

I am using the painless 'rvest' R package for web scraping and the 'tm' package for natural language processing. The 'stringr' package is used to make my text tidy (remove unwanted characters).

In [4]:
library(rvest)
library(tm)
library(SnowballC)
library(stringr)


### scraping¶

In [5]:
page <- read_html("http://journals.ametsoc.org/doi/full/10.1175/JCLI-D-13-00693.1")

paper=page%>% html_nodes(css = ".NLM_sec_level_1")%>%html_text()
paper[1]  # Let's see the first one

Out[5]:
1. "1. IntroductionThe Maddenâ€“Julian oscillation (MJO), which is a 30â€“60-day oscillation centered around the equator, is responsible for the majority of weather variability in the tropics (Madden and Julian 1994). The MJO appears as an eastward propagating large-scale system in convection, zonal winds, and upper-level velocity potential (Hendon and Salby 1994). The system usually develops in the western Indian Ocean, and precipitation anomalies are recognizable as it propagates eastward to the western Pacific Ocean. When it reaches the cold waters in the eastern Pacific, it becomes nondescript. However, precipitation usually reappears as it reaches the tropical Atlantic Ocean and Africa (Madden and Julian 1971, 1972).The MJO is strongest in winter and weakest in summer (Wang and Rui 1990; Hendon and Salby 1994). Notably, however, no matter whether it is winter or summer, the MJO influences rainfall in a number of regions in the tropics and extratropics (Jones 2000; Paegle et al. 2000; Higgins and Shi 2001; Carvalho et al. 2004; Jones et al. 2004; Barlow et al. 2005; Donald et al. 2006; Lorenz and Hartmann 2006; Jeong et al. 2008; Wheeler et al. 2009; Zhang et al. 2009; Pai et al. 2011, among many others). Pohl and Camberlin (2006a,b; hereafter PC06a and PC06b) identify equatorial East Africa (EA) as a region in which the MJO can influence intraseasonal precipitation. They diagnose an MJO influence on precipitation in both the long rains (Marchâ€“May) and the short rains (Octoberâ€“December) for selected regions in Kenya and northern Tanzania, with an observed contrast of influence between highland and coastal areas. They attribute the MJO influence and the intraregional contrast to a suite of mechanisms related to deep convection, moisture advection, and stratiform precipitation.The identification of an MJO influence in EA is both intriguing and potentially quite valuable. EA is a topographically diverse region and one of the most meteorologically complex regions on the African continent (Spinage 2012; Cook and Vizy 2013). Precipitation variability on interannual, interseasonal, and intraseasonal time scales has profound and extensively documented impacts on rain-fed agriculture, pastoralism, food and water security, and human health (Epstein 1999; Funk et al. 2005; Verdin et al. 2005; Bowden and Semazzi 2007; Funk et al. 2008; Ummenhofer et al. 2009; Anyah and Qiu 2012; Lyon and DeWitt 2012; Cook and Vizy 2013). While many studies have addressed challenges of explaining and predicting climate variability on seasonal and interannual time scales (Nicholson and Kim 1997; Indeje and Semazzi 2000; Mutai and Ward 2000; Black 2005; Hastenrath 2007; Owiti et al. 2008; Funk et al. 2008; Ummenhofer et al. 2009), relatively few have addressed intraseasonal variability on time scales that are potentially explainable, and perhaps predictable, based on MJO.It is well known that the long rains and short rains differ in their sensitivity to large-scale climate drivers and in the characteristics of precipitation (e.g., Camberlin et al. 2009). In addition, each season exhibits systematic differences in rainfall patterns between the early, middle, and late season (Fig. 1). These seasons are transitions between winter and summer monsoons (Hastenrath 2007) and correspond to the period when the intertropical convergence zone (ITCZ) crosses the equator in its southâ€“north and then northâ€“south migrations, respectively (Mutai and Ward 2000; Camberlin and Philippon 2002). The ITCZ modulates the northeast trades blowing during the southern summer and the southeast trades during the northern summer (Asnani 1993, 2005). Variability in the characteristics of the ITCZ is closely associated with variability in rainfall of the region (Gitau 2011).View larger version (97K)Fig. 1. Climatology of TRMM precipitation (mm dayâˆ’1) and wind vectors at 850 hPa (m sâˆ’1) from NCEP-R1. Precipitation values less than 0.5 mm dayâˆ’1 are suppressed. (left) Long rains for (a) March, (c) April, and (e) May and (center) short rains for (b) October, (d) November, and (f) December. The box in (a) shows the study region. (right) Map showing area of (a)â€“(f).For the long rains, several studies have indicated that the teleconnections linked to variability also differ across the season (Camberlin and Philippon 2002; Zorita and Tilya 2002), suggesting that atmospheric processes associated with precipitation at the beginning and end of the season are not the same (Camberlin and Okoola 2003). For this reason, authors of previous studies have recommended that studies of interannual variability consider each month of the long rainy season separately (Camberlin and Philippon 2002).Here, we apply this reasoning to an analysis of the MJO influence on EA, using the Climate Prediction Centerâ€™s (CPCâ€™s) operational MJO index and the all-season real-time multivariate MJO index (RMM) from the Centre for Australian Weather and Climate Research. The overarching objective of the study is to explore impacts of the MJO on tropospheric circulations affecting EA during the long and short rains and associated changes in precipitation on intraseasonal time scales. In this respect our analysis builds on the work presented in PC06a and PC06b, but for a larger geographic extent and more recent period, and with the analysis carried out for each calendar month, individually, during both rainy seasons. These differences allow for detailed exploration of intraregional and intraseasonal variability in the MJO influence on EA. In addition, we employ multiple datasets in the analysis and explore a number of mechanisms not specifically identified by PC06a and PC06b. The paper is organized as follows: Section 2 describes data and methods, followed by results and discussion in section 3. Finally, a summary and conclusions are offered in section 4."
In [24]:
length(paper)

Out[24]:
4

The paper is diveded in to introduction, Data and methods, Results and discussion, and Conclusions

#### Remove unwanted characters¶

In [7]:
paper=str_replace_all(paper, "[^[:alnum:]]", " ")
paper=gsub("[^A-Za-z0-9 ]", "-", paper)
paper[2]

Out[7]:

### Searching most common words¶

In [10]:
# Create corpus

corpus = Corpus(VectorSource(x))

In [26]:
# Convert to lower-case

corpus = tm_map(corpus, tolower)

In [15]:
# convert corpus to a Plain Text Document

corpus = tm_map(corpus, PlainTextDocument)

In [12]:
# Remove punctuation

corpus = tm_map(corpus, removePunctuation)

In [13]:
# Remove stopwords

corpus = tm_map(corpus, removeWords, stopwords("english"))

In [16]:
# Create matrix

frequencies = DocumentTermMatrix(corpus)


### Most commonly used words¶

Let's search words that occur at least 50 times

In [25]:
findFreqTerms(frequencies, lowfreq=50)

Out[25]:
1. "anomalies"
2. "mjo"
3. "precipitation"

### Conclusion¶

The paper investigates the impacts of an intra-seasonal oscillation called MJO on precipitation over east Africa. The processes through which the oscillation influences precipitation over the region is investigated by lookig at anomalies of various atmospheric and oceanic fields such as sea surface temperature, atmospheric winds and sea level pressure. Therefore, it is expected that anomalies, mjo and precipitation are the top three commonly used words the in paper.