Fisseha Berhane, PhD

Data Scientist

443-970-2353 CV Resume Linkedin GitHub twitter twitter

Web Scraping and Natural Language Processing: most commonly used words in a journal paper

I am practicing web scrapping, regular expressions and natural language processing in R. In this post, I will find the most commonly used words in one of my published papers. The paper can be accessed from the Journal of Climate.

I am using the painless 'rvest' R package for web scraping and the 'tm' package for natural language processing. The 'stringr' package is used to make my text tidy (remove unwanted characters).

In [4]:


In [5]:
page <- read_html("")

paper=page%>% html_nodes(css = ".NLM_sec_level_1")%>%html_text()
paper[1]  # Let's see the first one
  1. "1. IntroductionThe Madden–Julian oscillation (MJO), which is a 30–60-day oscillation centered around the equator, is responsible for the majority of weather variability in the tropics (Madden and Julian 1994). The MJO appears as an eastward propagating large-scale system in convection, zonal winds, and upper-level velocity potential (Hendon and Salby 1994). The system usually develops in the western Indian Ocean, and precipitation anomalies are recognizable as it propagates eastward to the western Pacific Ocean. When it reaches the cold waters in the eastern Pacific, it becomes nondescript. However, precipitation usually reappears as it reaches the tropical Atlantic Ocean and Africa (Madden and Julian 1971, 1972).The MJO is strongest in winter and weakest in summer (Wang and Rui 1990; Hendon and Salby 1994). Notably, however, no matter whether it is winter or summer, the MJO influences rainfall in a number of regions in the tropics and extratropics (Jones 2000; Paegle et al. 2000; Higgins and Shi 2001; Carvalho et al. 2004; Jones et al. 2004; Barlow et al. 2005; Donald et al. 2006; Lorenz and Hartmann 2006; Jeong et al. 2008; Wheeler et al. 2009; Zhang et al. 2009; Pai et al. 2011, among many others). Pohl and Camberlin (2006a,b; hereafter PC06a and PC06b) identify equatorial East Africa (EA) as a region in which the MJO can influence intraseasonal precipitation. They diagnose an MJO influence on precipitation in both the long rains (March–May) and the short rains (October–December) for selected regions in Kenya and northern Tanzania, with an observed contrast of influence between highland and coastal areas. They attribute the MJO influence and the intraregional contrast to a suite of mechanisms related to deep convection, moisture advection, and stratiform precipitation.The identification of an MJO influence in EA is both intriguing and potentially quite valuable. EA is a topographically diverse region and one of the most meteorologically complex regions on the African continent (Spinage 2012; Cook and Vizy 2013). Precipitation variability on interannual, interseasonal, and intraseasonal time scales has profound and extensively documented impacts on rain-fed agriculture, pastoralism, food and water security, and human health (Epstein 1999; Funk et al. 2005; Verdin et al. 2005; Bowden and Semazzi 2007; Funk et al. 2008; Ummenhofer et al. 2009; Anyah and Qiu 2012; Lyon and DeWitt 2012; Cook and Vizy 2013). While many studies have addressed challenges of explaining and predicting climate variability on seasonal and interannual time scales (Nicholson and Kim 1997; Indeje and Semazzi 2000; Mutai and Ward 2000; Black 2005; Hastenrath 2007; Owiti et al. 2008; Funk et al. 2008; Ummenhofer et al. 2009), relatively few have addressed intraseasonal variability on time scales that are potentially explainable, and perhaps predictable, based on MJO.It is well known that the long rains and short rains differ in their sensitivity to large-scale climate drivers and in the characteristics of precipitation (e.g., Camberlin et al. 2009). In addition, each season exhibits systematic differences in rainfall patterns between the early, middle, and late season (Fig. 1). These seasons are transitions between winter and summer monsoons (Hastenrath 2007) and correspond to the period when the intertropical convergence zone (ITCZ) crosses the equator in its south–north and then north–south migrations, respectively (Mutai and Ward 2000; Camberlin and Philippon 2002). The ITCZ modulates the northeast trades blowing during the southern summer and the southeast trades during the northern summer (Asnani 1993, 2005). Variability in the characteristics of the ITCZ is closely associated with variability in rainfall of the region (Gitau 2011).View larger version (97K)Fig. 1. Climatology of TRMM precipitation (mm day−1) and wind vectors at 850 hPa (m s−1) from NCEP-R1. Precipitation values less than 0.5 mm day−1 are suppressed. (left) Long rains for (a) March, (c) April, and (e) May and (center) short rains for (b) October, (d) November, and (f) December. The box in (a) shows the study region. (right) Map showing area of (a)–(f).For the long rains, several studies have indicated that the teleconnections linked to variability also differ across the season (Camberlin and Philippon 2002; Zorita and Tilya 2002), suggesting that atmospheric processes associated with precipitation at the beginning and end of the season are not the same (Camberlin and Okoola 2003). For this reason, authors of previous studies have recommended that studies of interannual variability consider each month of the long rainy season separately (Camberlin and Philippon 2002).Here, we apply this reasoning to an analysis of the MJO influence on EA, using the Climate Prediction Center’s (CPC’s) operational MJO index and the all-season real-time multivariate MJO index (RMM) from the Centre for Australian Weather and Climate Research. The overarching objective of the study is to explore impacts of the MJO on tropospheric circulations affecting EA during the long and short rains and associated changes in precipitation on intraseasonal time scales. In this respect our analysis builds on the work presented in PC06a and PC06b, but for a larger geographic extent and more recent period, and with the analysis carried out for each calendar month, individually, during both rainy seasons. These differences allow for detailed exploration of intraregional and intraseasonal variability in the MJO influence on EA. In addition, we employ multiple datasets in the analysis and explore a number of mechanisms not specifically identified by PC06a and PC06b. The paper is organized as follows: Section 2 describes data and methods, followed by results and discussion in section 3. Finally, a summary and conclusions are offered in section 4."
In [24]:

The paper is diveded in to introduction, Data and methods, Results and discussion, and Conclusions

Remove unwanted characters

In [7]:
paper=str_replace_all(paper, "[^[:alnum:]]", " ")
paper=gsub("[^A-Za-z0-9 ]", "-", paper)
  1. "2 Data and methodsa Data We use multiple datasets to study associations of the MJO with precipitation and tropospheric circulation The precipitation dataset used in this study is the Tropical Rainfall Measuring Mission TRMM 3B42 Multisatellite Precipitation Analysis TMPA version 7 The dataset has a horizontal resolution of 0 25- - 0 25- latitude- longitude Huffman et al 2010 Previous studies have shown that TMPA captures variability of precipitation in East Africa reasonably well although some versions of the data have exhibited a bias in the magnitude of estimated precipitation rates e g Dinku et al 2007 Li et al 2009 Habib et al 2012 The version 7 multisensor product used in this study has not been evaluated in peer reviewed publications but its behavior is similar to earlier products with some evidence that biases in highland regions have been reduced Interpolated outgoing longwave radiation OLR estimates derived from the Advanced Very High Resolution Radiometer AVHRR onboard National Oceanic and Atmospheric Administration NOAA polar orbiting satellites Liebmann and Smith 1996 were employed to examine MJO associated changes in patterns of deep convection Negative OLR anomalies tend to correspond to positive precipitation anomalies while positive OLR anomalies tend to correspond to negative precipitation anomalies Atmospheric fields i e wind vector data pressure velocity - temperature humidity and precipitable water and sea level pressure SLP were drawn from the National Centers for Environmental Prediction NCEP - National Center for Atmospheric Research NCAR reanalysis NCEP R1 Kalnay et al 1996 The wind vectors and temperature are considered as - -most reliable- as they strongly depend on instrumental measurements while - and relative humidity are considered to be - -quite reliable- since they rely more on general circulation model parameterization Pohl and Camberlin 2006b Both the wind vector and OLR datasets are available at 2 5- - 2 5- latitude- longitude resolution and were obtained from the website of the NOAA Earth System Research Laboratory ESRL Physical Sciences Division PSD http www esrl noaa gov psd For purposes of comparison we repeat our analyses using SLP and atmospheric fields drawn from the European Centre for Medium Range Weather Forecasts ECMWF Interim Re Analysis ERA Interim which is the latest global atmospheric reanalysis produced by the ECMWF Dee et al 2011 The dataset replaces the 40 yr ECMWF Re Analysis ERA 40 and addresses several difficult data assimilation problems encountered during the production of ERA 40 Dee et al 2011 For detailed information about ERA Interim products the reader is referred to Dee et al 2011 Sea surface temperature SST data were also acquired from the NOAA ESRL PSD high resolution 0 25- analysis product For more details about this dataset the reader is referred to Reynolds et al 2007 The MJO indices used in this study are the CPC MJO index Chen and Del Genio 2009 and the RMM Wheeler and Hendon 2004 hereafter WH04 The CPC MJO index is generated by first applying an extended empirical orthogonal function EEOF analysis to pentad velocity potential at 200 hPa for ENSO neutral and weak ENSO winters November- April during 1979- 2000 Xue et al 2002 Barrett and Leslie 2009 The first EEOF consists of 10 time lagged patterns Then 10 MJO indices centered at 20- 70- 80- 100- 120- 140- and 160- E and 120- 40- and 10- W are constructed by regressing the daily data onto the 10 patterns of the first EEOF Positive negative values represent suppressed enhanced convection Each index is normalized by dividing by its standard deviation Barrett and Leslie 2009 Several previous studies have used the CPC MJO index for analyses of MJO process and impacts e g Chen and Del Genio 2009 Ridout and Flatau 2011 Del Genio et al 2012 Straub 2013 among others The indices and their details are available on the CPC website at http www cpc ncep noaa gov products precip CWlink daily mjo index mjo index shtml The daily real time multivariate MJO indices RMM1 and RMM2 of WH04 are calculated as the principal component PC time series of the two leading empirical orthogonal functions EOFs of combined daily mean fields of 850 and 200 hPa zonal winds and OLR averaged over the tropics 15- N- 15- S WH04 categorized the eastward propagation of the MJO into eight phases each corresponding to the geographical position of its active convective center see their Fig 7 These phases constitute a full MJO cycle that is strong in the Indian Ocean and decays over the central Pacific On average each phase lasts for about 6 days WH04 developed a two dimensional phase space diagram with RMM1 and RMM2 as the horizontal and vertical Cartesian axes which is used for viewing the spatial and temporal evolution of the MJO In this phase space representation strong MJO events move in a large counterclockwise direction around the origin while weak MJO variability usually appears as random movement near the origin Phase 1 denotes the period when the center of convective activity is over Africa In phases 2 and 3 the convective envelope of the MJO is in the equatorial Indian Ocean phases 4 and 5 correspond to the period when the MJO- s convective envelope is in the Maritime Continent and phases 6 and 7 correspond to the period when it is in the equatorial Pacific Ocean The square root of the sum of the squares of RMM1 and RMM2 represents amplitude of the MJO When the amplitude of the MJO is greater than 1 the eight phases are categorized as - -strong- MJO phases otherwise the MJO is categorized as - -weak- irrespective of the phase of the MJO RMM indices are available online at http cawcr gov au staff mwheeler maproom RMM index htm Analyses that involve precipitation are constrained by the availability of TRMM satellite data which starts in 1998 As a result precipitation analyses cover the period 1998- 2012 OLR and dynamical analyses are presented for the modern satellite record 1979- 2012 while the SST data cover the period from 1982 to 2012 To test the stability of MJO associations over time we repeated all 1979- 2012 analyses using data only for 1979- 97 and data only for 1998- 2012 Results for these two time periods are consistent at seasonal scale and for most months Small differences between the 1979- 97 and 1998- 2012 periods are noted in the results section where they are relevant b Data analysis Combinations of linear correlations and composites are employed to explore associations between MJO and precipitation in EA and corresponding changes in tropospheric circulation Composite and correlation analyses are performed at pentad scale for each calendar month of both rainy seasons in order to capture subseasonal variability Pentads from 2- 31 March 1- 30 April 1- 30 May 3 October- 1 November 2 November- 1 December and 2- 31 December are considered for the months of March April May October November and December respectively MJO composites for the CPC indices are constructed using all pentads with CPC index amplitude equal to or greater than one and above a threshold that has been used in previous studies e g Chen and Del Genio 2009 Barrett and Leslie 2009 This results in between 25 and 43 composite pentads per month for the TRMM period 1998- 2012 A total of 90 204 pentads were available for each month six per year for 1998- 2012 1979- 2012 In addition wind and vertical velocity are analyzed at daily resolution for CPC pentads with strong MJO convection or subsidence to investigate whether the anomalies are change of strength of the prevailing motion or actual reversals Composite figures in this paper show the difference between enhanced MJO convection and suppressed MJO convection that is in each month the composites are the mean of pentads with MJO index less than or equal to negative one minus pentads with MJO index greater than or equal to one All analyses were repeated using daily RMM indices to verify the robustness of the results obtained employing the CPC MJO index When using the RMM index in each month days with MJO index of amplitude one and above are used Results for CPC and RMM indices are overwhelmingly similar so we focus on CPC results for simplicity For all figures showing CPC results we provide the equivalent RMM figures in the supplementary material To calculate composites we first compute the long term monthly mean for a given variable for each month as the average of all the values in each month Composites of all variables considered are computed for each calendar month based on the MJO indices aswhere the left hand term is the pentad anomaly daily for RMM the first term on the right is the value of a variable on a given pentad day employing the CPC RMM index and the last term on the right is the monthly mean of the variable considered Composites of OLR SLP vertical motion and wind vector anomalies at different levels are calculated for each MJO index These anomalies are examined to elucidate the physical mechanisms by which the MJO impacts rainfall on monthly time scales To investigate changes in components of the thermodynamic balance we employ the hydrostatic thermodynamic energy equation given bywhere T is temperature V is horizontal wind vector Sp is the static stability parameter Cp is the specific heat of dry air and J denotes diabatic heating In Eq 2 the left term is tendency while the first term on the right is horizontal temperature advection Static stability is proportional to the vertical gradient of temperature so is the adiabatic term that represents the vertical advection of temperature and the effect of adiabatic warming and cooling with vertical motion The diabatic heating term is calculated as a residual Moist static energy H composites are also calculated at each grid point H is found usingwhere Cp is the specific heat of air at constant pressure T is air temperature g is gravitational acceleration Z is geopotential height l- is latent heat of vaporization and q is specific humidity Lower tropospheric buoyancy is quantified using moist static instability which is calculated as moist static energy MSE at 1000 hPa minus saturation moist static energy at 700 hPa H1000 -- Hs700 Seager et al 2003 Saturation moist static energy Hs is calculated in the same manner as H Eq 3 but saturated specific humidity is used in place of specific humidity Throughout much of the study region the surface lies above 1000 hPa but anomalies of T Z and q at 1000 hPa and at the surface exhibit very similar patterns so the metric can still be used to diagnose the stability of the lower troposphere McHugh 2004 Composites of moisture flux divergence are calculated usingwhere MFD is moisture flux divergence q is specific humidity and u and - are zonal and meridional wind vectors respectively Also represents horizontal advection of specific humidity and denotes the product of specific humidity and horizontal mass divergence In all analyses that involve calculations of gradients centered difference techniques are used For analyses that involve wind speed vertical motion temperature or sea level pressure fields NCEP R1 and ERA Interim are both employed to confirm the robustness of findings The datasets provided similar results in all cases and NCEP R1 is used in the figures because it has been used in many previous studies in the region e g Mutai and Ward 2000 Camberlin and Okoola 2003 McHugh 2004 Hastenrath 2007 Hastenrath et al 2007 Lyon and DeWitt 2012 To test the significance of correlation coefficients a two tailed t test is used In the composite analysis a procedure outlined by Terray et al 2003 is used This procedure is useful to overcome drawbacks associated with the normality assumption of the Student- s t test "

Searching most common words

In [10]:
# Create corpus
corpus = Corpus(VectorSource(x))
In [26]:
# Convert to lower-case

corpus = tm_map(corpus, tolower)
In [15]:
# convert corpus to a Plain Text Document

corpus = tm_map(corpus, PlainTextDocument)
In [12]:
# Remove punctuation

corpus = tm_map(corpus, removePunctuation)
In [13]:
# Remove stopwords 

corpus = tm_map(corpus, removeWords, stopwords("english"))
In [16]:
# Create matrix

frequencies = DocumentTermMatrix(corpus)

Most commonly used words

Let's search words that occur at least 50 times

In [25]:
findFreqTerms(frequencies, lowfreq=50)
  1. "anomalies"
  2. "mjo"
  3. "precipitation"


The paper investigates the impacts of an intra-seasonal oscillation called MJO on precipitation over east Africa. The processes through which the oscillation influences precipitation over the region is investigated by lookig at anomalies of various atmospheric and oceanic fields such as sea surface temperature, atmospheric winds and sea level pressure. Therefore, it is expected that anomalies, mjo and precipitation are the top three commonly used words the in paper.

comments powered by Disqus