Fisseha Berhane, PhD

Data Scientist

443-970-2353 fisseha@jhu.edu CV Resume Linkedin GitHub twitter twitter

Text Mining, Scraping and Sentiment Analysis with R: Russia this week





Recently, I have been working on text Mining, scraping and sentiment Analysis with R and Python. I am specially interested in data from Facebook and twitter. In this post, I am scraping twitter to understand what has been being said about Russia and its relations with the middle east. Particularly, we will see the sentiment of posts from November 24-29, 2015. For this excercise, we will consider posts in English.

We will use the R packages twitteR, tm, stringr, plyr , among others.

Data extraction

Data has been extracted from twitter on November 29, 2015. The twitter posts are posted since November 24. 9999 posts have been extracted.

In [ ]:
library(twitteR)
library(ROAuth)
require(RCurl)

# All this info is found from the twitterR developer account

key="hidden"

secret="hidden"


# set working directory for the whole process-

setwd("C:/Fish/text_mining_and_web_scraping")


download.file(url="http://curl.haxx.se/ca/cacert.pem",
              destfile="C:/Fish/text_mining_and_web_scraping/cacert.pem",
              method="auto")



authenticate <- OAuthFactory$new(consumerKey=key,
                                 consumerSecret=secret,
                                 requestURL="https://api.twitter.com/oauth/request_token",
                                 accessURL="https://api.twitter.com/oauth/access_token",
                                 authURL="https://api.twitter.com/oauth/authorize")

setup_twitter_oauth(key, secret)



save(authenticate, file="twitter authentication.Rdata")

russiatweets=searchTwitter("russia",lang="en",n=9999,since='2015-11-24')

Preparing data

We will use the Natural Language processing tm package from R.

In [ ]:
library(tm)

russialist=sapply(russiatweets,function(x) x$getText())
russialist=str_replace_all(russialist, "[^[:alnum:]]", " ")
russialist=gsub("[^A-Za-z0-9]", " ", russialist)
russialist=gsub("https", " ", russialist)
russialist=gsub("com", " ", russialist)

# Create corpus
russiacorpus=Corpus(VectorSource(russialist))

# Convert to lower-case
russiacorpus=tm_map(russiacorpus,tolower)

# Remove stopwords; other words that are not relevant to the analysis such as http(s) can also be included

russiacorpus=tm_map(russiacorpus,function(x) removeWords(x,stopwords()))

# convert corpus to a Plain Text Document
russiacorpus=tm_map(russiacorpus,PlainTextDocument)

Most common words using word cloud

The wordcloud package is handy to create word cloud in R.

In [ ]:
library(wordcloud)

col=brewer.pal(6,"Dark2")
wordcloud(russiacorpus, min.freq=5, scale=c(5,2),rot.per = 0.5,
          random.color=T, max.word=45, random.order=F,colors=col)

Dendogram of most common words

In [ ]:
# Create matrix

russiatdm <- TermDocumentMatrix(russiacorpus)

# Remove sparse terms
findFreqTerms(russiatdm, lowfreq=300) # experiment with the lowfreq
tdm <-removeSparseTerms(russiatdm, sparse=0.93) # experimet with sparse
                         
# scale it
tdmscale <- scale(tdm)

# calculate distance for clustering
dist <- dist(tdmscale, method = "euclidean")


# Use hierarchical clustering
fit <- hclust(dist)


par(mai=c(1,1.2,1,0.5))
plot(fit, xlab="", sub="", col.main="salmon")
Out [ ]:
  

As we can see from the above two figures, Russia related tweets were associated with ISIS, Syria, airstrikes, bombing, etc. We also cleearly see the recent downing of a Russian jet by Turkey and the response of Russia: sanctions that Putin approved today.

Sentiment analysis

Now, let's analyze the sentiment of the tweets.

In [ ]:
#import positive and negative words; data can be downloaded from https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

pos = readLines("positivewords.txt")
neg = readLines("negativewords.txt")
library("stringr")
library("plyr")        

Sentiment analysis

Now, let's analyze the sentiment of the tweets.

In [ ]:

# This is a wrapper function

score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
  
{
  scores = laply(sentences,
                 function(sentence, pos.words, neg.words)
                 {
                   # remove punctuation - using global substitute
                   sentence = gsub("[[:punct:]]", "", sentence)
                   # remove control characters
                   sentence = gsub("[[:cntrl:]]", "", sentence)
                   # remove digits
                   sentence = gsub('\\d+', '', sentence)
                   # define error handling function when trying tolower
                   tryTolower = function(x)
                     
                   {
                     # create missing value
                     y = NA
                     # tryCatch error
                     try_error = tryCatch(tolower(x), error=function(e) e)
                     # if not an error
                     if (!inherits(try_error, "error"))
                       y = tolower(x)
                     # result
                     return(y)
                   }
                   # use tryTolower with sapply
                   sentence = sapply(sentence, tryTolower)
                   # split sentence into words with str_split (stringr package)
                   word.list = str_split(sentence, "\\s+")
                   words = unlist(word.list)
                   # compare words to the dictionaries of positive & negative terms
                   pos.matches = match(words, pos.words)
                   neg.matches = match(words, neg.words)
                   # get the position of the matched term or NA
                   # we just want a TRUE/FALSE
                   pos.matches = !is.na(pos.matches)
                   neg.matches = !is.na(neg.matches)
                   # final score
                   score = sum(pos.matches) - sum(neg.matches)
                   return(score)
                 }, pos.words, neg.words, .progress=.progress )
  # data frame with scores for each sentence
  scores.df = data.frame(text=sentences, score=scores)
  return(scores.df)
}
                                     
In [ ]:

scores = score.sentiment(russialist, pos, neg, .progress='text')

hist(scores[,2],xlab=" ",main="Sentiment of tweets from 24-29 Nov, 2015 that pertain to Russia", border="black",col="skyblue")
                     
Out [ ]:

Summary

As we can see from the histogram, there were more negative sentiment tweets than positive ones from 24-29 November associated with Russia in the data in this excercise. The most common words in the tweets were ISIS, Syria, airstrikes, bombing, etc. We also cleearly see the recent downing of a Russian jet by Turkey and the response of Russia: sanctions that Putin approved on the 29th of November.



comments powered by Disqus