443-970-2353
[email protected]
CV Resume
Recently, I have been working on text Mining, scraping and sentiment Analysis with R and Python. I am specially interested in data from Facebook and twitter. In this post, I am scraping twitter to understand what has been being said about Russia and its relations with the middle east. Particularly, we will see the sentiment of posts from November 24-29, 2015. For this excercise, we will consider posts in English.
We will use the R packages twitteR, tm, stringr, plyr , among others.
Data has been extracted from twitter on November 29, 2015. The twitter posts are posted since November 24. 9999 posts have been extracted.
library(twitteR)
library(ROAuth)
require(RCurl)
# All this info is found from the twitterR developer account
key="hidden"
secret="hidden"
# set working directory for the whole process-
setwd("C:/Fish/text_mining_and_web_scraping")
download.file(url="http://curl.haxx.se/ca/cacert.pem",
destfile="C:/Fish/text_mining_and_web_scraping/cacert.pem",
method="auto")
authenticate <- OAuthFactory$new(consumerKey=key,
consumerSecret=secret,
requestURL="https://api.twitter.com/oauth/request_token",
accessURL="https://api.twitter.com/oauth/access_token",
authURL="https://api.twitter.com/oauth/authorize")
setup_twitter_oauth(key, secret)
save(authenticate, file="twitter authentication.Rdata")
russiatweets=searchTwitter("russia",lang="en",n=9999,since='2015-11-24')
We will use the Natural Language processing tm package from R.
library(tm)
russialist=sapply(russiatweets,function(x) x$getText())
russialist=str_replace_all(russialist, "[^[:alnum:]]", " ")
russialist=gsub("[^A-Za-z0-9]", " ", russialist)
russialist=gsub("https", " ", russialist)
russialist=gsub("com", " ", russialist)
# Create corpus
russiacorpus=Corpus(VectorSource(russialist))
# Convert to lower-case
russiacorpus=tm_map(russiacorpus,tolower)
# Remove stopwords; other words that are not relevant to the analysis such as http(s) can also be included
russiacorpus=tm_map(russiacorpus,function(x) removeWords(x,stopwords()))
# convert corpus to a Plain Text Document
russiacorpus=tm_map(russiacorpus,PlainTextDocument)
The wordcloud package is handy to create word cloud in R.
library(wordcloud)
col=brewer.pal(6,"Dark2")
wordcloud(russiacorpus, min.freq=5, scale=c(5,2),rot.per = 0.5,
random.color=T, max.word=45, random.order=F,colors=col)
# Create matrix
russiatdm <- TermDocumentMatrix(russiacorpus)
# Remove sparse terms
findFreqTerms(russiatdm, lowfreq=300) # experiment with the lowfreq
tdm <-removeSparseTerms(russiatdm, sparse=0.93) # experimet with sparse
# scale it
tdmscale <- scale(tdm)
# calculate distance for clustering
dist <- dist(tdmscale, method = "euclidean")
# Use hierarchical clustering
fit <- hclust(dist)
par(mai=c(1,1.2,1,0.5))
plot(fit, xlab="", sub="", col.main="salmon")
As we can see from the above two figures, Russia related tweets were associated with ISIS, Syria, airstrikes, bombing, etc. We also cleearly see the recent downing of a Russian jet by Turkey and the response of Russia: sanctions that Putin approved today.
Now, let's analyze the sentiment of the tweets.
#import positive and negative words; data can be downloaded from https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
pos = readLines("positivewords.txt")
neg = readLines("negativewords.txt")
library("stringr")
library("plyr")
Now, let's analyze the sentiment of the tweets.
# This is a wrapper function
score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
{
scores = laply(sentences,
function(sentence, pos.words, neg.words)
{
# remove punctuation - using global substitute
sentence = gsub("[[:punct:]]", "", sentence)
# remove control characters
sentence = gsub("[[:cntrl:]]", "", sentence)
# remove digits
sentence = gsub('\\d+', '', sentence)
# define error handling function when trying tolower
tryTolower = function(x)
{
# create missing value
y = NA
# tryCatch error
try_error = tryCatch(tolower(x), error=function(e) e)
# if not an error
if (!inherits(try_error, "error"))
y = tolower(x)
# result
return(y)
}
# use tryTolower with sapply
sentence = sapply(sentence, tryTolower)
# split sentence into words with str_split (stringr package)
word.list = str_split(sentence, "\\s+")
words = unlist(word.list)
# compare words to the dictionaries of positive & negative terms
pos.matches = match(words, pos.words)
neg.matches = match(words, neg.words)
# get the position of the matched term or NA
# we just want a TRUE/FALSE
pos.matches = !is.na(pos.matches)
neg.matches = !is.na(neg.matches)
# final score
score = sum(pos.matches) - sum(neg.matches)
return(score)
}, pos.words, neg.words, .progress=.progress )
# data frame with scores for each sentence
scores.df = data.frame(text=sentences, score=scores)
return(scores.df)
}
scores = score.sentiment(russialist, pos, neg, .progress='text')
hist(scores[,2],xlab=" ",main="Sentiment of tweets from 24-29 Nov, 2015 that pertain to Russia", border="black",col="skyblue")
As we can see from the histogram, there were more negative sentiment tweets than positive ones from 24-29 November associated with Russia in the data in this excercise. The most common words in the tweets were ISIS, Syria, airstrikes, bombing, etc. We also cleearly see the recent downing of a Russian jet by Turkey and the response of Russia: sanctions that Putin approved on the 29th of November.