443-970-2353
[email protected]
CV Resume
Recently, the presidential candidate Donal Trump has become controverial. Particularly, associated with his provocative call to temporarily bar Muslims from entering the US, he has faced strong criticism.
Some of the many uses of social media analytics is sentiment analysis where we evaluate whether posts on a specific issue are positive or negative.
We can integrate R and Tableau for text data mining in social media analytics, machine learning, predictive modeling, etc., by taking advantage of the numerous R packages and compelling Tableau visualizations.
In this post, let's mine tweets and analyse thier sentiment using R. We will use Tableau to visualize our results. We will see spatial-temporal distribution of tweets, cities and states with top number of tweets and we will also map the sentiment of the tweets. This will help us to see in which areas his comments are accepted as positive and where they are perceived as negative.
library(twitteR)
library(ROAuth)
require(RCurl)
library(stringr)
library(tm)
library(ggmap)
library(dplyr)
library(plyr)
library(tm)
library(wordcloud)
# All this info is found from the twitterR developer account
key="hidden"
secret="hidden"
# set working directory for the whole process-
setwd("C:/Fish/text_mining_and_web_scraping")
download.file(url="http://curl.haxx.se/ca/cacert.pem",
destfile="C:/Fish/text_mining_and_web_scraping/cacert.pem",
method="auto")
authenticate <- OAuthFactory$new(consumerKey=key,
consumerSecret=secret,
requestURL="https://api.twitter.com/oauth/request_token",
accessURL="https://api.twitter.com/oauth/access_token",
authURL="https://api.twitter.com/oauth/authorize")
setup_twitter_oauth(key, secret)
save(authenticate, file="twitter authentication.Rdata")
Let's scrape most recent tweets from various cities across the US.
N=2000
lats=c(38.9,40.7,37.8,39,37.4,28,30,42.4,48,36,32.3,33.5,34.7,33.8,37.2,41.2,46.8,
46.6,37.2,43,42.7,40.8,36.2,38.6,35.8,40.3,43.6,40.8,44.9,44.9)
lons=c(-77,-74,-122,-105.5,-122,-82.5,-98,-71,-122,-115,-86.3,-112,-92.3,-84.4,-93.3,
-104.8,-100.8,-112, -93.3,-89,-84.5,-111.8,-86.8,-92.2,-78.6,-76.8,-116.2,-98.7,-123,-93)
#cities=dc,new york,san fransisco,colorado,mountainview,Tampa,Austin,Boston,
# Seatle,vegas,Montgomery,Phoenix,Little Rock,Atlanta,Springfield,
# Cheyenne,Bisruk,Helena,Springfield,Madison,Lansing,Salt Lake City,Nashville
# Jefferson City,Raleigh,Harrisburg,Boise,Lincoln,Salem,St. Paul
loc=do.call(rbind,lapply(1:length(lats), function(i) searchTwitter('Donald+Trump',
lang="en",n=N,resultType="recent",geocode= paste(lats[i],lons[i],paste0(S,"mi"),sep=","))))
We can check that there are no repated intries by comapring length(loc) and length(unique(loc))
loclat=sapply(loc, function(x) as.numeric(x$getLatitude()))
loclat=sapply(loclat, function(z) ifelse(length(z)==0,NA,z))
loclon=sapply(loc, function(x) as.numeric(x$getLongitude()))
loclon=sapply(loclon, function(z) ifelse(length(z)==0,NA,z))
locdate=lapply(loc, function(x) x$getCreated())
locdate=sapply(locdate,function(x) strftime(x, format="%Y-%m-%d %H:%M:%S",tz = "UTC"))
loctext=sapply(loc, function(x) x$getText())
loctext=unlist(loctext)
isretweet=sapply(loc, function(x) x$getIsRetweet())
retweeted=sapply(loc, function(x) x$getRetweeted())
retweetcount=sapply(loc, function(x) x$getRetweetCount())
favoritecount=sapply(loc, function(x) x$getFavoriteCount())
favorited=sapply(loc, function(x) x$getFavorited())
screenname=sapply(loc, function(x) x$getScreenName())
statussource=sapply(loc, function(x) x$getStatusSource())
truncated=sapply(loc, function(x) x$getTruncated())
data=as.data.frame(cbind(tweet=loctext,date=locdate,lat=loclat,lon=loclon,
isretweet=isretweet,retweeted=retweeted, retweetcount=retweetcount,
screenname=screenname,statussource=statussource,truncated=truncated,
favoritecount=favoritecount,favorited=favorited))
First, let's create a word cloud of the tweets. A word cloud helps us to visualize the most common words in the tweets and have a general feeling of the tweets.
# Create corpus
corpus=Corpus(VectorSource(data$tweet))
# Convert to lower-case
corpus=tm_map(corpus,tolower)
# Remove stopwords
corpus=tm_map(corpus,function(x) removeWords(x,stopwords()))
# convert corpus to a Plain Text Document
corpus=tm_map(corpus,PlainTextDocument)
col=brewer.pal(6,"Dark2")
wordcloud(corpus, min.freq=25, scale=c(5,2),rot.per = 0.25,
random.color=T, max.word=45, random.order=F,colors=col)
We see from the word cloud that among the most frequent words in the tweets are 'muslim', 'muslims', 'ban', 'president, 'bush', and 'job'. This suggests that most tweets were on Trump's recent idea of temporarily banning muslims from entering the US.
The dashboard below shows time serie of the number of tweets scraped. We can change the time unit between hour and day and the dashboard will change based on the selected time unit. Pattern of number of tweets over time helps us to drill in and see how each activities/campaigns are being perceived.
Since some tweets do not have lat/lon values, we will remove them because we want geographic information to show the tweets and thier attributes on a map.
data=filter(data, !is.na(lat),!is.na(lon))
lonlat=select(data,lon,lat)
Let's get full adress of each tweet location using the google maps API. The ggmaps package is what enables us to get the street address, city, zipcode and state of the tweets using the longitude and latitude of the tweets. Since the google maps API does not allow more than 2500 queries per day, I used a couple of machines to reverse geocode the latitude/longitude information ino full address. However, I was not lucky enough to reverse geocode all of the tweets I scraped. So, in the following visualizations, I am showing only some percentage of the tweets I scraped that I was able to reverse geocode.
result <- do.call(rbind,
lapply(1:nrow(lonlat),
function(i) revgeocode(as.numeric(lonlat[i,1:2]))))
If we see some of the values of result, we see that it contains the full address of the locations where the tweets were posted.
result[1:5,]
[,1]
[1,] "1778 Woodglo Dr, Asheboro, NC 27205, USA"
[2,] "1550 Missouri Valley Rd, Riverton, WY 82501, USA"
[3,] "118 S Main St, Ann Arbor, MI 48104, USA"
[4,] "322 W 101st St, New York, NY 10025, USA"
[5,] "322 W 101st St, New York, NY 10025, USA"
So, we will apply some regular expression and string manipulation to separate the city, zip code and state into different columns.
data2=lapply(result, function(x) unlist(strsplit(x,",")))
address=sapply(data2,function(x) paste(x[1:3],collapse=''))
city=sapply(data2,function(x) x[2])
stzip=sapply(data2,function(x) x[3])
zipcode = as.numeric(str_extract(stzip,"[0-9]{5}"))
state=str_extract(stzip,"[:alpha:]{2}")
data2=as.data.frame(list(address=address,city=city,zipcode=zipcode,state=state))
concatenate data2 to data
data=cind(data,data2)
Some text cleaning:
tweet=data$tweet
tweet_list=lapply(tweet, function(x) iconv(x, "latin1", "ASCII", sub=""))
tweet_list=lapply(tweet_list, function(x) gsub("htt.*",' ',x))
tweet=unlist(tweet_list)
data$tweet=tweet
We will use lexicon based sentiment analysis. A list of positive and negative opinion words or sentiment words for English was downloaded from here.
pos= readLines("positivewords.txt")
neg= readLines("negativewords.txt")
# Wrapper function for sentiment analysis
score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
{
scores = laply(sentences,
function(sentence, pos.words, neg.words)
{
# remove punctuation
sentence = gsub("[[:punct:]]", "", sentence)
# remove control characters
sentence = gsub("[[:cntrl:]]", "", sentence)
# remove digits
sentence = gsub('\\d+', '', sentence)
# define error handling function when trying tolower
tryTolower = function(x)
{
# create missing value
y = NA
# tryCatch error
try_error = tryCatch(tolower(x), error=function(e) e)
# if not an error
if (!inherits(try_error, "error"))
y = tolower(x)
# result
return(y)
}
# use tryTolower with sapply
sentence = sapply(sentence, tryTolower)
# split sentence into words with str_split (stringr package)
word.list = str_split(sentence, "\\s+")
words = unlist(word.list)
# compare words to the dictionaries of positive & negative terms
pos.matches = match(words, pos.words)
neg.matches = match(words, neg.words)
# get the position of the matched term or NA
# we just want a TRUE/FALSE
pos.matches = !is.na(pos.matches)
neg.matches = !is.na(neg.matches)
# final score
score = sum(pos.matches) - sum(neg.matches)
return(score)
}, pos.words, neg.words, .progress=.progress )
# data frame with scores for each sentence
scores.df = data.frame(score=scores)
return(scores.df)
}
score = score.sentiment(tweet, pos, neg, .progress='text')
data$score=score
hist(score,xlab=" ",main="Sentiment of sample tweets\n that have Donald Trump in them ",
border="black",col="skyblue")
We see from the histogram that the sentiment is slightly positive. Using Tableau, we will see the spatial distribution of the sentiment scores.
Now, let's save the data as csv file and import it to Tableau
write.csv(data,"tweets_Trump.csv")
The interactive map below shows the tweets that I was to reverse geocode. The size is proportional to the number of favorites each tweet got. We can hover over each circle and read the tweet, the address it was tweeted from, and the date and time it was posted.
Similarly, the dashboard below shows the tweets and the size is proportional to the number of times each tweet was retweeted. We can hover over each circle and read the tweet, the address it was tweeted from, and the date and time it was posted.
In the following three visualizations, top zip codes, cities and states by the number of tweets are shown. We can change the number of zip codes, cities and states to display by using the scrollbars shown in each viz. These visualizations help us to see the distribution of the tweets by state, city and zip code.
Sentiment analysis has myriads of uses. For example, a company may investigate what customers like most about the company's product, and what are the issues the customers are not satisfied with? When a company releases a new product, has the product been perceived positively or negatively? How does the sentiment of the customers vary across space and time?
The viz below shows the sentiment score of the reverse geocoded tweets by state. We see that the tweets have highest postive sentiment in NY, NC and Tx.
In this post, we saw how to integrate R and Tableau for text mining, sentiment analysis and visualization. Using these tools together enables us to answer detailed questions.
We used a sample from the most recent tweets that contain Donald Trump and since I was not able to reverse geocode all the tweets I scraped because of the constraint imposed by google maps API, we just used about 6000 tweets. The average sentiment is slightly above zero. Some states show strong positive sentiment. However, statistically speaking, to make robust conclusions, mining ample size sample data is important.
The accuracy of our sentiment analysis depends on how fully the words in the the tweets are included in the lexicon. Moreover, since tweets may contain slang, jargon and collequial words which may not be included in the lexicon, sentiment analysis needs careful evaluation.