Fisseha Berhane, PhD

Data Scientist

443-970-2353 fisseha@jhu.edu CV Resume Linkedin GitHub twitter twitter

Text Analytics with R

This lab is on text analytics with R using logistic regression and regression trees.

The data for the first part of the analysis comes from the 2010 TREC Legal Track

Load the dataset

In [97]:
emails = read.csv("energy_bids.csv", stringsAsFactors=FALSE)

str(emails)
'data.frame':	855 obs. of  2 variables:
 $ email     : chr  "North America's integrated electricity market requires cooperation on environmental policies Commission for Environmental Coope"| __truncated__ "FYI -----Original Message----- From: \t\"Ginny Feliciano\" <gfeliciano@earthlink.net>@ENRON [mailto:IMCEANOTES-+22Ginny+20Felic"| __truncated__ "14:13:53 Synchronizing Mailbox 'Kean, Steven J.' 14:13:53 Synchronizing Hierarchy 14:13:53 Synchronizing Favorites 14:13:53 Syn"| __truncated__ "^ ----- Forwarded by Steven J Kean/NA/Enron on 03/02/2001 12:27 PM ----- Suzanne_Nimocks@mckinsey.com Sent by: Carol_Benter@mck"| __truncated__ ...
 $ responsive: int  0 1 0 1 0 0 1 0 0 0 ...

Look at emails

In [98]:
emails$email[1]
Out[98]:
"North America's integrated electricity market requires cooperation on environmental policies Commission for Environmental Cooperation releases working paper on North America's electricity market Montreal, 27 November 2001 -- The North American Commission for Environmental Cooperation (CEC) is releasing a working paper highlighting the trend towards increasing trade, competition and cross-border investment in electricity between Canada, Mexico and the United States. It is hoped that the working paper, Environmental Challenges and Opportunities in the Evolving North American Electricity Market, will stimulate public discussion around a CEC symposium of the same title about the need to coordinate environmental policies trinationally as a North America-wide electricity market develops. The CEC symposium will take place in San Diego on 29-30 November, and will bring together leading experts from industry, academia, NGOs and the governments of Canada, Mexico and the United States to consider the impact of the evolving continental electricity market on human health and the environment. \"Our goal [with the working paper and the symposium] is to highlight key environmental issues that must be addressed as the electricity markets in North America become more and more integrated,\" said Janine Ferretti, executive director of the CEC. \"We want to stimulate discussion around the important policy questions being raised so that countries can cooperate in their approach to energy and the environment.\" The CEC, an international organization created under an environmental side agreement to NAFTA known as the North American Agreement on Environmental Cooperation, was established to address regional environmental concerns, help prevent potential trade and environmental conflicts, and promote the effective enforcement of environmental law. The CEC Secretariat believes that greater North American cooperation on environmental policies regarding the continental electricity market is necessary to: * protect air quality and mitigate climate change, * minimize the possibility of environment-based trade disputes, * ensure a dependable supply of reasonably priced electricity across North America * avoid creation of pollution havens, and * ensure local and national environmental measures remain effective. The Changing Market The working paper profiles the rapid changing North American electricity market. For example, in 2001, the US is projected to export 13.1 thousand gigawatt-hours (GWh) of electricity to Canada and Mexico. By 2007, this number is projected to grow to 16.9 thousand GWh of electricity. \"Over the past few decades, the North American electricity market has developed into a complex array of cross-border transactions and relationships,\" said Phil Sharp, former US congressman and chairman of the CEC's Electricity Advisory Board. \"We need to achieve this new level of cooperation in our environmental approaches as well.\" The Environmental Profile of the Electricity Sector The electricity sector is the single largest source of nationally reported toxins in the United States and Canada and a large source in Mexico. In the US, the electricity sector emits approximately 25 percent of all NOx emissions, roughly 35 percent of all CO2 emissions, 25 percent of all mercury emissions and almost 70 percent of SO2 emissions. These emissions have a large impact on airsheds, watersheds and migratory species corridors that are often shared between the three North American countries. \"We want to discuss the possible outcomes from greater efforts to coordinate federal, state or provincial environmental laws and policies that relate to the electricity sector,\" said Ferretti. \"How can we develop more compatible environmental approaches to help make domestic environmental policies more effective?\" The Effects of an Integrated Electricity Market One key issue raised in the paper is the effect of market integration on the competitiveness of particular fuels such as coal, natural gas or renewables. Fuel choice largely determines environmental impacts from a specific facility, along with pollution control technologies, performance standards and regulations. The paper highlights other impacts of a highly competitive market as well. For example, concerns about so called \"pollution havens\" arise when significant differences in environmental laws or enforcement practices induce power companies to locate their operations in jurisdictions with lower standards. \"The CEC Secretariat is exploring what additional environmental policies will work in this restructured market and how these policies can be adapted to ensure that they enhance competitiveness and benefit the entire region,\" said Sharp. Because trade rules and policy measures directly influence the variables that drive a successfully integrated North American electricity market, the working paper also addresses fuel choice, technology, pollution control strategies and subsidies. The CEC will use the information gathered during the discussion period to develop a final report that will be submitted to the Council in early 2002. For more information or to view the live video webcast of the symposium, please go to: http://www.cec.org/electricity. You may download the working paper and other supporting documents from: http://www.cec.org/programs_projects/other_initiatives/electricity/docs.cfm?varlan=english. Commission for Environmental Cooperation 393, rue St-Jacques Ouest, Bureau 200 Montréal (Québec) Canada H2Y 1N9 Tel: (514) 350-4300; Fax: (514) 350-4314 E-mail: info@ccemtl.org ***********"
In [99]:
emails$responsive[1]
Out[99]:
0
In [100]:
emails$email[2]
Out[100]:
"FYI -----Original Message----- From: \"Ginny Feliciano\" @ENRON [mailto:IMCEANOTES-+22Ginny+20Feliciano+22+20+3Cgfeliciano+40earthlink+2Enet+3E+40ENRON@ENRON.com] Sent: Thursday, June 28, 2001 3:40 PM To: Silvia Woodard; Paul Runci; Katrin Thomas; John A. Riggs; Kurt E. Yeager; Gregg Ward; Philip K. Verleger; Admiral Richard H. Truly; Susan Tomasky; Tsutomu Toichi; Susan F. Tierney; John A. Strom; Gerald M. Stokes; Kevin Stoffer; Edward M. Stern; Irwin M. Stelzer; Hoff Stauffer; Steven R. Spencer; Robert Smart; Bernie Schroeder; George A. Schreiber, Jr.; Robert N. Schock; James R. Schlesinger; Roger W. Sant; John W. Rowe; James E. Rogers; John F. Riordan; James Ragland; Frank J. Puzio; Tony Prophet; Robert Priddle; Michael Price; John B. Phillips; Robert Perciasepe; D. Louis Peoples; Robert Nordhaus; Walker Nolan; William A. Nitze; Kazutoshi Muramatsu; Ernest J. Moniz; Nancy C. Mohn; Callum McCarthy; Thomas R. Mason; Edward P. Martin; Jan W. Mares; James K. Malernee; S. David Freeman; Edwin Lupberger; Amory B. Lovins; Lynn LeMaster; Hoesung Lee; Lay, Kenneth; Lester Lave; Wilfrid L. Kohl; Soo Kyung Kim; Melanie Kenderdine; Paul L. Joskow; Ira H. Jolles; Frederick E. John; John Jimison; William W. Hogan; Robert A. Hefner, III; James K. Gray; Craig G. Goodman; Charles F. Goff, Jr.; Jerry D. Geist; Fritz Gautschi; Larry G. Garberding; Roger Gale; William Fulkerson; Stephen E. Frank; George Frampton; Juan Eibenschutz; Theodore R. Eck; Congressman John Dingell; Brian N. Dickie; William E. Dickenson; Etienne Deffarges; Wilfried Czernie; Loren C. Cox; Anne Cleary; Bernard H. Cherry; Red Cavaney; Ralph Cavanagh; Thomas R. Casten; Peter Bradford; Peter D. Blair; Ellen Berman; Roger A. Berliner; Michael L. Beatty; Vicky A. Bailey; Merribel S. Ayres; Catherine G. Abbott Subject: Energy Deregulation - California State Auditor Report Attached is my report prepared on behalf of the California State Auditor. I look forward to seeing you at The Aspen Institute Energy Policy Forum. Charles J. Cicchetti Pacific Economics Group, LLC - ca report new.pdf ***********"
In [101]:
emails$responsive[2]
Out[101]:
1

Responsive emails

In [102]:
table(emails$responsive)
Out[102]:
  0   1 
716 139 
In [103]:
library(tm)

Create corpus

In [104]:
corpus = Corpus(VectorSource(emails$email))

corpus[[1]]
Out[104]:
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 5607

Pre-process data

In [105]:
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, PlainTextDocument)
In [106]:
length(stopwords("english"))
Out[106]:
174
In [107]:
corpus = tm_map(corpus, removePunctuation)
In [108]:
corpus = tm_map(corpus, removeWords, stopwords("english"))
In [109]:
corpus = tm_map(corpus, stemDocument)

Create matrix

dtm = DocumentTermMatrix(corpus)

dtm

Remove sparse terms

In [110]:
dtm = removeSparseTerms(dtm, 0.97)
dtm
Out[110]:
<<DocumentTermMatrix (documents: 855, terms: 788)>>
Non-/sparse entries: 51612/622128
Sparsity           : 92%
Maximal term length: 19
Weighting          : term frequency (tf)
Create data frame
In [111]:
labeledTerms = as.data.frame(as.matrix(dtm))
Add in the outcome variable
In [112]:
labeledTerms$responsive = emails$responsive
In [113]:
str(labeledTerms)
'data.frame':	855 obs. of  789 variables:
 $ 100                : num  0 0 0 0 0 0 5 0 0 0 ...
 $ 1400               : num  0 0 0 0 0 0 0 0 0 0 ...
 $ 1999               : num  0 0 0 0 0 0 0 0 0 0 ...
 $ 2000               : num  0 0 1 0 1 0 6 0 1 0 ...
 $ 2001               : num  2 1 0 0 0 0 7 0 0 0 ...
 $ 713                : num  0 0 0 0 0 0 0 0 0 0 ...
 $ 77002              : num  0 0 0 0 0 0 0 0 0 0 ...
 $ abl                : num  0 0 0 0 0 0 2 0 0 0 ...
 $ accept             : num  0 0 0 0 0 0 1 0 0 0 ...
 $ access             : num  0 0 0 0 0 0 0 0 0 0 ...
 $ accord             : num  0 0 0 0 0 0 1 0 0 0 ...
 $ account            : num  0 0 0 0 0 0 3 0 0 0 ...
 $ act                : num  0 0 0 0 0 0 1 0 0 0 ...
 $ action             : num  0 0 0 0 1 0 0 0 0 0 ...
 $ activ              : num  0 0 1 0 1 0 1 0 0 0 ...
 $ actual             : num  0 0 0 0 0 0 0 0 0 0 ...
 $ add                : num  0 0 0 0 0 0 1 0 0 0 ...
 $ addit              : num  1 0 0 0 0 0 1 0 0 0 ...
 $ address            : num  3 0 0 0 2 0 0 0 0 1 ...
 $ administr          : num  0 0 0 0 0 0 1 0 0 0 ...
 $ advanc             : num  0 0 0 0 0 0 0 0 0 0 ...
 $ advis              : num  0 0 0 0 0 0 0 0 0 0 ...
 $ affect             : num  0 0 0 0 2 0 0 0 0 0 ...
 $ afternoon          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ agenc              : num  0 0 0 0 1 0 0 0 0 0 ...
 $ ago                : num  0 0 0 0 0 0 1 0 0 0 ...
 $ agre               : num  0 0 0 0 0 0 0 0 0 0 ...
 $ agreement          : num  2 0 0 0 2 0 1 0 0 1 ...
 $ alan               : num  0 0 0 0 0 1 0 0 0 0 ...
 $ allow              : num  0 0 0 0 0 0 2 0 0 0 ...
 $ along              : num  1 0 0 0 1 0 1 0 0 0 ...
 $ alreadi            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ also               : num  1 0 0 0 0 0 8 0 0 0 ...
 $ altern             : num  0 0 0 0 0 0 0 0 1 0 ...
 $ although           : num  0 0 0 0 0 0 6 0 0 0 ...
 $ amend              : num  0 0 0 0 0 0 0 0 0 0 ...
 $ america            : num  4 0 0 0 0 0 0 0 1 0 ...
 $ among              : num  0 0 0 0 0 0 3 0 0 0 ...
 $ amount             : num  0 0 0 0 0 0 1 0 0 0 ...
 $ analysi            : num  0 0 0 2 0 0 0 0 0 0 ...
 $ analyst            : num  0 0 0 0 0 0 6 0 0 0 ...
 $ andor              : num  0 0 0 0 0 0 0 0 0 0 ...
 $ andrew             : num  0 0 0 0 0 0 0 0 0 0 ...
 $ announc            : num  0 0 0 0 0 0 2 0 0 0 ...
 $ anoth              : num  0 0 0 0 0 0 6 0 0 0 ...
 $ answer             : num  0 0 0 0 0 0 2 0 0 0 ...
 $ anyon              : num  0 0 0 0 0 0 0 0 0 0 ...
 $ anyth              : num  0 0 0 0 0 0 0 0 0 0 ...
 $ appear             : num  0 0 0 0 0 0 3 0 0 0 ...
 $ appli              : num  0 0 0 0 0 0 0 0 0 0 ...
 $ applic             : num  0 0 0 0 0 0 0 0 0 0 ...
 $ appreci            : num  0 0 0 0 1 0 0 0 0 0 ...
 $ approach           : num  3 0 0 0 0 0 1 0 0 0 ...
 $ appropri           : num  0 0 0 0 0 0 0 1 0 0 ...
 $ approv             : num  0 0 0 0 0 0 1 0 0 0 ...
 $ approxim           : num  1 0 0 0 0 0 1 0 0 0 ...
 $ april              : num  0 0 0 0 0 0 3 0 0 0 ...
 $ area               : num  0 0 0 0 1 0 3 0 0 0 ...
 $ around             : num  2 0 0 0 0 0 1 0 0 0 ...
 $ arrang             : num  0 0 0 0 0 0 0 0 0 0 ...
 $ articl             : num  0 0 0 0 0 0 1 0 0 0 ...
 $ ask                : num  0 0 0 0 0 1 0 0 0 0 ...
 $ asset              : num  0 0 0 0 0 0 2 0 0 0 ...
 $ assist             : num  0 0 0 0 0 0 0 0 0 0 ...
 $ associ             : num  0 0 1 0 1 0 0 0 0 0 ...
 $ assum              : num  0 0 0 0 0 1 0 0 0 0 ...
 $ attach             : num  0 1 0 1 1 0 1 0 3 1 ...
 $ attend             : num  0 0 0 0 0 0 0 0 1 0 ...
 $ attent             : num  0 0 0 0 0 0 1 0 0 0 ...
 $ attorney           : num  0 0 0 0 0 0 0 0 0 0 ...
 $ august             : num  0 0 0 0 0 0 0 0 0 0 ...
 $ author             : num  0 0 1 0 0 0 0 0 0 0 ...
 $ avail              : num  0 0 0 0 0 0 0 0 0 0 ...
 $ averag             : num  0 0 0 0 0 0 5 0 0 0 ...
 $ avoid              : num  1 0 0 0 1 0 2 0 0 0 ...
 $ awar               : num  0 0 0 0 0 0 0 0 0 0 ...
 $ back               : num  0 0 0 0 1 1 1 0 0 0 ...
 $ balanc             : num  0 0 0 0 0 0 0 0 0 0 ...
 $ bank               : num  0 0 0 0 0 0 2 0 0 0 ...
 $ base               : num  0 0 0 0 1 0 9 0 0 0 ...
 $ basi               : num  0 0 0 0 0 0 1 0 0 0 ...
 $ becom              : num  1 0 0 0 0 0 4 0 0 0 ...
 $ begin              : num  0 0 0 0 0 0 0 0 0 0 ...
 $ believ             : num  1 0 0 0 0 0 0 0 0 0 ...
 $ benefit            : num  1 0 0 0 0 0 5 0 0 0 ...
 $ best               : num  0 0 0 0 0 0 0 0 0 1 ...
 $ better             : num  0 0 0 0 0 0 2 0 0 0 ...
 $ bid                : num  0 0 0 0 0 0 1 0 0 0 ...
 $ big                : num  0 0 0 0 0 1 6 0 0 0 ...
 $ bill               : num  0 0 0 0 0 0 0 0 0 0 ...
 $ billion            : num  0 0 0 0 0 0 2 0 0 0 ...
 $ bit                : num  0 0 0 0 0 1 2 0 0 0 ...
 $ board              : num  1 0 0 0 0 0 0 0 0 0 ...
 $ bob                : num  0 0 0 0 0 0 0 0 0 0 ...
 $ book               : num  0 0 0 0 0 0 0 0 0 0 ...
 $ brian              : num  0 1 0 0 0 0 0 0 0 0 ...
 $ brief              : num  0 0 0 0 0 0 0 0 0 0 ...
 $ bring              : num  1 0 0 0 0 0 2 0 0 0 ...
 $ build              : num  0 0 0 0 0 0 7 0 1 0 ...
  [list output truncated]

Split the data

In [114]:
library(caTools)

set.seed(144)

spl = sample.split(labeledTerms$responsive, 0.7)

train = subset(labeledTerms, spl == TRUE)
test = subset(labeledTerms, spl == FALSE)

Build a CART model

In [115]:
library(rpart)
library(rpart.plot)

emailCART = rpart(responsive~., data=train, method="class")

prp(emailCART)

Make predictions on the test set

In [116]:
pred = predict(emailCART, newdata=test)
pred[1:10,]
Out[116]:
01
character(0)0.21568630.7843137
character(0).10.95575220.04424779
character(0).20.95575220.04424779
character(0).30.81250.1875
character(0).40.40.6
character(0).50.95575220.04424779
character(0).60.95575220.04424779
character(0).70.95575220.04424779
character(0).80.1250.875
character(0).90.1250.875
In [117]:
pred.prob = pred[,2]
Compute accuracy
In [118]:
table(test$responsive, pred.prob >= 0.5)
Out[118]:
   
    FALSE TRUE
  0   195   20
  1    17   25
In [119]:
accuracy=sum(diag(table(test$responsive, pred.prob >= 0.5)))/sum(table(test$responsive, pred.prob >= 0.5))
accuracy
Out[119]:
0.856031128404669
Baseline model accuracy
In [120]:
table(test$responsive)
Out[120]:
  0   1 
215  42 
In [121]:
accuracy=max(table(test$responsive))/sum(table(test$responsive))
accuracy
Out[121]:
0.836575875486381

ROC curve

In [122]:
library(ROCR)

predROCR = prediction(pred.prob, test$responsive)

perfROCR = performance(predROCR, "tpr", "fpr")

plot(perfROCR, colorize=TRUE)
Loading required package: gplots

Attaching package: 'gplots'

The following object is masked from 'package:stats':

    lowess

From the curve, we see that taking a value of about 0.15 helps to have higher sensitivity and the false positive rate will be about 0.2.

Compute AUC
In [124]:
performance(predROCR, "auc")@y.values
Out[124]:
  1. 0.793632336655593

This shows it has about 80% probability of predicting the response correctly.


Second part

In [24]:
tweets = read.csv("tweets.csv", stringsAsFactors=FALSE)
str(tweets)
'data.frame':	1181 obs. of  2 variables:
 $ Tweet: chr  "I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore" "iOS 7 is so fricking smooth & beautiful!! #ThanxApple @Apple" "LOVE U @APPLE" "Thank you @apple, loving my new iPhone 5S!!!!!  #apple #iphone5S pic.twitter.com/XmHJCU4pcb" ...
 $ Avg  : num  2 2 1.8 1.8 1.8 1.8 1.8 1.6 1.6 1.6 ...

Create dependent variable

In [25]:
tweets$Negative = as.factor(tweets$Avg <= -1)

table(tweets$Negative)
Out[25]:
FALSE  TRUE 
  999   182 
In [26]:
library(tm)
In [27]:
library(SnowballC)

Create corpus

In [28]:
corpus = Corpus(VectorSource(tweets$Tweet))
Look at corpus
In [29]:
corpus

corpus[[1]]
Out[29]:
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 1181
Out[29]:
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 101

Convert to lower-case

In [30]:
corpus = tm_map(corpus, tolower)

corpus[[1]]
Out[30]:
"i have to say, apple has by far the best customer care service i have ever received! @apple @appstore"
In [31]:
corpus = tm_map(corpus, PlainTextDocument)

Remove punctuation

In [32]:
corpus = tm_map(corpus, removePunctuation)

corpus[[1]]
Out[32]:
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 97

Look at stop words

In [33]:
stopwords("english")[1:10]
Out[33]:
  1. "i"
  2. "me"
  3. "my"
  4. "myself"
  5. "we"
  6. "our"
  7. "ours"
  8. "ourselves"
  9. "you"
  10. "your"
In [34]:
length(stopwords("english"))
Out[34]:
174

Remove stopwords and apple

In [35]:
corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))

corpus[[1]]
Out[35]:
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 67

Stem document

In [36]:
corpus = tm_map(corpus, stemDocument)

corpus[[1]]
Out[36]:
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 61

Create matrix

In [37]:
frequencies = DocumentTermMatrix(corpus)

frequencies
Out[37]:
<<DocumentTermMatrix (documents: 1181, terms: 3289)>>
Non-/sparse entries: 8980/3875329
Sparsity           : 100%
Maximal term length: 115
Weighting          : term frequency (tf)

Look at matrix

In [38]:
inspect(frequencies[1000:1005,505:515])
<<DocumentTermMatrix (documents: 6, terms: 11)>>
Non-/sparse entries: 1/65
Sparsity           : 98%
Maximal term length: 9
Weighting          : term frequency (tf)

              Terms
Docs           cheapen cheaper check cheep cheer cheerio cherylcol chief
  character(0)       0       0     0     0     0       0         0     0
  character(0)       0       0     0     0     0       0         0     0
  character(0)       0       0     0     0     0       0         0     0
  character(0)       0       0     0     0     0       0         0     0
  character(0)       0       0     0     0     0       0         0     0
  character(0)       0       0     0     0     1       0         0     0
              Terms
Docs           chiiiiqu child children
  character(0)        0     0        0
  character(0)        0     0        0
  character(0)        0     0        0
  character(0)        0     0        0
  character(0)        0     0        0
  character(0)        0     0        0

Check for sparsity

In [39]:
findFreqTerms(frequencies, lowfreq=20)
Out[39]:
  1. "android"
  2. "anyon"
  3. "app"
  4. "appl"
  5. "back"
  6. "batteri"
  7. "better"
  8. "buy"
  9. "can"
  10. "cant"
  11. "come"
  12. "dont"
  13. "fingerprint"
  14. "freak"
  15. "get"
  16. "googl"
  17. "ios7"
  18. "ipad"
  19. "iphon"
  20. "iphone5"
  21. "iphone5c"
  22. "ipod"
  23. "ipodplayerpromo"
  24. "itun"
  25. "just"
  26. "like"
  27. "lol"
  28. "look"
  29. "love"
  30. "make"
  31. "market"
  32. "microsoft"
  33. "need"
  34. "new"
  35. "now"
  36. "one"
  37. "phone"
  38. "pleas"
  39. "promo"
  40. "promoipodplayerpromo"
  41. "realli"
  42. "releas"
  43. "samsung"
  44. "say"
  45. "store"
  46. "thank"
  47. "think"
  48. "time"
  49. "twitter"
  50. "updat"
  51. "use"
  52. "via"
  53. "want"
  54. "well"
  55. "will"
  56. "work"

Remove sparse terms

In [40]:
sparse = removeSparseTerms(frequencies, 0.995)
sparse
Out[40]:
<<DocumentTermMatrix (documents: 1181, terms: 309)>>
Non-/sparse entries: 4669/360260
Sparsity           : 99%
Maximal term length: 20
Weighting          : term frequency (tf)

Convert to a data frame

In [41]:
tweetsSparse = as.data.frame(as.matrix(sparse))

Make all variable names R-friendly

In [42]:
colnames(tweetsSparse) = make.names(colnames(tweetsSparse)) # do this any time building a data frame from text

Add dependent variable

In [43]:
tweetsSparse$Negative = tweets$Negative
In [44]:
library(caTools)
In [45]:
set.seed(123)

split = sample.split(tweetsSparse$Negative, SplitRatio = 0.7)

trainSparse = subset(tweetsSparse, split==TRUE)
testSparse = subset(tweetsSparse, split==FALSE)

Build a CART model

In [46]:
library(rpart)
library(rpart.plot)
In [47]:
tweetCART = rpart(Negative ~ ., data=trainSparse, method="class")

prp(tweetCART)

Evaluate the performance of the model

In [48]:
predictCART = predict(tweetCART, newdata=testSparse, type="class")
In [49]:
table(testSparse$Negative, predictCART) # confusion matrix
Out[49]:
       predictCART
        FALSE TRUE
  FALSE   294    6
  TRUE     37   18
In [50]:
accuracy=sum(diag(table(testSparse$Negative, predictCART)))/(sum(table(testSparse$Negative, predictCART)))

accuracy
Out[50]:
0.87887323943662

Baseline accuracy

In [51]:
table(testSparse$Negative)
Out[51]:
FALSE  TRUE 
  300    55 
In [52]:
accuracy=max(table(testSparse$Negative))/(sum(table(testSparse$Negative)))
accuracy
Out[52]:
0.845070422535211

Random forest model

In [53]:
library(randomForest)
set.seed(123)
randomForest 4.6-10
Type rfNews() to see new features/changes/bug fixes.
In [54]:
tweetRF = randomForest(Negative ~ ., data=trainSparse)

Make predictions

In [55]:
predictRF = predict(tweetRF, newdata=testSparse)

table(testSparse$Negative, predictRF)
Out[55]:
       predictRF
        FALSE TRUE
  FALSE   293    7
  TRUE     34   21
In [56]:
accuracy=(sum(diag(table(testSparse$Negative, predictRF))))/(sum(table(testSparse$Negative, predictRF)))

accuracy
Out[56]:
0.884507042253521

Logistic Regression

In [57]:
tweetLR = glm(Negative ~ ., data=trainSparse, family='binomial')
Warning message:
: glm.fit: algorithm did not convergeWarning message:
: glm.fit: fitted probabilities numerically 0 or 1 occurred
Now, make predictions using the logistic regression model:
In [58]:
predictions = predict(tweetLR, newdata=testSparse, type="response")
Warning message:
In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == : prediction from a rank-deficient fit may be misleading
In [59]:
table(testSparse$Negative, predictions>0.5)
Out[59]:
       
        FALSE TRUE
  FALSE   253   47
  TRUE     22   33

calculate accuracy of the Logistic Regression Model

In [60]:
accuracy=(sum(diag(table(testSparse$Negative, predictions>0.5))))/(sum(table(testSparse$Negative, predictions>0.5)))

accuracy
Out[60]:
0.805633802816901

This is an example of over-fitting. The model fits the training set really well, but does not perform well on the test set. A logistic regression model with a large number of variables is particularly at risk for overfitting.



comments powered by Disqus