Fisseha Berhane, PhD

Data Scientist

443-970-2353 [email protected] CV Resume Linkedin GitHub twitter twitter

Using SparkR in Rstudio with Hadoop Deployed on AWS EC2 - part 1

In a previous post, I wrote on how to integrate Hadoop ecosystem components with Business intelligence (BI) tools. In the next couple of posts, I will blog on how to use SparkR, an R package that provides a light-weight frontend to use Apache Spark from R, in Rstudio with Hortonworks Data Platform (HDP) Hadoop distribution deployed on Amazon Web Services (AWS) Elastic Cloud Compute (EC2). SparkR provides a distributed data frame implementation that supports operations like selection, filtering, aggregation etc, similar to the dplyr R package but on large datasets. SparkR also supports distributed machine learning using MLlib. You can read more from the Apache Spark website.

This link and this Youtube video have the steps on how to deploy a Hadoop Cluster on Amazon EC2 with HDP. I have deployed HDP on AWS EC2 with seven m4.2xlarge Red Hat nodes.

To use R in Red Hat Enterprise 7, we need Extra Packages for Enterprise Linux (EPEL). EPEL is a Fedora Special Interest Group that creates, maintains, and manages a high quality set of additional packages for Enterprise Linux, including, but not limited to, Red Hat Enterprise Linux (RHEL), CentOS and Scientific Linux (SL), Oracle Linux (OL). You can get more from their website.

Installing R

Let's get EPEL before we install R

In [ ]:
su -c  'rpm -Uvh'
In [ ]:
sudo yum update
sudo yum install R

If we run the code above, we get the error message below:

We solve the above problem by enabiling Red Hat's Extras and Optional repos.
Enable Red Hat's Extras and Optional repos
In [ ]:
yum-config-manager --enable rhui-REGION-rhel-server-extras rhui-REGION-rhel-server-optional

Now, we can install R

In [ ]:
sudo yum update
sudo yum install R

Now, we can start "R" on the terminal.

Installing Rstudio

Download and install RStudio Server. You can get the recent version here

In [ ]:
sudo yum install --nogpgcheck rstudio-server-rhel-0.99.903-x86_64.rpm

Installing packages

Install some of the most commonly used R packages. We are installing the packages as "root" so that they will be accessible to all users. If the packages are installed within R or Rstudio, they will be installed to a personal library of the user and will be accessible to the user who installed them only.

In [ ]:
sudo su - -c "R -e \"install.packages(c('caret','ggplot2','ggthemes','gridExtra','jsonlite','lattice','lubridate','manipulate','maps','maptools','markdown','NeuralNetTools','nnet', 'pdftools','plotmo','plyr','randomForest','RColorBrewer','Rcpp','RCurl','readr','reshape2','rgdal','rmarkdown','RMongo', 'RMySQL','ROAuth','ROCR','RODBC', 'rpart','RPostgreSQL','rsconnect','RSQLite','scales','shiny','shinydashboard','shinythemes', 'sp','sqldf','stringi','stringr','tidyr','tm','twitteR','wordcloud'), repos='')\""
Note: We have to install R and the packages we want to use on all nodes on our cluster.

Let's create a user account and provide a password to use it to login to Rstudio.

In [ ]:
sudo adduser fish
In [ ]:
passwd Rstudioiscool

By default RStudio Server runs on port 8787 and accepts connections from all remote clients.

Now, let's open a browser and login to Rstudio server using the username and password we created.

Let's run some R code and make sure everything is fine.


In this blog post, we saw how to install R, Rstudio server and R packages on AWS EC2 HortonWorks Data Platform Hadoop distribution. On the next couple of posts, we will start using SparkR.

comments powered by Disqus