443-970-2353
[email protected]
CV Resume
In a previous post, I wrote on how to integrate Hadoop ecosystem components with Business intelligence (BI) tools. In the next couple of posts, I will blog on how to use SparkR, an R package that provides a light-weight frontend to use Apache Spark from R, in Rstudio with Hortonworks Data Platform (HDP) Hadoop distribution deployed on Amazon Web Services (AWS) Elastic Cloud Compute (EC2). SparkR provides a distributed data frame implementation that supports operations like selection, filtering, aggregation etc, similar to the dplyr R package but on large datasets. SparkR also supports distributed machine learning using MLlib. You can read more from the Apache Spark website.
This link and this Youtube video have the steps on how to deploy a Hadoop Cluster on Amazon EC2 with HDP. I have deployed HDP on AWS EC2 with seven m4.2xlarge Red Hat nodes.
To use R in Red Hat Enterprise 7, we need Extra Packages for Enterprise Linux (EPEL). EPEL is a Fedora Special Interest Group that creates, maintains, and manages a high quality set of additional packages for Enterprise Linux, including, but not limited to, Red Hat Enterprise Linux (RHEL), CentOS and Scientific Linux (SL), Oracle Linux (OL). You can get more from their website.
Let's get EPEL before we install R
su -c 'rpm -Uvh http://dl.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-8.noarch.rpm'
sudo yum update
sudo yum install R
If we run the code above, we get the error message below:
yum-config-manager --enable rhui-REGION-rhel-server-extras rhui-REGION-rhel-server-optional
Now, we can install R
sudo yum update
sudo yum install R
Now, we can start "R" on the terminal.
Download and install RStudio Server. You can get the recent version here
wget https://download2.rstudio.org/rstudio-server-rhel-0.99.903-x86_64.rpm
sudo yum install --nogpgcheck rstudio-server-rhel-0.99.903-x86_64.rpm
Install some of the most commonly used R packages. We are installing the packages as "root" so that they will be accessible to all users. If the packages are installed within R or Rstudio, they will be installed to a personal library of the user and will be accessible to the user who installed them only.
sudo su - -c "R -e \"install.packages(c('caret','ggplot2','ggthemes','gridExtra','jsonlite','lattice','lubridate','manipulate','maps','maptools','markdown','NeuralNetTools','nnet', 'pdftools','plotmo','plyr','randomForest','RColorBrewer','Rcpp','RCurl','readr','reshape2','rgdal','rmarkdown','RMongo', 'RMySQL','ROAuth','ROCR','RODBC', 'rpart','RPostgreSQL','rsconnect','RSQLite','scales','shiny','shinydashboard','shinythemes', 'sp','sqldf','stringi','stringr','tidyr','tm','twitteR','wordcloud'), repos='http://cran.rstudio.com/')\""
Let's create a user account and provide a password to use it to login to Rstudio.
sudo adduser fish
passwd Rstudioiscool
By default RStudio Server runs on port 8787 and accepts connections from all remote clients.
Now, let's open a browser and login to Rstudio server using the username and password we created.
Let's run some R code and make sure everything is fine.
In this blog post, we saw how to install R, Rstudio server and R packages on AWS EC2 HortonWorks Data Platform Hadoop distribution. On the next couple of posts, we will start using SparkR.