Fisseha Berhane, PhD

Data Scientist

443-970-2353 fisseha@jhu.edu CV Resume Linkedin GitHub twitter twitter


Data Analysis and Machine Learning with R



Integrating Tableau with R through R Notebooks and Shiny for Descriptive, Inferential and Predictive Analytics

Integrating R Notebooks and R shiny with Tableau enables us to have descriptive, inferential and predictive analytics in our Tableau story/dashboard... more



Using PostgreSQL and shiny with a dynamic leaflet map: monitoring trash cans

When there is increased social activity, trash cans can get full quicker. On the contrary, during very cold weather, the trash cans can take one or a couple of more days to get full. Therefore, knowing when the trash cans are full lets us pick them up right away rather than waiting for a specific day of the week to come.

The code is available on GitHub



Leaflet, Plotly and Shiny: Weather Forecasts In The Northeast

Integrating JavaScript libraries with R helps create interactive visualizations. This blog post uses Leaflet, which is the leading open-source JavaScript library for interactive maps, and plotly to create weather forecast visualizations... more




Email and Text Message Alerts Based on Streaming Sensor Data

How can we get email and text message alerts when sensors either fail or transmit abnormal reading? If we have a dashboard that is built based on dynamic data and we want alerts when some conditions are met, how can we do that? One option is using R-shiny.



Logistic Regression Regularized with Optimization

Logistic regression predicts the probability of the outcome being true. In this blog post, we will build logistic regression models to predict whether a student gets admitted into a university and whether microchips from a fabrication plant pass quality assurance ... more



Anomaly Detection with R

Anomaly detection is used for different applications. It is a commonly used technique for fraud detection. It is also used in manufacturing to detect anomalous systems such as aircraft engines. It can also be used to identify anomalous medical devices and machines in a data center. In this blog post, we will implement anomaly detection algorithm and apply it to detect failing servers on a network ... more



Analytical and Numerical Solutions, with R, to Linear Regression Problems

This post shows how to implement numerical and analytical solutions to linear regression problems using R. This is the first programming exercise in the coursera machine learning course offered by Andrew Ng. The course is offered with Matlab/Octave. Since R is the lingua franca data science tool, I plan to do all the programming exercises in Andrew's course with R ... more



Visualizing Streaming Data with Shiny

This post is on visualizing streaming data with Shiny... more.




Using SparkR in Rstudio with Hadoop Deployed on AWS EC2 - part 1

SparkR provides a distributed data frame implementation that supports operations like selection, filtering, aggregation etc, similar to the dplyr R package but on large datasets. SparkR also supports distributed machine learning using MLlib... more



Using SparkR in Rstudio with Hadoop Deployed on AWS EC2 - part 2

In a previous post, we saw how to install R, Rstudio server and R packages on AWS EC2 Red Hat cluster to use with Hortonworks Data Platform (HDP 2.4) Hadoop distribution. Now, let’s use SparkR for data munging... more


Sentiment Analysis of Donald Trump's views on Muslims using R and Tableau

In this post, we will focus on how to integrate R and Tableau for text mining, sentiment analysis and visualization. Using these tools together enables us to answer detailed questions... more



Working with databases in R

The dplyr package, which is one of my favorite R packages, works with in-memory data and with data stored in databases. In this post, I will share my experience on using dplyr to work with databases... more



JSON data manipulation: the R way vs the Python way

JavaScript Object Notation (JSON) is the most common data format used for asynchronous browser/server communication and knowing how to work with JSON data is important as we get various datasets out there in this format. In this blog post, we will see how to analyse JSON data with R. In the next blog post, we will perform the same tasks using Python... more.



My Two favorite Packages for Data Manupilation in R

dplyr and data.table are so awesome as they make data manipulation more fun. Both packages have their strengthes. While dplyr is more elegant and resembles natural language, data.table is succint and we can do a lot with data.table in just a single line. Further, data.table is, in some cases, faster and it may be a go-to package when performance and memory are constraints... more



Performing SQL selects on R data frames

For anyone who has SQL background and who wants to learn R, I guess the sqldf package is very useful because it enables us to use SQL commands in R. One who has basic SQL skills can manipulate data frames in R using their SQL skills... more



Machine Learning for Drug Adverse Event Discovery

Clustering can be used for knowledge discovery in drug adverse event reactions. Specially in cases where the data has millions of observations, where we cannot get any insight visually, clustering becomes handy for summarizing our data, for getting statistical insights and for discovering new knowledge... more



Semi-automated rainfall prediction models for any geographic region using R (Shiny)

Here, I used shiny, an R package that makes it easy to build interactive web applications (apps) straight from R, HTML, CSS and JavaScript to develop semi-automated machine learning models to predict rainfall over a region the user selects. The user can extract predictand and predictors by drawing a polygon over a region. Then, the user can select some or all of the machine learning algorithms provided. Provided models include Linear regression models (GLM, SGLM), Tree-based ensemble models (Random Forest and Boosting), Support vector Machines, Artificial Neural Network, and other non-linear models (GAM, SGAM, MARS). Finally, the user can download the analysis steps they used, such as the region they selected, the time period they specified, the predictand and predictors they chose and preprocessing options they used, and the model results in PDF or HTML format. A quick demo is shown in the video below.

Server.R and ui.R codes are on GitHub




Supervised Machine Learning with R and Python

Here I show how to build various machine learning models in Python and R. The models include linear regression, logistic regression, tree-based models (bagging and random forest) and support vector machines (SVM)... more



Web scraping with R using rvest: Population of U.S. states and territories

In this post, we will use the rvest web scraping R package to scrape US population data from Wikipedia and use ggplo2 to visualize the population data by state... more.



US Hospital Ranking Shiny App

This is my shiny app that helps to see the performances of various US hospitals in heart attack, heart failure and pneumonia. We can select a state and outcome and see the rank of a hospital in that state and compare its performance with all hospitals across the nation.... more.



How is climate changing and where with R

Even if global temperature has risen, the magnitude varies from region to region. Here, I use R to investigate how temperature has changed in the last 110 years... more.



How Many Live on How Much, and Where with Shiny

This is a shiny app that shows world population by income. The income data can be downloaded from here and the shape file can be downloaded from here. The ui.R and server.R codes are available here



Using linear and non-linear regression models to predict global temperature

In this project, the performances of support vector machines, neural network, boosting, classification and regression trees, random forest and linear models, such as generalized linear model, lasso, ridge regression and elastic net, in predicitng average global temperature anomaly are compared... more



Ensemble Machine Learning Techniques for Human Activity Recognition

In this project, ensemble tree-based predictive models that determine the manner an exercise is done are built. The models considered are Random Forest, Adaptive Boosting and Bagged Adaptive Boosting... more


 No image to show

Velloso et al. (2013)



ggplot in R and Python

The grammar of graphics package (ggplot2) is the best data visualization library in R. The concept of grammar of graphics is also implemented in Python with the library ggplot and it has similar commands to ggplot2... more



Correlation map of climate variables

To understand physical mechanisms and to develop statistical rainfall prediction models, correlation analysis is used as a first step. Here, I show how to use R to generate correlation map between rainfall and sea surface temperature... more




Reproducible Research with R

It is now possible to collect a large amount of data about personal movement using activity monitoring devices such as a Fitbit (http://www.fitbit.com), Nike Fuelband (http://www.nike.com/us/en_us/c/nikeplusfuelband), or Jawbone Up (https://jawbone.com/up). These type of devices are part of the “quantified self” movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech... more



Downloading data from the web using R

There are different ways that we can download and read data into R. Some examples are shown... more



Google scholar scraping with R

In this post, I will show how to scrape google scholar account. Particularly, we will use the 'rvest' R package to scrape the google scholar account of my advisor. We will see his coauthors, how many times they are cited and thier affilations... more



Slidify presentation of a Shiny App

Slidify helps to create data-centric presentations. It allows embedded code chunks and mathematical formulas to be rendered correctly. Final products are HTML files, which can be viewed with any web browser and shared easily.


Shiny is an R package that makes it easy to build interactive web applications straight from R. Here I have developed a shiny app that calculates the area average rainfall and temperature climatologies and trend over any region selected by the user over Africa (see presentation).



Composite analysis to capture non-linear relationships

Though correlation is able to capture linear relationships, since it does not handle non-linear relationships, composite analysis is also widely used to understand physical mechanisms and to develop statistical rainfall prediction models. Here, I show how to use R to generate composites of rainfall based on sea surface temperature... more.



Most Harmful Storms and Weather Events In The United States


This report seeks to investigate storms and other weather events that cause the highest number of fatalities and injuries. Moreover, it shows which events have the greatest economic consequences... more.



Hospital Rankings In The United States

Here, I compare the performance of hospitals in the USA using data that come from the Hospital Compare web site (http://hospitalcompare.hhs.gov) run by the U.S. Department of Health and Human Services. Hospital rankings are performed on state-wide and nation-wise basis considering different outcomes... more.



Car Fuel Efficiency and Transmission Type

In this analysis, the relationship between a set of variables and miles per gallon (MPG) is explored using the mtcars dataset. Particularly, the MPG difference between automatic and manual transmissions is evaluated and quantified using multivariate linear regression models... more.




The Role of Regular Expressions in Creating a Tidy Data

In this analysis, let's prepare a tidy data that can be used for later analysis employing regular expressions in R and demonstrate the strength of regular expressions... more.



World's Biggest Companies

Let's visualize the distribution of the world's biggest companies using the Forbes2000 data from HSAUR2 package... more.



Approximating distributions

Here, let's see the approximation of some distributions by other distributions, when certain criteria are met, through simulations... more.



Predicting Earnings from census data

In this problem, we are going to use census information about an individual to predict how much a person earns -- in particular, whether the person earns more than $50,000 per year... more.



Letter Recognition


This is letter recognition exercise using tree-based models... more.



Quick overview of climate trends using Shiny

Global climate is changing and this change is apparent across a wide range of observations. The impacts of climate change on rainfall and temperature varies from region to region. Shiny, which is an R package that makes it easy to build interactive web applications (apps) straight from R, can be used to see the trends of different climate variables over different parts of the world very quickly. Here, I developed a Shiny App that displays the trends of temperature and rainfall over any selected region over Africa. This app can be used as a starting point in studying impacts of, adaptation to and mitigation of climate change over a region. The app is available on RStudio.



Working with Dates and Times in R using Power Consumption Data

Here data from the UC Irvine Machine Learning Repository, a popular repository for machine learning datasets, is used to show how to work with dates in R... more.



Text Analytics with R

This lab is on text analytics with R using logistic regression and regression trees... more.



Visualizing election predictions using ggplot2

Here, ggplot2 is used to visualize US presidential election predictions... more.



Visualizing murder rates by state in the US with ggplot2

Let's visualize murder rate by state in the US using ggplot2... more.



Using simulation to demosntrate the Central Limit Theorem

The central limit theorem (CLT) states that the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with a well-defined expected value and well-defined variance, will be approximately normally distributed, regardless of the underlying distribution (Wikipedia). Here, I demonstrate this theorem using simulations... more.



The Effect of Vitamin C on Tooth Growth in Guinea Pigs

In this analysis, inferential statistics is employed to investigate if the length of odontoblasts (cells responsible for tooth growth) in guinea pigs is influenced by the dose levels of vitamin C (0.5, 1, and 2 mg/day). Moreover, the impacts of two delivery methods (orange juice or ascorbic acid) on the length of odontoblasts are studied... more.



Exploratory Analysis of Fine Particulate Matter

In this analysis, the trend of fine particulate matter over time, by source, in the United States and over specific cities is investigated... more.



Spatial distribution of Food and Drug Administration's adverse events reports data

This post shows how to download the quartley FDA's adverse events reports data, concatenate them and do some analysis using the dplyr package and display them by country and gender using ggplot2... more.




Visualizing outcomes of drug related adverse events

How many of the drug related adverse events in the FDA database resulted in deaths, disabilities, hopspitalization, etc? Here we will download various datasets and join them... more.




Web scraping with R using rvest: List of countries by proven oil reserves

In this post, let's see how to scrape information from the web using the rvest R package. Specifically, we will scrape the "List of countries by proven oil reserves" table from wikipedia... more.


Wikipedia


Web Scraping and Natural Language Processing: most commonly used words in a journal paper

I am practicing web scrapping, regular expressions and natural language processing in R. In this post, I will find the most commonly used words in one of my published papers... more.



Text Mining, Scraping and Sentiment Analysis with R: Russia this week

In this post, I am scraping twitter to understand what has been being said about Russia and its relations with the middle east. Particularly, we will see the sentiment of posts from November 24-29, 2015. For this excercise, we will consider posts in English... more.



Top searches associated with each nation with R

In this post, we will get top searches associated with each nation. In doing so, first, we will scrape the list of world countries from wikipedia... more.



Google Trends Analytics using Shiny

In this post, I will show how we can use Shiny to analyse Google Trends data and create a dashboard. Shiny is an R package that makes it easy to build interactive web apps straight from R. For a nice look and feel, we will use the shinydashboard package.... more.



Visualizing world cities using R

In this post, we will create a world map that shows world cities using ggplot2. We will get the cities and their attributies from Wikipedia. After we get the world cities data from Wikipedia, we will use the string manipulation packages stringi and stringr .... more.



Installing and loading many R packages at once

I like Jupyter notebook because it enables me to use R, Python and Matlab on the same session. Recently, I was trying to insall other kernels and for unknown reason my Jupyter notebook crushed. Then, I uninstalled Anaconda and reinstalled it. The problem, I have lost all R packages I installed over time .... more.



Document Clustering

Clustering is a non-supervised learning technique which has wide applications. Some examples where clustering is commonly applied are market segmentation, social network analytics, and astronomical data analysis. This post is on document clustering or text clustering, which is a very popular application of clustering algorithms. We will see K-means clustering and Hierarchical clustering... more.



A Shiny Dashboard of Adverse Drug Event Reports

This is a shiny dashboard developed using openFDA data. The data is in JSON format. The R library jsonlite is used to access the data from the openFDA website and change it to data frame. The user can select one , a couple or all types of events.

The ui.R and server.R codes are available here


PDF Mining with R using Shiny

This application helps to get useful insights from PDF documents by creating visualizations and summarizations. It also enables searching, sorting and filtering. We can browse through lots of documents in a single click and get a summary and comparison of the documents in minutes.... more.



The importance of Data Visualization

Before we perform any analysis and come up with any assumptions about the distribution of and relationships between variables in our datasets, it is always a good idea to visualize our data in order to understand their properties and identify appropriate analytics techniques... more.



Using Amazon Relational Database Service with Python and R

Amazon Relational Database Service (RDS) is a distributed relational database service by Amazon Web Services (AWS). It simplifies the setup, operation, and scaling of a relational database for use in applications. In this blog post, we will see how to use R and Python with Amazon RDS. AWS RDS has a free tier for anybody to use for testing/development efforts... more.



Interactive visualization with R-Shiny versus with Tableau -Part 1

This post is on interactive treemap with Shiny and Tableau... more.



Using MongoDB with R and Python

This post shows how to use Python and R to analyse data stored in MongoDB, top NoSQL database engine in use today. When dealing with large volume data, using MongoDB can give us performance boost ... more.



Data Manipulation with Python Pandas and R Data.Table

Pandas is a commonly used data manipulation library in Python. Data.Table, on the other hand, is among the best data manipulation packages in R. Data.Table is succinct and we can do a lot with Data.Table in just a single line. Further, data.table is, generally, faster than Pandas (see benchmark here) and it may be a go-to package when performance is a constraint. For someone who knows one of these packages, I thought it could help to show codes that perform the same tasks in both packages side by side to help them quickly study the other. If you know either of them and want to learn the other, this blog post is for you ... more.