Fisseha Berhane, PhD

Data Scientist

443-970-2353 fisseha@jhu.edu CV Resume Linkedin GitHub twitter twitter


Data Analysis and Machine Learning with Python and Apache Spark


Using Amazon Relational Database Service with Python and R

Amazon Relational Database Service (RDS) is a distributed relational database service by Amazon Web Services (AWS). It simplifies the setup, operation, and scaling of a relational database for use in applications. In this blog post, we will see how to use R and Python with Amazon RDS. AWS RDS has a free tier for anybody to use for testing/development efforts... more.



Using MongoDB with R and Python

This post shows how to use Python and R to analyse data stored in MongoDB, top NoSQL database engine in use today. When dealing with large volume data, using MongoDB can give us performance boost ... more.



Data Manipulation with Python Pandas and R Data.Table

Pandas is a commonly used data manipulation library in Python. Data.Table, on the other hand, is among the best data manipulation packages in R. Data.Table is succinct and we can do a lot with Data.Table in just a single line. Further, data.table is, generally, faster than Pandas (see benchmark here) and it may be a go-to package when performance is a constraint. For someone who knows one of these packages, I thought it could help to show codes that perform the same tasks in both packages side by side to help them quickly study the other. If you know either of them and want to learn the other, this blog post is for you ... more.



Big Data Analytics with Spark - part 2

Most of the time, data analysis involves more than one table of data. Therefore, it is important to know techniques that enable us to combine data from various tables. In this blog post, let's see how we can work with joins in Spark... more


Big Data Analytics with Spark - part 1

In this big data era, Spark, which is a fast and general engine for large-scale data processing, is the hottest big data tool. Spark is a cluster computing framework which is used for scalable and efficient analysis of big data. Other data analysis tools such as R and Pandas run on a single machine but with Spark, we can use many machines which divide the tasks among themselves and perform fault tolerant computations by distributing the data over a cluster... more


Introduction to Machine Learning with Apache Spark

One of the most common uses of big data is to predict what users want. This allows Google to show you relevant ads, Amazon to recommend relevant products, and Netflix to recommend movies that you might like. This lab will demonstrate how we can use Apache Spark to recommend movies to a user. We will start with some basic techniques, and then use the Spark MLlib library's Alternating Least Squares method to make more sophisticated predictions... more

collaborative filtering

Wikipedia

Linear Regression with Apache Spark

This lab covers a common supervised learning pipeline, using a subset of the Million Song Dataset from the UCI Machine Learning Repository. Our goal is to train a linear regression model to predict the release year of a song given a set of audio features.... more


Click-Through Rate Prediction with Apache Spark

This lab covers the steps for creating a click-through rate (CTR) prediction pipeline using the Criteo Labs dataset that was used for a recent Kaggle competition... more


Principal Component Analysis of Neuroscience Data with Apache Spark

This lab delves into exploratory analysis of neuroscience data, specifically using principal component analysis (PCA) and feature-based aggregation. We will use a dataset of light-sheet imaging recorded by the Ahrens Lab at Janelia Research Campus, and hosted on the CodeNeuro data repository... more


Web Server Log Analysis with Apache Spark

Server log analysis is an ideal use case for Spark. It's a very large, common data source and contains a rich set of information. Spark allows you to store your logs in files on disk cheaply, while still providing a quick and simple way to perform data analysis on them... more


Building a word count application with Spark

The volume of unstructured text in existence is growing dramatically, and Spark is an excellent tool for analyzing this type of data. Here, I am calculating the most common words in the... more


Text Analysis and Entity Resolution with Spark

Entity resolution is a common, yet difficult problem in data cleaning and integration. This lab will demonstrate how we can use Apache Spark to apply powerful and scalable text analysis techniques and perform entity resolution across two datasets of commercial products... more


Introduction to Spark programming

Here, I am using the Python programming interface to Spark (pySpark) in my introductory Spark program... more


NumPy Python package and Python lambda expressions

This lab covers NumPy, NumPy and Spark linear algebra and Python lambda expressions, among others... more


Supervised Machine Learning with R and Python

Here I show how to build various machine learning models in Python and R. The models include linear regression, logistic regression, tree-based models (bagging and random forest) and support vector machines (SVM)... more


Working with NetCDF data using NumPy and matplotlib


NetCDF (Network Common Data Form) data is a self-describing, portable, scalable and appendable machine-independent data format. More information can be found from Wikipedia. This data format is commonly used in Atmospheric science and Oceanography as it is convinient to store various variables of many dimensions. Here, let's see how to open netcdf data in python and generate monthly climatology of global... more



Numpy and Pandas to work with multidimensional arrays

Here, I show the convenience of using the python package numpy together with pandas and matplotlib in working with climate data.... more



Analysing the Madden-Julian Oscillation using Numpy and Scipy



The Madden-Julian Oscillation (MJO) is the major mode of intra-seasonal variability in the tropics. Since its dynamics still remains unclear, the MJO is intriguing to many atmospheric physicists and mathematicians... more



ggplot in R and Python

The grammar of graphics package (ggplot2) is the best data visualization library in R. The concept of grammar of graphics is also implemented in Python with the library ggplot and it has similar commands to ggplot2... more


Integrating RDDs and DataFrames in a machine learning pipeline

This lab demonstrates how to integrate RDDs and DataFrames in a supervised machine learning pipeline... more


Spark DataFrame API: word count application

Source

Spark DataFrame API: log analytics

Source

Spark DataFrame and RDD: Logistic Regression


Spark DataFrame and RDD: Principal Component Analysis


Power Plant Machine Learning Pipeline Application

This notebook is an end-to-end exercise of performing Extract-Transform-Load and Exploratory Data Analysis on a real-world dataset, and then applying several different machine learning algorithms to solve a supervised regression problem on the dataset... more


Predicting Movie Ratings

One of the most common uses of big data is to predict what users want. This allows Google to show you relevant ads, Amazon to recommend relevant products, and Netflix to recommend movies that you might like. This lab will demonstrate how we can use Apache Spark to recommend movies to a user... more


Word Count Lab: Building a word count application

The volume of unstructured text in existence is growing dramatically, and Spark is an excellent tool for analyzing this type of data. In this lab, we will write code that calculates the most common words in the Complete Works of William Shakespeare retrieved from Project Gutenberg... more


Text Analysis and Entity Resolution

Entity resolution is a common, yet difficult problem in data cleaning and integration. This lab will demonstrate how we can use Apache Spark to apply powerful and scalable text analysis techniques and perform entity resolution across two datasets of commercial products... more