Fisseha Berhane, PhD

Data Scientist

Resume Linkedin GitHub twitter twitter


Data Analysis and Machine Learning with Python and Apache Spark



Parallelizing your Python model building process with Pandas UDF in PySpark

Assume you want to train models per category and assume you have thousands of categories, with each of the category having thousands of records, if you try to train models sequentially, it could take you so long. However, if you parallelize your sklearn code in PySpark with Pandas UDF, you can shorten the run time very significantly. If this sounds of interest to you, you can watch my video tutorial below to learn how to do it.



ROC Curve could be misleading with imbalanced data: Precision-Recall Curve is more informative

Even if ROC curve and area under the ROC curve are commonly used to evaluate model performance with balanced and imbalanced datasets, as shown in this blog post, if your data is imbalanced, Precision-Recall curve and the area under that curve are more informative than the ROC curve and area under the ROC curve. Actually, ROC curve could be misleading for binary classification problems with imbalanced data... more.



Data distributions where Kmeans clustering fails; can DBSCAN be a solution?

For K-means clustering to work well the variance of the distribution of each attribute (variable) should be approximately spherical, all variables should have similar variance and each cluster should have roughly equal number of observations. Can DBSCAN be a solution for datasets that do not have the properties mentioned above?... more.



Sampling using truncated hash

Let's suppose tens of millions of people visit your website everyday and you want to do ad hoc analysis. However, you cannot use all the data since it is computationally costly and also jobs will take long time. Therefore, a better approach is to have a representative sample. But how can you sample from the data so that a visitor that has been included in today's sample is also included in future samples... more.



Hive Partitioning with Spark

In this post, I will show how to perform Hive partitioning in Spark and talk about its benefits, including performance... more.



How using scikit-learn in Spark could save the day

We may want to use scikit-learn with Spark when training a model in scikit-learn takes so long, the machine learning algorithm we want to use does not exist in Spark but exists in scikit-learn, the optimization technique we want does not exists in Spark but exists in scikit-learn or we know scikit-learn but not Spark... more.



Issues to pay attention to when performing PCA in Spark, Python and R

In this post, I will cover data prepocessing required and how to implement PCA in R, Python and Spark and how to translate the results ... more.



Benefits of and Tips on Hortonworks Apache Spark Certification

This is about hands-on, performance-based certification for Spark on the Hortonworks Data Platform (HDPCD). In this article, I will share what benefits one gets from the certification process, and some tips on how to prepare for it... more.



Machine Learning with Python scikit-learn Vs R Caret - Part 1

We will use the Scikit-learn library in Python and the Caret package in R. In this part, we will first perform exploratory Data Analysis (EDA) on a real-world dataset, and then apply non-regularized linear regression to solve a supervised regression problem on the dataset... more.

Machine Learning with Text in PySpark - Part 1

We usually work with structured data in our machine learning applications. However, unstructured data can also have vital content for machine learning. In this blog post, we will see how to use PySpark to build machine learning models with unstructured text data... more.



Extreme Gradient Boosting (XGBoost) with R and Python

Extreme Gradient Boosting is among the hottest libraries in supervised machine learning these days. It shines when we have lots of training data where the features are numeric or mixture of numeric and categorical fields. In this post, we will see how to use it in R and Python... more.



Leveraging Hive with Spark using Python

In this blog post, we will see how to use Spark with Hive: how to create and use Hive databases and tables, how to load and insert data into Hive tables, how to query data from Hive tables, also how to save dataframes to any Hadoop supported file system ... more.



Spark RDDs Vs DataFrames vs SparkSQL - Part 5: Using Functions

This is the fifth tutorial on the Spark RDDs Vs DataFrames vs SparkSQL blog post series. In this part, we will see how to use functions (scalar, aggregate and window functions) in Spark the RDD way, the DataFrame way and the SparkSQL way ... more.



Spark RDDs Vs DataFrames vs SparkSQL - Part 4: Set Operators

This is the fourth tutorial on the Spark RDDs Vs DataFrames vs SparkSQL blog post series. In this part, we will see set operators in Spark the RDD way, the DataFrame way and the SparkSQL way ... more.



Spark RDDs Vs DataFrames vs SparkSQL - Part 3: Web Server Log Analysis

This is the third tutorial on the Spark RDDs Vs DataFrames vs SparkSQL blog post series. In the first part, we saw how to retrieve, sort and filter data. In the second part, on the other hand, we saw how to work with multiple tables. In this tutorial, we will see how to analyze web server log . ... more.



Spark RDDs Vs DataFrames vs SparkSQL - Part 2 : Working With Multiple Tables

This is the second tutorial on the Spark RDDs Vs DataFrames vs SparkSQL blog post series. In the first part, we saw how to retrieve, sort and filter data using Spark RDDs, DataFrames and SparkSQL. In this tutorial, we will see how to work with multiple datasets in Spark the RDD way, the DataFrame way and with SparkSQL ... more.



Spark RDDs vs DataFrames vs SparkSQL - part 1: Retrieving, Sorting and Filtering

For the next couple of weeks, I will write a blog post series on how to perform the same tasks using Spark Resilient Distributed Dataset (RDD), DataFrames and Spark SQL. If you want to learn/master Spark with Python or if you are preparing for a Spark Certification to show your skills in big data, these articles are for you... more.



Analyzing the Bible and the Quran using Spark

Most of the data out there is unstructured, and Spark is an excellent tool for analyzing this type of data. Here, we will analyze the Bible and the Quran. We will see the distribution of words, the most common words in both scriptures and the average frequency. This could also be scaled to find the most common words and distribution of all words on the Internet ... more.



Spark DataFrames: Exploring Chicago Crimes

This is the second blog post on the Spark tutorial series to help people prepare for the Hortonworks Apache Spark Certification. The first one is here. If you want to learn/master Spark with Python or if you are preparing for a Spark Certification to show your skills in big data, these articles are for you. ... more.



Merging DataFrames with Pandas

Python's popularity in all the current trending technologies in IT is increasing from year to year. Therefore, mastering Python opens more options in the marketplace. Python is also one of the most popular data science tools. One of the reasons for Python's high popularity in data science is the Pandas Package. In this blog post, we will see how to use Pandas to merge lots of data files ... more.



Using Amazon Relational Database Service with Python and R

Amazon Relational Database Service (RDS) is a distributed relational database service by Amazon Web Services (AWS). It simplifies the setup, operation, and scaling of a relational database for use in applications. In this blog post, we will see how to use R and Python with Amazon RDS. AWS RDS has a free tier for anybody to use for testing/development efforts... more.



Using MongoDB with R and Python

This post shows how to use Python and R to analyse data stored in MongoDB, top NoSQL database engine in use today. When dealing with large volume data, using MongoDB can give us performance boost ... more.



Data Manipulation with Python Pandas and R Data.Table

Pandas is a commonly used data manipulation library in Python. Data.Table, on the other hand, is among the best data manipulation packages in R. Data.Table is succinct and we can do a lot with Data.Table in just a single line. Further, data.table is, generally, faster than Pandas (see benchmark here) and it may be a go-to package when performance is a constraint. For someone who knows one of these packages, I thought it could help to show codes that perform the same tasks in both packages side by side to help them quickly study the other. If you know either of them and want to learn the other, this blog post is for you ... more.



Big Data Analytics with Spark - part 2

Most of the time, data analysis involves more than one table of data. Therefore, it is important to know techniques that enable us to combine data from various tables. In this blog post, let's see how we can work with joins in Spark... more


Big Data Analytics with Spark - part 1

In this big data era, Spark, which is a fast and general engine for large-scale data processing, is the hottest big data tool. Spark is a cluster computing framework which is used for scalable and efficient analysis of big data. Other data analysis tools such as R and Pandas run on a single machine but with Spark, we can use many machines which divide the tasks among themselves and perform fault tolerant computations by distributing the data over a cluster... more


Introduction to Machine Learning with Apache Spark

One of the most common uses of big data is to predict what users want. This allows Google to show you relevant ads, Amazon to recommend relevant products, and Netflix to recommend movies that you might like. This lab will demonstrate how we can use Apache Spark to recommend movies to a user. We will start with some basic techniques, and then use the Spark MLlib library's Alternating Least Squares method to make more sophisticated predictions... more

collaborative filtering

Wikipedia

Linear Regression with Apache Spark

This lab covers a common supervised learning pipeline, using a subset of the Million Song Dataset from the UCI Machine Learning Repository. Our goal is to train a linear regression model to predict the release year of a song given a set of audio features.... more


Click-Through Rate Prediction with Apache Spark

This lab covers the steps for creating a click-through rate (CTR) prediction pipeline using the Criteo Labs dataset that was used for a recent Kaggle competition... more


Principal Component Analysis of Neuroscience Data with Apache Spark

This lab delves into exploratory analysis of neuroscience data, specifically using principal component analysis (PCA) and feature-based aggregation. We will use a dataset of light-sheet imaging recorded by the Ahrens Lab at Janelia Research Campus, and hosted on the CodeNeuro data repository... more


Web Server Log Analysis with Apache Spark

Server log analysis is an ideal use case for Spark. It's a very large, common data source and contains a rich set of information. Spark allows you to store your logs in files on disk cheaply, while still providing a quick and simple way to perform data analysis on them... more


Building a word count application with Spark

The volume of unstructured text in existence is growing dramatically, and Spark is an excellent tool for analyzing this type of data. Here, I am calculating the most common words in the... more


Text Analysis and Entity Resolution with Spark

Entity resolution is a common, yet difficult problem in data cleaning and integration. This lab will demonstrate how we can use Apache Spark to apply powerful and scalable text analysis techniques and perform entity resolution across two datasets of commercial products... more


Introduction to Spark programming

Here, I am using the Python programming interface to Spark (pySpark) in my introductory Spark program... more


NumPy Python package and Python lambda expressions

This lab covers NumPy, NumPy and Spark linear algebra and Python lambda expressions, among others... more


Supervised Machine Learning with R and Python

Here I show how to build various machine learning models in Python and R. The models include linear regression, logistic regression, tree-based models (bagging and random forest) and support vector machines (SVM)... more


Working with NetCDF data using NumPy and matplotlib


NetCDF (Network Common Data Form) data is a self-describing, portable, scalable and appendable machine-independent data format. More information can be found from Wikipedia. This data format is commonly used in Atmospheric science and Oceanography as it is convinient to store various variables of many dimensions. Here, let's see how to open netcdf data in python and generate monthly climatology of global... more



Numpy and Pandas to work with multidimensional arrays

Here, I show the convenience of using the python package numpy together with pandas and matplotlib in working with climate data.... more



Analysing the Madden-Julian Oscillation using Numpy and Scipy



The Madden-Julian Oscillation (MJO) is the major mode of intra-seasonal variability in the tropics. Since its dynamics still remains unclear, the MJO is intriguing to many atmospheric physicists and mathematicians... more



ggplot in R and Python

The grammar of graphics package (ggplot2) is the best data visualization library in R. The concept of grammar of graphics is also implemented in Python with the library ggplot and it has similar commands to ggplot2... more


Integrating RDDs and DataFrames in a machine learning pipeline

This lab demonstrates how to integrate RDDs and DataFrames in a supervised machine learning pipeline... more


Spark DataFrame API: word count application

Source

Spark DataFrame API: log analytics

Source

Spark DataFrame and RDD: Logistic Regression


Spark DataFrame and RDD: Principal Component Analysis


Power Plant Machine Learning Pipeline Application

This notebook is an end-to-end exercise of performing Extract-Transform-Load and Exploratory Data Analysis on a real-world dataset, and then applying several different machine learning algorithms to solve a supervised regression problem on the dataset... more


Predicting Movie Ratings

One of the most common uses of big data is to predict what users want. This allows Google to show you relevant ads, Amazon to recommend relevant products, and Netflix to recommend movies that you might like. This lab will demonstrate how we can use Apache Spark to recommend movies to a user... more


Word Count Lab: Building a word count application

The volume of unstructured text in existence is growing dramatically, and Spark is an excellent tool for analyzing this type of data. In this lab, we will write code that calculates the most common words in the Complete Works of William Shakespeare retrieved from Project Gutenberg... more


Text Analysis and Entity Resolution

Entity resolution is a common, yet difficult problem in data cleaning and integration. This lab will demonstrate how we can use Apache Spark to apply powerful and scalable text analysis techniques and perform entity resolution across two datasets of commercial products... more