Data Analysis and Machine Learning with Python and Apache Spark
Parallelizing your Python model building process with Pandas UDF in PySpark
Assume you want to train models per category and assume you have thousands of categories, with each of the category having thousands of records, if you try to train models sequentially, it could take you so long. However, if you parallelize your sklearn code in PySpark with Pandas UDF, you can shorten the run time very significantly.
If this sounds of interest to you, you can watch my video tutorial below to learn how to do it.
ROC Curve could be misleading with imbalanced data: Precision-Recall Curve is more informative
Even if ROC curve and area under the ROC curve are commonly used to evaluate model performance with balanced and imbalanced datasets, as shown in this blog post, if your data is imbalanced, Precision-Recall curve and the area under that curve are more informative than the ROC curve and area under the ROC curve. Actually, ROC curve could be misleading for binary classification problems with imbalanced data... more.
Data distributions where Kmeans clustering fails; can DBSCAN be a solution?
For K-means clustering to work well the variance of the distribution of each attribute (variable) should be approximately spherical, all variables should have similar variance and each cluster should have roughly equal number of observations. Can DBSCAN be a solution for datasets that do not have the properties mentioned above?... more.
Sampling using truncated hash
Let's suppose tens of millions of people visit your website everyday and you want to do ad hoc analysis. However, you cannot use all the data since it is computationally costly and also jobs will take long time. Therefore, a better approach is to have a representative sample. But how can you sample from the data so that a visitor that has been included in today's sample is also included in future samples... more.
Hive Partitioning with Spark
In this post, I will show how to perform Hive partitioning in Spark and talk about its benefits, including performance... more.

How using scikit-learn in Spark could save the day
We may want to use scikit-learn with Spark when training a model in scikit-learn takes so long, the machine learning algorithm we want to use does not exist in Spark but exists in scikit-learn, the optimization technique we want does not exists in Spark but exists in scikit-learn or we know scikit-learn but not Spark... more.
Issues to pay attention to when performing PCA in Spark, Python and R
In this post, I will cover data prepocessing required and how to implement PCA in R, Python and Spark and how to translate the results ... more.

Benefits of and Tips on Hortonworks Apache Spark Certification
This is about hands-on, performance-based certification for Spark on the Hortonworks Data Platform (HDPCD). In this article, I will share what benefits one gets from the certification process, and some tips on how to prepare for it... more.

Machine Learning with Python scikit-learn Vs R Caret - Part 1
We will use the Scikit-learn library in Python and the Caret package in R. In this part, we will first perform exploratory Data Analysis (EDA) on a real-world dataset, and then apply non-regularized linear regression to solve a supervised regression problem on the dataset... more.

Machine Learning with Text in PySpark - Part 1
We usually work with structured data in our machine learning applications. However, unstructured data can also have vital content for machine learning. In this blog post, we will see how to use PySpark to build machine learning models with unstructured text data... more.
Extreme Gradient Boosting (XGBoost) with R and Python
Extreme Gradient Boosting is among the hottest libraries in supervised machine learning these days. It shines when we have lots of training data where the features are numeric or mixture of numeric and categorical fields. In this post, we will see how to use it in R and Python... more.

Leveraging Hive with Spark using Python
In this blog post, we will see how to use Spark with Hive: how to create and use Hive databases and tables, how to load and insert data into Hive tables, how to query data from Hive tables, also how to save dataframes to any Hadoop supported file system ... more.

Spark RDDs Vs DataFrames vs SparkSQL - Part 5: Using Functions
This is the fifth tutorial on the Spark RDDs Vs DataFrames vs SparkSQL blog post series. In this part, we will see how to use functions (scalar, aggregate and window functions) in Spark the RDD way, the DataFrame way and the SparkSQL way ... more.
Spark RDDs Vs DataFrames vs SparkSQL - Part 4: Set Operators
This is the fourth tutorial on the Spark RDDs Vs DataFrames vs SparkSQL blog post series. In this part, we will see set operators in Spark the RDD way, the DataFrame way and the SparkSQL way ... more.
Spark RDDs Vs DataFrames vs SparkSQL - Part 3: Web Server Log Analysis
This is the third tutorial on the Spark RDDs Vs DataFrames vs SparkSQL blog post series. In the first part, we saw how to retrieve, sort and filter data. In the second part, on the other hand, we saw how to work with multiple tables. In this tutorial, we will see how to analyze web server log . ... more.
Spark RDDs Vs DataFrames vs SparkSQL - Part 2 : Working With Multiple Tables
This is the second tutorial on the Spark RDDs Vs DataFrames vs SparkSQL blog post series. In the first part, we saw how to retrieve, sort and filter data using Spark RDDs, DataFrames and SparkSQL. In this tutorial, we will see how to work with multiple datasets in Spark the RDD way, the DataFrame way and with SparkSQL ... more.
Spark RDDs vs DataFrames vs SparkSQL - part 1: Retrieving, Sorting and Filtering
For the next couple of weeks, I will write a blog post series on how to perform the same tasks using Spark Resilient Distributed Dataset (RDD), DataFrames and Spark SQL. If you want to learn/master Spark with Python or if you are preparing for a Spark Certification to show your skills in big data, these articles are for you... more.
Analyzing the Bible and the Quran using Spark
Most of the data out there is unstructured, and Spark is an excellent tool for analyzing this type of data. Here, we will analyze the Bible and the Quran. We will see the distribution of words, the most common words in both scriptures and the average frequency. This could also be scaled to find the most common words and distribution of all words on the Internet ... more.
Spark DataFrames: Exploring Chicago Crimes
This is the second blog post on the Spark tutorial series to help people prepare for the Hortonworks Apache Spark Certification. The first one is here. If you want to learn/master Spark with Python or if you are preparing for a Spark Certification to show your skills in big data, these articles are for you. ... more.

Merging DataFrames with Pandas
Python's popularity in all the current trending technologies in IT is increasing from year to year. Therefore, mastering Python opens more options in the marketplace. Python is also one of the most popular data science tools. One of the reasons for Python's high popularity in data science is the Pandas Package. In this blog post, we will see how to use Pandas to merge lots of data files ... more.

Using Amazon Relational Database Service with Python and R
Amazon Relational Database Service (RDS) is a distributed relational database service by Amazon Web Services (AWS). It simplifies the setup, operation, and scaling of a relational database for use in applications. In this blog post, we will see how to use R and Python with Amazon RDS. AWS RDS has a free tier for anybody to use for testing/development efforts... more.

Using MongoDB with R and Python
This post shows how to use Python and R to analyse data stored in MongoDB, top NoSQL database engine in use today. When dealing with large volume data, using MongoDB can give us performance boost ... more.

Data Manipulation with Python Pandas and R Data.Table
Pandas is a commonly used data manipulation library in Python. Data.Table, on the other hand, is among the best data manipulation packages in R. Data.Table is succinct and we can do a lot with Data.Table in just a single line. Further, data.table is, generally, faster than Pandas (see benchmark here) and it may be a go-to package when performance is a constraint. For someone who knows one of these packages, I thought it could help to show codes that perform the same tasks in both packages side by side to help them quickly study the other. If you know either of them and want to learn the other, this blog post is for you ... more.

Big Data Analytics with Spark - part 2
Most of the time, data analysis involves more than one table of data. Therefore, it is important to know techniques that enable us to combine data from various tables. In this blog post, let's see how we can work with joins in Spark... more
Big Data Analytics with Spark - part 1
In this big data era, Spark, which is a fast and general engine for large-scale data processing, is the hottest big data tool. Spark is a cluster computing framework which is used for scalable and efficient analysis of big data. Other data analysis tools such as R and Pandas run on a single machine but with Spark, we can use many machines which divide the tasks among themselves and perform fault tolerant computations by distributing the data over a cluster... more
Introduction to Machine Learning with Apache Spark
One of the most common uses of big data is to predict what users want. This allows Google to show you relevant ads, Amazon to recommend relevant products, and Netflix to recommend movies that you might like. This lab will demonstrate how we can use Apache Spark to recommend movies to a user. We will start with some basic techniques, and then use the Spark MLlib library's Alternating Least Squares method to make more sophisticated predictions... more
Linear Regression with Apache Spark
This lab covers a common supervised learning pipeline, using a subset of the Million Song Dataset from the UCI Machine Learning Repository. Our goal is to train a linear regression model to predict the release year of a song given a set of audio features.... more

Click-Through Rate Prediction with Apache Spark
This lab covers the steps for creating a click-through rate (CTR) prediction pipeline using the Criteo Labs dataset that was used for a recent Kaggle competition... more

Principal Component Analysis of Neuroscience Data with Apache Spark
This lab delves into exploratory analysis of neuroscience data, specifically using principal component analysis (PCA) and feature-based aggregation. We will use a dataset of light-sheet imaging recorded by the Ahrens Lab at Janelia Research Campus, and hosted on the CodeNeuro data repository... more

Web Server Log Analysis with Apache Spark
Server log analysis is an ideal use case for Spark. It's a very large, common data source and contains a rich set of information. Spark allows you to store your logs in files on disk cheaply, while still providing a quick and simple way to perform data analysis on them... more
Building a word count application with Spark
The volume of unstructured text in existence is growing dramatically, and Spark is an excellent tool for analyzing this type of data. Here, I am calculating the most common words in the... more
Text Analysis and Entity Resolution with Spark
Entity resolution is a common, yet difficult problem in data cleaning and integration. This lab will demonstrate how we can use Apache Spark to apply powerful and scalable text analysis techniques and perform entity resolution across two datasets of commercial products... more
Introduction to Spark programming
Here, I am using the Python programming interface to Spark (pySpark) in my introductory Spark program... more
NumPy Python package and Python lambda expressions
This lab covers NumPy, NumPy and Spark linear algebra and Python lambda expressions, among others... more
Supervised Machine Learning with R and Python
Here I show how to build various machine learning models in Python and R. The models include linear regression, logistic regression, tree-based models (bagging and random forest) and support vector machines (SVM)... more
Working with NetCDF data using NumPy and matplotlib
NetCDF (Network Common Data Form) data is a self-describing, portable, scalable and appendable machine-independent data format. More information can be found from Wikipedia. This data format is commonly used in Atmospheric science and Oceanography as it is convinient to store various variables of many dimensions. Here, let's see how to open netcdf data in python and generate monthly climatology of global... more
Numpy and Pandas to work with multidimensional arrays
Here, I show the convenience of using the python package numpy together with pandas and matplotlib in working with climate data.... more

Analysing the Madden-Julian Oscillation using Numpy and Scipy
The Madden-Julian Oscillation (MJO) is the major mode of intra-seasonal variability in the tropics. Since its dynamics still remains unclear,
the MJO is intriguing to many atmospheric physicists and mathematicians... more
ggplot in R and Python
The grammar of graphics package (ggplot2) is the best data visualization library in R. The concept of grammar of graphics is also implemented in Python with the library ggplot and it has similar commands to ggplot2... more

Integrating RDDs and DataFrames in a machine learning pipeline
This lab demonstrates how to integrate RDDs and DataFrames in a supervised machine learning pipeline... more

Spark DataFrame API: word count application

Spark DataFrame API: log analytics

Spark DataFrame and RDD: Logistic Regression

Spark DataFrame and RDD: Principal Component Analysis

Power Plant Machine Learning Pipeline Application
This notebook is an end-to-end exercise of performing Extract-Transform-Load and Exploratory Data Analysis on a real-world dataset, and then applying several different machine learning algorithms to solve a supervised regression problem on the dataset... more
Predicting Movie Ratings
One of the most common uses of big data is to predict what users want. This allows Google to show you relevant ads, Amazon to recommend relevant products, and Netflix to recommend movies that you might like. This lab will demonstrate how we can use Apache Spark to recommend movies to a user... more
Word Count Lab: Building a word count application
The volume of unstructured text in existence is growing dramatically, and Spark is an excellent tool for analyzing this type of data. In this lab, we will write code that calculates the most common words in the Complete Works of William Shakespeare retrieved from Project Gutenberg... more
Text Analysis and Entity Resolution
Entity resolution is a common, yet difficult problem in data cleaning and integration. This lab will demonstrate how we can use Apache Spark to apply powerful and scalable text analysis techniques and perform entity resolution across two datasets of commercial products... more