Fisseha Berhane, PhD

Data Scientist

443-970-2353 fisseha@jhu.edu CV Resume Linkedin GitHub twitter twitter



Integrating Hadoop and BI tools: Analyzing and Visualizing Big Data in Tableau with Spark

This blog post is on how to integrate Hadoop with business inteligence tools such as Tableau, and leverage the capabilities of Spark for querying data. We can query data in Hive by using Tableau connector for Spark SQL and create quick and amazing dashboards in a short time... more


Using SparkR in Rstudio with Hadoop Deployed on AWS EC2 - part 2

In a previous post, we saw how to install R, Rstudio server and R packages on AWS EC2 Red Hat cluster to use with Hortonworks Data Platform (HDP 2.4) Hadoop distribution. Now, let’s use SparkR for data munging... more


Analogy between a Data Lake and a Natural Lake

A data lake is analogous to a natural lake. Data lake helps us to strore massive amount of data and data of different type and shape cheaply. We can store data with well-defined data model, unstructured data such as social media posts and binary data such as imgages and videos. Similar to a natural lake, a data lake stores any type, size and shape of data... more


Source

Data Ingestion to a Hadoop Data Lake with Jupyter

Jupyter is a web-based notebook which is used for data exploration, visualization, sharing and collaboration. It is an ideal environment for experimenting with different ideas and/or datasets. We can start with vague ideas and in Jupyter we can crystallize, after various experiments, our ideas for building our projects. It can also be used for staging data from a data lake to be used by BI and other tools. In this blog post, we will see how to use Jupyter to ingest data to Hadoop Distributed File System (HDFS). Finally, we will explore our data in HDFS using Spark and create simple visualization... more


Using SparkR in Rstudio with Hadoop Deployed on AWS EC2 - part 1

SparkR provides a distributed data frame implementation that supports operations like selection, filtering, aggregation etc, similar to the dplyr R package but on large datasets. SparkR also supports distributed machine learning using MLlib... more






comments powered by Disqus