Fisseha Berhane, PhD

Data Scientist

443-970-2353 fisseha@jhu.edu CV Resume Linkedin GitHub twitter twitter

Analogy between a Data Lake and a Natural Lake

At Aurotech, my coworkers and I create data lakes for our customers and develop solutions to different business problems using data lake capabilities. When I mention data lake to my friends, they ask me what it is. In this post, I will describe, to a non-data savvy, what a data lake is, what its uses are and how its infrastructure and platform are different from the traditional systems.

More than 95% of the world data was collected in recent years. There are billions of devices that collect data of different kind, size and shape. As billions of people get connected to the internet, the amount of data generated daily is increasing tremendously. Massive data is collected from social media (e.g., Facebook, Twitter), weather (e.g., rainfall radar and other global weather networks), machines and sensors, online shopping websites and so on. To store and analyze the massive amount of data collected daily, big data infrastructures and platforms were created because the volume, velocity and variety of data collected requires fundamentally different infrastructure and fundamentally different platform breaking traditional systems.

Before speaking about a data lake, let's first talk about a natural lake. A natural lake can have many tributaries (rivers) that flow into it. The water from the different tributaries might have different quality and quantity. Some rivers could be clean and others cloud be dirty and muddy. The rivers might have different slopes. Some could have steep slope and rapid flow of water and others could have a gentle flow where the water flow is slow and more or less steady. Some rivers could flow throughout the year while others might flow only in the rainy season. Some rivers may have regulated dams upstream and the flow could be periodic.


Source

A data lake is analogous to a natural lake. Data lake helps us to store massive amount of data and data of different type and shape cheaply. We can store data with well-defined data model, unstructured data such as social media posts and binary data such as images and videos. Similar to a natural lake, a data lake stores any type of data.

The other point of analogy is speed. A data lake can capture and store data coming with different speeds. Some data could be streaming data and analyzing the data on the fly could be important to extract valuable business insights instantly. A data lake helps us to ingest streaming data and do any type of data processing and complex analytics on it.

Further, a data lake provides cheap computing power designed for high-performance processing and analytics. The other interesting point about a data lake is the fact that it can scale up or down based on our needs. When we work with terabytes, petabytes or even exabytes of data, our data lake takes care of scalability issues and the same code works for data of various sizes. As we can swim and fish in a natural lake, we can use a data lake for discovering hidden insights, associations, patterns and make predictions. We can also use it to propose new business hypothesis.

Interested to learn more? Visit Microsoft Azure Data Lake. Also, Amazon gives webinars on building a data lake on AWS (e.g., this).





comments powered by Disqus