Header image

Understanding Data Storage Solutions: Databases vs. Data Warehouses vs Data Lakes

November 11, 2021 blog-post big-data cloud data-cloud data-storage

With the mass adoption of data-driven practices, companies are squeezing as many insights as they can from the numerous types of data sources that exist today. The increasing reliance on data called for the emergence of different data storage solutions to address the inherent problems of analyzing data at a huge scale. These solutions are called data warehouses and data lakes.

Why would we need data lakes and data warehouses? Haven’t we been storing data for decades using traditional Online Transaction Processing (OLTP) databases? As we will see, OLTP databases, data warehouses, and data lakes are very different. Each of them solves different problems within the organization.

In this article we’ll:

  1. Do a high-level overview of the different storage solutions
  2. Understand the different properties of each solution and how they fulfill different use cases; and finally
  3. Identify the roles data warehouses and data lakes in a modern data architecture and contrast how it differs from a traditional OLTP database
OLTP Databases: Storing Data for Operations

OLTP Databases are the traditional databases that we’re most familiar with. Usually referred to as just “databases”, these systems are optimized for the rapid handling of business transactions. As such, we use these databases to fulfill basic Create-Read-Update-Delete (CRUD) operations and serve as the back-end of our applications.

Properties

Use Cases

Examples

Data Lakes: Data Repository for Unstructured, Semi-structured, and Structured Data

A data lake is a huge storage repository that is usually used to hold raw data in their original forms before being used for analysis. It’s essentially a high-capacity thumb drive that exists in the cloud! Unlike the previous two storage solutions, the value of the data lake lies in its flexibility: it can store any type of data. From images and videos to CSVs and XLSX files -- everything is fair game for the data lake.

Properties

Use Cases

Examples

The Need for a Data Lake

Why would you want this? The need for a data lake is best understood in contrast to the data warehouse.

Storing data directly into warehouses requires rigid decision making on the part of data engineers. If data won’t be used to answer key business decisions immediately, then there’s little reason to store it in the warehouse! Adding these prematurely will incur expenses that have no apparent business value. In addition, storing data in a warehouse requires a data schema that may not yet be available.

But business requirements often change — the business questions being asked now may not be the same business questions asked tomorrow, and not collecting data from the onset might lead to business questions left unanswered simply because the data is not available! This is especially true for ephemeral data.

Having cheaper storage costs and being able to store data in its raw form makes the data lake a good solution for this problem.

Hitting the Nail on the Head

Here’s a visual to summarize the different data storage solutions we discussed:

OLTP Database Data Warehouse Data Lake
Data Structure Structured Structured Unstructured, Semi-structured, Structured
High-level Use Case Processing Business Transactions Data Science/Analysis Data Science/Analysis
Users Business Users/Customers Data Scientists/Data Analysts/Business Analysts Data Scientists
Unique Advantage Optimized for rapidly dealing with transactions one at a time Highly performant for executing analytical queries on large number of rows of data Can store raw, untransformed data at optimal cost

Data Structure

Databases and warehouses can only store structured data. A data lake on the other hand can store any type of data, be it structured, semi-structured, or unstructured.

Use Cases

OLTP Databases are most commonly used for transactional purposes. Both data lakes and data warehouses are purpose-made for data science and data analysis purposes.

Users

The users for databases are both business users and customers that use different operational systems. Examples of these are Enterprise Resource Planning (ERP) software for business users and social media applications for customers.

Unique Advantage

As we have come to show, these three data storage systems are all unique and have their own set of advantages.

First, the advantage of the traditional OLTP database is that it’s optimized for rapidly dealing with transactions one at a time. This makes it the ideal storage solution for operational systems.

Next, the data warehouse is the best for answering key business questions, because it can easily crunch through large amounts of data.

Lastly, the data lake has the capability to store any type of data, making it the perfect staging area for unstructured, semi-structured, and structured data.

While the delineation between OLTP Databases and the other two systems is clear, it’s a bit harder to understand why we need both Data Warehouses and Data Lakes for analysis. To aid in our understanding, let’s borrow from Asha Saxena, chair of Big Data Analytics Consulting Firm Future Technologies, Inc, who provides the following analogy:

"Think of a data warehouse like a shopping mall. It has discrete “shops” within it that store structured data -- bits that are presorted into formats that database software can interact with. In contrast, a data lake is like a disorganized flea market. It has “stalls”, but where one stops and the next one begins is not so clear. Unlike data warehouses, data lakes can contain both structured and unstructured data. Unstructured data, as the name implies, refers to “messy” digital information, such as audio, images, and video."

Thinking Machines Can Help You Migrate to the Cloud

While having the knowledge to distinguish between data storage solutions is useful, knowing how each of them fits in a modern cloud data platform is much more involved. Different organizations have different data needs, and as such, the cloud migration strategies need to be tailor-fit for each.

We have worked with dozens of companies to figure out and solve their different data challenges, and can assist you to have a properly executed cloud migration. Let us handle the data nitty-gritty so you can focus on running your business! Contact us to get started.

MORE STORIES

Building Beautiful Plots with Matplotlib

Our data science lead Stef Sy shows you how to generate colorful line, bar, scatter plots and more using just a few lines of code from the Matplotlib library.

The Business Case for A/B Testing

One of the biggest promises of data science is for business owners and executives to be able to understand the causal relationships and fundamental drivers that underpin their business. Find out how A/B tests can be leveraged to do just that and why you should.

5 Ways You Already Rely On Artificial Intelligence

Sure, Siri is a long way from being Scarlett Johansson. But she’s a real-world example of how artificial intelligence has crept into daily life.