Understanding Data Storage Solutions: Databases vs. Data Warehouses vs Data Lakes

Understanding Data Storage Solutions: Databases vs. Data Warehouses vs Data Lakes

November 11, 2021 blog-post big-data cloud data-cloud data-storage

With the mass adoption of data-driven practices, companies are squeezing as many insights as they can from the numerous types of data sources that exist today. The increasing reliance on data called for the emergence of different data storage solutions to address the inherent problems of analyzing data at a huge scale. These solutions are called data warehouses and data lakes.

Why would we need data lakes and data warehouses? Haven’t we been storing data for decades using traditional Online Transaction Processing (OLTP) databases? As we will see, OLTP databases, data warehouses, and data lakes are very different. Each of them solves different problems within the organization.

In this article we’ll:

Do a high-level overview of the different storage solutions
Understand the different properties of each solution and how they fulfill different use cases; and finally
Identify the roles data warehouses and data lakes in a modern data architecture and contrast how it differs from a traditional OLTP database

OLTP Databases: Storing Data for Operations

OLTP Databases are the traditional databases that we’re most familiar with. Usually referred to as just “databases”, these systems are optimized for the rapid handling of business transactions. As such, we use these databases to fulfill basic Create-Read-Update-Delete (CRUD) operations and serve as the back-end of our applications.

Properties

Storing operational data
Highly performant for short transactions and finding specific rows
Not optimized for running column-specific operations and aggregating/grouping large amounts of data

Use Cases

Filing patient information for healthcare
Signing up a new user for a social media website
Taking orders for an e-commerce website

Examples

PostgreSQL
MySQL
Oracle
MongoDB
Redis

Data Lakes: Data Repository for Unstructured, Semi-structured, and Structured Data

A data lake is a huge storage repository that is usually used to hold raw data in their original forms before being used for analysis. It’s essentially a high-capacity thumb drive that exists in the cloud! Unlike the previous two storage solutions, the value of the data lake lies in its flexibility: it can store any type of data. From images and videos to CSVs and XLSX files -- everything is fair game for the data lake.

Properties

Data can be loaded in a raw, untransformed state
Can store any type of data (e.g. images, videos, audio, XLSX, CSVs, etc.)

Use Cases

Storing data that you don’t want to lose while still being unsure how to use them
Quickly reading and writing files in their raw form

Examples

Google Cloud Storage
Amazon S3
Azure Data Lake

The Need for a Data Lake

Why would you want this? The need for a data lake is best understood in contrast to the data warehouse.

Storing data directly into warehouses requires rigid decision making on the part of data engineers. If data won’t be used to answer key business decisions immediately, then there’s little reason to store it in the warehouse! Adding these prematurely will incur expenses that have no apparent business value. In addition, storing data in a warehouse requires a data schema that may not yet be available.

But business requirements often change — the business questions being asked now may not be the same business questions asked tomorrow, and not collecting data from the onset might lead to business questions left unanswered simply because the data is not available! This is especially true for ephemeral data.

Having cheaper storage costs and being able to store data in its raw form makes the data lake a good solution for this problem.

Hitting the Nail on the Head

Here’s a visual to summarize the different data storage solutions we discussed:

	OLTP Database	Data Warehouse	Data Lake
Data Structure	Structured	Structured	Unstructured, Semi-structured, Structured
High-level Use Case	Processing Business Transactions	Data Science/Analysis	Data Science/Analysis
Users	Business Users/Customers	Data Scientists/Data Analysts/Business Analysts	Data Scientists
Unique Advantage	Optimized for rapidly dealing with transactions one at a time	Highly performant for executing analytical queries on large number of rows of data	Can store raw, untransformed data at optimal cost

Data Structure

Databases and warehouses can only store structured data. A data lake on the other hand can store any type of data, be it structured, semi-structured, or unstructured.

Use Cases

OLTP Databases are most commonly used for transactional purposes. Both data lakes and data warehouses are purpose-made for data science and data analysis purposes.

Users

The users for databases are both business users and customers that use different operational systems. Examples of these are Enterprise Resource Planning (ERP) software for business users and social media applications for customers.

Unique Advantage

As we have come to show, these three data storage systems are all unique and have their own set of advantages.

First, the advantage of the traditional OLTP database is that it’s optimized for rapidly dealing with transactions one at a time. This makes it the ideal storage solution for operational systems.

Next, the data warehouse is the best for answering key business questions, because it can easily crunch through large amounts of data.

Lastly, the data lake has the capability to store any type of data, making it the perfect staging area for unstructured, semi-structured, and structured data.

While the delineation between OLTP Databases and the other two systems is clear, it’s a bit harder to understand why we need both Data Warehouses and Data Lakes for analysis. To aid in our understanding, let’s borrow from Asha Saxena, chair of Big Data Analytics Consulting Firm Future Technologies, Inc, who provides the following analogy:

"Think of a data warehouse like a shopping mall. It has discrete “shops” within it that store structured data -- bits that are presorted into formats that database software can interact with. In contrast, a data lake is like a disorganized flea market. It has “stalls”, but where one stops and the next one begins is not so clear. Unlike data warehouses, data lakes can contain both structured and unstructured data. Unstructured data, as the name implies, refers to “messy” digital information, such as audio, images, and video."

Thinking Machines Can Help You Migrate to the Cloud

While having the knowledge to distinguish between data storage solutions is useful, knowing how each of them fits in a modern cloud data platform is much more involved. Different organizations have different data needs, and as such, the cloud migration strategies need to be tailor-fit for each.

We have worked with dozens of companies to figure out and solve their different data challenges, and can assist you to have a properly executed cloud migration. Let us handle the data nitty-gritty so you can focus on running your business! Contact us to get started.