Understanding Data Storage Solutions: Databases vs. Data Warehouses vs Data Lakes
With the mass adoption of data-driven practices, companies are squeezing as many insights as they can from the numerous types of data sources that exist today. The increasing reliance on data called for the emergence of different data storage solutions to address the inherent problems of analyzing data at a huge scale. These solutions are called data warehouses and data lakes.
Why would we need data lakes and data warehouses? Haven’t we been storing data for decades using traditional Online Transaction Processing (OLTP) databases? As we will see, OLTP databases, data warehouses, and data lakes are very different. Each of them solves different problems within the organization.
In this article we’ll:
- Do a high-level overview of the different storage solutions
- Understand the different properties of each solution and how they fulfill different use cases; and finally
- Identify the roles data warehouses and data lakes in a modern data architecture and contrast how it differs from a traditional OLTP database
OLTP Databases: Storing Data for Operations
OLTP Databases are the traditional databases that we’re most familiar with. Usually referred to as just “databases”, these systems are optimized for the rapid handling of business transactions. As such, we use these databases to fulfill basic Create-Read-Update-Delete (CRUD) operations and serve as the back-end of our applications.
- Storing operational data
- Highly performant for short transactions and finding specific rows
- Not optimized for running column-specific operations and aggregating/grouping large amounts of data
- Filing patient information for healthcare
- Signing up a new user for a social media website
- Taking orders for an e-commerce website
Data Lakes: Data Repository for Unstructured, Semi-structured, and Structured Data
A data lake is a huge storage repository that is usually used to hold raw data in their original forms before being used for analysis. It’s essentially a high-capacity thumb drive that exists in the cloud! Unlike the previous two storage solutions, the value of the data lake lies in its flexibility: it can store any type of data. From images and videos to CSVs and XLSX files -- everything is fair game for the data lake.
- Data can be loaded in a raw, untransformed state
- Can store any type of data (e.g. images, videos, audio, XLSX, CSVs, etc.)
- Storing data that you don’t want to lose while still being unsure how to use them
- Quickly reading and writing files in their raw form
- Google Cloud Storage
- Amazon S3
- Azure Data Lake
The Need for a Data Lake
Why would you want this? The need for a data lake is best understood in contrast to the data warehouse.
Storing data directly into warehouses requires rigid decision making on the part of data engineers. If data won’t be used to answer key business decisions immediately, then there’s little reason to store it in the warehouse! Adding these prematurely will incur expenses that have no apparent business value. In addition, storing data in a warehouse requires a data schema that may not yet be available.
But business requirements often change — the business questions being asked now may not be the same business questions asked tomorrow, and not collecting data from the onset might lead to business questions left unanswered simply because the data is not available! This is especially true for ephemeral data.
Having cheaper storage costs and being able to store data in its raw form makes the data lake a good solution for this problem.
Hitting the Nail on the Head
Here’s a visual to summarize the different data storage solutions we discussed:
|OLTP Database||Data Warehouse||Data Lake|
|Data Structure||Structured||Structured||Unstructured, Semi-structured, Structured|
|High-level Use Case||Processing Business Transactions||Data Science/Analysis||Data Science/Analysis|
|Users||Business Users/Customers||Data Scientists/Data Analysts/Business Analysts||Data Scientists|
|Unique Advantage||Optimized for rapidly dealing with transactions one at a time||Highly performant for executing analytical queries on large number of rows of data||Can store raw, untransformed data at optimal cost|
Databases and warehouses can only store structured data. A data lake on the other hand can store any type of data, be it structured, semi-structured, or unstructured.
OLTP Databases are most commonly used for transactional purposes. Both data lakes and data warehouses are purpose-made for data science and data analysis purposes.
The users for databases are both business users and customers that use different operational systems. Examples of these are Enterprise Resource Planning (ERP) software for business users and social media applications for customers.
As we have come to show, these three data storage systems are all unique and have their own set of advantages.
First, the advantage of the traditional OLTP database is that it’s optimized for rapidly dealing with transactions one at a time. This makes it the ideal storage solution for operational systems.
Next, the data warehouse is the best for answering key business questions, because it can easily crunch through large amounts of data.
Lastly, the data lake has the capability to store any type of data, making it the perfect staging area for unstructured, semi-structured, and structured data.
While the delineation between OLTP Databases and the other two systems is clear, it’s a bit harder to understand why we need both Data Warehouses and Data Lakes for analysis. To aid in our understanding, let’s borrow from Asha Saxena, chair of Big Data Analytics Consulting Firm Future Technologies, Inc, who provides the following analogy:
"Think of a data warehouse like a shopping mall. It has discrete “shops” within it that store structured data -- bits that are presorted into formats that database software can interact with. In contrast, a data lake is like a disorganized flea market. It has “stalls”, but where one stops and the next one begins is not so clear. Unlike data warehouses, data lakes can contain both structured and unstructured data. Unstructured data, as the name implies, refers to “messy” digital information, such as audio, images, and video."
Thinking Machines Can Help You Migrate to the Cloud
While having the knowledge to distinguish between data storage solutions is useful, knowing how each of them fits in a modern cloud data platform is much more involved. Different organizations have different data needs, and as such, the cloud migration strategies need to be tailor-fit for each.
We have worked with dozens of companies to figure out and solve their different data challenges, and can assist you to have a properly executed cloud migration. Let us handle the data nitty-gritty so you can focus on running your business! Contact us to get started.