You’ve probably heard the catchphrase, “Data is the new oil.” You need it to run your business, and its potential value is huge… But first you have to drill for it.
Companies that can collect, process and analyze their data resources efficiently and effectively will gain competitive advantage. A robust data warehouse is often critical to success.
Data warehouses have traditionally been on-premises systems. But a cloud-based or hybrid alternative may now be a better choice for many organizations. This post compares on-premises versus cloud data warehouse architectures and explains the pros and cons of each.
What is a data warehouse?
A data warehouse is a central repository that stores and processes both historical and new data from multiple databases and applications (including potentially SaaS platforms) to improve the speed and scope of problem-solving and decision-making. Data scientists and other roles direct queries, reporting and analytics requests to the data warehouse via business intelligence (BI) and analytics tools and dashboards. Data warehouses support business decisions in many areas, from how to price products to where to cut costs to whether to launch a new service.
Columnar versus row-based data storage
Why not just query your relational databases (RDBMS) directly? One reason is that BI/analytics requests could significantly impact performance of business-critical RDBMS and associated “systems of record.”
Also, the row-based structure of RDBMS, which is ideal for transactional usage, isn’t as well suited to analytical processing. This is why many data warehouses, especially cloud-based offerings, store data in a columnar format.
For example, aggregation queries (like “What is the average age of customers who bought widgets?”) can be much faster in a columnar database because the data elements involved are stored closer together on the storage media, making read times (latency) lower. But if you need to query multiple fields per record (like “What is the name, email address and date of birth of each person who bought a widget in 2018?”) the advantages of columnar databases diminish.
Similarly, write times can be much slower in columnar databases. This is less problematic in data warehouses versus transactional applications because new data is generally added less frequently, via so-called “data dumps.”
Data warehouse structures
The more data sources you need to access and the more kinds of questions you want to answer, the more complex your data warehouse structure might need to become.
A basic data warehouse provides summary data derived from similar types of operational systems, like RDBMS. As you add more data sources, you may need to setup a “staging area” to convert diverse data into a more homogeneous structure that is easier to query. Some solutions then add “data marts,” which tailor data from individual lines of business to support the specific analysis needs of different user roles.
ELT or ETL?
Most data warehouses pull data from different line of business sources, such as marketing, sales, accounting, etc. and/or from the organization overall. All this data can either be aggregated within a central repository and then transformed into a suitable format (an Extract-Load-Transform (ELT) approach); or initially pulled into a temporary database to be transformed before being moved to a central repository (an Extract-Transform-Load (ETL) process).
The ETL approach is typically the choice for on-premises data warehouses, as hardware limitations dictate processing the data off-platform. ETL can also help ensure privacy and compliance by cleansing the data before loading. The massive scalability of cloud data warehouse platforms makes them amenable to ELT, which can enable users to transform raw data according to their specific needs after it’s loaded.
On-premises data warehouse architectures
An on-premises data warehouse requires dedicated storage, servers and administrative resources. As such, it can be slow and costly to scale up as data volumes expand.
Traditional data warehouses have a three-tier architecture:
- The bottom tier consists of a database server that extracts data from multiple sources; e.g., RDBMS and SaaS
- The middle tier consists of an online analytical processing (OLAP) server that transforms the data into a more analytics-friendly structure
- The upper tier is the client-facing applications, which provide tools for data analysis, reporting and data mining
Cloud data warehouse architectures
A cloud data warehouse, also called Data Warehouse-as-a-Service (DWaaS), is a SaaS offering. You pay as you go for storage and compute power, and your data is stored, queried and analyzed over the internet.
Since the service provider manages the physical hardware and IT resources, cost-effective scalability and much lower setup costs are among the key advantages of cloud data warehouses. Gartner estimates that 75% of all databases will be deployed on or migrated to the cloud by 2022, with cost and flexibility being perhaps the primary drivers for this rapid shift.
Some generic architecture components common to many cloud data warehouse architectures include:
- Clusters, which are large groupings of shared computing resources called nodes
- Nodes are discrete computing resources with dedicated CPU, RAM and storage. Queries are executed across multiple nodes in a cluster, depending on what data is stored on them.
- Slices. Each node has multiple partitions, called slices, which operate in parallel to accelerate query execution.
These elements together enable massively parallel processing, another big performance advantage of cloud data warehouses.
Cloud data warehouse architectures can vary significantly. Amazon Redshift uses a model similar to the above, and requires users to provision resources more like an on-premises data warehouse. Google BigQuery, in contrast, is serverless; Google manages system resources dynamically with no user input or visibility.
Pros and cons of on-premises and cloud data warehouses
Where to host your data warehouse? That depends on a balance of cost, scalability/growth, query efficiency and other factors that are unique to your business. Here are the top issues to consider:
Because you pay only for the storage and CPU time you actually use, plus there are no physical machines to buy and no maintenance, admin or upgrade costs, cloud data warehouses are much less expensive—especially upfront. According to Forrester Research, most companies save between 20% and 80% on cost with a cloud data warehouse solution.
Security and compliance
Security and compliance concerns are the top roadblocks to moving data warehouses to the cloud. Leading DWaaS providers offer robust security controls that are probably better than many businesses could deliver on their own. However, in virtualized cloud environments you may not know where your data resides or where and when it is moved. You also may not know when a vendor has trusted your data to a “fourth party” whose security posture might be suspect. Concerns like these can make it hard to prove compliance.
Some of the security topics to explore with DWaaS providers include:
- Data encryption at rest and in transit
- User identity management and authentication
- Security and privacy policies for replication and archiving of data
- Compliance with regulations like HIPAA and PCI DSS, and with cybersecurity frameworks like SOC 2 and ISO 27001
To scale an on-premises data warehouse solution, you need to buy new hardware and/or software (often more than you need right away). In the cloud, you get the capacity you need on-demand to support growth and across both peak and off-peak times.
Reliability and availability
With on-premises systems, reliability and availability depend on your IT resources. In cloud environments, you can hold the service provider to SLAs, which are typically pretty stringent (e.g., 99.9% uptime). DWaaS providers may also be able to offer high availability or improved resilience by replicating data across different geographical regions.
Third-party tools can make data loading faster and easier. However, it is generally more of a challenge with cloud platforms because data has to travel over external networks. Further, cloud data loading tools are often vendor-specific, which can make moving to a new DWaaS a technological and contractual struggle.
Time to market
Thanks to faster setup and rollout times, DWaaS typically offers faster time to market (and faster time to value) than an on-premises data warehouse you must stand up from scratch.
The volume, type(s) and location of data all influence latency, which is the time required to store or retrieve data. Because of their distributed architectures, cloud data warehouses generally have more latency issues than onsite data warehouses. The fresher you need your data to be (one minute versus one hour, for instance), the bigger problem latency can present. Especially if you need near real-time data analysis, it’s important to test and measure latency before choosing a cloud provider.
The leading cloud service providers increasingly offer pre-built integrations with growing ecosystems of cloud services that could help you load or process your data (assuming no compliance problems); e.g., identity and access management and BI/big data analytics tools. In an on-premises scenario, comparable integrations could be largely on you.
Choosing the right data warehouse solution
Before you implement a data warehouse, carefully consider your unique technical, business and user requirements and research likely options in detail. Given the high levels of time, effort and cost involved you don’t want any surprises.
If you don’t have database architecture and design expertise in-house, a trusted advisor can help ensure you ask all the right questions and make all the right moves.
For a free consultation to discuss your needs and questions in-depth, contact Buda Consulting.