Cloud-based Enterprise Data Warehousing: A Comparison of Snowflake vs Amazon Redshift vs Azure Synapse

June 6, 2022 | Commercial Data Management
Adoption Of Big Data Repositories In Pharma

Cloud-based Enterprise Data Warehousing: A Comparison of Snowflake vs Amazon Redshift vs Azure Synapse

Cloud-based Enterprise Data Warehousing: A Comparison of Snowflake vs Amazon Redshift vs Azure Synapse 1024 683 deepak

Adoption of Big Data Repositories in Pharma: Background

The use of big data in drug research stemmed from more efficient data access and more secure data repositories for biopharma companies. Historically, data management systems and warehouses were based on heavily structured paradigms that assumed the data was being captured for specific questions. But with a big data boom, access to huge amounts of historical and current information via traditional data warehouses became a formidable task.

Biotech companies need to retrieve information for analysis, interpretation, trend identification regardless of whether objectives focus on clinical trials, regulatory compliance, marketing and sales alongside distribution. As pharma companies brace with the reality of being more data-driven to get better insights and achieve top-of-the-line revenues, cloud data warehouse technologies are the answer for faster insights. The global Data Warehouse as a Service (DWaaS) market size is expected to grow at a Compound Annual Growth Rate (CAGR) of 22.3% to reach USD 12.9 billion by 2026 from USD 4.7 billion in 2021.

One of the key differences between traditional databases and cloud-based databases is that the latter is designed keeping elasticity and scalability in mind. Cloud-based data warehouses do not need physical hardware and its quicker to set up. In terms of functionality, cloud data warehouses have a completely different approach to loading data, querying and moving the data around the various levels in the company. They also function without interfacing with the hardware, with a service layer that allows IT teams to manage things remotely via the cloud.

Compared to on-premises data repositories, cloud-based data warehouses allow data managers to plan, procure, maintain data securely and efficiently. Unlike on-premises data warehouses, pharma majors do not have to invest on hardware related to number of nodes and disks. At a time when digital transformation is imperative, cloud data warehouses allow pharma companies to tap into unstructured data seamlessly. It allows them multiple user touchpoints across all channels- social, in-person, marketing analytics, and other data collected from automation systems. Since the user touchpoints for marketing are innumerable, the data points generated are huge and can be effectively analyzed by a cloud data warehouse.

Snowflake

This is a data warehouse that is a multi-cloud software as a service (SaaS) solution built on the back of the major cloud provider’s storage options. Snowflake data warehouse can be backed by an Azure Storage Account, an AWS S3 account or even a Google Cloud storage. Snowflake’s model means that the data warehouse is technically on their Snowflake cloud which is publicly deployed and not private. Snowflake maintains a collection of pre-provisioned and pre-warmed virtual machine instances for the sake of compute. It features its own ANSI-compliant (to an extent) SQL dialect, called SnowSQL.

As a data warehouse, compute and storage are merely services to be called in to. Storage is the external account set during provisioning, and compute is procured at query-time from existing VM pool. This compute capacity is allocated in small, medium and large sizes, with each step up the size chart representing a doubling of the number of VM’s concurrently processing the data. When a query is submitted, it is compiled and distributed to the compute nodes, data is procured and processing is completed, collated and returned.

Azure Synapse

Azure Synapse is a very new data platform from Microsoft, introduced in December 2020. While it has longstanding roots in Microsoft’s award winning Massively Parallel Processing (MPP) database work that spans both on-premises appliances and the cloud (PDW->APS->Azure SQL DW. Synapse has rebranded since adding an otherwise unmatched collection of services into a single platform. Synapse offers integration to Azure Data Factory (ADF) pipelines, SQL DW Dedicated Pools, Synapse Serverless and even Spark Pools. It also offers compatibility with many other compute and storage services by virtue of the Azure Data Factory integration. It is a PaaS (platform-as-a-service) solution and effectively represents an integration of many other existing Microsoft data solutions. This integration is natively presented in the Synapse Studio experience, which is the front-end of Synapse. This allows for development and monitoring alongside other functions.

Amazon Redshift

Amazon Redshift is a cloud data warehouse service that handles data sets of various sizes that range from few gigabytes to a petabyte or more. Users initially launch a set of nodes and provision them, post which data is uploaded and an analysis is carried out. This data warehouse is part of a broader Amazon Web Services (AWS) ecosystem. The Redshift data warehousing service can offer various features. For instance, users can export data to and from a data lake and integrate with other platforms such as Salesforce, Facebook Ads, Google Analytics, Jira, Slack and Splunk. This data warehouse service achieves high performance and has efficient storage options using columnar storage and data compression alongside zone maps. It allows teams to quickly access petabytes of data, run queries and produce visualizations. Business intelligence reports can be produced in record time.

Amazon Redshift v/s Snowflake v/s Azure Synapse: Understanding Differences

Architecture:
All three data warehouses use the MPP architecture. In AWS, the data warehouse is made of clusters with compute nodes split into node slices. The leader node assigns the individual compute nodes with the code. The entire data warehouse system uses industry standard JDBC or ODBC to communicate with client applications.

Snowflake architecture on the other hand is made for the cloud and combined with an SQL query engine. It combines traditional shared disk with the shared-nothing database architectures giving it three core layers namely database storage, cloud services and query processing.

Azure Synapse uses a scale-out architecture to distribute the computational processing of data across various nodes. It also separates compute from storage, allowing users to scale out compute independently of the data stored in the system. One of the most critical areas of difference is in cost. Snowflake offers a Pay-As-You-Go billing for compute that is calculated on a per-second basis with a minimum of 60 seconds with auto-suspend and auto-resume capabilities.

Typically speaking, query execution takes about 2 minutes, and you pay only for 2 minutes of compute assuming that the Virtual Warehouse is suspended after the query execution. With Synapse, compute usage is charged on an hourly basis. If the data warehouse is active for 12 hours in a month, users will be billed for the 12 hours the data warehouse existed. It is the organization that is responsible for managing and resume Synapse of even plan for its capacity and educate developers to optimize overall consumption and cost.

Snowflake also has an innovative multi-cluster, shared data architecture that stands out the most and is effectively made ground up for the cloud. With Snowflake, different workloads can be isolated concurrently on a shared data layer. By creating separate Virtual Warehouses, that can be seen as differently sized and isolated “compute instances” on shared data. It is possible to achieve unlimited scale and concurrency based on the unprecedented compute needs without any downtime. Snowflake also is designed to scale the needs of the organization overtime.

Synapse provides a dedicated SQL pool option with a per-defined unit of scale called the Data Warehouse Unit or DWU and a serverless SQL pool option where scaling is done automatically to meet different scaling needs. The data is then moved between compute nodes using Data Movement Service or DMS within a dedicated Synapse SQL pool. There is also a separation of different workloads that can be defined by a dedicated resource pools within a limited pool of nodes (like traditional MPPs such as Impala). These resource pools must be defined and monitored overtime by an administrator. Each Azure Synapse has capacity limits; therefore, an organization might need to stand up and manage multiple Azure Synapse Analytics services when these limits are exceeded. Additionally, Azure Synapse is mainly designed for Big Data loads.

Performance:
Redshift offers a fine performance on almost all data types. But the performance is low when it is dealing with semi-structured data such as JSON files. The concept of distribution keys is very useful when you are trying to get optimal performance. Redshift Distribution Keys (DIST Keys) determine where data is stored and clusters store data fundamentally across the compute nodes. Query performance suffers when a large amount of data is stored on a single node therefore distribution keys are styles that distribute data across clusters. Amazon Redshift also stores your data on disk in sorted order according to sort key. The Amazon Redshift query optimizer uses sort order when it determines optimal query plans.

Snowflake contrastingly, separates compute from storage allowing it to take in concurrent workloads letting users run multiple queries at a time. The workloads don’t impact each other and that leads to faster performance.

Azure Synapse has an architecture that allows for concurrent query processing. Users can extract insights from their data and visualize it. Azure Synapse has the best ability to scale up and down and perform well under various load levels. You can run performance benchmarks using data with varying degree of workloads. Azure services can set up to auto scale and needs an administrator’s attention. Administrators can also partition data structures to improve performance and do other kinds of performance optimization. Azure Synapse takes automatic snapshots of the data warehouse throughout the day to create restore points that are available for seven days. You can also manually trigger as many as 42 user-defined snapshots. Snapshots storage counts toward storage allotment for billing purposes. You can also restore the data warehouse from any snapshot by issuing a restore command.

Integrations:
Amazon Redshift supports integration with the entire AWS ecosystem that includes Amazon Dynamo DB, Amazon RDS, Amazon S3, AWS Data Pipeline and AWS EMR. Redshift also partners with the other platforms. Snowflake offers native connectivity with multiple BI, data integration and analytics tools such as Azure Data Factory, IBM Cognos, Oracle Analytics Cloud, Fivetran and Google Cloud while Azure Synapse comes with many integration tools such as logic apps, API management, Event Grid and Service Bus that enable you to connect to a wide variety of third-party services. Synapse also supports native integration with BI, ML software and operational databases.

Security:
Redshift enables both users and AWS to have responsibilities to ensure that the data is secure. AWS controls access to Redshift resources at all levels. It is also compliant with ISO, PCI, HIPAA BAA and SOC 1,2,3 standards.

Snowflake has a very secure cloud platform and complies with many data protection standards that include SOC1 type 2, SOC 2Type 2 for Snowflake editions and HIPAA, HITRUST and PCI DSS for business-critical edition or even higher. Snowflake has also implemented controlled access management and data security by encrypting all data and files.

Azure offers both data protection services for both on-premises and cloud workloads. These services include access management, information security, threat protection, data protection and network security.

Data Backup and Recovery: Redshift has a system that is advanced for both automated and manual snapshots. The snapshots facilitate recovery in the case of the occurrence of an unseen event. These snapshots are stored in S3 using an encrypted SSL connection.

Snowflake on the other hand uses fail-safe rather than backup. The fail-safe approach offers a 7-day period during which any Snowflake data that might be lost can be recovered. The concept of Snowflake Time Travel also is very critical in data storage and retrieval. This enables access to historical data, both changed and deleted, at any point within a defined period. It also serves as a powerful tool for restoring data-related objects like tables, schemas and databases that have been accidentally or intentionally deleted. The concept of Time-Travel also enables duplicating and backing up data from key points in the past. It also helps users analyze data usage and manipulation over specified periods of time.

Microsoft has an in-built Azure Backup feature for back up. This scales well to meet your backup storage needs.

Use Cases:
Redshift is suitable for any business that deals with large scale data and runs on AWS where queries need a quick response. It is also good for businesses that need a data warehouse solution with a transparent pricing model with almost no administrative costs. Amazon Redshift is also a relational database management system (RDBMS), so it is compatible with other RDBMS applications. Although it provides the same functionality as a typical RDBMS, including online transaction processing (OLTP) functions such as inserting and deleting data, Amazon Redshift is optimized for high-performance analysis and reporting of very large datasets.

Snowflake is suitable for companies that want to deploy a cloud data warehouse with nearly unlimited automatic scaling and high performance. Every warehouse within Snowflake is on its own independent compute cluster. Warehouses do not share resources with other virtual warehouses. This means that Snowflake supports near-unlimited concurrency for both queries and users. On top of this, Snowflake is cloud-agnostic and runs on all three major clouds, AWS, GCP, and Azure.

Azure Synapse uses a distributed query system that leverages T-SQL (T-SQL is very similar to conventional SQL but it comes with a few added benefits and is primarily used with Microsoft services). It offers what’s known as Dedicated SQL pools, Serverless SQL pools, and Spark Pools. It also has a great price and performance ratio and is an enterprise grade data warehouse which offers seamless integrations. Synapse is native to Azure, so this means that Synapse is only available to Microsoft customers who are already leveraging Microsoft’s existing cloud offerings.

Enterprise Data Warehouses: Road Ahead for the New Class of Business Workloads in the Cloud

Early cloud was all about infrastructure-as-a-service that would spin up storage and compute alongside networking resources to support pharma companies. Increasingly, cloud workloads are going beyond infrastructure services and becoming increasingly diverse. Data warehouses are now leveraging data by infusing artificial intelligence into applications by simplifying analytics and scaling with the cloud to deliver business insights in near real time. These data stores and analytic databases are breaking away from legacy enterprise data warehouses that were too cumbersome and complicated, slow to keep pace with the speed of business. Since pharma data is increasingly exponentially, cloud data warehouses enable user access and safeguards precious data, all while maintaining processing speeds. Cloud data warehouses also allow IT resources to dedicate time to more important business intelligence projects. All these cloud data warehouses seamlessly integrate new data sources while upholding data integrity throughout the collection, governance, transformation and storage process. Organizations therefore share a common data repository thereby simplifying the monitoring and deployment experience.

References

Nicholas Samuel. Snowflake vs AWS vs Azure: Top 8 Unique Differences
Snowflake vs AWS vs Azure: Top 8 Unique Differences – Learn | Hevo (hevodata.com)
March 25, 2022 

Marianna Park. Cloud Data Warehouse Comparison: Redshift vs BigQuery vs Azure vs Snowflake for Real-Time Workloads
Cloud Data Warehouse Comparison: Redshift vs BigQuery vs Azure vs Snowflake for Real-Time Workloads | Striim.
2022 

Travis Manning. Synapse vs Snowflake: The Data Warehouse Debate.
Synapse vs Snowflake: The Data Warehouse Debate (bluegranite.com)
Sep
tember 23, 2021 

Subscribe to Our Insights

Contact us