Snowflake Migration and Testing
What is Snowflake?
Snowflake is the first cloud-based data warehouse built to analyze massive data volumes at scale.
The critical difference between Snowflake and other traditional data warehouses is that it segregates data storage and compute (processing). This leads to high gains in massive parallel processing (MPP) and is one of the reasons why enterprises have started adopting Snowflake as the platform of choice to migrate their legacy on-prem data warehouses.
Additionally, Snowflake introduces the concept of a virtual warehouse that encapsulates a cluster of compute nodes. A virtual warehouse provides resources such as memory, temporary storage and CPU to carry out DML operations and SQL execution. Snowflake has utilized this separation of concerns related to data storage and compute to provide organizations a vastly optimized ability to scale horizontally or vertically by creating warehouses of different configurations and sizes for different types of workloads.
Migrating to Snowflake
Most migrations are generally complex and will lead to failures and rework with an additional cost of lost time and effort. Consequently, most migrations require careful consideration towards developing a migration plan, which, amongst other things, includes meticulous attention to detail when it comes to migration testing. In this document, we recommend a carefully crafted step by step approach for Teradata to Snowflake migration with a focus on testing your data. We explore and recommend testing platforms, techniques and procedures to leverage during and after the migration with a focus on automating the overall data migration testing effort.
‘Staged’ vs ‘Lift and Shift’ Migration
Some factors may lead organizations to adopt a “staged migration” approach. The compelling reasons to choose this type of migration may arise if the warehouse platform consists many data marts interconnected with applications or critical data and processes within the data warehouse that no longer perform well and require re-engineering.
Other factors may lead organizations to adopt a “lift and shift” approach for migrating to Snowflake. Organizations may choose this approach wherein there is are pressing time constraints or presence of highly integrated data across the existing warehouse or, perhaps, if the organization is migrating a single standalone data mart or has a system that uses well-designed data and processes using standard ANSI SQL.
Notwithstanding the migration approach, it is an established practice to implement a phase-wise plan and course of action for data warehouse migrations. The following list is the general approach suggested for migration when migrating to Snowflake.
- Schema migration
- Initial data load
- Daily delta load
- Running source and Snowflake in parallel
- Running downstream systems in parallel
- Cutover to Snowflake
The stepwise approach listed above lends itself to the early adoption of testing data quality and it encourages left shifting the test effort in the DevOps cycle for data migration projects.
Thus, the primary challenge for data migration is data migration testing and certification. Migration related data quality issues are extremely disruptive contributors to excessive delays and cost overruns in data migration projects.
iCEDQ blends smoothly into the staged migration and the Lift and Shift migration approaches by dovetailing non-intrusively into the data movement cycles implicit in these approaches, as will be illustrated in the examples explained in this document. The Staged approach is nuanced and has an order of magnitude higher complexity as compared to the Lift and Shift one.
DevOps approach to Snowflake migration testing
Left shifting migration testing activities on the migration project timeline is the new normal in a world driven by contemporary and effective paradigms such as DevOps – especially for achieving a high-quality, cost-effective and timely migration. This can be realized readily using a test-driven development (TDD) approach for data migration projects.
It turns out that adopting a test-driven development approach for migration from a legacy data warehouse to Snowflake is indeed an optimal mechanism to discover data issues as soon as possible. As pointed earlier, this up-front adoption of testing early has positive side effects of reduced migration rework and the benefit of hugely lowering the probability of time and cost overruns during a migration project. The following sections elaborate on how these benefits can be realized by following a meticulous approach to data migration testing using iCEDQ.
The iCEDQ left shift approach for data migration testing is based on consistent testing in between each of the migration steps. It is firmly aligned to certify each step of the Snowflake migration effort. Additionally, iCEDQ addresses the vitally important aspect of executing a comprehensive test suite at each stage to discover any data quality failures and issues as early as possible in the data migration cycle.
The generic test plan blueprint depicted below is designed to ensure thorough testing and test coverage during each step of Snowflake migration. We highly recommend adopting this blueprint for Snowflake migration testing to effectively leverage the repeatable, reusable and automated rules-based testing elements built into the iCEDQ platform.
Migrate & Test Schema
The key step is to move database objects from the existing source database to Snowflake. This step will include moving schemas, tables, views, sequences, and any other object(s) supported by Snowflake.
Once the schema is migrated, the crucial step is to create and run tests to detect any missing objects, columns, or incorrectly mapped datatypes between the source database and Snowflake.
Schema Compare Test
To identify schema discrepancies between the source database and Snowflake, create tests to compare Schema Names, Table Names, View Names, Sequence Names, Column Names, and Datatypes for corresponding columns. This information is readily available in the system tables for any database.
It is particularly important to be alert when comparing data types between the source and target. This is especially pertinent since a one to one mapping between the source and target data types can never be assumed. Every database, more often than not, may have different data types. Hence, while creating schema tests, the tests must factor the correct datatype mapping between the source database and Snowflakes.
Run and Test Initial Data Loads
A one-time load process entails migrating the historical data from a source database to a raw zone (S3/ Azure Blob/ GCS) and into Snowflake tables.
We kick off the testing cycle in this stage by creating tests between the source database and Snowflake to identify any data truncation, transformation, rounding, or any other data issue.
Since the initial loads have historical data, which could be billions of rows, users have to create different tests to ensure the timely execution of tests to avoid any unnecessary egress charges. Different types of tests required at this stage are explained below.
Row Count Test
This is the initial test to compare row counts between the source database and Snowflake tables. This test should pass or fail based on the actual versus expected outcome.
In the case where the row count test is a success, it may not imply that the data is identical. It is best to create drill-down and detailed checks before flagging that data has no issues.
Data Compare Test
This test is created to compare data between the source and Snowflake tables row by row and column by column. This test will recognize the exact data issues that an organization needs to fix before a rerun of the initial load.
A sampling of the issues which this test helps to identify are outlined below.
- Data Truncation
- Floating-point issues
- Incorrect DateTime
- Special character conversion issues
Data Aggregation Test
Instead of comparing historical data on a row by row basis, users can create data aggregation tests to evaluate quantities such as the sum of numeric columns by month or year between the source database and Snowflake. If a data aggregation test fails for a specific group, then a row by row comparison for that particular group may be performed.
Run and Test Delta Loads
As upstream systems generate the data, it is necessary to load the delta in Snowflake by running a daily delta load process. It can be a separate process, or upstream systems can directly load into Snowflake.
Create tests to ensure that the source database and Snowflake are in sync. The tests should be executed daily as long as both the systems are running in parallel.
Delta Data Compare Test
Users can create tests to compare daily/ delta loads to identify issues between source and Snowflake database. Also, execute these tests continuously until the source system is switched off.
Data Validation Test
Snowflake does not enforce primary key constraints. Hence it is crucial to create a test to check for duplicates in critical tables. This type of data validation helps users quickly identify all the upstream processes, which are causing the data quality issues.
Test Downstream Systems
BI tools, ETL tools, or downstream processes that are consuming data from the source database and Snowflake should function in the same manner. Therefore, it is crucial to create tests that identify data issues created by downstream systems.
Data Compare Test
The downstream applications/ systems must generate the same output irrespective of whether data is originating from the source database or Snowflake. Hence, to ensure that the downstream systems are performing as expected it is essential to create tests to compare and check output data of the downstream tools.
Snowflake Migration Testing Walkthrough
Organizations are migrating to Snowflake from various databases. It is best to provide an example to get a firm understanding of the specific test cases explained above.
Teradata to Snowflake Migration Testing
A comprehensive list of test cases is illustrated below to get a better understanding of the technical aspects of how to test a specific database with Snowflake. Teradata is the source database chosen to illustrate the example test cases.
Schema Compare Test
Refer to Migrate & Test Schema for a quick understanding of this type of test. Users can query the system tables of Teradata and Snowflake to compare table names, column names, and datatype. Since the data type is not the same, modify the Teradata query to generate equivalent datatype.
The sample query below illustrates this type of test. Please note that the query needs to be modified based on how one has mapped the datatypes between Teradata and Snowflake.
|WHEN ColumnType = 'CF' AND CharType = 1 THEN 'TEXT'|
|WHEN ColumnType = 'CV'THEN 'TEXT'|
|WHEN ColumnType = 'CO'THEN 'TEXT'|
|WHEN ColumnType = '11'THEN 'NUMBER'|
|WHEN ColumnType = '12'THEN 'NUMBER'|
|WHEN ColumnType = '18'THEN 'NUMBER'|
|WHEN ColumnType = 'T'THEN 'NUMBER'|
|WHEN ColumnType = 'D'THEN 'NUMBER'|
|WHEN ColumnType = 'F'THEN 'FLOAT'|
|WHEN ColumnType = 'N'THEN 'NUMBERIC'|
|WHEN ColumnType = 'DA'THEN 'DATE'|
|WHEN ColumnType = 'AT'THEN 'TIME'|
|WHEN ColumnType = 'SZ'THEN 'TIMESTAMP WITH TIME ZONE'|
|WHEN ColumnType = 'TS' THEN 'TIMESTAMP_NTZ'|
|WHEN ColumnType = 'TZ' THEN 'TIME WITH TIME ZONE'|
|WHEN ColumnType = 'JN' THEN 'VARIANT'|
|WHEN ColumnType = 'A1'THEN 'ARRAY'|
|WHEN ColumnType = 'AN' THEN 'ARRAY'|
|WHEN ColumnType = 'CF'AND CharType = 4 THEN 'BINARY'|
|WHEN ColumnType = 'BO' THEN 'BINARY'|
|WHEN ColumnType = 'BF' THEN 'BINARY'|
|WHEN ColumnType = 'BV' THEN 'BINARY'|
|END as equivalent_snowflake_datatype|
|INNER JOIN information_schema.columns c ON (t.table_name = c.table_name)|
|WHERE t.table_name = 'SAMPLE_DATATYPE_TABLE' AND t.table_schema = 'PUBLIC'|
Row Count Test
Refer to Row Count Test for a quick understanding of this type of test.
Comparing the counts of the two tables is the most basic check. When a user tries to execute a count query against Teradata, it might return a Numeric Overflow condition as the number overflows beyond the maximum range of integer values supported. It is recommended to write the count query as shown below to avoid the an overflow condition.
|SELECT COUNT (*) AS TARGET_COUNT FROM TABLE|
Data Aggregation Test
Refer to Data Aggregation Test for a quick understanding of this type of test.
The goal is to compare aggregated values of each of the number columns like smallint, float and decimal for the complete table or by using a group by clause between Teradata and Snowflake.
If the aggregate value is too high users will face Numeric Overflow issues. Users may encounter a rounding problem for decimal datatypes if each of the columns has different precisions. But both of these discrepancies can be resolved by writing SQL’s as shown below.
Similarly, HASH functions along with AGG functions can be used for string and date datatype.
|SUM(CAST(INT_COLUMN AS DECIMAL(38,0))) AS INT_COLUMN,|
|SUM(CAST(FLOAT_COLUMN AS DECIMAL(38,0))) AS FLOAT_COLUMN,|
|SUM(CAST(NUMERIC_COLUMN AS DECIMAL(38,0))) AS FLOAT_COLUMN|
|SUM(INT_COLUMN) AS INT_COLUMN,|
|SUM(FLOAT_COLUMN) AS FLOAT_COLUMN,|
|SUM(NUMERIC_COLUMN) AS FLOAT_COLUMN|
Data Validation Test
Refer to Data Validations for a quick understanding of this type of test.
Users can create technical or business validations on Snowflake tables after the migration to ensure upstream processes are not causing any data issues. An example of such a validation is a Duplicate Check. Use the query below to identify duplicates in a table.
|GROUP BY COLUMN_NAME|
|HAVING COUNT(*) > 1|
Data Compare Test
Refer to Data Compare Test for a quick understanding of this type of test.
Users can compare either a complete dataset in a single test or create multiple tests to compare only a chunk of data at a time for performance optimizations.
|SELECT * FROM TABLE|
|SELECT * FROM TABLE|
Automate Snowflake Migration Testing With iCEDQ
- Quickly scale testing of your migration to Snowflake Data Warehouse from Netezza, Teradata, Oracle, or any other database.
- Creation of tests for 1000’s of tables using our Snowflake Data Migration test generator in a matter of minutes.
- Generate data certification reports for your business to sign off on the Snowflake migration.
- Integrate with your data pipeline through orchestration or scheduling tools.