Databricks Migration Testing
Summary
A $300 billion pharmaceutical solutions company that processes vast amounts of data has relied heavily on Synapse for its data management needs. However, after undergoing major internal restructuring, they faced ongoing challenges with handling large-scale data within Synapse. To overcome these challenges and meet their growing data processing and storage requirements, the team decided to expand their capabilities by moving to the Databricks cloud service. In 2024, the pharmaceutical company initiated the migration of data from Azure Synapse to the Databricks platform.
Initially, the Data Engineering (DE) team was responsible for both data migration and testing. During the first few runs of development and migration, maintaining data consistency was crucial due to the sensitive nature of healthcare data. As a result, data migration testing became a key focus of this multi-million-dollar project. When the scoping and estimation of the project was completed, it was projected to take approximately:
- 26 weeks of manual testing
- 6 resources
- Testing is limited to sampling data
These estimates were found to be costly, resource-intensive, and accompanied by extended timelines. At this point, the Quality Engineering team decided to automate the migration testing.
After careful evaluation, iceDQ was chosen as the data testing platform. This decision resulted in high-quality data, a faster project timeline, and significant cost savings.
“Not only did we achieve near-perfect quality, but we also saved time and money on the project.” – Director of Quality Engineering (QE).
Post iceDQ:
- 0 defects in production
- 20 weeks saved from the project timeline
- Resources reduced from 6 to 3
- 87% cost reduction, resulting in approx. $400,000 in savings
This case study highlights how the pharmaceutical company successfully mitigated the risks of migrating big data from Synapse to Databricks. iceDQ, an automated data migration testing platform, played a crucial role in streamlining the migration process and ensuring its success.
The System
The internal restructuring resulted in a complex dataset comprising healthcare, pharmaceutical, and drug store data, which was processed by ETL pipelines through three layers into the final data warehouse:
- Bronze Layer: Raw landing layer where data from files is loaded into the database with minimal modification. This ensures 100% data capture, even if the data is faulty.
- Silver Layer: Cleaned and conformed data. Data from the bronze layer is extracted, cleaned, rationalized, and loaded into the silver layer.
- Gold Layer: The final layer where the cleaned data from the silver layer is integrated and prepared for end-user analytics.
During the analysis the tables were grouped based on their size and criticality. This was done to ensure adequate testing coverage.
The system involved 432 small tables, 144 medium tables, and 144 large tables.
The Data Migration Project
The team identified approximately 120 tables in each layer (Bronze, Silver, and Gold) and for each environment (Staging and Production), resulting in the migration of around 720 tables in total.
As part of the migration planning, the following steps were taken:
- Environment Setup: Each database layer (Bronze, Silver, and Gold) was recreated in the Databricks environment.
- ETL Development: Specific ETL processes were developed to copy and transfer data from Azure Synapse to the Databricks environment.
- Chunked Data Migration: The data was migrated into chunks into the Databricks environment.
The Data Migration Testing Challenge
Achieving zero data leakage during the Databricks migration was challenging due to schema and data type differences, the risk of leakage, and the massive volume of data.
- Manual Testing Inefficiency: Before using iceDQ, all data testing was done manually, which was time-consuming and inadequate.
- Massive Data Volume: The migration involved 5.47 billion records across 720 tables, posing a significant challenge.
- Iterative Testing Cycles: Each table required multiple testing cycles due to the iterative development and QA process, adding complexity and extending the project timeline.
- Data Transformations: Numerous data transformations were performed to improve data quality, including automated testing for columns and mandatory schema adjustments, increasing the scope and complexity of testing.
- Strict Deadline Pressure: A fixed go-live date left limited time for thorough data testing, putting intense pressure on the team to deliver high-quality data within a tight timeframe.
The time, cost, and resource estimates provided by the Data Engineering team were enormous:
The Manual Testing Estimates
The initial plan for data testing included 4 manual tests for each of the 720 tables, totaling 2,880 tests. Given the testing requirements and challenges, the team estimated that it would take approximately 792 man-days, or 26 weeks with 6 resources, to complete the migration testing.
Automated Databricks Migration Testing with iceDQ Tool
The Quality Engineering team evaluated iceDQ across several parameters and decided to automate the data migration testing. Their goal is to reduce the testing time from 26 weeks to 6 weeks and decrease the number of resources needed from 6 to 3.
- Once the data was copied to Databricks, iceDQ seamlessly connected across the Azure Synapse and Databricks environments.
- This enabled iceDQ to automatically compare data at the table, row, and column levels.
- As a result, it could certify 100% of the data with full volume testing, ensuring complete accuracy.
- All of this was achieved in record time.
- For detailed implementation steps, please refer to the appendix on iceDQ.
Key Benefits of iceDQ
- Fully Automated Large-Scale Data Testing: iceDQ automated the testing for 5.47 billion records, removing manual testing dependencies and ensuring 100% data coverage, accuracy, and reliability, while eliminating gaps in spot checks during migration.
- Reusability with Multiple Executions: Once rules were created in iceDQ, they were easily executed multiple times with minimal modifications, handling 90% of required changes, ideal for the dev-QA team’s iterative process.
- Out-of-the-Box Transformations: iceDQ provided configurable, out-of-the-box checks to validate complex transformation logic.
- User-Friendly Interface: The UI streamlined rule creation and testing, with features like rule duplication, improving efficiency and enabling faster test cycles without compromising data quality.
- Workflows: Rules were grouped into workflows, simplifying execution and enhancing test efficiency.
- Email Notifications: Automated email alerts were sent to designated users about test failures, ensuring timely action.
- Scheduling Capability: High-priority rules were auto-scheduled daily, enabling continuous monitoring and ensuring data integrity during migration.
- Performance Improvement: iceDQ consistently surpassed performance thresholds, ensuring robust validation cycles and quick feedback during migration.
- Test Evidence: The centralized platform provided real-time dashboards and exception reports, tracking testing progress and highlighting critical issues.
- Reusability: Rules created in the staging environment were deployed to production, reducing rule creation time while enhancing consistency.
- Test Templatization: Templatized rules accelerated rule creation, allowing a team of three resources to create 2,880 rules in six weeks, instead of the projected twenty-six weeks with six resources.
The Result
Due to the steps implemented in iceDQ, the team saw a significant reduction in indirect migration costs, dropping from $450K to $55K.
Additionally, the migration process only required three resources instead of six, saving approximately 20 weeks of the timeline.
Conclusion
The seamless migration testing from Synapse to Databricks with iceDQ showcases the power of strategic partnership and collaboration with the pharmaceutical company. The synergy between the two significantly accelerated and enhanced the data migration testing process. Additionally, the company gained greater control and oversight of their testing workflows, boosting their ability to efficiently meet the data migration requirements.
Appendix
- Language Variable Check: The language column at the source contained double-character data, while the target used the full-name of the language. To address this, multiple statements were applied to convert the source data to match the character format of the target, based on the defined rules.
- Reconciliation (Recon) Rule: Recon rules allow users to compare data between two data sources, row by row and column by column, to identify data issues. The data sources can be databases, files, or REST APIs. Users can use iceDQ to perform full volume testing or delta testing.
- Null Checks used for Several Fields: Multiple NULL checks were performed at several fields to make sure the data migrated to the destination location doesn’t miss out on any data.
- Out-of-Box Checks: iceDQ provided predefined out-of-box checks like completeness, length, pattern, date checks, and more to build validation rules.