How to test an ETL Process - Basic Concept
The answer lies in the understanding of an ETL process. An ETL process at its core reads data, applies a transformation on it and then loads the data. This can be represented by the following simplistic equation.
Input Data + Transformation = Output Data
The above equation can provide us the information for creating physical tests to validate the ETL processes. I will use a simple example below to explain the ETL testing mechanism.
The above equation can provide us the information for creating physical tests to validate the ETL processes. We will use a simple example below to explain the ETL testing mechanism.
A source table has an individual and corporate customer. The requirement is that an ETL process should take the corporate customers only and populate the data in a target table. The test cases required to validate the ETL process by reconciling the source (input) and target (Output) data. The transformation rule also specifies that output should only have corporate customers.
Physical Test 1: Count the corporate customers from the source. Count the customers from the target table. If the ETL transformation is correct the count should be an exact match.
This is pretty good test however the count might misguide if the same record is loaded more than once as it cannot distinguish between each customer.
Physical Test 2: Compare each corporate customer from the source to the customer on target. This kind of reconciliation can be done at the row or attribute level. Hence, it will not only prove the validation of counts but will also prove that the customer is exactly the same on both sides.
There are various permutations and combination of these types of rules including advancing complexity. Also, this example proves that the concepts for ETL testing are totally different from that of GUI based software testing.
As discussed earlier in the article – ETL testing Vs. Application Testing, ETL process are background jobs that cannot be tested with conventional QA tools. They need specialized software. In the drawing below, we have a bunch of ETL processes that are reading, transforming and loading customer, orders and shipment data. We will take these examples and then create test cases and rules in iCEDQ to certify the processes. The examples below will also clear the thought process and the principles behind ETL testing.
This process is loading customer data from a file to a staging table (CUSTOMER STAGE).
See Test Rule 1 and Test Rule 2.
This process is loading sales transaction data from OMS (Source B) into table SALES TRANSACTION.
This process is loading product rating data from OMS (Source B) into table PRODUCT RATING.
Product rating data is also sourced from an additional data source C (Market Data Vendor) into another PRODUCT RATING table.
An ETL process is loading data the table SHIPMENT from a Shipment System (Source D).
This process is reading data from CUSTOMER STAGE table and population into CUSTOMER DIM table.
This ETL process is more complex than other ETL process. It is reading data from SALES TRANSACTION table and populating ORDER and DAILY SALES SUMMARY table.
- It must do lookup from PRODUCT table to get the attribute Product Name.
- It must summarize sales transaction data to get summary of daily quantity and dollar amount.
Ensure that the incoming data has valid data and data is in correct format.
This data validation of customer file is needed to ensure that ETL1 gets correct data.
This rule tests if the data in the source is valid. In the customer file:
- Make sure “Name” attribute is NOT NULL. If customer name is missing the record might not be useful.
- Date of Birth format is in YYYYMMDD format. The data is stored in a file; hence, it is possible that the source system did not send the data in correct format.
Even though the downstream ETL processes are not responsible for incoming upstream data from source system, still it is important to validate the source data because;
- The source data content and format do not confirm the requirement specifications
- Source systems accidently or knowingly changed the way data was sent.
- Find Data issues early in the flow and project time line.
- It will prevent ETL process from populating incorrect data downstream.
- Some errors in data can break the processes in production. If the ETL developer is aware of the issues he can either skip the data or modify the ETL process to handle the exception.
- Often the ETL developers or the data warehouse managers are blamed for the data issues, even if they are not responsible for it. Developers and managers spend a lot of time analyzing and defending issues that are not of their creation. Source data Validation will avoid most of these situations.
Ensure data is copied (transported) from system A to system B correctly.
Source target reconciliation will certify that the ETL1 has not dropped data or added extra data in the processes of copying the data from file to Stage.
This rules teste if the ETL Process (1) has loaded the source data correctly to target. In this case CUSTOMER FILE is loaded in staging area.
- Make sure row counts matches on both sides. Many times, process might have run but loaded nothing. Same data might be loaded twice.
- The attributes on each row matches between source and target. Even though the records count matches the data at attribute level might not be same.
- If applicable data is ﬁltered correctly.
Usually such kind of tests need to connect across two different systems. In this case its between file server and database. It is impossible to implement this rule without an iCEDQ’s ETL Testing and Automation platform. Consider the alternative:
- These tests are not possible are very expensive to do manually
- In the absence of iCEDQ, the level of effort required for custom programming will be in multiple weeks and not very good
- Big data volumes on Hadoop, HDFS files cannot be dealt with conventional approach
- Support of different file formats such as Avro, Json, Parquet is not possible
Raise alarm, if same data coming from two different sources do not match.
This ETL Testing rule makes sure that the data processed by ETL3 and ETL4 from two diﬀerent system is within acceptable tolerance.
Product Ratings data is delivered by both source B (Order Management System) and source C (Market Data Vendor). Since the products are same
- Each product from both systems should have same rating.
- Report any mismatches
A data validation rule can be setup to notify data that is outside the expected range and it can work. However, if the product rating data is important for business and there is another source that can provide data.
- It is highly unlikely that both the source systems are incorrect for the same product at the same day. Hence the data reconciliation between two source works.
If another source is not available, then data reconciliation can be done with previous days data. This can capture sudden changes or if the previous days data is sent again by the source system.
Validate the data independent of the ETL process but based on business rules.
Validate if the data is following the business rule. These rules are independent of ETL processes. Ex :
Audit if the
The order table is populated by ETL2 process. And it can be very complicated. However, a simple rule that the business is aware can suggest a data or processing issue.
- Net Amount =? Gross Amount – (Commission Amount + Tax Amount + Fee Amount)
But why rely on the specification given by the ETL process.
- Empower business users to apply their understanding of business to create these rules
- The specification document cannot be the only source to verify the data. As the specification can also be wrong
The ETL process might be correct but the source it self might be providing wrong data
Reconcile the data independent of ETL processing logic but based on Business Rules.
Audit if the data between two business areas is consistent and reconciles correctly. These Audit rules are independent of the ETL processes.
ETL 2 is populating data in Sales Transaction table and ETL 5 is populating data in Shipment table.
- The ETL testing rules verifies if there are any shipment for which sales transactions data does not exist
- Since for a Transaction, there could be multiple shipment. The rule can sum the shipment quantity and verify with the transaction quantity
Regardless of the system is used to process data or store data the business rules are universal.
- ETL 2 and ETL 5 both be could be correct, but what if the orders data was never sent to the ETL processes?
- This kind of situations can never be caught in ETL testing based on specifications of the ETL process.
- An Independent business person would immediately know that for Shipment I need to have an order
- Similar approach can be applied to insurance data. Ex. Claims vs. Policy
- These can also be applied between MDM systems and Transactional systems
Reconcile the data based on entity relationship and independent of the ETL processing or data source.
Audit if the factual data and reference data reconciles.
Product table exists and doesn’t have an ETL process populating it. Probably is static table that get populated manually occasionally.
- Check Foreign Key for Product ID between product dimension and Shipment fact
- Database constraints are not always physically implemented in a database
- The ETL processes are usually designed to populate specific tables. It is possible that two ETL processes populated two tables without the Foreign Key check.
It is possible that Master data or Reference data is populated manually in the data warehouse. In such cases, if the operational system starts sending transactions then there could be data issues if the ETL process does not validate the FK.
Reconcile data structures independently.
This reconciliation makes sure that the physical schema matches between two systems.
- It verifies data structure as compared to a known state
- Used to verify structures between QA and PROD environment
- Watch for changes in structure of incoming files
- Table structure can be changed accidentally or changed intentionally without notification
- During data migration projects special focus is needed to verify the compatibility of data structures between legacy systems and new system
As we see from the discussion above even though we call ETL testing; it is not just about ETL testing. An understanding of ETL, Data Structures as well as business context is necessary to effectively do ETL testing. Any delinquency on behalf of the tester or absence of a capable audit rules engine can and will cause havoc in your data centric projects.
- A Practical Guide for Data Centric Testing: Automated ETL Testing
- Overcome Data Testing Challenges
- Agile Data Warehouse Testing & Data Migration Testing
- Migrating Database to Redshift, Snowflake, Azure DW and Test with iCEDQ
- Data Migration Testing Techniques to Migrate Data Successfully
- The Data Migration Process & the Potential Risks
- ETL Development & ETL Testing – a Pipeline for Data Warehouse Testing
- ETL Testing and Data Quality Governance Software - The Missing Link
- DataOps Implementation Guide
- AML Software Implementation & Production Monitoring with iCEDQ DataOps Platform
- What Are The Challenges Of A Data Factory
- 3 Reasons Why You Need to Perform ETL Testing
- ETL Testing - Unit Testing vs. Quality Assurance for Data Warehouse
- ETL Testing Vs. Application Testing - The Fundamental Difference