With the exponential growth in data, there is a surge of data centric projects. And these data projects require testing as a part of DataOps practice.
What is Data Testing?
Data Testing is a QA activity performed by testers in a non-production environment of a data-centric system involving a large volume of data.
The primary goal of data testing is to ensure that the data processes are certified before deploying them into production. Here are some of the systems where data testing is required:
- Customer Relationship Management – CRM
- Master Data Management – MDM
- Data Warehouse /Data Marts – DWH
- Reference Data Management – RDM
- Cloud Data Migration Projects
- Big Data Projects
- Data Lake Projects
Data Testing Types & Methods
There are various types of data testing. Depending on the use cases they vary slightly but the fundamentals remain the same.
ETL Testing |
ETL Testing is a process to verify that a data process is correctly extracting, transforming, and loading the data according to the specified requirements. It is done by validating and/or comparing the generated output with the input data. |
Data Warehouse Testing |
Data Warehouse Testing is the process of verifying data loaded into a data warehouse to ensure it meets the business requirements. This is done by certifying data transformations, integrations, the execution, and scheduling order of various data processes.
Data warehouse testing is an extension of ETL testing but with a few caveats. While ETL testing focuses on certifying individual processes, data warehouse testing places additional emphasis on:
|
Data Migration Testing |
Data Migration Testing is the process of verifying data from a legacy system or database to a new system. There are typically two types of data migrations:
|
Big Data Testing |
Big Data Testing is essentially data testing, but at a petabyte and terabyte scale in a big data ecosystem. Because of the volume of data, there are many restrictions on data movement, and it requires scalable cluster-based processing. |
Cloud Data Testing |
As companies move data to the cloud. Cloud data testing needs the ability to connect to cloud and work natively in the AWS, Azure or the GCP cloud environments. |
Data Lake Testing |
Data lake testing involves large amounts of structured, semi-structured, and unstructured data. Also, there are no size limits. The data is also stored under high security and is encrypted. This creates additional testing challenges. |
BI Report Testing |
BI Report Testing is used to validate and verify the data in BI reports and dashboards. The report’s accuracy is confirmed through data validation against business rules and data source reconciliation with the database. |
Data Testing Techniques
Data testing essentially requires that a data process runs and then output generated by the data process is compared with the expected data. Data testing generally involves two fundamental techniques, data validation and data reconciliation. To get expected values:
- Static Values: There are predetermined values the users expect. These are essentially predetermined static values.
- Dynamic Values: Calculated values based on the data feed and the data transformation specifications. Such values are dynamic since the input keeps changing.
For more details, please refer to ETL Testing.
Data Testing Solutions
A data testing solution must have the following capabilities:
-
- Data Testing: Almost all applications, including mobile apps, involve data. But here we are referring to is not a few transactions but millions or billions of records.
- Input data: While data processing does the transformation, if the input data itself is not correct then the data output will also be affected. So, it is essential to verify the quality of input data while in the development phase itself.
- Reference Data: Reference data in systems like account types and employee types drive a lot of processing logic. Hence it is critical to ensure that the reference data values are exactly as expected.
- Referential Integrity: Compare the parent and child table for values that are in child table but not in parent table.
- Data Processes Testing: Data-centric systems usually have data processes often called ETL or data pipelines. These processes extract, transform and load millions of records in batch or in real time. If these processes are not developed or tested properly, it will greatly impact when the jobs are deployed in production.
- Data Process Orchestration Testing: Data pipelines are not standalone processes. But in fact, work in synchronization with many other tasks.
- Process Schedule Testing: Tests to ensure tasks are executed based on predetermined schedules and events.
- Process Dependency Testing: Any mistake in the execution order will corrupt the data. Hence it is essential to test the process dependencies before deploying and running them in production.
- Data Models and Schema Testing: The data is held in a data model. For example, developers can accidentally change column data types or length. This will affect the data populated in the system.
Hence the data testing solution usually has capabilities to test and verify the model based on reference schemas. - BI Report Testing: While data in tables is commonly understood, data is also pushed into reports. Just because data in the database is correct does not guarantee the accuracy of the generated report. The data testing solution must also have the capabilities to test reports.
- Data Testing: Almost all applications, including mobile apps, involve data. But here we are referring to is not a few transactions but millions or billions of records.
Manual vs Automated Data Testing
Manual Data Testing involves sampling and visual inspection of data. However, manual data testing is impossible to do because of the inability of humans to process data beyond a few records. Developers might build some scripts to augment some of the testing activities. They try to build data in excel but it is not scalable or repeatable. Usually, they end up doing some sampling or limiting the scope of test cases.
The flaws in manual data testing are:
-
-
- Limit on records.
- Data sampling.
- Not repeatable due to the Ad hoc nature of testing.
- Requires skilled worker.
- Limited use cases.
- Lacks scalability.
- No reusability.
-
Automated data testing on the other hand is based on program templates. There are literally no limits on data volumes. As the data scales so does the computing power. Hence, there is no need to do data sampling. The data testing software has the capabilities to go across environments that enables comparisons.
What is the difference between Data Testing and Application Testing?
A web or mobile application usually has screens with which a user interacts. On the other hand, the systems involving data processing almost never have any screens with which a user interacts. In data testing, large data volume and the ETL processes are the core that drives everything. Application testing is usually focused on UI. But what about testing data-centric systems involving large amounts of data processing? Instead of screens, they have data processes running in the background.
Here’s a summary of the differences.
Application Testing | Data Testing | |
Project Types |
Application testing spans a wide spectrum of web apps and mobile apps. |
On the other hand, data testing zeroes in on projects like data migration, data |
Testing Objective and Focus | Application Testing addresses everything from user interface intricacies to scripting, APIs, functions, and code integrity. |
For data testing, the emphasis is on ETL/data processes, process orchestration, and unique attention to data integrity sets it apart as a specialized discipline. |
Data Volume |
Application testing spans various dimensions, one of them being data. But in the |
However, data testing puts a spotlight on the critical nuances of data, |
Certification |
In application testing the certification focus is on code integrity. |
Data testing is essentially designed to certify data integrity. |
Expected vs. Actual |
Application testing compares the actual behavior of user interfaces and scripts |
Data testing navigates the complex terrain of data integrity, migration |
Performance Testing |
In application testing the focus is on the speed at which the UI or the |
For data testing the performance is usually calculated by rows processed per |
Employee Skillsets |
Both processes demand a skill set that combines technical acumen and a deep |
In contrast, data testing necessitates expertise in handling data sources and |
Testing Tools |
Application testing often employs tools like Selenium and JMeter. |
Data testing leverages specialized tools like iceDQ for |
Conclusion
With the advent of data centric projects, it has become critical for enterprises to focus on data testing during the development phase itself. If the data process is not tested as part of quality assurance, then there will be many more defects in production data. Also, there is an exponential increase in the volume of data as well as complexity of data processing; these factors necessitate automated data testing.