Data Testing

With the exponential growth in data, there is a surge of data centric projects. And these data projects require testing as a part of DataOps practice.

What is Data Testing?

What-is-Data-Testing-iceDQ

Data Testing is a QA activity performed by testers in a non-production environment of a data-centric system involving a large volume of data.

The primary goal of data testing is to ensure that the data processes are certified before deploying them into production. Here are some of the systems where data testing is required:

  • Customer Relationship Management – CRM
  • Master Data Management – MDM
  • Data Warehouse /Data Marts – DWH
  • Reference Data Management – RDM
  • Cloud Data Migration Projects
  • Big Data Projects
  • Data Lake Projects

Data Testing Types & Methods

There are various types of data testing. Depending on the use cases they vary slightly but the fundamentals remain the same.

ETL Testing

ETL Testing is a process to verify that a data process is correctly extracting, transforming, and loading the data according to the specified requirements. It is done by validating and/or comparing the generated output with the input data.

Data Warehouse Testing

Data Warehouse Testing is the process of verifying data loaded into a data warehouse to ensure it meets the business requirements. This is done by certifying data transformations, integrations, the execution, and scheduling order of various data processes.

Data warehouse testing is an extension of ETL testing but with a few caveats. While ETL testing focuses on certifying individual processes, data warehouse testing places additional emphasis on:

  • Reference data
  • Data pipeline orchestration
  • Data integration

Data Migration Testing

Data Migration Testing is the process of verifying data from a legacy system or database to a new system. There are typically two types of data migrations:

  1. Data migration where the data model has minimum or no changes, only the database vendor or version changes.
  2. In the second case, the data model between the legacy and new system is drastically different, but the business model is similar or the same.

Big Data Testing

Big Data Testing is essentially data testing, but at a petabyte and terabyte scale in a big data ecosystem. Because of the volume of data, there are many restrictions on data movement, and it requires scalable cluster-based processing.

Cloud Data Testing

As companies move data to the cloud. Cloud data testing needs the ability to connect to cloud and work natively in the AWS, Azure or the GCP cloud environments.

Data Lake Testing

Data lake testing involves large amounts of structured, semi-structured, and unstructured data. Also, there are no size limits. The data is also stored under high security and is encrypted. This creates additional testing challenges.

BI Report Testing

BI Report Testing is used to validate and verify the data in BI reports and dashboards. The report’s accuracy is confirmed through data validation against business rules and data source reconciliation with the database.

Data Testing Techniques

Data testing essentially requires that a data process runs and then output generated by the data process is compared with the expected data. Data testing generally involves two fundamental techniques, data validation and data reconciliation. To get expected values:

  1. Static Values: There are predetermined values the users expect. These are essentially predetermined static values.
  2. Dynamic Values: Calculated values based on the data feed and the data transformation specifications. Such values are dynamic since the input keeps changing.

For more details, please refer to ETL Testing.

Data Testing Solutions

A data testing solution must have the following capabilities:

    1. Data Testing: Almost all applications, including mobile apps, involve data. But here we are referring to is not a few transactions but millions or billions of records.
      • Input data: While data processing does the transformation, if the input data itself is not correct then the data output will also be affected. So, it is essential to verify the quality of input data while in the development phase itself.
      • Reference Data: Reference data in systems like account types and employee types drive a lot of processing logic. Hence it is critical to ensure that the reference data values are exactly as expected.
      • Referential Integrity: Compare the parent and child table for values that are in child table but not in parent table.
    2. Data Processes Testing: Data-centric systems usually have data processes often called ETL or data pipelines. These processes extract, transform and load millions of records in batch or in real time. If these processes are not developed or tested properly, it will greatly impact when the jobs are deployed in production.
    3. Data Process Orchestration Testing: Data pipelines are not standalone processes. But in fact, work in synchronization with many other tasks.
      • Process Schedule Testing: Tests to ensure tasks are executed based on predetermined schedules and events.
      • Process Dependency Testing: Any mistake in the execution order will corrupt the data. Hence it is essential to test the process dependencies before deploying and running them in production.
    4. Data Models and Schema Testing: The data is held in a data model. For example, developers can accidentally change column data types or length. This will affect the data populated in the system.
      Hence the data testing solution usually has capabilities to test and verify the model based on reference schemas.
    5. BI Report Testing: While data in tables is commonly understood, data is also pushed into reports. Just because data in the database is correct does not guarantee the accuracy of the generated report. The data testing solution must also have the capabilities to test reports.

Manual vs Automated Data Testing

Manual Data Testing involves sampling and visual inspection of data. However, manual data testing is impossible to do because of the inability of humans to process data beyond a few records. Developers might build some scripts to augment some of the testing activities. They try to build data in excel but it is not scalable or repeatable. Usually, they end up doing some sampling or limiting the scope of test cases.

The flaws in manual data testing are:

      1. Limit on records.
      2. Data sampling.
      3. Not repeatable due to the Ad hoc nature of testing.
      4. Requires skilled worker.
      5. Limited use cases.
      6. Lacks scalability.
      7. No reusability.

Automated data testing on the other hand is based on program templates. There are literally no limits on data volumes. As the data scales so does the computing power. Hence, there is no need to do data sampling. The data testing software has the capabilities to go across environments that enables comparisons.

What is the difference between Data Testing and Application Testing?

YouTube player

A web or mobile application usually has screens with which a user interacts. On the other hand, the systems involving data processing almost never have any screens with which a user interacts. In data testing, large data volume and the ETL processes are the core that drives everything. Application testing is usually focused on UI. But what about testing data-centric systems involving large amounts of data processing? Instead of screens, they have data processes running in the background.

Here’s a summary of the differences.

Application Testing Data Testing
Project Types

Application testing spans a wide spectrum of web apps and mobile apps.


On the other hand, data testing zeroes in on projects like data migration, data pipelines, data warehouses and big data.
Testing Objective and Focus
Application Testing addresses everything from user interface intricacies to scripting, APIs, functions, and code integrity.

For data testing, the emphasis is on ETL/data processes, process orchestration, and unique attention to data integrity sets it apart as a specialized discipline.
Data Volume

Application testing spans various dimensions, one of them being data. But in the scope of application testing, data involvement is extremely limited to a few records created by a transaction.

However, data testing puts a spotlight on the critical nuances of data, revealing a stark contrast. Unlike application testing, data testing involves millions and billions of records.

Certification
In application testing the certification focus is on code integrity.

Data testing is essentially designed to certify data integrity.

Expected vs. Actual
Application testing compares the actual behavior of user interfaces and scripts vs expected.

Data testing navigates the complex terrain of data integrity, migration accuracy, and the nuances of big data processing.
Performance Testing
In application testing the focus is on the speed at which the UI or the underlying functions respond to a request. It is in the realms of microseconds. On the other hand, performance testing for Data is in minutes and hours.

For data testing the performance is usually calculated by rows processed per second. It is usually computed in the time required to read data, transport data, process data and load data in a target database. The loading time is further calculated in terms of update, insert, and delete speed.

Employee Skillsets
Both processes demand a skill set that combines technical acumen and a deep understanding of the tools at play. Application testing requires proficiency in user interface testing, scripting, and tools like Selenium/JMeter. Application testing requires understanding screen behavior, and utilizing tools tailored to
the unique challenges presented by data.

In contrast, data testing necessitates expertise in handling data sources and
target data, SQL, Data models, and Reference data. Proficiency in scripting and code-level understanding is essential for application testing, while data testing demands a command over SQL for effective data manipulation and validation.

Testing Tools
Application testing often employs tools like Selenium and JMeter.

Data testing leverages specialized tools like iceDQ for comprehensive data quality assurance.

Conclusion

With the advent of data centric projects, it has become critical for enterprises to focus on data testing during the development phase itself. If the data process is not tested as part of quality assurance, then there will be many more defects in production data. Also, there is an exponential increase in the volume of data as well as complexity of data processing; these factors necessitate automated data testing.

Sandesh Gawande - CTO iceDQ

Sandesh Gawande

CTO and Founder at iceDQ.
First to introduce automated data testing. Advocate for data reliability engineering.

Leave a Reply

Your email address will not be published. Required fields are marked *

Post comment