QA Challenges in Data Integration Projects

Download Whitepaper from here QA Challenges Data Integration Projects

Quality Assurance (QA) is a very important component of any data-centric application project.  Projects such as data warehouse, data migration, ETL, Data Lakes and MDM are no exception. The Majority of these projects are the multi-year and multi-million dollar in nature due to the amount of work and products required. Therefore it’s necessary to have proper planning for QA in place to avoid late discoveries of process and data errors.

While the methodologies of testing have evolved considerably over the years, the science of QA in data integration project has not. In this article, we’ll focus on some of the key challenges with data warehouse testing, data migration testing and ETL testing.

Let’s examine some of the QA challenges in Data Integration Projects:,

  1. Lack of automation
  2. Lack of Agility
  3. No Test Driven Development (TDD)
  4. Clutter
  5. Lack of Regression Testing

1. Lack of Automation

It’s obvious that manual testing is a very time consuming and error prone method. In data integration projects, there are many tasks that are difficult or humanly impossible to test manually.

For example:

  • In data integration projects, an ETL process is reading millions of records from a file and loading them into a database. To test this ETL process the data from the source files needs to be reconciled with the data in the target table. Only then will we know if the ETL process has executed successfully or not. However, even if we assume that there is little or no transformation, how will you compare data in a file with the data in a table?
  • Imagine testing thousands of tables during a system migration.  Is it even possible to test such a huge volume of data as well as an entity?
  • For any change, the testing has to be performed again which gets repetitive and hence errors are more likely to occur.  The main drawback of manual testing is having to work on a sample data set; large volumes can’t be tested for obvious reasons. A large amount of time is also wasted on coding the scripts rather than testing.

iceDQ software provides the ability to test ETL processes by validating the output data as well as reconciling Source data vs Target data.  And, with iceDQ’s In-Memory Rules Engine you can test the entire data set, which typically contains millions of rows. Additionally, the data could be in XML, Mainframe, RDBMS, Flat File or in Hadoop. iceDQ encapsulates the complexity as well as the native data schemas to make life easy.

Reconciliation rule - iceDQ

Fig 1: Validation Rule: Tests ETL process by validating output data  | Reconciliation Rule: Test ETL Transformations  by Reconciling Source data vs Target data

2. Lack of Agility

The waterfall approach works well if the process is going to be predictable and won’t change over the course of development. Sadly that doesn’t happen often in data-centric applications and the waterfall approach (due to its fixed operating style), can cause major issues in data integration projects for a number of reasons:

  • Long timelines of data integration projects force you to wait until the end to discover your mistakes
  • The requirements are constantly evolving
  • It’s better to test a story in smaller units of work rather than testing an entire project at once

A more flexible and smarter approach needs to be implemented in place of Waterfall. iceDQ follows an Agile approach which is adaptable and provides developers time to work while requirements are changing that makes it less error prone.

Agile datawarehouse development-iceDQ

Fig 2: With iceDQ in place, audit rules can be automated and linked to the specific requirements of the story

3. No Test Driven Development (TDD)

As mentioned above, the drawbacks of the Waterfall approach are a fixed operating style and will perform only when there are no changes in the process.

There are some basic issues with the testing:

  • The QA team waits for the development team to finish the code before beginning the QA process
  • Late involvement of the Business Users in the QA phase
  • QA team is too dependent on the developers rather than the business, to find the testing requirements

Test Driven Development is an evolutionary approach to development, which stresses a test-first approach.  Testing teams can start working at the beginning of the SDLC, work with the Business Users and gather QA requirements independently of the development stream.   They no longer wait for the development team to finish their job and then take over.  The QA process commences at the beginning of the SDLC as per the Test Driven Development, which can result in a 33% savings of time, thus finishing the project sooner.

No test driven development - iceDQ

Fig 3: By starting the QA process at the beginning of the SDLC as per the Test Driven Development result in saving of 33% of you time

4. Clutter

Current ETL testing requires quite a few manual steps, utilities, and human coordination.  In order to perform testing, the QA team typically uses MS Excel, SQL, and Shell Scripts. The sheer variety of utilities and people involved causes clutter and disorganization in the QA process.

For example:

  • Many desktop-based tools and utilities are used that were never designed for Data Warehouse testing or Data Migration testing
  • There is no central repository of rules
  • No execution logs or reports
  • No way for the users, BAs, Developers or the QA team to coordinate

Conversely, with iceDQ, you can plan out the complete process and design workflows, there is hardly any need for additional utilities. The system is well organized and properly documents all steps created in the platform.

iCEDQ Repository - iCEDQ

Fig 4: iceDQ software provides a well-organized as well as properly documented system

5. Lack of Regression Testing & Release sign off

Regression testing ensures that the older code still works with the new changes. With the conventional non- automated approach it’s nearly impossible to support regression testing.

  • The manual scripts or SQL are stored on the developer’s desktop which is virtually impossible to find at the time of regression testing.
  • The QA team isn’t able to find the scripts and SQL’s if the regression testing occurs after a few months. The QA team members may have been shifted, promoted, transferred or left the company by that time.
  • The details required to sign off might not be available for the Release Manager

With iceDQ, the audit rules for ETL testing, Data Warehouse testing, and Migration testing are stored in a centralized repository for future access. These rules can be recalled on demand, combined with new rules and executed in minutes. iceDQ provides detailed reports and execution logs that are necessary for the sign off of new releases. It further provides continuous build management support through it API’s, which ultimately results in less risk and higher quality.

iceDQ ddlc

Fig 5: iceDQ provides detailed reports, execution logs and build management support through it APIs which ultimately results in less risk and results in better quality and are really important for Sign Offs and Releases.

Sandesh Gawande - CTO iceDQ

Sandesh Gawande

CTO and Founder at iceDQ.
First to introduce automated data testing. Advocate for data reliability engineering.

Leave a Reply

Your email address will not be published. Required fields are marked *

Post comment