“Information is the oil of the 21st century, and analytics is the combustion engine”- Peter Sondergaard, Senior Vice President, Gartner

Big Data and Business Intelligence are becoming an increasingly important source of statistical information which is used as a vital part of the critical decision-making process of all businesses. Bernard Marr, in his article titled “4 Ways Big Data Will Change Every Business” reiterates the industry-belief that “big data and its implications will affect every single business -from Fortune 500 enterprises to mom and pop companies – and change how we do business, inside and out.”

Therefore, it is critical that Big Data technicians ensure that the data that is manipulated into statistical information is loaded into the data warehouse correctly. It goes without saying that any errors during the Extract-Transform-Load (ETL) process can and will have dire consequences for the business.

Automated ETL testing: The problem and the solution

The Problem

A large part of the ETL testing process is to compare the source data with the data loaded into the data warehouse. This process translates into checking the data across both platforms and row by row to make sure that the data has been correctly loaded. Users will only able a sample set of data using an Excel spreadsheet; it is impossible to compare all the data, row by row.

The modern software development style that follows the Agile methodology causes the ETL code to be built and deployed continuously. Therefore, it creates problems when it comes to regression testing as constant deployment requires continuous testing. As a result, the chance of existing functionality ceasing to function at optimal levels is very high because of the continual process of adding new functionality to the data warehousing modules.

Thereby, translating into the fact that the existing functionality needs to be retested to ensure that it is still returning the desired results. Again, it is virtually impossible to regression test the whole data warehousing software manually without employing substantial human resources.

Data-centric testing

Data-centric testing’s core function is to test the ETL processes to ensure that the data loaded into a data warehouse is accurately inserted into the data warehouse tables. The source data is typically not stored in the data warehouse; therefore, it has to be extracted from one or more source databases.

Secondly, the original data is not always in the correct format, so it has to be converted from its source format into the format that matches the data warehouse tables.

Thirdly, and finally, this data has to be manipulated from data into informational statistics that can be used by management for forecasting as well as making decisions that will drive the business forward and increase its overall revenue.

Thus, the purpose of data-centric testing, with particular reference to ETL testing, is to determine whether data is extracted from its source, massaged into the required format, and loaded correctly into the data warehouse environment.

The Solution

Thus, if there are any reasons to make a case for automated ETL testing, it is the fact that manual testing cannot test a broad enough sample of the data to be 100% satisfied that the ETL process will not break in the production environment. Nor can manual testing cope with the rigorous demands of regression testing in the short time frames allocated to the Agile software development lifecycle.

An ETL testing solution such as iCEDQ is not only the answer to automated ETL testing, but it also provides users with an exclusive testing engine that allows users to set up testing rules to compare and validate millions of rows automatically that are extracted from multiple source databases and loaded into the data warehouse. Additionally, it helps users identify specific rows and columns where there is a discrepancy between the original database table and the data warehouse table.

The set of rules to test each module in the ETL code are kept together in an individual test suite. Consequently, apart from the ability to accurately test large datasets consisting of millions of rows of data to identify any issues during the extract, transform, and load process, iCEDQ offers users the ability to regression test by combining all of the rules from the different test suites into one test suite.

Final thoughts

Business Intelligence is playing and will continue to play a fundamental role in how businesses are managed, what decisions are taken, and how key indicators such as sales figures, return on investment, overall growth, and net profit are portrayed. Finally, as the ability of database management and data warehousing systems to collect, store, and manipulate data is improved, the requirement for more sophisticated statistics will grow; thus, creating the need for automated ETL systems that can test greater amounts of data quickly, efficiently, and successfully.