Data-centric projects are becoming both bigger in size and complexity, which makes execution that much more difficult. This not only creates delays in project execution but also results in poor data quality. More and more projects are facing:
- Longer time to Market – The time required for projects is increasing, with many cloud data migration projects having multi-year timelines.
- Delayed or failed projects – Data teams are underestimating the complexity of the data projects resulting in last moment surprises as well as cost overruns.
- Poor Data Quality – Projects are delayed due to testing issues that are discovered too late in the project lifecycle.
- User dissatisfaction and Complaints – Data quality is an afterthought, resulting in high rates of user dissatisfaction.
- High Production Cost Fixes – Lack of test automation has resulted in lots of refactoring or patchwork in production.
- Testing on big data volumes – The large volumes have made is generally impossible to test the data manually.
- Regression testing nearly impossible – After the delivery of the project, code revision or ETL processes require complete regression testing. However, these concepts are missing in the data engineering side.
Costly Manpower – The manual and repetitive tasks are still not automated and either require manual work or custom coding, which often will take highly skilled talent off of other critical work.
While there are many macro and micro issues affecting the delivery of data engineering projects, the following are some of the underlying causes:
1. Siloed Teams
The team is usually divided into development, QA, operations and business users. In almost all Data Integration projects, development teams try to build and test ETL processes, reports as fast as possible and throw the code across the wall to the operations teams and business users. However, when the data issues start appearing in production, business users become unhappy. They point fingers at Operations people, who in turn point fingers at QA people. The QA group then puts the blame on the development teams.
|During Development of ETL and Release…|
|…and when data defects are found in production after release!|
2. Lack of Code Repository
ETL, Database procedures, schemas, schedules, and reports are not treated as code. In the early nineties, the ETL and Reporting tools came into existence and since they created custom ETL objects or Reports, they were not treated as code.
3. Lack of Data Management Repository
Configuration data, Reference data, and Test data are not managed. A data project requires test data, however, test data is not created in advance nor linked to the test cases.
Reference data is required to initialize the database. For example, default values for customer types must be created in advance so it doesn’t have any data source. If the reference data is missing, none of the ETL processes will work.
Configuration tables data must also be prepopulated. Some of the configuration data is used for incremental or delta processing. Some data values are used to populate metadata about the processes.
4. Lack of Test Automation
The way data processes (ETL) and reports are tested is very different than how software applications are tested. [link] In order to test, the ETL process is executed first and then the data is compared from the original to certify the ETL process. This is because the quality is determined by the expected vs actual. The actual data is the data added or updated by the ETL process and expected is the input data plus the data transformation rule(s).
5. Lack of Automated Build and Deployment
Since most ETL and Report developers use GUI /tools to create their processes, the code is not visible. The ETL tool stores the code directly into its repository. This creates a false narrative that since there is no code, there’s no need to manage, version or integrate. The majority of ETL tools now provide APIs to import and deploy the code into different environments, the functionality of which is often ignored.
6.Lack of Agile & Test-Driven Development (TDD)
While data transformation rules are provided to developers, the business doesn’t share testing and monitoring requirements during development. Once the developers have completed development, only then the focus shifts to testing. This is now late in the process and quite often this is when users start complaining. At this late stage is the time when data monitoring issues are considered.
7. Lack of Whitebox Monitoring
Data Quality and governance is an afterthought. Developers neither seek nor integrates hooks into their data process to monitor the data quality once it’s in production. When the system goes live, there is nothing available for operations to certify the data.
8. Lack of Regression Testing
After the system goes live, if any data issues are found, the development team must go back and fix the code. This creates a big testing challenge in order to complete regression testing since previous/older test cases must be considered to test the ETL flow. If they’ve not used a test automation tool that stores the rules in a repository, nothing will exist.
Many of the problem statements defined above are already solved in the software development world, implementing concepts such as Agile Development, CICD, Test Automation, and DevOps. It’s time the data world borrows some of these ideas and adopts them in the data world as well.
DataOps is the application of Agile Development, Continues Integration, Continues Deployment, Continuous Testing methodologies and DevOps principles, with the addition of some data specific considerations to a data-centric project. It could be any of the data integration or data migration projects such as data warehouses, data lakes, big data, ETL, Data Migration, BI Reporting, and Cloud Migration.
To implement DataOps the organization must focus on three things:
- People and their culture
- Defining standard practices and processes.
- Automation testing & Monitoring tools
A. Identify the people and their culture
In a data project, there are many types of resources. However, their roles also define their boundary. Developers, testers on one side of the wall and business users, operations data stewards are on the other side.
DataOps is about removing this wall and the first cultural change required for DataOps is to:
- Tell the development team that they are responsible for data quality issues that will appear in production environments.
- Tell the business users it’s their responsibility to provide the data transformation requirements as well as Audit Rules for Validation and Reconciliation of data.
This small change will ensure the developer involves the business users and data stewards right from the beginning of the project. DataOps adoption results in a transformation of organizational culture, automating every aspect of SDLC from test automation to production data monitoring. Beyond that, DataOps results in a culture shift, which removes the barriers between development and operations teams that are broken. There is no more throwing over the wall and running away from the responsibilities.
|DataOps Transforms the Culture of the Organization.|
|With DataOps everyone is on the same side of the wall!|
Now, instead of sequential steps, developers can create the design and develop the tests in parallel to the development of the data pipeline. By using Non-Linear timelines Time-to-Market is now 33% faster.
B. Get the automation tools for DataOps
DataOps in not possible with proper automation tools. The organization must acquire multiple software platforms to support DataOps, such as:
- Code Repository, Ex. Git
- QA software for Data Test Automation, Ex. iCEDQ
- Test Data Repository, Ex. Stored in dedicated database or file server
- CICD software, Ex. Jenkins
- Production Data Monitoring Software, Ex. iCEDQ
- Issue management software, Ex. Jira, ServiceNow
The idea is to continuously integrate, deploy, test and monitor the data and processes in an automated fashion. The purpose of each tool will be clearer with the process diagrams in the section below.
C. Define the DataOps Practice
Requirements process, development process, data testing process, test data management, production data monitoring and defect tracking. Assuming people and the tools are in place.
a. Develop and Integrate in a Code Repository
|Store the code in a repository. The main requirement is to ensure the code is accessible to some automation tool. The code in data-centric projects is a combination of ETL code, BI report code, scheduler/orchestration code, database procedures, database schema (DDL) and some DML. Both ETL and reports code must be captured and stored in some repository.|
|The test automation repository should consist of test cases, regression packs, test automation scripts and production data monitoring rules. This test can be called on- demand by a CICD script. This will ensure all testing and monitoring rules are stored and accessible in the future.|
DataOps has a very special data component required for configurations, test and database initialization. Eventually, the ETL process will collect data, the database will store it and the reports will show it, when the system is fully live.
Whenever new data processes are added or updated, usually some data must be prepopulated into the database e.g., it could be reference data, test data or configuration data. Configuration data could be dates required for incremental loads.
b. Implement CICD Pipeline
DataOps changes both the culture as well as the processes.
- Continuous Integration – In the previous section, it’s clear that call code must be stored in some repository and available for DevOps automation. With code, it becomes easy to manage various code branches and versions. Based on the release plan, the code can be selected and integrated with the help of CICD tools like Jenkins.
- Continuous Deployment – The integrated code is pulled by Jenkins and deployed with the help of API’s of command line import and export utilities. Depending on the code type, the code is pushed to a Database, ETL, Reporting platform. Further, CICD tools will also deploy initialization data in the database. This will create the necessary QA or production environment which is ready for further execution.
- Initialization Tests – Once the environment is ready with code and data, the CICD tool will execute iCEDQ rules to validate the data structures (database objects, tables, column, datatypes, etc.) as well as initial data.
- ETL/Report Execution – The next step for the CICD tool is to execute the scheduler to orchestrate the execution of the ETL process and reports.
- ETL/Report Testing – Once the data is loaded by ETL and reports are executed, iCEDQ can run the test and verify the validity of both the ETL as well as report quality. This is unique to DataOps because without first executing the ETL or the reports, there is no way to do the data testing.
- Production Monitoring – Once the system is live, the hooks left by the development and QA team will be used for monitoring the production systems, which is also sometimes referred to as white box monitoring. The business also benefits as they now have hooks (testing rules) developed by QA teams available to monitor the production data pipeline on an ongoing basis.
- Once the system is online and running based on the schedules, the Audit Rules in iCEDQ will also start running
- When iCEDQ notices any discrepancies in the data it will identify the specific data issues and raise alerts.
- The Issue logging system can then be used as a source of changes in the data pipeline or a simple update of the data.
- If there is a change in the code due to defects found in the data or a new business requirement is discovered, the DataOps cycle repeats again.
Why DataOps with iCEDQ results in better Data Quality?
One of the direct impacts of DataOps is the improvement of data quality for the data pipeline. There are three core reasons for this:
- Cultural Changes
- One of the concepts of us versus them is gone. The developers are responsible for the quality of data in production.
- Business Users are involved early and forces them to provide business requirements as well as data testing and monitoring requirements.
- All these checks and validations are added in iCEDQ and can be further used to test and/or monitor the data in production.
- Automation of Testing
- DataOps results in test automation which can improve productivity by 70% over manual testing.
- Now that the tests are automated, the test coverage can improve by more than 200%. Some tests are time-consuming if done manually, however, with DataOps automation, there are no such limitations. There can now be an increase in both the number of tests as well as the complexity of the tests that can be run.
- The cost of production monitoring and refactoring of code is reduced as more defects are captured early in the life cycle of the data pipeline.
- Test and monitoring automation also enable regression testing. The testing and monitoring rules are stored in the system and can be recalled as needed during the regression tests.
- Production Monitoring
- Some of the tests created during development and QA in iCEDQ are reused in the Production environment to monitor the data.
- Automation of monitoring also removes the limits on the volume of data that can be scanned. Organizations can move from sampling data to big data without any issues. With its Big Data edition, iCEDQ platform can monitor the production data without data volume constraints.
- The iCEDQ rules can be embedded in the ETL batch or rules can also run periodically with its built-in scheduler.
- iCEDQ notifies the workflow or ticketing systems whenever there is a data issue.
DataOps is all about reducing data organization siloes and automating the data engineering pipeline. CDOs, business users, data stewards are involved early in the data development life cycle. It forces the organization to automate all its processes, including testing. The data quality tasks are now implemented early in the project life cycle. This provides enormous benefits to the data development team as well as operations and business teams with data issues occurring in production environments.
- Faster Time-to-Market
- Improves Data Quality
- Lowers Cost Per Defect