Why we developed iceDQ

Data Reliability Engineering

The Data Factory Concept

Data centric projects and systems such as a data warehouse, big data, CRM and data migration are like a factory wherein raw data is ingested and then in a series of orchestrated steps data is cleansed, transformed, integrated to produce the final product, the data. This is very similar to an assembly line in a factory.

Data Quality Fallacy - iceDQ

The Final Data Quality Fallacy

Conventionally, enterprises are solely focused on inspecting the final data that is delivered in production. However, the data issues often stem from prior missteps that happened during development, testing, and operations. By the time data defects are found in production data, it’s too late.

  1. Bad data is already integrated, hence almost impossible to undo.
  2. Downstream business has already consumed the bad data and is now impacted.
  3. Triage and fixing data issues in production incurs high costs and downtime.
  4. Disrupts business continuity, impacting both business and customers.
  5. Exposes organizations to various risks related to regulations, compliance, reputation, financial losses, legal, and other penalties.

Digging Deeper

Often data defects in production have their root cause much earlier in development and operations of the data platform. In the data factory analogy, organizations navigate two crucial phases before delivering final data: building the data factory and running it effectively. Mistakes at any point in this lifecycle directly impact the final data quality.

Phase 1, Build the Data Factory: In this phase, requirements are collected, processes are coded, and the data pipelines are orchestrated. Data testing in the build phase is needed to capture key issues:

  1. Incorrectly captured requirements
  2. Bugs in data processing code
  3. Scheduling and orchestration errors
  4. Corner or edge test cases
  5. Deployment issues

Phase 2, Run the Data Factory: Once the data factory is assembled, it starts receiving data and all the processes start running in an orchestrated manner. Apart from the defects from the build phase, additional data errors are introduced because of lack of operational checks and controls.

  1. Poor quality of raw input data
  2. Scheduling delays
  3. Unauthorized configuration or code changes

Phase 3, Final Production Data: Any Carry-over defects that were missed in the prior two phases along with ongoing challenges can still cause defects in final production data.

  1. Defects that escaped detection earlier two phases can surface later.
  2. Evolving business processes, rules, and reference data can introduce new errors in data.

We’ve seen that just checking data quality at the very end isn’t enough. Instead, a more comprehensive approach called data reliability engineering is needed.

Diggings Deepers - iceDQ

Data Quality vs. Data Reliability Engineering

DQ – Data Quality

Data Quality is an instance in time.

DQ – Data Quality - iceDQ

DR – Data Reliability

Data Reliability is consistent data quality over time.

DQ – Data Quality - iceDQ

DRE – Data Reliability Engineering

Data Reliability Engineering is a practice of integrating people, processes, and products to deliver reliable data.

DQ – Data Quality - iceDQ

The 3Ps of Data Reliability Engineering

Successful implementation of DRE requires people, processes and the iceDQ platform in all phases of data development life cycle.

main-image

How to Implement DRE?

Automate Data Testing

One of the first steps towards DRE is to automate data testing ensuring that each ETL process is tested before deploying into production.

Enable Chaos Testing with Full Data

Instead of sampling data, use the complete data set for testing. This will ensure complete test coverage by discovering any corner test cases missed during analysis.

Adopt integrated QA + QC

Ensure that not only quality controls are done on the final data, but also the tooling and data pipelines are tested thoroughly.

Adopt Left Shift

Shift the focus from the end of data development lifecycle to the left by ensuring business users are involved in requirements gathering, establishing data audit rules and making sure that the checks and controls are established in operations.

Deploy Whitebox Monitoring

As part of the code deployment not only deploy the ETL code but also embed checks as part of deployment. This will allow the operations team to monitor the production pipelines with minimum effort.

Create Organizational Memory

Store all the testing, monitoring and data observability rules in a centralized repository that can be accessed by teams over a long period of time.

Monitor Operations

Establish checks and controls in the production pipeline to capture any data issues before they can cause damage in the downstream systems.

Observe Production Data

Observe the production data for anomalies and notify incidents on real-time basis as they happen with supporting information of triage and root cause analysis.

Measure Failure Rate and Magnitude

Measure and report both the frequency and magnitude of data errors.

Data Reliability Engineering CTA - iceDQ

Conclusion

By adopting a holistic data reliability approach, companies can significantly reduce the amount of hidden work that goes in correcting data issues. There is a direct savings on costs, time, improved delivery timelines, and most importantly, build a reputation for reliable, high-quality data.