Data Reliability Engineering with iceDQ 

Data Factory Concept

Datacentric systems, such as data warehouses, big data, CRM, and data migration, are like a factory wherein raw data is ingested, and then, in a series of orchestrated steps, data is cleansed, transformed, and integrated to produce the final product. This is very similar to an assembly line in a factory. 

Data Quality

data-quality

Data quality typically captures data quality dimensions at a specific point in time.  

However, these singular measurements have inherent limitations. Due to their focus on a single instance, they fail to account for temporal variations. As a result, this form of assessment is considered unreliable due to the absence of comprehensive monitoring over a sustained period.  

Moreover, it restricts its scope to data quality outcomes, neglecting the underlying processes and systems responsible for data generation. 

Conventionally, enterprises are solely focused on inspecting the final data that is delivered in production. However, the data issues often stem from prior missteps that occurred during development, testing, and operations. By the time data defects are found in production data, it’s too late. 

  1. Bad data is already integrated, making it almost impossible to undo. 
  2. Downstream business has already consumed the bad data and is now impacted. 
  3. Triaging and fixing data issues in production incurs high costs and downtime. 
  4. Disrupts business continuity, impacting both business and customers. 
  5. Exposes organizations to various risks related to regulations, compliance, reputation, financial losses, legal issues, and other penalties. 

Data Reliability

Data reliability is defined as the consistent delivery of data quality over time through the development of a reliable system.  

It extends beyond a snapshot of data quality by incorporating a temporal dimension. Success is determined by the continuous measurement and maintenance of data quality across a defined period.

The metrics below provides insights into the overall health of the data systems and their components, enabling organizations to make informed decisions about design, maintenance, and risk management:

data reliability
  • Mean Time to Failure (MTTF): The average time a non-repairable system or component is expected to function before it fails. 
  • Mean Time Between Failures (MTBF): The average time a repairable system or component is expected to function between failures. 
  • Mean Time to Repair (MTTR): The average time it takes to repair a failed system or component and restore it to operational status. 
  • Failure Rate (λ): The frequency with which a system or component fails, expressed as the number of failures per unit of time. 
  • Availability (A): The probability that a system or component will be operational and available when needed. 
  • Reliability Function (R(t)): The probability that a system or component will function without failure for a specified time-period (t). 

Data Reliability vs Data Quality

Data Reliability Data Quality
Related to data, data pipelines, orchestration, etc. Concerned only about data.
Measures consistent quality over time. Measures quality at a moment.
Metrics such as MTTF, MTBF, MTTR are used. Metrics are based on data quality dimensions.
Data reliability is concerned with both non-production and production environments. Data quality is only related to the production environment.

Data Reliability Engineering

Data Reliability Engineering involves the systematic application of the best engineering practices and techniques, integrating people, processes, and products, to ensure the delivery of reliable data. 

This practice identifies potential failures, analyzes their root causes, and implements measures to prevent or mitigate them by scientifically designing, developing, testing, operating and monitoring throughout the data development cycle.  

Data-Reliability-Engineering

Digging Deeper into Data Reliability Engineering

Often, data defects in production have their root causes much earlier in the development and operations of the data platform. In the data factory analogy, organizations navigate two crucial phases before delivering final data: building the data factory and running it effectively. Mistakes at any point in this lifecycle directly impact the final data quality. 

Phase 1 – Build the Data Factory: In this phase, requirements are collected, processes are coded, and the data pipelines are orchestrated. Data testing in the build phase is needed to capture key issues, such as:

  1. Incorrectly captured requirements. 
  2. Bugs in data processing code. 
  3. Scheduling and orchestration errors. 
  4. Corner or edge test cases. 
  5. Deployment issues. 

Phase 2 – Run the Data Factory: Once the data factory is assembled, it starts receiving data, and all the processes start running in an orchestrated manner. Apart from the defects from the build phase, additional data errors are introduced due to a lack of operational checks and controls.

  1. Poor quality of raw input data. 
  2. Scheduling delays. 
  3. Unauthorized configuration or code changes. 

Final Production Data: Any carry-over defects that were missed in the prior two phases along with ongoing challenges can still cause defects in final production data.

  1. Defects that escaped detection earlier two phases can surface later.  
  2. Business changes: Evolving business processes, rules, and reference data can introduce new errors in data.  

We’ve seen that just checking data quality at the very end isn’t enough. Instead, a more comprehensive approach called data reliability engineering is needed.

How to Implement Data Reliability Engineering with iceDQ?

Successful implementation of DRE requires the three Ps: People, Processes and the iceDQ Platform in all phases of data development life cycle.  

How to Implement Data Reliability Engineering

  1. Data Reliability Engineer: A data reliability engineer is a professional who builds, runs, and manages a data-centric system to deliver reliable data consistently over time. He does it by ensuring that the system is built, and data is tested properly.  He will employ proper data monitoring and data observability. Their primary goal is to minimize failures and downtime, thereby maximizing efficiency and reducing costs. 
  2. Automate Data Testing: One of the first steps towards DRE is to automate data testing, ensuring that each ETL process is tested before being deployed into production. 
  3. Enable Chaos Testing with Full Data: Instead of sampling data, use the complete data set for testing. This will ensure complete test coverage by discovering any corner test cases missed during analysis.
  4. Adopt integrated QA + QC: Ensure that not only quality controls are done on the final data, but also the tooling and data pipelines are thoroughly tested.  
  5. Adopt Left Shift: Shift the focus from the end of the data development lifecycle to the left by ensuring business users are involved in requirements gathering, establishing data audit rules and ensuring that the checks and controls are established in operations. 
  6. Deploy Whitebox Monitoring: As part of the code deployment, not only deploy the ETL code but also embed checks as part of deployment. This will allow the operations team to monitor the production pipelines with minimal effort. 
  7. Create Organizational Memory: Store all the testing, monitoring and data observability rules in a centralized repository that can be accessed by teams over a long period of time. 
  8. Monitor Operations: Establish checks and controls in the production pipeline to capture any data issues before they can cause damage in the downstream systems. 
  9. Observe Production Data: Observe the production data for anomalies and notify incidents in real time as they happen, with supporting information for triage and root cause analysis.  
  10. Measure Failure Rate and Magnitude: Measure and report both the frequency and magnitude of data errors. 
Data Reliability Engineering CTA - iceDQ

Conclusion

By adopting a holistic data reliability approach, companies can significantly reduce the hidden effort involved in correcting data issues. This leads to direct savings in costs, time, and improved delivery timelines. Most importantly, it helps build a reputation for delivering reliable, high-quality data.