Data Observability

Table of Contents

What is Data Observability?

Data Observability Definition: It is the practice of understanding the internal state or condition of a data centric system by carefully monitoring and analyzing the data, logs and signals emitted by the system.

Historical Background: In 1960, Rudolf E. Kalman as part of control theory, formally introduced the term “observability” to describe a system’s capacity to be measured through its outputs. The term software observability was later adapted by complex and distributed software systems, where it refers to the ability to understand a system’s internal state by analyzing logs, traces, and identifying abnormal patterns.

The Evolution from Data Monitoring to Data Observability: As data-centric systems evolved, they became increasingly complex, with big data volumes, intricate process chains, and distributed architectures.

While data monitoring can flag data issues, it often falls short in diagnosing root causes, assessing impacts, and uncovering hidden problems for such complex systems.

To address these challenges, the observability concept was borrowed from software and applied to the data centric systems.

While both are complementary to each other, data observability is built on top of data monitoring.
The goal of data monitoring is to track data issues, while data observability focuses on troubleshooting and root cause analysis.
Only a data observability system with deep analysis of data, logs and events can trace (using data lineage) can understand both reasons and impact of the issue.

Example: While a monitoring system can detect when a data process crashes due to a connection timeout, it cannot determine if the process loaded data correctly unless it has a catastrophic failure. This is where data observability excels as it can analyze data and logs to verify process correctness. Furthermore, it can assess the impact and potential causes of defects.

Data Observability Metrics

A data-centric system primarily consists of data, ETL processes, and the underlying infrastructure. A well-observed system includes a process that regularly collects metrics essential for assessing system health. The key metrics include:

Freshness Metric: This metric monitors the delivery time of files and data. This is useful to track late data arrival or processing time.
Volume Metric: This metric is useful to check sudden changes in data volume. Sudden drop in data volumes can indicate a problem in the upstream systems. It is also used to track large deviations in aggregate financial numbers.
Distribution Metric: This metric detects anomalies in categorical and numerical distribution of data over time.
Categorical distribution example: The count of customers grouped by customer type.
Numerical distribution example: The average dollar transactions by store.
Data Quality Metric: This measure usual data quality dimensions such as completeness, consistency, uniqueness, etc.
Schema Drift Metric: Monitor the database changes such as tables, columns or data type changes.
Data Pipeline Metric: This metric is useful to track the processing errors of the data process. It can do this by reading logs or comparing the input data with the output data. Beyond this it is useful to track schedule delays, process executing time and other data pipeline orchestration issues.
Data Drift Metric: ML relies heavily on the features. Compared to the data used during initial training the composition of feature data can drift over time. This will result in unreliable predictions by the ML models. This metric tracks and alerts if there are changes in the composition of data.
Data Contract Metric: Track the vendor promises vs. actual data delivery. Data vendors provide certain SLAs and data quality promises. This metric tracks the deviations from the agreed contractual requirements.
Resource Metric: Data processing utilizes CPU, memory, storage and network. The metric can track the resource consumption trends and warn the user if necessary.
Spend Metric: Track and alert on infrastructure usage costs. Many companies rely on the cloud such as AWS, and Snowflake for the data processing and storage. This metric track usage or the billing to take corrective actions. This is usually done to avoid sudden cost surprises.

Data Observability Architecture

To provide a comprehensive understanding of data observability architecture, this section is divided into two parts. First, we’ll explore the key components that make up the architecture, detailing each element’s role and significance. Then, we will delve into how these components work together to enable effective data observability.

Key Components: Data Observability consists of the following components. Please refer to figure 1.
Metrics: Define and measure metrics.
Anomalies: Detect anomalies based on rules or AI/ML pattern detection.
Impact: Assess the downstream impact and collect supporting evidence.
Notification: Identify relevant system or person and notify for immediate action.
Tickets: Open incident management tickets for follow up and documentation.
KPIs: KPIs around the failures, quality, outage and maintenance costs.

Key Components: Data Observability consists of the following components. Please refer to figure 1.
Metrics: Define and measure metrics.
Anomalies: Detect anomalies based on rules or AI/ML pattern detection.
Impact: Assess the downstream impact and collect supporting evidence.
Notification: Identify relevant system or person and notify for immediate action.
Tickets: Open incident management tickets for follow up and documentation.
KPIs: KPIs around the failures, quality, outage and maintenance costs.

How does it work? Refer to Figure 2 below for a flow diagram highlighting the key processes and how the components mentioned above work together to form the Data Observability architecture.

Create Metrics: For every data asset such as table, column, files or an ETL process create one or more metrics to observe the condition. As mentioned above, it could be related to data volumes, ETL process, schema changes and many more. Then the system is periodically profiled to capture and store data in these metrics with timestamps.

Detect Anomalies: Next step is to analyze telemetry data, such as logs, metrics, and traces, from all aspects of the data pipelines. The key aspect of data observability methodology is the use of historic time series data.

Learn the data patterns created over time. This data can be used to establish baselines for key metrics, such as data volume, latency, and error rates.
Use machine learning or rules to detect unusual data patterns or anomalies. Any deviations from these baselines can then be flagged as potential anomalies.
Example: The platform can detect a spike in the number of customer records with missing zip code. The platform can determine that the percentage of customer records with missing zip code addresses has suddenly increased from 0.1% to 5% then it will consider it as an anomaly.

Analyze the Impact: Once the anomaly is detected then data lineage is used to track the upstream system that are populating the data and the downstream system to where the data is flowing or the system that is impacted.

Identify the origin of the problem. The data can then be used to detect anomalies, identify trends, and understand the root cause of problems.
Understand the impacted downstream systems, tables or columns.

Notify: Once all the systems are mapped it is also important to identify people or groups responsible for fixing or people who are affected by the defect. With the help of knowledge graphs, the system can automatically notify.

Open Tickets and Workflow: Often notification messages on an email or text are for immediate operational response. Using a ticketing system to track and follow up on issues is essential for documentation and guaranteed attention. However, a proper ticketing system has assigned people, escalations, workflow which ensures that not only tactical actions are taken but long-term fixes are implemented by taking corrective actions.

Generate KPIs: While the metrics measure the condition of the data and processes, it does not help with the management of the overall trends. For this dashboard are needed showing various quality and failure metrics. Some examples are MTTF, MTTR, Defect Rate, etc. Additionally, it should also support slicing and dicing of data from different perspectives such as source systems, tables, database or departments.

Data Observability Benefits

Identifying and Troubleshooting Unknown Issues: Data observability allows organizations to uncover problems they weren’t aware of and address them proactively.
Predicting Potential Problems: It equips businesses with tools to predict and prevent potential issues before they become critical.
Gaining Insights into Hidden System State: Data observability provides insights into variables, parameters, processes, or components that may not be directly visible but significantly impact system behavior.
Empowers Diverse Roles: Benefits various stakeholders – operations, compliance, governance, data quality, and business users – by providing reliable, insightful data. Let’s explore how below.

Operations Teams: Data observability is a crucial tool for operations teams who rely on it to ensure the health and performance of systems.
Compliance Experts: Compliance experts use data observability to maintain data integrity and adherence to regulations.
Governance Authorities: Governance authorities leverage data observability to maintain data quality and security.
Data Quality Specialists: Specialists in data quality use observability to ensure that data is not only available but also trustworthy.
Business Users: Business users rely on data observability to extract meaningful insights for decision-making.

Limitations of Data Observability

Despite its strengths, data observability has inherent limitations:

Reactive Nature: Data observability primarily identifies issues after they’ve occurred, limiting its ability to prevent problems upstream.
Aggregate Focus: Its emphasis on aggregate data can obscure granular details, hindering deep-dive analysis.
Probabilistic Insights: Data observability often relies on statistical models, which can introduce uncertainty in problem identification.
Dependency on Historical Data: Advanced observability techniques frequently require substantial historical data, which may not always be accessible.

Conclusion

Data observability is critical to DataOps practice and is a core component of data reliability engineering. It provides a holistic view of the health and performance of data systems, enabling organizations to identify and resolve issues quickly and proactively. This can lead to significant benefits, including improved data quality, reduced downtime, increased agility and ensure that their data is reliable and trustworthy.

iceDQ offers a unified data testing, monitoring, and observability platform driving organizations to achieve 100% data reliability. By seamlessly integrating into data pipelines and leveraging AI-driven features, iceDQ helps you detect and diagnose data issues proactively. With features like automated testing, real-time monitoring, observability and comprehensive reporting, iceDQ ensures quality throughout the data development life cycle.