Data Monitoring is the systematic and continuous process of establishing checks, controls and notifications for production data pipelines and input data.
Data monitoring is primarily directed, but not limited to operations and support teams that are responsible for running the production data pipelines. It detects data and processing issues in the runtime environment for immediate corrective actions. It involves three distinct activities:
|
Why is Data Monitoring Required?
This question is often asked: “If I’ve data testing tool in development and I have scheduling tools and data observability tools in production, then why do I still need a data monitoring tool?”
Data testing is QA activity that focuses on verifying the data processes while they are under development.
Data observability and data monitoring both are Quality Control (QC) activities that keep an eye on things: observability ensures that the final data is accurate, while data monitoring checks if the data processing steps are running smoothly and efficiently.
Here are some key points underlying the need for a data monitoring solution.
Introduction of Runtime Errors in Operational Environment | Even after thorough data testing and certification, data processes, once deployed
into production can encounter unforeseen issues. Because the runtime environment itself can
introduce new problems that were not present during development.
|
Scheduling Tools Limitations (Silent Data Issues) | The existing enterprise scheduling tools can only monitor the success of the batch
processes and cannot verify if the data has been transformed correctly. These tools do not inspect
the content or data; they only detect hot and catastrophic errors that cause outright process
failure. They can catch issues, if password is expired, a table is dropped, or a process runs out of
memory then scheduling tool can catch it.
But the silent failures related to data will often go undetected by the orchestration tools. Here
are few cases that you will see in your data factory.
|
Data Observability Limitations (Stopping Error Propagation) | It detects anomalies after the fact, when data is already produced and at the point of usage by the end users. It is the last resort. It cannot stop the data pipeline, just warns users about the state of data. It is also not embedded as part of the production pipeline to have any direct impact on the data processing pipeline or batch. |
Helping Ops and Support Teams | Create real-time alerts and notifications so the team can act. They can stop the process, if necessary, fix and restart data pipeline with little or no impact to downstream users. |
How to Implement Data Monitoring
Data monitoring involves:
|
Data monitoring involves:
|
1. Blackbox Data Monitoring
Blackbox data monitoring refers to a method of monitoring a system or application by focusing on its external behavior, without needing access to its internal workings. This is analogous to treating the system as a “black box” and taking note of error codes when the process fails or times out. |
Blackbox data monitoring refers to a method of monitoring a system or application by focusing on its external behavior, without needing access to its internal workings. This is analogous to treating the system as a “black box” and taking note of error codes when the process fails or times out. |
Blackbox monitoring is typically performed by operations teams using their scheduling and orchestration tools. These tools monitor for obvious error responses and exit codes generated by batch processes.
Such data monitoring will detect hard failures. Here are some examples:
- Connection errors.
- Database connection timeout errors.
- Hardware failures.
- Schema changes, such as dropped tables or columns referenced by the ETL processes.
Most scheduling and orchestration tools come equipped with this type of hard failure monitoring by default, allowing them to detect, stop, and report such errors.
However, such kind of monitoring has limitations and cannot detect silent data defects caused inside the data process. This leads us to Whitebox monitoring.
2. Whitebox Data Monitoring
Whitebox monitoring, unlike Blackbox monitoring, dives deep into the internal workings of an application or system. It provides a transparent view of what’s happening “inside the box” to understand its health and performance. |
Whitebox monitoring, unlike Blackbox monitoring, dives deep into the internal workings of an application or system. It provides a transparent view of what’s happening “inside the box” to understand its health and performance. |
The limitation of Blackbox is it cannot see inside the envelope. For example, if a file has zero records, even then the ETL process will load zero records and call it a success. On the other hand, Whitebox data monitoring will check the content of the data file and provide the proper exit code to indicate a failure.
In the context of DataOps, developers can no longer just handover the data processes to operations and expect Ops team to support the data pipelines.
During development, developers integrate monitoring checks with the ETL pipelines providing little windows inside the system to allow operations teams to see how it’s working. These monitoring checks are deployed along with the ETL code to production. They run alongside the data pipelines, providing valuable insights into hidden data errors that might have gone unnoticed previously.
DataOps has revolutionized this approach. With Whitebox monitoring, operations teams not only receive the ETL pipeline and its dependencies, but also benefit from:
- Embedded checks: These checks are deployed automatically when the data pipeline is deployed.
- Real-time monitoring: These checks run as part of the orchestration process, providing immediate feedback on data quality.
3. In-Process Data Monitoring
In this method, checks are run in-line with the ETL process to verify that data processing occurred as expected. First, the process(es) are executed. Then, the check is run. If the check fails, the process is considered a failure, is stopped, and the support team is notified.
When to use in-process data monitoring?
- Monitor critical process.
- The check takes minimum time without affecting the timeline.
- The organization is mature and confident of the effectiveness of the checks.
- The checks are critical to the usage of data.
- There are a few checks.
The advantage of in-process data monitoring.
- When the data or process defect is detected, it immediately stops further executions preventing downstream damage.
- Data delivery delays can be accommodated with the run timeline with little or no delays for the end users.
The drawbacks of in-process data monitoring.
- Due to additional checks batch will take longer time to execute.
- The job will come to a halt until corrective action is taken.
- Can create panic situation as the whole data pipeline is halted.
4. Out-of-Process Data Monitoring
It is a parallel data monitoring that involves executing checks concurrently with the process flows, or alternatively, at predetermined scheduled times. This ensures that the checks don’t disrupt the execution timeline or cause process interruptions in case of errors.
It only provides notification to the concerned parties, and it is up to them to act.
When to use out-of-process data monitoring?
- Monitor noncritical defects.
- The check might take longer and negatively affect the data processing timelines.
- The organization is not mature or not confident of the effectiveness of the checks.
- The checks are not critical to the usage of data.
- Allows execution of many checks.
The advantages of out-of-process data monitoring.
- It takes shorter time to execute the data pipeline.
- When a defect is detected, there is no urgency to respond as the process is not stopped.
- Gives time to operations team to learn about the check behavior and convert into inline monitor as they become confident.
- Doesn’t create unnecessary panic for each defect.
The drawbacks of using out-of-process data monitoring.
- Since the process does not stop when a data defect is noticed, it will affect downstream data. This makes it difficult to undo.
- No action will be taken unless some support person looks at the defect.
5. Input Data Monitoring
The data generated is dependent on the input data and the processing applied to it. One key action is to monitor the input data provided by the data vendors or internal division within the company. |
The data generated is dependent on the input data and the processing applied to it. One key action is to monitor the input data provided by the data vendors or internal division within the company. |
Here are some basic checks that can help monitor input data before it is processed by the ETL process:
- Business validation on the incoming data to ensure the data is within expected values.
- Reconciliation of data between two systems to see if they fall within range and don’t deviate much if any.
- Watch for new reference data that is not on the predetermined list.
6. File Data Monitoring
File formats remain a common method for data delivery today. However, compared to tables, files offer less structure, which can lead to challenges such as unexpected column shifts, truncated files, or incorrect data formats.
File delivery related monitoring:
- Delivery Timing problems.
- Partial file deliveries.
- Mismatches between manifest files and data.
- Re-issuing of identical files.
File content related monitoring:
- Empty files
- Data quality problems
- Incorrect file schema
- Data format errors
7. ETL Data Pipeline Monitoring
The output of data quality is dependent on the input data and the data pipeline. The data pipeline can be affected by defects in the data processing or if the data process is executed with the wrong parameters or incorrect order. Both issues can cause a hard failure or a software failure.
Hard failures: As discussed earlier, hard failures of data pipeline are easy to detect because the data process itself will crash because of schema changes, connection issues, etc. Most process orchestrating tools will take of it.
Soft failures: However, soft failures are very difficult to detect as there are no outward symptoms, unless you check the data that was generated by the process. Some of reasons behind soft failures are:
- Wrong processes or code is deployed into production.
- Incorrect or accidental execution of the processes.
Such issues are not detected directly but require a business rules driven data monitoring strategy. It involves:
- Validate the output data based on business rules: ex. Check if the dollar amount is below or above certain range. The number of customers count is too low.
- Reconciliation between reference data and transformed data: Compare if shipments are more than the orders.
When such business rules fail, it can be interpreted that the processes populating the defective data have issues.
8. Data Contract Monitoring
Data contracts are agreements between data providers and data consumers that specify the specifications containing how, when, and what data will be delivered. Additionally, they might contain penalties for non-delivery.
The data provider might be an internal department or an external agency. It is essential to monitor incoming data and compare it with data contracts to hold the data provider accountable for the promises made related to.
- Schema
- Data Format
- Data Quality
- Delivery Timing
Data monitoring checks can be created based on the data contracts, enabling tracking against benchmarks and comparing vendors.
Benefits of Data Monitoring
- Detects silent data defects often missed by scheduling tools.
- Prevents further downstream damage in real time.
- Provides enough time for operations and support teams to fix the data and restart the process.
- Complies with regulations by continuously auditing and documenting activity, thus producing an audit trail useful for certification requirements.
- Enforces data contracts from various data providers.
- Lowers operating costs with proactive process monitoring by preventing the compounding effect of undoing and rerunning the entire nightly batch.
iceDQ’s Data Monitoring Solution
Data monitoring is built into the iceDQ platform with features and functionality that integrate into process execution workflows. When properly implemented, operations should be able to quickly detect and respond to data and process issues before they become problems.
- Audit Rules Engine for Checks: Users can add rules to monitor execution daily.
- Workflow Integration for Controls: Integrate embedded rules into process execution within the workflow using the command line or web service interface. This enables automatic termination of process flows when data or process defects are detected.
- Compliance Reporting & Alerts: Comply with SOX, FINRA, GDPR, Solvency II, and many other regulations by implementing proper monitoring and alerting mechanisms.
Conclusion
Data quality is a byproduct of the system consisting of input data and the ETL processes that populate data. Hence, monitoring data processes is critical. While existing enterprise scheduling and process orchestration tools can monitor the success or failure of a data process, they cannot verify if the data has been transformed correctly, leading to downstream data problems.
Therefore, data process monitoring in production is an essential component for delivering reliable data.