In this guide, I will explain both data quality (DQ) and the six data quality dimensions. You will learn advanced data quality concepts, data quality measurement, and examples of different data quality dimensions. This guide shares my 25+ years of experience in real-world data engineering. Let’s dive right in.
What is Data Quality?
Data quality (DQ) is defined as the data’s suitability for a user’s defined purpose. It is subjective, as the concept of quality is relative to the standards defined by the end user’s expectations.
![]() |
Data Quality Expectations: It is possible that for the exact same data, different users may have totally different data quality expectations, depending on their usage. For example, the accounting department needs accurate data to the penny, whereas the marketing team does not, because the approximate sales number are enough to determine the sales trends. |
![]() |
Data Quality Expectations: It is possible that for the exact same data, different users may have totally different data quality expectations, depending on their usage. For example, the accounting department needs accurate data to the penny, whereas the marketing team does not, because the approximate sales number are enough to determine the sales trends. |
Instead of just providing definitions of different data quality dimensions, this guide offers a comprehensive and very nuanced list of use cases and examples from our professional experience.
What are the Six Data Quality Dimensions?
The six data quality dimensions are Accuracy, Completeness, Consistency, Uniqueness, Timeliness, and Validity. However, this classification is not universally agreed upon.
In this guide we have added four more dimensions: Currency, Conformity, Integrity, and Precision, bringing the total of 10 DQ dimensions.
As humans, we naturally like to classify things. For example, we categorize animals into various categories such as reptiles, mammals, birds, etc. Similarly, Data Quality dimensions serve as a conceptual framework designed to group data quality issues with similar patterns. Of course, you can choose to restrict, expand, or create your own taxonomy.
As humans, we naturally like to classify things. For example, we categorize animals into various categories such as reptiles, mammals, birds, etc. Similarly, Data Quality dimensions serve as a conceptual framework designed to group data quality issues with similar patterns. Of course, you can choose to restrict, expand, or create your own taxonomy.
What is Accuracy Data Quality Dimension?
Data accuracy is the degree to which data accurately represents real-world things, events, or an agreed-upon source.
For example, if a prospective employee has an inaccurate interview address, he won’t be able to attend the interview until he obtains the accurate address.
We will take two examples to explain the data accuracy dimension and how it can be measured:
-
a. Data Accuracy Measurement With Physical World – Example
Data accuracy can be judged by comparing data values with physical measurement or observations.
Example: We perform this data accuracy check at the grocery store every time we make a purchase – by checking the items on the bill and then physically verifying the items in the grocery cart. However, this manual testing is not feasible at scale. Imagine checking the accuracy of inventory data for thousands of items: someone would have to go to the warehouse and count each items manually. |
Data accuracy can be judged by comparing data values with physical measurement or observations.
Example: We perform this data accuracy check at the grocery store every time we make a purchase – by checking the items on the bill and then physically verifying the items in the grocery cart. However, this manual testing is not feasible at scale. Imagine checking the accuracy of inventory data for thousands of items: someone would have to go to the warehouse and count each items manually. |
-
b. Data Accuracy Measurement With Reference Source – Example
Another way to measure accuracy is by comparing actual values to standard or reference values provided by a reliable source.
Example: The Consumer Price Index (CPI) is published by the US Bureau of Statistics. If you have CPI index values in your database, you can compare it with the reference values obtained from the US Bureau of Statistics website to measure accuracy.
What is Completeness Data Quality Dimension?
The completeness data quality dimension is defined as the percentage of data populated vs. the possibility of 100% fulfillment.
You have probably heard multiple times that data is incomplete for making decisions.
Example: A salesperson wants to send an email to the customer, but the data entry operator did not fill in the email address. In this case, the data is not inaccurate, rather the email attribute was left empty. When data is missing, it directly impedes the operations of any organization. |
You have probably heard multiple times that data is incomplete for making decisions. Example: A salesperson wants to send an email to the customer, but the data entry operator did not fill in the email address. In this case, the data is not inaccurate, rather the email attribute was left empty. When data is missing it directly impedes the operations of any organization. We will provide four examples to explain different types of data completeness quality issues:
|
-
a. Completeness Check – Missing Records Example
You are an eligible voter, but at the voting booth, the record with your name is missing from the voter’s list. This is an example of a missing record under the completeness data quality dimension. |
You are an eligible voter, but at the voting booth, the record with your name is missing from the voter’s list. This is an example of a missing record under the completeness data quality dimension. |
-
b. Completeness Check – Null Attribute Example
Even though you have all the customer records, some attributes within those records might be missing values. For example, each customer record should include a name, email address, and phone number. However, the phone number or email address might be missing in some of the records.
-
c. Completeness Check – Missing Reference Data Example
A system might not have all the reference values required for the domain.
Example: A banker is trying to update a customer account to a “Suspended” state. The banker expects three reference values: |
However, the reference table has only two domain values: “Open” and “Closed”. The banker cannot find the “Suspended” reference value in the data. This is an example of reference data completeness. It is a specific case of the prior example, where complete records are missing.
-
d. Completeness Check – Data Truncations Example
Even if an attribute is populated with a data value, it’s possible that the values were truncated during the loading process. This often occurs if the ETL process variables are not correctly defined or if the target attribute is not large enough to capture the full length of the data values.
What is Consistency Data Quality Dimension?
Consistent data refers to how closely your data aligns with or matches another dataset or a reference dataset.
Here are a few examples of Data Consistency DQ dimension:
-
a. Record Level Data Consistency Across Source and Target
When data is loaded from one system to another, it’s important to ensure that the data reconciles with the source system. Source vs. target reconciliation often reveals inconsistencies in the records. Below is an example of an inconsistency at the record level. The record for Tom exists in the source but not in the target system.
-
b. Attribute Consistency Across Source And Target
Another specialized example of inconsistency between the source and target is when the records exist on both sides, but their attributes do not match. In the case below, the records for Tom and Ken exists on both sides, but the target side is missing Tom’s email and Ken’s phone number.
-
c. Data Consistency Between Two Subject Areas
In a clothing store, a customer’s order shows one gown and three pairs of dress pants. However, the shipping dataset for the same order indicates that the store must ship three gowns and one pair of dress pants. In this case, the orders and shipment quantities are inconsistent between the two datasets.
-
d. Transaction Data Consistency
A transaction is a collection of read/write operations that succeed only if all the contained operations are successful. If the transaction is not executed properly, it can create consistency issues in the data.
The opening balance for account A500 was $9000, and $1000 was withdrawn. At the end of the day, the A500 account should have a balance of $8000, but it is showing as $4000. This discrepancy occurred because the transaction was not executed properly, creating inconsistency in the data.
-
e. Data Consistency Over Time
Data values and volumes are expected to remain consistent over time, with only minor variations unless there is a significant business change.
Example: You receive IBM stock prices every day, and suddenly you notice that the value has increased by 10 times. A 1000% increase in the stock price in a single day is nearly impossible. This could be a simple mistake of misplacing the decimal.
Similarly, most companies acquire customers at a steady and consistent pace. If the business typically acquires about 500 new customers every day, and suddenly one day the number zooms to thousands, it’s highly likely that the data was loaded twice due to an error. If the customer count suddenly drops to zero, it’s possible that the data processor failed to run for that day.
-
f. Consistency In Data Representation Across Systems
Reference data is expected to be stored consistently not only within a dataset but also across multiple data stores.
Example: In a customer dataset, the reference table for sex includes “Male”, “Female”, and “Unknown”.
This reference data might be used across multiple systems. For example, the Return Material Authorization (RMA) can experience reference data consistency issues if:
- Same meaning, different representation: The business definitions are the same but different data values are used to represent the same business concept.
- Missing reference data values: One or more reference data values are missing.
- Additional reference values: One or more reference values are added.
- Finer granularity: The reference values are further subdivided into more detailed levels.
- Same representation, different meaning: The data values are the same but used differently, which is difficult to catch.
What is Uniqueness Data Quality Dimension?
Uniqueness refers to the occurrence of an object or events being recorded multiple times in a dataset.
An event or entity should be recorded only once. Duplicate data should be avoided, as it can lead to double counting or misreporting.
Below are examples of duplicate data:
-
a. Same Entity Is Represented by Different Identities
There is a general expectation that a single physical entity should be only represented once. In this example, the customer is recorded twice, initially as “Thomas” and second time by the nickname “Tom”. Anyone accessing the data may be confused about which name to use for the customer. Additionally, information about the customer might be split across the two records. As a result, the company may count two customers, even though there is only one.
If you simply check the data, you cannot determine if “Thomas” and “Tom” are the same because the names are different. To deduplicate such records, you will need secondary but universally unique information, such as email addresses.
-
b. The same Entity Is Represented Multiple Times With the Same Identity
In this case, the record identifier is exactly the same. This type of duplication is easy to detect because the keys in the dataset can also be compared to each other to identify the duplicates.
What is Validity Data Quality Dimension?
Data validity refers to how closely a data value matches predetermined values or a calculation.
Here are three examples of the Validity DQ dimension: |
Here are three examples of the Validity DQ dimension: |
-
a. Data Validity Based On Business Rules Or Calculation
The data captured in the datastore can come from a graphical user interface or an automated ETL process. But is the data valid according to the business rules?
Example: The business rule for Net Amount is Gross Amt – (Tax Amt + Fee Amt + Commission Amt).
The net amount can be validated by calculating the expected value based on the business rule given above.
-
b. Data Validity For Range Of Values
Data values can also be based on predefined ranges. For example, the value (numeric or date) in an attribute must fall within the specified range.
Numeric Range: Weight range for a USPS parcel. If the weight data doesn’t match the parcel type, then the data is considered invalid.
Parcel | Content must weigh less than 70 lbs. |
Large Parcel | Contents must weigh less than 70 lbs. |
Irregular Parcel | Contents must weigh less than 16 oz. |
Date Range: A liquor shop cannot have a customer who is less than 21 years old and it is rare for a customer to be older than 100 years. |
Date Range: A liquor shop cannot have a customer who is less than 21 years old and it is rare for a customer to be older than 100 years. |
-
c. Invalid Sequence
Normally, you cannot ship without having the order in place, that is the business rule. So, if you find a shipping record that has a shipping date earlier than the order date, there is clearly a data validation issue. |
|
What is Timeliness Data Quality Dimension?
Timeliness refers to the time lag between the actual event time and the time the event is captured in a system, making it available for use.
When an actual event occurs, the system needs to capture the event information, process it, and store it for further downstream usage. However, this process is never instantaneous.
The delay between actual event occurrence and the availability of data, defined by the business or the downstream process, defines the timeliness quality dimension. It is important to note that the data is still valid and simply delayed. |
When an actual event occurs, the system needs to capture the event information, process it, and store it for further downstream usage. However, this process is never instantaneous.
The delay between actual event occurrence and the availability of data, defined by the business or the downstream process defines the timeliness quality dimension. It is important to note that the data is still valid and simply delayed. |
Here we are considering two timeliness data quality examples
-
a. Late For Business Process
A Pizza restaurant promises to deliver a pizza within 50 minutes. However, the order booking clerk enters the data two hours late for some reason. In this case, the data itself is correct, but for the business it is too late. The pizza is delivered late, which will result in negative reviews and potentially a loss of future business. This is a failure to meet the promise of timeliness.
Even though the data is accurate in terms of the business process and expectations, the timeliness issue makes the data of poor quality. |
A Pizza restaurant promises to deliver a pizza within 50 minutes. However, the order booking clerk enters the data two hours late for some reason. In this case, the data itself is correct, but for the business, it is too late. The pizza is delivered late which will result in negative reviews and potentially a loss of future business. This is a failure to meet the promise of timeliness.
Even though the data is accurate in terms of the business process and expectations, the timeliness issue makes the data of poor quality. |
-
b. Time Lag In Real-Time Systems
In automated trading, decisions to buy /sell stocks are processed in microseconds. The user excepts the immediate availability of data for their algorithmic trading.
If there is a lag in the availability of data, their competitors will have an advantage. Even if the data is accurate, it still suffers from poor timeliness quality. A similar situation can occur with self-driving cars, where any lag in the arrival of data can cause accidents, as the system won’t be able to make course correction in time. |
In automated trading decisions to buy /sell, stocks data is processed in microseconds. The user excepts the immediate availability of data for their algorithmic trading.
If there is a lag in the availability of data, their competitors will have an advantage. Even if the data is accurate, it still suffers from poor timeliness quality. A similar situation can occur with self-driving cars where any lag in the arrival of data can cause accidents, as the system won’t be able to make course correction in time. |
What is Currency Data Quality Dimension?
Data Currency is defined as reflection of the real-world state vs. the state captured in the dataset.
Often, the data captured reflects the current state of an entity, but the state of the object can change over time. If the state transition are not captured correctly, the data becomes outdated.
Here are two examples of the data currency DQ dimension:
-
a. Changed Address
A mailing list contains customer’s addresses, but if customers have moved to a new address, the data loses its currency. |
|
-
b. Expired Coupon
If you are trying to sell a wedding gown to your customer and send a discount coupon as an incentive for purchase, the coupon is sent based on data showing the customer is unmarried and in the market for a wedding dress. However, the customer is already married.
Since the data was not updated in time, it still reflects the customer’s old state, and the data currency is compromised. |
If you are trying to sell a wedding gown to your customer and send a discount coupon as an incentive for purchase, the coupon is sent based on the data showing the customer is unmarried and in the market for a wedding dress. However, the customer is already married.
Since the data was not updated in time. It still reflects the customer’s old state, and the data currency is compromised. |
-
What is the difference between Data Timeliness and Currency?
Timeliness refers to the late arrival of data or a delay, while the information remains accurate. However, if the data arrives late and reflects a state that has changed or expired, it becomes irrelevant, losing its value or currency.
What is Conformity Data Quality Dimension?
Conformity means that the data values of the same attributes must be represented in a consistent format and adhere to the correct data types.
Humans have a unique ability to discern subtle differences and recognize commonality, whereas computers cannot. Even if the data values are correct, if the data does not adhere to the same standard format or data type, it results in conformity data quality issues.
Below are two examples of the data conformity DQ dimension: |
What is Conformity Data Quality Dimension?
Conformity means that the data values of the same attributes must be represented in a consistent format and adhere to the correct data types.
Humans have a unique ability to discern subtle differences and recognize commonality. whereas computers cannot. Even if the data values are correct if the data does not adhere to the same standard format or data type, it results in conformity data quality issues.
Below are two examples of the data conformity DQ dimension: |
-
a. Format Conformity
The order date below is expected to follow ‘MM/DD/YYYY’ format. While the data may appear correct to humans, any changes in the data format can cause chaos for computers.
- Don records has date in ‘YYYY/M/DD’ format.
- Joe’s record has date in correct ‘MM/DD/YYYY’ format.
- Tim’s records is in the ‘YYYY/M/DD HH:MM:SS’ format.
Data Format conformity issues can typically be identified using regular expressions.
-
b. Data Type Conformity
The data type is also another case of conformity quality issue. The order amount in the table below is expected to be in numeric format, but Joe’s record is written in alpha numeric format. This is a data type conformity issue.
What is Integrity Data Quality Dimension?
Data Integrity is the degree to which defined relational constraints are implemented between two datasets.
The data integrity issues can arise within a single system or across multiple systems. The key characteristic of the integrity data quality dimension is the relationship between two datasets.
Here are two examples for data integrity dimension
-
a. Referential Integrity Or Foreign Keys:
The reference for a parent record must always exist in a child dataset. For example, an order might have a customer number as a foreign key, which means that the customer number must also exist in the customer table. The master dataset could reside within the same database or in a different system.
-
b. Cardinality Integrity
Another example of Integrity data quality dimension is Cardinality, 1:1, 1: Many, etc. Cardinality defines the ratio between two datasets. For example, an employee can only have one badge (1:1). If the cardinality of the relationship is known in advance, it can be checked under the data integrity DQ dimension.
What is Precision Data Quality Dimension?
Precision refers to the degree to which the data has been rounded or aggregated.
In industrial measurements, precision and accuracy are different concepts. Accuracy refers to the deviation from the target data value, while precision pertains to the closeness of the values to each other. In data quality measurement, Precision is a derived concept used to identify errors related to rounding or aggregation of data.
Below are some of the examples of precision errors
-
a. Precision Errors Due To Rounding Of Number
Depending on the degree of precision provided by the GPS coordinates, the location can differ by kilometers. The table below shows values ranging from two-digit precision to five-digit precision. The location error can range from 1 meter to 1 kilometer.
GPS Decimal Places | Decimal Degrees | N/S or E/W at equator |
2 | 0.01 | 1.1132 km |
3 | 0.001 | 111.32 m |
4 | 0.0001 | 11.132 m |
5 | 0.00001 | 1.1132 m |
Imagine the consequences of a military bombing occurring 1 km away from the intended location.
In stock trading, the SEC under rule 612 mandateds a minimum precision for stocks: those worth over $1.00 must have a precision of $0.01, while stocks under $1.00 require a precision of $0.0001.
Stock | Date | End of day Price |
IBM | 05/05/2020 | $122.58 |
JPM | 05/05/2020 | $92.00 |
MTNB (Penny Stock) | 05/05/2020 | $0.7064 |
-
b. Time Precision
The store accounting is done at the day level and may not require the exact second of purchase. However, for credit card fraud detection, time precision must be accurate to the second.
-
c. Granularity Precision
Every time data is aggregated it loses details or precision. Granular data cannot be derived from summarized data.
At first glance, granularity may not seem like an obvious aspect of precision. However, for certain operations aggregated or summarized data is not useful.
For example, if you want to pay each salesperson’s commission based on their individual sale, you will need the specific sales number, not just the aggregated total.
Commission Calculator | |||
Product | $ Sales by each Employee | Commission | $ Sale by Emp X Commission % =Commission Amount |
John Dove | — | 3% | ? |
Evan Gardner | — | 3% | ? |
Accessories | — | 3% | ? |
But the data below does not have precision at the salesperson level. It is summarized for all the employee for each moth of the quarter. Since the head of sales does not have individual sales data for each salesperson, they cannot pay commissions.
Total Sales | |||
Product | Total Sales | ||
Jan 2020 | $4,050,000 | ||
Feb 2020 | $3,500,000 | ||
Mar 2020 | $500,000 |
Data Quality Measurement
This is simply the ratio of total records available to the defective records identified by one of the data quality dimensions.
Data Quality Dimension | Measurement |
Accuracy | # Of records with inaccurate data / total # of records |
Completeness | # Of records with incomplete data / total # of records |
Timeliness | # Of records with Timeliness data / total # of records |
Currency | # Of records with Currency data / total # of records |
Consistency | # Of records with inconsistent data / total # of records |
Uniqueness | # Of non-unique records / total # of records |
Validity | # Of records with invalid data / total # of records |
Conformity | # Of records with unconfirmed data / total # of records |
Integrity | # Of records with integrity issues data / total # of records |
Precision | # Of records with imprecise data / total # of records |
The above can be easily represented by a gauge representation on a dashboard. It can also be easily aggregated or drilled down into different dimensions.
Conclusion
I hope you liked the data quality examples and understand that there is much more than the 6 DQ dimensions. Do not fret too much about these classifications, choose the one you like or define your own.
Do you agree with our thought process? Leave a comment below!