In this guide, I will explain both data quality (DQ) and the six data quality dimensions. Additionally, you will learn advanced data quality concepts, data quality measurement, and examples of different data quality dimensions. This guide shares my 25+ years of experience in real data engineering. Let’s dive right in.
What is Data Quality?
Data quality (DQ) is defined as the data’s suitability for a user’s defined purpose. It is subjective, as the concept of quality is relative to the standards defined by the end users’ expectations.
Data Quality Expectations: It is possible that for the exact same data, depending on their usage, different users can have totally different data quality expectations. For example, the accounting department needs accurate data to the penny, whereas the marketing team does not, because the approximate sales number are enough to determine the sales trends. |
Data Quality Expectations: It is possible that for the exact same data, depending on their usage, different users can have totally different data quality expectations. For example, the accounting department needs accurate data to the penny, whereas the marketing team does, because the approximate sales number are enough to determine the sales trends. |
Instead of just providing definitions of different data quality dimensions, this guide provides a comprehensive and very nuanced list of use cases and examples from our professional experience.
What are the Six Data Quality Dimensions?
The six data quality dimensions are Accuracy, Completeness, Consistency, Uniqueness, Timeliness, and Validity. However, this classification is not universally agreed upon.
In this guide we have added four more – Currency, Conformity, Integrity, and Precision – to create a total of 10 DQ dimensions.
We by nature, like to classify things. For example, we classify animals into various categories such as reptiles, mammals, birds, etc. Similarly, Data Quality dimensions are a superficial concept introduced to bucket data quality issues with similar patterns. You can of course choose to restrict, expand, or come up with your own taxonomy.
We by nature, like to classify things. For example, we classify animals into various categories such as reptiles, mammals, birds, etc. Similarly, Data Quality dimensions are a superficial concept introduced to bucket data quality issues with similar patterns. You can of course choose to restrict, expand, or come up with your own taxonomy.
What is Accuracy Data Quality Dimension?
Data accuracy is the degree to which data represent real-world things, events, or an agreed-upon source.
For example, if a prospective employee has an inaccurate interview address, he won’t be able to attend the interview until he obtains the accurate address.
We will take two examples explaining the data accuracy dimension and how it can be measured:
-
a. Data Accuracy Measurement With Physical World – Example
Data accuracy can be judged by comparing the data values with a physical measurement or physical observations.
Example: We do this data accuracy check at the grocery store every time we do purchase by checking the items in the bill, then physically checking for the items in the grocery cart. However, this manual testing is not feasible at scale. Imagine checking the accuracy of inventory data for thousands of items: someone must go to the warehouse and count each of the items. |
Data accuracy can be judged by comparing the data values with a physical measurement or physical observations.
Example: We do this data accuracy check at the grocery store every time we do purchase by checking the items in the bill, then physically checking for the items in the grocery cart. However, this manual testing is not feasible at scale. Imagine checking the accuracy of inventory data for thousands of items: someone must go to the warehouse and count each of the items. |
-
b. Data Accuracy Measurement With Reference Source – Example
Another way to measure accuracy is by simply comparing actual values to standard or reference values provided by a reliable source.
Example: The Consumer Price Index (CPI) is published by the US Bureau of Statistics.
If you have CPI index values in your database, then for the accuracy measurement, you can compare it with the reference values obtained from the US Bureau of Statistics website.
What is Completeness Data Quality Dimension?
The completeness data quality dimension is defined as the percentage of data populated vs. the possibility of 100% fulfillment.
You have probably heard multiple times that the data is incomplete to take a decision. For example, a salesperson wants to send an email to the customer, but the data entry operator did not fill in the email address. Here it is not that the data was inaccurate, but the email attribute was left empty. When data is missing it directly impedes the operations of any organization. We will use four examples explaining different types of data completeness quality issues: |
You have probably heard multiple times that the data is incomplete to take a decision. For example, a salesperson wants to send an email to the customer, but the data entry operator did not fill in the email address. Here it is not that the data was inaccurate, but the email attribute was left empty. When data is missing it directly impedes the operations of any organization. We will use four examples explaining different types of data completeness quality issues: |
-
a. Completeness Check – Missing Records Example
You are an eligible voter, but at the voting booth, the record with your name is missing from the voter’s list. This is an example of a missing record for the completeness data quality dimension. |
You are an eligible voter, but at the voting booth, the record with your name is missing from the voter’s list. This is an example of a missing record for the completeness data quality dimension. |
-
b. Completeness Check – Null Attribute Example
Even though you have all the customer records, some of the attributes in the customer records might be missing values. For example, each customer record must have a name, email address, and phone. However, the phone number or the email ID might be missing in some of the customer records.
-
c. Completeness Check – Missing Reference Data Example
A system might not have all the reference values required for the domain. For example, a banker is trying to update a customer account to a “Suspended” state. The banker expects three reference values.
1. Open
2. Closed
3. Suspended
But the reference table has only two domain values “Open” and “Close”. He cannot find the “Suspended” reference value in the data. This is a case of reference data completeness. This is a specific case of the prior example where complete records are missing.
-
d. Completeness Check – Data Truncations Example
Even if an attribute is populated with a data value, it is possible that the values got truncated while loading. This often happens if the ETL process variables are not correctly defined, or the target attribute is not large enough to capture the full length of the data values.
What is Consistency Data Quality Dimension?
Consistent data can be explained as how close your data aligns or is in uniformity with another dataset or a reference dataset.
Here are few examples of Data Consistency DQ dimension:
-
a. Record Level Data Consistency Across Source and Target
When data is loaded from one system to another, we need to ensure that the data reconciles with the source system. Source vs. target reconciliation usually get you the inconsistency in the records. Below is an example of inconsistency at the record level. The record for tom exists in the source but not in the target system.
-
b. Attribute Consistency Across Source And Target
Another specialized example of inconsistency between source and target. The records exist on both sides, but their attributes do not match. In the case below record for Tom and Ken exists on both sides, but the target side is missing email for Tom’s record and Ken’s record is missing the phone number.
-
c. Data Consistency Between Two Subject Areas
In a clothing store, a customer’s order shows an order for one gown and three dress pants. The shipping dataset for the same order shows that the store must ship three gowns and one dress pant. Here orders and shipment quantity are inconsistent between the two datasets.
-
d. Transaction Data Consistency
A transaction is a collection of read/write operations succeeding only if all contained operations succeed. If the transaction is not executed properly then it can create consistency issues in the data.
The opening balance for account A500 was $9000, and $1000 was taken out. So, at the end of the day, the A500 account should have end of day balance of $8000, but it is showing as $4000. This happened because the transaction for some reason was not executed properly hence it has created inconsistency in the data.
-
e. Data Consistency Over Time
Data values and volumes are expected to be consistent over time with minor variations unless there is a big business change.
You receive IBM stock prices every day, suddenly you notice that the value has increased by 10 times. A 1000% increase in the stock price in a day is nearly impossible. This could be a simple mistake of misplacing the decimal.
Most companies acquire customers at a steady and consistent pace. If the business acquires about 500 new customers every day, and suddenly one day the number zooms to thousands, then there is a high possibility that the data was loaded twice because of some error. If the customer count suddenly drops to zero, then it is possible that the data processor has failed to run for that day.
-
f. Consistency In Data Representation Across Systems
The reference data is expected to be stored consistently not only in a dataset but also across multiple data stores. In a customer dataset, the reference table for sex has Male, Female, and Unknown.
Now this reference data might be used in multiple systems and for example the Return Material Authorization can have reference data consistency issues if
- Same meaning but different representation issue arises when the business definitions are the same but there are different data values representing the same business concept.
- One or more of the reference data values is missing.
- One or more additional reference value is added.
- The reference values are subdivided further into finer granularity.
- Same representation but different meaning. This is difficult to catch as the data values are the same but used differently.
What is Uniqueness Data Quality Dimension?
The occurrence of an object or an event gets recorded multiple times in a dataset.
An event or entity should only get recorded only once. No one wants duplicate data because it can cause double counting or create misreporting.
Below are the examples for duplicate data
-
a. Same Entity Is Represented With Different Identities
There is a general expectation that a single physical entity should be only represented once. In this example, the customer is recorded twice, initially as Thomas and second time by the nickname Tom. Someone accessing the data will be confused as to what to call the customer. Also, the information about the customer might be partially stored across the two records. The company will count the number of customers as two whereas there is only one customer.
If you simply check the data, you cannot determine if Thomas and Tom are the same because the names are different. You will need secondary but universally unique information such as emails to deduplicate such records.
-
b. Same Entity Is Represented Multiple Times With Same Identity
In this case, the record identifier is exactly the same. The following case is easy to detect because the keys in the dataset can also be compared to themselves to find the duplicates.
What is Validity Data Quality Dimension?
Data validity describes the closeness of data value to predetermined values or a calculation.
Here are three examples of Validity DQ dimension |
Here are three examples of Validity DQ dimension |
-
a. Data Validity Based On Business Rules Or Calculation
The data captured in the datastore can be through a graphical user interface or some background ETL process. But, is the data valid according to the business rules? The business rule for Net Amount is Gross Amt – (Tax Amt + Fee Amt + Commission Amt).
The net amount can be validated by calculating the expected value based on the business rule.
-
b. Data Validity For Range Of Values
The data values can also be based on ranges provided in advance. For example, the value (numeric or date) in an attribute must fall in the specified range.
Numeric Range: Weight range for a USPS parcel. If the weight data doesn’t match the parcel type, then we know the data is invalid.
Parcel | Content must weigh less than 70 lbs. |
Large Parcel | Contents must weigh less than 70 lbs. |
Irregular Parcel | Contents must weigh less than 16 oz. |
Date Range: A liquor shop cannot have a customer who is less than 21 years old and rarely possible that the customer is older than 100 years. |
Date Range: A liquor shop cannot have a customer who is less than 21 years old and rarely possible that the customer is older than 100 years. |
-
c. Invalid Sequence
Normally you cannot ship without having the order in place. That is the business rule. So, if you find a shipping record that has a shipping date less than that of the order date, obviously there is a data validation problem. |
Normally you cannot ship without having the order in place. That is the business rule. So, if you find a shipping record that has a shipping date less than that of the order date, obviously there is a data validation problem. |
What is Timeliness Data Quality Dimension?
It is the time lag between actual event time vs. the event captured in a system to make it available for use.
When an actual event occurs, the system needs to capture the event information, process it, and store it for further downstream usage. This is however never instantaneous.
The delay between actual event occurrence and the data availability exceptions by the business, or the downstream process defines the timeliness quality dimension. It is important to understand that the data is still valid just late. |
When an actual event occurs, the system needs to capture the event information, process it, and store it for further downstream usage. This is however never instantaneous.
The delay between actual event occurrence and the data availability exceptions by the business, or the downstream process defines the timeliness quality dimension. It is important to understand that the data is still valid just late. |
Here we are considering two timeliness data quality examples
-
a. Late For Business Process
A Pizza restaurant promises to deliver a pizza within 50 minutes. But the order booking clerk for some reason enters the data two hours late. In this case, the data is correct by themselves but for the business, it is too late. The pizza is delivered late which will result in negative reviews and probable loss of future business. It is due to not keeping their promise of timeliness.
Even if the data are accurate in the context of the business process and expectations, the data is of poor quality. |
A Pizza restaurant promises to deliver a pizza within 50 minutes. But the order booking clerk for some reason enters the data two hours late. In this case, the data is correct by themselves but for the business, it is too late. The pizza is delivered late which will result in negative reviews and probable loss of future business. It is due to not keeping their promise of timeliness.
Even if the data are accurate in the context of the business process and expectations, the data is of poor quality. |
-
b. Time Lag In Real-Time Systems
In automated trading decisions to buy /sell, stocks are also taken in microseconds. The user excepts the immediate availability of data for its algorithmic trading.
If there is a lag in the availability of data, their competitors will have an advantage over them. Again, even if the data is accurate, it still has poor timeliness quality. A similar situation can occur in self-driving cars where any lag in the arrival of data can cause accidents as it won’t be able to course correct. |
In automated trading decisions to buy /sell, stocks are also taken in microseconds. The user excepts the immediate availability of data for its algorithmic trading.
If there is a lag in the availability of data, their competitors will have an advantage over them. Again, even if the data is accurate, it still has poor timeliness quality. A similar situation can occur in self-driving cars where any lag in the arrival of data can cause accidents as it won’t be able to course correct. |
What is Currency Data Quality Dimension?
Data Currency is about the reflection of the real-world state vs. the state captured in the dataset.
Many times the data captured contains the current state of an entity. The state of the object can change over time. If the state transition is not captured correctly the data becomes useless.
Here are two examples where the data currency DQ dimension:
-
a. Changed Address
A mailing list has customers’ addresses. But if the customers have already moved to a new address the data loses its currency. |
A mailing list has customers’ addresses. But if the customers have already moved to a new address the data loses its currency. |
-
b. Expired Coupon
If you are trying to sell a wedding gown to your customer and sent a discount coupon as an incentive for purchase. The coupon was sent because the data showed the customer is unmarried and is in the market for a wedding dress. But the customer is already married.
Since the data was not updated in time. It still shows an old state of the customer, and the data currency is bad. |
If you are trying to sell a wedding gown to your customer and sent a discount coupon as an incentive for purchase. The coupon was sent because the data showed the customer is unmarried and is in the market for a wedding dress. But the customer is already married.
Since the data was not updated in time. It still shows an old state of the customer, and the data currency is bad. |
-
What is the difference between Data Timeliness and Currency?
Timeliness is the late arrival of data or delay, but the information is still accurate. If the data is late and reflects a state that has changed, or expired, and hence the data became irrelevant and lost its value or currency.
What is Conformity Data Quality Dimension?
Conformity means that the data values of the same attributes must be represented in a uniform format and data types.
Humans have a unique ability to discern subtle differences and understand the commonality. However, computers cannot. Even though the values may be correct but if the data does not have the same data standard format or data type then the data has conformity data quality issues.
Below are two examples of data conformity: |
What is Conformity Data Quality Dimension?
Conformity means that the data values of the same attributes must be represented in a uniform format and data types.
Humans have a unique ability to discern subtle differences and understand the commonality. However, computers cannot. Even though the values may be correct but if the data does not have the same data standard format or datatype then the data has conformity data quality issues.
Below are two examples of data conformity: |
-
a. Format Conformity
The order date below is expected to follow ‘MM/DD/YYYY’ format. For humans, the data in the table looks correct but for the computer data format changes will cause chaos.
- Don records has date in ‘YYYY/M/DD’
- The Joe record has date in correct format ‘MM/DD/YYYY’
- The third records for Tim is in ‘YYYY/M/DD HH:MM:SS’
Data Format conformity issues can be usually identified with the help of regular expressions.
-
b. Data Type Conformity
The data type is also another case of conformity quality issue. The order amount in the table below is expected to be in numeric format. But the record for Joe is written in alpha numeric format. Which is data type conformity issue.
What is Integrity Data Quality Dimension?
Data Integrity Quality dimension is the degree to which a defined relational constraint is implemented between two data sets.
The data integrity issues can exist in a single system or across two systems. The main characteristic for integrity data quality dimension is the relationship between two datasets.
Here are two examples for data integrity dimension
-
a. Referential Integrity Or Foreign Keys:
The reference for parent must always exist in a child dataset. The order has a customer number as a foreign key, then customer number must also exist in the customer table. The master dataset could be in the same database or a different system.
-
b. Cardinality Integrity
Another example of Integrity data quality dimension is Cardinality, 1:1, 1: Many, etc. Cardinality defines the ratio between two datasets. For example, an employee can only have one badge (1:1). If the cardinality of the relationship is known in advance, then it can be checked under data integrity DQ dimension.
What is Precision Data Quality Dimension?
The degree to which the data has been rounded or aggregated.
In industrial measurements precision and accuracy are different concepts. Accuracy is the deviation of from target data value, while precision is about closeness among values. Precision in data quality measurement is more of a derived concept devised to check for rounding and aggregate errors in data.
Below are some of the examples of precision errors
-
a. Precision Errors Due To Rounding Of Number
Depending on the degree of precision the provided by the GPS coordinates the location might differ by kilometers. The table below shows values from the two-digit precision to five-digit precision. The location error can be from 1 meter to 1 kilometer.
GPS Decimal Places | Decimal Degrees | N/S or E/W at equator |
2 | 0.01 | 1.1132 km |
3 | 0.001 | 111.32 m |
4 | 0.0001 | 11.132 m |
5 | 0.00001 | 1.1132 m |
Imagine the consequences of military bombing 1 km away from the intended location.
For stock trading the SEC under the rule 612 has mandated a minimum precision stocks worth over $1.00 to be $0.01 and those under $1.00 stock quite needs a precision of $0.0001.
Stock | Date | End of day Price |
IBM | 05/05/2020 | $122.58 |
JPM | 05/05/2020 | $92.00 |
MTNB (Penny Stock) | 05/05/2020 | $0.7064 |
-
b. Time Precision
The store accounting is done at the day level and might not care for the exact second of purchase. But, for credit card fraud the time precision must be precise to the second.
-
c. Granularity Precision
Every time data is aggregated it loses details or precision. You cannot derive granular data from summarized data.
At first glance, granularity does not seem like an obvious case of precision. For some operations aggregated or summarized data is not useful.
If you want to pay each salesperson’s commission based on his personal sale, you will need his individual sales number.
Commission Calculator | |||
Product | $ Sales by each Employee | Commission | $ Sale by Emp X Commission % =Commission Amount |
John Dove | — | 3% | ? |
Evan Gardner | — | 3% | ? |
Accessories | — | 3% | ? |
Data Quality Measurement
There are two fundamental ways of measuring and reporting data quality issues
-
a. DQ Measurement With Success To Fail Ratio Approach
This is simply the ratio of total records available to the defective records found by one of the data quality dimensions.
Data Quality Dimension | Measurement |
Accuracy | # Of records with inaccurate data / total # of records |
Completeness | # Of records with incomplete data / total # of records |
Timeliness | # Of records with Timeliness data / total # of records |
Currency | # Of records with Currency data / total # of records |
Consistency | # Of records with inconsistency data / total # of records |
Uniqueness | # Of non-unique records / total # of records |
Validity | # Of records with invalid data / total # of records |
Conformity | # Of records with unconfirmed data / total # of records |
Integrity | # Of records with integrity issues data / total # of records |
Precision | # Of records with imprecision data / total # of records |
The above is easily represented by a gauge representation in a dashboard. It is also easy to aggregate or drilled down to different dimensions.
-
b. DQ Measurement With Six Sigma Approach
To measure data quality, we need to borrow the concept of six sigma quality from manufacturing. In simple terms, it is a ratio of opportunities to do mistakes vs. actual mistakes done.
- Think of your data infrastructure as a data factory
- Each table is a product type
- Every record on the table is a product you are delivering.
- A record in a table is a product that you are delivering.
- Every column is an opportunity to do a mistake.
Thus, evaluating how many products (records) you delivered defect free, compared to how many opportunities you had will provide your sigma rating for data quality.
With data, the product we deliver is an instance of a record. We can refer to a data record as a unit product. A table is nothing but a type of unit. All | Total Units (Total Records) | 10,000 |
With data, the product we deliver is an instance of a record. We can refer a data record as a unit product. A table is nothing but a type of unit. All the records delivered during a period are the total units.
|
Total Units (Total Records) | 10,000 |
Assume of these 10,000 records or units 200 records are defective.
|
Total Units with Defects (Failed Records) | 200 |
Defects Per Unit is calculated by dividing Total Units with Defect by Total Units | Defect Per Unit Ratio (DPU)=
(Failed Records/Total Records) |
200/10,000 =0.02 |
We need to determine the possible opportunities for failure in a unit or in our case a data record.
There are many ways to do it. But we can see that each record has 20 attributes/columns that can have defective values. |
Opportunities for Failure per unit or record | 20 |
The total defects within a sample period divided by the total defect chances. | Defects Per Opportunities (DPO)= Defects/ (Units * Opportunities)
|
200/ (10,000*20) = 0.001 = 0.1% |
We need to know the capabilities of data engineering to produce defect-free records | Yield = (1-DPO) | 1-0.001 =0.999 =99.9%
|
What is Sigma Level for Data Quality?
Sigma level | Defects per Million (records) | Yield (records with no defects) |
6 | 3.4 | 99.99966% |
5 | 230 | 99.977% |
4 | 6,210 | 99.38% |
3 | 66,800 | 93.32% |
2 | 308,000 | 69.15% |
1 | 690,000 | 30.85% |
Conclusion
I hope you liked the data quality examples and understand that there is much more than the 6 DQ dimensions. Do not fret too much about these classifications, choose the one you like or define your own.
Do you agree with our thought process? Leave a comment below!