6 Dimensions of Data Quality: Complete Guide with Examples & Measurement Methods
Table of Contents
- Data Quality
- Six Data Quality Dimensions
- Accuracy Data Quality Dimension
- Completeness Data Quality Dimension
- Consistency Data Quality Dimension
- Uniqueness Data Quality Dimension
- Validity Data Quality Dimension
- Timeliness Data Quality Dimension
- Currency Data Quality Dimension
- Conformity Data Quality Dimension
- Integrity Data Quality Dimension
- Precision Data Quality Dimension
- Data Quality Measurement
- Regulatory Compliance
- Conclusion
- FAQs
In this guide, I will explain both Data Quality (DQ) and the six data quality dimensions. Additionally, you will learn advanced data quality concepts, data quality measurement, and examples of different data quality dimensions. This guide shares my 25+ years of experience in real-world data engineering. Let’s dive right in.
What is Data Quality?
Data Quality (DQ) is defined as the data’s suitability for a user’s defined purpose. It is subjective, as the concept of quality is relative to the standards defined by the end user’s expectations.
Data Quality Expectations: It is possible that for the exact same data, different users may have totally different data quality expectations, depending on their usage. For example, the accounting department needs accurate data to the penny, whereas the marketing team does not, because the approximate sales numbers are enough to determine the sales trends.
Instead of just providing definitions of different data quality dimensions, this guide offers a comprehensive and very nuanced list of uses cases and examples from our professional experience.
What are the Six Data Quality Dimensions?
The six data quality dimensions are Accuracy, Completeness, Consistency, Uniqueness, Timeliness, and Validity. However, this classification is not universally agreed upon.
In this guide, we have added four more dimensions: Currency, Conformity, Integrity, and Precision, bringing the total of 10 DQ dimensions.
As humans, we naturally like to classify things. For example, we categorize animals into various categories such as reptiles, mammals, birds, etc. Similarly, Data Quality dimensions serve as a conceptual framework designed to group data quality issues with similar patterns. Of course, you can choose to restrict, expand, or create your own taxonomy.
What is Accuracy Data Quality Dimension?
Data accuracy is the degree to which data accurately represents real-world things, events, or an agreed-upon source.
For example, if a prospective employee has an inaccurate interview address, he won’t be able to attend the interview until he obtains the accurate address.
We will take two examples to explain the data accuracy dimension and how it can be measured:
- Data accuracy measurement with physical world
- Data accuracy measurement with reference source
Data Accuracy Measurement with Physical World – Example
Data accuracy can be judged by comparing data values with physical measurement or observations.
Example: We perform this data accuracy check at the grocery store every time we make a purchase – by checking the items on the bill and then physically verifying the items in the grocery cart.
However, this manual testing is not feasible at scale. Imagine checking the accuracy of inventory data for thousands of items: someone would have to go to the warehouse and count each item manually.
Data Accuracy Measurement with Reference Source – Example
Another way to measure accuracy is by comparing actual values to standard or reference values provided by a reliable source.
Example: The Consumer Price Index (CPI) is published by the US Bureau of Statistics. If you have CPI index values in your database, you can compare it with the reference values obtained from the US Bureau of Statistics website to measure accuracy.
What is Completeness Data Quality Dimension?
The completeness data quality dimension is defined as the percentage of data populated vs. the possibility of 100% fulfillment.
You have probably heard multiple times that data is incomplete for making decisions. |
Example: A salesperson wants to send an email to the customer, but the data entry operator did not fill in the email address. In this case, the data is not inaccurate, rather the email attribute was left empty. When data is missing, it directly impedes the operations of any organization.
We will provide four examples to explain different types of data completeness quality issues:
- The record itself is missing
- A value in an attribute is missing
- A reference value is missing
- Data truncation
Missing Records – Example
You are an eligible voter, but at the voting booth, the record with your name is missing from the voters list. This is an example of a missing record under the completeness data quality dimension. |
Null Attribute – Example
Even though you have all the customer records, some attributes within those records might be missing value. For example, each customer record should include a name, email address, and phone number. However, the phone number or email address might be missing in some of the records.
Missing Reference Data – Example
A system might not have all the reference values required for the domain. |
Example: A banker is trying to update a customer account to a “Suspended” state. The banker expects three reference values:
- Open
- Closed
- Suspended
However, the reference table has only two domain values: “Open” and “Closed”. The banker cannot find the “Suspended” reference value in the data. This is an example of reference data completeness. It is a specific case of the prior example, where complete records are missing.
Data Truncation – Example
Even if an attribute is populated with a data value, it’s possible that the values were truncated during the loading process. This often occurs if the ETL process variables are not correctly defined or if the target attribute is not large enough to capture the full length of the data values.
What is Consistency Data Quality Dimension?
Consistent data refers to how closely your data aligns with or matches another dataset or a reference dataset.
It refers to the degree to which data values are the same across multiple datasets, systems, or repositories. It evaluates whether data remains uniform and free from contradiction regardless of where it resides within an organization’s information ecosystem or external.
Consistency is particularly crucial in distributed systems, data warehouses, and when implementing data replication.
Here are a few examples of Data Consistency DQ dimension:
- Record level data consistency across source and target
- Attribute level data consistency across source and target
- Data consistency between subject areas
- Data consistency in transactions
- Data consistency across time
- Data consistency in data representation
Record level data consistency across source and target – Example
When data is loaded from one system to another, it’s important to ensure that the data reconciles with the source system. Source vs. target reconciliation often reveals inconsistencies in the records. Below is the example of an inconsistency at the record level. The record for Tom exists in the source but not in the target system.
Attribute consistency across source and target – Example
Another specialized example of inconsistency between the source and target is when the records exist on both sides, but their attributes do not match. In the case below, the records for Tom and Ken exist on both sides, but the target side is missing Tom’s email and Ken’s phone number.
Data consistency between two subject areas – Example
In a clothing store, a customer’s order shows one gown and three pairs of dress pants. However, the shipping dataset for the same order indicates that the store must ship three gowns and one pair of dress pants. In this case, the order and shipment quantities are inconsistent between the two datasets.
Transaction Data consistency – Example
A transaction is a collection of read/write operations that succeed only if all the contained operations are successful. If the transaction is not executed properly, it can create consistency issues in the data.
The opening balance for account A500 was $9000, and $1000 was withdrawn. At the end of day, the A500 account should have a balance of $8000, but it shows $4000. This discrepancy occurred because the transaction was not executed properly, creating inconsistency in the data.
Data Consistency over time – Example
Data values and volumes are expected to remain consistent over time, with only minor variations unless there is a significant business change.
Example: You receive IBM stock prices every day, and suddenly, you notice a tenfold increase in its value. A 1000% increase in stock prices in a single day is nearly impossible. This could simply be a decimal placement error.
Similarly, most companies acquire customers at a steady and consistent pace. If the business typically acquires about 500 new customers every day, and suddenly one day the number zooms to thousands, it’s highly likely that the data was loaded twice due to an error. If the customer count suddenly drops to zero, it’s possible that the data processor failed to run for that day.
Consistency in data representation across systems – Example
Reference data is expected to be stored consistently, not only within a dataset but also across multiple data stores.
Example: In a customer dataset, the reference table for sex includes “Male”, “Female”, and “Unknown”.
This reference data might be used across multiple systems. For example, the Return Material Authorization (RMA) can experience reference data consistency issues if:
- Same meaning, different representation: The business definitions are the same but different data values are used to represent the same business concept.
- Missing reference data values: One or more reference data values are missing.
- Additional reference values: One or more reference values are added.
- Finer granularity: The reference values are further subdivided into more detailed levels.
- Same representation, different meaning: The data values are the same but used differently, which is difficult to catch.
What is Uniqueness Data Quality Dimension?
Uniqueness refers to the occurrence of an object or event being recorded multiple times in a dataset.
An event or entity should be recorded only once. Duplicate data should be avoided as it can lead to double counting or misreporting.
Below are examples of duplicate data:
- One entity is represented by two identities
- One entity is represented multiple times with the same identity
Same entity is represented with different identities – Example
There is a general expectation that a single physical entity should only be represented once. In this example, the customer is recorded twice, initially as “Thomas” and then by his nickname “Tom”. Anyone accessing the data may become confused about which customer name to use to proceed. Additionally, information about the customer might be split across the two records. As a result, the company may count two customers, even though there is only one.
If you simply check the data, you cannot determine if “Thomas” and “Tom” are the same because the names are different. To de-duplicate such records, you will need secondary but universally unique information, such as email addresses.
Same entity is represented multiple times with same identity – Example
In this case, the record identifier is the same. This type of duplication is easy to detect because the keys in the dataset can be compared to each other to identify the duplicates.
What is Validity Data Quality Dimension?
Data validity refers to how closely a data value matches predetermined values or a calculation.
Here are three examples of the Validity DQ dimension:
|
Data Validity based on Business Rules or Calculation – Example
The data captured in the data store can come from a graphical user interface or an automated ETL process. But is the data valid according to the business rules?
Example: The business rule for the Net Amount is: Gross Amt – (Tax Amt + Fee Amt + Commission Amt).
The net amount can be validated by calculating the expected value based on the business rule given above.
Data Validity for Range of Values – Example
Data values can also be based on predefined ranges. For example, the value (numeric or date) in an attribute must fall within the specified range.
Numeric Range: Weight range for a USPS parcel. If the weight data doesn’t match the parcel type, then the data is considered invalid.
Parcel | Content must weigh less than 70 lbs. |
Large Parcel | Contents must weigh more than 70 lbs. |
Irregular Parcel | Contents must weigh less than 16 oz. |
Date Range: A liquor shop cannot have a customer who is less than 21 years old, and it is rare for a customer to be older than 100 years. |
Invalid Sequence – Example
Normally, you cannot ship without having the order in place, that is the business rule. So, if you find a shipping record that has a shipping date earlier than the order date, there is clearly a data validation issue. |
What is Timeliness Data Quality Dimension?
Timeliness refers to the time lag between the actual event time and the time the event is captured in a system, making it available for use. |
When an actual event occurs, the system needs to capture the event information, process it, and store it for further downstream usage. However, this process is never instantaneous.
The delay between actual event occurrence and the availability of data, defined by the business or the downstream process, defines the timeliness quality dimension. It is important to note that the data is still valid and simply delayed.
Timeliness refers to the time expectation for the accessibility of data – measuring whether data is available when it is needed for use. It focuses on whether data delivery meets established deadlines or service level agreements. For example, ensuring “all records in the customer dataset must be loaded by 9:00 am” would be a timeliness requirement.
Here we are considering two timeliness data quality examples:
- Late for business
- Lag in the data capture
Late for Business Process – Example
A pizza restaurant promises to deliver a pizza within 50 minutes. However, for some reason the order booking clerk enters the data two hours late. The data itself is correct in this case, but it arrives too late for the business. The late delivery of the pizzas will result in negative reviews and could potentially lead to a loss of future business. This is a failure to meet the promise of timeliness.
Even though the data is accurate for business processes and expectations, the issue of timeliness makes the data poor in quality
Time Lag in Real-Time Systems – Example
In automated trading, decisions to buy / sell stocks are processed in microseconds. The user expects the immediate availability of data for their algorithmic trading.
If there is a lag in the availability of data, their competitors will have an advantage. Even if the data is accurate, it still suffers from poor timeliness quality.
A similar situation can occur in self-driving cars, where any data lag may lead to accidents as the system won’t be able to make corrections to its path in time.
What is Currency Data Quality Dimension?
Data currency is defined as a reflection of the real-world state vs. the state captured in the dataset.
Often, the data captured reflects the current state of an entity, but the state of the object can change over time. If the state transition is not captured correctly, the data becomes outdated.
Currency measures how up to date the data is compared to what is represented. It might be timely, but the state of the underlying object can change over time and the data might represent its past state, making it useless.
Here are two examples of the data currency DQ dimension:
- Changed Address
- Expired Coupon
Changed Address – Example
A mailing list contains customer’s addresses, but if customers have moved to a new address, the data loses its currency. |
Expired Coupon – Example
If you are trying to sell a wedding gown to your customer and send a discount coupon as an incentive to purchase, the coupon is sent based on data showing the customer is unmarried and in the market for a wedding dress. However, the customer is already married. |
Since the data was not updated on time, it still reflects the customer’s old state, and the data currency is compromised.
What is Conformity Data Quality Dimension?
Conformity means that the data values of the same attributes must be represented in a consistent format and adhere to the correct data types. |
Humans have a unique ability to discern subtle differences and recognize commonality, whereas computers cannot. Even if the data values are correct, if the data does not adhere to the same standard format or data type, it results in conformity data quality issues.
Below are two examples of the data conformity DQ dimension:
- Format Conformity
- Data Type Conformity
Format Conformity – Example
The order date below is expected to follow ‘MM/DD/YYYY’ format. While the data may appear correct to humans, any changes in the data format can cause chaos for computers.
- Don’s record has date in ‘YYYY/M/DD’ format.
- Joe’s record has a correct ‘MM/DD/YYYY’ format.
- Tim’s record is in the ‘YYYY/M/DD HH:MM: SS’ format.
Data Format conformity issues can typically be identified using regular expressions.
Data Type Conformity – Example
The data type is also another case of conformity quality issue. The order amount in the table below is expected to be in numeric format, but Joe’s record is written in alphanumeric format. This is a data type conformity issue.
What is Integrity Data Quality Dimension?
Data integrity is the degree to which defined relational constraints are implemented between two datasets.
The data integrity issues can arise within a single system or across multiple systems. The key characteristic of the integrity data quality dimension is the relationship between two datasets.
- Referential Integrity
- Relationship Cardinality
Referential Integrity or Foreign Keys – Example
The reference for a parent record must always exist in a child dataset. For example, an order might have a customer number as a foreign key, which means that the customer number must also exist in the customer table. The master dataset could reside within the same database or in a different system.
Cardinality Integrity – Example
Another example of integrity data quality dimension is cardinality, 1:1, 1: many, etc. Cardinality defines the ratio between two datasets. For example, an employee can only have one badge (1:1). If the cardinality of the relationship is known in advance, it can be checked under the data integrity DQ dimension.
What is Precision Data Quality Dimension?
Precision refers to the degree to which the data has been rounded or aggregated.
In industrial measurements, precision and accuracy are different concepts. Accuracy refers to the deviation from the target data value, while precision pertains to the closeness of the values to each other. In data quality measurement, precision is a derived concept used to identify errors related to rounding or aggregation of data.
Below are some examples of precision errors:
- Numerical precision
- Time precision
- Granularity precision
Precision errors due to rounding off numbers – Example
Depending on the degree of precision provided by the GPS coordinates, the location can differ by kilometers. The table below shows values ranging from two-digit precision to five-digit precision. The location error can range from 1 meter to 1 kilometer.
GPS Decimal Places | Decimal Degrees | N/S or E/W at equator |
2 | 0.01 | 1.1132 km |
3 | 0.001 | 111.32 m |
4 | 0.0001 | 11.132 m |
5 | 0.00001 | 1.1132 m |
Imagine the consequences of a military bombing missing the target by a kilometer from the intended location.
In stock trading, the SEC under Rule 612 mandates a minimum precision for stocks: those worth over $1.00 must have a precision of $0.01, while stocks under $1.00 require a precision of $0.0001.
Stock | Date | End of Day Price |
IBM | 05/05/2020 | $122.58 |
JPM | 05/05/2020 | $92.00 |
MTNB (Penny Stock) | 05/05/2020 | $0.7064 |
Time Precision – Example
The store accounting is done at the day level and may not require the exact second of purchase. However, for credit card fraud detection, time precision must be accurate to the second.
Granularity Precision – Example
Every time data is aggregated, it loses details or precision. Granular data cannot be derived from summarized data.
At first glance, granularity may not seem like an obvious aspect of precision. However, for certain operations aggregated or summarized data is not useful. |
For example, if you want to pay each salesperson’s commission based on their individual sale, you will need the specific sales figure, not just the aggregated total.
Commission Calculator | |||
Product | $ Sales by Each Employee | Commission | $ Sale by Emp X Commission % =Commission Amount |
John Dove | — | 3% | ? |
Evan Gardner | — | 3% | ? |
Accessories | — | 3% | ? |
But the data below does not have precision at the salesperson level. It is summarized for all the employees for each moth of the quarter. Since the head of sales does not have individual sales data for each salesperson, they cannot pay commissions.
Total Sales | |||
Product | Total Sales | ||
Jan 2020 | $4,050,000 | ||
Feb 2020 | $3,500,000 | ||
Mar 2020 | $500,000 |
Dimension Comparison Matrix
Understanding how data quality dimensions relate to each other helps organizations develop comprehensive quality strategies:
Dimension | Complementary Dimensions | Potential Conflicts | Primary Focus |
Accuracy | Validity, Integrity | Timeliness | Correctness |
Completeness | Integrity, Validity | Uniqueness | Coverage |
Consistency | Accuracy, Validity | Currency | Coherence |
Timeliness | Currency | Accuracy, Completeness | Recency |
Validity | Accuracy, Consistency | None | Conformance |
Uniqueness | Accuracy | Completeness | Singularity |
Data Quality Measurement
This is simply the ratio of total records available to the defective records identified by one of the data quality dimensions.
Data Quality Dimension | Measurement |
Accuracy | # of records with inaccurate data / total # of records |
Completeness | # of records with incomplete data / total # of records |
Timeliness | # of records with Timeliness data / total # of records |
Currency | # of records with Currency data / total # of records |
Consistency | # of records with inconsistent data / total # of records |
Uniqueness | # of non-unique records / total # of records |
Validity | # of records with invalid data / total # of records |
Conformity | # of records with unconfirmed data / total # of records |
Integrity | # of records with integrity issues data / total # of records |
Precision | # of records with imprecise data / total # of records |
Measurement Framework
Implementing a structured measurement approach ensures consistent evaluation of data quality dimensions:
Scoring Methodology
- Define Metrics: Establish specific measurements for each dimension
- Set Thresholds: Determine acceptable quality levels by data type and use case
- Weight Dimensions: Assign importance factors based on business impact
- Calculate Composite Scores: Combine dimensional scores with overall quality rating
- Establish Baselines: Document initial scores for measuring improvement
Sample Scoring Template
Dimension | Metric | Weight | Target | Current | Score |
Accuracy | % matching reference source | 25% | >95% | 92% | 23.0 |
Completeness | % required fields populated | 20% | >98% | 96% | 19.2 |
Consistency | % agreement across systems | 15% | >90% | 85% | 12.8 |
Timeliness | % records updated within SLA | 15% | >95% | 91% | 13.7 |
Validity | % records passing all rules | 15% | >98% | 97% | 14.6 |
Uniqueness | % unique entities | 10% | >99% | 98% | 9.8 |
OVERALL | 100% | 93.1 |
The above can be easily represented by a gauge representation on a dashboard. It can also be easily aggregated or drilled down into different dimensions.
Regulatory Compliance
Data quality dimensions directly support compliance with various regulatory frameworks. For financial institutions interested in Basel Committee on Banking Supervision (BCBS 239) compliance, please see our dedicated article on BCBS 239 Solutions .
Other regulatory frameworks that emphasize data quality dimensions include:
- GDPR (General Data Protection Regulation)
- HIPAA (Health Insurance Portability and Accountability Act)
- SOX (Sarbanes-Oxley Act)
- CCPA (California Consumer Privacy Act)
Conclusion
I hope you liked the data quality examples and understand that there are much more than the 6 DQ dimensions. Do not fret too much about these classifications, choose the one you like or define your own.
Do you agree with our thought process? Leave a comment below!
I will soon have another article as to why these data quality dimensions are meaningless from a business point of view. Watch this space…
FAQs
What does the consistency dimension refer to when data quality is being measured?
The consistency dimension refers to whether data values are coherent and free of contradiction across different datasets, systems, or time periods. It ensures that logically related data elements maintain their relationships and that the same entity is represented uniformly wherever it appears.
Which dimension for measuring data quality means that the data conforms to a set of predefined standards and definitions such as type and format?
The data quality dimension that ensures conformity to predefined standards and definitions such as type and format is called data conformity.
- Conforming to required data types (numeric, text, date, etc.)
- Following specified formats (date formats, phone numbers, etc.)
Data quality is measured across which three dimensions?
A: Data quality is not typically measured across just three dimensions, but rather six primary dimensions that are accuracy, completeness, consistency, validity, timeliness, and uniqueness. However, if you want the top three, then they are accuracy, completeness, consistency.
What data quality dimension ensures that the data stored in multiple locations is the same?
A: Consistency dimension refers to:
- Value consistency: Ensuring the same data elements have identical values across different systems.
- Format consistency: Maintaining uniform data formats (e.g., date formats, units of measurement).
- Structural consistency: Preserving compatible data structures and relationships.
- Temporal consistency: Making sure data remains coherent and accurate over time.
Which dimension of data quality focuses on ensuring that data is up to date?
The dimension of data quality that focuses on ensuring data is up to date is timeliness. This dimension measures whether information is available when it’s needed.
Which data quality dimension is most closely related to the timeliness of data?
Currency is most closely related to timeliness. While timeliness focuses on data being available when needed, currency specifically addresses how well data reflects the current state of what it represents. Both dimensions deal with temporal aspects of data quality.
What is the difference between timeliness and currency in data quality dimensions?
A: While these dimensions are closely related and sometimes grouped together, they have distinct focuses: Timeliness refers to correct data but late arrival, while currency is data can be timely (delivered when expected) but not current (representing outdated information).
Which dimension ensures data conforms to predefined standards and definitions?
The validity dimension ensures that data conforms to predefined standards, formats, types, and business rules. It focuses on structural correctness and adherence to established definitions, including data type constraints, allowed value ranges, and formatting requirements.
During a routine data quality assessment exercise, the assessors found that a date field contained data from the US in M/D/Y format and dates from the rest of the world in D/M/Y format. This will result in a poor score on which of the following data quality dimensions?
In this case the data values are correct, but the representation is not. This issue would primarily result in a poor score on the consistency dimension of data quality. But if your organization uses conformity dimensions then that is a more accurate classification.
What does the completeness dimension check?
It checks whether all required records are present and the data elements within the records are populated.
What does the completeness dimension rate mean?
The completeness dimension rates the degree to which all required data is present in a dataset. For example, a customer database might be considered 95% completeness rate if 95% of customer records have all the required information filled in.
Which of these is not a dimension of data? Data complexity. data quality. data integrity. data storage.
The two values ‘data complexity’ and ‘data storage’ are not a data quality dimension.
How is data quality measured across dimensions?
Data quality is typically measured across dimensions by establishing specific metrics for each dimension, setting thresholds for acceptable quality, and calculating scores that indicate compliance levels. Organizations often create weighted composite scores that combine individual dimension ratings into an overall quality assessment.