6 Dimensions of Data Quality, Examples, and Measurement

In this guide, I will explain both data quality (DQ) and the six data quality dimensions. You will learn advanced data quality concepts, data quality measurement, and examples of different data quality dimensions. This guide shares my 25+ years of experience in real-world data engineering. Let’s dive right in.

What is Data Quality?

Data quality (DQ) is defined as the data’s suitability for a user’s defined purpose. It is subjective, as the concept of quality is relative to the standards defined by the end user’s expectations.

Data Quality-iceDQ Data Quality Expectations: It is possible that for the exact same data, different users may have totally different data quality expectations, depending on their usage. For example, the accounting department needs accurate data to the penny, whereas the marketing team does not, because the approximate sales number are enough to determine the sales trends.
Data Quality-iceDQ
Data Quality Expectations: It is possible that for the exact same data, different users may have totally different data quality expectations, depending on their usage. For example, the accounting department needs accurate data to the penny, whereas the marketing team does not, because the approximate sales number are enough to determine the sales trends.

Instead of just providing definitions of different data quality dimensions, this guide offers a comprehensive and very nuanced list of use cases and examples from our professional experience.

What are the Six Data Quality Dimensions?

The six data quality dimensions are Accuracy, Completeness, Consistency, Uniqueness, Timeliness, and Validity. However, this classification is not universally agreed upon.

In this guide we have added four more dimensions: Currency, Conformity, Integrity, and Precision, bringing the total of 10 DQ dimensions.

Six Dimensions of Data Quality-iCEDQ

As humans, we naturally like to classify things. For example, we categorize animals into various categories such as reptiles, mammals, birds, etc. Similarly, Data Quality dimensions serve as a conceptual framework designed to group data quality issues with similar patterns. Of course, you can choose to restrict, expand, or create your own taxonomy.

Six Dimensions of Data Quality-iCEDQ

As humans, we naturally like to classify things. For example, we categorize animals into various categories such as reptiles, mammals, birds, etc. Similarly, Data Quality dimensions serve as a conceptual framework designed to group data quality issues with similar patterns. Of course, you can choose to restrict, expand, or create your own taxonomy.

What is Accuracy Data Quality Dimension?

Data accuracy is the degree to which data accurately represents real-world things, events, or an agreed-upon source.

For example, if a prospective employee has an inaccurate interview address, he won’t be able to attend the interview until he obtains the accurate address.

We will take two examples to explain the data accuracy dimension and how it can be measured:

  1. Data Accuracy Measurement with the physical world.
  2. Data Accuracy Measurement with Reference Source.
  • a. Data Accuracy Measurement With Physical World – Example

Data Accuracy Measurement-iCEDQ Data accuracy can be judged by comparing data values with physical measurement or observations.

Example: We perform this data accuracy check at the grocery store every time we make a purchase – by checking the items on the bill and then physically verifying the items in the grocery cart.

However, this manual testing is not feasible at scale. Imagine checking the accuracy of inventory data for thousands of items: someone would have to go to the warehouse and count each items manually.

Data Accuracy Measurement-iCEDQ
Data accuracy can be judged by comparing data values with physical measurement or observations.

Example: We perform this data accuracy check at the grocery store every time we make a purchase – by checking the items on the bill and then physically verifying the items in the grocery cart.

However, this manual testing is not feasible at scale. Imagine checking the accuracy of inventory data for thousands of items: someone would have to go to the warehouse and count each items manually.

  • b. Data Accuracy Measurement With Reference Source – Example

Data-Accuracy-Measurement-with-Reference-Source-iCEDQ

Another way to measure accuracy is by comparing actual values to standard or reference values provided by a reliable source.

Example: The Consumer Price Index (CPI) is published by the US Bureau of Statistics. If you have CPI index values in your database, you can compare it with the reference values obtained from the US Bureau of Statistics website to measure accuracy.

What is Completeness Data Quality Dimension?

The completeness data quality dimension is defined as the percentage of data populated vs. the possibility of 100% fulfillment.

Completeness Data Quality Dimension-iCEDQ You have probably heard multiple times that data is incomplete for making decisions.

Example: A salesperson wants to send an email to the customer, but the data entry operator did not fill in the email address. In this case, the data is not inaccurate, rather the email attribute was left empty. When data is missing, it directly impedes the operations of any organization.
We will provide four examples to explain different types of data completeness quality issues:

  1. The record itself is missing.
  2. A value in an attribute is missing.
  3. A reference Value is missing.
  4. Data truncation.
Completeness Data Quality Dimension-iCEDQ
You have probably heard multiple times that data is incomplete for making decisions. Example: A salesperson wants to send an email to the customer, but the data entry operator did not fill in the email address. In this case, the data is not inaccurate, rather the email attribute was left empty. When data is missing it directly impedes the operations of any organization.
We will provide four examples to explain different types of data completeness quality issues:

  1. The record itself is missing.
  2. A value in an attribute is missing.
  3. A reference Value is missing.
  4. Data truncation.
    • a. Completeness Check – Missing Records Example

    Completeness DQ dimension Missing Record-iCEDQ You are an eligible voter, but at the voting booth, the record with your name is missing from the voter’s list. This is an example of a missing record under the completeness data quality dimension.
    Completeness DQ dimension Missing Record-iCEDQ
    You are an eligible voter, but at the voting booth, the record with your name is missing from the voter’s list. This is an example of a missing record under the completeness data quality dimension.
    • b. Completeness Check – Null Attribute Example

    Even though you have all the customer records, some attributes within those records might be missing values. For example, each customer record should include a name, email address, and phone number. However, the phone number or email address might be missing in some of the records.

    Completeness-DQ-Dimension-Null-Attribute-Example-iceDQ
    • c. Completeness Check – Missing Reference Data Example

    Completeness DQ dimension Reference Data-iCEDQ A system might not have all the reference values required for the domain.

    Example: A banker is trying to update a customer account to a “Suspended” state. The banker expects three reference values:
    1. Open
    2. Closed
    3. Suspended

    However, the reference table has only two domain values: “Open” and “Closed”. The banker cannot find the “Suspended” reference value in the data. This is an example of reference data completeness. It is a specific case of the prior example, where complete records are missing.

    • d. Completeness Check – Data Truncations Example

    Even if an attribute is populated with a data value, it’s possible that the values were truncated during the loading process. This often occurs if the ETL process variables are not correctly defined or if the target attribute is not large enough to capture the full length of the data values.

    Completeness-Check-Data-Truncations-iCEDQ

    What is Consistency Data Quality Dimension?

    Consistent data refers to how closely your data aligns with or matches another dataset or a reference dataset.

    • a. Record Level Data Consistency Across Source and Target

    When data is loaded from one system to another, it’s important to ensure that the data reconciles with the source system. Source vs. target reconciliation often reveals inconsistencies in the records. Below is an example of an inconsistency at the record level. The record for Tom exists in the source but not in the target system.

    Source-vs.-Target-Reconciliation-iceDQ
    • b. Attribute Consistency Across Source And Target

    Another specialized example of inconsistency between the source and target is when the records exist on both sides, but their attributes do not match. In the case below, the records for Tom and Ken exists on both sides, but the target side is missing Tom’s email and Ken’s phone number.

    Data-Inconsistency-Between-Source-Target-iCEDQ
    • c. Data Consistency Between Two Subject Areas

    In a clothing store, a customer’s order shows one gown and three pairs of dress pants. However, the shipping dataset for the same order indicates that the store must ship three gowns and one pair of dress pants. In this case, the orders and shipment quantities are inconsistent between the two datasets.

    Inconsistent-Data-Between-the-Two-Datasets-iCEDQ
    • d. Transaction Data Consistency

    A transaction is a collection of read/write operations that succeed only if all the contained operations are successful. If the transaction is not executed properly, it can create consistency issues in the data.

    Transaction-Data-Consistency-Issue-iCEDQ

    The opening balance for account A500 was $9000, and $1000 was withdrawn. At the end of the day, the A500 account should have a balance of $8000, but it is showing as $4000. This discrepancy occurred because the transaction was not executed properly, creating inconsistency in the data.

    • e. Data Consistency Over Time

    Data Consistency Over Time-iCEDQ

    Data values and volumes are expected to remain consistent over time, with only minor variations unless there is a significant business change.

    Example: You receive IBM stock prices every day, and suddenly you notice that the value has increased by 10 times. A 1000% increase in the stock price in a single day is nearly impossible. This could be a simple mistake of misplacing the decimal.

    Similarly, most companies acquire customers at a steady and consistent pace. If the business typically acquires about 500 new customers every day, and suddenly one day the number zooms to thousands, it’s highly likely that the data was loaded twice due to an error. If the customer count suddenly drops to zero, it’s possible that the data processor failed to run for that day.

    • f. Consistency In Data Representation Across Systems

    Reference data is expected to be stored consistently not only within a dataset but also across multiple data stores.

    Example: In a customer dataset, the reference table for sex includesMale”, “Female”, and “Unknown”.

    Consistency-in-data-representation-across-systems-iCEDQ

    This reference data might be used across multiple systems. For example, the Return Material Authorization (RMA) can experience reference data consistency issues if:

    1. Same meaning, different representation: The business definitions are the same but different data values are used to represent the same business concept.
    2. Missing reference data values: One or more reference data values are missing.
    3. Additional reference values: One or more reference values are added.
    4. Finer granularity: The reference values are further subdivided into more detailed levels.
    5. Same representation, different meaning: The data values are the same but used differently, which is difficult to catch.

    What is Uniqueness Data Quality Dimension?

    Uniqueness refers to the occurrence of an object or events being recorded multiple times in a dataset.

    An event or entity should be recorded only once. Duplicate data should be avoided, as it can lead to double counting or misreporting.

    Examples-of-Duplicate-Data-iCEDQ
    • a. Same Entity Is Represented by Different Identities

    Same entity is represented multiple times with same identity-iceDQ

    There is a general expectation that a single physical entity should be only represented once. In this example, the customer is recorded twice, initially as “Thomas” and second time by the nickname “Tom”. Anyone accessing the data may be confused about which name to use for the customer. Additionally, information about the customer might be split across the two records. As a result, the company may count two customers, even though there is only one.

    If you simply check the data, you cannot determine if “Thomas” and “Tom” are the same because the names are different. To deduplicate such records, you will need secondary but universally unique information, such as email addresses.

    • b. The same Entity Is Represented Multiple Times With the Same Identity

    Same entity is represented multiple times with same identity-iCEDQ

    In this case, the record identifier is exactly the same. This type of duplication is easy to detect because the keys in the dataset can also be compared to each other to identify the duplicates.

    What is Validity Data Quality Dimension?

    Data validity refers to how closely a data value matches predetermined values or a calculation. 

    • a. Data Validity Based On Business Rules Or Calculation

    The data captured in the datastore can come from a graphical user interface or an automated ETL process.  But is the data valid according to the business rules?

    Example: The business rule for Net Amount is Gross Amt – (Tax Amt + Fee Amt + Commission Amt).

    The net amount can be validated by calculating the expected value based on the business rule given above.

    Data-Validity-based-on-Business-Rules-or-Calculation-iCEDQ
    • b. Data Validity For Range Of Values

    Data values can also be based on predefined ranges. For example, the value (numeric or date) in an attribute must fall within the specified range.

    Numeric Range: Weight range for a USPS parcel. If the weight data doesn’t match the parcel type, then the data is considered invalid.

    Parcel Content must weigh less than 70 lbs.
    Large Parcel  Contents must weigh less than 70 lbs.
    Irregular Parcel  Contents must weigh less than 16 oz.
    Data Validity for Range of values -iCEDQ Date Range: A liquor shop cannot have a customer who is less than 21 years old and it is rare for a customer to be older than 100 years.
    Data Validity for Range of values -iCEDQ
    Date Range: A liquor shop cannot have a customer who is less than 21 years old and it is rare for a customer to be older than 100 years.
    • c. Invalid Sequence

    Invalid Sequence in Data Validation-iCEDQ Normally, you cannot ship without having the order in place, that is the business rule. So, if you find a shipping record that has a shipping date earlier than the order date, there is clearly a data validation issue.
    Invalid Sequence in Data Validation-iCEDQ
    Normally, you cannot ship without having the order in place. That is the business rule. So, if you find a shipping record that has a shipping date earlier than the order date, there is clearly a data validation issue.

    What is Timeliness Data Quality Dimension?

    Timeliness refers to the time lag between the actual event time and the time the event is captured in a system, making it available for use. 

    Data Timeliness Quality Dimension-iCEDQ When an actual event occurs, the system needs to capture the event information, process it, and store it for further downstream usage. However, this process is never instantaneous. 

    The delay between actual event occurrence and the availability of data, defined by the business or the downstream process, defines the timeliness quality dimension. It is important to note that the data is still valid and simply delayed.

    Data Timeliness Quality Dimension-iCEDQ
    When an actual event occurs, the system needs to capture the event information, process it, and store it for further downstream usage. However, this process is never instantaneous. 

    The delay between actual event occurrence and the availability of data, defined by the business or the downstream process defines the timeliness quality dimension. It is important to note that the data is still valid and simply delayed.

    Here we are considering two timeliness data quality examples

    1. Late for business
    2. Lag in the data capture
    • a. Late For Business Process

    Poor Data Quality due to Timeliness issues-iCEDQ A Pizza restaurant promises to deliver a pizza within 50 minutes. However, the order booking clerk enters the data two hours late for some reason. In this case, the data itself is correct, but for the business it is too late. The pizza is delivered late, which will result in negative reviews and potentially a loss of future business. This is a failure to meet the promise of timeliness.

    Even though the data is accurate in terms of the business process and expectations, the timeliness issue makes the data of poor quality. 

    Poor Data Quality due to Timeliness issues-iCEDQ
    A Pizza restaurant promises to deliver a pizza within 50 minutes. However, the order booking clerk enters the data two hours late for some reason. In this case, the data itself is correct, but for the business, it is too late. The pizza is delivered late which will result in negative reviews and potentially a loss of future business. This is a failure to meet the promise of timeliness.

    Even though the data is accurate in terms of the business process and expectations, the timeliness issue makes the data of poor quality. 

    • b. Time Lag In Real-Time Systems

    Time Lag in Real time Systems-iCEDQ In automated trading, decisions to buy /sell stocks are processed in microseconds. The user excepts the immediate availability of data for their algorithmic trading.

    If there is a lag in the availability of data, their competitors will have an advantage. Even if the data is accurate, it still suffers from poor timeliness quality.

    A similar situation can occur with self-driving cars, where any lag in the arrival of data can cause accidents, as the system won’t be able to make course correction in time.

    Time Lag in Real time Systems-iCEDQ
    In automated trading decisions to buy /sell, stocks data is processed in microseconds. The user excepts the immediate availability of data for their algorithmic trading.

    If there is a lag in the availability of data, their competitors will have an advantage. Even if the data is accurate, it still suffers from poor timeliness quality.

    A similar situation can occur with self-driving cars where any lag in the arrival of data can cause accidents, as the system won’t be able to make course correction in time.

    What is Currency Data Quality Dimension?

    Data Currency is defined as reflection of the real-world state vs. the state captured in the dataset.

    Often, the data captured reflects the current state of an entity, but the state of the object can change over time. If the state transition are not captured correctly, the data becomes outdated.

    Here are two examples of the data currency DQ dimension:

    1. Address has changed
    2. Coupon has expired
    • a. Changed Address

    Address Data Currency Quality Dimension-iCEDQ A mailing list contains customer’s addresses, but if customers have moved to a new address, the data loses its currency.
    Address Data Currency Quality Dimension-iCEDQ
    A mailing list contains customer’s addresses, but if customers have moved to a new address, the data loses its currency.
    • b. Expired Coupon

    Coupon Expired in Data Currency Quality Dimension-iCEDQ If you are trying to sell a wedding gown to your customer and send a discount coupon as an incentive for purchase, the coupon is sent based on data showing the customer is unmarried and in the market for a wedding dress. However, the customer is already married.

    Since the data was not updated in time, it still reflects the customer’s old state, and the data currency is compromised.

    Coupon Expired in Data Currency Quality Dimension-iCEDQ
    If you are trying to sell a wedding gown to your customer and send a discount coupon as an incentive for purchase, the coupon is sent based on the data showing the customer is unmarried and in the market for a wedding dress. However, the customer is already married.

    Since the data was not updated in time. It still reflects the customer’s old state, and the data currency is compromised.

    • What is the difference between Data Timeliness and Currency?

    Timeliness refers to the late arrival of data or a delay, while the information remains accurate. However, if the data arrives late and reflects a state that has changed or expired, it becomes irrelevant, losing its value or currency.

    What is Conformity Data Quality Dimension?

    Conformity means that the data values of the same attributes must be represented in a consistent format and adhere to the correct data types 

    Data Conformity Quality Dimension-iCEDQ
    Humans have a unique ability to discern subtle differences and recognize commonality, whereas computers cannot. Even if the data values are correct, if the data does not adhere to the same standard format or data type, it results in conformity data quality issues.

    Below are two examples of the data conformity DQ dimension:

    1. Format Conformity
    2. Data Type Conformity

    What is Conformity Data Quality Dimension?

    Conformity means that the data values of the same attributes must be represented in a consistent format and adhere to the correct data types 

    Data Conformity Quality Dimension-iCEDQ Humans have a unique ability to discern subtle differences and recognize commonality. whereas computers cannot. Even if the data values are correct if the data does not adhere to the same standard format or data type, it results in conformity data quality issues.

    Below are two examples of the data conformity DQ dimension:

    1. Format Conformity
    2. Data Type Conformity
      • a. Format Conformity

      The order date below is expected to follow ‘MM/DD/YYYY’ format. While the data may appear correct to humans, any changes in the data format can cause chaos for computers.

      1. Don records has date in ‘YYYY/M/DD’ format.
      2. Joe’s record has date in correct ‘MM/DD/YYYY’ format.
      3. Tim’s records is in the ‘YYYY/M/DD HH:MM:SS’ format.

      Data Format conformity issues can typically be identified using regular expressions.

      Data-Format-Conformity-Issue-iCEDQ
      • b. Data Type Conformity

      The data type is also another case of conformity quality issue. The order amount in the table below is expected to be in numeric format, but Joe’s record is written in alpha numeric format. This is a data type conformity issue.

      Data-Type-Conformity-Issue-iceDQ

      What is Integrity Data Quality Dimension?

      Data Integrity is the degree to which defined relational constraints are implemented between two datasets.

      The data integrity issues can arise within a single system or across multiple systems. The key characteristic of the integrity data quality dimension is the relationship between two datasets.

      Here are two examples for data integrity dimension

      1. Referential Integrity
      2. Relationship Cardinality
      • a. Referential Integrity Or Foreign Keys:

      Referential-Integrity-in-Data-Integrity-Dimension-iCEDQ

      The reference for a parent record must always exist in a child dataset. For example, an order might have a customer number as a foreign key, which means that the customer number must also exist in the customer table. The master dataset could reside within the same database or in a different system.

      • b. Cardinality Integrity

      Cardinality-Data-Integrity-Dimension-iCEDQ

      Another example of Integrity data quality dimension is Cardinality, 1:1, 1: Many, etc. Cardinality defines the ratio between two datasets. For example, an employee can only have one badge (1:1). If the cardinality of the relationship is known in advance, it can be checked under the data integrity DQ dimension.

      What is Precision Data Quality Dimension?

      Precision refers to the degree to which the data has been rounded or aggregated.

      In industrial measurements, precision and accuracy are different concepts. Accuracy refers to the deviation from the target data value, while precision pertains to the closeness of the values to each other. In data quality measurement, Precision is a derived concept used to identify errors related to rounding or aggregation of data.

      Data-Precision-Quality-Dimension-iCEDQ

      Below are some of the examples of precision errors

      1. Numerical Precisions
      2. Time precision
      3. Granularity precision
      • a. Precision Errors Due To Rounding Of Number

      Depending on the degree of precision provided by the GPS coordinates, the location can differ by kilometers. The table below shows values ranging from two-digit precision to five-digit precision.  The location error can range from 1 meter to 1 kilometer.

      GPS Decimal Places Decimal Degrees N/S or E/W at equator
      2 0.01 1.1132 km
      3 0.001 111.32 m
      4 0.0001 11.132 m
      5 0.00001 1.1132 m
      Rounding-Errors-Data-Precision-Quality-Dimension-iceDQ

      Imagine the consequences of a military bombing occurring 1 km away from the intended location.

      In stock trading, the SEC under rule 612 mandateds a minimum precision for stocks: those worth over $1.00 must have a precision of $0.01, while stocks under $1.00 require a precision of $0.0001.

      Stock Date End of day Price
      IBM 05/05/2020 $122.58
      JPM 05/05/2020 $92.00
      MTNB (Penny Stock) 05/05/2020  $0.7064
      • b. Time Precision

      The store accounting is done at the day level and may not require the exact second of purchase. However, for credit card fraud detection, time precision must be accurate to the second.

      time precision in data precision quality dimension icedq
      • c. Granularity Precision

      Every time data is aggregated it loses details or precision. Granular data cannot be derived from summarized data.

      Granularity Data Precision Quality Dimension-iceDQ

      At first glance, granularity may not seem like an obvious aspect of precision. However, for certain operations aggregated or summarized data is not useful.

      For example, if you want to pay each salesperson’s commission based on their individual sale, you will need the specific sales number, not just the aggregated total.

      Commission Calculator
      Product $ Sales by each Employee Commission $ Sale by Emp X Commission % =Commission Amount
      John Dove 3% ?
      Evan Gardner 3% ?
      Accessories 3% ?

      But the data below does not have precision at the salesperson level. It is summarized for all the employee for each moth of the quarter. Since the head of sales does not have individual sales data for each salesperson, they cannot pay commissions. 

       

      Total Sales
      Product Total Sales
      Jan 2020 $4,050,000
      Feb 2020 $3,500,000
      Mar 2020 $500,000

      Data Quality Measurement

      This is simply the ratio of total records available to the defective records identified by one of the data quality dimensions.

      Data Quality Dimension Measurement
      Accuracy # Of records with inaccurate data / total # of records
      Completeness # Of records with incomplete data / total # of records
      Timeliness # Of records with Timeliness data / total # of records
      Currency # Of records with Currency data / total # of records
      Consistency # Of records with inconsistent data / total # of records
      Uniqueness # Of non-unique records / total # of records
      Validity # Of records with invalid data / total # of records
      Conformity # Of records with unconfirmed data / total # of records
      Integrity # Of records with integrity issues data / total # of records
      Precision # Of records with imprecise data / total # of records

      The above can be easily represented by a gauge representation on a dashboard. It can also be easily aggregated or drilled down into different dimensions.

      Conclusion

      I hope you liked the data quality examples and understand that there is much more than the 6 DQ dimensions. Do not fret too much about these classifications, choose the one you like or define your own.

      Do you agree with our thought process? Leave a comment below!

      Sandesh Gawande - CTO iceDQ

      Sandesh Gawande

      CEO and Founder at iceDQ.
      First to introduce automated data testing. Advocate for data reliability engineering.

      Leave a Reply

      Your email address will not be published. Required fields are marked *

      Post comment