• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
  • Articles
  • News
  • Events
  • Advertize
  • Jobs
  • Courses
  • Contact
  • (0)
  • LoginRegister
    • Facebook
    • LinkedIn
    • RSS
      Articles
      News
      Events
      Job Posts
    • Twitter
Datafloq

Datafloq

Data and Technology Insights

  • Categories
    • Big Data
    • Blockchain
    • Cloud
    • Internet Of Things
    • Metaverse
    • Robotics
    • Cybersecurity
    • Startups
    • Strategy
    • Technical
  • Big Data
  • Blockchain
  • Cloud
  • Metaverse
  • Internet Of Things
  • Robotics
  • Cybersecurity
  • Startups
  • Strategy
  • Technical

12 Data Quality Metrics That ACTUALLY Matter

Barr Moses / 7 min read.
March 30, 2023
Datafloq AI Score
×

Datafloq AI Score: 79.33

Datafloq enables anyone to contribute articles, but we value high-quality content. This means that we do not accept SEO link building content, spammy articles, clickbait, articles written by bots and especially not misinformation. Therefore, we have developed an AI, built using multiple built open-source and proprietary tools to instantly define whether an article is written by a human or a bot and determine the level of bias, objectivity, whether it is fact-based or not, sentiment and overall quality.

Articles published on Datafloq need to have a minimum AI score of 60% and we provide this graph to give more detailed information on how we rate this article. Please note that this is a work in progress and if you have any suggestions, feel free to contact us.

floq.to/rakWS

Why do data quality metrics matter?

If you’re in data, you’re either currently working on a data quality project or you just wrapped one up. It’s the law of bad data – there’s always more of it.

Traditional methods of measuring data quality metrics are often time and resource-intensive, spanning several variables, from accuracy (a no-brainer) and completeness, to validity and timeliness (in data, there’s no such thing as being fashionably late). But the good news is there’s a better way to approach data quality metrics.

Data downtime – periods of time when your data is partial, erroneous, missing, or otherwise inaccurate – is an important data quality metric for any company striving to be data-driven.

It might sound clich, but it’s true – we work hard to collect, track, and use data, but so often we have no idea if the data is actually accurate. In fact, companies frequently end up having excellent data pipelines, but terrible data. So what’s all this hard work to set up a fancy data architecture worth if at the end of the day, we can’t actually use the data?

By measuring data downtime, this simple data quality KPI will help you determine the reliability of your data, giving you the confidence necessary to use it or lose it.

The North Star data quality metric

Overall, data downtime is a function of the following data quality metrics:

  • Number of data incidents (N) – This factor is not always in your control given that you rely on data sources “external” to your team, but it’s certainly a driver of data uptime.
  • Time-to-detection (TTD) – In the event of an incident, how quickly are you alerted? In extreme cases, this quantity can be measured in months if you don’t have the proper methods for detection in place. Silent errors made by bad data can result in costly decisions, with repercussions for both your company and your customers.
  • Time-to-resolution (TTR) – Following a known incident, how quickly were you able to resolve it?

By this method, a data incident refers to a case where a data product (e.g., a Looker report) is “incorrect,” which could be a result of a number of root causes, including:

  • All/parts of the data are not sufficiently up-to-date
  • All/parts of the data are missing/duplicated
  • Certain fields are missing/incorrect

Here are some examples of things that are not a data incident:

  • A planned schema change that does not “break” any downstream data
  • A table that stops updating as a result of an intentional change to the data system (deprecation)

Bringing this all together, I’d propose the right formula for data downtime is:

Data downtime is an effective data quality metric and a very simple data quality KPI. It is measured by the number of data incidents multiplied by the average time to detection plus the average time to resolution.

Data downtime is an effective data quality metric. It is measured by the number of data incidents multiplied by the average time to detection plus the average time to resolution.

Data downtime is an effective data quality metric. It is measured by the number of data incidents multiplied by the average time to detection plus the average time to resolution.

If you want to take this data quality KPI a step further, you could also categorize incidents by severity and weight uptime by level of severity, which we have addressed in another post.

With the right combination of automation, advanced detection, and seamless resolution, you can minimize data downtime by reducing TTD and TTR. There are even ways such as data SLAs and data health insights to reduce N.

11 More Important Data Quality Metrics

Of course, if we hold that data downtime is our north star data quality KPI, then all of the data points that make up the formula are important as well. For example:

1. Total number of incidents (N)

This metric measures the number of errors or anomalies across all of your data pipelines. A high number of incidents indicates areas where more resources need to be dedicated towards optimizing the data systems and processes. However, it’s important to keep in mind that depending on your level of data quality maturity, more data incidents can be a good thing. It doesn’t necessarily mean you HAVE more incidents, just that you are catching more.

2. Table uptime:

It is important to place the total number of incidents within the larger context of table uptime or the percentage of tables without an incident. This can be filtered by type of incident, by custom data freshness rules for example, to get a broad look at SLA adherence.

3. Time to response (detection):

This metric measures the median time from an incident being created until a member of the data engineering team updates it with a status (which would typically be “investigating,” but could also be expected, no action needed, or false positive).

4. Time to fixed (resolution):

Once an incident has been given a status, we then want to understand the median time from that moment until the status is updated to fixed.


Interested in what the future will bring? Download our 2023 Technology Trends eBook for free.

Consent

Sort by domain:

It’s important to measure the above data quality metrics by domain to understand the areas where additional optimization may be needed.

Other types of data quality metrics are important as well, including those that measure table reliability, scope of monitoring coverage, tidiness, and incident response.

First let’s look at table reliability metrics:

5. Importance Score:

Risk is a combination of frequency and severity. You can’t understand the severity of a data incident if you don’t also understand how important the underlying table is. Importance score, measured by the number of read/writes as well as downstream consumption at the BI level via data lineage, is an important part of enabling data engineers to triage their incident response. It can also indicate a natural starting place for where layering on more advanced custom monitoring, data tests, or data contracts may be appropriate.

6. Table health:

If the importance score is the severity, the table health, or number of incidents a table has experienced over a certain time period provides the frequency end of the risk equation.

Now let’s look at scope of data monitoring coverage:

7. Table Coverage:

Having comprehensive coverage across as many of your production tables as possible is critical to ensuring data reliability. This is because data systems are so interdependent. Issues from one table flow into others. With a data observability platform, your coverage across data freshness, data volume, and schema change monitors should be at or near 100% (but that’s not true for all modern data quality solutions).

8. Custom Monitors Created:

Table coverage helps you ensure you have a broad base of coverage across as many of your production tables as possible. Data engineering teams can get additional coverage by creating custom monitors on key tables or fields. It is important to understand the number of incidents being created by each custom monitor type to avoid alert fatigue.

Now let’s look at, for lack of a better word, “tidiness” data quality metrics. Tracking and understanding these metrics can help you reduce incidents related to poor organization, usability, and overall management.

9. Number of unused tables and dashboards:

It can take some guts to deprecate tables and dashboards. Because it’s easier to just leave them lying around, unused tables and dashboards quickly accumulate. This can make it hard to find data, or worse, your stakeholders might end up leveraging the wrong table or dashboard resulting in two sources of truth (one of which was meant to be retired).

10. Deteriorating queries:

Measured as queries displaying a consistent increase in execution runtime over the past 30 days. It is important to catch these deteriorating queries before they become failed queries and a data incident.

Finally, we covered key incident response metrics that comprise data downtime, but we’ll add one more below.

11. Status update rate:

The better your team is at updating the status of incidents, the better your incident response will be.

Focusing on the data quality metrics that matter

Tracking key data quality metrics is the first step in understanding data quality, and from there, ensuring data reliability. With fancy algorithms and data quality KPIs flying all over the place, it’s easy to overcomplicate how to measure data quality.

Sometimes it’s best to focus on what matters. But to do that, you need the systems and processes in place to measure (ideally in a dashboard) the many data metrics that comprise your overall data health.

Categories: Big Data
Tags: big data engineer, big data quality, data quality, metrics
Credit: Monte Carlo

About Barr Moses

CEO and Co-Founder of Monte Carlo Data. Lover of data observability and action movies. #datadowntime

Primary Sidebar

E-mail Newsletter

Sign up to receive email updates daily and to hear what's going on with us!

Publish
AN Article
Submit
a press release
List
AN Event
Create
A Job Post

Related Articles

The Advantages of IT Staff Augmentation Over Traditional Hiring

May 4, 2023 By Mukesh Ram

The State of Digital Asset Management in 2023

May 3, 2023 By pimcoremkt

Test Data Management – Implementation Challenges and Tools Available

May 1, 2023 By yash.mehta262

Related Jobs

  • Software Engineer | South Yorkshire, GB - February 07, 2023
  • Software Engineer with C# .net Investment House | London, GB - February 07, 2023
  • Senior Java Developer | London, GB - February 07, 2023
  • Software Engineer – Growing Digital Media Company | London, GB - February 07, 2023
  • LBG Returners – Senior Data Analyst | Chester Moor, GB - February 07, 2023
More Jobs

Tags

AI Amazon analysis analytics app application Artificial Intelligence BI Big Data business China Cloud Companies company costs crypto customers Data design development digital engineer environment experience future Google+ government health information learning machine learning market mobile news public research security services share skills social social media software strategy technology

Related Events

  • 6th Middle East Banking AI & Analytics Summit 2023 | Riyadh, Saudi Arabia - May 10, 2023
  • Data Science Salon NYC: AI & Machine Learning in Finance & Technology | The Theater Center - December 7, 2022
  • Big Data LDN 2023 | Olympia London - September 20, 2023
More events

Related Online Courses

  • Oracle Cloud Data Management Foundations Workshop
  • Data Science at Scale
  • Statistics with Python
More courses

Footer


Datafloq is the one-stop source for big data, blockchain and artificial intelligence. We offer information, insights and opportunities to drive innovation with emerging technologies.

  • Facebook
  • LinkedIn
  • RSS
  • Twitter

Recent

  • 5 Reasons Why Modern Data Integration Gives You a Competitive Advantage
  • 5 Most Common Database Structures for Small Businesses
  • 6 Ways to Reduce IT Costs Through Observability
  • How is Big Data Analytics Used in Business? These 5 Use Cases Share Valuable Insights
  • How Realistic Are Self-Driving Cars?

Search

Tags

AI Amazon analysis analytics app application Artificial Intelligence BI Big Data business China Cloud Companies company costs crypto customers Data design development digital engineer environment experience future Google+ government health information learning machine learning market mobile news public research security services share skills social social media software strategy technology

Copyright © 2023 Datafloq
HTML Sitemap| Privacy| Terms| Cookies

  • Facebook
  • Twitter
  • LinkedIn
  • WhatsApp

In order to optimize the website and to continuously improve Datafloq, we use cookies. For more information click here.

settings

Dear visitor,
Thank you for visiting Datafloq. If you find our content interesting, please subscribe to our weekly newsletter:

Did you know that you can publish job posts for free on Datafloq? You can start immediately and find the best candidates for free! Click here to get started.

Not Now Subscribe

Thanks for visiting Datafloq
If you enjoyed our content on emerging technologies, why not subscribe to our weekly newsletter to receive the latest news straight into your mailbox?

Subscribe

No thanks

Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.

Necessary Cookies

Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.

If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.

Marketing cookies

This website uses Google Analytics to collect anonymous information such as the number of visitors to the site, and the most popular pages.

Keeping this cookie enabled helps us to improve our website.

Please enable Strictly Necessary Cookies first so that we can save your preferences!