Plenty of us complain about data quality, or worry our data isn’t good enough for analytics. But what do we mean when we say “data quality”?
When I was in business school, we defined quality as fitness for an intended purpose. For instance, a broom handle may be a high quality item for when you’re sweeping floors, but a low quality item for when you’re hitting a baseball.
In this sense, quality data is data that fits your intended analysis. There are five aspects of data quality that comprise fitness for analysis: relevance, accuracy, completeness, recency and cleanliness.
Quality data is relevant. Your data should describe or pertain to the time period, location and / or population that comprise and affect what you are analyzing. It should also be directly related to the goals of your analysis. For instance, if your analytics project is intended to reduce manufacturing waste, then your data should measure defects in raw materials and finished goods, plus products returned from the distribution channel.
How do you know if your data is relevant? You’ll notice more if it’s not relevant. In that case, you analysis will yields results that are unrelated to your problem or just don’t make sense.
What if your data is not relevant? You will need to generate or acquire data that is relevant.
Quality data is accurate. Your data needs to accurately reflect or correspond to what you’re measuring, to the required level of measurement. It should also be free of typos, transpositions, and other inaccuracies of data entry and classification.
How do you know if your data is accurate? Your data will pass spelling checks, check sums, spot checks and other measures of internal accuracy and consistency.
What if your data is not accurate? You’ll need to either fix your data, or generate or acquire new data. But you need to work only on data that is related to the analysis that you’re trying to perform.
Quality data is complete. Complete data measures or describes all the relevant aspects of the problem you’re trying to solve. It encompasses the total population, time period and/or geographic area that you’re studying. There are no items missing from series.
How do you know if your data is complete? You’ll notice more if it’s not complete. In that case, you’ll find yourself redoing calculation and analyses to fill in the gaps you’ve discovered. Discovering and filling gaps can be good for data quality, but can be time-consuming and frustrating.
What if your data is not complete? You have a choice. You can decide that it’s complete enough to base a decision upon, or you can acquire or generate additional data to round out your data set.
Quality data is recent. Recent data reflects the current state of a measurement. Recency is measured relative to the problem you’re trying to solve. Recent geological data will be much older than recent stock market data. Disparate data sets that are related to the problem you’re addressing can have difference levels of recency.
How do you know if your data is recent? Your data should carry some sort of time stamp. Some data might seem old, like census data, and yet be the most recent, relevant data that you can apply.
What if your data is not recent? You can take new measurements if recency is an issue. You could re-measure a portion of your data and see if it varies significantly from your existing data.
Quality data is clean. Clean data is free of duplicate values. The data is organized, standardized, structured and labelled or documented to the extent possible. Most data in the world is unstructured, in that it doesn’t fit into the neat fields of a data table. Think social media, emails, reports, videos and images. Still, even unstructured data can be in a well-documented and standardized format.
How do you know if your data is clean? In part, it will just look clean. You won’t find other data-dependent tasks interrupted by impromptu data cleaning. Also, your counts will be accurate because there are no duplicates.
What if your data is not clean? Use ETL (extract, transform, load) or de-duping tools. Normalize data when different terms or values are used to represent the same information. Make data as consistent as possible in terms of labels, categories, time stamps and other types of structures.
The next time you catch someone, including me or possibly yourself, complaining about data quality, take a moment to dive deeper into these five characteristics of data quality. Getting more specific about quality helps you pinpoint tangible problems with specific solutions that you can implement.