Scary Data

Dec 13

It’s Friday the 13th and a full moon just passed; it seems as appropriate a time as any to talk about scary data - that is, the common issues that affect data quality that we can be on the lookout for.

Duplicate Data. Sometimes you’re going to legitimately receive records that look very similar. Most of the time, though, if a record matches another one exactly in your data, you’re probably dealing with some sort of data ingestion issue; either your source isn’t configured right, and it’s passing records to you more than once, or your ETL is building in some sort of accidental duplicate into the input stream. Either way, duplicate data can be a pain to deal with once it’s found its way into your system, so cutting it off at the pass when you can is the best idea.

Inconsistent Formatting. Let’s say you receive phone numbers from a couple of different vendors. One of them sends the values to you as (937) 123-4567. The other sends them to you as 9371234567. Neither is necessary right or wrong - they’re just different, and if you don’t do something about it your ability to use the data for searching, matching or other purposes will be limited. Best practice here? Save the data in one place exactly as it came in, and do a simple transformation to an accepted standard early in your process.

Old Data. This can be particularly challenging when it comes to demographic data about your customers - if they’ve moved or changed a phone number or name, how will you know when a change is valid, versus just a mistake? I’ve discussed this previously here, but it’s worth repeating: saving all of the versions of data is a valid approach, as long as you have a standard mechanism for deciding which represent the latest, best, information. Making sure that you tag every row that enters your system with a date and timestamp will also help you to make decisions about data obsolescence.

Empty Fields. Your feeds are arriving without certain data points populated. Sometimes this isn’t a major problem - you might not need or expect a middle name from every individual. But other times it poses a problem. Did the user really respond to your survey question that they visited your store 0 times? Or did they just not answer the question? Making sure you have a way to tell the difference is critical in your form design in that case. In other situations, pushing back on vendors may be the only way to improve the population of fields. Either way, your systems need to be prepared to deal with the presence of empty data - and to make inferences about what it means.

Overloading. Whenever I hear the phrase “sometimes it means…” I shudder, because it almost always indicates that a data field has been overloaded, that is, used for more than one purpose. Such situations need to be addressed early in your data transformation process - the conditions under which it means one thing versus another need to be identified and tested against, and the data need to be separated into two fields that have individual, specific, semantic meanings. If you don’t do this, you’re going to end up wondering why half of the values for “Size” read “Master of Business Administration”.

The effort that goes into cleaning, standardizing, and preparing data can sometimes be substantial. But doing so - and automating ways to detect when those safeguards are not enough - will save you a lot of time in the future, as your users come to trust the data your systems provide.

George L'Heureux

Scary Data

Playing as a Team

Remembering the Past

Building Data Fluency. Building Success.