A Trusted Reference
I may have been part of one of the last generations for whom leaving home to go to college meant having a dictionary with me in my dorm room. The Internet age was at its advent - I was one of only a smattering of students with a computer in my room - and when writing term papers or other assignments, a thick dictionary was the best reference work you could have to make sure the words you were using were not only the words you wanted to use, but were spelled correctly.
Students today probably don’t bring a dictionary with them to stack in their dorm room - not when there’s autocorrect in every cell phone - but the idea of a trusted dictionary remains important to those of us working in data. Having a reference available that will help you determine when it’s appropriate (or not) to use a particular attribute or data column is still necessary.
For me, the closest analog that I can think of is when I have to buy some specific cleaner at the store. I know, in general, which aisle I need to go to in order to find the product, and I know in general what products I’ve going to be looking for - something to help clean kitchen surfaces. But it’s not until I read the label and understand the specific benefits, limitations and cautions of each product that I can confidently select the one that is going to do what I need - and not permanently damage my kitchen.
A data dictionary has a similar purpose to the labels on the cleaning products - it should be there to help guide decisions about whether to use a particular field, column, attribute, or not. There are lots of things that get overlooked, however, when building a data dictionary… and we all have an opportunity to fix that.
For example, let’s assume a field named “Last Name”. In 99% of the data dictionaries I’ve encountered, “Last Name” is likely not to have any sort of description at all. After all, we might think, it stands for itself in terms of what it means. But does it? What do or don’t we know about the “Last Name” field?
Are the contents of this field expected to be in mixed case, or all upper-case letters?
Do we allow punctuation of any sort in this field for names like L’Heureux, or for hyphenated names like Brown-Jones? Or is it only alphabetic characters?
What will be the contents of this field in cases where the ordering of given names versus family names is reversed from what we’re accustomed to, such as Korean names? Will the contents house the family name? Or the given name?
Should we expect generational suffices like Jr or Sr to be in this field?
Will this field ever be null or empty? Under what circumstances?
Are there other fields that may be related and be better choices?
Each of the above questions could play a crucial role in whether a user decides that the field is the correct choice for them, for example in putting together a marketing campaign. It would look a bit absurd in most cases if one were to select a field that provides given names in all capital letters and a family name in mixed case. Having a robust data dictionary that identifies those sort of factors helps to avoid those types of mix-ups.
The final question in that list above, about other choices and related fields, is one that is probably overlooked more often than any other - but may be among the most important. Let’s go back to that example about kitchen cleaners. I may pick up a bottle that says it’s inappropriate for concrete counters. If that’s all it says, then while it’s helpful, it doesn’t tell me what I can use - I’m going to have to pick up a bunch of other bottles. But if it says to use some specific alternative instead, some sister brand or product, then I can grab that product off the shelf instead, without having to do as much work.
Those of us who perform data curation have the opportunity to help everyone downstream by being conscientious about how we construct and maintain our data dictionaries. They don’t need to be thousand-page tomes that we lug everywhere, but if they’re concise, complete and answer all the right questions, they’ll save our users a ton of time.