2020 New Year’s Data Resolutions

person-pouring-champagne-on-champagne-flutes-3171770.jpg

With the dawn of a new year (and, depending on whom you believe, a new decade) upon us, I wanted to take a moment to make a few resolutions. These aren’t the typical New Years resolutions - nothing about losing weight, quitting smoking, exercising more, giving up chocolate, or anything like that. These are data resolutions: things I’m going to make sure I continue to do with data.

  • I will resolve duplicate records when they occur and investigate the upstream processes to understand why they happened and how to avoid them in the future.

  • I will convert data to a proper datatype as early as possible, and capture any conversion errors to a log file. I won’t leave everything as strings even though that feels easiest.

  • I will maintain numeric precision for as long as possible in my ETL streams, and I’ll document when I’ve knowingly reduced precision so that there are fewer questions that end up coming down to rounding errors.

  • I will have standardization rules for every attribute I maintain, for example, uppercase/lowercase/mixed case, whether to allow punctuation, whether to allow numerics or whitespace, and so forth. These rules will be well-documented and public.

  • I will deal with nulls appropriately, and early in my processing; if necessary, I’ll add more metadata that will help to make the difference between null and some other condition very clear.

  • I will consider how late-arriving dimensionality might affect my ETL processing, and introduce appropriate placeholder records as needed to ensure referential integrity.

  • I will measure inbound data volumes and track trends so that I can understand what future data volumes might look like and appropriately plan, and so that I can identify situations where more data is arriving than makes sense.

  • I will build keys that are unique and which anticipate future states that may threaten that uniqueness; my goal will be to avoid expensive refactoring.

  • I will involved subject matter experts regularly to evaluate exception conditions, to help understand data relationships, and to construct appropriate reference data.

  • I will not introduce arbitrary relationship constraints and will generally accept that relationships have cardinalities of zero, one, or infinity.

  • I will have a robust and reliable backup and restoration strategy.

Hopefully your data resolutions are similar - and you have a happy and successful 2020 in the data world!

Previous
Previous

Events to Remember

Next
Next

Strong Relationships