Cold November Grain
Sorry about the punny title - but this is actually a pretty interesting topic, and I’m feeling a bit punchy as it approaches Thanksgiving.
Something that I’ve talked about at great length with my clients (and my coworkers, and friends and family at inappropriate times) is the concept of data grain. What is grain? Grain is the level to which data gets aggregated in order to report a metric.
One of the richest areas to mine for examples of this concept is the sport of baseball, which is rich with example statistics. Let’s think about baseball for a moment and talk about batting average. Now, batting average is a fairly simple statistic - it’s simply the number of hits achieved divided by the number of at-bats taken. Easy enough.
But what is its grain? I get different answers when I ask this question. When I ask this question of clients during a design session they almost universally - initially, at least - respond that the grain of batting average is “player”, that is, that we calculate the batting average on a player-by-player basis. When I ask this question of family, on the other hand, their response is almost universally “shut up about this already and pass the stuffing.”
The fact is, though, that while my clients have been partially right - the grain of batting average can be player - that’s not the end of the story. Batting average can also be calculated for an entire team, for example, or for a player in a given season, or for a team in a given season, and so forth. Understanding this generally encourages data engineers, analysts and others to do two things that are important from a data governance perspective:
Naming columns in a way that helps to clarify the grain of particular metrics
Constructing calculations for these metrics in a manner that is reusable across different grains
The second one is particularly important; if tomorrow everyone agreed that batting average would be redefined, it would be much simpler to change the logic that calculates it in one place and have it to apply to all grains rather than to have to search it down in multiple places and replace the logic in each situation (it’s much less prone to error, too).
Another advantage of thinking about metrics in terms of their grain is that it allows us to contextualize and organize our thought process a bit more clearly; this makes, among other things, data gap analysis more productive by letting our teams ideate about possible data usage without immediately needing to worry about which physical table a metric needs to be stored on.
It’s an important concept to keep in mind and one to make sure you discuss - just, perhaps, not on Thanksgiving.