Data as Water
Lots of terminology in the data world refers to data as essentially water - data lakes, data oceans, data flows, and so forth. Even Amazon’s S3 service puts data into buckets. In chatting the other day with a fellow data professional, it became clear that the metaphor of data as water goes beyond just oceans and lakes, though - and much, much deeper.
Just as water is the essential compound for sustaining life on our planet, so to is data the essential element for sustaining the life of an organization. In nature, water arrives in the form of rain - which can be heavy or light - and in business, data arrives sometimes in a trickle and sometimes as a flood.
As data professionals, we have to be prepared to build the plumbing that takes data from its sources and moves it where it needs to go. The pipes, elbows, valves and more that we do this with are what clean, filter, integrate, and standardize our data. The valves may be some of the most important parts of this, which are often overlooked.
While it can be tempting to build a single run of pipe, bending and turning and twisting as necessary to carry our data from point A to point B, it’s critical that we interrupt that flow sometimes with a valve that we can turn on or off, or redirect to a separate course. What happens when we realize that we’ve built some part of our plumbing incorrectly? Without smartly placed valves, we’re forced to reconstruct the entire thing; on the other hand, if we’ve integrated a couple of valves into our plan, then we can turn the flow off for a bit, fix the part between the valves, and keep going.
This idea of modularity is a concept that’s borrowed from lots of other domains within computer science but which I feel is important when working with data.. Being able to pause the waterworks and examine the quality of the data passing through a particular point is incredibly helpful to diagnosing and correcting problems that occur in an otherwise mature and well-performing data flow process. Sure, you can do every bit of data processing in a single step - but should you?
Of course, as with anything else, the answer is it depends. But more often than not, unless we’re talking about real-time or near-real-time applications in which every microsecond is critical, there is opportunity to modularize the work and allow (even if it’s just temporarily) for data to be examined at different steps in its flow.
Having these sorts of intermediate data in place are a great way to allow your data engineers to examine and forensically deduce the cause of problems you’re having later in your data pipeline - just like they’d allow a plumber working with water to remove a section of pipe to ascertain whether there’s a blockage there. And - just like with plumbing - while we’d love to believe that there’s never going to be a problem that’ll require that ability, we should recognize that of course there will be, and of course it’ll make things easier to be able to examine it in that fashion.
It’s easy to overdo that sort of approach, and end up with tons of extra data sitting around that you don’t need and which never gets cleaned up - I’m not advocating for that. But making a couple of data “valves” a part of your processing plan from the very beginning is a smart way to ensure you’re giving future data plumbers a fighting chance at fixing problems without having to rip out the pipes down to the studs.