DevOps Observability and lessons for Data Quality
This is a preliminary draft - see avaitla16.com/how-i-write
What's old is new again. I've been spending some time in the data quality space and have been thinking about how much of it reminds me of approaches taken by devops teams in the past ten years or so, just with different names. These notes were inspired by a brief conversation I had with linkedin.com/in/kevinzenghu
Let's begin with a common devops architecture that exists:
You have a website which calls an origin service that calls other services that work in conjunction to fulfill the request. The goals of the ops teams are to ensure this operation works quickly, reliably (minimize downtime), and when things do break we are alerted to them and have the tools to effectively resolve them.
There's a good philosophical guide here that has been the most helpful material I have found summarizing some operational best practices here: docs.google.com/document/d/199PqyG3UsyXlwie..
Now back to the topic of data quality. We want to know that our dashboards work reliably and when things do break we are able to quickly resolve them.
The first matter is alerting when something is wrong. Alerts can frequently go awry and become spammy noise. This problem is also encountered by ops and as recommended in the above guide - you want to be monitoring symptoms not sources. This means monitor the fact the website went down not that the database crashed. Certainly you could monitor both but there are hundreds of reasons the website could go down in a service mesh - so get the most bang for your buck by monitoring the site itself. As a caveat always monitor disk usage on the database.
Now extend this to data quality. The most important things are the end user reports. Monitor the marts themselves first before the stg tables. In particular there are some basic items to watch. In devops space you monitor # of 500 errors, latency, and throughput (more false positive). In data quality land you monitor freshness, volume, schema, distribution, in addition to custom business specific metrics.
Now suppose that an incident has happened and we need alerts. The site is down and we'd like to know what the root cause is. The best tool in ops land that exists today is probably distributed tracing. So rather than scatter brained asking hundreds of different services and starting to panic, a tracing solution can tell you which service is slowed down or facing an outage, so you can systematically root cause and focus your efforts in the right subsystem.
Then you break out very specialized debugging tools like gdb, os monitors, etc - oh there are a lot of tools, start with Brendan Greggs work as a starting point: brendangregg.com/overview.html
Now extend this to data quality. Lineage is crucial to doing the same thing and accelerating TTR (time to resolution of outage). It can help pinpoint what is broken (which upstream models were the source of the breakage) and the impact to the organization (which downstream reports consume this data and who uses them). The special tools (which are still in their infancy and I expect will continue to develop), are mostly running raw sql and viewing diffs side by side in addition to reviewing developer code changes and releases.
See more here on how this works in practice: avaitla16.com/column-level-lineage-and-impl...
So far we've mostly been talking about reactionary approaches, however a proactive one is often times far better - an ounce of prevention is worth a pound of cure.
This exists in the form of unit / integration, code review, qa testing. Most of this in software development can be done automatically, however theres still a good need of manual effort when it comes to visual ui testing or mobile device testing.
In the data quality space this feels quite rudimentary as the state of things is view a dashboard pointing to qa data and one pointing to prod data and spot check manual diffs (does this graph look different than that one?). It's quite interesting to see how datafold.com has begun to take a more rigorous approach to data quality "testing" by automatically running and automatically checking data diffs before code even gets merged in so you could see the impact much more comprehensively and exhaustively than what could ever be done manually.
Generally the process we've arrived at for data quality issues follows the format:
How to investigate data quality issues:
First start by creating an alert for a data quality issue.
When making the alert does it have the right steps to investigate the issue so when it fails the on call knows what to do?
When the test fails do the docs it was created with still make sense / can they be updated if the investigation is unique?
Is there a process for showing operators the bad data (in an internal tool) so they can triage it?
If it's a false positive, how can we improve the test to be less flaky?
Do operators have the right data in their ui to be able to get full context or do we have to dig into the database every time?
Is there a way for operators to fix the bad data - have we notified them of it?
Is there a root cause for what happened to fix it from happening again?
Are regression tests now in place?
May the queries flow, and your pagers be quiet.