This is a preliminary draft - see avaitla16.com/how-i-write
Column level lineage helps us understand a variety of things:
- When reports break, what column was the source of the breakage
- When we modify a field in a stg table, what downstream marts are impacted
- Are downstream marts using an upstream column in the correct way
- Onboarding new team members to a data model
- Helping team members build trust in data as they know how items are derived
For this exercise we'll use Bigquery and illustrate some common lineage properties that are important to be aware of. We'll use the jaffle shop repository as our test bed. Here's a vimeo link on getting this setup: vimeo.com/614780503
Now let's look at the final models customers
and orders
generated by dbt.
customers.sql
orders.sql
DBT computes lineages based on manual annotations with ref
: docs.getdbt.com/reference/dbt-jinja-functio..
This approach is more explicit, but makes column level lineage difficult as the model writer would need annotations on each column itself. A better approach is described through query log analysis here: cloud.google.com/architecture/building-a-bi..
Column level lineage is very similar to how a compiler parses a function in c++ (it traces where variables are instantiated and where they are used). There are challenges in this approach when you do select star (you need to know state of tables at that time) or import star from module (python dialect), which makes this approach challenging since it is determined at runtime, but in the sql case you could infer this from the schema of the tables (if you move outside the compile/parse only step and only if the select star isn’t in a hidden cte/ sub query).
The key problems that make column lineage tricky are: renamed columns, junction only tables which don’t materialize in the final table, counts and aggregations, and star expansion.
In many ways ideas that have been known for quite some time are now seeing a reemergence in new fields. What's old is new again.