Notes on syncing systems

When you only have 1 system (your primary database), there's no need for data syncing.

When you have 2 systems (salesforce for sales and your database), there's 2 directions of sync (sf -> db and db -> sf).

When you have 3 systems (zendesk for support, salesforce for sales, and your database), there's 6 systems to sync (sf -> zd, sf -> db, zd -> sf, zd -> db, db -> sf, db -> zd).

When you have 4 systems (iterable for marketing, zendesk for support, salesforce for sales, and your database) there's 12 systems to sync.

When you have n systems there's theoretically n * (n - 1) systems to sync:

400px-Complete_graph_K7.svg.png

New systems are introduced when businesses grow to best serve members (it's practically not possible to build everything in house in a single database). It's important to provide operators the best tools for the job (sf, zd, iterable are non trivial to replicate). Nevertheless, nothing is free, and new systems come with this implicit integration cost which doubles behind the scenes (so growth becomes a hiring bottleneck that cannot keep up).

In addition to implicit integration sprawl, there are additional unseen challenges to building a good syncer. A syncer not only needs to get data from A to B but needs to consider the following:

What is the desired sync frequency? Do we need real-time/incremental sync (usually change data capture) or is nightly ok (usually full resyncs)? Are there rate limits in place?
How do we know the sync is working. Suppose we only sync in one direction A -> B. Employee Joe updates a value in A, it syncs to B, and then Sue updates the value in system B. Our sync sees theres no issue, but behind the scenes the value has been changed. Every sync problem has a dual problem of integrity checking (does the sync still work).
When a sync fails, do we know why it failed? How is observability baked into a sync systems for troubleshooting? Can operators be notified when needed to fix the issue quickly?
Do off the shelf integrations already exist for the systems you're trying to integrate (can you trust they work / does it fit 95% of your requirements or 100% of them)? Do you need to build your sync script / validation script from scratch (how long would that take to build but also maintain from a tco perspective)?
Validation and negotiation of different schemas. Suppose we sync google sheets to mysql, but the google sheet has an errant string, when the schema only accepts an integer. How do you enforce or support integrity needs of different systems.
Conflict resolution and circular references: Are sync systems built in a way that loops cannot occur, and when two values differ theres a way to resolve them (last write wins, crdts, max hops / ttl)

Generally there's ways the complexity is removed through consolidation points (like a warehouse):

DB <--> Warehouse <--> (Zd, Sf)

In the above example you can always force everything through the warehouse to avoid edge connections everywhere. Warehouses also have many challenges (real-time may not always be possible, point to point sync may be faster to implement)

You can also have clear data flows that are only in one direction by clearly marking which system supports which fields:

Salesforce -> DB -> Warehouse <-> Zd

This simplifies the graph practically into something more like this:

Which ultimately reduces the number of cron jobs for syncing and verifications you need to manage. For now I need to get back to writing my cronjob sync scripts - rather than writing about writing cronjob sync scripts 😂