Unlocking OLAP in MySQL
This is a preliminary draft - see avaitla16.com/how-i-write
Historically you've needed an OLTP (online transaction processing system - like mysql, postgres, sqlserver, oracle) database to run transactions on and then an ETL to an OLAP system to perform reports, the modern data stack including:
- Modern Data Stack Warehouses: Bigquery, Databricks, Snowflake, Redshift, Clickhouse, Materialize, etc
- Enterprise Products: Sybase IQ, Vertica, Informix, Teradata, Oracle Warehouse etc
Warehouses generally operate with massive amounts of compute in memory, columnar storage, and ruthless parallelization (often times eschewing indexes all together). They also come with helpful SQL that's intended for Analysts convenience (pivoting, unnest, statistics or ml functions). An extensive background can be found here: holistics.io/blog/the-rise-and-fall-of-the-..
A discussion on the separation of OLTP and OLAP is given here: philip.greenspun.com/wtr/data-warehousing.h.. and databases which attempt to combine them is called HTAP (en.wikipedia.org/wiki/Hybrid_transactional/..)
I believe the need to have a separate OLTP and OLAP is a constraint solely due to technology and splitting the two causes a lot of problems:
- A social separation of the development team from the data team (devs generally understand their OLTP syntax but not the OLAP syntax, and data teams understand the OLAP syntax but maybe not the OLTP one). This causes a gap in data understanding as developers who build the features and no many nuances aren't doing reporting against it. This is given to data engineering who need to unravel the development teams schemas after the fact.
- Lack of realtime analytics (fivetrans minimum sync frequency is 5 minutes and from there materializations may run nightly)
- There are computations that may be sophisticated that you only want to do once and not have repeated in OLAP (for risk of incorrect calculations). This includes a persons invoice which they need to see in OLTP but also finance needs to see in an analytics report.
- Early stage companies may create more overhead of having two systems rather than doing everything in one place and artificially have to adopt the syntax and nuances of an OLAP system - often times the dev team is also much larger than the analysts team so centralization in one HTAP system means .
Indeed for folks using Postgres, they often use it for reporting as well on small datasets so you only need to know the Postgres dialect. Citus Data is one such HTAP example for postgres with distributed tables (citusdata.com), but you can also see tuning advice here (narrator.ai/blog/using-postgresql-as-a-data..). Similarly with SQL Server can also run reporting queries with a different storage engine seamlessly under the hood in SQL Server Reporting Services.
To date this has not been possible in MySQL. However there are a key set of things that have changed in the past ten years to start making this possible now:
- Basic 101 Reporting Syntax in MySQL 8.0: CTE's and Window Functions
- Innovations in mysql wire compatibility:
- Most fascinating is the development over the past ten years of Heatwave - which is 100% parity with MySQL
- MariaDB Columnstore
- Singlestore, Clickhouse, and TiDB
- New convenience tooling to model in MySQL:
- DBT MySQL Adapter github.com/dbeatty10/dbt-mysql written by Doug Beatty
- Fivetran has started to support MySQL as a destination
We'll now go through a sample of how MySQL can be a blazing fast reporting engine... TODO migrate a workload and show the heatwave performance for analytics. We focus on heatwave for 100% compatibility and the ability to use it as a replication chain - so no ETL or query syntax changes needed whatsoever.