Upgrading MySQL from 8.0.17 to 8.0.23

AWS did a minor version upgrade for our MySQL servers from 8.0.17 to 8.0.23 this past Saturday at 4:55 AM CST. This is done automatically due to:

UpdatedVersion

In most cases a minor version upgrade is uneventful but in this case led to two key issues.

We were first alerted due to a large number of errors of the following kind: (1038, 'Out of sort memory, consider increasing server sort buffer size')

Unclear what happened the first step was to go to aws console to see what was going on:

Screen Shot 2021-07-19 at 10.00.04 PM.png

We now realize the database was upgraded and the logs confirm this:

T09:44  /rdsdbbin/mysql/bin/mysqld: Shutdown complete (mysqld 8.0.17) Source distribution.
T09:45  /rdsdbbin/mysql/bin/mysqld (mysqld 8.0.23) starting as process 22020

Then we start to google the primary error and find the following fairly quickly https://bugs.mysql.com/bug.php?id=103225. In particular we read:

On MySQL 8.0.23, a certain dataset and query cause an "Out of sort memory" (1038) error.
This is reproducible, and it does not happen on 8.0.17.

A potentially relevant change is that somewhere between those versions, we started sorting small blobs, such as TEXT, as addon fields instead of always doing sort-by-rowid. This is the reason why there's now more pressure on the sort buffer (but for most cases, sorts should still be faster). I see you have a TEXT field in your data (but I haven't looked at it apart from that).

We notice that we are indeed doing a SELECT * query on a table that has a TEXT column, so a quick fix is to either increase the sort buffer and longer term optimize the columns selected in the query itself:

sort_buffer_size

Now the second issue was much more subtle. We started noticing that our fivetran cluster was constantly no longer completing jobs and syncs with small amounts of data were taking over 12 hours and every single sync was failing:

Sync Failures

After working with the Fivetran support team we tested a few different timeouts which didn't do much, and finally created a database on MySQL 8.0.17 to replicate from our primary cluster. Connecting to it lo and behold did let the sync complete. Looking into it further the team noticed the Invisible columns feature that was introduced in MySQL 8.0.23: dev.mysql.com/doc/refman/8.0/en/invisible-c... This feature causes issues in this underlying binlog reader here: github.com/osheroff/mysql-binlog-connector-.. (which also causes issues in Maxwells daemon: github.com/zendesk/maxwell/issues/1724)

In particular this breakage only occurs if you have binlog_row_metadata=FULL AND are running MySQL 8.0.23 (mysqlhighavailability.com/more-metadata-is-..). We did have this setting on and turning it off allows the underlying binlog connector to work again.

Finally our sync works again:

It's an interesting case where multiple conditions occurring at the same time (MySQL 8.0.23 + FULL Row Metadata + MySQL Binlog connector bug not parsing invisible events) can lead to an issue even if anyone of them don't cause a problem by themselves.

Some follow up from the Fivetran team:

Syncs should fail quickly when the binlog thread fails (otherwise people think the sync is going but taking too long).
Incorporate the new Binlog connector patch when available.

Huge Thanks to the Fivetran team for their help uncovering what was going on here!

Also incredible kudos to Shyiko and Osheroff for making and maintaining such a crucial library in mysql-binlog-connector-java!