July 20, 2019
Stop me if you’ve heard this one before: you have a use case for Postgres logical replication and a few minutes/hours/days after enabling it, PagerDuty is yelling that the database’s disk is dangerously full and would you please do something about it before bits start spilling out onto the datacentre floor?
The most common cause for logical replication consuming disk (and which this post is not about, so bear with me) is that the replication consumer is offline. A logical replication slot works by keeping track of the oldest address in the oldest write-ahead log (WAL) that the consumer has already replicated; postgres will consequently refuse to delete anything older from disk until the cursor advances. That way if the consumer crashes, postgres has maintained a copy of all changes so that the consumer can pick up where it left off. As changes are replicated, the consumer advances the slot’s
restart_lsn and WAL files older than that position can be cleaned up.
It follows then that if the disk is filling up, the consumer is probably offline. Right? Right?
I was taken a bit by surprise when some of our RDS postgres instances began eating their disks at the same time that the logical replication consumer was online, active, and claiming (accurately) that it saw no new changes to replicate. This isn’t a narrative trick where I’m going to later tell you that I was wrong and later found consumer wasn’t really running, because it was definitely running!
So what’s up, postgres?
This disk-filling behaviour wasn’t consistent for all of the databases with replication slots. A heatmap analysis showed that these postgres instances followed one of three behaviours:
- Postgres instances that were write-heavy seemed to be working normally and were not filling up their disk
- Postgres instances with infrequent writes were filling up their disk when idle, and then reclaiming the space shortly after a write occurred
- Postgres instances with no writes were just filling up disk space like it was going out of style
Maybe the second category could be attributed to databases with large changesets that were being vacuumed, but the metrics didn’t back that story. RDS metrics can differentiate between transaction log usage vs general disk usage, and the transaction log disk usage made the very same sawtooth pattern as space was reclaimed.
Anyway here’s what’s going on: a logical replication slot is for consuming changes from one database in a postgres instance, whereas the WAL files store changes made across all databases in the postgres instance. This kind of makes sense, since WAL files are also used for streaming replication, where the WAL files are replicated to another instance rather than the changes within. Take this instance for example:
bigdata=> \list List of databases Name | Owner | Encoding | Collate | Ctype | Access privileges ------------+----------+----------+------------+------------+----------------------- bigdata | bigdata | UTF8 | en_US.utf8 | en_US.utf8 | postgres | postgres | UTF8 | en_US.utf8 | en_US.utf8 | rdsadmin | rdsadmin | UTF8 | en_US.utf8 | en_US.utf8 | rdsadmin=CTc/rdsadmin template0 | rdsadmin | UTF8 | en_US.utf8 | en_US.utf8 | =c/rdsadmin + | | | | | rdsadmin=CTc/rdsadmin template1 | rdsadmin | UTF8 | en_US.utf8 | en_US.utf8 | =c/rdsadmin + | | | | | rdsadmin=CTc/rdsadmin (5 rows)
bigdata database has a replication slot with an active consumer:
bigdata=> SELECT slot_name, slot_type, database, active, restart_lsn FROM pg_replication_slots; slot_name | slot_type | database | active | restart_lsn -----------+-----------+----------+--------+------------- big_repl | logical | bigdata | t | 402/E4000840 (1 row)
If you write to
bigdata then the consumer will find new changes to replicate, and advance the cursor to the last transaction. But what if you write to
rdsadmin instead? What if write only to
rdsadmin and never to
bigdata? Well, I guess postgres will have to generate new WAL files to track those changes:
bigdata=> SELECT pg_current_wal_lsn(); pg_current_wal_lsn -------------------- 40F/D4000900 (1 row)
That seems further along than
restart_lsn and yet the consumer is reporting no changes on
bigdata. That’s because the way this consumer is programmed, it doesn’t advance the cursor unless it finds changes. The WAL files being generated by inserts to
rdsadmin will be retained because the cursor isn’t advancing and this is where our disk space is going. As soon as a single change is made to
bigdata, that transaction goes to the “front of the line” and the consumer will replicate changes up to that LSN. The advancing cursor means older WAL files can be cleaned up.
This explains the sawtooth pattern on the databases with infrequent writes, and the unbounded disk consumption on databases with no writes! No wait, it only almost explains the pattern because you can’t write to the
But Amazon can.
While investigating the problem, AWS support wrote to us the following:
“In RDS, we write to a heartbeat table in our internal “rdsadmin” database every 5 minutes, so RDS instances are usually never “idle” for longer than 5 minutes, whereas PostgreSQL installed on EC2 machine may appear “idle” indefinitely. On RDS hosts, we periodically write heartbeats to our admin DB on the host. As a result, even if there is no user activity, WAL files are generated.”
Heartbeats for Heartbeats
This is pretty solvable now that we know what’s happening. Here are a couple ways with varying degrees of danger and autonomy:
- Probably the most “correct” solution is for the consumer to know how to fast-forward the cursor without accidentally losing data. The logical replication consumer knows when there’s no data to replicate, and it knows where the front of the front of the line is care of
pg_current_wal_lsn();. If this sounds conceptually easy then I’m sure it’s still quite difficult to implement safely.
- If the consumer is someone else’s software then it may not be within your power to fix it, but we can still trick it. Since we know that writing to the database with the active replication slot will unjam things, let’s just do that.
Since Amazon likes to write to their database every five minutes, let’s do the same. First create a table:
bigdata=> CREATE TABLE IF NOT EXISTS public._heartbeat (id bigint NOT NULL, heartbeat bigint NOT NULL);
And then upsert into it every 5 minutes so we only ever need to store between 0 and 1 row:
bigdata=> WITH upsert AS (UPDATE _heartbeat SET heartbeat=1234 WHERE id=1 RETURNING *) INSERT INTO _heartbeat (id, heartbeat) SELECT 1, 1234 WHERE NOT EXISTS (SELECT * FROM upsert);
1234 with an actual auto-incrementing value. Where this runs and what gets inserted into the heartbeat table is an exercise left to the reader, but simplicity and team conventions can guide that decision. A Lambda fired from a CloudWatch Event is a pretty modern-day cron, but you could use an oldschool cron too. Resist the temptation to do the insert as part of your app’s healthcheck.
Once implemented, the
bigdata logical replication consumer will always see something new to replicate every 5 minutes at least. Even if no one else is writing to the database, the consumer will advance the cursor, and you’ll avoid that 4am page.
Big disclaimer: be careful copying and pasting random code you find on the internet! These are things I might have done, but that doesn’t mean you should do them too. If you do, you definitely accept all responsibility! Database names have been anonmized for this post – please don’t go looking for a
bigdata instance in our AWS account, because there ain’t one.