[ADMIN] standby with a fixed lag behind the master

Discussion:

(too old to reply)

Alexey Klyukin

2012-07-26 09:24:04 UTC

Hello,

I've recently come across the task of setting up a PostgreSQL 9.1 standby server that is N hours behind the master, i.e. only transactions that finished N hours in the past or older should be replayed on a standby. The goal is to have a known good state server to perform backups from and possibly revert to in case of unwanted changes on primary. It seems that there is no mechanism in PostgreSQL to just ask the standby to keep a fixed distance (in terms of either WAL segments or time) between the primary, so these are possible solutions:

1. Use restore command on standby to fetch the current WAL segment only if it has been created not less than N hours in the past (according to ctime).
2. Pause the restore process on standby if the lag * is less than N hours (with pg_xlog_replay_pause()) and resume if it is more than that.
3. Set recovery_target_time to current - 6 hours and pause_at_recovery_target to true, periodically check whether the recovery is paused, reset the recovery target time to a new value (and restart the standby) if it is.

* - the lag would be calculated as now() - pg_last_xact_replay_timestamp() on standby.

Both 2 and 3 requires external cron job to pause/resume the recovery, and 1, while being the easiest of all, doesn't work with SR (unless it's combined with WAL shipping). I wonder if there are other well established approaches at solving this problem and if there is an interest for adding such feature to the -core?

Thank you,
--
Alexey Klyukin http://www.commandprompt.com
The PostgreSQL Company – Command Prompt, Inc.

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Jerry Sievers

2012-07-26 13:11:30 UTC

Permalink

Post by Alexey Klyukin
Hello,
I've recently come across the task of setting up a PostgreSQL 9.1
standby server that is N hours behind the master, i.e. only
transactions that finished N hours in the past or older should be
replayed on a standby. The goal is to have a known good state server
to perform backups from and possibly revert to in case of unwanted
changes on primary. It seems that there is no mechanism in
PostgreSQL to just ask the standby to keep a fixed distance (in
terms of either WAL segments or time) between the primary, so these

A not unreasonable problem to want options for solving.

Post by Alexey Klyukin
1. Use restore command on standby to fetch the current WAL segment
only if it has been created not less than N hours in the past
(according to ctime).

This is the approach I implemented a few years ago by writing a custom
version of pg_standby that implements all the standard logic plus
checking for a configurable age of WAL segment before applying same.
I chose to do the program in Python rather than trying to hack the
feature into the C language native version but of course this was due
to personal preference.

I have seen this feature requested a few times before but not sure
ever added to the community release.

Post by Alexey Klyukin
2. Pause the restore process on standby if the lag * is less than N
hours (with pg_xlog_replay_pause()) and resume if it is more than
that.
3. Set recovery_target_time to current - 6 hours and
pause_at_recovery_target to true, periodically check whether the
recovery is paused, reset the recovery target time to a new value
(and restart the standby) if it is.
* - the lag would be calculated as now() - pg_last_xact_replay_timestamp() on standby.
Both 2 and 3 requires external cron job to pause/resume the
recovery, and 1, while being the easiest of all, doesn't work with
SR (unless it's combined with WAL shipping). I wonder if there are
other well established approaches at solving this problem and if
there is an interest for adding such feature to the -core?

Right and I"m not sure such a strategy would play nice with
hot-standby either.

One strategy if you have enough storage, is to have multiple standby
instances perhaps all running on the same standby host. One of them
is a streamer and the other a lagger. You would go live on whichever
instance was necessary to recover from whatever disaster you had.

In practice, we've never had to use any of our lagged standbys for the
intended purpose, always unlagging them and bringing online at latest
transaction.

But... If you are ever faced with a case of like junior DBA drops
important table in production, deployment code mangles DB or
whatever, you may be glad you had a lagged standby to migrate to.

In the case of guarding against risk of disaster from a deployment
gone bad, and given just one or more regular standbys in the typical
near real-time mode, it's much easier to just halt one of them after
apps are down and immediatly prior to the deployment. If things go
badly, online that standby and start running on it.

HTH

Post by Alexey Klyukin
Thank you,
--
Alexey Klyukin http://www.commandprompt.com
The PostgreSQL Company Command Prompt, Inc.
--
http://www.postgresql.org/mailpref/pgsql-admin

--
Jerry Sievers
Postgres DBA/Development Consulting
e: ***@comcast.net
p: 732.216.7255

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Steve Crawford

2012-07-26 16:18:14 UTC

Permalink

Post by Alexey Klyukin
Hello,
1. Use restore command on standby to fetch the current WAL segment only if it has been created not less than N hours in the past (according to ctime).
2. Pause the restore process on standby if the lag * is less than N hours (with pg_xlog_replay_pause()) and resume if it is more than that.
3. Set recovery_target_time to current - 6 hours and pause_at_recovery_target to true, periodically check whether the recovery is paused, reset the recovery target time to a new value (and restart the standby) if it is.
* - the lag would be calculated as now() - pg_last_xact_replay_timestamp() on standby.
Both 2 and 3 requires external cron job to pause/resume the recovery, and 1, while being the easiest of all, doesn't work with SR (unless it's combined with WAL shipping). I wonder if there are other well established approaches at solving this problem and if there is an interest for adding such feature to the -core?

This sounds like a weird way to approach this problem. If you want a
snapshot n-hours ago why not just schedule a dump or base-backup to run
at the desired times? Or continuously archive your WAL so you can
recover up to any point you want rather than just a pre-defined n-hours?

I'm currently looking at using the recently announced pgbarman to
simplify the overall process. (http://www.pgbarman.org/)

Cheers,
Steve

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Amador Alvarez

2012-08-08 17:34:54 UTC

Permalink

Hi there,

And why not shipping older WAL files to the target on a regular basis ?.
On the master you can control with a crontab job to ship the wanted WAL
files (n hours older than current time and clean the shipped up, check
rsync options up) in a regular basis.

A.A.

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin