Postgres WAL Recovery Fails... And Then Works...

Discussion:

(too old to reply)

Lonni J Friedman

2013-01-14 20:02:34 UTC

Your errors look somewhat similar to a problem I reported last week
(no replies thus far):
http://www.postgresql.org/message-id/CAP=oouE5niXgAO_34Q+FGq=***@mail.gmail.com

Except in my case no number of restarts helped. You didn't say, were
you explicitly copying $PGDATA or using some other mechanism to
migrate the data elsewhere?

Also, which version of postgres are you using?

Hi Everyone,
So we had to failover and do a full base backup to get our slave database
back online and ran into a interesting scenario. After copying the data
directory, setting up the recovery.conf, and starting the slave database,
the database crashes while replaying xlogs. However, trying to start the
database again, the database is able to replay xlogs farther than it
initially got, but ultimately ended up failing out again. After starting the
DB a third time, PostgreSQL replays even further and catches up to the
master to start streaming replication. Is this common and or acceptable?
base/16408/18967399 does not exist
1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
invalid pages
1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
terminated by signal 6: Aborted
base/16408/18967399 does not exist
1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
invalid pages
1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
terminated by signal 6: Aborted
Fortunately, these errors only pertain to indexes, which can be rebuilt.
But is this a sign that the data directory on the slave is corrupt?
1. Data Directory Copy Finishes.
2. Recovery.conf Setup
recovery at 2013-01-12 00:14:06 UTC
incomplete startup packet
database system is starting up
unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
"0000000900008E45000000B8" from archive
"0000000900008E450000008B" from archive
"pg_snapshots": No such file or directory
reached at 8E45/8B174840
database system is starting up
"0000000900008E450000008C" from archive
"0000000900008E450000008D" from archive
*SNIP*
"0000000900008E4800000066" from archive
"0000000900008E4800000067" from archive
base/16408/18967399 does not exist
1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
invalid pages
1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
terminated by signal 6: Aborted
server processes
4. PostgreSQL shuts down...
5. Debugging logs enabled in postgresql.conf.
while in recovery at log time 2013-01-11 18:05:31 UTC
once some data might be corrupted and you might need to choose an earlier
recovery target.
incomplete startup packet
database system is starting up
unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
"0000000900008E45000000B8" from archive
8E45/B80AF650
8E45/8B173180; shutdown FALSE
0/552803703; next OID: 24427698
MultiXactOffset: 2442921
ID: 3104202601, in database 16408
956718952, limited by database with OID 16408
cleanup 1 init 0
"pg_snapshots": No such file or directory
"0000000900008E450000008B" from archive
"0000000900008E450000008C" from archive
*SNIP*
"0000000900008E4800000062" from archive
"0000000900008E4800000063" from archive
"0000000900008E4800000064" from archive
"0000000900008E4800000065" from archive
"0000000900008E4800000066" from archive
"0000000900008E4800000067" from archive
reached at 8E48/67AC4E28
accept read only connections
"0000000900008E4800000068" from archive
"0000000900008E4800000069" from archive
"0000000900008E480000006A" from archive
"0000000900008E480000006B" from archive
"0000000900008E480000006C" from archive
*SNIP*
"0000000900008E4F00000079" from archive
"0000000900008E4F0000007A" from archive
base/16408/18967399 does not exist
1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
invalid pages
1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
terminated by signal 6: Aborted
server processes
7. PostgreSQL shuts down...
while in recovery at log time 2013-01-11 19:50:31 UTC
once some data might be corrupted and you might need to choose an earlier
recovery target.
incomplete startup packet
database system is starting up
unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
"0000000900008E4A00000039" from archive
8E4A/39CD4BA0
8E4A/19F0D210; shutdown FALSE
0/552859005; next OID: 24427698
MultiXactOffset: 2443321
ID: 3104202601, in database 16408
956718952, limited by database with OID 16408
cleanup 1 init 0
"pg_snapshots": No such file or directory
"0000000900008E4A00000019" from archive
"0000000900008E4A0000001A" from archive
*SNIP*
"0000000900008E4F00000077" from archive
"0000000900008E4F00000078" from archive
"0000000900008E4F00000079" from archive
"0000000900008E4F0000007A" from archive
reached at 8E4F/7A22BD08
accept read only connections
"0000000900008E4F0000007B" from archive
"0000000900008E4F0000007C" from archive
"0000000900008E4F0000007D" from archive
"0000000900008E4F0000007E" from archive
*SNIP*
"0000000900008E53000000D9" from archive
"0000000900008E53000000DA" from archive
"0000000900008E53000000DB" from archive
"0000000900008E53000000DC" from archive
"0000000900008E53000000DD" from archive
unpigz: /mnt/db/wals/0000000900008E53000000DE does not exist -- skipping
in log file 36435, segment 222, offset 0
unpigz: /mnt/db/wals/0000000900008E53000000DE does not exist -- skipping
successfully connected to primary
file=base/16408/22873432 time=2.538 msec
file=base/16408/18967506 time=12.054 msec
file=base/16408/18967506_fsm time=0.095 msec
file=base/16408/22873244 time=0.144 msec
file=base/16408/22892823 time=0.087 msec
9. Slave DB connected to streaming replication with Master DB and all
seems fine.
Any help would be appreciated. Thanks!

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

KONDO Mitsumasa

2013-01-15 01:08:16 UTC

Permalink

Hi

It may be PG9.2.x on HA cluster(pgsql RA on Pacemaker).

This is known bug.
When HA cluster starting, PG use recovery.conf in crash recovery.
But using recovery.conf in PG9.2, PG need database cluster that does not be crashed.

Please read under following mailing list.
[HACKERS] [BUG?] lag of minRecoveryPont in archive recovery
[BUGS] PITR potentially broken in 9.2

Latest-dev(next release version 9.2.3)is fixed this bug.

Post by Lonni J Friedman
Your errors look somewhat similar to a problem I reported last week
Except in my case no number of restarts helped. You didn't say, were
you explicitly copying $PGDATA or using some other mechanism to
migrate the data elsewhere?
Also, which version of postgres are you using?

--
NTT OSS Center
Mitsumasa KONDO

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Albe Laurenz

2013-01-15 08:17:29 UTC

Permalink

So we had to failover and do a full base backup to get our slave database back online and ran into a
interesting scenario. After copying the data directory, setting up the recovery.conf, and starting the
slave database, the database crashes while replaying xlogs. However, trying to start the database
again, the database is able to replay xlogs farther than it initially got, but ultimately ended up
failing out again. After starting the DB a third time, PostgreSQL replays even further and catches up
to the master to start streaming replication. Is this common and or acceptable?

Certainly not acceptable.

Just checking: When you were "copying the data directory",
did you have online backup mode enabled?

Yours,
Laurenz Albe

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Heikki Linnakangas

2013-01-15 10:51:50 UTC

Permalink

How did you perform the base backup? Did you use pg_basebackup? Or if
you did a filesystem-level copy, did you use pg_start/stop_backup
correctly? Did you take the base backup from the master server, or from
another slave?

This looks similar to the bug discussed here:
http://www.postgresql.org/message-id/CAMkU=1wpvYJVEDo6Qvq4QbosZ+AV6BMVCf+XVCG=***@mail.gmail.com.
That was fixed in 9.2.2, so if you're using 9.2.1 or 9.2.0, try upgrading.

- Heikki

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Phil Monroe

2013-01-15 19:54:49 UTC

Permalink

Sorry, Initial response got blocked since I replied with the logs quoted
again.

Post by Lonni J Friedman
Also, which version of postgres are you using?

PostgreSQL 9.2.1 on Ubuntu 12.04

Post by Lonni J Friedman
Except in my case no number of restarts helped. You didn't say, were
you explicitly copying $PGDATA or using some other mechanism to
migrate the data elsewhere?

So we have a very large database (~5TB), so we use a script to do
parallel rsyncs to copy the data directory
(https://gist.github.com/4477190/#file-pmrcp-rb). The whole copy process
ended up taking ~3.5 hours. So we did a physical copy of $PGDATA (which
is located at /var/lib/postgresql/9.2/main/ on both machines.). We
followed the following process to do this:

1. Master archives WAL files to Backup Host.
2. Execute on Master: psql -c "select pg_start_backup('DATE-slave-restore')"
3. Execute on Master: RCP='rsync -cav --inplace -e rsh'
EXCLUDE='pg_xlog' pmrcp /var/lib/postgresql/9.2/main/
prd-db-01:/var/lib/postgresql/9.2/main/ > /tmp/backup.log
4. Execute on Master: psql -c "select pg_stop_backup()"
5. On Slave, setup recovery.conf to read WAL archive on Backup Host
6. Execute on Slave: pg_ctlcluster 9.2 main start (as described in
initial email)

Best,
Phil

Post by Lonni J Friedman
Sorry, Initial response got blocked since I replied with the logs
quoted again.

Post by Lonni J Friedman
Also, which version of postgres are you using?

PostgreSQL 9.2.1 on Ubuntu 12.04

Post by Lonni J Friedman
Except in my case no number of restarts helped. You didn't say, were
you explicitly copying $PGDATA or using some other mechanism to
migrate the data elsewhere?

So we have a very large database (~5TB), so we use a script to do
parallel rsyncs to copy the data directory
(https://gist.github.com/4477190/#file-pmrcp-rb). The whole copy
process ended up taking ~3.5 hours. So we did a physical copy of
$PGDATA (which is located at /var/lib/postgresql/9.2/main/ on both
1. Master archives WAL files to Backup Host.
2. Execute on Master: psql -c "select
pg_start_backup('DATE-slave-restore')"
3. Execute on Master: RCP='rsync -cav --inplace -e rsh'
EXCLUDE='pg_xlog' pmrcp /var/lib/postgresql/9.2/main/
prd-db-01:/var/lib/postgresql/9.2/main/ > /tmp/backup.log
4. Execute on Master: psql -c "select pg_stop_backup()"
5. On Slave, setup recovery.conf to read WAL archive on Backup Host
6. Execute on Slave: pg_ctlcluster 9.2 main start (as described in
initial email)
Best,
Phil

Phil Monroe

2013-01-15 18:33:42 UTC

Permalink

Sorry, Initial response got blocked since I replied with the logs quoted
again.

Post by Lonni J Friedman
Also, which version of postgres are you using?

PostgreSQL 9.2.1 on Ubuntu 12.04

Post by Lonni J Friedman
Except in my case no number of restarts helped. You didn't say, were
you explicitly copying $PGDATA or using some other mechanism to
migrate the data elsewhere?

Phil Monroe

2013-01-14 23:45:09 UTC

Permalink

Post by Lonni J Friedman
Also, which version of postgres are you using?

PostgreSQL 9.2.1 on Ubuntu 12.04

Post by Lonni J Friedman
Except in my case no number of restarts helped. You didn't say, were
you explicitly copying $PGDATA or using some other mechanism to
migrate the data elsewhere?