Postgres point-in-time recovery failure

Discussion:

(too old to reply)

Cheryl Grant

2013-02-26 02:00:20 UTC

Hi, I'm trying to test restoration of a database using point-in-time
recovery. I'm taking a backup of the database using pg_basebackup:
pg_basebackup -D /postgres/data -Fp -l RestorePostgres -U reco -w -h
radmast01 -p 5432
Then attempting to recover the backup on a second server using the following
recovery.conf settings:
restore_command = 'cp /apps/postgres/backup/WAL/%f %p'
recovery_target_time = '2013-02-26 12:53:00'
recovery_target_inclusive=true
Every time I start the recovery I get the following error in the log file
and the instance crashes:
2844LOG: database system was interrupted; last known up at 2013-02-26
12:46:56 EST
2844LOG: creating missing WAL directory "pg_xlog/archive_status"
2844LOG: starting point-in-time recovery to 2013-02-26 12:53:00+11
2844LOG: restored log file "000000010000017D00000056" from archive
2844LOG: unexpected pageaddr 17D/2E000000 in log file 381, segment 86,
offset 0
2844LOG: invalid checkpoint record
2844FATAL: could not locate required checkpoint record
2844HINT: If you are not restoring from a backup, try removing the file
"/apps/postgres/data/backup_label".
2825LOG: startup process (PID 2844) exited with exit code 1
2825LOG: aborting startup due to startup process failure

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Postgres-point-in-time-recovery-failure-tp5746638.html
Sent from the PostgreSQL - admin mailing list archive at Nabble.com.

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Albe Laurenz

2013-02-26 08:38:12 UTC

Permalink

Post by Cheryl Grant
Hi, I'm trying to test restoration of a database using point-in-time
pg_basebackup -D /postgres/data -Fp -l RestorePostgres -U reco -w -h
radmast01 -p 5432
Then attempting to recover the backup on a second server using the following
restore_command = 'cp /apps/postgres/backup/WAL/%f %p'
recovery_target_time = '2013-02-26 12:53:00'
recovery_target_inclusive=true
Every time I start the recovery I get the following error in the log file
2844LOG: database system was interrupted; last known up at 2013-02-26
12:46:56 EST
2844LOG: creating missing WAL directory "pg_xlog/archive_status"
2844LOG: starting point-in-time recovery to 2013-02-26 12:53:00+11
2844LOG: restored log file "000000010000017D00000056" from archive
2844LOG: unexpected pageaddr 17D/2E000000 in log file 381, segment 86,
offset 0
2844LOG: invalid checkpoint record
2844FATAL: could not locate required checkpoint record
2844HINT: If you are not restoring from a backup, try removing the file
"/apps/postgres/data/backup_label".
2825LOG: startup process (PID 2844) exited with exit code 1
2825LOG: aborting startup due to startup process failure

That indicates that the WAL file 000000010000017D00000056 is
broken. Are you sure that it is from the PostgreSQL server
you backed up? How did you archive the WAL files?

Yours,
Laurenz Albe

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Cheryl Grant

2013-02-26 23:18:30 UTC

Permalink

I'm doing a pg_basebackup to create the instance with -x specified so some
of the logs are in the pg_xlog directory after the backup. It always seems
to fall over with the same error on the first log. I've tried this numerous
times with different backups and it always fails on the first log.

I've used the same method to create a hot standby which works, but only
because streaming replication is getting the data across. But this won't
work in a disaster recovery situation.

My backup command for the primary WAL logs is a script. Here is the
contents of the script:

ls -1 $PGDATA/pg_xlog | while read f; do
{
if [ -f $PGDATA/pg_xlog/$f ] ; then
if [ ! -f $LOGPATH/$f ] ; then
echo "$PGDATA/pg_xlog/$f" >> $LOGFILE
cp $PGDATA/pg_xlog/$f $LOGPATH
status=$?
echo status=$status >> $LOGFILE
scp $LOGPATH/$f $SCPHOST:$LOGPATH &
fi
fi
} done;

Post by Cheryl Grant

following

Post by Cheryl Grant
restore_command = 'cp /apps/postgres/backup/WAL/%f %p'
recovery_target_time = '2013-02-26 12:53:00'
recovery_target_inclusive=true
Every time I start the recovery I get the following error in the log file
2844LOG: database system was interrupted; last known up at 2013-02-26
12:46:56 EST
2844LOG: creating missing WAL directory "pg_xlog/archive_status"
2844LOG: starting point-in-time recovery to 2013-02-26 12:53:00+11
2844LOG: restored log file "000000010000017D00000056" from archive
2844LOG: unexpected pageaddr 17D/2E000000 in log file 381, segment 86,
offset 0
2844LOG: invalid checkpoint record
2844FATAL: could not locate required checkpoint record
2844HINT: If you are not restoring from a backup, try removing the

file

Post by Cheryl Grant
"/apps/postgres/data/backup_label".
2825LOG: startup process (PID 2844) exited with exit code 1
2825LOG: aborting startup due to startup process failure

That indicates that the WAL file 000000010000017D00000056 is
broken. Are you sure that it is from the PostgreSQL server
you backed up? How did you archive the WAL files?
Yours,
Laurenz Albe

--
*Cheryl Grant*

Senior Development DBA

T: 02 9009 3050 (extn 65050)

M: 0404 083 591

E: ***@aapt.com.au

W: aapt.com.au <http://www.aapt.com.au/>

Level 20, 680 George St
Sydney NSW 2000

This communication, including any attachments, is confidential. If you are not the intended
recipient, you should not read it - please contact me immediately, destroy it, and do not
copy or use any part of this communication or disclose anything about it.

Albe Laurenz

2013-02-27 08:58:45 UTC

Permalink

Post by Albe Laurenz

Post by Cheryl Grant
2844LOG: starting point-in-time recovery to 2013-02-26 12:53:00+11
2844LOG: restored log file "000000010000017D00000056" from archive
2844LOG: unexpected pageaddr 17D/2E000000 in log file 381, segment 86,
offset 0
2844LOG: invalid checkpoint record
2844FATAL: could not locate required checkpoint record

That indicates that the WAL file 000000010000017D00000056 is
broken. Are you sure that it is from the PostgreSQL server
you backed up? How did you archive the WAL files?

Ah, but what the above log entry says is that it
took the WAL file from the archive location and
copied it into pg_xlog.

So the WAL file created by the -x switch of
pg_basebackup was overwritten with a file from
the archive.

Does the archive contain a different (= wrong)
copy of the WAL file?

I've used the same method to create a hot standby which works, but only because streaming replication
is getting the data across. But this won't work in a disaster recovery situation.

Even with streaming replication that should not
work, if the problem is a bad WAL file in the archive.

ls -1 $PGDATA/pg_xlog | while read f; do
{
if [ -f $PGDATA/pg_xlog/$f ] ; then
if [ ! -f $LOGPATH/$f ] ; then
echo "$PGDATA/pg_xlog/$f" >> $LOGFILE
cp $PGDATA/pg_xlog/$f $LOGPATH
status=$?
echo status=$status >> $LOGFILE
scp $LOGPATH/$f $SCPHOST:$LOGPATH &
fi
fi
} done;

That's not your archive_command, right?
At what points is this script run?
Could it have copied an incomplete WAL file to the archive?

The good way to archive WAL files is to specify an
appropriate archive_command in postgresql.conf.
Then each WAL file is archived as soon as it is full,
and the PostgreSQL server knows if archiving worked or not.

Yours,
Laurenz Albe

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin