WAL recovery failure

Discussion:

(too old to reply)

Rick Dicaire

2012-09-25 18:22:48 UTC

Hi folks, we're snapshotting at san level, two volumes containing a
postgres 9.1.5 database, and pg_xlog respectively.
This database cluster is a streaming replicant.
We're stopping the database completely, taking the snapshots,
restarting the database.
I export these snapshots as luns to a VM, mount the disks in the VM,
cp -a the database and pg_xlog contents to
a staging area. Start postgres using -D /staging/9.1/main. Postgres
fails to start.

From log:

2012-09-25 12:41:30.147 CDT,,,9597,,5061ecca.257d,1,,2012-09-25
12:41:30 CDT,,0,LOG,00000,"database system was interrupted while in
recovery at 2012-09-25 12:18:06 CDT",,"This probably means that some
data is corrupted and you will have to use the last backup for
recovery.",,,,,,,""
2012-09-25 12:41:30.155 CDT,,,9597,,5061ecca.257d,2,,2012-09-25
12:41:30 CDT,,0,LOG,00000,"starting archive recovery",,,,,,,,,""
2012-09-25 12:41:30.354 CDT,,,9597,,5061ecca.257d,3,,2012-09-25
12:41:30 CDT,,0,LOG,00000,"restored log file
""000000010000074D000000EE"" from archive",,,,,,,,,""
2012-09-25 12:41:30.529 CDT,,,9597,,5061ecca.257d,4,,2012-09-25
12:41:30 CDT,,0,LOG,00000,"restored log file
""000000010000074D000000E2"" from archive",,,,,,,,,""
2012-09-25 12:41:30.529 CDT,,,9597,,5061ecca.257d,5,,2012-09-25
12:41:30 CDT,,0,PANIC,XX000,"unexpected pageaddr 74D/7C000000 in log
file 1869, segment 226, offset 0",,,,,,,,,""
2012-09-25 12:41:30.530 CDT,,,9595,,5061ecc8.257b,1,,2012-09-25
12:41:28 CDT,,0,LOG,00000,"startup process (PID 9597) was terminated
by signal 6: Aborted",,,,,,,,,""
2012-09-25 12:41:30.530 CDT,,,9595,,5061ecc8.257b,2,,2012-09-25
12:41:28 CDT,,0,LOG,00000,"terminating any other active server
processes",,,,,,,,,""

/staging/9.1/main/recovery.conf:

restore_command = 'cp /staging/9.1/main/pg_xlog/%f "%p"'

Does the PANIC shown in above log entry indicate a corrupt WAL file?

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Fujii Masao

2012-09-26 00:01:52 UTC

Permalink

Post by Rick Dicaire
Hi folks, we're snapshotting at san level, two volumes containing a
postgres 9.1.5 database, and pg_xlog respectively.
This database cluster is a streaming replicant.
We're stopping the database completely, taking the snapshots,
restarting the database.
I export these snapshots as luns to a VM, mount the disks in the VM,
cp -a the database and pg_xlog contents to
a staging area. Start postgres using -D /staging/9.1/main. Postgres
fails to start.
2012-09-25 12:41:30.147 CDT,,,9597,,5061ecca.257d,1,,2012-09-25
12:41:30 CDT,,0,LOG,00000,"database system was interrupted while in
recovery at 2012-09-25 12:18:06 CDT",,"This probably means that some
data is corrupted and you will have to use the last backup for
recovery.",,,,,,,""
2012-09-25 12:41:30.155 CDT,,,9597,,5061ecca.257d,2,,2012-09-25
12:41:30 CDT,,0,LOG,00000,"starting archive recovery",,,,,,,,,""
2012-09-25 12:41:30.354 CDT,,,9597,,5061ecca.257d,3,,2012-09-25
12:41:30 CDT,,0,LOG,00000,"restored log file
""000000010000074D000000EE"" from archive",,,,,,,,,""
2012-09-25 12:41:30.529 CDT,,,9597,,5061ecca.257d,4,,2012-09-25
12:41:30 CDT,,0,LOG,00000,"restored log file
""000000010000074D000000E2"" from archive",,,,,,,,,""
2012-09-25 12:41:30.529 CDT,,,9597,,5061ecca.257d,5,,2012-09-25
12:41:30 CDT,,0,PANIC,XX000,"unexpected pageaddr 74D/7C000000 in log
file 1869, segment 226, offset 0",,,,,,,,,""
2012-09-25 12:41:30.530 CDT,,,9595,,5061ecc8.257b,1,,2012-09-25
12:41:28 CDT,,0,LOG,00000,"startup process (PID 9597) was terminated
by signal 6: Aborted",,,,,,,,,""
2012-09-25 12:41:30.530 CDT,,,9595,,5061ecc8.257b,2,,2012-09-25
12:41:28 CDT,,0,LOG,00000,"terminating any other active server
processes",,,,,,,,,""
restore_command = 'cp /staging/9.1/main/pg_xlog/%f "%p"'

/staging/9.1/main/pg_xlog is really archive directory? restore_command
should be set to the command which copies the archived WAL files from
the archive area.

Regards,

--
Fujii Masao
--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Rick Dicaire

2012-09-26 01:56:21 UTC

Permalink

Post by Fujii Masao
/staging/9.1/main/pg_xlog is really archive directory? restore_command
should be set to the command which copies the archived WAL files from
the archive area.

Yes. This is the archive directory.

This begs another question, does a replicant write its own wal files to disk?

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Albe Laurenz

2012-09-26 09:21:08 UTC

Permalink

Post by Rick Dicaire

Post by Fujii Masao
/staging/9.1/main/pg_xlog is really archive directory?

restore_command

Post by Rick Dicaire

Post by Fujii Masao
should be set to the command which copies the archived WAL files from
the archive area.

Yes. This is the archive directory.

But according to what you wrote before you start the server
with -D /staging/9.1/main.
That means that /staging/9.1/main/pg_xlog contains the active
WAL files, right?

So the restore_command 'cp /staging/9.1/main/pg_xlog/%f "%p"'
would just try to cp the files onto themselves, right?

The WAL archive directory must be different from the
active WAL directory. Check if that's your problem!

Yours,
Laurenz Albe

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Rick Dicaire

2012-09-26 11:35:13 UTC

Permalink

Post by Albe Laurenz
But according to what you wrote before you start the server
with -D /staging/9.1/main.
That means that /staging/9.1/main/pg_xlog contains the active
WAL files, right?
So the restore_command 'cp /staging/9.1/main/pg_xlog/%f "%p"'
would just try to cp the files onto themselves, right?
The WAL archive directory must be different from the
active WAL directory. Check if that's your problem!

/staging/9.1/main/pg_xlog/ was a symlink to /staging/9.1/pg_xlog/,
removed the link, created
the directory, restarted postgres:

2012-09-26 06:29:20.094 CDT,,,13904,,5062e710.3650,1,,2012-09-26
06:29:20 CDT,,0,LOG,00000,"database system was interrupted while in
recovery at log time 2012-09-25 09:33:28 CDT",,"If this has occurred
more than once some data might be corrupted and you might need to
choose an earlier recovery target.",,,,,,,""
2012-09-26 06:29:20.094 CDT,,,13904,,5062e710.3650,2,,2012-09-26
06:29:20 CDT,,0,LOG,00000,"creating missing WAL directory
""pg_xlog/archive_status""",,,,,,,,,""
2012-09-26 06:29:20.117 CDT,,,13904,,5062e710.3650,3,,2012-09-26
06:29:20 CDT,,0,LOG,00000,"starting archive recovery",,,,,,,,,""
2012-09-26 06:29:20.361 CDT,,,13904,,5062e710.3650,4,,2012-09-26
06:29:20 CDT,,0,LOG,00000,"restored log file
""000000010000074D000000EE"" from archive",,,,,,,,,""
2012-09-26 06:29:20.498 CDT,,,13904,,5062e710.3650,5,,2012-09-26
06:29:20 CDT,,0,LOG,00000,"restored log file
""000000010000074D000000E2"" from archive",,,,,,,,,""
2012-09-26 06:29:20.498 CDT,,,13904,,5062e710.3650,6,,2012-09-26
06:29:20 CDT,,0,PANIC,XX000,"unexpected pageaddr 74D/7C000000 in log
file 1869, segment 226, offset 0",,,,,,,,,""
2012-09-26 06:29:20.499 CDT,,,13902,,5062e70f.364e,1,,2012-09-26
06:29:19 CDT,,0,LOG,00000,"startup process (PID 13904) was terminated
by signal 6: Aborted",,,,,,,,,""
2012-09-26 06:29:20.499 CDT,,,13902,,5062e70f.364e,2,,2012-09-26
06:29:19 CDT,,0,LOG,00000,"terminating any other active server
processes",,,,,,,,,""

Two WAL files are read in, choking on the third. Is it a corrupt WAL file?

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin