[ADMIN] How to deal with corrupted database?

Discussion:

(too old to reply)

Ruslan A. Bondar

2011-11-09 11:02:25 UTC

Hello all.

This is a first time I receive such an issue.
My database was corrupted some way.
When I'm trying to access the database via psql I receive:

***@udb:/etc/bacula# psql -U bacula
psql: FATAL: could not read block 0 in file "base/16387/86057840": read only 0 of 8192 bytes

So I want to drop it, and recover from backup. But when I'm trying to drop the database I see:

postgres=# drop database bacula;
ERROR: could not access status of transaction 15892843
DETAIL: Could not open file "pg_subtrans/00F2": No such file or directory.

Is there any way to recover the database to working state, or drop it?

OS: Linux Debian Sid
postgres version: 9.0.4
uname: Linux udb 2.6.32-5-xen-686 #1 SMP Tue Oct 19 17:09:04 UTC 2010 i686 GNU/Linux

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Craig Ringer

2011-11-09 13:04:47 UTC

Permalink

Post by Ruslan A. Bondar
Hello all.
This is a first time I receive such an issue.
My database was corrupted some way.

Before you do ANYTHING else, make a copy of your database somewhere
safe. See:

http://wiki.postgresql.org/wiki/Corruption

Post by Ruslan A. Bondar
psql: FATAL: could not read block 0 in file "base/16387/86057840": read only 0 of 8192 bytes
postgres=# drop database bacula;
ERROR: could not access status of transaction 15892843
DETAIL: Could not open file "pg_subtrans/00F2": No such file or directory.
Is there any way to recover the database to working state, or drop it?

*ONLY* once you've made a full backup copy, you may be able to set
zero_damaged_pages to get a usable dump.

Do you know what caused this? The missing files suggest it was probably
file system corruption - was there a disk failure? fsck run with errors?
Unexpected reboot on a RAID controller with a dead backup battery?

--
Craig Ringer

Ruslan A. Bondar

2011-11-09 13:37:44 UTC

Permalink

There were no unexpected reboots.
First issue was some kind of deadlock (concurrent insert and concurrent delete on a table) I saw them wile reindexing the database.
Also mesages like this were in dmesg:
[3681001.529179] INFO: task postgres:12432 blocked for more than 120 seconds.
[3681001.529191] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[3681001.529205] postgres D ed8c0e98 0 12432 740 0x00000000
[3681001.529225] ec83f700 00000286 ec361080 ed8c0e98 c1f8bee0 c145ae20 c145ae20 c1456354
[3681001.529262] ec83f8bc c3e54e20 00000000 00000000 00000000 00000001 ffffffff ec361080
[3681001.529312] c3e50354 ec83f8bc 36d8262f 00000000 ec83f700 00000000 00000000 00000000
[3681001.529369] Call Trace:
[3681001.529385] [<c128d717>] ? __mutex_lock_common+0xe8/0x13b
[3681001.529401] [<c128d779>] ? __mutex_lock_slowpath+0xf/0x11
[3681001.529416] [<c128d80a>] ? mutex_lock+0x17/0x24
[3681001.529429] [<c128d80a>] ? mutex_lock+0x17/0x24
[3681001.529444] [<c10bc2a3>] ? generic_file_llseek+0x17/0x44
[3681001.529458] [<c10bc28c>] ? generic_file_llseek+0x0/0x44
[3681001.529473] [<c10bb145>] ? vfs_llseek+0x30/0x34
[3681001.529487] [<c10bc1a1>] ? sys_llseek+0x3a/0x7a
[3681001.529501] [<c1008efc>] ? syscall_call+0x7/0xb

So I've stopped software caused these inserts and deletes, but reindexing shows same warnings. I've restarted postgresql server. Postgres restarted successfully, but the database became unaccessible. Filesystem is clean. File base/16387/86057840 exists but is zero length. File pg_subtrans/00F2 does not exists.

Also after several restarts postgres can't start. Messages are:

2011-11-09 16:25:04 MSK LOG: database system shutdown was interrupted; last known up at 2011-11-09 14:13:38 MSK
2011-11-09 16:25:04 MSK LOG: database system was not properly shut down; automatic recovery in progress
2011-11-09 16:25:04 MSK FATAL: the database system is starting up
2011-11-09 16:25:04 MSK LOG: consistent recovery state reached at 171/19BE8060
2011-11-09 16:25:04 MSK LOG: redo starts at 171/19BE8060
2011-11-09 16:25:04 MSK LOG: incomplete startup packet
2011-11-09 16:25:04 MSK LOG: record with zero length at 171/19C26010
2011-11-09 16:25:04 MSK LOG: redo done at 171/19C25FB4
2011-11-09 16:25:04 MSK LOG: last completed transaction was at log time 2011-11-09 13:05:20.105323+03
2011-11-09 16:25:04 MSK FATAL: xlog flush request 171/1B1374E0 is not satisfied --- flushed only to 171/19C26010
2011-11-09 16:25:04 MSK CONTEXT: writing block 0 of relation base/16385/86064815_vm
2011-11-09 16:25:04 MSK LOG: startup process (PID 3570) exited with exit code 1
2011-11-09 16:25:04 MSK LOG: aborting startup due to startup process failure

This database isn't mission critical, so if you want - I can experiment on this.

On Wed, 09 Nov 2011 21:04:47 +0800

Post by Craig Ringer

Post by Ruslan A. Bondar
Hello all.
This is a first time I receive such an issue.
My database was corrupted some way.

Before you do ANYTHING else, make a copy of your database somewhere
http://wiki.postgresql.org/wiki/Corruption

Post by Ruslan A. Bondar
psql: FATAL: could not read block 0 in file
"base/16387/86057840": read only 0 of 8192 bytes
So I want to drop it, and recover from backup. But when I'm trying
postgres=# drop database bacula;
ERROR: could not access status of transaction 15892843
DETAIL: Could not open file "pg_subtrans/00F2": No such file or directory.
Is there any way to recover the database to working state, or drop it?

*ONLY* once you've made a full backup copy, you may be able to set
zero_damaged_pages to get a usable dump.
Do you know what caused this? The missing files suggest it was
probably file system corruption - was there a disk failure? fsck run
with errors? Unexpected reboot on a RAID controller with a dead
backup battery?
--
Craig Ringer

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Craig Ringer

2011-11-10 02:36:55 UTC

Permalink

Post by Ruslan A. Bondar
This database isn't mission critical, so if you want - I can

experiment on this.

Brilliant! It's rare for people to be able to investigate issues like
this, most of the time they just have to get up and running ASAP and
often destroy data required to usefully investigate in the process.

Post by Ruslan A. Bondar
First issue was some kind of deadlock (concurrent insert and concurrent delete on a table) I saw them wile reindexing the database.
[3681001.529385] [<c128d717>] ? __mutex_lock_common+0xe8/0x13b
[3681001.529401] [<c128d779>] ? __mutex_lock_slowpath+0xf/0x11
[3681001.529416] [<c128d80a>] ? mutex_lock+0x17/0x24
[3681001.529429] [<c128d80a>] ? mutex_lock+0x17/0x24
[3681001.529444] [<c10bc2a3>] ? generic_file_llseek+0x17/0x44
[3681001.529458] [<c10bc28c>] ? generic_file_llseek+0x0/0x44
[3681001.529473] [<c10bb145>] ? vfs_llseek+0x30/0x34
[3681001.529487] [<c10bc1a1>] ? sys_llseek+0x3a/0x7a
[3681001.529501] [<c1008efc>] ? syscall_call+0x7/0xb

OK, so it was hung waiting for a lock within the VFS layer. That's
rather odd at best.

Your system details are:

OS: Linux Debian Sid
postgres version: 9.0.4
uname: Linux udb 2.6.32-5-xen-686 #1 SMP Tue Oct 19 17:09:04 UTC 2010
i686 GNU/Linux

so you're running a Xen kernel (guest? Or host? I'm assuming guest) on a
very fresh kernel on a testing version of Debian. I won't be entirely
shocked if this is a kernel issue.

I don't see any obvious deadlock reports on a search for

"vfs_llseek" OR "generic_file_llseek" deadlock

but

"vfs_llseek" OR "generic_file_llseek" mutex_lock

finds:

https://bugzilla.redhat.com/show_bug.cgi?id=716991 (unrelated?)

and more interestingly:

http://postgresql.1045698.n5.nabble.com/Load-Increase-td4269457.html

... which unfortunately doesn't give any OS/kernel info, but is another
recent report.

There have been some recent changes in file system locking:

http://lwn.net/Articles/448038/

so I'd appreciate it if you could pass this report on to the kernel
folks involved in case they want to investigate further.

Post by Ruslan A. Bondar
So I've stopped software caused these inserts and deletes, but reindexing shows same warnings. I've restarted postgresql server.

How did you restart PostgreSQL?

If there were backends hung in the vfs, did the eventually terminate by
themselves? If not, did you terminate them yourself? How? With a signal
(kill)? Which signal? Some other way?

Post by Ruslan A. Bondar
Postgres restarted successfully, but the database became unaccessible. Filesystem is clean. File base/16387/86057840 exists but is zero length. File pg_subtrans/00F2 does not exists.

Hm, ok. I'm a bit suspicious of the deadlock in the kernel. It isn't
necessarily a kernel issue, but given the system you're running on I
won't be too surprised if it is either. There's a fairly good chance the
trigger for this was a kernel issue munching your data.

Are you able to reproduce this issue with another instance of PostgreSQL
running with a freshly created database cluster (initdb) ?

Post by Ruslan A. Bondar
2011-11-09 16:25:04 MSK FATAL: xlog flush request 171/1B1374E0 is not satisfied --- flushed only to 171/19C26010
2011-11-09 16:25:04 MSK CONTEXT: writing block 0 of relation base/16385/86064815_vm
2011-11-09 16:25:04 MSK LOG: startup process (PID 3570) exited with exit code 1
2011-11-09 16:25:04 MSK LOG: aborting startup due to startup process failure

I don't know enough about Pg's guts to suggest how to proceed from here.
Maybe a pg_resetxlog might get you up and running (albeit with potential
data damage) but I'm not sure.

--
Craig Ringer

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Ruslan A. Bondar

2011-11-10 05:53:31 UTC

Permalink

Post by Craig Ringer

Post by Ruslan A. Bondar
So I've stopped software caused these inserts and deletes, but
reindexing shows same warnings. I've restarted postgresql server.

How did you restart PostgreSQL?
If there were backends hung in the vfs, did the eventually terminate
by themselves? If not, did you terminate them yourself? How? With a
signal (kill)? Which signal? Some other way?

Just /etc/init.d/postgresql-9.0 restart
As far as I know it uses "smart" shutdown on debian.

Post by Craig Ringer

Post by Ruslan A. Bondar
Postgres restarted successfully, but the database became
unaccessible. Filesystem is clean. File base/16387/86057840 exists
but is zero length. File pg_subtrans/00F2 does not exists.

Hm, ok. I'm a bit suspicious of the deadlock in the kernel. It isn't
necessarily a kernel issue, but given the system you're running on I
won't be too surprised if it is either. There's a fairly good chance
the trigger for this was a kernel issue munching your data.
Are you able to reproduce this issue with another instance of
PostgreSQL running with a freshly created database cluster (initdb) ?

Ok. I'll reinit clean cluster, restore databases from backups and will try to reproduce the issue.
I'll have nearly the same configuration and data like before crash. Hope the issue will appear.
Also I've copied old, corrupted, database for the case issue will not appear in a week.

P.S. Yes, the host is xen guest and postgres works under a heavy load.
But at a hardware level it uses 4 individual HDD's in a mirror for storing it's data.
The filesystem operations should be fast enough.

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Kevin Grittner

2011-11-09 15:21:18 UTC

Permalink

Post by Ruslan A. Bondar
ERROR: could not access status of transaction 15892843
DETAIL: Could not open file "pg_subtrans/00F2": No such file or directory.
postgres version: 9.0.4

Did you move to 9.0.x using pg_upgrade version 9.0.3 or earlier?

http://wiki.postgresql.org/wiki/20110408pg_upgrade_fix

-Kevin

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Ruslan A. Bondar

2011-11-09 15:53:50 UTC

Permalink

No, this was a clean install of 9.0.x.
Maybe It was upgraded from 9.0.3 to 9.0.4. I'm not sure.

On Wed, 09 Nov 2011 09:21:18 -0600

Post by Ruslan A. Bondar

Post by Ruslan A. Bondar
ERROR: could not access status of transaction 15892843
DETAIL: Could not open file "pg_subtrans/00F2": No such file or

directory.

Post by Ruslan A. Bondar
postgres version: 9.0.4

Did you move to 9.0.x using pg_upgrade version 9.0.3 or earlier?
http://wiki.postgresql.org/wiki/20110408pg_upgrade_fix
-Kevin

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Kevin Grittner

2011-11-09 15:57:37 UTC

Permalink

Post by Ruslan A. Bondar
No, this was a clean install of 9.0.x.

Then it isn't related to the issue I raised.

Post by Ruslan A. Bondar
Maybe It was upgraded from 9.0.3 to 9.0.4. I'm not sure.

You don't normally use pg_upgrade for that; this was a pg_upgrade bug.

-Kevin

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin