Possible database corruption

Discussion:

Possible database corruption - urgent

(too old to reply)

Benjamin Krajmalnik

2013-01-07 21:22:25 UTC

I have a situation where pg_xlog started growing until it filled up the
disk drive.

I got alerted to the error and started investigating.

Checked the logs and I am seeing the following entry repeatedly:

2013-01-07 01:49:12 GMT ERROR: could not open file
"base/16748/181979366_fsm": No such file or directory

2013-01-07 01:49:12 GMT CONTEXT: writing block 1 of relation
base/16748/181979366_fsm

2013-01-07 01:49:12 GMT WARNING: could not write block 1 of
base/16748/181979366_fsm

I checked the actual file system, and that file is indeed missing.
181979366 exists.

Is there a way to get the system back up and running?

I stopped the postmaster and am moving the pg_xlog directory to a
partition which has room left in it, but I need to resolve this missing
file problem

Benjamin Krajmalnik

2013-01-07 21:31:29 UTC

Permalink

I forgot to mention - PostgreSQL 9.0 - my apologies.

Can I just recreate the file using touch so it exists and then restart
potgresql?

The system coredumped and was attempting to go intorecovery mode

2013-01-07 01:49:12 GMT ERROR: could not open file
"base/16748/181979366_fsm": No such file or directory

2013-01-07 01:49:12 GMT CONTEXT: writing block 1 of relation
base/16748/181979366_fsm

2013-01-07 01:49:12 GMT WARNING: could not write block 1 of
base/16748/181979366_fsm

.

.

.

2013-01-07 01:49:12 GMT ERROR: could not open file
"base/16748/181979366_fsm": No such file or directory

2013-01-07 01:49:12 GMT CONTEXT: writing block 1 of relation
base/16748/181979366_fsm

2013-01-07 01:49:12 GMT WARNING: could not write block 1 of
base/16748/181979366_fsm

From: pgsql-admin-***@postgresql.org
[mailto:pgsql-admin-***@postgresql.org] On Behalf Of Benjamin
Krajmalnik
Sent: Monday, January 07, 2013 2:22 PM
To: pgsql-***@postgresql.org
Subject: [ADMIN] Possible database corruption - urgent

I have a situation where pg_xlog started growing until it filled up the
disk drive.

I got alerted to the error and started investigating.

Checked the logs and I am seeing the following entry repeatedly:

2013-01-07 01:49:12 GMT ERROR: could not open file
"base/16748/181979366_fsm": No such file or directory

2013-01-07 01:49:12 GMT CONTEXT: writing block 1 of relation
base/16748/181979366_fsm

2013-01-07 01:49:12 GMT WARNING: could not write block 1 of
base/16748/181979366_fsm

I checked the actual file system, and that file is indeed missing.
181979366 exists.

Is there a way to get the system back up and running?

I stopped the postmaster and am moving the pg_xlog directory to a
partition which has room left in it, but I need to resolve this missing
file problem

Benjamin Krajmalnik

2013-01-07 21:35:48 UTC

Permalink

Sorry for the cut and paste error.

This is the log entry when the pg_xlog partition ran out of space:

2013-01-07 20:50:22 GMT [local]PANIC: could not write to file
"pg_xlog/xlogtemp.49680": No space left on device

2013-01-07 20:50:22 GMT [local]STATEMENT: INSERT INTO tbltmptests
(testhash, testtime, statusid, replytxt, replyval, groupid) V

2013-01-07 20:50:23 GMT LOG: server process (PID 49680) was terminated
by signal 6: Abort trap

2013-01-07 20:50:23 GMT LOG: terminating any other active server
processes

2013-01-07 20:50:23 GMT [local]WARNING: terminating connection because
of crash of another server process

2013-01-07 20:50:23 GMT [local]DETAIL: The postmaster has commanded
this server process to roll back the current transaction an

2013-01-07 20:50:23 GMT [local]HINT: In a moment you should be able to
reconnect to the database and repeat your command.

.

.

.

2013-01-07 20:50:23 GMT [local]FATAL: the database system is in
recovery mode

2013-01-07 20:50:23 GMT LOG: all server processes terminated;
reinitializing

2013-01-07 20:50:24 GMT LOG: database system was interrupted; last
known up at 2013-01-07 00:31:02 GMT

2013-01-07 20:50:24 GMT LOG: database system was not properly shut
down; automatic recovery in progress

2013-01-07 20:50:24 GMT LOG: consistent recovery state reached at
52F/8CE57490

2013-01-07 20:50:24 GMT LOG: redo starts at 52F/7BABC118

2013-01-07 20:50:38 GMT [local]FATAL: the database system is in
recovery mode

2013-01-07 20:50:53 GMT [local]FATAL: the database system is in
recovery mode

2013-01-07 20:51:08 GMT [local]FATAL: the database system is in
recovery mode

2013-01-07 20:51:24 GMT [local]FATAL: the database system is in
recovery mode

2013-01-07 20:51:39 GMT [local]FATAL: the database system is in
recovery mode

2013-01-07 20:51:54 GMT [local]FATAL: the database system is in
recovery mode

From: Benjamin Krajmalnik
Sent: Monday, January 07, 2013 2:31 PM
To: Benjamin Krajmalnik; pgsql-***@postgresql.org
Subject: RE: [ADMIN] Possible database corruption - urgent

I forgot to mention - PostgreSQL 9.0 - my apologies.

Can I just recreate the file using touch so it exists and then restart
potgresql?

The system coredumped and was attempting to go intorecovery mode

2013-01-07 01:49:12 GMT ERROR: could not open file
"base/16748/181979366_fsm": No such file or directory

2013-01-07 01:49:12 GMT CONTEXT: writing block 1 of relation
base/16748/181979366_fsm

2013-01-07 01:49:12 GMT WARNING: could not write block 1 of
base/16748/181979366_fsm

.

.

.

2013-01-07 01:49:12 GMT ERROR: could not open file
"base/16748/181979366_fsm": No such file or directory

2013-01-07 01:49:12 GMT CONTEXT: writing block 1 of relation
base/16748/181979366_fsm

2013-01-07 01:49:12 GMT WARNING: could not write block 1 of
base/16748/181979366_fsm

From: pgsql-admin-***@postgresql.org
[mailto:pgsql-admin-***@postgresql.org] On Behalf Of Benjamin
Krajmalnik
Sent: Monday, January 07, 2013 2:22 PM
To: pgsql-***@postgresql.org
Subject: [ADMIN] Possible database corruption - urgent

I have a situation where pg_xlog started growing until it filled up the
disk drive.

I got alerted to the error and started investigating.

Checked the logs and I am seeing the following entry repeatedly:

2013-01-07 01:49:12 GMT ERROR: could not open file
"base/16748/181979366_fsm": No such file or directory

2013-01-07 01:49:12 GMT CONTEXT: writing block 1 of relation
base/16748/181979366_fsm

2013-01-07 01:49:12 GMT WARNING: could not write block 1 of
base/16748/181979366_fsm

I checked the actual file system, and that file is indeed missing.
181979366 exists.

Is there a way to get the system back up and running?

I stopped the postmaster and am moving the pg_xlog directory to a
partition which has room left in it, but I need to resolve this missing
file problem

Walter Hurry

2013-01-07 22:21:36 UTC

Permalink

On Mon, 07 Jan 2013 14:35:48 -0700, Benjamin Krajmalnik wrote:

<snip>

What do you think you will gain by adding "urgent" to your subject line?
What do you think you will gain by posting in HTML?

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Walter Hurry

2013-01-07 22:21:36 UTC

Permalink

On Mon, 07 Jan 2013 14:35:48 -0700, Benjamin Krajmalnik wrote:

<snip>

What do you think you will gain by adding "urgent" to your subject line?
What do you think you will gain by posting in HTML?

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Craig Ringer

2013-01-08 02:23:56 UTC

Permalink

Post by Benjamin Krajmalnik
I have a situation where pg_xlog started growing until it filled up
the disk drive.

This should not ever cause corruption. If it has, there's a bug at work.

A crash is reasonable (albeit undesirable; it'd be better to just report
errors on connections) - but database corruption is not.

Before doing ANYTHING else, read
http://wiki.postgresql.org/wiki/Corruption and act on it.

How big is the DB?

What file system is it on?

PostgreSQL 9.0.[what?] ?

Host OS?

Disk subsystem?

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Benjamin Krajmalnik

2013-01-08 02:30:57 UTC

Permalink

Thanks for the reply - I posted an update that I had resolved the issue.

When the partition with the WAL files filled up due to the missing fsm
file (I wonder what caused that), the db panicked.

After moving all 43GB of WAL files to a different partition, database
came into recovery mode, and after about half an hour of processing the
WAL files the server came back online.

The only thing that is still pending is for the system to clean out all
of the now unused wal files.

Once this is done, I will move back the WAL files to their own spindle.

Since the database would not restart until the WAL files were moved I
feared data corruption - which thankfully did not occur.

DB was Postgres 9.0.4 running on FreeBSD 8.1/amd64. Subsystem is dual
RAID-1 SAS, OS/WAL on one set of spindles, data on the other.

From: Craig Ringer [mailto:***@2ndQuadrant.com]
Sent: Monday, January 07, 2013 7:24 PM
To: Benjamin Krajmalnik
Cc: pgsql-***@postgresql.org
Subject: Re: [ADMIN] Possible database corruption

On 01/08/2013 05:22 AM, Benjamin Krajmalnik wrote:

I have a situation where pg_xlog started growing until it filled
up the disk drive.

This should not ever cause corruption. If it has, there's a bug at work.

A crash is reasonable (albeit undesirable; it'd be better to just report
errors on connections) - but database corruption is not.

Before doing ANYTHING else, read
http://wiki.postgresql.org/wiki/Corruption and act on it.

How big is the DB?

What file system is it on?

PostgreSQL 9.0.[what?] ?

Host OS?

Disk subsystem?

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services