Discussion:
Database corruption event, unlockable rows, possibly bogus virtual xids? (-1/4444444444)
(too old to reply)
Albe Laurenz
2013-02-22 08:15:50 UTC
Permalink
I'm doing a postmortem on a corruption event we had. I have an idea on what happened, but not sure.
I figure I'd share what happened and see if I'm close to right here.
Event: Running 9.1.6 with hot-standby, archiving 4 months of wal files, and even a nightly pg_dump
all. 50G database. Trying to update or delete a row in a small (21 row, but heavily used table) would
lock up completely. Never finish. Removed all clients, restarted the db instance, no joy. Check
pg_stat_activity, and nothing that wasn't idle.... run the delete, locked up.
Ran (SELECT*FROM pg_class JOIN pg_locks ON pg_locks.relation = pg_class.oid;) with no clients
touching this row, fresh restart of the db, and saw virtualtransactions against this same table where
the values would be -1/nnnnn were nnnnn was a huge number. Turns out we had about 159 entries from
different tables in the database. Checked hot-standby and, of course, no locks or anything. Switched
to hot-standby.
Hot-standby instantly gained these locks, Also noticed that 2 prepared transactions migrated to the
hot-standby. Binary upgraded to 9.1.8, locks still existed. Ended up executing the one thing we knew
would work. Take the site down, pg_dumpall to fresh instance. Everything is fine.
A little more background: We were running 9.1.4 back when 9.1.6 came out. We saw there was possible
corruption issues and did a binary upgrade and reindexing. Everything seemed to be fine, but I wonder
if we really had problems back then. We rebuilt the hot-standby after the binary upgrade via normal
restore and wal-file replays. I should also note that this row that had the lock on it that would not
go away, was created by an app server that was killed (via -9) since it was non-responsive, and the
row 'create date' (in db and logs) is the exact time the app server was killed.
I was wondering if a) these virtualtransactions that start with '-1/' indicate a problem, b) if this
could have happened from pre 9.1.6 corruption that was fixed in 9.1.6. Or, could this have occurred
when we killed that app server? Or.... am I looking in the wrong place.
I do still have the old data directories so I can start them up and check out the dataset. Any advice?
What do you get for
SELECT * FROM pg_prepared_xacts;

Have you tried to ROLLBACK PREPARED those prepared transactions?

The "-1" is quite alright:

BEGIN;
INSERT INTO addr VALUES (10, '127.0.0.1');
PREPARE TRANSACTION 'test';

test=> SELECT * FROM pg_prepared_xacts;
transaction | gid | prepared | owner | database
-------------+------+-------------------------------+---------+----------
2275 | test | 2013-02-22 09:09:04.627137+01 | laurenz | test

test=> SELECT database, relation, transactionid, virtualtransaction, mode FROM pg_locks;
database | relation | transactionid | virtualtransaction | mode
----------+----------+---------------+--------------------+------------------
16421 | 11069 | | 2/15 | AccessShareLock
| | | 2/15 | ExclusiveLock
| | 2275 | -1/2275 | ExclusiveLock
16421 | 24711 | | -1/2275 | RowExclusiveLock
(4 rows)

The last two rows are the prepared transaction.

ROLLBACK PREPARED 'test';

Gets rid of them.

Yours,
Laurenz Albe
--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin
Ned Wolpert
2013-02-22 16:59:27 UTC
Permalink
As it turns out, it was prepared transactions at fault. I restored my test
bed, forced a rollback and the locks were gone. At least now I can connect
those dots. :-)
I'm doing a postmortem on a corruption event we had. I have an idea on
what happened, but not sure.
I figure I'd share what happened and see if I'm close to right here.
Event: Running 9.1.6 with hot-standby, archiving 4 months of wal
files, and even a nightly pg_dump
all. 50G database. Trying to update or delete a row in a small (21 row,
but heavily used table) would
lock up completely. Never finish. Removed all clients, restarted the db
instance, no joy. Check
pg_stat_activity, and nothing that wasn't idle.... run the delete,
locked up.
Ran (SELECT*FROM pg_class JOIN pg_locks ON pg_locks.relation =
pg_class.oid;) with no clients
touching this row, fresh restart of the db, and saw virtualtransactions
against this same table where
the values would be -1/nnnnn were nnnnn was a huge number. Turns out we
had about 159 entries from
different tables in the database. Checked hot-standby and, of course,
no locks or anything. Switched
to hot-standby.
Hot-standby instantly gained these locks, Also noticed that 2 prepared
transactions migrated to the
hot-standby. Binary upgraded to 9.1.8, locks still existed. Ended up
executing the one thing we knew
would work. Take the site down, pg_dumpall to fresh instance. Everything
is fine.
A little more background: We were running 9.1.4 back when 9.1.6 came
out. We saw there was possible
corruption issues and did a binary upgrade and reindexing. Everything
seemed to be fine, but I wonder
if we really had problems back then. We rebuilt the hot-standby after
the binary upgrade via normal
restore and wal-file replays. I should also note that this row that had
the lock on it that would not
go away, was created by an app server that was killed (via -9) since it
was non-responsive, and the
row 'create date' (in db and logs) is the exact time the app server was
killed.
I was wondering if a) these virtualtransactions that start with '-1/'
indicate a problem, b) if this
could have happened from pre 9.1.6 corruption that was fixed in 9.1.6.
Or, could this have occurred
when we killed that app server? Or.... am I looking in the wrong place.
I do still have the old data directories so I can start them up and
check out the dataset. Any advice?
What do you get for
SELECT * FROM pg_prepared_xacts;
Have you tried to ROLLBACK PREPARED those prepared transactions?
BEGIN;
INSERT INTO addr VALUES (10, '127.0.0.1');
PREPARE TRANSACTION 'test';
test=> SELECT * FROM pg_prepared_xacts;
transaction | gid | prepared | owner | database
-------------+------+-------------------------------+---------+----------
2275 | test | 2013-02-22 09:09:04.627137+01 | laurenz | test
test=> SELECT database, relation, transactionid, virtualtransaction, mode FROM pg_locks;
database | relation | transactionid | virtualtransaction | mode
----------+----------+---------------+--------------------+------------------
16421 | 11069 | | 2/15 | AccessShareLock
| | | 2/15 | ExclusiveLock
| | 2275 | -1/2275 | ExclusiveLock
16421 | 24711 | | -1/2275 |
RowExclusiveLock
(4 rows)
The last two rows are the prepared transaction.
ROLLBACK PREPARED 'test';
Gets rid of them.
Yours,
Laurenz Albe
--
Virtually, Ned Wolpert

"Settle thy studies, Faustus, and begin..." --Marlowe
Loading...