[ADMIN] Help! PostgreSQL stuck at starting up after crash

Discussion:

(too old to reply)

Samuel Hwang

2012-01-19 00:33:37 UTC

version Postgresql 9.1.1 on centos5 x64

We experience slow performance and found the server is running 3 vacuum
process on the same db which use up 99% of CPU.
Then we kill -9 one of those process which cause postgresql to crash and it
tried to restart after the crash
However when the starting process reach the last WAL files, it just stuck
there

pg_controldata shows the db is in Archive Recovery mode and when using psql
to connect the db, it says FATAL: the database system starting up.

I took a chance and upgrade to PostgreSql 9.1.2 and see if anything changed
it still stuck at the end of recovery.
pg_controldata shows db is in Crash recovery, but that probably different
wording I think
using psql to connect the db, it says FATAL: the database system is
starting up.

I pretty much run out of idea here.
Can anyone help what to go from here?

Samuel

Samuel Hwang

2012-01-19 16:56:29 UTC

Permalink

Thanks, Rick.

I think we learned the lesson not to do kill -9. One weird thing was that
the three stuck automatically vacuum processes had been running for
probably more than a day and they were all working on the same pretty
static database. I doubt there were something wrong with our db. However it
would be nice postgtrsql would be tolerant to that.

The symptoms we experienced was postgresql was extremely slow and failed
all the db tests. The slowness came from the 3 vacuum process which use up
all the cpu.

Apparently killing is not the right solution. Is a proper/safer way to
recover the performance of postgresql?

Post by Samuel Hwang
We experience slow performance and found the server is running 3 vacuum
process on the same db which use up 99% of CPU.
Then we kill -9 one of those process which cause postgresql to crash and

Post by Samuel Hwang
tried to restart after the crash

While I cannot help with restarting, you shouldn't ever use kill -9
unless no other kill signal works.
kill default signal is SIGTERM (15), refer to man 7 signal (in linux).
vacuum processes can be set on a per table basis, perhaps you need to
closely examine that part of configuration.
Sorry I can't help further.
--
http://www.ardynet.com
http://www.ardynet.com:9000/ardymusic.ogg.m3u

--
*Shian-Miin Samuel Hwang* | Software Developer | Phone 1-403-2626519 (ext.
276) | Fax 1-403-233-8046

*Replicon* | Hassle-Free Time & Expense Management Software - 7,300
Customers - 70 Countries
www.replicon.com | facebook <http://www.facebook.com/Replicon.inc> |
twitter<http://twitter.com/Replicon>|
blog <http://www.replicon.com/blog/> | contact
us<http://www.replicon.com/about_replicon/contact_us.aspx>

*We are hiring!* | search
jobs<http://tbe.taleo.net/NA2/ats/careers/searchResults.jsp?org=REPLICON&cws=1&act=sort&sortColumn=1&__utma=1.651918544.1299001662.1299170819.1299174966.10&__utmb=1.8.10.1299174966&__utmc=1&__utmx=-&__utmz=1.1299174985.10.3.utmcsr=google%7Cutmccn=%28organic%29%7Cutmcmd=organic%7Cutmctr=replicon%20careers&__utmv=1.%7C3=Visitor%20Type=Prospects=1,&__utmk=40578466>

David Hornsby

2012-01-19 14:30:56 UTC

Permalink

Sounds like you have a corrupt wal files that you will have to reset the
wal logs with pgresetxlog.

http://www.postgresql.org/docs/8.2/static/app-pgresetxlog.html

This will result in missing transactions so before you do this shutdown
postgres and make a copy of the database files first. That way if you
don't like what happens you can always go back to the way things were.

Also right now would be a good time to evaluate your backup strategy,
which is a different topic for a different thread, but I can certainly
help with that as well.

-David Hornsby

Post by Samuel Hwang
version Postgresql 9.1.1 on centos5 x64
We experience slow performance and found the server is running 3 vacuum
process on the same db which use up 99% of CPU.
Then we kill -9 one of those process which cause postgresql to crash and it
tried to restart after the crash
However when the starting process reach the last WAL files, it just stuck
there
pg_controldata shows the db is in Archive Recovery mode and when using psql
to connect the db, it says FATAL: the database system starting up.
I took a chance and upgrade to PostgreSql 9.1.2 and see if anything changed
it still stuck at the end of recovery.
pg_controldata shows db is in Crash recovery, but that probably different
wording I think
using psql to connect the db, it says FATAL: the database system is
starting up.
I pretty much run out of idea here.
Can anyone help what to go from here?
Samuel

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Samuel Hwang

2012-01-19 17:09:48 UTC

Permalink

I don't know how to make sure if WAL logs corrupted.
At the end of the recovery in postgresql log I saw

2012-01-18 18:30:58.570 MST 3666 - LOG: consistent recovery state
reached at 56C/CD0AFE00
2012-01-18 18:30:58.587 MST 3666 - LOG: recovery stopping before abort
of transaction 541802043, time 2012-01-18 12:50:08.531615-07
2012-01-18 18:30:58.587 MST 3666 - LOG: redo done at 56C/CD226C58
2012-01-18 18:30:58.587 MST 3666 - LOG: last completed transaction was
at log time 2012-01-18 12:49:28.321605-07
2012-01-18 18:30:58.589 MST 3666 - LOG: selected new timeline ID: 2
2012-01-18 18:30:59.187 MST 3666 - LOG: archive recovery complete

just nothing happened after that and postgresql is stuck at starting up and
not getting out of archive recovery mode.
at that time there is no cpu/disk activities and it seemed like it's
waiting for something?

Fortunately this is a development/test database and we don't have any
backup plan on it as data loss is not a big issue.
In production environment we do set up stream replication and on going
backup for both db and WAL logs.
However, this do raise some worries, especially my impression is that
postgresql shall protect the data pretty well.

I will give it a try on pg_resetxlog, Thanks for the heads-up.

Post by David Hornsby
Sounds like you have a corrupt wal files that you will have to reset the
wal logs with pgresetxlog.
http://www.postgresql.org/docs/8.2/static/app-pgresetxlog.html
This will result in missing transactions so before you do this shutdown
postgres and make a copy of the database files first. That way if you
don't like what happens you can always go back to the way things were.
Also right now would be a good time to evaluate your backup strategy,
which is a different topic for a different thread, but I can certainly
help with that as well.
-David Hornsby

Kevin Grittner

2012-01-19 17:16:39 UTC

Permalink

Post by Samuel Hwang
I don't know how to make sure if WAL logs corrupted.
At the end of the recovery in postgresql log I saw
2012-01-18 18:30:58.570 MST 3666 - LOG: consistent recovery
state reached at 56C/CD0AFE00
2012-01-18 18:30:58.587 MST 3666 - LOG: recovery stopping
before abort of transaction 541802043, time 2012-01-18
12:50:08.531615-07
2012-01-18 18:30:58.587 MST 3666 - LOG: redo done at
56C/CD226C58
2012-01-18 18:30:58.587 MST 3666 - LOG: last completed
transaction was at log time 2012-01-18 12:49:28.321605-07
2012-01-18 18:30:58.589 MST 3666 - LOG: selected new timeline ID: 2
2012-01-18 18:30:59.187 MST 3666 - LOG: archive recovery
complete
just nothing happened after that and postgresql is stuck at
starting up and not getting out of archive recovery mode.\

What do you base that on? (copy/paste)

Post by Samuel Hwang
at that time there is no cpu/disk activities and it seemed like
it's waiting for something?

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Samuel Hwang

2012-01-19 22:43:38 UTC

Permalink

Post by Samuel Hwang
just nothing happened after that and postgresql is stuck at
starting up and not getting out of archive recovery mode.\

What do you base that on? (copy/paste)

==> I have waited for at least 2 hours, but the database is still in
starting up state. In the mean time, there is not cpu/disk activities.

Post by Samuel Hwang
at that time there is no cpu/disk activities and it seemed like
it's waiting for something?

That looks like normal recovery. I would expect it to be waiting
for clients to connect at that point.

What happens when you try to connect to the database after that
above has been logged? (Copy/paste the psql command line and any
errors, please.)

==> when I tried to use psql to connect the db, I got

FATAL: the database system is starting up

and it doesn't allow connection.

On Thu, Jan 19, 2012 at 10:16 AM, Kevin Grittner <

Post by Samuel Hwang

Post by Samuel Hwang
I don't know how to make sure if WAL logs corrupted.
At the end of the recovery in postgresql log I saw
2012-01-18 18:30:58.570 MST 3666 - LOG: consistent recovery
state reached at 56C/CD0AFE00
2012-01-18 18:30:58.587 MST 3666 - LOG: recovery stopping
before abort of transaction 541802043, time 2012-01-18
12:50:08.531615-07
2012-01-18 18:30:58.587 MST 3666 - LOG: redo done at
56C/CD226C58
2012-01-18 18:30:58.587 MST 3666 - LOG: last completed
transaction was at log time 2012-01-18 12:49:28.321605-07
2012-01-18 18:30:58.589 MST 3666 - LOG: selected new timeline ID: 2
2012-01-18 18:30:59.187 MST 3666 - LOG: archive recovery complete
just nothing happened after that and postgresql is stuck at
starting up and not getting out of archive recovery mode.\

What do you base that on? (copy/paste)

Post by Samuel Hwang
at that time there is no cpu/disk activities and it seemed like
it's waiting for something?

Samuel Hwang

2012-01-19 22:45:34 UTC

Permalink

pg_resetxlog does the trick and db can be started and readable.

I am dumping the data out and import to a newly created database cluster.

We pretty much lost the data for the last two days, but since our
postgresql were running well, it is fewer than it looks.

Thanks a lot for the help.

Samuel Hwang

2012-01-19 22:46:41 UTC

Permalink

correct typo.
We pretty much lost the data for the last two days, but since our
postgresql wereN'T running well, it is fewer than it looks.

Post by Samuel Hwang
pg_resetxlog does the trick and db can be started and readable.
I am dumping the data out and import to a newly created database cluster.
We pretty much lost the data for the last two days, but since our
postgresql were running well, it is fewer than it looks.
Thanks a lot for the help.

stuck

Post by Samuel Hwang
there
pg_controldata shows the db is in Archive Recovery mode and when using psql
to connect the db, it says FATAL: the database system starting up.
I took a chance and upgrade to PostgreSql 9.1.2 and see if anything changed
it still stuck at the end of recovery.
pg_controldata shows db is in Crash recovery, but that probably

different

Post by Samuel Hwang
wording I think
using psql to connect the db, it says FATAL: the database system is
starting up.
I pretty much run out of idea here.
Can anyone help what to go from here?
Samuel

--
*Shian-Miin Samuel Hwang* | Software Developer | Phone 1-403-2626519(ext. 276) | Fax
1-403-233-8046
*Replicon* | Hassle-Free Time & Expense Management Software - 7,300
Customers - 70 Countries
www.replicon.com | facebook <http://www.facebook.com/Replicon.inc> |
twitter <http://twitter.com/Replicon> | blog<http://www.replicon.com/blog/>| contact
us <http://www.replicon.com/about_replicon/contact_us.aspx>
*We are hiring!* | search jobs<http://tbe.taleo.net/NA2/ats/careers/searchResults.jsp?org=REPLICON&cws=1&act=sort&sortColumn=1&__utma=1.651918544.1299001662.1299170819.1299174966.10&__utmb=1.8.10.1299174966&__utmc=1&__utmx=-&__utmz=1.1299174985.10.3.utmcsr=google%7Cutmccn=%28organic%29%7Cutmcmd=organic%7Cutmctr=replicon%20careers&__utmv=1.%7C3=Visitor%20Type=Prospects=1,&__utmk=40578466>

David Hornsby

2012-01-19 22:50:20 UTC

Permalink

Glad to hear that you can spin up the database and get to your data.
Sorry to hear that you did end up losing data on that server.

-David Hornsby

Post by Samuel Hwang
correct typo.
We pretty much lost the data for the last two days, but since our
postgresql wereN'T running well, it is fewer than it looks.
pg_resetxlog does the trick and db can be started and readable.
I am dumping the data out and import to a newly created database cluster.
We pretty much lost the data for the last two days, but since our
postgresql were running well, it is fewer than it looks.
Thanks a lot for the help.
On Thu, Jan 19, 2012 at 7:30 AM, David Hornsby
Sounds like you have a corrupt wal files that you will have to reset the
wal logs with pgresetxlog.
http://www.postgresql.org/docs/8.2/static/app-pgresetxlog.html
This will result in missing transactions so before you do this shutdown
postgres and make a copy of the database files first. That way if you
don't like what happens you can always go back to the way things were.
Also right now would be a good time to evaluate your backup strategy,
which is a different topic for a different thread, but I can certainly
help with that as well.
-David Hornsby

Post by Samuel Hwang
version Postgresql 9.1.1 on centos5 x64
We experience slow performance and found the server is

running 3 vacuum

Post by Samuel Hwang
process on the same db which use up 99% of CPU.
Then we kill -9 one of those process which cause postgresql

to crash and

Post by Samuel Hwang
it
tried to restart after the crash
However when the starting process reach the last WAL files,

it just stuck

Post by Samuel Hwang
there
pg_controldata shows the db is in Archive Recovery mode and

when using

Post by Samuel Hwang
psql
to connect the db, it says FATAL: the database system

starting up.

Post by Samuel Hwang
I took a chance and upgrade to PostgreSql 9.1.2 and see if

anything

Post by Samuel Hwang
changed
it still stuck at the end of recovery.
pg_controldata shows db is in Crash recovery, but that

probably different

Post by Samuel Hwang
wording I think
using psql to connect the db, it says FATAL: the database

system is

Post by Samuel Hwang
starting up.
I pretty much run out of idea here.
Can anyone help what to go from here?
Samuel

--
*Shian-Miin Samuel Hwang* | Software Developer | Phone
1-403-2626519 <tel:1-403-2626519> (ext. 276) | Fax 1-403-233-8046
<tel:1-403-233-8046>
*Replicon* | Hassle-Free Time & Expense Management Software -
7,300 Customers - 70 Countries
www.replicon.com <http://www.replicon.com/> | facebook
<http://www.facebook.com/Replicon.inc> | twitter
<http://twitter.com/Replicon> | blog
<http://www.replicon.com/blog/> | contact us
<http://www.replicon.com/about_replicon/contact_us.aspx>
*We are hiring!* | search jobs
<http://tbe.taleo.net/NA2/ats/careers/searchResults.jsp?org=REPLICON&cws=1&act=sort&sortColumn=1&__utma=1.651918544.1299001662.1299170819.1299174966.10&__utmb=1.8.10.1299174966&__utmc=1&__utmx=-&__utmz=1.1299174985.10.3.utmcsr=google%7Cutmccn=%28organic%29%7Cutmcmd=organic%7Cutmctr=replicon%20careers&__utmv=1.%7C3=Visitor%20Type=Prospects=1,&__utmk=40578466>
--
*Shian-Miin Samuel Hwang* | Software Developer | Phone 1-403-2626519
(ext. 276) | Fax 1-403-233-8046
*Replicon* | Hassle-Free Time & Expense Management Software - 7,300
Customers - 70 Countries
www.replicon.com <http://www.replicon.com/> | facebook
<http://www.facebook.com/Replicon.inc> | twitter
<http://twitter.com/Replicon> | blog <http://www.replicon.com/blog/> |
contact us <http://www.replicon.com/about_replicon/contact_us.aspx>
*We are hiring!* | search jobs
<http://tbe.taleo.net/NA2/ats/careers/searchResults.jsp?org=REPLICON&cws=1&act=sort&sortColumn=1&__utma=1.651918544.1299001662.1299170819.1299174966.10&__utmb=1.8.10.1299174966&__utmc=1&__utmx=-&__utmz=1.1299174985.10.3.utmcsr=google%7Cutmccn=%28organic%29%7Cutmcmd=organic%7Cutmctr=replicon%20careers&__utmv=1.%7C3=Visitor%20Type=Prospects=1,&__utmk=40578466>

--
David Hornsby
Beechglen Development Inc.
P: (513) 922 - 0509 x432
C: (513) 254 - 0605
F: (513) 347 - 2834
W: beechglen.com