Discussion:
[GENERAL] Streaming Replication Server Crash
(too old to reply)
Craig Ringer
2012-10-22 07:02:22 UTC
Permalink
Hi All,
We have configured Streaming Replication b/w Primary and Standby server
and Pgpool-II load balancing module diverting
SELECT statements to Standby server. As per our observations, Standby
server crashed during peak hours on today and error message as follows
2012-10-19 12:26:46 IST [1338]: [18-1] user=,db= LOG: server process
(PID 15565) was terminated by signal 10
2012-10-19 12:26:46 IST [1338]: [19-1] user=,db= LOG: terminating any
other active server processes
That's odd. SIGUSR1 (signal 10) shouldn't terminate PostgreSQL.

Was the server intentionally sent SIGUSR1 by an admin? Do you know what
triggered the signal?

Are you running any procedural languages other than PL/PgSQL, or any
custom C extensions? Anything that might have unwittingly cleared the
signal handler for SIGUSR1?

--
Craig Ringer
--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin
Tom Lane
2012-10-22 12:52:38 UTC
Permalink
Post by Craig Ringer
2012-10-19 12:26:46 IST [1338]: [18-1] user=,db= LOG: server process
(PID 15565) was terminated by signal 10
That's odd. SIGUSR1 (signal 10) shouldn't terminate PostgreSQL.
Was the server intentionally sent SIGUSR1 by an admin? Do you know what
triggered the signal?
SIGUSR1 is used for all sorts of internal cross-process signaling
purposes. There's no need to hypothesize any external force sending
it; if somebody had broken a PG process's signal handling setup for
SIGUSR1, a crash of this sort could be expected in short order.

But having said that, are we sure 10 is SIGUSR1 on the OP's platform?
AFAIK, that signal number is not at all compatible across different
flavors of Unix. (I see SIGUSR1 is 30 on OS X for instance.)
Post by Craig Ringer
Are you running any procedural languages other than PL/PgSQL, or any
custom C extensions? Anything that might have unwittingly cleared the
signal handler for SIGUSR1?
libperl has a bad habit of thinking it can mess with the process's
signal setup ...

regards, tom lane
--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin
Craig Ringer
2012-10-23 05:08:10 UTC
Permalink
http://stackoverflow.com/questions/6403803/how-to-get-backtrace-function-line-number-on-solaris
Actually, that link doesn't apply to this problem, it's for getting a
stack trace programmatically:

Try:

http://publib.boulder.ibm.com/httpserv/ihsdiag/get_backtrace.html

http://www.princeton.edu/~unix/Solaris/troubleshoot/adb.html
<http://www.princeton.edu/%7Eunix/Solaris/troubleshoot/adb.html>

Most of the good links I could find were on blogs.sun.com, which Oracle
have helpfully redirected to www.oracle.com - where the pages don't
actually exist.

--
Craig Ringer
Tom Lane
2012-10-23 05:20:31 UTC
Permalink
Post by Tom Lane
But having said that, are we sure 10 is SIGUSR1 on the OP's platform?
AFAIK, that signal number is not at all compatible across different
flavors of Unix. (I see SIGUSR1 is 30 on OS X for instance.)
Gah. I incorrectly though that POSIX specified signal *numbers*, not
just names. That does not appear to actually be the case. Thanks.
This isn't the first time I've wondered exactly which signal was meant
in a postmaster child-crash report. Seems like it might be worth
expending some code on a symbolic translation, instead of just printing
the number. That'd be easy enough (for common signal names) on Unix,
but has anyone got a suggestion how we might do something useful on
Windows?

regards, tom lane
--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin
Craig Ringer
2012-10-23 05:58:15 UTC
Permalink
Post by Tom Lane
This isn't the first time I've wondered exactly which signal was meant
in a postmaster child-crash report. Seems like it might be worth
expending some code on a symbolic translation, instead of just printing
the number. That'd be easy enough (for common signal names) on Unix,
but has anyone got a suggestion how we might do something useful on
Windows?
Here's a typical Windows exception:


2012-10-04 14:29:08 CEST LOG: server process (PID 1416) was terminated
by exception 0xC0000005

2012-10-04 14:29:08 CEST HINT: See C include file "ntstatus.h" for a
description of the hexadecimal value.


These codes can be translated with FormatMessage:


http://msdn.microsoft.com/en-us/library/windows/desktop/ms679351(v=vs.85).aspx
<http://msdn.microsoft.com/en-us/library/windows/desktop/ms679351%28v=vs.85%29.aspx>
http://support.microsoft.com/kb/259693

FormatMessage may not be safe to perform in the context of a munged heap
or some other failure conditions, so you probably don't want to do it
from a crash handler. It is safe for the postmaster to do it based on
the exception code it gets from the dying backend, though.

I'd say the best option is for the postmaster to print the
FormatMessage(
FORMAT_MESSAGE_ALLOCATE_BUFFER|FORMAT_MESSAGE_FROM_SYSTEM|FORMAT_MESSAGE_FROM_HMODULE,
...) output when it sees the exception code from the dying backend.

RtlNtStatusToDosError may also be of interest:
http://msdn.microsoft.com/en-us/library/windows/desktop/ms680600(v=vs.85).aspx
<http://msdn.microsoft.com/en-us/library/windows/desktop/ms680600%28v=vs.85%29.aspx>
... but it's in Winternl.h so it's not guaranteed to exist / be
compatible between versions and can only be accessed via runtime dynamic
linking. Not ideal.

--
Craig Ringer
Myers Brian D
2012-10-23 18:06:07 UTC
Permalink
It looks like there's no standard way to do that. Here's how I'd do it in Python:

[CODE]
import signal
dict((k, v) for v, k in signal.__dict__.iteritems() if v.startswith('SIG'))
[/CODE]

In C, I guess I'd just do a switch statement on the common signal names between Windows and POSIX as exposed SIGNAL.H. Looks like all you get in Windows is:

http://msdn.microsoft.com/en-us/library/xdkz3x12(v=vs.110).aspx

Brian

-----Original Message-----
From: pgsql-admin-***@postgresql.org [mailto:pgsql-admin-***@postgresql.org] On Behalf Of Tom Lane
Sent: Monday, October 22, 2012 10:21 PM
To: Craig Ringer
Cc: raghu ram; pgsql-***@postgresql.org; pgsql-general
Subject: Re: [ADMIN] [GENERAL] Streaming Replication Server Crash
Post by Tom Lane
But having said that, are we sure 10 is SIGUSR1 on the OP's platform?
AFAIK, that signal number is not at all compatible across different
flavors of Unix. (I see SIGUSR1 is 30 on OS X for instance.)
Gah. I incorrectly though that POSIX specified signal *numbers*, not
just names. That does not appear to actually be the case. Thanks.
This isn't the first time I've wondered exactly which signal was meant in a postmaster child-crash report. Seems like it might be worth expending some code on a symbolic translation, instead of just printing the number. That'd be easy enough (for common signal names) on Unix, but has anyone got a suggestion how we might do something useful on Windows?

regards, tom lane


--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org) To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin
--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin
Loading...