Discussion:
Escaping a blocked sendto() syscall without causing a restart
(too old to reply)
Jerry Sievers
2013-01-17 20:36:24 UTC
Permalink
Does anyone know if one of the signals below can be sent to break out
,of this state *without* the postmaster sensing a crashed backend?

I've seen several times in the past at other companies, backends that
will not respond to cancel nor SIGTERM due to syscall that's blocked
on IO.

Quite often though apparently the backend would notice the broken
socket eventually and receive the signals and exit cleanly.

I've got one that's been wedged like that for a couple days now.

I recall trying several in a similar situation a while ago and of
course one of them interrupted the syscall all right but it was an
abort and we got the customary spontaneous postmaster restart.


PostgreSQL 8.4.13 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.3.real (Debian 4.3.2-1.1) 4.3.2, 64-bit

$ uname -a
Linux somebox.foo.zizzy 2.6.36 #5 SMP Thu Jul 28 17:52:31 UTC 2011 x86_64 GNU/Linux
$
$ strace -p 31603
Process 31603 attached - interrupt to quit
sendto(9, "default_rate_3m_v4: 0.1224\nmonth_"..., 3440, 0, NULL, 0
<unfinished ...>
Process 31603 detached
$ $
$ kill -l
1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL
5) SIGTRAP 6) SIGABRT 7) SIGBUS 8) SIGFPE
9) SIGKILL 10) SIGUSR1 11) SIGSEGV 12) SIGUSR2
13) SIGPIPE 14) SIGALRM 15) SIGTERM 16) SIGSTKFLT
17) SIGCHLD 18) SIGCONT 19) SIGSTOP 20) SIGTSTP
21) SIGTTIN 22) SIGTTOU 23) SIGURG 24) SIGXCPU
25) SIGXFSZ 26) SIGVTALRM 27) SIGPROF 28) SIGWINCH
29) SIGIO 30) SIGPWR 31) SIGSYS 34) SIGRTMIN
35) SIGRTMIN+1 36) SIGRTMIN+2 37) SIGRTMIN+3 38) SIGRTMIN+4
39) SIGRTMIN+5 40) SIGRTMIN+6 41) SIGRTMIN+7 42) SIGRTMIN+8
43) SIGRTMIN+9 44) SIGRTMIN+10 45) SIGRTMIN+11 46) SIGRTMIN+12
47) SIGRTMIN+13 48) SIGRTMIN+14 49) SIGRTMIN+15 50) SIGRTMAX-14
51) SIGRTMAX-13 52) SIGRTMAX-12 53) SIGRTMAX-11 54) SIGRTMAX-10
55) SIGRTMAX-9 56) SIGRTMAX-8 57) SIGRTMAX-7 58) SIGRTMAX-6
59) SIGRTMAX-5 60) SIGRTMAX-4 61) SIGRTMAX-3 62) SIGRTMAX-2
63) SIGRTMAX-1 64) SIGRTMAX
$

Thanks
--
Jerry Sievers
e: ***@comcast.net
p: 312.241.7800
--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin
Tom Lane
2013-01-17 21:38:40 UTC
Permalink
Post by Jerry Sievers
Does anyone know if one of the signals below can be sent to break out
,of this state *without* the postmaster sensing a crashed backend?
I've seen several times in the past at other companies, backends that
will not respond to cancel nor SIGTERM due to syscall that's blocked
on IO.
Quite often though apparently the backend would notice the broken
socket eventually and receive the signals and exit cleanly.
I've got one that's been wedged like that for a couple days now.
I recall trying several in a similar situation a while ago and of
course one of them interrupted the syscall all right but it was an
abort and we got the customary spontaneous postmaster restart.
Offhand it looks to me like most signals would kick the backend off the
send() call ... but it would loop right back and try again. See
internal_flush() in pqcomm.c. (If you're using SSL, this diagnosis
may or may not apply.)

We can't do anything except repeat the send attempt if the client
connection is to be kept in a sane state. It's possible that if the
interrupt was a SIGTERM (forced exit) we could mark the connection dead
and return early, but it would probably take some thought and
experimentation to get useful behavior that way. And I'm not at all
sure if we could get it to work in SSL mode ...

So the short answer is no, you probably can't kill the session without
causing a restart. Possibly we should add a TODO to make this better.

What you might consider instead, if this is a recurring problem, is
adjusting the postmaster-side TCP keepalive parameters so that dead
connections are noticed more quickly. The default connection timeout
according to the TCP standards is on the order of hours, but you can
reduce that quite a lot if your network environment is at all reliable.

(But it's not clear to me why your stuck-for-a-couple-days case wouldn't
have timed out long since. Are you sure this isn't a client-side
problem, ie client is wedged? If so, why not kill the client instead?)

regards, tom lane
--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin
Loading...