Thanks, Tom.
The system is running 16 cores (dual E5620's), 96GB RAM.
Nothing changed on the system when this started happening - at least
nothing in the configuration - it has not been touched since it was set
up.
Load averages are in between 6 and 8 (all 3 of them)
Memory wise, I do not see a problem - plenty of free memory, and only
120K of swap has been used.
The only thing which I see when I run iostat is load on the drives every
so often. Mifd0 is the database, mifd1 is the pg_xlog.
db1# iostat -x 5
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 1.6 0
ad6 0.0 0.0 0.0 0.0 0 0.8 0
mfid0 11.6 172.2 171.7 2988.5 0 178.1 6
mfid1 0.1 734.0 1.1 5898.8 0 0.2 4
ar0 0.3 0.7 1.0 10.3 0 114.7 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 2.8 73.4 22.4 1961.9 0 0.2 2
mfid1 0.0 1784.9 0.0 14303.2 0 0.0 5
ar0 0.0 1.0 0.0 6.8 0 0.8 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 2.2 652.7 44.8 7671.1 2053 99.6 10
mfid1 0.0 1525.9 0.0 12207.6 1 0.1 12
ar0 0.0 0.0 0.0 0.0 0 0.0 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 1.0 2202.4 8.0 19776.3 0 632.0 68
mfid1 0.0 574.3 0.0 4666.4 0 0.5 23
ar0 0.0 0.0 0.0 0.0 0 0.0 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 1.8 75.8 14.4 2242.8 0 0.3 2
mfid1 0.0 1722.5 0.0 13802.0 0 0.0 5
ar0 0.0 0.0 0.0 0.0 0 0.0 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 1.4 71.4 11.2 1991.6 0 0.2 1
mfid1 0.0 1708.9 0.0 13671.4 0 0.0 4
ar0 0.2 2.4 3.2 38.4 0 2.5 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 1.0 78.0 8.0 2222.5 0 0.1 1
mfid1 0.0 1541.0 0.0 12352.1 0 0.0 5
ar0 0.0 0.6 0.0 9.6 0 0.6 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 1.4 56.6 11.2 1629.1 0 0.3 2
mfid1 0.0 1558.3 0.0 12468.3 0 0.0 6
ar0 0.0 0.6 0.0 3.2 0 0.5 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 1.8 72.2 14.4 1706.7 0 0.2 1
mfid1 0.0 1696.5 0.0 13596.2 0 0.0 5
ar0 0.0 0.0 0.0 0.0 0 0.0 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 0.0 1757.7 0.0 16093.5 351 648.0 48
mfid1 0.0 503.9 0.0 4031.3 0 0.2 10
ar0 0.0 0.0 0.0 0.0 0 0.0 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 0.0 125.0 0.0 2542.5 0 404.4 1
mfid1 0.0 1743.9 0.0 13951.1 0 0.1 5
ar0 0.0 0.0 0.0 0.0 0 0.0 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 1.2 73.4 9.6 2122.6 0 0.1 1
mfid1 0.0 1624.3 0.0 13020.0 0 0.0 4
ar0 0.0 2.0 0.0 32.0 0 1.8 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 0.2 67.4 1.6 1955.8 0 0.1 0
mfid1 0.0 1334.5 0.0 10699.7 0 0.0 5
ar0 0.0 0.4 0.0 6.4 0 0.5 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 1.0 64.6 8.0 1789.1 0 0.2 1
mfid1 0.0 1541.0 0.0 12352.1 0 0.0 5
ar0 0.0 0.0 0.0 0.0 0 0.0 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 1.0 78.4 8.0 1855.3 0 0.2 1
mfid1 0.0 1652.1 0.0 13217.0 0 0.0 5
ar0 0.0 0.8 0.0 6.4 0 0.7 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 0.6 691.6 4.8 7919.5 3461 82.4 11
mfid1 0.0 1477.7 0.0 11821.7 1 0.3 15
ar0 0.0 0.0 0.0 0.0 0 0.0 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 0.4 1346.2 3.2 12558.1 0 855.1 41
mfid1 0.0 1020.8 0.0 8190.6 0 0.5 39
ar0 0.0 0.0 0.0 0.0 0 0.0 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 1.0 62.2 8.0 1792.6 0 0.3 2
mfid1 0.0 1636.3 0.0 13112.4 0 0.0 4
ar0 0.0 0.8 0.0 12.8 0 0.8 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 0.2 63.0 1.6 1818.1 0 0.1 0
mfid1 0.0 1710.6 0.0 13684.6 0 0.0 5
ar0 0.0 1.2 0.0 19.2 0 1.2 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 0.8 83.8 6.4 2330.6 0 0.1 1
mfid1 0.0 1218.8 0.0 9750.2 0 0.0 4
ar0 0.0 0.4 0.0 6.4 0 0.5 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 0.0 68.6 0.0 1655.7 0 0.1 0
mfid1 0.0 1708.9 0.0 13671.4 0 0.0 5
ar0 0.0 0.6 0.0 3.2 0 0.5 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 0.0 71.6 0.0 1999.4 0 0.0 0
mfid1 0.0 1071.3 0.0 8594.1 0 0.1 3
ar0 0.0 0.0 0.0 0.0 0 0.0 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 0.4 1373.6 3.2 12956.5 0 550.7 35
mfid1 0.0 999.8 0.0 8022.7 0 0.8 37
ar0 0.0 0.0 0.0 0.0 0 0.0 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 0.8 68.8 8.0 1908.5 0 0.2 1
mfid1 0.0 1370.5 0.0 10962.3 0 0.0 3
ar0 0.0 0.6 0.0 9.6 0 0.6 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 1.2 63.8 9.6 1912.1 0 0.3 2
mfid1 0.0 1355.8 0.0 10846.2 0 0.0 3
ar0 0.0 1.0 0.0 16.0 0 1.1 0
Looking at /var/log/messages, everything appears to be ok.
The only thing I found, but it was from over a month ago, was the
following:
Nov 17 06:35:03 db1 kernel: MCA: Bank 8, Status 0x8c0000400001009f
Nov 17 06:35:03 db1 kernel: MCA: Global Cap 0x0000000000001c09, Status
0x0000000000000000
Nov 17 06:35:03 db1 kernel: MCA: Vendor "GenuineIntel", ID 0x206c2, APIC
ID 0
Nov 17 06:35:03 db1 kernel: MCA: CPU 0 COR (1) RD channel ?? memory
error
Nov 17 06:35:03 db1 kernel: MCA: Address 0x84aa98000
Nov 17 06:35:03 db1 kernel: MCA: Misc 0x4701041000041280
Nov 17 07:35:03 db1 kernel: MCA: Bank 8, Status 0x8c0000400001009f
Nov 17 07:35:03 db1 kernel: MCA: Global Cap 0x0000000000001c09, Status
0x0000000000000000
Nov 17 07:35:03 db1 kernel: MCA: Vendor "GenuineIntel", ID 0x206c2, APIC
ID 0
Nov 17 07:35:03 db1 kernel: MCA: CPU 0 COR (1) RD channel ?? memory
error
Nov 17 07:35:03 db1 kernel: MCA: Address 0x84aa98000
Nov 17 07:35:03 db1 kernel: MCA: Misc 0x4701041000046141
Nov 17 09:35:03 db1 kernel: MCA: Bank 8, Status 0x88000040000200cf
Nov 17 09:35:03 db1 kernel: MCA: Global Cap 0x0000000000001c09, Status
0x0000000000000000
Nov 17 09:35:03 db1 kernel: MCA: Vendor "GenuineIntel", ID 0x206c2, APIC
ID 0
Nov 17 09:35:03 db1 kernel: MCA: CPU 0 COR (1) MS channel ?? memory
error
Nov 17 09:35:03 db1 kernel: MCA: Misc 0x4701041000046141
Nov 18 13:34:59 db1 kernel: MCA: Bank 8, Status 0xcc0000800001009f
Nov 18 13:34:59 db1 kernel: MCA: Global Cap 0x0000000000001c09, Status
0x0000000000000000
Nov 18 13:34:59 db1 kernel: MCA: Vendor "GenuineIntel", ID 0x206c2, APIC
ID 0
Nov 18 13:34:59 db1 kernel: MCA: CPU 0 COR (2) OVER RD channel ?? memory
error
Nov 18 13:34:59 db1 kernel: MCA: Address 0x84aa98000
Nov 18 13:34:59 db1 kernel: MCA: Misc 0x4701041000044707
But it has not reoccurred.
-----Original Message-----
Sent: Thursday, December 22, 2011 10:55 AM
To: Benjamin Krajmalnik
Cc: pgsql-admin
Subject: Re: [ADMIN] Problem with pgstat timneouts
Post by Benjamin KrajmalnikAbout a month ago, I started receiving quite a few pgstat timeouts
on
my
Post by Benjamin Krajmalnikproduction database.
PostgreSQL 9.0.3 on amd64-portbld-freebsd8.1, compiled by GCC cc
(GCC)
Post by Benjamin Krajmalnik4.2.1 20070719 [FreeBSD], 64-bit
I am not sure where to start looking for the cause.
Yeah, so are the rest of us :-(. People have been reporting sometimes
seeing these messages without any clear cause. We've been unable to
find any bug in Postgres that would explain them, and the obvious
explanation of "your system is so overloaded that the stats collector
is too slow to respond" isn't very satisfying if response time seems
snappy otherwise. I think we have seen one case where the problem was
provably caused by significant (multi-second) clock skew between
different processors, but that's likely not applicable to many people.
Can you reconstruct what changed in your system around the time you
started seeing these?
Post by Benjamin KrajmalnikIt appears that these timeouts are affecting the performance of the
application - sometimes it hangs for a few seconds before data gets
returned, which was not the case before this.
Usually these timeouts are reported by autovacuum, and so should not be
affecting the performance of foreground queries, at least not queries
that are not explicitly requesting pgstats data. The fact that your
log messages don't include any query context reinforces the idea that
they are coming from autovacuum for you too. So if you are seeing
query response time problems, maybe the pgstat timeouts are another
symptom, not the cause, of an underlying too-much-load problem. You
might try turning on query duration logging and see if you can
correlate
response-time glitches and pgstat timeouts to anything else
observable,
such as peaks of system load average.
regards, tom lane
--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin