[ADMIN] Problem with pgstat timneouts

Discussion:

(too old to reply)

Benjamin Krajmalnik

2011-12-22 16:32:51 UTC

About a month ago, I started receiving quite a few pgstat timeouts on my
production database.

PostgreSQL 9.0.3 on amd64-portbld-freebsd8.1, compiled by GCC cc (GCC)
4.2.1 20070719 [FreeBSD], 64-bit

I am not sure where to start looking for the cause.

Database has been up since march without any restart

db1# ps -aux | grep stat

pgsql 1287 1.0 0.0 22832 4724 ?? Ss 11Mar11 4265:13.87
postgres: stats collector process (postgres)

It appears that these timeouts are affecting the performance of the
application - sometimes it hangs for a few seconds before data gets
returned, which was not the case before this.

2011-12-22 03:45:04 MST WARNING: pgstat wait timeout

2011-12-22 03:45:04 MST WARNING: pgstat wait timeout

2011-12-22 03:45:04 MST WARNING: pgstat wait timeout

2011-12-22 03:45:04 MST WARNING: pgstat wait timeout

2011-12-22 03:45:41 MST WARNING: pgstat wait timeout

2011-12-22 03:45:41 MST WARNING: pgstat wait timeout

2011-12-22 03:45:41 MST WARNING: pgstat wait timeout

2011-12-22 03:46:18 MST WARNING: pgstat wait timeout

2011-12-22 03:46:18 MST WARNING: pgstat wait timeout

2011-12-22 03:47:32 MST WARNING: pgstat wait timeout

2011-12-22 03:47:32 MST WARNING: pgstat wait timeout

2011-12-22 03:47:32 MST WARNING: pgstat wait timeout

2011-12-22 03:48:09 MST WARNING: pgstat wait timeout

2011-12-22 03:48:09 MST WARNING: pgstat wait timeout

2011-12-22 03:48:09 MST WARNING: pgstat wait timeout

2011-12-22 03:49:26 MST WARNING: pgstat wait timeout

Runtime statistics setting for the database are as follows:

#-----------------------------------------------------------------------
-------

# RUNTIME STATISTICS

#-----------------------------------------------------------------------
-------

# - Query/Index Statistics Collector -

#track_activities = on

#track_counts = on

#track_functions = none # none, pl, all

#track_activity_query_size = 1024 # (change requires restart)

# On FreeBSD, this is a performance hog, so keep it off if you need
speed

update_process_title = off

#update_process_title = on

#stats_temp_directory = 'pg_stat_tmp'

# - Statistics Monitoring -

#log_parser_stats = off

#log_planner_stats = off

#log_executor_stats = off

#log_statement_stats = off

Tom Lane

2011-12-22 17:54:33 UTC

Permalink

Post by Benjamin Krajmalnik
About a month ago, I started receiving quite a few pgstat timeouts on my
production database.
PostgreSQL 9.0.3 on amd64-portbld-freebsd8.1, compiled by GCC cc (GCC)
4.2.1 20070719 [FreeBSD], 64-bit
I am not sure where to start looking for the cause.

Yeah, so are the rest of us :-(. People have been reporting sometimes
seeing these messages without any clear cause. We've been unable to
find any bug in Postgres that would explain them, and the obvious
explanation of "your system is so overloaded that the stats collector
is too slow to respond" isn't very satisfying if response time seems
snappy otherwise. I think we have seen one case where the problem was
provably caused by significant (multi-second) clock skew between
different processors, but that's likely not applicable to many people.

Can you reconstruct what changed in your system around the time you
started seeing these?

Post by Benjamin Krajmalnik
It appears that these timeouts are affecting the performance of the
application - sometimes it hangs for a few seconds before data gets
returned, which was not the case before this.

Usually these timeouts are reported by autovacuum, and so should not be
affecting the performance of foreground queries, at least not queries
that are not explicitly requesting pgstats data. The fact that your
log messages don't include any query context reinforces the idea that
they are coming from autovacuum for you too. So if you are seeing
query response time problems, maybe the pgstat timeouts are another
symptom, not the cause, of an underlying too-much-load problem. You
might try turning on query duration logging and see if you can correlate
response-time glitches and pgstat timeouts to anything else observable,
such as peaks of system load average.

regards, tom lane

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Benjamin Krajmalnik

2011-12-22 18:25:46 UTC

Permalink

Thanks, Tom.
The system is running 16 cores (dual E5620's), 96GB RAM.
Nothing changed on the system when this started happening - at least
nothing in the configuration - it has not been touched since it was set
up.
Load averages are in between 6 and 8 (all 3 of them)
Memory wise, I do not see a problem - plenty of free memory, and only
120K of swap has been used.
The only thing which I see when I run iostat is load on the drives every
so often. Mifd0 is the database, mifd1 is the pg_xlog.

db1# iostat -x 5
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 1.6 0
ad6 0.0 0.0 0.0 0.0 0 0.8 0
mfid0 11.6 172.2 171.7 2988.5 0 178.1 6
mfid1 0.1 734.0 1.1 5898.8 0 0.2 4
ar0 0.3 0.7 1.0 10.3 0 114.7 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 2.8 73.4 22.4 1961.9 0 0.2 2
mfid1 0.0 1784.9 0.0 14303.2 0 0.0 5
ar0 0.0 1.0 0.0 6.8 0 0.8 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 2.2 652.7 44.8 7671.1 2053 99.6 10
mfid1 0.0 1525.9 0.0 12207.6 1 0.1 12
ar0 0.0 0.0 0.0 0.0 0 0.0 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 1.0 2202.4 8.0 19776.3 0 632.0 68
mfid1 0.0 574.3 0.0 4666.4 0 0.5 23
ar0 0.0 0.0 0.0 0.0 0 0.0 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 1.8 75.8 14.4 2242.8 0 0.3 2
mfid1 0.0 1722.5 0.0 13802.0 0 0.0 5
ar0 0.0 0.0 0.0 0.0 0 0.0 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 1.4 71.4 11.2 1991.6 0 0.2 1
mfid1 0.0 1708.9 0.0 13671.4 0 0.0 4
ar0 0.2 2.4 3.2 38.4 0 2.5 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 1.0 78.0 8.0 2222.5 0 0.1 1
mfid1 0.0 1541.0 0.0 12352.1 0 0.0 5
ar0 0.0 0.6 0.0 9.6 0 0.6 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 1.4 56.6 11.2 1629.1 0 0.3 2
mfid1 0.0 1558.3 0.0 12468.3 0 0.0 6
ar0 0.0 0.6 0.0 3.2 0 0.5 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 1.8 72.2 14.4 1706.7 0 0.2 1
mfid1 0.0 1696.5 0.0 13596.2 0 0.0 5
ar0 0.0 0.0 0.0 0.0 0 0.0 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 0.0 1757.7 0.0 16093.5 351 648.0 48
mfid1 0.0 503.9 0.0 4031.3 0 0.2 10
ar0 0.0 0.0 0.0 0.0 0 0.0 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 0.0 125.0 0.0 2542.5 0 404.4 1
mfid1 0.0 1743.9 0.0 13951.1 0 0.1 5
ar0 0.0 0.0 0.0 0.0 0 0.0 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 1.2 73.4 9.6 2122.6 0 0.1 1
mfid1 0.0 1624.3 0.0 13020.0 0 0.0 4
ar0 0.0 2.0 0.0 32.0 0 1.8 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 0.2 67.4 1.6 1955.8 0 0.1 0
mfid1 0.0 1334.5 0.0 10699.7 0 0.0 5
ar0 0.0 0.4 0.0 6.4 0 0.5 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 1.0 64.6 8.0 1789.1 0 0.2 1
mfid1 0.0 1541.0 0.0 12352.1 0 0.0 5
ar0 0.0 0.0 0.0 0.0 0 0.0 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 1.0 78.4 8.0 1855.3 0 0.2 1
mfid1 0.0 1652.1 0.0 13217.0 0 0.0 5
ar0 0.0 0.8 0.0 6.4 0 0.7 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 0.6 691.6 4.8 7919.5 3461 82.4 11
mfid1 0.0 1477.7 0.0 11821.7 1 0.3 15
ar0 0.0 0.0 0.0 0.0 0 0.0 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 0.4 1346.2 3.2 12558.1 0 855.1 41
mfid1 0.0 1020.8 0.0 8190.6 0 0.5 39
ar0 0.0 0.0 0.0 0.0 0 0.0 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 1.0 62.2 8.0 1792.6 0 0.3 2
mfid1 0.0 1636.3 0.0 13112.4 0 0.0 4
ar0 0.0 0.8 0.0 12.8 0 0.8 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 0.2 63.0 1.6 1818.1 0 0.1 0
mfid1 0.0 1710.6 0.0 13684.6 0 0.0 5
ar0 0.0 1.2 0.0 19.2 0 1.2 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 0.8 83.8 6.4 2330.6 0 0.1 1
mfid1 0.0 1218.8 0.0 9750.2 0 0.0 4
ar0 0.0 0.4 0.0 6.4 0 0.5 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 0.0 68.6 0.0 1655.7 0 0.1 0
mfid1 0.0 1708.9 0.0 13671.4 0 0.0 5
ar0 0.0 0.6 0.0 3.2 0 0.5 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 0.0 71.6 0.0 1999.4 0 0.0 0
mfid1 0.0 1071.3 0.0 8594.1 0 0.1 3
ar0 0.0 0.0 0.0 0.0 0 0.0 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 0.4 1373.6 3.2 12956.5 0 550.7 35
mfid1 0.0 999.8 0.0 8022.7 0 0.8 37
ar0 0.0 0.0 0.0 0.0 0 0.0 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 0.8 68.8 8.0 1908.5 0 0.2 1
mfid1 0.0 1370.5 0.0 10962.3 0 0.0 3
ar0 0.0 0.6 0.0 9.6 0 0.6 0
extended device statistics
device r/s w/s kr/s kw/s wait svc_t %b
ad4 0.0 0.0 0.0 0.0 0 0.0 0
ad6 0.0 0.0 0.0 0.0 0 0.0 0
mfid0 1.2 63.8 9.6 1912.1 0 0.3 2
mfid1 0.0 1355.8 0.0 10846.2 0 0.0 3
ar0 0.0 1.0 0.0 16.0 0 1.1 0

Looking at /var/log/messages, everything appears to be ok.
The only thing I found, but it was from over a month ago, was the
following:
Nov 17 06:35:03 db1 kernel: MCA: Bank 8, Status 0x8c0000400001009f
Nov 17 06:35:03 db1 kernel: MCA: Global Cap 0x0000000000001c09, Status
0x0000000000000000
Nov 17 06:35:03 db1 kernel: MCA: Vendor "GenuineIntel", ID 0x206c2, APIC
ID 0
Nov 17 06:35:03 db1 kernel: MCA: CPU 0 COR (1) RD channel ?? memory
error
Nov 17 06:35:03 db1 kernel: MCA: Address 0x84aa98000
Nov 17 06:35:03 db1 kernel: MCA: Misc 0x4701041000041280
Nov 17 07:35:03 db1 kernel: MCA: Bank 8, Status 0x8c0000400001009f
Nov 17 07:35:03 db1 kernel: MCA: Global Cap 0x0000000000001c09, Status
0x0000000000000000
Nov 17 07:35:03 db1 kernel: MCA: Vendor "GenuineIntel", ID 0x206c2, APIC
ID 0
Nov 17 07:35:03 db1 kernel: MCA: CPU 0 COR (1) RD channel ?? memory
error
Nov 17 07:35:03 db1 kernel: MCA: Address 0x84aa98000
Nov 17 07:35:03 db1 kernel: MCA: Misc 0x4701041000046141
Nov 17 09:35:03 db1 kernel: MCA: Bank 8, Status 0x88000040000200cf
Nov 17 09:35:03 db1 kernel: MCA: Global Cap 0x0000000000001c09, Status
0x0000000000000000
Nov 17 09:35:03 db1 kernel: MCA: Vendor "GenuineIntel", ID 0x206c2, APIC
ID 0
Nov 17 09:35:03 db1 kernel: MCA: CPU 0 COR (1) MS channel ?? memory
error
Nov 17 09:35:03 db1 kernel: MCA: Misc 0x4701041000046141
Nov 18 13:34:59 db1 kernel: MCA: Bank 8, Status 0xcc0000800001009f
Nov 18 13:34:59 db1 kernel: MCA: Global Cap 0x0000000000001c09, Status
0x0000000000000000
Nov 18 13:34:59 db1 kernel: MCA: Vendor "GenuineIntel", ID 0x206c2, APIC
ID 0
Nov 18 13:34:59 db1 kernel: MCA: CPU 0 COR (2) OVER RD channel ?? memory
error
Nov 18 13:34:59 db1 kernel: MCA: Address 0x84aa98000
Nov 18 13:34:59 db1 kernel: MCA: Misc 0x4701041000044707
But it has not reoccurred.

-----Original Message-----
Sent: Thursday, December 22, 2011 10:55 AM
To: Benjamin Krajmalnik
Cc: pgsql-admin
Subject: Re: [ADMIN] Problem with pgstat timneouts

Post by Benjamin Krajmalnik
About a month ago, I started receiving quite a few pgstat timeouts

Post by Benjamin Krajmalnik
production database.
PostgreSQL 9.0.3 on amd64-portbld-freebsd8.1, compiled by GCC cc

(GCC)

Post by Benjamin Krajmalnik
4.2.1 20070719 [FreeBSD], 64-bit
I am not sure where to start looking for the cause.

Yeah, so are the rest of us :-(. People have been reporting sometimes
seeing these messages without any clear cause. We've been unable to
find any bug in Postgres that would explain them, and the obvious
explanation of "your system is so overloaded that the stats collector
is too slow to respond" isn't very satisfying if response time seems
snappy otherwise. I think we have seen one case where the problem was
provably caused by significant (multi-second) clock skew between
different processors, but that's likely not applicable to many people.
Can you reconstruct what changed in your system around the time you
started seeing these?

Usually these timeouts are reported by autovacuum, and so should not be
affecting the performance of foreground queries, at least not queries
that are not explicitly requesting pgstats data. The fact that your
log messages don't include any query context reinforces the idea that
they are coming from autovacuum for you too. So if you are seeing
query response time problems, maybe the pgstat timeouts are another
symptom, not the cause, of an underlying too-much-load problem. You
might try turning on query duration logging and see if you can
correlate
response-time glitches and pgstat timeouts to anything else

observable,

such as peaks of system load average.
regards, tom lane

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Tom Lane

2011-12-22 18:45:22 UTC

Permalink

Post by Benjamin Krajmalnik
The only thing which I see when I run iostat is load on the drives every
so often. Mifd0 is the database, mifd1 is the pg_xlog.

Hmm. Maybe checkpoints or something else saturating your drives for a
little while? You might want to be more aggressive about smoothing out
checkpoints, cf
http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm
At the very least I'd enable checkpoint logging for awhile and see
if there seems to be any correlation to the response-time glitches.

regards, tom lane

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Benjamin Krajmalnik

2011-12-22 18:55:53 UTC

Permalink

checkpoint_segments = 1024 # in logfile segments, min 1,
16MB each
checkpoint_timeout = 60min # range 30s-1h
checkpoint_completion_target = 0.85 # checkpoint target duration,
0.0 - 1.0
#checkpoint_warning = 30s # 0 disables

select * from pg_stat_bgwriter gives ne the following:
checkpoints_timed - 6887
checkpoints_req - 36

A snapshot from the log (I am logging checkpoints)

2011-12-22 01:04:33 MST LOG: checkpoint starting: time
2011-12-22 01:12:32 MST WARNING: pgstat wait timeout
2011-12-22 01:12:32 MST WARNING: pgstat wait timeout
2011-12-22 01:13:12 MST WARNING: pgstat wait timeout
2011-12-22 01:13:12 MST WARNING: pgstat wait timeout
2011-12-22 01:13:12 MST WARNING: pgstat wait timeout
2011-12-22 01:13:18 MST WARNING: pgstat wait timeout
2011-12-22 01:13:18 MST WARNING: pgstat wait timeout
2011-12-22 01:35:35 MST WARNING: pgstat wait timeout
2011-12-22 01:35:35 MST WARNING: pgstat wait timeout
2011-12-22 01:38:32 MST WARNING: pgstat wait timeout
2011-12-22 01:38:32 MST WARNING: pgstat wait timeout
2011-12-22 01:41:28 MST LOG: checkpoint complete: wrote 329593 buffers
(41.9%); 0 transaction log file(s) added, 0 removed, 731 rec

2011-12-22 09:04:33 MST LOG: checkpoint starting: time
2011-12-22 09:06:54 MST WARNING: pgstat wait timeout
2011-12-22 09:06:54 MST WARNING: pgstat wait timeout
2011-12-22 09:06:54 MST WARNING: pgstat wait timeout
2011-12-22 09:07:33 MST WARNING: pgstat wait timeout
2011-12-22 09:07:33 MST WARNING: pgstat wait timeout
2011-12-22 09:07:33 MST WARNING: pgstat wait timeout
2011-12-22 09:13:41 MST WARNING: pgstat wait timeout
2011-12-22 09:22:07 MST ERROR: canceling autovacuum task
2011-12-22 09:22:07 MST CONTEXT: automatic vacuum of table
"ishield.public.tbltmptests"
2011-12-22 09:28:41 MST WARNING: pgstat wait timeout
2011-12-22 09:28:41 MST WARNING: pgstat wait timeout
2011-12-22 09:31:35 MST WARNING: pgstat wait timeout
2011-12-22 09:31:35 MST WARNING: pgstat wait timeout
2011-12-22 09:33:16 MST WARNING: pgstat wait timeout
2011-12-22 09:33:16 MST WARNING: pgstat wait timeout
2011-12-22 09:33:55 MST WARNING: pgstat wait timeout
2011-12-22 09:33:55 MST WARNING: pgstat wait timeout
2011-12-22 09:33:55 MST WARNING: pgstat wait timeout
2011-12-22 09:35:10 MST WARNING: pgstat wait timeout
2011-12-22 09:35:10 MST WARNING: pgstat wait timeout
2011-12-22 09:35:10 MST ERROR: canceling autovacuum task
2011-12-22 09:35:10 MST CONTEXT: automatic vacuum of table
"ishield.public.tbltmptests"
2011-12-22 09:36:21 MST WARNING: pgstat wait timeout
2011-12-22 09:36:21 MST WARNING: pgstat wait timeout
2011-12-22 09:37:02 MST WARNING: pgstat wait timeout
2011-12-22 09:37:02 MST WARNING: pgstat wait timeout
2011-12-22 09:37:02 MST WARNING: pgstat wait timeout
2011-12-22 09:37:43 MST WARNING: pgstat wait timeout
2011-12-22 09:37:43 MST WARNING: pgstat wait timeout
2011-12-22 09:38:20 MST WARNING: pgstat wait timeout
2011-12-22 09:38:20 MST WARNING: pgstat wait timeout
2011-12-22 09:38:59 MST WARNING: pgstat wait timeout
2011-12-22 09:38:59 MST WARNING: pgstat wait timeout
2011-12-22 09:38:59 MST WARNING: pgstat wait timeout
2011-12-22 09:42:33 MST LOG: checkpoint complete: wrote 362856 buffers
(46.1%); 0 transaction log file(s) added, 0 removed, 908 rec

I do have an aggressive autovacuum because I have some tables that get
updated thousands of times per minute and otherwise they bloat terribly,
but it is very seldom that I see cancelation of autovacuum, and only on
this table, which is a temporary table used for the insertion of the
data, in which records are inserted and deleted - never updated.

-----Original Message-----
Sent: Thursday, December 22, 2011 11:45 AM
To: Benjamin Krajmalnik
Cc: pgsql-admin
Subject: Re: [ADMIN] Problem with pgstat timneouts

Post by Benjamin Krajmalnik
The only thing which I see when I run iostat is load on the drives

every

Post by Benjamin Krajmalnik
so often. Mifd0 is the database, mifd1 is the pg_xlog.

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Benjamin Krajmalnik

2011-12-30 18:49:03 UTC

Permalink

Just a WAG - is it possible this is being caused by the connection
pooler (pgbouncer in my case)?

-----Original Message-----
Sent: Thursday, December 22, 2011 11:56 AM
To: Tom Lane
Cc: pgsql-admin
Subject: Re: [ADMIN] Problem with pgstat timneouts
checkpoint_segments = 1024 # in logfile segments, min 1,
16MB each
checkpoint_timeout = 60min # range 30s-1h
checkpoint_completion_target = 0.85 # checkpoint target duration,
0.0 - 1.0
#checkpoint_warning = 30s # 0 disables
checkpoints_timed - 6887
checkpoints_req - 36
A snapshot from the log (I am logging checkpoints)
2011-12-22 01:04:33 MST LOG: checkpoint starting: time
2011-12-22 01:12:32 MST WARNING: pgstat wait timeout
2011-12-22 01:12:32 MST WARNING: pgstat wait timeout
2011-12-22 01:13:12 MST WARNING: pgstat wait timeout
2011-12-22 01:13:12 MST WARNING: pgstat wait timeout
2011-12-22 01:13:12 MST WARNING: pgstat wait timeout
2011-12-22 01:13:18 MST WARNING: pgstat wait timeout
2011-12-22 01:13:18 MST WARNING: pgstat wait timeout
2011-12-22 01:35:35 MST WARNING: pgstat wait timeout
2011-12-22 01:35:35 MST WARNING: pgstat wait timeout
2011-12-22 01:38:32 MST WARNING: pgstat wait timeout
2011-12-22 01:38:32 MST WARNING: pgstat wait timeout
2011-12-22 01:41:28 MST LOG: checkpoint complete: wrote 329593 buffers
(41.9%); 0 transaction log file(s) added, 0 removed, 731 rec
2011-12-22 09:04:33 MST LOG: checkpoint starting: time
2011-12-22 09:06:54 MST WARNING: pgstat wait timeout
2011-12-22 09:06:54 MST WARNING: pgstat wait timeout
2011-12-22 09:06:54 MST WARNING: pgstat wait timeout
2011-12-22 09:07:33 MST WARNING: pgstat wait timeout
2011-12-22 09:07:33 MST WARNING: pgstat wait timeout
2011-12-22 09:07:33 MST WARNING: pgstat wait timeout
2011-12-22 09:13:41 MST WARNING: pgstat wait timeout
2011-12-22 09:22:07 MST ERROR: canceling autovacuum task
2011-12-22 09:22:07 MST CONTEXT: automatic vacuum of table
"ishield.public.tbltmptests"
2011-12-22 09:28:41 MST WARNING: pgstat wait timeout
2011-12-22 09:28:41 MST WARNING: pgstat wait timeout
2011-12-22 09:31:35 MST WARNING: pgstat wait timeout
2011-12-22 09:31:35 MST WARNING: pgstat wait timeout
2011-12-22 09:33:16 MST WARNING: pgstat wait timeout
2011-12-22 09:33:16 MST WARNING: pgstat wait timeout
2011-12-22 09:33:55 MST WARNING: pgstat wait timeout
2011-12-22 09:33:55 MST WARNING: pgstat wait timeout
2011-12-22 09:33:55 MST WARNING: pgstat wait timeout
2011-12-22 09:35:10 MST WARNING: pgstat wait timeout
2011-12-22 09:35:10 MST WARNING: pgstat wait timeout
2011-12-22 09:35:10 MST ERROR: canceling autovacuum task
2011-12-22 09:35:10 MST CONTEXT: automatic vacuum of table
"ishield.public.tbltmptests"
2011-12-22 09:36:21 MST WARNING: pgstat wait timeout
2011-12-22 09:36:21 MST WARNING: pgstat wait timeout
2011-12-22 09:37:02 MST WARNING: pgstat wait timeout
2011-12-22 09:37:02 MST WARNING: pgstat wait timeout
2011-12-22 09:37:02 MST WARNING: pgstat wait timeout
2011-12-22 09:37:43 MST WARNING: pgstat wait timeout
2011-12-22 09:37:43 MST WARNING: pgstat wait timeout
2011-12-22 09:38:20 MST WARNING: pgstat wait timeout
2011-12-22 09:38:20 MST WARNING: pgstat wait timeout
2011-12-22 09:38:59 MST WARNING: pgstat wait timeout
2011-12-22 09:38:59 MST WARNING: pgstat wait timeout
2011-12-22 09:38:59 MST WARNING: pgstat wait timeout
2011-12-22 09:42:33 MST LOG: checkpoint complete: wrote 362856 buffers
(46.1%); 0 transaction log file(s) added, 0 removed, 908 rec
I do have an aggressive autovacuum because I have some tables that get
updated thousands of times per minute and otherwise they bloat
terribly,
but it is very seldom that I see cancelation of autovacuum, and only on
this table, which is a temporary table used for the insertion of the
data, in which records are inserted and deleted - never updated.

-----Original Message-----
Sent: Thursday, December 22, 2011 11:45 AM
To: Benjamin Krajmalnik
Cc: pgsql-admin
Subject: Re: [ADMIN] Problem with pgstat timneouts

Post by Benjamin Krajmalnik
The only thing which I see when I run iostat is load on the drives

every

Post by Benjamin Krajmalnik
so often. Mifd0 is the database, mifd1 is the pg_xlog.

Hmm. Maybe checkpoints or something else saturating your drives for

little while? You might want to be more aggressive about smoothing

out

checkpoints, cf
http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm
At the very least I'd enable checkpoint logging for awhile and see
if there seems to be any correlation to the response-time glitches.
regards, tom lane

--
http://www.postgresql.org/mailpref/pgsql-admin

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Benjamin Krajmalnik

2012-01-04 17:54:52 UTC

Permalink

I have proceeded to do some more checking, and I see in iostat that the pg_xlog drive has a significantly higher busy state than before. Whereas it was barely busy when we first spun up the server (total %busy since we started the server is about 6%) it is now in its 80's almost steady state. We have a set of partitioned tables which are continuously updated, and based on the size of them they no longer fit in the shared memory which was allocated. Pg_xlog is in a SAS RAID 1. The server is set up with streaming replication to an identical server. The one thing which I just checked is the RAID mode on the server.
db1# megacli64 -LDInfo -lAll -aAll

Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name :
RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0
Size : 2.452 TB
State : Optimal
Strip Size : 256 KB
Number Of Drives per span:2
Span Depth : 6
Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Access Policy : Read/Write
Disk Cache Policy : Disk's Default
Encryption Type : None

Virtual Drive: 1 (Target Id: 1)
Name :
RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0
Size : 418.656 GB
State : Optimal
Strip Size : 256 KB
Number Of Drives : 2
Span Depth : 1
Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Access Policy : Read/Write
Disk Cache Policy : Disk's Default
Encryption Type : None

Is it possible that the WriteThrough is what is causing the high io (and maybe the pgstat wait timeouts as well)?
If this is the case, would it be safe to change the cache to Write back?

Additionally, and somewhat unrelated, is there anything special which I need to do when restarting the primary server vis-à-vis the streaming replication server? In other words, if I were to restart the main server, will the streaming replication server reconnect and pick up once the primary comes online?

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Kevin Grittner

2012-01-04 18:43:59 UTC

Permalink

Post by Benjamin Krajmalnik
Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Is it possible that the WriteThrough is what is causing the high
io (and maybe the pgstat wait timeouts as well)?

Yes.

Post by Benjamin Krajmalnik
If this is the case, would it be safe to change the cache to Write back?

Maybe. How did it get into this (non-default) state? Are batteries
installed and healthy?

Post by Benjamin Krajmalnik
Additionally, and somewhat unrelated, is there anything special
which I need to do when restarting the primary server vis-à-vis
the streaming replication server? In other words, if I were to
restart the main server, will the streaming replication server
reconnect and pick up once the primary comes online?

I think that should be pretty automatic as long as you haven't
promoted the standby to be the new master.

-Kevin

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Benjamin Krajmalnik

2012-01-04 18:48:02 UTC

Permalink

Batteries are installed and healthy. I wonder if it did it during one of the charge/discharge cycles and just did not revert (interestingly, I have one more server that did just that). I am trying to get the info from LSI as to what may have caused it, if it is safe to revert the cache mode while it is running, and, of course, the exact command so I don't foobar things up.

-----Original Message-----
Sent: Wednesday, January 04, 2012 11:44 AM
To: pgsql-admin; Benjamin Krajmalnik
Subject: Re: [ADMIN] Problem with pgstat timeouts

Yes.

Post by Benjamin Krajmalnik
If this is the case, would it be safe to change the cache to Write back?

Maybe. How did it get into this (non-default) state? Are batteries
installed and healthy?

I think that should be pretty automatic as long as you haven't
promoted the standby to be the new master.
-Kevin

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Benjamin Krajmalnik

2012-01-04 22:48:09 UTC

Permalink

Just a followup - thanks to Tom, Kevin, and the rest of the user group, as usual, for their great support.
It appears that I have hit a nice firmware bug, so I am going to have to upgrade the controller firmware. Apparently, it has some issues communicating with the BBU so although the BBU is showing as good, trying to bring it to writeback says it is "below threshold". Since I am going to be taking down 4 servers (our entire infrastructure) to perform the firmware upgrades, I will end up replacing the BBU's for good measure.

Once again, thanks to everyone for their assistance, and I guess this pgstat timeout can be attributed to the controller :)

-----Original Message-----
Sent: Wednesday, January 04, 2012 11:48 AM
To: Kevin Grittner; pgsql-admin
Subject: Re: [ADMIN] Problem with pgstat timeouts
Batteries are installed and healthy. I wonder if it did it during one
of the charge/discharge cycles and just did not revert (interestingly,
I have one more server that did just that). I am trying to get the
info from LSI as to what may have caused it, if it is safe to revert
the cache mode while it is running, and, of course, the exact command
so I don't foobar things up.

-----Original Message-----
Sent: Wednesday, January 04, 2012 11:44 AM
To: pgsql-admin; Benjamin Krajmalnik
Subject: Re: [ADMIN] Problem with pgstat timeouts

Yes.

Post by Benjamin Krajmalnik
If this is the case, would it be safe to change the cache to Write back?

Maybe. How did it get into this (non-default) state? Are batteries
installed and healthy?

I think that should be pretty automatic as long as you haven't
promoted the standby to be the new master.
-Kevin

--
http://www.postgresql.org/mailpref/pgsql-admin

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin