CPU causes 100% load in user space when ntp client runs and postgresql is under heavy load

Discussion:

(too old to reply)

Dennis Brouwer

2012-09-24 13:53:10 UTC

Dear mailing list,

I am currently benching postgresq-9.2 using debain squeeze (Linux
2.6.32-5-amd64 x86_64 GNU/Linux).

The server used for benching is a quad core E5-1620, 32 GB RAM and for
storage we use and LSI-9265 with 8 SSDs. The database freshly restored is
about 90GB in size and doesn't fit in RAM in order to test the IO system.

The database mainly consists of a partitioned table with 6 partitions. In
order to test the performance I run 32 queries in parallel doing some
grouping queries on the partitioned table. Every query runs in its own
transaction. While the number of concurrent queries run may be higher then
recommended we consider this a stress test as well.

Last week I was repeatedly able to run all these tests on the database
without any issue but recently, all of a sudden at random, some of the
queries performed a factor 100 less. It may take hours to complete the
transaction. At the same moment we see a dramatic decrease in IO and the
CPU is nearly 100% busy in user space.

After days of testing I may have found the cause: the ntp client. If I stop
the ntp client the problem vanishes.

I have started reading on spinlocks and other related material but this all
is rather complicated stuff and kindly ask in what direction I should
search. The issue can be reproduced for both postgresql-9.1 and
postgresql-9.2 and perhaps can be rephrased as: Very high CPU load in user
space (at random) with ntp enabled and (long?) running transactions.

Perhaps somebody from the mailing list has sufficient experience debugging
this kind of behaviour to exclude a bug in postgresql. Much appreciated!

Very kind regards,

Dennis Brouwer
M4N

P.S. If required I can provide more details like: the queries, auto_explain
output, iostat, top, iotop, postgresql.conf etc etc.

Tom Lane

2012-09-24 16:30:45 UTC

Permalink

Post by Dennis Brouwer
Last week I was repeatedly able to run all these tests on the database
without any issue but recently, all of a sudden at random, some of the
queries performed a factor 100 less. It may take hours to complete the
transaction. At the same moment we see a dramatic decrease in IO and the
CPU is nearly 100% busy in user space.
After days of testing I may have found the cause: the ntp client. If I stop
the ntp client the problem vanishes.
I have started reading on spinlocks and other related material but this all
is rather complicated stuff and kindly ask in what direction I should
search. The issue can be reproduced for both postgresql-9.1 and
postgresql-9.2 and perhaps can be rephrased as: Very high CPU load in user
space (at random) with ntp enabled and (long?) running transactions.

That's really bizarre. What "ntp client" are you using exactly? Is it
configured to adjust the system clock by slewing, or by stepping? Can
you identify what part of the code is eating CPU (try perf or oprofile)?

regards, tom lane

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Dennis Brouwer

2012-09-24 20:00:54 UTC

Permalink

Dear Tom Lane,

Thanks for the tip for using perf or oprofile but ntp might not the problem
at all. During testing with ntp off the problem was still reproducable be
it less frequent. It might have something to do with accessive row locking.
We are currently looking into the explain results from the postgresql log
if there is a pattern to be observerd and reading the pg_locks chapters
from the books ;-). It will take some time to understand whats going on.

I still might need to use the tools to identify where in code the CPU user
load comes from.

I'll keep you posted.

Kind regards,

Dennis Brouwer
M4n

Post by Tom Lane

stop

Post by Dennis Brouwer
the ntp client the problem vanishes.
I have started reading on spinlocks and other related material but this

all

Post by Dennis Brouwer
is rather complicated stuff and kindly ask in what direction I should
search. The issue can be reproduced for both postgresql-9.1 and
postgresql-9.2 and perhaps can be rephrased as: Very high CPU load in

user

Post by Dennis Brouwer
space (at random) with ntp enabled and (long?) running transactions.

Dennis Brouwer

2012-09-25 12:58:31 UTC

Permalink

Hi Tom,

I now have excluded ntp as root cause for the CPU cycles being wasted in
user space.

I installed perf and monitored two servers (with different postgresql
versions and hardware specification) which are "hanging" and have some
output. Since I'm no die-hard at interpreting the output of perf top what
would be the next step to do?

Would it be a good idea to a) read the perf manual and/or 2) provide the
output of perf top as a first step to see what is going on?

What I think I see is a lot spin_lock_irq and scheduler processes active.

Any guidance much appreciated.

Most Regards,

Dennis Brouwer
M4N

Post by Tom Lane

stop

Post by Dennis Brouwer
the ntp client the problem vanishes.
I have started reading on spinlocks and other related material but this

all

user

Post by Dennis Brouwer
space (at random) with ntp enabled and (long?) running transactions.

Tom Lane

2012-09-25 15:57:36 UTC

Permalink

Post by Dennis Brouwer
I now have excluded ntp as root cause for the CPU cycles being wasted in
user space.

Good, cause that wasn't making any sense at all.

Post by Dennis Brouwer
I installed perf and monitored two servers (with different postgresql
versions and hardware specification) which are "hanging" and have some
output. Since I'm no die-hard at interpreting the output of perf top what
would be the next step to do?

I'd suggest asking for help in pgsql-performance. I don't know much
about perf either (still an oprofile guy), but the people who do know
it hang out there.

regards, tom lane

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Marcello Perathoner

2012-09-24 21:05:03 UTC

Permalink

Any chance you are hitting this known linux bug in conjunction with a
misconfigured ntp server? ie. does a

# date -s now

fix the cpu load?

http://blog.mozilla.org/it/2012/06/30/mysql-and-the-leap-second-high-cpu-and-the-fix/comment-page-1/#comment-1471

http://serverfault.com/questions/403732/anyone-else-experiencing-high-rates-of-linux-server-crashes-during-a-leap-second

--
Marcello Perathoner
***@gutenberg.org
--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin