[ADMIN] Delay in completion of aggregation inserts when run in a single commit

Discussion:

[ADMIN] Delay in completion of aggregation inserts when run in a single commit - PG 9.1.2

(too old to reply)

Craig Ringer

2012-07-05 06:33:50 UTC

-Staging table population happens in expected time. Anywhere from
10-15 minutes for each process.
-Aggregation process almost never completes which is a set of 15
insert queries. Entire process happens in a single commit (population
of staging tables and aggregation).

When you mean "never completes" - can you be more specific?

Is the query not progressing - say, waiting on a lock?

Is it doing work, but performing very poorly?

Does it terminate with a failure? If so, error message?

If the query just keeps running but you're not sure what it is or isn't
doing, you should examine pg_stat_activity and pg_locks. From
pg_stat_activity you can get the backend process ID; examine that
backend using system tools (ps, top, etc) to see if it's using CPU and
whether it's doing disk I/O. From pg_locks you can get an idea of
whether a query is stuck waiting on a lock.

Could you be kind enough to explain how differently queries run when
in a “single commit” as opposed to running each of them individually?

There shouldn't be tons of difference. The first queries continue to
hold locks while the others run, but since you're in the same
transaction it won't affect subsequent queries, only concurrent ones.

If you have other concurrent work going on it's possible you're running
into locking issues that you don't encounter when you release locks with
individual commits.

Also, some resources can only be released when a transaction commits. If
you're doing something like using huge numbers of savepoints or PL/PgSQL
BEGIN ... EXCEPT blocks it's possible you're running into resource
issues. The SAVEPOINT code has improved massively since 8.3 when I had
similar issues, but savepoints still aren't free.

Start by looking at the locking situation, and by determining whether
your query is making slow forward progress or is simply waiting.

Postgresql.conf remains untouched, except for log writing which has
been made to “all”.

That's going to be VERY sub-optimal if you're bulk-loading data. You'll
need to push checkpoint_segments right up. Check your logs; you're
probably seeing warnings about checkpoints being too frequent. Hell, my
checkpoint_segments is at 50 on my *laptop* - albeit a very fast
SSD-equipped laptop that does lots of blk data uinit testing.

You'll want to push effective_cache_size up to 4 or 5 GB given you have
three instances. Hard to say re shared_buffers, I have little experience
with running multiple instances on one machine.

--
Craig Rin

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Albe Laurenz

2012-07-05 08:00:08 UTC

Permalink

[...]

- Aggregation process almost never completes which is a set of 15

insert queries. Entire

process happens in a single commit (population of staging tables and

aggregation).

- When I run these 15 queries individually, it happens in no time.
Could you be kind enough to explain how differently queries run when

in a "single commit" as opposed

to running each of them individually?

Apart from minor differences (current_timestamp, ...) the
main difference is that there is no COMMIT after each query.
COMMIT is expensive since it requires a write to hard disk.

Try to get the execution plans of the queries in both cases
and see if you can spot a difference.
The auto_explain contrib module might come handy for that.

Yours,
Laurenz Albe

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Craig Ringer

2012-07-05 12:22:47 UTC

Permalink

[...]

- Aggregation process almost never completes which is a set of 15

insert queries. Entire

process happens in a single commit (population of staging tables and

aggregation).

- When I run these 15 queries individually, it happens in no time.
Could you be kind enough to explain how differently queries run when

in a "single commit" as opposed

to running each of them individually?

Apart from minor differences (current_timestamp, ...) the
main difference is that there is no COMMIT after each query.
COMMIT is expensive since it requires a write to hard disk.
Try to get the execution plans of the queries in both cases
and see if you can spot a difference.

Actually, that's a thought. Autovaccum can't see uncomitted work, and
thus can't analyze it. Maybe they need an explicit ANALYZE or two after
a bulk insert or update during their data load.

--
Craig Ringer

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Akash

2012-07-06 05:06:13 UTC

Permalink

@@Craig

I could only wait for 6 hours. I cancelled the process after that.

No locks on any of the tables when running.

Yes, It was progressing. Of the 15 tables, when I cancelled after 6 hours,
3rd table was getting populated (when checked in pg_stat_activity).

Below are some changes I have in postgresql.conf, some based on your
suggestions:

1) constraint_exclusion = partition, which I had already made, supported by
an trigger on every insert with around 100 checks for each table (date
constraints).
2)log_destination = 'stderr'
3)logging_collector = on
4)autovacuum = off, we are doing daily vacuum analyze on all tables.
5)shared_buffers = 32MB
6)max_prepared_transactions = 100
7)work_mem = 16MB
8)maintenance_work_mem = 64MB
9)wal_writer_delay = 1000ms
10)checkpoint_segments = 16

After these changes process is moving forward, but somehow I feel that, It
is only matter of time, I will reach the next choking point. Since I am
seeing the that process completion is taking a little longer each time I run
it.

Also a thing I noticed is: When I run the query:

select relname, age(relfrozenxid) from pg_class order by age(relfrozenxid)
desc;

relname | age
--------------------------------------------------+------------
pg_toast_28344_index | 2147483647
pg_toast_28351_index | 2147483647
pg_toast_33106_index | 2147483647
pg_toast_33099_index | 2147483647
pg_toast_32128_index | 2147483647
pg_toast_28232_index | 2147483647
pg_toast_33092_index | 2147483647
promo_seq | 2147483647
pg_toast_33085_index | 2147483647
pg_toast_32135_index | 2147483647
pg_toast_33120_index | 2147483647
...... There are many more.

Regards,
Akash.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Delay-in-completion-of-aggregation-inserts-when-run-in-a-single-commit-PG-9-1-2-tp5715391p5715621.html
Sent from the PostgreSQL - admin mailing list archive at Nabble.com.

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Craig Ringer

2012-07-06 05:28:43 UTC

Permalink

Post by Akash
@@Craig
I could only wait for 6 hours. I cancelled the process after that.
No locks on any of the tables when running.

That's a very ambiguous statement. I'm assuming you are saying "our code
does not take any explict locks on those tables using LOCK TABLE or
SELECT ... FOR SHARE / FOR UPDATE" - because there /most certainly are/
locks on those tables whenever you're running a query against them. Even
a simple SELECT takes out a lock to prevent the table from being dropped
while the query runs.

There could be no locks taken by transactions other than the transaction
doing the run, though.

Post by Akash
Yes, It was progressing. Of the 15 tables, when I cancelled after 6 hours,
3rd table was getting populated (when checked in pg_stat_activity).

OK. If you turn auto_explain on and use it to get a plan for the slow
queries, do those plans differ from the plans produced when running the
same queries standalone?

Did you examine iostat? Did you look at what work the postgres process
was doing to see if it was busy with cpu or I/O (iowait) or if it was
idle waiting for something else to happen?

Post by Akash
4)autovacuum = off, we are doing daily vacuum analyze on all tables.

No!

Unless your database as /no/ write activity other than these batch
loads, you should have autovacuum on. The more frequently autovaccum
runs the better a job it will do of preventing bloat, maintaining table
stats, etc.

If you have tables that are only ever modified by a bulk-load script
that then immediately vacuums them, set the autovac parameters for that
table so it's excluded, don't turn autovaccum off entirely.

Post by Akash
5)shared_buffers = 32MB

That's tiny and will severely constrain Pg's resources.

I notice you didn't set effective_cache_size either.

Post by Akash
After these changes process is moving forward, but somehow I feel that, It
is only matter of time, I will reach the next choking point. Since I am
seeing the that process completion is taking a little longer each time I run
it.

Well, your tables and indexes are probably bloating horribly because
autovaccum is turned off, so that's not surprising.

If you drop and recreate, or TRUNCATE, the tables between load runs you
might be OK with autovac off for those tables, but what you're
describing makes me think otherwise.

--
Craig Ringer

Eduardo Sá dos Reis

2012-07-09 13:26:23 UTC

Permalink

Hi,

In my Postgres log the message is occurring:

1 4ffad9f4.67db 2012-07-09 10:17:40 BRT LOG: connection received: host=10.20.0.1 port=2368
2 4ffad9f4.67db 2012-07-09 10:17:40 BRT LOG: connection authorized: user=usu_cad database=prd_cad
3 4ffad9f4.67db 2012-07-09 10:17:42 BRT LOG: could not send data to client: Connection reset by peer
4 4ffad9f4.67db 2012-07-09 10:17:42 BRT LOG: unexpected EOF on client connection

Users can not run the system.

Does anyone know what might be happening?

Thanks.

Eduardo Reis

Craig Ringer

2012-07-09 23:40:20 UTC

Permalink

Post by Eduardo SÃ¡ dos Reis
Hi,
host=10.20.0.1 port=2368
user=usu_cad database=prd_cad
3 4ffad9f4.67db 2012-07-09 10:17:42 BRT LOG: *could not send data to
client: Connection reset by peer*
4 4ffad9f4.67db 2012-07-09 10:17:42 BRT LOG: *unexpected EOF on client
connection*
Users can not run the system.

If this is a Windows system with a 3rd party software firewall (Windows
Firewall is OK, anything else isn't) then remove or disable the firewall
and see if the problem goes away.

Otherwise, more info please. PostgreSQL version, OS and version, client
program, whether the client program is using psqlODBC, pgJDBC, libpq
directly, etc.

--
Craig Ringer