Autovacuum issues with truncate and create index ...

Discussion:

(too old to reply)

Baptiste LHOSTE

2012-12-20 13:40:46 UTC

Hi everybody,

We are having issues with the autovacuum process.

Our database is composed by two kinds of tables :
- the first ones are partitions,
- the second ones are classic tables.

Each five minutes we execute the following process :
- we drop constraint of the target partition
- we drop indexes (2) of the target partition
- we truncate all data of the target partition
- we copy new data (around 150 000 rows) in the target partition
- we create indexes (2) in the target partition
- we create the constraint of the target partition
- we aggregate data from main table of the previous partition and we insert them in the second kind of tables (max 200 rows)
- finally we delete old data of the second kind of tables

Our problem is that we can not perform the last action (deletion of old data) because it take too much time (2-3min for about 200 rows). Some tables have about 5 000 000 000 of rows to delete so we have to proceed this action.

So I look in the pg_stat_all_tables, and I see that each partition of first kind of tables is analyzed once a day, but second kind of tables are analyzed only one time each 5 or 6 days.

I have tried to decrease autovacuum thresholds for these tables with the following command :
ALTER TABLE tableName SET (autovacuum_vacuum_threshold=20, autovacuum_vacuum_scale_factor=0.01, autovacuum_analyze_threshold=10, autovacuum_analyze_scale_factor=0.005, autovacuum_vacuum_cost_delay=10, autovacuum_vacuum_cost_limit=1000);

Then the autovacuum process starts to work on the second kind of tables, but our process blocks into step 3 (truncate) or step 5 (create index).

As soon as I reset the autovacuum thresholds for the second kind of tables, our process run again perfectly.

Can our process could create deadlock with the autovacuum process ?

Do we do something wrong in the autovacuum process configuration ?

Our postgresql is on a server with 16 CPUs 16Go RAM 4Go SWAP.

name | current_setting
-----------------------------+-------------------------------------------------------------------------------------------------------
version | PostgreSQL 8.4.11 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.4.real (Debian 4.4.5-8) 4.4.5, 64-bit
autovacuum_naptime | 30s
checkpoint_segments | 64
constraint_exclusion | partition
custom_variable_classes | alaloop
effective_cache_size | 14GB
external_pid_file | /var/run/postgresql/8.4-main.pid
lc_collate | en_US.UTF-8
lc_ctype | en_US.UTF-8
listen_addresses | *
log_autovacuum_min_duration | 0
log_line_prefix | %t
maintenance_work_mem | 1GB
max_connections | 100
max_locks_per_transaction | 256
max_prepared_transactions | 100
max_stack_depth | 5MB
port | 5432
server_encoding | UTF8
shared_buffers | 756MB
ssl | on
stats_temp_directory | /dev/shm
temp_buffers | 1536
TimeZone | localtime
unix_socket_directory | /var/run/postgresql
work_mem | 756MB

---
Baptiste LHOSTE
***@alaloop.com

ALALOOP S.A.S. - Technopole Izarbel - 64210 Bidart
Téléphone : +33 (0) 5 59 41 51 10
www.alaloop.com

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Kevin Grittner

2012-12-20 15:04:30 UTC

Permalink

- finally we delete old data of the second kind of tables
Then the autovacuum process starts to work on the second kind of
tables, but our process blocks into step 3 (truncate) or step 5
(create index).
As soon as I reset the autovacuum thresholds for the second kind
of tables, our process run again perfectly.
Can our process could create deadlock with the autovacuum
process ?

Would it be possible for you to create such a situation and capture
the contents of pg_stat_activity and pg_locks while it is going on?
What messages related to autovacuum or deadlocks do you see in the
server log while this is going on?

PostgreSQL 8.4.11

Would it be possible to update your 8.4 installation to the latest
bug fix (currently 8.4.15) to rule out the influence of any bugs
which have already been fixed?

-Kevin

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Baptiste LHOSTE

2012-12-20 15:59:43 UTC

Permalink

Post by Kevin Grittner
Would it be possible for you to create such a situation and capture
the contents of pg_stat_activity and pg_locks while it is going on?
What messages related to autovacuum or deadlocks do you see in the
server log while this is going on?

Before the change we can only see only automatic analyze logs on first kind of tables.

After the change I can see automatic analyze logs on both kind of tables.

Here's the pg_stat_activity during the issue :
datid | datname | procpid | usesysid | usename | current_query | waiting | xact_start | query_start | backend_start | client_addr | client_port
-------+---------+---------+----------+----------+-------------------------------------------------------------------------+---------+-------------------------------+-------------------------------+-------------------------------+---------------+-------------
42539 | flows | 13792 | 16384 | asa | TRUNCATE flowpool_24_segment_221 | f | 2012-12-20 16:55:01.346986+01 | 2012-12-20 16:55:01.347051+01 | 2012-12-20 16:50:22.073586+01 | 10.48.168.160 | 33550
42539 | flows | 12443 | 16384 | asa | <IDLE> | f | | 2012-12-20 15:51:32.787359+01 | 2012-12-20 13:00:09.968633+01 | 10.48.168.163 | 58799
42539 | flows | 13804 | 10 | postgres | autovacuum: ANALYZE public.agg_t1406_outgoing_a3_src_net_and_dst_net_f5 | f | 2012-12-20 16:56:00.919196+01 | 2012-12-20 16:56:00.919196+01 | 2012-12-20 16:51:27.079612+01 | |
42539 | flows | 12444 | 16384 | asa | <IDLE> | f | | 2012-12-20 16:09:53.123326+01 | 2012-12-20 13:00:16.82323+01 | 10.48.168.163 | 58800
42539 | flows | 12453 | 16384 | asa | <IDLE> | f | | 2012-12-20 13:01:15.019182+01 | 2012-12-20 13:01:15.015847+01 | 10.48.168.163 | 58801
42539 | flows | 13629 | 16384 | asa | TRUNCATE flowpool_15_segment_216 | f | 2012-12-20 16:55:01.783653+01 | 2012-12-20 16:55:01.783725+01 | 2012-12-20 16:30:22.331191+01 | 10.48.168.160 | 33495
42539 | flows | 13793 | 16384 | asa | TRUNCATE flowpool_19_segment_215 | f | 2012-12-20 16:55:01.296588+01 | 2012-12-20 16:55:01.297449+01 | 2012-12-20 16:50:22.095245+01 | 10.48.168.160 | 33551
42539 | flows | 13822 | 10 | postgres | autovacuum: ANALYZE public.agg_t449_incoming_a3_src_net_and_dst_net_f5 | f | 2012-12-20 16:53:34.720815+01 | 2012-12-20 16:53:34.720815+01 | 2012-12-20 16:51:57.097049+01 | |
42539 | flows | 13658 | 16384 | asa | TRUNCATE flowpool_2_segment_218 | f | 2012-12-20 16:55:01.606198+01 | 2012-12-20 16:55:01.606266+01 | 2012-12-20 16:35:30.327792+01 | 10.48.168.160 | 33505
42539 | flows | 13680 | 16384 | asa | TRUNCATE flowpool_16_segment_228 | f | 2012-12-20 16:55:01.710645+01 | 2012-12-20 16:55:01.713793+01 | 2012-12-20 16:40:21.689958+01 | 10.48.168.160 | 33512
42539 | flows | 13870 | 16384 | asa | select * from pg_stat_activity; | f | 2012-12-20 16:56:05.58373+01 | 2012-12-20 16:56:05.58373+01 | 2012-12-20 16:56:03.432322+01 | | -1
42539 | flows | 13770 | 16384 | asa | TRUNCATE flowpool_31_segment_208 | f | 2012-12-20 16:55:01.782363+01 | 2012-12-20 16:55:01.791481+01 | 2012-12-20 16:45:24.082223+01 | 10.48.168.160 | 33544
42539 | flows | 13771 | 16384 | asa | TRUNCATE flowpool_17_segment_211 | f | 2012-12-20 16:55:01.729515+01 | 2012-12-20 16:55:01.736037+01 | 2012-12-20 16:45:24.147856+01 | 10.48.168.160 | 33545
42539 | flows | 13849 | 10 | postgres | autovacuum: ANALYZE public.agg_t1251_incoming_a7_src_port_and_proto_f5 | f | 2012-12-20 16:56:01.00984+01 | 2012-12-20 16:56:01.00984+01 | 2012-12-20 16:52:27.111586+01 | |
42539 | flows | 13795 | 16384 | asa | TRUNCATE flowpool_22_segment_217 | f | 2012-12-20 16:55:01.342442+01 | 2012-12-20 16:55:01.345095+01 | 2012-12-20 16:50:22.159256+01 | 10.48.168.160 | 33553

Here's the pg_locks during the issue :
locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid |
mode | granted
---------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-------+
--------------------------+---------
transactionid | | | | | | 27720953 | | | | 12/0 | 13770 |
ExclusiveLock | t
virtualxid | | | | | 11/3140 | | | | | 11/3140 | 13870 |
ExclusiveLock | t
relation | 42539 | 106752 | | | | | | | | 13/0 | 13771 |
ShareLock | t
relation | 42539 | 106752 | | | | | | | | 13/0 | 13771 |
AccessExclusiveLock | t
relation | 42539 | 88542 | | | | | | | | 12/0 | 13770 |
ShareLock | t
relation | 42539 | 88542 | | | | | | | | 12/0 | 13770 |
AccessExclusiveLock | t
transactionid | | | | | | 27720938 | | | | 1/0 | 13792 |
ExclusiveLock | t
transactionid | | | | | | 27720969 | | | | 7/2476 | 13793 |
ExclusiveLock | t
virtualxid | | | | | 13/2821 | | | | | 13/0 | 13771 |
ExclusiveLock | t
relation | 42539 | 106755 | | | | | | | | 13/0 | 13771 |
ShareLock | t
relation | 42539 | 106755 | | | | | | | | 13/0 | 13771 |
AccessExclusiveLock | t
relation | 42539 | 112041 | | | | | | | | 1/0 | 13792 |
ShareLock | t
relation | 42539 | 112041 | | | | | | | | 1/0 | 13792 |
AccessExclusiveLock | t
virtualxid | | | | | 1/15720 | | | | | 1/0 | 13792 |
ExclusiveLock | t
virtualxid | | | | | 7/2476 | | | | | 7/2476 | 13793 |
ExclusiveLock | t
relation | 42539 | 88544 | | | | | | | | 12/0 | 13770 |
AccessExclusiveLock | t
transactionid | | | | | | 27720971 | | | | 9/2935 | 13658 |
ExclusiveLock | t
relation | 42539 | 10969 | | | | | | | | 11/3140 | 13870 |
AccessShareLock | t
relation | 42539 | 6697688 | | | | | | | | 14/2459 | 13849 |
AccessShareLock | t
relation | 42539 | 49499 | | | | | | | | 8/15669 | 13822 |
AccessShareLock | t
relation | 42539 | 88539 | | | | | | | | 12/0 | 13770 | ShareLock | t
relation | 42539 | 88539 | | | | | | | | 12/0 | 13770 | AccessExclusiveLock | t
relation | 42539 | 112038 | | | | | | | | 1/0 | 13792 | ShareLock | t
relation | 42539 | 112038 | | | | | | | | 1/0 | 13792 | AccessExclusiveLock | t
relation | 42539 | 106757 | | | | | | | | 13/0 | 13771 | AccessExclusiveLock | t
virtualxid | | | | | 6/2974 | | | | | 6/2974 | 13629 | ExclusiveLock | t
relation | 42539 | 50023 | | | | | | | | 3/3137 | 13804 | ShareUpdateExclusiveLock | t
virtualxid | | | | | 3/3137 | | | | | 3/3137 | 13804 | ExclusiveLock | t
relation | 42539 | 80808 | | | | | | | | 7/2476 | 13793 | RowExclusiveLock | t
transactionid | | | | | | 27720951 | | | | 13/0 | 13771 | ExclusiveLock | t
virtualxid | | | | | 9/2935 | | | | | 9/2935 | 13658 | ExclusiveLock | t
relation | 42539 | 49492 | | | | | | | | 8/15669 | 13822 | ShareUpdateExclusiveLock | t
virtualxid | | | | | 15/2206 | | | | | 15/2206 | 13795 | ExclusiveLock | t
relation | 42539 | 91227 | | | | | | | | 9/2935 | 13658 | RowExclusiveLock | t
relation | 42539 | 49498 | | | | | | | | 8/15669 | 13822 | AccessShareLock | t
relation | 42539 | 112043 | | | | | | | | 1/0 | 13792 | AccessExclusiveLock | t
relation | 42539 | 6697685 | | | | | | | | 14/2459 | 13849 | ShareUpdateExclusiveLock | t
virtualxid | | | | | 14/2459 | | | | | 14/2459 | 13849 | ExclusiveLock | t
relation | 42539 | 50026 | | | | | | | | 3/3137 | 13804 | RowExclusiveLock | t
relation | 42539 | 6697689 | | | | | | | | 14/2459 | 13849 | AccessShareLock | t
relation | 42539 | 86013 | | | | | | | | 6/2974 | 13629 | RowExclusiveLock | t
virtualxid | | | | | 8/15669 | | | | | 8/15669 | 13822 | ExclusiveLock | t
virtualxid | | | | | 12/14901 | | | | | 12/0 | 13770 | ExclusiveLock | t
transactionid | | | | | | 27720970 | | | | 6/2974 | 13629 | ExclusiveLock | t
relation | 42539 | 50027 | | | | | | | | 3/3137 | 13804 | RowExclusiveLock | t
(45 rows)

Post by Kevin Grittner
Would it be possible to update your 8.4 installation to the latest
bug fix (currently 8.4.15) to rule out the influence of any bugs
which have already been fixed?

Is there a way to upgrade without having to dump all data and restore them after the upgrade ?

Best regards, Baptiste.

---
Baptiste LHOSTE
***@alaloop.com

ALALOOP S.A.S. - Technopole Izarbel - 64210 Bidart
Téléphone : +33 (0) 5 59 41 51 10
www.alaloop.com

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Baptiste LHOSTE

2012-12-20 16:11:09 UTC

Permalink

Post by Baptiste LHOSTE

Post by Kevin Grittner
Would it be possible to update your 8.4 installation to the latest
bug fix (currently 8.4.15) to rule out the influence of any bugs
which have already been fixed?

Is there a way to upgrade without having to dump all data and restore them after the upgrade ?

I have check but debian proposed only to upgrade to the 8.4.13.

Best regards, Baptiste.

---
Baptiste LHOSTE
***@alaloop.com

ALALOOP S.A.S. - Technopole Izarbel - 64210 Bidart
Téléphone : +33 (0) 5 59 41 51 10
www.alaloop.com
--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Kevin Grittner

2012-12-20 16:42:08 UTC

Permalink

[no processes waiting]
[all locks granted]

Was the blocking you described occurring at the time you captured
this? It doesn't seem to be showing any problem.

Is there a way to upgrade without having to dump all data and
restore them after the upgrade ?

A minor release (where the version number doesn't change before the
second dot) never requires a dump and restore. There is sometimes a
need to do some cleanup work; for example, if a bug is fixed which
could corrupt a particular type of index, the release notes may
recommend rebuilding all indexes of that type to repair any damage
which may have occurred before the bug was fixed.

http://www.postgresql.org/support/versioning/

-Kevin

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Baptiste LHOSTE

2012-12-20 17:08:43 UTC

Permalink

Post by Kevin Grittner
Was the blocking you described occurring at the time you captured
this? It doesn't seem to be showing any problem.

Yes indeed. We have noticed that any process seems to be in waiting situation but :
- before the autovacuum process starts to work on the both kind of tables, truncate and index creation take less than 2 seconds
- after the autovacuum process starts to work on the both kind of tables, truncate and index creation never end

We have to stop our process, then reset the autovacuum thresold for second kind of tables, then restart our process.

Is it possible that the fact that statistics of postgresql are not up-to-date could explain this behavior ?
Is it possible that the autovacuum process does not stop itself when we perform a truncate or a create index ?

Post by Kevin Grittner
A minor release (where the version number doesn't change before the
second dot) never requires a dump and restore. There is sometimes a
need to do some cleanup work; for example, if a bug is fixed which
could corrupt a particular type of index, the release notes may
recommend rebuilding all indexes of that type to repair any damage
which may have occurred before the bug was fixed.

As we mention in the previous mail, we can only upgrade to the 8.4.13 with Debian.

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Kevin Grittner

2012-12-20 22:46:12 UTC

Permalink

Post by Baptiste LHOSTE

Post by Kevin Grittner
Was the blocking you described occurring at the time you
captured this? It doesn't seem to be showing any problem.

Yes indeed. We have noticed that any process seems to be in
- before the autovacuum process starts to work on the both kind
of tables, truncate and index creation take less than 2
seconds
- after the autovacuum process starts to work on the both kind
of tables, truncate and index creation never end

Just so we know how to interpret that, how many minutes, hours, or
days did you wait to see whether it would ever end?

Post by Baptiste LHOSTE
We have to stop our process, then reset the autovacuum thresold
for second kind of tables, then restart our process.
Is it possible that the fact that statistics of postgresql are
not up-to-date could explain this behavior ?
Is it possible that the autovacuum process does not stop itself
when we perform a truncate or a create index ?

At the time you captured data from pg_stat_activity and pg_locks,
there were three autovacuum processes active, all running ANALYZE,
and eight TRUNCATE commands active on "normal" connections. All the
TRUNCATE statements started in the same second. One of the ANALYZE
tasks started about a minute and a half before that, the other two
started about a minute after the TRUNCATE statements. All are on
different tables, and no heavyweight locking is showing up.

I've heard enough reports of behavior like this to believe that
there is some sort of bug here, but the nature of it is not
apparent. We could really use more clues.

If it doesn't cause too much pain to let it get into this state for
a few minutes, it might help diagnose the issue if you could start
`vmstat 1` before you let it get into this state, and capture `ps
aux | postgres`, pg_stat_activity, and pg_locks at intervals while
it is in this state. Looking at all of the above might suggest a
cause. If we can find the cause, we can almost certainly fix it.

-Kevin

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin