[ADMIN] Data corruption after SAN snapshot

Discussion:

(too old to reply)

Terry Schmitt

2012-08-07 22:23:33 UTC

Hi All,

I have a pretty strange issue that I'm looking for ideas on.
I'm using Postgres Plus Advanced Server 9.1, but I believe this problem is
relevant to Postgres Community. It is certainly possible to be a EDB bug
and I am already working with them on this.

We are migrating 1TB+ from Oracle to PPAS. Our new environment consists of
a primary server with two "read-only" clones. We use NetApps SAN storage
and execute a NetApps consistent snapshot on the primary server and then
use flex clones for the read-only servers. The clones power up, do a short
recovery and all should be well. We have been doing this method for two
years except using PPAS 8.4 and physical servers and ext4.
The new environment is RHEL 6.x guests running inside Redhat Virtualization
using XFS and LVM.

The problem is that after the data load, we take a warm snapshot and the
cloned database are coming up corrupt.
A classic example is: ERROR: could not read block 1 in file
"base/18511/13872": read only 0 of 8192 bytes. Looking at the data file, it
is 8k in size, so obviously we are missing block 1 from the file. So far I
identified indexes and sequences as corrupt, but I believe it could be any
object.
Since the snapshot is essentially a crash, this system is not crash
resistant either.

Looking through the timeline of events, it is clear that data exists in RAM
on the primary server, but is not being written out to the SAN for the
snapshot and hence is missing when the clone starts up. My first thought is
that fsync is not working. PPAS has fsync on and is using fdatasync.

I run a rudimentary test using: dd if=/dev/zero of=dd_test2 bs=8k
count=128k conv=fdatasync and crash the server immediate after dd completes.
Everything behaves as expected. with fsync or fdatasync, the file exists
after the crash and reboot. Leaving out the sync results in a missing file
after the crash/reboot, but that is expected. This simple tests shows that
fdatasync is working, but does not prove this under load.

So, at this point, I don't know if the fdatasync is being issued, but not
honored by the OS stack, or if PPAS is even issuing the sync at all.

Anyone have a solid method to test if fdatasync is working correctly or
thoughts on troubleshooting this? It is extremely time consuming to
replicate the problem, but even then the corruption moves around, so it's
hard to know immediately if there is corruption at all. I'm hoping to
utilize a tool set outside of Postgres to positively eliminate the OS stack.

Sorry for the lengthy post, but hopefully it's clear what is going on.

Thanks!
T

Simon Riggs

2012-08-07 23:01:09 UTC

Permalink

Post by Terry Schmitt
I have a pretty strange issue that I'm looking for ideas on.
I'm using Postgres Plus Advanced Server 9.1, but I believe this problem is
relevant to Postgres Community. It is certainly possible to be a EDB bug and
I am already working with them on this.

I think its important to say that as soon as someone patches/forks
Postgres code then the onus is really on whoever did that to show the
problem wasn't created by them when they did that. That is very much
the reason why my company and many others don't market a forked
product and stick closely to the community distributed binaries.

I have much respect for your vendors ability to diagnose faults and
recreate them on stock Postgres if appropriate, so I think we should
wait for that to happen rather than attempt to answer this just yet.
Or alternatively, please submit a test case that fails on Postgres if
you doubt that.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Terry Schmitt

2012-08-07 23:18:00 UTC

Permalink

Simon,
While I agree with your reply in general and am working that angle and
more, I'm hoping to add to my personal tool kit and gain more insight into
methods to test fsync and prove without a doubt that it is functioning
properly on any given system no matter what type of database I'm running.

T

Post by Simon Riggs

Post by Terry Schmitt
I have a pretty strange issue that I'm looking for ideas on.
I'm using Postgres Plus Advanced Server 9.1, but I believe this problem

Post by Terry Schmitt
relevant to Postgres Community. It is certainly possible to be a EDB bug

and

Post by Terry Schmitt
I am already working with them on this.

I think its important to say that as soon as someone patches/forks
Postgres code then the onus is really on whoever did that to show the
problem wasn't created by them when they did that. That is very much
the reason why my company and many others don't market a forked
product and stick closely to the community distributed binaries.
I have much respect for your vendors ability to diagnose faults and
recreate them on stock Postgres if appropriate, so I think we should
wait for that to happen rather than attempt to answer this just yet.
Or alternatively, please submit a test case that fails on Postgres if
you doubt that.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Craig Ringer

2012-08-08 01:11:02 UTC

Permalink

Post by Terry Schmitt
Anyone have a solid method to test if fdatasync is working correctly or
thoughts on troubleshooting this?

Try diskchecker.pl

https://gist.github.com/3177656

The other obvious step is that you've changed three things, so start
isolation testing.

- Test Postgres Plus Advanced Server 8.4, which you knew worked, on your
new file system and OS.

- Test PP9.1 on your new OS but with ext3, which you knew worked

- Test PP9.1 on your new OS but with ext4, which should work if ext3 did

- Test PP9.1 on a copy of your *old* OS with the old file system setup.

- Test mainline PostgreSQL 9.1 on your new setup to see if it's PP specific.

Since each test sounds moderately time consuming, you'll probably need
to find a way to automate. I'd first see if I could reproduce the
problem when running PgBench against the same setup that's currently
failing, and if that reproduces the fault you can use PgBench with the
other tests.

--
Craig Ringer

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Terry Schmitt

2012-08-08 01:34:17 UTC

Permalink

Thanks Craig.

"# Brad's el-ghetto do-our-storage-stacks-lie?-script" I like it already :)

I may play around with that. Looks interesting. For everyone else, here's a
post describing the use of diskchecker:
http://brad.livejournal.com/2116715.html
I experimented with sysbench today, which was somewhat enlightening and it
clearly shows the impact that fsync/fdatasync has on the file system
performance. It's pretty obvious that fsync is writing out to disk simply
based on the throughput of each test.
Using pgbench is a good idea, as I can throw a high transaction rate at the
database and take a snapshot during the test. So far, executing pg_dumpall
seems to be fairly reliable for finding the corrupt objects after my
initial data load, but unfortunately much of the corruption has been with
indexes which pgdump will not expose.

Thanks for the input,
T

Post by Terry Schmitt
Anyone have a solid method to test if fdatasync is working correctly or

Post by Terry Schmitt
thoughts on troubleshooting this?

Try diskchecker.pl
https://gist.github.com/**3177656 <https://gist.github.com/3177656>
The other obvious step is that you've changed three things, so start
isolation testing.
- Test Postgres Plus Advanced Server 8.4, which you knew worked, on your
new file system and OS.
- Test PP9.1 on your new OS but with ext3, which you knew worked
- Test PP9.1 on your new OS but with ext4, which should work if ext3 did
- Test PP9.1 on a copy of your *old* OS with the old file system setup.
- Test mainline PostgreSQL 9.1 on your new setup to see if it's PP specific.
Since each test sounds moderately time consuming, you'll probably need to
find a way to automate. I'd first see if I could reproduce the problem when
running PgBench against the same setup that's currently failing, and if
that reproduces the fault you can use PgBench with the other tests.
--
Craig Ringer

Stephen Frost

2012-08-08 01:39:31 UTC

Permalink

Terry,

Post by Terry Schmitt
So far, executing pg_dumpall
seems to be fairly reliable for finding the corrupt objects after my
initial data load, but unfortunately much of the corruption has been with
indexes which pgdump will not expose.

Shouldn't be too hard to write a script that'll do a query against each
table using an ORDER BY that matches each index, at least for 'simple'
indexes, which'll typically cause an in-order index traversal.

Just a thought.

Thanks,

Stephen

Craig Ringer

2012-08-08 02:38:52 UTC

Permalink

Post by Stephen Frost
Terry,

I'd really like a "VERIFY" command for PostgreSQL, though a proper one
isn't really possible without block checksums.

I'm currently working on a virtual plug pull tool that uses VMs to
simulate abrupt crashes of the machine PostgreSQL is running on. One of
the bigger challenges is that Pg doesn't offer any reliable way to
detect even simple corruption.

Maybe a pg_read_relation(oid) that simply reads all blocks in an index
or table would help. It could live in the `adminpack' module (
http://www.postgresql.org/docs/9.1/static/adminpack.html) or
`pageinspect' module (
http://www.postgresql.org/docs/9.1/static/pageinspect.html).

It turns out I can use the pageinspect functions to do a rough kind of
verify, but it's pretty slow and inconvenient. Eg:

WITH pages(page) AS (
SELECT get_raw_page('tablename'::text, pageno)
FROM generate_series(0, (SELECT relpages FROM pg_class WHERE
relname = 'tablename')-1) AS pageno
)
SELECT page_header(page), heap_page_items(page) FROM pages;

takes 90ms when a 'SELECT * FROM tablename' takes 6.2ms . On a bigger
table, the query takes 3939.912 vs 125.135ms for a table scan.

Of course, pageinspect is mostly interesting for indexes, where I'd do:

create or replace function scan_index(indexname text) returns setof
record as $$
SELECT page_header(get_raw_page($1, 0));
WITH pagenumbers(pageno) AS (
SELECT generate_series(1, (SELECT relpages FROM pg_class WHERE
relname = $1)-1)
)
SELECT bt_page_items($1, pageno) FROM pagenumbers;
$$ language sql volatile;

SELECT scan_index('some_idx');

... but that's getting really quite slow and still hasn't touched the
free space map or visibility map.

Of course, these checks prove nothing about subtle corruption or
incorrect contents, they only make sure Pg can read them and they look
vaguely sane. It doesn't do any kind of consistency checking between
index and table.

--
Craig Ringer

Stephen Frost

2012-08-08 01:34:25 UTC

Permalink

Terry,

Post by Terry Schmitt
The new environment is RHEL 6.x guests running inside Redhat Virtualization
using XFS and LVM.

That's quite the shift, yet you left out any details on this piece..

How is the VM connected to the NetApp LUN? What kind of options have
been chosen wrt cacheing? Have you tried your 8.4 "known good" S/W
stack under this virtualized environment?

Thanks,

Stephen