Monitoring Replication on Master/Slave Postgres(9.1)

Discussion:

(too old to reply)

Shams Khan

2012-11-26 10:05:09 UTC

HI All,

I want to know the ways to monitor the replication, whether the master and
slave server are sync.
The way I am checking it is:

*On Master:*

[***@livedb1 ~]# ps -ef | grep sender
postgres 2152 7045 0 Nov05 ? 01:05:33 postgres: wal sender
process postgres 10.0.0.2(45862) streaming 17A/35B62638
root 26030 10506 0 15:26 pts/2 00:00:00 grep sender

radius=# SELECT pg_current_xlog_location();
pg_current_xlog_location
--------------------------
17A/342A6F78
(1 row)

*On slave database:*

[***@liveDB2 ~]# ps -ef | grep receiver
root 11725 4286 0 15:28 pts/0 00:00:00 grep receiver
postgres 27250 3432 0 Nov05 ? 00:35:33 postgres: wal receiver
process streaming 17A/363EE668

radius=# select pg_last_xlog_receive_location();
pg_last_xlog_receive_location
-------------------------------
17A/34366C30
(1 row)

*If I go to pg_xlog folder:* I checked on slave WAL files are generated.
However I could not find the same result whenever I run
pg_current_xlog_location(); on master and on the other side I get the
different number (on slave)

My question is how do we read these numbers(17A/342A6F78 and 17A/34366C30)
I never got the same results on both servers. Is that mean slave is not
synced with master?
If yes, then how do we ensure my replication is working fine?

Please share you knowledge and through some light.

thanx

Shams Khan

2012-11-27 07:37:09 UTC

Permalink

Can somebody help me on that????

---------- Forwarded message ----------
From: Shams Khan <***@gmail.com>
Date: Mon, Nov 26, 2012 at 3:35 PM
Subject: Monitoring Replication on Master/Slave Postgres(9.1)
To: pgsql-***@postgresql.org

HI All,

I want to know the ways to monitor the replication, whether the master and
slave server are sync.
The way I am checking it is:

*On Master:*

[***@livedb1 ~]# ps -ef | grep sender
postgres 2152 7045 0 Nov05 ? 01:05:33 postgres: wal sender
process postgres 10.0.0.2(45862) streaming 17A/35B62638
root 26030 10506 0 15:26 pts/2 00:00:00 grep sender

radius=# SELECT pg_current_xlog_location();
pg_current_xlog_location
--------------------------
17A/342A6F78
(1 row)

*On slave database:*

[***@liveDB2 ~]# ps -ef | grep receiver
root 11725 4286 0 15:28 pts/0 00:00:00 grep receiver
postgres 27250 3432 0 Nov05 ? 00:35:33 postgres: wal receiver
process streaming 17A/363EE668

radius=# select pg_last_xlog_receive_location();
pg_last_xlog_receive_location
-------------------------------
17A/34366C30
(1 row)

*If I go to pg_xlog folder:* I checked on slave WAL files are generated.
However I could not find the same result whenever I run
pg_current_xlog_location(); on master and on the other side I get the
different number (on slave)

My question is how do we read these numbers(17A/342A6F78 and 17A/34366C30)
I never got the same results on both servers. Is that mean slave is not
synced with master?
If yes, then how do we ensure my replication is working fine?

Please share you knowledge and through some light.

thanx

Shams Khan

2012-11-27 08:02:58 UTC

Permalink

on the master use
select procpid,usesysid, usename, application_name, client_addr, state,
sent_location,write_location,sync_state from pg_stat_replication;
the last field is the state of your repliocation status sync or async
Best regards,
Sergey Garas

Post by Shams Khan
Can somebody help me on that????
---------- Forwarded message ----------
Date: Mon, Nov 26, 2012 at 3:35 PM
Subject: Monitoring Replication on Master/Slave Postgres(9.1)
HI All,
I want to know the ways to monitor the replication, whether the master
and slave server are sync.
*On Master:*
postgres 2152 7045 0 Nov05 ? 01:05:33 postgres: wal sender
process postgres 10.0.0.2(45862) streaming 17A/35B62638
root 26030 10506 0 15:26 pts/2 00:00:00 grep sender
radius=# SELECT pg_current_xlog_location();
pg_current_xlog_location
--------------------------
17A/342A6F78
(1 row)
*On slave database:*
root 11725 4286 0 15:28 pts/0 00:00:00 grep receiver
postgres 27250 3432 0 Nov05 ? 00:35:33 postgres: wal receiver
process streaming 17A/363EE668
radius=# select pg_last_xlog_receive_location();
pg_last_xlog_receive_location
-------------------------------
17A/34366C30
(1 row)
*If I go to pg_xlog folder:* I checked on slave WAL files are
generated. However I could not find the same result whenever I run
pg_current_xlog_location(); on master and on the other side I get the
different number (on slave)
My question is how do we read these numbers(17A/342A6F78 and
17A/34366C30) I never got the same results on both servers. Is that mean
slave is not synced with master?
If yes, then how do we ensure my replication is working fine?
Please share you knowledge and through some light.
thanx

Stuart Bishop

2012-11-27 08:34:35 UTC

Permalink

Post by Shams Khan
I want to know the ways to monitor the replication, whether the master and
slave server are sync.

On the slave, run "SELECT now() - pg_last_xact_replay_timestamp() AS
time_lag;". That tells you how far behind in time the slave is.

Post by Shams Khan
My question is how do we read these numbers(17A/342A6F78 and 17A/34366C30) I
never got the same results on both servers. Is that mean slave is not synced
with master?

With asynchronous replication, the slave will always lag behind the
master. The query I posted above will tell you by how much.

--
Stuart Bishop <***@stuartbishop.net>
http://www.stuartbishop.net/
--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Shams Khan

2012-11-27 09:39:30 UTC

Permalink

Thanks for the response Stuart...It was really helpful:

Below are some doubts I wanted to clarify..please read and suggest.
Can we also check if replication was broken earlier...somehow due to power
failure of some other reasons in past...The reason I am asking is I am
getting some discrepancies in data between master and slave...? I want to
know the reason of it...

thanx

---------- Forwarded message ----------
From: Stuart Bishop <***@stuartbishop.net>
Date: Tue, Nov 27, 2012 at 2:03 PM
Subject: Re: [ADMIN] Monitoring Replication on Master/Slave Postgres(9.1)

Post by Shams Khan
HI All,
I want to know the ways to monitor the replication, whether the master and
slave server are sync.

On the slave, run "SELECT now() - pg_last_xact_replay_timestamp() AS
time_lag;". That tells you how far behind in time the slave is.

Post by Shams Khan
My question is how do we read these numbers(17A/342A6F78 and

17A/34366C30) I

Post by Shams Khan
never got the same results on both servers. Is that mean slave is not

synced

Post by Shams Khan
with master?

With asynchronous replication, the slave will always lag behind the
master. The query I posted above will tell you by how much.

--
Stuart Bishop <***@stuartbishop.net>
http://www.stuartbishop.net/

Shams Khan

2012-11-28 18:21:53 UTC

Permalink

Can anybody answer the below query???

---------- Forwarded message ----------
From: Shams Khan <***@gmail.com>
Date: Mon, Nov 26, 2012 at 3:35 PM
Subject: Monitoring Replication on Master/Slave Postgres(9.1)
To: pgsql-***@postgresql.org

HI All,

I want to know the ways to monitor the replication, whether the master and
slave server are sync.
The way I am checking it is:

*On Master:*

[***@livedb1 ~]# ps -ef | grep sender
postgres 2152 7045 0 Nov05 ? 01:05:33 postgres: wal sender
process postgres 10.0.0.2(45862) streaming 17A/35B62638
root 26030 10506 0 15:26 pts/2 00:00:00 grep sender

radius=# SELECT pg_current_xlog_location();
pg_current_xlog_location
--------------------------
17A/342A6F78
(1 row)

*On slave database:*

[***@liveDB2 ~]# ps -ef | grep receiver
root 11725 4286 0 15:28 pts/0 00:00:00 grep receiver
postgres 27250 3432 0 Nov05 ? 00:35:33 postgres: wal receiver
process streaming 17A/363EE668

radius=# select pg_last_xlog_receive_location();
pg_last_xlog_receive_location
-------------------------------
17A/34366C30
(1 row)

*If I go to pg_xlog folder:* I checked on slave WAL files are generated.
However I could not find the same result whenever I run
pg_current_xlog_location(); on master and on the other side I get the
different number (on slave)

My question is how do we read these numbers(17A/342A6F78 and 17A/34366C30)
I never got the same results on both servers. Is that mean slave is not
synced with master?
If yes, then how do we ensure my replication is working fine?

Please share you knowledge and through some light.

thanx

Steve Crawford

2012-11-28 19:02:26 UTC

Permalink

...how do we ensure my replication is working fine?...

Below is the core of one of my bash-script tools. It could use some
tweaking (comments welcome) but works well. The script is run every
minute by cron on master and standby servers. It auto-determines whether
the server is currently a master or standby so the same script can be
deployed to all servers.

If a master-server, it updates a one-record test table with a current
timestamp to ensure there is activity on the master.

If a standby-server, it determines the lag based both on the age of
pg_last_xact_replay_timestamp() and on the age of the record in the test
table then returns the worst of the two.

The delay value is set in $standby_delay which is a value in seconds.
It's up to you to decide what constitutes an issue that requires
attention (but remember that 60-seconds does not necessarily indicate a
problem on an idle server). My first-level alert triggers at 130-seconds
and I have never hit that much of a delay.

#!/bin/bash
#
# Check PostgreSQL sync-status
#
# Requires table "sync_status" with column "sync_time" of type timestamp
with time zone

# We need a temp file
tempquery="$(mktemp /tmp/monitor_db_synchronizationXXXXXXXXXX)"

# If master, update sync_status timestamp and return 0. If standby,
check both age
# of log-replay location and of timestamp in sync_status table and set
$standby_delay
# to the greater of the two (in seconds)
#
standby_delay=$(
psql -q --tuples-only --no-align 2>/dev/null <<EOS
\o ${tempquery}
select
case when setting='on' then
'
with logdelay as
(
select
case when
pg_last_xlog_receive_location()=pg_last_xlog_replay_location() then 0::int
else
(extract(epoch from now())-extract(epoch from
pg_last_xact_replay_timestamp()))::int
end as replicadelay
union
select
(extract(epoch from now())-extract(epoch from sync_time))::int
as replicadelay
from
sync_status
)
select
max(replicadelay)
from
logdelay
;
'
else
'
begin;
delete from sync_status;
insert into sync_status (sync_time) values (now()) returning 0::int as
replicadelay;
commit;
'
end
from pg_settings where name='transaction_read_only';
\o
\i ${tempquery}
EOS
)

# Cleanup temp file
test -f "${tempquery}" && rm "${tempquery}"

# Do some alert based on the number of seconds of lag between master and
standby here

Cheers,
Steve

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Shams Khan

2012-11-28 19:24:57 UTC

Permalink

Thanks for the response Steve...It was really helpful:

Below are some doubts I wanted to clarify..please read and suggest.
Can we also check if replication was broken earlier...somehow due to power
failure of some other reasons in past...The reason I am asking is I am
getting some discrepancies in data between master and slave...? I want to
know the reason of it...

thanx

On Thu, Nov 29, 2012 at 12:32 AM, Steve Crawford <

Post by Steve Crawford

...how do we ensure my replication is working fine?...

Below is the core of one of my bash-script tools. It could use some
tweaking (comments welcome) but works well. The script is run every minute
by cron on master and standby servers. It auto-determines whether the
server is currently a master or standby so the same script can be deployed
to all servers.
If a master-server, it updates a one-record test table with a current
timestamp to ensure there is activity on the master.
If a standby-server, it determines the lag based both on the age of
pg_last_xact_replay_timestamp(**) and on the age of the record in the
test table then returns the worst of the two.
The delay value is set in $standby_delay which is a value in seconds. It's
up to you to decide what constitutes an issue that requires attention (but
remember that 60-seconds does not necessarily indicate a problem on an idle
server). My first-level alert triggers at 130-seconds and I have never hit
that much of a delay.
#!/bin/bash
#
# Check PostgreSQL sync-status
#
# Requires table "sync_status" with column "sync_time" of type timestamp
with time zone
# We need a temp file
tempquery="$(mktemp /tmp/monitor_db_**synchronizationXXXXXXXXXX)"
# If master, update sync_status timestamp and return 0. If standby, check
both age
# of log-replay location and of timestamp in sync_status table and set
$standby_delay
# to the greater of the two (in seconds)
#
standby_delay=$(
psql -q --tuples-only --no-align 2>/dev/null <<EOS
\o ${tempquery}
select
case when setting='on' then
'
with logdelay as
(
select
case when
pg_last_xlog_receive_location(**)=pg_last_xlog_replay_**location() then
0::int
else
(extract(epoch from now())-extract(epoch from
pg_last_xact_replay_timestamp(**)))::int
end as replicadelay
union
select
(extract(epoch from now())-extract(epoch from sync_time))::int as
replicadelay
from
sync_status
)
select
max(replicadelay)
from
logdelay
;
'
else
'
begin;
delete from sync_status;
insert into sync_status (sync_time) values (now()) returning 0::int as
replicadelay;
commit;
'
end
from pg_settings where name='transaction_read_only';
\o
\i ${tempquery}
EOS
)
# Cleanup temp file
test -f "${tempquery}" && rm "${tempquery}"
# Do some alert based on the number of seconds of lag between master and
standby here
Cheers,
Steve
--
http://www.postgresql.org/**mailpref/pgsql-admin<http://www.postgresql.org/mailpref/pgsql-admin>

Steve Crawford

2012-11-28 22:09:37 UTC

Permalink

Post by Shams Khan
Below are some doubts I wanted to clarify..please read and suggest.
Can we also check if replication was broken earlier...somehow due to
power failure of some other reasons in past...The reason I am asking
is I am getting some discrepancies in data between master and
slave...? I want to know the reason of it...

What sort of discrepancies? I am unaware of situations in a properly
configured and functioning system where the standby could be anything
other than lagging the master but I didn't dig through release notes to
see if there were any related bugs.

Cheers,
Steve

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Gabriel Muñoz

2012-11-29 02:31:00 UTC

Permalink

I use "select * from pg_stat_replication;" and compare columns:
sent_location | write_location | flush_location | replay_location

Gabriel.

On Wed, Nov 28, 2012 at 7:09 PM, Steve Crawford <

Post by Shams Khan
Below are some doubts I wanted to clarify..please read and suggest.
Can we also check if replication was broken earlier...somehow due to
power failure of some other reasons in past...The reason I am asking is I
am getting some discrepancies in data between master and slave...? I want
to know the reason of it...
What sort of discrepancies? I am unaware of situations in a properly

configured and functioning system where the standby could be anything other
than lagging the master but I didn't dig through release notes to see if
there were any related bugs.
Cheers,
Steve
--
http://www.postgresql.org/**mailpref/pgsql-admin<http://www.postgresql.org/mailpref/pgsql-admin>

Kris Bushover

2012-11-28 19:29:48 UTC

Permalink

The Bucardo check_postgres module contains a hot_standby_delay check function which will calculate the delta between the xlog position of the master with the slave(s).

http://bucardo.org/check_postgres/check_postgres.pl.html#hot_standby_delay

-----Original Message-----
From: pgsql-admin-***@postgresql.org [mailto:pgsql-admin-***@postgresql.org] On Behalf Of Steve Crawford
Sent: Wednesday, November 28, 2012 1:02 PM
To: pgsql-***@postgresql.org
Subject: Re: [ADMIN] Fwd: Monitoring Replication on Master/Slave Postgres(9.1)

...how do we ensure my replication is working fine?...

Below is the core of one of my bash-script tools. It could use some tweaking (comments welcome) but works well. The script is run every minute by cron on master and standby servers. It auto-determines whether the server is currently a master or standby so the same script can be deployed to all servers.

If a master-server, it updates a one-record test table with a current timestamp to ensure there is activity on the master.

If a standby-server, it determines the lag based both on the age of
pg_last_xact_replay_timestamp() and on the age of the record in the test table then returns the worst of the two.

The delay value is set in $standby_delay which is a value in seconds.
It's up to you to decide what constitutes an issue that requires attention (but remember that 60-seconds does not necessarily indicate a problem on an idle server). My first-level alert triggers at 130-seconds and I have never hit that much of a delay.

#!/bin/bash
#
# Check PostgreSQL sync-status
#
# Requires table "sync_status" with column "sync_time" of type timestamp with time zone

# We need a temp file
tempquery="$(mktemp /tmp/monitor_db_synchronizationXXXXXXXXXX)"

# If master, update sync_status timestamp and return 0. If standby,
check both age
# of log-replay location and of timestamp in sync_status table and set
$standby_delay
# to the greater of the two (in seconds)
#
standby_delay=$(
psql -q --tuples-only --no-align 2>/dev/null <<EOS
\o ${tempquery}
select
case when setting='on' then
'
with logdelay as
(
select
case when
pg_last_xlog_receive_location()=pg_last_xlog_replay_location() then 0::int
else
(extract(epoch from now())-extract(epoch from
pg_last_xact_replay_timestamp()))::int
end as replicadelay
union
select
(extract(epoch from now())-extract(epoch from sync_time))::int
as replicadelay
from
sync_status
)
select
max(replicadelay)
from
logdelay
;
'
else
'
begin;
delete from sync_status;
insert into sync_status (sync_time) values (now()) returning 0::int as
replicadelay;
commit;
'
end
from pg_settings where name='transaction_read_only';
\o
\i ${tempquery}
EOS
)

# Cleanup temp file
test -f "${tempquery}" && rm "${tempquery}"

# Do some alert based on the number of seconds of lag between master and
standby here

Cheers,
Steve
--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin