當前位置：首頁 > 运维知识 > 数据库 >内容正文

数据库

Pacemaker+Corosync PostgreSQL流复制HA的部署(pha4pgsql)

發布時間：2025/4/5 数据库 35 豆豆

生活随笔收集整理的這篇文章主要介紹了 Pacemaker+Corosync PostgreSQL流复制HA的部署(pha4pgsql) 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

簡介

在眾多的PostgreSQL HA方案中，流復制HA方案是性能，可靠性，部署成本等方面都比較好的，也是目前被普遍采用的方案。而用于管理流復制集群的工具中，Pacemaker+Corosync又是比較成熟可靠的。

但是原生的基于Pacemaker+Corosync搭建PostgreSQL流復制HA集群存在配置復雜，設置參數眾多，不易使用的問題，尤其對于缺少Pacemaker使用經驗的用戶。因此在Pacemaker+Corosync PostgreSQL流復制HA集群方案的基礎上簡化了集群配置并封裝了常用的集群操作命令，目的在于簡化集群的部署和使用。同時對Resource Agent 3.9.7的pgsql RA進行了增強，引入分布式鎖服務，防止雙節點集群出現腦裂，并確保同步復制下failover后數據不丟失。

封裝和修改過的內容可從以下位置獲取，后面是相關的詳細說明。
https://github.com/ChenHuajun/pha4pgsql

功能特性

秒級故障轉移

支持雙節點集群和多節點集群

支持同步復制和異步復制

同步復制下failover零數據丟失

提供讀寫VIP和只讀VIP，集群的拓撲結構對應用透明

基本架構和原理

Pacemaker + Corosync作為集群基礎軟件，Corosync負責集群通信和成員關系管理，Pacemaker負責資源管理。

集群用到資源包括PostgreSQL和VIP等，PostgreSQL對應的Resource Agent(RA)為expgsql，expgsql負責實施PostgreSQL的起停，監視，failover等操作。

集群初始啟動時expgsql通過比較所有節點的xlog位置，找出xlog最新的節點作為Master，其它節點作為Slave通過讀寫VIP連接到Master上進行WAL復制。

集群啟動后expgsql不斷監視PostgreSQL的健康狀況，當expgsql發現PostgreSQL資源故障時報告給Pacemaker，由Pacemaker實施相應動作。

如果是PostgreSQL進程故障，原地重啟PostgreSQL，并且該節點上的fail-count加1。
fail-count累加到3時不再分配PostgreSQL資源到這個節點。如果該節點為Master，會提升一個Slave為Master，即發起failover。

Corosync發現節點故障(主機或網絡故障)時，Pacemaker也根據情況實施相應動作。

對多節點集群，未包含過半節點成員的分區將主動釋放本分區內的所有資源，包括PostgreSQL和VIP。
合法的分區中如果沒有Master，Pacemaker會提升一個Slave為Master，即發起failover。

Master上的expgsql會不斷監視Slave的復制健康狀況，同步復制下會選定一個Slave作為同步Slave。

當同步Slave出現故障時，Master上的expgsql會臨時將同步復制切換到異步復制，防止Master上的寫操作被hang住。如果故障Slave恢復或存在另一個健康的Slave，再切換到同步復制。

為防止集群分區后，Slave升級為新Master而舊Master切換到異步復制導致腦裂和數據雙寫，引入分布式鎖服務進行仲裁。Slave升級為新Master和舊Master切換到異步復制前必須先取得鎖，避免這兩件事同時發生。失去鎖的Master會主動停止PostgreSQL進程，防止出現雙主。

如果分布鎖服務發生故障而所有PostgreSQL節點都是健康的，expgsql會忽視鎖服務，即不影響集群服務。但在分布鎖服務故障期間，Master發生節點故障(注意區分節點故障和資源故障)，集群將無法正常failover。

同步復制下只有同步Slave才有資格成為候選Master，加上有分布式鎖的防護，可以確保failover后數據不丟失。

集群初始啟動和每次failover時通過pg_ctl promote提升Slave為Master并使時間線加1，同時記錄Master節點名，時間線和切換時的xlog位置到集群CIB。

集群重啟時根據集群CIB中記錄的信息確定Master節點，并保持時間線不變。

expgsql啟動PostgreSQL前會檢查該節點的時間線和xlog，如果和集群CIB中記錄的信息有沖突，將報錯。需要人工通過cls_repair_slave(pg_rewind)等手段修復。

讀寫VIP和Master節點綁定，只讀VIP和其中一個Slave綁定，應用只需訪問VIP，無需關心具體訪問哪個節點。

集群操作命令一覽

cls_start
啟動集群

cls_stop
停止集群

cls_online_switch
在線主從切換

cls_master
輸出當前Master節點名

cls_status
顯示集群狀態

cls_cleanup
清除資源狀態和fail-count。在某個節點上資源失敗次數(fail-count)超過3次Pacemaker將不再分配該資源到此節點，人工修復故障后需要調用cleanup讓Pacemkaer重新嘗試啟動資源。

cls_reset_master [master]
設置pgsql_REPL_INFO使指定的節點成為Master；如未指定Master，則清除pgsql_REPL_INFO讓Pacemaker重新在所有節點中選出xlog位置最新的節點作為Master。僅用于集群中沒有任何節點滿足Master條件情況下的緊急修復。

cls_repair_slave
通過pg_rewind修復當前節點，主要用于舊Master的修復，回退超出時間線分叉點的那部分更新，并和新Master建立復制關系。pg_rewind僅在PostgreSQL 9.5以上版本提供

cls_rebuild_slave
通過pg_basebackup在當前節點重建Slave。執行該命令前需要停止當前節點上的PostgreSQL進程并清空舊的數據目錄。

cls_unmanage
unmanage所有資源使其脫離Pacemaker的控制。當需要重啟Pacemaker和Corosync又不能停止PostgreSQL服務時，可以先調用這個命令，Pacemaker和Corosync重啟完成后再用cls_manage恢復管理。

cls_manage
恢復cls_unmanage產生的資源unmanaged狀態。

cls_standby_node [nodename]
釋放某節點上所有資源。可用于特定節點的維護，比如升級。

cls_unstandby_node [nodename]
恢復cls_standby_node產生的節點standby狀態。

依賴軟件

pacemaker
pcs
psmisc
policycoreutils-python
postgresql-server

安裝

安裝過程以在以下環境下部署雙節點HA集群為例說明。

OS:CentOS 7.0
節點1主機名:node1
節點2主機名:node2
writer_vip:192.168.41.136
reader_vip:192.168.41.137
用作分布式鎖服務的PostgreSQL的連接字符串:"host=node3 port=5439 dbname=postgres user=postgres"

Linux集群環境安裝與配置

環境準備

所有節點設置時鐘同步

所有節點設置獨立的主機名(node1，node2)

設置對所有節點的域名解析(修改/etc/hosts)

所有節點間設置SSH互信

禁用防火墻

在所有節點執行：

setenforce 0 sed -i.bak "s/SELINUX=enforcing/SELINUX=permissive/g" /etc/selinux/config systemctl disable firewalld.service systemctl stop firewalld.service iptables -F

安裝Pacemaker和Corosync及相關軟件包

在所有節點執行：

yum install -y pacemaker pcs psmisc policycoreutils-python

注：如果OS自帶的Pacemaker比較舊，建議下載新版的。之前在Pacemaker 1.1.7上遇到了不少Bug，因此不建議使用這個版本或更老的版本。

啟用服務

在所有節點執行：

systemctl start corosync.service systemctl enable corosync.service systemctl start pacemaker.service systemctl enable pacemaker.service systemctl start pcsd.service systemctl enable pcsd.service

設置hacluster用戶密碼

在所有節點執行：

echo hacluster | passwd hacluster --stdin

集群認證

在任何一個節點上執行:

pcs cluster auth -u hacluster -p hacluster node1 node2

同步配置

在任何一個節點上執行:

pcs cluster setup --last_man_standing=1 --name pgcluster node1 node2

啟動集群

在任何一個節點上執行:

pcs cluster start --all

安裝和配置PostgreSQL

安裝PostgreSQL

在所有節點執行：

yum install postgresql-server

OS自帶的PostgreSQL往往比較舊，可參考http://www.postgresql.org/download/linux/?，安裝最新版PostgreSQL.

創建主數據庫

在node1節點執行：

創建數據目錄

mkdir -p /data/postgresql/data chown -R postgres:postgres /data/postgresql/ chmod 0700 /data/postgresql/data

初始化db

su - postgres initdb -D /data/postgresql/data/

修改postgresql.conf

listen_addresses = '*' wal_level = hot_standby synchronous_commit = on max_wal_senders=5 wal_keep_segments = 32 hot_standby = on replication_timeout = 5000 wal_receiver_status_interval = 2 max_standby_streaming_delay = -1 max_standby_archive_delay = -1 restart_after_crash = off hot_standby_feedback = on

注：PostgreSQL 9.3以上版本，應將replication_timeout替換成wal_sender_timeout；PostgreSQL 9.5以上版本，可加上"wal_log_hints = on"，使得可以使用pg_rewind修復舊Master。

修改pg_hba.conf

local all all trust host all all 192.168.41.0/24 md5 host replication all 192.168.41.0/24 md5

啟動

pg_ctl -D /data/postgresql/data/ start

創建復制用戶

createuser --login --replication replication -P

9.5以上版本如需要支持pg_rewind，需加上“-s”選項。

createuser --login --replication replication -P -s

創建備數據庫

在node2節點執行：

創建數據目錄

mkdir -p /data/postgresql/data chown -R postgres:postgres /data/postgresql/ chmod 0700 /data/postgresql/data

創建基礎備份

su - postgres pg_basebackup -h node1 -U replication -D /data/postgresql/data/ -X stream -P

停止PostgreSQL服務

在node1上執行:

pg_ctl -D /data/postgresql/data/ stop

配置分布式鎖服務

分布式鎖服務的作用是防止雙節點集群出現腦裂。當網絡發生故障形成分區時，備可能會被提升為主，同時舊主會將同步復制切換到異步復制，這可能導致數據丟失。通過分布式鎖服務可以確保新主的提升和舊主的切換到異步復制同時只能有一個成功。

分布式鎖服務通過HA集群外部的另外一個PostgreSQL服務實現。需要事先創建鎖表。

create table if not exists distlock(lockname text primary key,owner text not null,ts timestamptz not null,expired_time interval not null);

可選地，可以創建鎖的歷史表，每次鎖的owner變更(主從角色切換)都會記錄到歷史表(distlock_history)中。

create table if not exists distlock_history(id serial primary key,lockname text not null,owner text not null,ts timestamptz not null,expired_time interval not null);CREATE OR REPLACE FUNCTION distlock_log_update() RETURNS trigger AS $$BEGINIF TG_OP = 'INSERT' or NEW.owner OLD.owner THENINSERT INTO distlock_history(lockname, owner, ts, expired_time) values(NEW.lockname, NEW.owner, NEW.ts, NEW.expired_time);END IF;RETURN NEW;END; $$ LANGUAGE plpgsql;DROP TRIGGER IF EXISTS distlock_log_update ON distlock;CREATE TRIGGER distlock_log_update AFTER INSERT OR UPDATE ON distlockFOR EACH ROW EXECUTE PROCEDURE distlock_log_update();

安裝和配置pha4pgsql

在任意一個節點上執行:

下載pha4pgsql

git clone git://github.com/Chenhuajun/pha4pgsql.git

編輯config.ini

cluster_type=dual OCF_ROOT=/usr/lib/ocf RESOURCE_LIST="msPostgresql vip-master vip-slave" pha4pgsql_dir=/opt/pha4pgsql writer_vip=192.168.41.136 reader_vip=192.168.41.137 node1=node1 node2=node2 vip_nic=eno33554984 vip_cidr_netmask=24 pgsql_pgctl=/usr/bin/pg_ctl pgsql_psql=/usr/bin/psql pgsql_pgdata=/data/postgresql/data pgsql_restore_command="" pgsql_rep_mode=sync pgsql_repuser=replication pgsql_reppassord=replication pgsql_enable_distlock=true pgsql_distlock_psql_cmd='/usr/bin/psql \\"host=node3 port=5439 dbname=postgres user=postgres connect_timeout=5\\"' pgsql_distlock_lockname=pgsql_cls1

需要根據實際環境修改上面的參數。當多個多個集群使用鎖服務時，確保每個集群的pgsql_distlock_lockname值必須是唯一的。

安裝pha4pgsql

sh install.sh ./setup.sh

注意，安裝過程只需在一個節點上執行即可。

設置環境變量

export PATH=/opt/pha4pgsql/bin:$PATH

啟動集群

cls_start

確認集群狀態

cls_status

cls_status的輸出示例如下：

[root@node1 pha4pgsql]# cls_status Last updated: Fri Apr 22 02:01:01 2016 Last change: Fri Apr 22 02:01:00 2016 by root via crm_resource on node1 Stack: corosync Current DC: node2 (2) - partition with quorum Version: 1.1.12-a14efad 2 Nodes configured 4 Resources configuredOnline: [ node1 node2 ]Full list of resources:vip-master (ocf::heartbeat:IPaddr2): Started node1 vip-slave (ocf::heartbeat:IPaddr2): Started node2 Master/Slave Set: msPostgresql [pgsql]Masters: [ node1 ]Slaves: [ node2 ]Node Attributes: * Node node1:+ #cluster-name : pgcluster + #site-name : pgcluster + master-pgsql : 1000 + pgsql-data-status : LATEST + pgsql-master-baseline : 00000000070000D0+ pgsql-status : PRI * Node node2:+ #cluster-name : pgcluster + #site-name : pgcluster + master-pgsql : -INFINITY + pgsql-data-status : STREAMING|ASYNC+ pgsql-status : HS:async Migration summary: * Node node2: * Node node1: pgsql_REPL_INFO:node1|1|00000000070000D0

檢查集群的健康狀態。完全健康的集群需要滿足以下條件：

msPostgresql在每個節點上都已啟動

在其中一個節點上msPostgresql處于Master狀態，其它的為Salve狀態

Salve節點的data-status值是以下中的一個

STREAMING|SYNC
同步復制Slave
STREAMING|POTENTIAL
候選同步復制Slave
STREAMING|ASYNC
異步復制Slave

pgsql-data-status的取值詳細可參考下面的說明

The transitional state of data is displayed. This state remains after stopping pacemaker. When starting pacemaker next time, this state is used to judge whether my data is old or not. DISCONNECT Master changes other node state into DISCONNECT if Master can't detect connection of replication because of LAN failure or breakdown of Slave and so on. {state}|{sync_state} Master changes other node state into {state}|{sync_state} if Master detects connection of replication. {state} and {sync_state} means state of replication which is retrieved using "select state and sync_state from pg_stat_replication" on Master. For example, INIT, CATCHUP, and STREAMING are displayed in {state} and ASYNC, SYNC are displayed in {sync_state} LATEST It's displayed when it's Master. These states are the transitional state of final data, and it may be not consistent with the state of actual data. For instance, During PRI, the state is "LATEST". But the node is stopped or down, this state "LATEST" is maintained if Master doesn't exist in other nodes. It never changes to "DISCONNECT" for oneself. When other node newly is promoted, this new Master changes the state of old Master to "DISCONNECT". When any node can not become Master, this "LATEST" will be keeped.

故障測試

Master上的postgres進程故障

強制殺死Master上的postgres進程

[root@node1 pha4pgsql]# killall postgres

檢查集群狀態
由于設置了migration-threshold="3"，發生一次普通的錯誤，Pacemaker會在原地重新啟動postgres進程，不發生主從切換。（如果Master的物理機或網絡發生故障，直接進行failover。）

[root@node1 pha4pgsql]# cls_status Last updated: Fri Apr 22 02:03:17 2016 Last change: Fri Apr 22 02:03:10 2016 by root via crm_resource on node1 Stack: corosync Current DC: node2 (2) - partition with quorum Version: 1.1.12-a14efad 2 Nodes configured 4 Resources configuredOnline: [ node1 node2 ]Full list of resources:vip-master (ocf::heartbeat:IPaddr2): Started node1 vip-slave (ocf::heartbeat:IPaddr2): Started node2 Master/Slave Set: msPostgresql [pgsql]Masters: [ node1 ]Slaves: [ node2 ]Node Attributes: * Node node1:+ #cluster-name : pgcluster + #site-name : pgcluster + master-pgsql : 1000 + pgsql-data-status : LATEST + pgsql-master-baseline : 0000000007000250+ pgsql-status : PRI * Node node2:+ #cluster-name : pgcluster + #site-name : pgcluster + master-pgsql : 100 + pgsql-data-status : STREAMING|SYNC+ pgsql-status : HS:sync Migration summary: * Node node2: * Node node1: pgsql: migration-threshold=3 fail-count=1 last-failure='Mon Apr 18 09:14:28 2016'Failed actions:pgsql_monitor_3000 on node1 'unknown error' (1): call=205, status=complete, exit-reason='none', last-rc-change='Fri Apr 22 02:02:56 2016', queued=0ms, exec=0mspgsql_REPL_INFO:node1|1|00000000070000D0

再強制殺死Master上的postgres進程2次后檢查集群狀態。
這時已經發生了failover，產生了新的Master，并提升了時間線。

[root@node1 pha4pgsql]# cls_status Last updated: Fri Apr 22 02:07:33 2016 Last change: Fri Apr 22 02:07:31 2016 by root via crm_resource on node1 Stack: corosync Current DC: node2 (2) - partition with quorum Version: 1.1.12-a14efad 2 Nodes configured 4 Resources configuredOnline: [ node1 node2 ]Full list of resources:vip-master (ocf::heartbeat:IPaddr2): Started node2 vip-slave (ocf::heartbeat:IPaddr2): Stopped Master/Slave Set: msPostgresql [pgsql]Masters: [ node2 ]Stopped: [ node1 ]Node Attributes: * Node node1:+ #cluster-name : pgcluster + #site-name : pgcluster + master-pgsql : -INFINITY + pgsql-data-status : DISCONNECT+ pgsql-status : STOP * Node node2:+ #cluster-name : pgcluster + #site-name : pgcluster + master-pgsql : 1000 + pgsql-data-status : LATEST + pgsql-master-baseline : 0000000007000410+ pgsql-status : PRI Migration summary: * Node node2: * Node node1: pgsql: migration-threshold=3 fail-count=3 last-failure='Mon Apr 18 09:18:58 2016'Failed actions:pgsql_monitor_3000 on node1 'not running' (7): call=237, status=complete, exit-reason='none', last-rc-change='Fri Apr 22 02:07:26 2016', queued=0ms, exec=0mspgsql_REPL_INFO:node2|2|0000000007000410

修復舊Master
通過pg_baseback修復舊Master

[root@node1 pha4pgsql]# rm -rf /data/postgresql/data/* [root@node1 pha4pgsql]# cls_rebuild_slave 22636/22636 kB (100%), 1/1 tablespace All resources/stonith devices successfully cleaned up wait for recovery complete ..... slave recovery of node1 successed [root@node1 pha4pgsql]# cls_status Last updated: Fri Apr 22 02:40:48 2016 Last change: Fri Apr 22 02:40:36 2016 by root via crm_resource on node1 Stack: corosync Current DC: node2 (2) - partition with quorum Version: 1.1.12-a14efad 2 Nodes configured 4 Resources configuredOnline: [ node1 node2 ]Full list of resources:vip-master (ocf::heartbeat:IPaddr2): Started node2 vip-slave (ocf::heartbeat:IPaddr2): Started node1 Master/Slave Set: msPostgresql [pgsql]Masters: [ node2 ]Slaves: [ node1 ]Node Attributes: * Node node1:+ #cluster-name : pgcluster + #site-name : pgcluster + master-pgsql : 100 + pgsql-data-status : STREAMING|SYNC+ pgsql-status : HS:sync * Node node2:+ #cluster-name : pgcluster + #site-name : pgcluster + master-pgsql : 1000 + pgsql-data-status : LATEST + pgsql-master-baseline : 0000000007000410+ pgsql-status : PRI Migration summary: * Node node2: * Node node1: pgsql_REPL_INFO:node2|2|0000000007000410

9.5以上版本還可以通過pg_rewind修復舊Master

[root@node1 pha4pgsql]# cls_repair_slave connected to server servers diverged at WAL position 0/7000410 on timeline 2 rewinding from last common checkpoint at 0/7000368 on timeline 2 reading source file list reading target file list reading WAL in target need to copy 67 MB (total source directory size is 85 MB) 69591/69591 kB (100%) copied creating backup label and updating control file syncing target data directory Done! All resources/stonith devices successfully cleaned up wait for recovery complete .... slave recovery of node1 successed

Master網絡故障

故障前的集群狀態

故障前的Master是node1

[root@node1 pha4pgsql]# cls_status Last updated: Fri Apr 22 11:28:26 2016 Last change: Fri Apr 22 11:25:56 2016 by root via crm_resource on node1 Stack: corosync Current DC: node2 (2) - partition with quorum Version: 1.1.12-a14efad 2 Nodes configured 4 Resources configuredOnline: [ node1 node2 ]Full list of resources:vip-master (ocf::heartbeat:IPaddr2): Started node1 vip-slave (ocf::heartbeat:IPaddr2): Started node2 Master/Slave Set: msPostgresql [pgsql]Masters: [ node1 ]Slaves: [ node2 ]Node Attributes: * Node node1:+ #cluster-name : pgcluster + #site-name : pgcluster + master-pgsql : 1000 + pgsql-data-status : LATEST + pgsql-master-baseline : 0000000009044898+ pgsql-status : PRI * Node node2:+ #cluster-name : pgcluster + #site-name : pgcluster + master-pgsql : 100 + pgsql-data-status : STREAMING|SYNC+ pgsql-status : HS:sync Migration summary: * Node node2: * Node node1: pgsql_REPL_INFO:node1|12|0000000009044898

阻斷Master和其它節點的通信

[root@node1 pha4pgsql]# iptables -A INPUT -j DROP -s node2 [root@node1 pha4pgsql]# iptables -A OUTPUT -j DROP -d node2 [root@node1 pha4pgsql]# iptables -A INPUT -j DROP -s node3 [root@node1 pha4pgsql]# iptables -A OUTPUT -j DROP -d node3

等10幾秒后檢查集群狀態

在node1(舊Master)上查看，由于失去分布式鎖，node1已經停止了部署在自身上面的所有資源。

[root@node1 pha4pgsql]# cls_status Last updated: Fri Apr 22 11:34:46 2016 Last change: Fri Apr 22 11:25:56 2016 by root via crm_resource on node1 Stack: corosync Current DC: node1 (1) - partition with quorum Version: 1.1.12-a14efad 2 Nodes configured 4 Resources configuredOnline: [ node1 ] OFFLINE: [ node2 ]Full list of resources:vip-master (ocf::heartbeat:IPaddr2): Stopped vip-slave (ocf::heartbeat:IPaddr2): Stopped Master/Slave Set: msPostgresql [pgsql]Stopped: [ node1 node2 ]Node Attributes: * Node node1:+ #cluster-name : pgcluster + #site-name : pgcluster + master-pgsql : -INFINITY + pgsql-data-status : LATEST + pgsql-status : STOP Migration summary: * Node node1: pgsql: migration-threshold=3 fail-count=2 last-failure='Fri Apr 22 11:34:23 2016'Failed actions:pgsql_promote_0 on node1 'unknown error' (1): call=990, status=complete, exit-reason='none', last-rc-change='Fri Apr 22 11:34:15 2016', queued=0ms, exec=7756mspgsql_REPL_INFO:node1|12|0000000009044898

在node2上查看，發現node2已經被提升為新Master，PostgreSQL的時間線也從12增長到了13。

[root@node2 ~]# cls_status Last updated: Sun May 8 01:02:04 2016 Last change: Sun May 8 00:57:47 2016 by root via crm_resource on node1 Stack: corosync Current DC: node2 (2) - partition with quorum Version: 1.1.12-a14efad 2 Nodes configured 4 Resources configuredOnline: [ node2 ] OFFLINE: [ node1 ]Full list of resources:vip-master (ocf::heartbeat:IPaddr2): Started node2 vip-slave (ocf::heartbeat:IPaddr2): Stopped Master/Slave Set: msPostgresql [pgsql]Masters: [ node2 ]Stopped: [ node1 ]Node Attributes: * Node node2:+ #cluster-name : pgcluster + #site-name : pgcluster + master-pgsql : 1000 + pgsql-data-status : LATEST + pgsql-master-baseline : 0000000009045828+ pgsql-status : PRI Migration summary: * Node node2: pgsql_REPL_INFO:node2|13|0000000009045828

請注意，這時發生了網絡分區，node1和node2各自保存的集群狀態是不同的。

恢復node1上的網絡

[root@node1 pha4pgsql]# iptables -F

再次在node1上檢查集群狀態
再次在node1上檢查集群狀態，發現node1和node2兩個分區合并后，集群采納了node2的配置而不是node1，這正是我們想要的（由于node2上的集群配置的版本更高，所以采納node2而不是node1的配置)。同時，Pacemaker試圖重新啟動node1上的PostgreSQL進程時，發現它的最近一次checkpoint位置大于等于上次時間線提升的位置，不能作為Slave連到新Master上所以報錯并阻止它上線。

[root@node1 pha4pgsql]# cls_status Last updated: Fri Apr 22 11:49:44 2016 Last change: Sun May 8 00:57:47 2016 by root via crm_resource on node1 Stack: corosync Current DC: node2 (2) - partition with quorum Version: 1.1.12-a14efad 2 Nodes configured 4 Resources configuredOnline: [ node1 node2 ]Full list of resources:vip-master (ocf::heartbeat:IPaddr2): Started node2 vip-slave (ocf::heartbeat:IPaddr2): Started node1 Master/Slave Set: msPostgresql [pgsql]Masters: [ node2 ]Stopped: [ node1 ]Node Attributes: * Node node1:+ #cluster-name : pgcluster + #site-name : pgcluster + master-pgsql : -INFINITY + pgsql-data-status : DISCONNECT+ pgsql-status : STOP * Node node2:+ #cluster-name : pgcluster + #site-name : pgcluster + master-pgsql : 1000 + pgsql-data-status : LATEST + pgsql-master-baseline : 0000000009045828+ pgsql-status : PRI Migration summary: * Node node2: * Node node1: pgsql: migration-threshold=3 fail-count=1000000 last-failure='Sun May 8 01:12:57 2016'Failed actions:pgsql_start_0 on node1 'unknown error' (1): call=1022, status=complete, exit-reason='The master's timeline forked off current database system timeline 13 before latest checkpoint location 0000000009045828, REPL_IN', last-rc-change='Fri Apr 22 11:49:35 2016', queued=0ms, exec=2123mspgsql_REPL_INFO:node2|13|0000000009045828

修復node1(舊Master)

修復node1(舊Master)的方法和前面一樣，使用cls_repair_slave或cls_rebuild_slave。

[root@node1 pha4pgsql]# cls_repair_slave connected to server servers diverged at WAL position 0/9045828 on timeline 13 rewinding from last common checkpoint at 0/9045780 on timeline 13 reading source file list reading target file list reading WAL in target need to copy 211 MB (total source directory size is 229 MB) 216927/216927 kB (100%) copied creating backup label and updating control file syncing target data directory Done! All resources/stonith devices successfully cleaned up wait for recovery complete .......... slave recovery of node1 successed

Slave上的PostgreSQL進程故障

強制殺死Slave上的postgres進程

[root@node2 pha4pgsql]# killall postgres

檢查集群狀態
由于設置了migration-threshold="3"，發生一次普通的錯誤，Pacemaker會在原地重新啟動postgres進程。

[root@node2 ~]# cls_status Last updated: Sun May 8 01:34:36 2016 Last change: Sun May 8 01:33:01 2016 by root via crm_resource on node1 Stack: corosync Current DC: node2 (2) - partition with quorum Version: 1.1.12-a14efad 2 Nodes configured 4 Resources configuredOnline: [ node1 node2 ]Full list of resources:vip-master (ocf::heartbeat:IPaddr2): Started node1 vip-slave (ocf::heartbeat:IPaddr2): Started node2 Master/Slave Set: msPostgresql [pgsql]Masters: [ node1 ]Slaves: [ node2 ]Node Attributes: * Node node1:+ #cluster-name : pgcluster + #site-name : pgcluster + master-pgsql : 1000 + pgsql-data-status : LATEST + pgsql-master-baseline : 00000000090650F8+ pgsql-status : PRI * Node node2:+ #cluster-name : pgcluster + #site-name : pgcluster + master-pgsql : 100 + pgsql-data-status : STREAMING|SYNC+ pgsql-status : HS:sync Migration summary: * Node node2: pgsql: migration-threshold=3 fail-count=1 last-failure='Sun May 8 01:32:44 2016' * Node node1: Failed actions:pgsql_monitor_4000 on node2 'not running' (7): call=227, status=complete, exit-reason='none', last-rc-change='Sun May 8 01:32:44 2016', queued=0ms, exec=0mspgsql_REPL_INFO:node1|14|00000000090650F8

再強制殺死Master上的postgres進程2次后檢查集群狀態。

fail-count增加到3后，Pacemaker不再啟動PostgreSQL，保持其為停止狀態。

[root@node2 ~]# cls_status Last updated: Sun May 8 01:36:16 2016 Last change: Sun May 8 01:36:07 2016 by root via crm_resource on node1 Stack: corosync Current DC: node2 (2) - partition with quorum Version: 1.1.12-a14efad 2 Nodes configured 4 Resources configuredOnline: [ node1 node2 ]Full list of resources:vip-master (ocf::heartbeat:IPaddr2): Started node1 vip-slave (ocf::heartbeat:IPaddr2): Stopped Master/Slave Set: msPostgresql [pgsql]Masters: [ node1 ]Stopped: [ node2 ]Node Attributes: * Node node1:+ #cluster-name : pgcluster + #site-name : pgcluster + master-pgsql : 1000 + pgsql-data-status : LATEST + pgsql-master-baseline : 00000000090650F8+ pgsql-status : PRI * Node node2:+ #cluster-name : pgcluster + #site-name : pgcluster + master-pgsql : -INFINITY + pgsql-data-status : DISCONNECT+ pgsql-status : STOP Migration summary: * Node node2: pgsql: migration-threshold=3 fail-count=3 last-failure='Sun May 8 01:36:08 2016' * Node node1: Failed actions:pgsql_monitor_4000 on node2 'not running' (7): call=240, status=complete, exit-reason='none', last-rc-change='Sun May 8 01:36:08 2016', queued=0ms, exec=0mspgsql_REPL_INFO:node1|14|00000000090650F8

同時，Master(node1)上的復制模式被自動切換到異步復制，防止寫操作hang住。

[root@node1 pha4pgsql]# tail /var/lib/pgsql/tmp/rep_mode.conf synchronous_standby_names = ''

修復Salve
在node2上執行cls_cleanup，清除fail-count后，Pacemaker會再次啟動PostgreSQL進程。

[root@node2 ~]# cls_cleanup All resources/stonith devices successfully cleaned up [root@node2 ~]# cls_status Last updated: Sun May 8 01:43:13 2016 Last change: Sun May 8 01:43:08 2016 by root via crm_resource on node1 Stack: corosync Current DC: node2 (2) - partition with quorum Version: 1.1.12-a14efad 2 Nodes configured 4 Resources configuredOnline: [ node1 node2 ]Full list of resources:vip-master (ocf::heartbeat:IPaddr2): Started node1 vip-slave (ocf::heartbeat:IPaddr2): Started node2 Master/Slave Set: msPostgresql [pgsql]Masters: [ node1 ]Slaves: [ node2 ]Node Attributes: * Node node1:+ #cluster-name : pgcluster + #site-name : pgcluster + master-pgsql : 1000 + pgsql-data-status : LATEST + pgsql-master-baseline : 00000000090650F8+ pgsql-status : PRI * Node node2:+ #cluster-name : pgcluster + #site-name : pgcluster + master-pgsql : 100 + pgsql-data-status : STREAMING|SYNC+ pgsql-status : HS:sync Migration summary: * Node node2: * Node node1: pgsql_REPL_INFO:node1|14|00000000090650F8

同時，Master(node1)上的復制模式又自動切換回到同步復制。

[root@node1 pha4pgsql]# tail /var/lib/pgsql/tmp/rep_mode.conf synchronous_standby_names = 'node2'

附錄1：對pgsql RA的修改

本項目使用的expgsql RA是在Resource Agent 3.9.7中的pgsql RA的基礎上做的修改。修改內容如下：

引入分布式鎖服務防止雙節點集群出現腦裂，并防止在failover過程中丟失數據。
promote和monitor的同步復制切換為異步復制前都需要先獲取鎖，因此確保這兩件事不能同時發生，也就防止了在同步復制模式下failover出現數據丟失。相應的引入以下參數：

enable_distlock
是否啟動分布式鎖仲裁，對雙節集群建議啟用。
distlock_lock_cmd
分布式鎖服務的加鎖命令
distlock_unlock_cmd
分布式鎖服務的解鎖命令
distlock_lockservice_deadcheck_nodelist
無法訪問分布式鎖服務時，需要做二次檢查的節點列表，通過ssh連接到這些節點后再獲取鎖。如果節點列表中所有節點都無法訪問分布式鎖服務，認為分布式鎖服務失效，按已獲得鎖處理。如果節點列表中任何一個節點本身無法訪問，按未獲得鎖處理。

并且內置了一個基于PostgreSQL的分布式鎖實現，即tools\distlock。

根據Master是否發生變更動態采取restart或pg_ctl promote的方式提升Slave為Master。
當Master發生變更時采用pg_ctl promote的方式提升Slave為Master；未發生變更時采用restart的方式提升。相應地廢棄原pgsql RA的restart_on_promote參數。

記錄PostgreSQL上次時間線切換前的時間線和xlog位置信息
這些信息記錄在集群配置變量pgsql_REPL_INFO中。pgsql_REPL_INFO的值由以下3個部分組成,通過‘|’連接在一起。

Master節點名
pg_ctl promote前的時間線
pg_ctl promote前的時間線的結束位置

RA啟動時，會檢查當前節點和pgsql_REPL_INFO中記錄的狀態是否有沖突，如有報錯不允許資源啟動。因為有這個檢查廢棄原pgsql RA的PGSQL.lock鎖文件。

資源啟動時通過pgsql_REPL_INFO中記錄的Master節點名，繼續沿用原Master。
通過這種方式加速集群的啟動，并避免不必要的主從切換。集群僅在初始啟動pgsql_REPL_INFO的值為空時，才通過xlog比較確定哪個節點作為Master。

關于pgsql RA的原始功能請參考：PgSQL Replicated Cluster

附錄2：參考

PostgreSQL流復制高可用的原理與實踐
PgSQL Replicated Cluster
Pacemaker+Corosync搭建PostgreSQL集群

總結

以上是生活随笔為你收集整理的Pacemaker+Corosync PostgreSQL流复制HA的部署(pha4pgsql)的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：前端模版引擎选择指南
下一篇： Asp.Net Web Api 2 实现