之前通过官方的WAL日志流同步实现了主从备份功能 这里讲解了如何实现主备自动切换
搭建基于日志流的主从切换
- 详见PgSql基于异步流复制的热备份
- 主机 172.30.16.6
- 备机 172.30.16.7
配置 $PGDATA/recovery.conf 文件
- 查看备份机 172.30.16.7原有配置1 
 2
 3[root@localhost ~]$ cat $PGDATA/recovery.conf 
 standby_mode = 'on'
 primary_conninfo = 'user=slony host=172.30.16.6 port=5432 sslmode=disable sslcompression=1'
- 添加一行 trigger_file = '/opt/pgsql/data/upgrade'
- 添加自动恢复 主机变备机要用很重要 recovery_target_timeline = latest
- 我们看下添加之后的文件1 
 2
 3
 4
 5
 6[root@localhost ~]$ echo "trigger_file = '/opt/pgsql/data/upgrade'" >> $PGDATA/recovery.conf 
 [root@localhost ~]$ cat $PGDATA/recovery.conf
 standby_mode = 'on'
 primary_conninfo = 'user=slony host=172.30.16.6 port=5432 sslmode=disable sslcompression=1'
 trigger_file = '/opt/pgsql/data/upgrade'
 recovery_target_timeline = latest
- 重启 PgSql pgc restart注意: 这里的pgc是命令别名 详见PgSql常用指令
测试主备切换
- 模拟原机 172.30.16.6宕机su postgresql -c "pg_ctl -D $PGDATA -l logfile stop"
- 备份机 172.30.16.7上可以看到日志开始报错1 
 2
 3
 4
 5
 6
 7
 8
 9LOG: replication terminated by primary server 
 DETAIL: End of WAL reached on timeline 1 at 0/41FF480.
 FATAL: could not send end-of-streaming message to primary: no COPY in progress
 LOG: invalid record length at 0/41FF480: wanted 24, got 0
 FATAL: could not connect to the primary server: could not connect to server: 拒绝连接
 Is the server running on host "172.30.16.6" and accepting
 TCP/IP connections on port 5432?
- 这个时候在备份机 172.30.16.7新建上面设置的恢复文件touch /opt/pgsql/data/upgrade
- 我们再看日志 提示找到目标文件 备机 172.30.16.7已经升级为主机1 
 2
 3
 4
 5
 6
 7
 8
 9
 10FATAL: could not connect to the primary server: could not connect to server: 拒绝连接 
 Is the server running on host "172.30.16.6" and accepting
 TCP/IP connections on port 5432?
 LOG: trigger file found: /opt/pgsql/data/upgrade
 LOG: redo done at 0/41FF410
 LOG: selected new timeline ID: 2
 LOG: archive recovery complete
 LOG: MultiXact member wraparound protections are now enabled
 LOG: database system is ready to accept connections
 LOG: autovacuum launcher started
- 在 172.30.16.7用pg_controldata看一下1 
 2
 3
 4
 5
 6pg_control version number: 960 
 Catalog version number: 201608131
 Database system identifier: 6341692574859629745
 Database cluster state: in production 这里就代表处于主机状态
 pg_control last modified: 2016年10月26日 星期三 18时10分10秒
 省略余下部分...主备切换最明显的一点就是 备机 172.30.16.7下的recovery.conf文件变成了recovery.done
切换源主机为备份机
- 从原备份机 172.30.16.7复制recovery.conf(recovery.done)到原主机172.30.16.6- 执行 scp -P 22 $PGDATA/recovery.done root@172.30.16.6:$PGDATA/recovery.conf
 
- 执行 
- 修改 原主机 172.30.16.6为备份模式- 修改 recovery.conf调整primary_conninfo修改IP为16.7的地址1 
 2
 3
 4standby_mode = 'on' 
 primary_conninfo = 'user=slony host=172.30.16.7 port=5432 sslmode=disable sslcompression=1'
 trigger_file = '/opt/pgsql/data/upgrade'
 recovery_target_timeline = latest
- 修改 postgresql.conf文件 启动热备份hot_standby = on
 
- 修改 
- 启动当前备机 172.30.16.6执行su postgresql -c "pg_ctl -D $PGDATA -l logfile start"
- 如果 当前备机 172.30.16.6启动后出现以下错误就说明时间线不一致 需要复制原备机的时间线文件1 
 2
 3LOG: restarted WAL streaming at 0/4000000 on timeline 1 
 LOG: replication terminated by primary server
 DETAIL: End of WAL reached on timeline 1 at 0/41FF480.- 在 原备机 172.30.16.7执行下列命令scp -P 22 $PGDATA/pg_xlog/00000002.history root@172.30.16.6:$PGDATA/pg_xlog/
- 过一会儿 就会自动恢复了 可以看到 redo starts at 0/41FF480自动化恢复了1 
 2
 3
 4
 5
 6LOG: entering standby mode 
 LOG: consistent recovery state reached at 0/41FF480
 LOG: invalid record length at 0/41FF480: wanted 24, got 0
 LOG: database system is ready to accept read only connections
 LOG: started streaming WAL from primary at 0/4000000 on timeline 2
 LOG: redo starts at 0/41FF480
 
- 在 原备机 
如复制了时间线文件还是不能自动恢复 请检查
recovery.conf是不是忘了加recovery_target_timeline = latest这一条 并且重启数据库
使用Shell脚本实现主备自动切换
由于对外业务不可能切换IP地址 而且DNS解析至少需要10分钟才能刷新缓存 所以这里用虚拟IP(VIP)
- 虚拟IP配置详见 Linux下虚拟IP(Virtual Internet Protocol)配置;
使用NMAP获取数据库运行状态
- nmap是用来扫描机器端口开放情况的工具 结果可以分为 open => 开启 close => 关闭 filtered => 过滤(有可能被防火墙拦截)
- 如果提示找不到命令 请执行 yum install nmap安装
- 执行 nmap 172.30.16.7 -p 5432 | grep 5432 | awk '{ print $2 }'可以得到结果 open
编写备机升级主机脚本
- vi $PGDATA/upgrade.sh1 
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11INF='enp2s0:0' 
 MAINIP='172.30.16.7' #主服务器IP
 VIP='172.30.16.10' #虚拟IP地址
 UPGRADEFILE='/opt/pgsql/data/upgrade' #主备升级文件
 if [[ `nmap ${MAINIP} -p 5432 | grep 5432 | awk '{ print $2 }'` != "open" ]]; then
 touch ${UPGRADEFILE}
 ifconfig ${INF} ${VIP} netmask 255.255.255.0 up
 else
 rm -f ${UPGRADEFILE}
 ifconfig ${INF} down
 end
使用任务计划循环检测主机运行状态
- CronTab 使用 详见crontab使用简介
- 输入 crontab -e打开用户任务计划表 配置为每分钟检测一次 按ESC 输入:wq 保存退出1 * * * * * $PGDATA/upgrade.sh >/dev/null 2>&1 
- 重启服务 service crond restart