GPDB gprecoverseg -F Unable to connect to database
在GPDB运维过程中,有时会发生节点错误,一般采用gprecoverseg 进行节点的恢复,如果不能恢复,会尝试使用gprecoverseg -F进行全量恢复,一般在这个环节问题可以解决。但是个别情况下,会报出Unable to connect to database的错误,这个时候就需要特别处理了。
错误信息:报错信息为Unable to connect to database. Retrying 1
gprecoverseg failed. (Reason=’Unable to connect to database and start transaction’) exiting…
一般在retry5次后,会自动退出。
解决办法:发现其实出现这个问题只需要修改pg_changetracking下的CT_METADATA,或者说从其他正常的主事例上拷贝一份到出问题的主事例上即可。
下面说说分析和解决的具体步骤:
gpstate -s 去查看具体信息,可以发现一个主节点down了,一个mirror切换为主了。
注意这个时候使用 gprecoverseg -F -v 去显示恢复的详细信息
[gpadmin@gpmaster ~]$ gprecoverseg -F -v
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[INFO]:-Starting gprecoverseg with args: -F -v
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Checking if MASTER_DATA_DIRECTORY env variable is set.
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Obtaining master's port from master data directory
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Read from postgresql.conf port=5432
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Read from postgresql.conf max_connections=200
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Checking if GPHOME env variable is set.
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 4.3.4.1 build 2'
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Connecting to dbname=template1
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.2.15 (Greenplum Database 4.3.4.1 build 2) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.4.2 compiled on Feb 10 2015 14:15:10'
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[INFO]:-Checking if segments are ready
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[INFO]:-Obtaining Segment details from master...
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Connecting to dbname=template1
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Validating configuration...
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Sending msg getStatus and cmdStr $GPHOME/bin/gp_primarymirror -h gpslave-1 -p 40000
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Adding cmd to work_queue: $GPHOME/bin/gp_primarymirror -h gpslave-1 -p 40000
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Sending msg getStatus and cmdStr $GPHOME/bin/gp_primarymirror -h gpslave-2 -p 40000
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Adding cmd to work_queue: $GPHOME/bin/gp_primarymirror -h gpslave-2 -p 40000
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker0] got cmd: $GPHOME/bin/gp_primarymirror -h gpslave-1 -p 40000
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker1] got cmd: $GPHOME/bin/gp_primarymirror -h gpslave-2 -p 40000
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker0] finished cmd: Get segment status cmdStr='ssh -o 'StrictHostKeyChecking no' gpslave-1 ". /usr/local/greenplum-db/./greenplum_path.sh; $GPHOME/bin/gp_primarymirror -h gpslave-1 -p 40000"' had result: cmd had rc=1 completed=True halted=False
stdout=''
stderr='mode: PrimarySegment
segmentState: ChangeTrackingDisabled --content=0的primary的状态是ChangeTrackingDisabled
dataState: InChangeTracking
faultType: NotInitialized
mode: PrimarySegment
segmentState: ChangeTrackingDisabled
dataState: InChangeTracking
faultType: NotInitialized
'
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker1] finished cmd: Get segment status cmdStr='ssh -o 'StrictHostKeyChecking no' gpslave-2 ". /usr/local/greenplum-db/./greenplum_path.sh; $GPHOME/bin/gp_primarymirror -h gpslave-2 -p 40000"' had result: cmd had rc=1 completed=True halted=False
stdout=''
stderr='mode: PrimarySegment
segmentState: Ready --content=1的primary的状态是Ready
dataState: InChangeTracking
faultType: NotInitialized
mode: PrimarySegment
segmentState: Ready
dataState: InChangeTracking
faultType: NotInitialized
'
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Encountered error Not ready to connect to database mode: PrimarySegment
segmentState: ChangeTrackingDisabled --由此可见,之前无法执行全量恢复是由于content=0的primary有问题,与mirror无关
dataState: InChangeTracking
faultType: NotInitialized
mode: PrimarySegment
segmentState: ChangeTrackingDisabled
dataState: InChangeTracking
faultType: NotInitialized
20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[INFO]:-Unable to connect to database. Retrying 1
^C20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-WorkerPool haltWork()
20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker0] haltWork
20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker1] haltWork
20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker2] haltWork
20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker3] haltWork
20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker4] haltWork
20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker5] haltWork
20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker6] haltWork
20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker7] haltWork
20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker8] haltWork
20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker9] haltWork
20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker10] haltWork
20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker11] haltWork
20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker12] haltWork
20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker13] haltWork
20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker14] haltWork
20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker15] haltWork
20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker5] stopping
20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker3] stopping
20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker6] stopping
20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker4] stopping
20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker2] stopping
20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker7] stopping
20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker9] stopping
20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker10] stopping
20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker12] stopping
20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker8] stopping
20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker11] stopping
20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker14] stopping
20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker13] stopping
20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker15] stopping
20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker0] stopping
20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker1] stopping
User Interrupted
查看content=0的primary此次启动的日志记录,与下面content=1的primary此次启动的日志记录对比,发现还是CHANGETRACKING有问题。而且gprecoverseg -F执行后没有产生日志记录,说明相应文件只在数据库启动时生效。
查看**/data/primary/gpseg0/pg_changetracking** 下的文件,就这么三个文件,而造成上面gprecoverseg无法执行的正是其中的CT_METADATA
将mirror切换为主的节点下的pg_chnagetracking下的CT_METADATA重命名,并copy到之前的出错的主节点上,然后执行gprecoverseg -F即可解决此问题。
