Advertisement

GPDB gprecoverseg -F Unable to connect to database

阅读量:

在GPDB运维过程中,有时会发生节点错误,一般采用gprecoverseg 进行节点的恢复,如果不能恢复,会尝试使用gprecoverseg -F进行全量恢复,一般在这个环节问题可以解决。但是个别情况下,会报出Unable to connect to database的错误,这个时候就需要特别处理了。
错误信息:报错信息为Unable to connect to database. Retrying 1
gprecoverseg failed. (Reason=’Unable to connect to database and start transaction’) exiting…
一般在retry5次后,会自动退出。
解决办法:发现其实出现这个问题只需要修改pg_changetracking下的CT_METADATA,或者说从其他正常的主事例上拷贝一份到出问题的主事例上即可。

下面说说分析和解决的具体步骤:
gpstate -s 去查看具体信息,可以发现一个主节点down了,一个mirror切换为主了。
注意这个时候使用 gprecoverseg -F -v 去显示恢复的详细信息

复制代码
    [gpadmin@gpmaster ~]$ gprecoverseg -F -v
    20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[INFO]:-Starting gprecoverseg with args: -F -v
    20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Checking if MASTER_DATA_DIRECTORY env variable is set.
    20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Obtaining master's port from master data directory
    20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Read from postgresql.conf port=5432
    20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Read from postgresql.conf max_connections=200
    20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Checking if GPHOME env variable is set.
    20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 4.3.4.1 build 2'
    20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Connecting to dbname=template1
    20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.2.15 (Greenplum Database 4.3.4.1 build 2) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.4.2 compiled on Feb 10 2015 14:15:10'
    20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[INFO]:-Checking if segments are ready
    20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[INFO]:-Obtaining Segment details from master...
    20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Connecting to dbname=template1
    20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Validating configuration...
    20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Sending msg getStatus and cmdStr $GPHOME/bin/gp_primarymirror -h gpslave-1 -p 40000
    20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Adding cmd to work_queue: $GPHOME/bin/gp_primarymirror -h gpslave-1 -p 40000
    20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Sending msg getStatus and cmdStr $GPHOME/bin/gp_primarymirror -h gpslave-2 -p 40000
    20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Adding cmd to work_queue: $GPHOME/bin/gp_primarymirror -h gpslave-2 -p 40000
    20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker0] got cmd: $GPHOME/bin/gp_primarymirror -h gpslave-1 -p 40000
    20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker1] got cmd: $GPHOME/bin/gp_primarymirror -h gpslave-2 -p 40000
    20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker0] finished cmd: Get segment status cmdStr='ssh -o 'StrictHostKeyChecking no' gpslave-1 ". /usr/local/greenplum-db/./greenplum_path.sh; $GPHOME/bin/gp_primarymirror -h gpslave-1 -p 40000"' had result: cmd had rc=1 completed=True halted=False
    stdout=''
    stderr='mode: PrimarySegment
    segmentState: ChangeTrackingDisabled --content=0的primary的状态是ChangeTrackingDisabled
    dataState: InChangeTracking
    faultType: NotInitialized
    mode: PrimarySegment
    segmentState: ChangeTrackingDisabled
    dataState: InChangeTracking
    faultType: NotInitialized
    '
    20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker1] finished cmd: Get segment status cmdStr='ssh -o 'StrictHostKeyChecking no' gpslave-2 ". /usr/local/greenplum-db/./greenplum_path.sh; $GPHOME/bin/gp_primarymirror -h gpslave-2 -p 40000"' had result: cmd had rc=1 completed=True halted=False
    stdout=''
    stderr='mode: PrimarySegment
    segmentState: Ready --content=1的primary的状态是Ready
    dataState: InChangeTracking
    faultType: NotInitialized
    mode: PrimarySegment
    segmentState: Ready
    dataState: InChangeTracking
    faultType: NotInitialized
    '
    20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-Encountered error Not ready to connect to database mode: PrimarySegment
    segmentState: ChangeTrackingDisabled --由此可见,之前无法执行全量恢复是由于content=0的primary有问题,与mirror无关
    dataState: InChangeTracking
    faultType: NotInitialized
    mode: PrimarySegment
    segmentState: ChangeTrackingDisabled
    dataState: InChangeTracking
    faultType: NotInitialized
    20150727:22:51:32:003078 gprecoverseg:gpmaster:gpadmin-[INFO]:-Unable to connect to database. Retrying 1
    ^C20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-WorkerPool haltWork()
    20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker0] haltWork
    20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker1] haltWork
    20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker2] haltWork
    20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker3] haltWork
    20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker4] haltWork
    20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker5] haltWork
    20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker6] haltWork
    20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker7] haltWork
    20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker8] haltWork
    20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker9] haltWork
    20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker10] haltWork
    20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker11] haltWork
    20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker12] haltWork
    20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker13] haltWork
    20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker14] haltWork
    20150727:22:51:34:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker15] haltWork
    20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker5] stopping
    20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker3] stopping
    20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker6] stopping
    20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker4] stopping
    20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker2] stopping
    20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker7] stopping
    20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker9] stopping
    20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker10] stopping
    20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker12] stopping
    20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker8] stopping
    20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker11] stopping
    20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker14] stopping
    20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker13] stopping
    20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker15] stopping
    20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker0] stopping
    20150727:22:51:37:003078 gprecoverseg:gpmaster:gpadmin-[DEBUG]:-[worker1] stopping
    User Interrupted
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    

查看content=0的primary此次启动的日志记录,与下面content=1的primary此次启动的日志记录对比,发现还是CHANGETRACKING有问题。而且gprecoverseg -F执行后没有产生日志记录,说明相应文件只在数据库启动时生效。
查看**/data/primary/gpseg0/pg_changetracking** 下的文件,就这么三个文件,而造成上面gprecoverseg无法执行的正是其中的CT_METADATA

将mirror切换为主的节点下的pg_chnagetracking下的CT_METADATA重命名,并copy到之前的出错的主节点上,然后执行gprecoverseg -F即可解决此问题。

全部评论 (0)

还没有任何评论哟~