Skip to content

Fix recovery slot handling before pg_basebackup#1675

Open
aviralgarg05 wants to merge 2 commits intoapache:mainfrom
aviralgarg05:fix-1654-precreate-recovery-slot
Open

Fix recovery slot handling before pg_basebackup#1675
aviralgarg05 wants to merge 2 commits intoapache:mainfrom
aviralgarg05:fix-1654-precreate-recovery-slot

Conversation

@aviralgarg05
Copy link
Copy Markdown

Fixes #1654

What does this PR do?

This PR fixes the recovery flow when the internal WAL replication slot does not already exist on the source segment.

Before this change, both gpsegrecovery and gpconfigurenewsegment would start pg_basebackup first and only retry with slot creation after the backup failed. In practice, that meant a full base backup could run for a long time and then fail at the end because the slot was missing.

This change fixes that at the root:

  • adds a shared helper to check whether the replication slot already exists
  • creates the slot up front when needed, before pg_basebackup starts
  • removes the fallback second pg_basebackup attempt from both recovery paths
  • updates unit tests to cover the new behavior and the new failure mode

Type of Change

  • Bug fix (non-breaking change)
  • New feature (non-breaking change)
  • Breaking change (fix or feature with breaking changes)
  • Documentation update

Breaking Changes

None.

Test Plan

Tested with focused management-script unit coverage and repeated verification runs.

  • Unit tests added/updated
  • Integration tests added/updated
  • Passed make installcheck
  • Passed make -C src/test installcheck-cbdb-parallel

Commands used:

  • python3 -m py_compile gpMgmt/sbin/gpsegrecovery.py gpMgmt/bin/gppylib/commands/pg.py gpMgmt/bin/lib/gpconfigurenewsegment
  • PYTHONPATH=/Users/aviralgarg/Everything/cloudberry/gpMgmt/bin:/Users/aviralgarg/Everything/cloudberry/gpMgmt/sbin:/Users/aviralgarg/Everything/cloudberry/gpMgmt/bin/lib python3.11 -m unittest -v gpMgmt.bin.gppylib.commands.test.unit.test_unit_pg_base_backup gpMgmt.bin.gppylib.test.unit.test_unit_gpsegrecovery

The unit suite above was run multiple times and passed consistently.

Impact

Performance:
This avoids wasting time on a full pg_basebackup that was destined to fail only because the replication slot was missing. Recovery becomes more predictable and avoids unnecessary duplicate work.

User-facing changes:
Recovery now fails less often in clusters where the internal replication slot has not been created yet, and it no longer relies on a second backup attempt to recover from that condition.

Dependencies:
No product dependencies were added.

Checklist

Additional Context

The fix is intentionally narrow and only touches the recovery paths involved in the reported issue:

  • gpMgmt/bin/gppylib/commands/pg.py
  • gpMgmt/sbin/gpsegrecovery.py
  • gpMgmt/bin/lib/gpconfigurenewsegment
  • related unit tests

I did not run the broader integration suites in this environment.

CI Skip Instructions

No CI skip requested.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @aviralgarg05 welcome!🎊 Thanks for taking the effort to make our project better! 🙌 Keep making such awesome contributions!

Copy link
Copy Markdown
Member

@yjhjstz yjhjstz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

welcome, LGTM~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gprecoverseg runs pg_basebackup twice when replication slot does not exist

2 participants