SPOC-494: Introduce random delays injection for testing bgworker races#408
SPOC-494: Introduce random delays injection for testing bgworker races#408
Conversation
📝 WalkthroughWalkthroughAdds compile-time worker injection hooks and optional random-delay implementation; inserts delay calls at several worker/apply points; updates build, Docker, and GitHub Actions to enable SPOCK_RANDOM_DELAYS; modifies regression tests to add explicit spock.sync_event()/spock.wait_for_sync_event() synchronization barriers. Changes
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 0 |
| Duplication | 0 |
TIP This summary will be updated as you push new changes. Give us feedback
Adds SPOCK_WORKER_DELAY() at worker start/finish and before each
INSERT/UPDATE/DELETE in the apply loop.
Two mechanisms, mutually exclusive at build time:
- SPOCK_RANDOM_DELAYS=1 make: compiles in unconditional 1–100 ms
random delays, no runtime setup needed.
- PG17+ without the flag: wires INJECTION_POINT("spock-worker-delay")
so the core injection_points module can attach callbacks at runtime.
Pre-PG17 compiles to a no-op.
The per-transaction delay introduced by SPOCK_WORKER_DELAY() in handle_begin exposed several races in the regression suite where tests were checking subscriber state before the apply worker had caught up. Four tests needed fixes: autoddl: security label checks on the subscriber were racing against the apply worker. Replace the "short round trip" comments with an explicit sync_event / wait_for_sync_event barrier before each pg_seclabels query. primary_key: two fixes: - Before issuing CREATE UNIQUE INDEX via replicate_ddl, add a sync barrier to ensure the rows that must cause the duplicate-key failure already exist on the subscriber. Without it, the index could be created successfully, leaving the subscription enabled instead of disabled. - After the DDL, replace the immediate sub_show_status() check with a polling loop guarded by statement_timeout = 30s, waiting for sub_enabled to become false before reading the status. sync_table, generated_columns: sub_resync_table would trigger a COPY that collided with rows the delayed apply worker had not yet inserted. Add a sync barrier before each sub_resync_table call so the apply worker is fully caught up before the sync begins. att_list: add a trailing sync barrier so the DROP TABLE is guaranteed to have been applied on the subscriber before subsequent tests run.
fbbc440 to
fe8e181
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/regress/sql/primary_key.sql (1)
141-155: Scope the wait loop totest_subscriptionfor stricter determinism.The current predicate exits on any disabled subscription. Filtering by
sub_nameavoids accidental early exit if additional subscriptions exist in the test environment.Proposed tweak
- EXIT WHEN EXISTS (SELECT 1 FROM spock.subscription WHERE sub_enabled = false); + EXIT WHEN EXISTS ( + SELECT 1 + FROM spock.subscription + WHERE sub_name = 'test_subscription' + AND sub_enabled = false + );🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/regress/sql/primary_key.sql` around lines 141 - 155, The wait loop currently exits when any subscription is disabled; narrow the EXISTS predicate to only consider the subscription named "test_subscription" by adding a filter on spock.subscription.sub_name (e.g., WHERE sub_name = 'test_subscription' AND sub_enabled = false) so the loop only finishes for that specific subscription and avoids false positives from other subscriptions in the environment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/regress/sql/primary_key.sql`:
- Around line 141-155: The wait loop currently exits when any subscription is
disabled; narrow the EXISTS predicate to only consider the subscription named
"test_subscription" by adding a filter on spock.subscription.sub_name (e.g.,
WHERE sub_name = 'test_subscription' AND sub_enabled = false) so the loop only
finishes for that specific subscription and avoids false positives from other
subscriptions in the environment.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 16899bd2-02d8-46df-9456-5dc590204416
⛔ Files ignored due to path filters (4)
tests/regress/expected/att_list.outis excluded by!**/*.outtests/regress/expected/generated_columns.outis excluded by!**/*.outtests/regress/expected/primary_key.outis excluded by!**/*.outtests/regress/expected/sync_table.outis excluded by!**/*.out
📒 Files selected for processing (7)
.github/workflows/random-delays-regress.ymltests/docker/Dockerfile-step-1.el9tests/regress/sql/att_list.sqltests/regress/sql/autoddl.sqltests/regress/sql/generated_columns.sqltests/regress/sql/primary_key.sqltests/regress/sql/sync_table.sql
✅ Files skipped from review due to trivial changes (1)
- .github/workflows/random-delays-regress.yml
There was a problem hiding this comment.
🧹 Nitpick comments (2)
tests/regress/sql/primary_key.sql (1)
150-150: Clarify the purpose of this COMMIT.The
DO $$ ... $$;block runs as a single statement under psql's autocommit. Adding an explicitCOMMIT;after it seems redundant since there's no open transaction. If this is defensive coding against a specific edge case (e.g., implicit transaction from prior commands), a brief comment would help future maintainers understand the intent.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/regress/sql/primary_key.sql` at line 150, The explicit COMMIT; after the DO $$ ... $$; block appears redundant because psql runs DO as a single statement under autocommit; either remove the standalone COMMIT; line or, if it was added as defensive protection against an implicit/open transaction from earlier statements, replace it with a brief comment explaining that intent (reference the DO $$ ... $$; block and the COMMIT; token) so future maintainers understand why the extra COMMIT is present..github/workflows/random-delays-regress.yml (1)
58-63: Container cleanup could be more robust.Currently, if
docker cpfails for a reason other than missing files (e.g., container not running), the|| truesuppresses the error but thedocker rmon line 63 still runs. However, if the "Run regression tests" step times out or is cancelled,$REG_CT_NAMEmay not be set.Consider extracting container cleanup to a dedicated step to ensure it always runs:
♻️ Suggested improvement
- name: Collect regression artifacts if: ${{ always() }} run: | docker cp "$REG_CT_NAME":/home/pgedge/spock/tests/regress/regression_output \ "${GITHUB_WORKSPACE}/tests/regress/" || true - docker rm -f "$REG_CT_NAME" || true + + - name: Cleanup container + if: ${{ always() }} + run: | + docker rm -f "$REG_CT_NAME" || true🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/random-delays-regress.yml around lines 58 - 63, Split artifact collection and container teardown: remove the docker rm -f "$REG_CT_NAME" from the "Collect regression artifacts" run block (keep docker cp || true), and add a new separate step (e.g., "Cleanup regression container") that runs if: always() and performs a guarded removal: first test that the REG_CT_NAME env var is set/non-empty (e.g., [ -n "${REG_CT_NAME:-}" ]), then check container existence (docker ps -a --filter "name=$REG_CT_NAME" --format '{{.Names}}' or docker inspect) and only then run docker rm -f "$REG_CT_NAME"; this ensures teardown always runs but won’t error when REG_CT_NAME is missing or the container doesn’t exist.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In @.github/workflows/random-delays-regress.yml:
- Around line 58-63: Split artifact collection and container teardown: remove
the docker rm -f "$REG_CT_NAME" from the "Collect regression artifacts" run
block (keep docker cp || true), and add a new separate step (e.g., "Cleanup
regression container") that runs if: always() and performs a guarded removal:
first test that the REG_CT_NAME env var is set/non-empty (e.g., [ -n
"${REG_CT_NAME:-}" ]), then check container existence (docker ps -a --filter
"name=$REG_CT_NAME" --format '{{.Names}}' or docker inspect) and only then run
docker rm -f "$REG_CT_NAME"; this ensures teardown always runs but won’t error
when REG_CT_NAME is missing or the container doesn’t exist.
In `@tests/regress/sql/primary_key.sql`:
- Line 150: The explicit COMMIT; after the DO $$ ... $$; block appears redundant
because psql runs DO as a single statement under autocommit; either remove the
standalone COMMIT; line or, if it was added as defensive protection against an
implicit/open transaction from earlier statements, replace it with a brief
comment explaining that intent (reference the DO $$ ... $$; block and the
COMMIT; token) so future maintainers understand why the extra COMMIT is present.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 597b08dc-1194-48f6-84da-b09f3eeace57
⛔ Files ignored due to path filters (4)
tests/regress/expected/att_list.outis excluded by!**/*.outtests/regress/expected/generated_columns.outis excluded by!**/*.outtests/regress/expected/primary_key.outis excluded by!**/*.outtests/regress/expected/sync_table.outis excluded by!**/*.out
📒 Files selected for processing (12)
.github/workflows/random-delays-regress.ymlMakefileinclude/spock_injection.hsrc/spock_apply.csrc/spock_injection.csrc/spock_worker.ctests/docker/Dockerfile-step-1.el9tests/regress/sql/att_list.sqltests/regress/sql/autoddl.sqltests/regress/sql/generated_columns.sqltests/regress/sql/primary_key.sqltests/regress/sql/sync_table.sql
✅ Files skipped from review due to trivial changes (4)
- Makefile
- tests/regress/sql/sync_table.sql
- src/spock_apply.c
- src/spock_worker.c
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/regress/sql/generated_columns.sql
fe8e181 to
dd8b072
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (3)
.github/workflows/random-delays-regress.yml (1)
8-11: Consider restricting workflow triggers to reduce CI load.Running random-delay stress tests on every PR open/sync may be excessive and consume CI resources. Consider:
- Using only
workflow_dispatchfor manual triggering- Adding a schedule for periodic runs
- Adding path filters to only trigger on relevant source changes
If running on PRs is intentional for catching timing regressions early, this is fine as-is.
Example: limit to workflow_dispatch + schedule
on: workflow_dispatch: - pull_request: - types: [opened, synchronize, reopened] + schedule: + - cron: '0 3 * * 0' # Weekly on Sunday at 3 AM🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/random-delays-regress.yml around lines 8 - 11, The workflow currently triggers on workflow_dispatch and pull_request (types: opened, synchronize, reopened), which may run heavy random-delay tests on every PR; update the trigger configuration to reduce CI load by removing or narrowing the pull_request trigger (e.g., remove pull_request entirely to rely on manual workflow_dispatch, or replace it with a scheduled cron and/or path filters), or if PR triggers are needed limit them via path: or specific branches so only relevant source changes run the tests; modify the top-level triggers (workflow_dispatch, pull_request, schedule/path filters) accordingly to implement the chosen restriction.src/spock_injection.c (1)
42-42: Consider reducing log level from LOG to DEBUG1.With
SPOCK_WORKER_DELAY()called at every INSERT/UPDATE/DELETE operation in addition to worker start/finish, logging atLOGlevel will generate significant output. Consider usingDEBUG1to reduce noise while still allowing visibility when needed.Proposed fix
- elog(LOG, "Spock random delay: sleeping %ld ms", delay_ms); + elog(DEBUG1, "Spock random delay: sleeping %ld ms", delay_ms);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/spock_injection.c` at line 42, Change the elog call in the SPOCK_WORKER_DELAY() path from LOG to DEBUG1 to reduce noisy output; locate the call elog(LOG, "Spock random delay: sleeping %ld ms", delay_ms) in spock_injection.c (used during SPOCK_WORKER_DELAY() invocations and worker start/finish) and replace the log level with DEBUG1 so the message remains available at debug verbosity but not at LOG level.tests/regress/sql/primary_key.sql (1)
150-150: Verify necessity of explicit COMMIT after DO block.The
DOblock creates its own implicit transaction in psql. The explicitCOMMITafter it would be a no-op in autocommit mode. Is this needed for a specific test execution context, or can it be removed?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/regress/sql/primary_key.sql` at line 150, The explicit COMMIT after the DO block appears redundant because the DO block runs in its own transaction; inspect the test harness around the DO block and either remove the trailing explicit COMMIT statement (delete the lone "COMMIT;" line) if tests run in autocommit mode, or if a commit is required by a non-autocommit wrapper, replace the standalone COMMIT with an explicit BEGIN/COMMIT pair or a clarifying comment; ensure the change targets the DO block section and the lone "COMMIT;" token so the test behavior remains correct.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In @.github/workflows/random-delays-regress.yml:
- Around line 8-11: The workflow currently triggers on workflow_dispatch and
pull_request (types: opened, synchronize, reopened), which may run heavy
random-delay tests on every PR; update the trigger configuration to reduce CI
load by removing or narrowing the pull_request trigger (e.g., remove
pull_request entirely to rely on manual workflow_dispatch, or replace it with a
scheduled cron and/or path filters), or if PR triggers are needed limit them via
path: or specific branches so only relevant source changes run the tests; modify
the top-level triggers (workflow_dispatch, pull_request, schedule/path filters)
accordingly to implement the chosen restriction.
In `@src/spock_injection.c`:
- Line 42: Change the elog call in the SPOCK_WORKER_DELAY() path from LOG to
DEBUG1 to reduce noisy output; locate the call elog(LOG, "Spock random delay:
sleeping %ld ms", delay_ms) in spock_injection.c (used during
SPOCK_WORKER_DELAY() invocations and worker start/finish) and replace the log
level with DEBUG1 so the message remains available at debug verbosity but not at
LOG level.
In `@tests/regress/sql/primary_key.sql`:
- Line 150: The explicit COMMIT after the DO block appears redundant because the
DO block runs in its own transaction; inspect the test harness around the DO
block and either remove the trailing explicit COMMIT statement (delete the lone
"COMMIT;" line) if tests run in autocommit mode, or if a commit is required by a
non-autocommit wrapper, replace the standalone COMMIT with an explicit
BEGIN/COMMIT pair or a clarifying comment; ensure the change targets the DO
block section and the lone "COMMIT;" token so the test behavior remains correct.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b3e36ddd-dc09-4ad2-a70d-cf9d5da31abd
⛔ Files ignored due to path filters (4)
tests/regress/expected/att_list.outis excluded by!**/*.outtests/regress/expected/generated_columns.outis excluded by!**/*.outtests/regress/expected/primary_key.outis excluded by!**/*.outtests/regress/expected/sync_table.outis excluded by!**/*.out
📒 Files selected for processing (12)
.github/workflows/random-delays-regress.ymlMakefileinclude/spock_injection.hsrc/spock_apply.csrc/spock_injection.csrc/spock_worker.ctests/docker/Dockerfile-step-1.el9tests/regress/sql/att_list.sqltests/regress/sql/autoddl.sqltests/regress/sql/generated_columns.sqltests/regress/sql/primary_key.sqltests/regress/sql/sync_table.sql
✅ Files skipped from review due to trivial changes (1)
- Makefile
🚧 Files skipped from review as they are similar to previous changes (3)
- src/spock_apply.c
- src/spock_worker.c
- include/spock_injection.h
Adds .github/workflows/random-delays-regress.yml, a dedicated workflow that builds Spock with SPOCK_RANDOM_DELAYS=1 and runs the full regression suite. The random delays (1–100 ms at every transaction begin, worker start, and worker finish) expose timing-sensitive races that are invisible under normal test load. The workflow runs on pull_request and workflow_dispatch across PG15–18, with a 60-minute timeout per run to accommodate the extra per-transaction overhead. Dockerfile-step-1.el9 gains a SPOCK_RANDOM_DELAYS build argument (default empty, i.e. disabled) that threads through to the Spock make invocation. Normal builds are unaffected.
dd8b072 to
0ba6e80
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
.github/workflows/random-delays-regress.yml (1)
8-11: Consider limiting workflow triggers to reduce CI load.This workflow builds 4 Docker images (one per PG version) with 30-minute timeout each, potentially adding significant CI time to every PR. Consider:
- Adding a
pathsfilter to only run when relevant files change (e.g.,src/**,tests/**,Makefile)- Or running only on
workflow_dispatchand scheduled basis instead of every PRExample: Add paths filter
on: workflow_dispatch: pull_request: types: [opened, synchronize, reopened] + paths: + - 'src/**' + - 'include/**' + - 'tests/**' + - 'Makefile' + - 'patches/**' + - '.github/workflows/random-delays-regress.yml'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/random-delays-regress.yml around lines 8 - 11, The workflow currently triggers on pull_request (types: [opened, synchronize, reopened]) which causes expensive builds for every PR; update the workflow triggers to reduce CI load by either adding a paths filter under the pull_request trigger (e.g., limit to relevant paths like src/**, tests/**, Makefile) or remove the pull_request trigger and keep only workflow_dispatch and an optional schedule; modify the "on" block to include the chosen change (adjust the pull_request entry or remove it) so the workflow only runs when relevant files change or when manually/scheduled invoked.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In @.github/workflows/random-delays-regress.yml:
- Around line 8-11: The workflow currently triggers on pull_request (types:
[opened, synchronize, reopened]) which causes expensive builds for every PR;
update the workflow triggers to reduce CI load by either adding a paths filter
under the pull_request trigger (e.g., limit to relevant paths like src/**,
tests/**, Makefile) or remove the pull_request trigger and keep only
workflow_dispatch and an optional schedule; modify the "on" block to include the
chosen change (adjust the pull_request entry or remove it) so the workflow only
runs when relevant files change or when manually/scheduled invoked.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c90a0bf0-00e5-452e-8d1e-c1d48b8a7455
⛔ Files ignored due to path filters (5)
tests/regress/expected/att_list.outis excluded by!**/*.outtests/regress/expected/autoddl.outis excluded by!**/*.outtests/regress/expected/generated_columns.outis excluded by!**/*.outtests/regress/expected/primary_key.outis excluded by!**/*.outtests/regress/expected/sync_table.outis excluded by!**/*.out
📒 Files selected for processing (12)
.github/workflows/random-delays-regress.ymlMakefileinclude/spock_injection.hsrc/spock_apply.csrc/spock_injection.csrc/spock_worker.ctests/docker/Dockerfile-step-1.el9tests/regress/sql/att_list.sqltests/regress/sql/autoddl.sqltests/regress/sql/generated_columns.sqltests/regress/sql/primary_key.sqltests/regress/sql/sync_table.sql
✅ Files skipped from review due to trivial changes (3)
- Makefile
- src/spock_apply.c
- src/spock_worker.c
🚧 Files skipped from review as they are similar to previous changes (1)
- include/spock_injection.h
Guard the INJECTION_POINT() path with USE_INJECTION_POINTS instead of PG_VERSION_NUM >= 170000. The header and macro only exist when PostgreSQL is built with --enable-injection-points.
Summary
Adds
SPOCK_WORKER_DELAY()at three points in the replication pipeline to support timing-related test scenarios:Two independent mechanisms are provided depending on build-time configuration:
SPOCK_RANDOM_DELAYS=1 make— compiles in unconditional random delays (1–100 ms) at all three sites using the PostgreSQL global PRNG. No runtime setup required. Useful for stress-testing races in the worker lifecycle and apply path.Injection points (PG17+, no flag set) — wires
INJECTION_POINT("spock-worker-delay")so the coreinjection_pointsmodule can attach callbacks at runtime without any Spock-specific SQL functions: