Skip to content

fix: resolve flaky e2e test root causes in kine#11

Open
jarvis9443 wants to merge 1 commit intomasterfrom
fix/flaky-e2e-tests
Open

fix: resolve flaky e2e test root causes in kine#11
jarvis9443 wants to merge 1 commit intomasterfrom
fix/flaky-e2e-tests

Conversation

@jarvis9443
Copy link
Copy Markdown

@jarvis9443 jarvis9443 commented Apr 6, 2026

Summary

Fix kine-level issues that caused intermittent e2e test failures:

  • COMPACT batching: Batch deletes into 1000-row chunks to avoid long DB locks
  • MSSQL NOLOCK: Add WITH (NOLOCK) hints to prevent deadlocks under concurrent access
  • MSSQL WAITFOR DELAY: Implement proper polling intervals for MSSQL
  • Stale read fixes: Fix eventual consistency issues in two-kine-server setup

Verified across 235 rounds of PG/MySQL/MSSQL integration tests with 20+ consecutive passes on each database.

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Improved transaction rollback handling for SQL Server to prevent crashes during serialization failures.
    • Added retry logic for PostgreSQL schema setup to handle transient constraint violations.
  • Performance Improvements

    • Reduced lock contention in SQL Server revision queries.
    • Optimized database compaction to allow better concurrent query execution.

- COMPACT: batch deletes into 1000-row chunks to avoid long DB locks
- MSSQL: add WITH (NOLOCK) hints to prevent deadlocks under concurrent access
- MSSQL: add WAITFOR DELAY for polling intervals
- PG/generic: fix stale read issues in two-kine-server setup

These fixes eliminate intermittent e2e test failures caused by kine-level
timing and locking issues, verified across 235 rounds of PG/MySQL/MSSQL
integration tests with 20+ consecutive passes on each database.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 6, 2026

📝 Walkthrough

Walkthrough

The changes address database transaction handling and concurrency across multiple SQL drivers and the logging layer. Updates include improved error handling for MSSQL rollback failures, query optimization with NOLOCK hints for SQL Server, retry logic for PostgreSQL schema setup, and tuning of batch processing parameters for transaction management.

Changes

Cohort / File(s) Summary
MSSQL Transaction Rollback Error Handling
pkg/drivers/generic/tx.go
Modified MustRollback to suppress fatal logging for MSSQL-specific rollback errors when the error message contains "no corresponding BEGIN TRANSACTION", treating cases where MSSQL auto-rolls back on serialization failures. Added strings import for substring matching.
MSSQL Query Optimization
pkg/drivers/mssql/mssql.go
Added WITH (NOLOCK) table hints to revision polling queries in dialect.AfterSQL, introducing NOLOCK-backed scalar subqueries to reduce lock contention on the kine table during polling operations.
PostgreSQL Schema Setup Robustness
pkg/drivers/pgsql/pgsql.go
Implemented retry logic (up to 3 attempts) for schema and index statement execution in setup, with conditional retry on pgconn.PgError with UniqueViolation SQLSTATE and "pg_class" constraint name, including 100ms sleep between attempts.
Compaction Batch Tuning
pkg/logstructured/sqllog/sql.go
Reduced compactBatchSize constant from 1000 to 100 and added 50ms sleep between successful batch iterations to reduce SERIALIZABLE transaction scope and yield to concurrent queries.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
E2e Test Quality Review ⚠️ Warning PR contains zero E2E test files covering database driver fixes and compaction logic across PostgreSQL, MySQL, and MSSQL. Add comprehensive E2E tests validating all fixes across databases and apply identified code corrections: fix pgsql retry logic, revert compactBatchSize to 1000, replace NOLOCK with READPAST.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: resolve flaky e2e test root causes in kine' is directly related to the main purpose of the changeset, which addresses flaky test issues through multiple database-specific fixes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/flaky-e2e-tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.4)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
pkg/drivers/generic/tx.go (1)

53-57: Harden rollback suppression with errors.Is and normalized message matching.

Line 53 should prefer errors.Is(err, sql.ErrTxDone) to handle wrapped errors; Line 57’s case-sensitive substring check is brittle.

Proposed patch
 import (
 	"context"
 	"database/sql"
+	"errors"
 	"strings"
 	"time"
@@
-		if err != sql.ErrTxDone &&
+		if !errors.Is(err, sql.ErrTxDone) &&
 			// MSSQL auto-rolls back transactions on serialization failures or
 			// deadlocks; a subsequent Rollback() returns this driver-specific
 			// error instead of sql.ErrTxDone.
-			!strings.Contains(err.Error(), "no corresponding BEGIN TRANSACTION") {
+			!strings.Contains(strings.ToLower(err.Error()), "no corresponding begin transaction") {
 			logrus.Fatalf("Transaction rollback failed: %v", err)
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/drivers/generic/tx.go` around lines 53 - 57, Replace the fragile
comparison with errors.Is by changing the check from `err != sql.ErrTxDone` to
`!errors.Is(err, sql.ErrTxDone)` and make the MSSQL message match robust by
normalizing the error string to lower-case before doing the substring check
(e.g., `strings.Contains(strings.ToLower(err.Error()), "no corresponding begin
transaction")`), ensuring wrapped errors and case differences are handled;
update the surrounding conditional that currently references `sql.ErrTxDone` and
the literal "no corresponding BEGIN TRANSACTION".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/drivers/mssql/mssql.go`:
- Around line 157-168: The current polling SQL strings nolockRevSQL,
nolockCompactRevSQL and the query assigned to dialect.AfterSQL use "WITH
(NOLOCK)" which permits dirty (uncommitted) reads; change those hints to "WITH
(READPAST)" so the poll skips locked/uncommitted rows instead of reading them.
Locate nolockRevSQL, nolockCompactRevSQL and the q(fmt.Sprintf(...)) that sets
dialect.AfterSQL and replace each "WITH (NOLOCK)" occurrence with "WITH
(READPAST)" ensuring the same SQL structure and placeholders remain unchanged.

In `@pkg/drivers/pgsql/pgsql.go`:
- Around line 152-155: The retry branch logs "retrying" and sleeps even on the
final attempt because it doesn't check the loop's last iteration; change the
if-block that checks pgErr (the condition using pgerrcode.UniqueViolation and
strings.Contains(pgErr.ConstraintName, "pg_class")) to only call
logrus.Warnf("... retrying ...", attempt+1, err) and time.Sleep(100 *
time.Millisecond) when the current attempt is not the final one (e.g., check
attempt < maxAttempts-1 or attempt != lastAttempt); leave the error
handling/continue flow untouched for the final attempt so no misleading log or
unnecessary sleep occurs.

In `@pkg/logstructured/sqllog/sql.go`:
- Around line 20-24: The compaction is over-throttled: increase compactBatchSize
from 100 to 1000 (change the compactBatchSize constant) and remove the
unconditional 50ms sleep inside the compaction loop (the loop that currently
sleeps at the end of each iteration) — instead make the pause adaptive or
configurable (e.g., a new CompactionConfig or env var controlling
throttle/backoff) so under backlog the loop can process multiple back-to-back
batches while still allowing small delays for MSSQL if needed; update references
to compactBatchSize and the compaction loop/sleep call accordingly.

---

Nitpick comments:
In `@pkg/drivers/generic/tx.go`:
- Around line 53-57: Replace the fragile comparison with errors.Is by changing
the check from `err != sql.ErrTxDone` to `!errors.Is(err, sql.ErrTxDone)` and
make the MSSQL message match robust by normalizing the error string to
lower-case before doing the substring check (e.g.,
`strings.Contains(strings.ToLower(err.Error()), "no corresponding begin
transaction")`), ensuring wrapped errors and case differences are handled;
update the surrounding conditional that currently references `sql.ErrTxDone` and
the literal "no corresponding BEGIN TRANSACTION".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d73fdc60-e5fe-4846-b745-d765ca282dd7

📥 Commits

Reviewing files that changed from the base of the PR and between 54ffde6 and 1c850ce.

📒 Files selected for processing (4)
  • pkg/drivers/generic/tx.go
  • pkg/drivers/mssql/mssql.go
  • pkg/drivers/pgsql/pgsql.go
  • pkg/logstructured/sqllog/sql.go

Comment on lines +157 to +168
// Use NOLOCK hint to avoid lock contention with concurrent writes.
// The poll query runs every ~1s; any missed rows from dirty reads
// will be caught in the next iteration.
nolockRevSQL := `SELECT MAX(rkv.id) FROM kine AS rkv WITH (NOLOCK)`
nolockCompactRevSQL := `SELECT MAX(crkv.prev_revision) FROM kine AS crkv WITH (NOLOCK) WHERE crkv.name = 'compact_rev_key'`
dialect.AfterSQL = q(fmt.Sprintf(`
SELECT (%s) AS id, (%s) AS compact_revision, %s
FROM kine AS kv
FROM kine AS kv WITH (NOLOCK)
WHERE
kv.name LIKE ? AND
kv.id > ?
ORDER BY kv.id ASC`, revSQL, compactRevSQL, columns))
ORDER BY kv.id ASC`, nolockRevSQL, nolockCompactRevSQL, columns))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In SQL Server, for a polling query on an append-only event table, does WITH (NOLOCK) allow reading uncommitted rows that may later roll back? Compare this with WITH (READPAST) regarding lock avoidance and consistency guarantees.

💡 Result:

Yes, in SQL Server, WITH (NOLOCK) allows reading uncommitted rows from an append-only event table that may later roll back. NOLOCK (equivalent to READUNCOMMITTED) enables dirty reads by not acquiring shared locks and ignoring exclusive locks from other transactions, potentially reading data modifications that get rolled back. This provides high concurrency and avoids blocking but risks inconsistency, such as seeing phantom rows that disappear or data never committed. In comparison, WITH (READPAST) avoids locks by skipping locked rows entirely, ensuring only committed data is read (no dirty reads) but potentially missing rows that match the query criteria if they are locked during an insert/update transaction. READPAST requires READ COMMITTED or REPEATABLE READ isolation (or SNAPSHOT with other hints) and is useful for queue-like scenarios but offers no consistency guarantee against missing data. For a polling query on an append-only event table: - NOLOCK: May read uncommitted inserts (later rolled back), avoiding all blocking. - READPAST: Skips locked (uncommitted) rows, so polls only committed events but ignores in-flight ones, also avoiding blocking. NOLOCK prioritizes speed over accuracy (dirty reads possible); READPAST prioritizes avoiding dirty reads but sacrifices completeness.

Citations:


NOLOCK allows reading uncommitted rows that may later roll back.

Lines 160–164 use WITH (NOLOCK) hints on the polling query, which enables dirty reads. For an append-only event table, this risks processing events from transactions that subsequently roll back, causing phantom rows to disappear from the dataset after being published. Consider using WITH (READPAST) instead to skip locked (uncommitted) rows while avoiding blocking, ensuring only committed data is polled.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/drivers/mssql/mssql.go` around lines 157 - 168, The current polling SQL
strings nolockRevSQL, nolockCompactRevSQL and the query assigned to
dialect.AfterSQL use "WITH (NOLOCK)" which permits dirty (uncommitted) reads;
change those hints to "WITH (READPAST)" so the poll skips locked/uncommitted
rows instead of reading them. Locate nolockRevSQL, nolockCompactRevSQL and the
q(fmt.Sprintf(...)) that sets dialect.AfterSQL and replace each "WITH (NOLOCK)"
occurrence with "WITH (READPAST)" ensuring the same SQL structure and
placeholders remain unchanged.

Comment on lines +152 to +155
if pgErr, ok := err.(*pgconn.PgError); ok && pgErr.Code == pgerrcode.UniqueViolation && strings.Contains(pgErr.ConstraintName, "pg_class") {
logrus.Warnf("SETUP EXEC attempt %d hit pg_class race, retrying: %v", attempt+1, err)
time.Sleep(100 * time.Millisecond)
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid “retrying” log and sleep on the last attempt.

Line 153 and Line 154 execute even on the final attempt (attempt == 2), so the code logs “retrying” and waits 100ms when no retry will happen.

Proposed fix
-			if pgErr, ok := err.(*pgconn.PgError); ok && pgErr.Code == pgerrcode.UniqueViolation && strings.Contains(pgErr.ConstraintName, "pg_class") {
-				logrus.Warnf("SETUP EXEC attempt %d hit pg_class race, retrying: %v", attempt+1, err)
-				time.Sleep(100 * time.Millisecond)
-				continue
-			}
-			return err
+			if pgErr, ok := err.(*pgconn.PgError); ok && pgErr.Code == pgerrcode.UniqueViolation && strings.Contains(pgErr.ConstraintName, "pg_class") {
+				if attempt < 2 {
+					logrus.Warnf("SETUP EXEC attempt %d hit pg_class race, retrying: %v", attempt+1, err)
+					time.Sleep(100 * time.Millisecond)
+					continue
+				}
+			}
+			return err
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if pgErr, ok := err.(*pgconn.PgError); ok && pgErr.Code == pgerrcode.UniqueViolation && strings.Contains(pgErr.ConstraintName, "pg_class") {
logrus.Warnf("SETUP EXEC attempt %d hit pg_class race, retrying: %v", attempt+1, err)
time.Sleep(100 * time.Millisecond)
continue
if pgErr, ok := err.(*pgconn.PgError); ok && pgErr.Code == pgerrcode.UniqueViolation && strings.Contains(pgErr.ConstraintName, "pg_class") {
if attempt < 2 {
logrus.Warnf("SETUP EXEC attempt %d hit pg_class race, retrying: %v", attempt+1, err)
time.Sleep(100 * time.Millisecond)
continue
}
}
return err
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/drivers/pgsql/pgsql.go` around lines 152 - 155, The retry branch logs
"retrying" and sleeps even on the final attempt because it doesn't check the
loop's last iteration; change the if-block that checks pgErr (the condition
using pgerrcode.UniqueViolation and strings.Contains(pgErr.ConstraintName,
"pg_class")) to only call logrus.Warnf("... retrying ...", attempt+1, err) and
time.Sleep(100 * time.Millisecond) when the current attempt is not the final one
(e.g., check attempt < maxAttempts-1 or attempt != lastAttempt); leave the error
handling/continue flow untouched for the final attempt so no misleading log or
unnecessary sleep occurs.

Comment on lines +20 to +24
// Use smaller batches to reduce lock contention on MSSQL, where
// SERIALIZABLE DELETE transactions can block other queries for the
// entire batch duration. 100 revisions per batch keeps each
// transaction under ~1 s even on slow backends.
compactBatchSize = 100
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Compaction is over-throttled globally (100 batch + fixed 50ms sleep each loop).

Line 24 plus Lines 174-176 can reduce compaction throughput by an order of magnitude under backlog, and it diverges from the stated 1,000-row objective.

Proposed patch
-	compactBatchSize = 100
+	compactBatchSize = 1000
@@
-			time.Sleep(50 * time.Millisecond)
+			select {
+			case <-s.ctx.Done():
+				return
+			case <-time.After(50 * time.Millisecond):
+			}

Also applies to: 174-176

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/logstructured/sqllog/sql.go` around lines 20 - 24, The compaction is
over-throttled: increase compactBatchSize from 100 to 1000 (change the
compactBatchSize constant) and remove the unconditional 50ms sleep inside the
compaction loop (the loop that currently sleeps at the end of each iteration) —
instead make the pause adaptive or configurable (e.g., a new CompactionConfig or
env var controlling throttle/backoff) so under backlog the loop can process
multiple back-to-back batches while still allowing small delays for MSSQL if
needed; update references to compactBatchSize and the compaction loop/sleep call
accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant