fix: resolve flaky e2e test root causes in kine#11
Conversation
- COMPACT: batch deletes into 1000-row chunks to avoid long DB locks - MSSQL: add WITH (NOLOCK) hints to prevent deadlocks under concurrent access - MSSQL: add WAITFOR DELAY for polling intervals - PG/generic: fix stale read issues in two-kine-server setup These fixes eliminate intermittent e2e test failures caused by kine-level timing and locking issues, verified across 235 rounds of PG/MySQL/MSSQL integration tests with 20+ consecutive passes on each database.
📝 WalkthroughWalkthroughThe changes address database transaction handling and concurrency across multiple SQL drivers and the logging layer. Updates include improved error handling for MSSQL rollback failures, query optimization with NOLOCK hints for SQL Server, retry logic for PostgreSQL schema setup, and tuning of batch processing parameters for transaction management. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.4)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
pkg/drivers/generic/tx.go (1)
53-57: Harden rollback suppression witherrors.Isand normalized message matching.Line 53 should prefer
errors.Is(err, sql.ErrTxDone)to handle wrapped errors; Line 57’s case-sensitive substring check is brittle.Proposed patch
import ( "context" "database/sql" + "errors" "strings" "time" @@ - if err != sql.ErrTxDone && + if !errors.Is(err, sql.ErrTxDone) && // MSSQL auto-rolls back transactions on serialization failures or // deadlocks; a subsequent Rollback() returns this driver-specific // error instead of sql.ErrTxDone. - !strings.Contains(err.Error(), "no corresponding BEGIN TRANSACTION") { + !strings.Contains(strings.ToLower(err.Error()), "no corresponding begin transaction") { logrus.Fatalf("Transaction rollback failed: %v", err) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/drivers/generic/tx.go` around lines 53 - 57, Replace the fragile comparison with errors.Is by changing the check from `err != sql.ErrTxDone` to `!errors.Is(err, sql.ErrTxDone)` and make the MSSQL message match robust by normalizing the error string to lower-case before doing the substring check (e.g., `strings.Contains(strings.ToLower(err.Error()), "no corresponding begin transaction")`), ensuring wrapped errors and case differences are handled; update the surrounding conditional that currently references `sql.ErrTxDone` and the literal "no corresponding BEGIN TRANSACTION".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/drivers/mssql/mssql.go`:
- Around line 157-168: The current polling SQL strings nolockRevSQL,
nolockCompactRevSQL and the query assigned to dialect.AfterSQL use "WITH
(NOLOCK)" which permits dirty (uncommitted) reads; change those hints to "WITH
(READPAST)" so the poll skips locked/uncommitted rows instead of reading them.
Locate nolockRevSQL, nolockCompactRevSQL and the q(fmt.Sprintf(...)) that sets
dialect.AfterSQL and replace each "WITH (NOLOCK)" occurrence with "WITH
(READPAST)" ensuring the same SQL structure and placeholders remain unchanged.
In `@pkg/drivers/pgsql/pgsql.go`:
- Around line 152-155: The retry branch logs "retrying" and sleeps even on the
final attempt because it doesn't check the loop's last iteration; change the
if-block that checks pgErr (the condition using pgerrcode.UniqueViolation and
strings.Contains(pgErr.ConstraintName, "pg_class")) to only call
logrus.Warnf("... retrying ...", attempt+1, err) and time.Sleep(100 *
time.Millisecond) when the current attempt is not the final one (e.g., check
attempt < maxAttempts-1 or attempt != lastAttempt); leave the error
handling/continue flow untouched for the final attempt so no misleading log or
unnecessary sleep occurs.
In `@pkg/logstructured/sqllog/sql.go`:
- Around line 20-24: The compaction is over-throttled: increase compactBatchSize
from 100 to 1000 (change the compactBatchSize constant) and remove the
unconditional 50ms sleep inside the compaction loop (the loop that currently
sleeps at the end of each iteration) — instead make the pause adaptive or
configurable (e.g., a new CompactionConfig or env var controlling
throttle/backoff) so under backlog the loop can process multiple back-to-back
batches while still allowing small delays for MSSQL if needed; update references
to compactBatchSize and the compaction loop/sleep call accordingly.
---
Nitpick comments:
In `@pkg/drivers/generic/tx.go`:
- Around line 53-57: Replace the fragile comparison with errors.Is by changing
the check from `err != sql.ErrTxDone` to `!errors.Is(err, sql.ErrTxDone)` and
make the MSSQL message match robust by normalizing the error string to
lower-case before doing the substring check (e.g.,
`strings.Contains(strings.ToLower(err.Error()), "no corresponding begin
transaction")`), ensuring wrapped errors and case differences are handled;
update the surrounding conditional that currently references `sql.ErrTxDone` and
the literal "no corresponding BEGIN TRANSACTION".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d73fdc60-e5fe-4846-b745-d765ca282dd7
📒 Files selected for processing (4)
pkg/drivers/generic/tx.gopkg/drivers/mssql/mssql.gopkg/drivers/pgsql/pgsql.gopkg/logstructured/sqllog/sql.go
| // Use NOLOCK hint to avoid lock contention with concurrent writes. | ||
| // The poll query runs every ~1s; any missed rows from dirty reads | ||
| // will be caught in the next iteration. | ||
| nolockRevSQL := `SELECT MAX(rkv.id) FROM kine AS rkv WITH (NOLOCK)` | ||
| nolockCompactRevSQL := `SELECT MAX(crkv.prev_revision) FROM kine AS crkv WITH (NOLOCK) WHERE crkv.name = 'compact_rev_key'` | ||
| dialect.AfterSQL = q(fmt.Sprintf(` | ||
| SELECT (%s) AS id, (%s) AS compact_revision, %s | ||
| FROM kine AS kv | ||
| FROM kine AS kv WITH (NOLOCK) | ||
| WHERE | ||
| kv.name LIKE ? AND | ||
| kv.id > ? | ||
| ORDER BY kv.id ASC`, revSQL, compactRevSQL, columns)) | ||
| ORDER BY kv.id ASC`, nolockRevSQL, nolockCompactRevSQL, columns)) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In SQL Server, for a polling query on an append-only event table, does WITH (NOLOCK) allow reading uncommitted rows that may later roll back? Compare this with WITH (READPAST) regarding lock avoidance and consistency guarantees.
💡 Result:
Yes, in SQL Server, WITH (NOLOCK) allows reading uncommitted rows from an append-only event table that may later roll back. NOLOCK (equivalent to READUNCOMMITTED) enables dirty reads by not acquiring shared locks and ignoring exclusive locks from other transactions, potentially reading data modifications that get rolled back. This provides high concurrency and avoids blocking but risks inconsistency, such as seeing phantom rows that disappear or data never committed. In comparison, WITH (READPAST) avoids locks by skipping locked rows entirely, ensuring only committed data is read (no dirty reads) but potentially missing rows that match the query criteria if they are locked during an insert/update transaction. READPAST requires READ COMMITTED or REPEATABLE READ isolation (or SNAPSHOT with other hints) and is useful for queue-like scenarios but offers no consistency guarantee against missing data. For a polling query on an append-only event table: - NOLOCK: May read uncommitted inserts (later rolled back), avoiding all blocking. - READPAST: Skips locked (uncommitted) rows, so polls only committed events but ignores in-flight ones, also avoiding blocking. NOLOCK prioritizes speed over accuracy (dirty reads possible); READPAST prioritizes avoiding dirty reads but sacrifices completeness.
Citations:
- 1: https://docs.microsoft.com/en-us/sql/t-sql/queries/hints-transact-sql-table
- 2: https://learn.microsoft.com/en-us/sql/t-sql/queries/hints-transact-sql-table?view=sql-server-ver15
- 3: https://www.mssqltips.com/sqlservertip/4468/compare-sql-server-nolock-and-readpast-table-hints/
- 4: https://axial-sql.com/info/understanding-sql-server-table-hints-nolock-vs-readpast/
NOLOCK allows reading uncommitted rows that may later roll back.
Lines 160–164 use WITH (NOLOCK) hints on the polling query, which enables dirty reads. For an append-only event table, this risks processing events from transactions that subsequently roll back, causing phantom rows to disappear from the dataset after being published. Consider using WITH (READPAST) instead to skip locked (uncommitted) rows while avoiding blocking, ensuring only committed data is polled.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/drivers/mssql/mssql.go` around lines 157 - 168, The current polling SQL
strings nolockRevSQL, nolockCompactRevSQL and the query assigned to
dialect.AfterSQL use "WITH (NOLOCK)" which permits dirty (uncommitted) reads;
change those hints to "WITH (READPAST)" so the poll skips locked/uncommitted
rows instead of reading them. Locate nolockRevSQL, nolockCompactRevSQL and the
q(fmt.Sprintf(...)) that sets dialect.AfterSQL and replace each "WITH (NOLOCK)"
occurrence with "WITH (READPAST)" ensuring the same SQL structure and
placeholders remain unchanged.
| if pgErr, ok := err.(*pgconn.PgError); ok && pgErr.Code == pgerrcode.UniqueViolation && strings.Contains(pgErr.ConstraintName, "pg_class") { | ||
| logrus.Warnf("SETUP EXEC attempt %d hit pg_class race, retrying: %v", attempt+1, err) | ||
| time.Sleep(100 * time.Millisecond) | ||
| continue |
There was a problem hiding this comment.
Avoid “retrying” log and sleep on the last attempt.
Line 153 and Line 154 execute even on the final attempt (attempt == 2), so the code logs “retrying” and waits 100ms when no retry will happen.
Proposed fix
- if pgErr, ok := err.(*pgconn.PgError); ok && pgErr.Code == pgerrcode.UniqueViolation && strings.Contains(pgErr.ConstraintName, "pg_class") {
- logrus.Warnf("SETUP EXEC attempt %d hit pg_class race, retrying: %v", attempt+1, err)
- time.Sleep(100 * time.Millisecond)
- continue
- }
- return err
+ if pgErr, ok := err.(*pgconn.PgError); ok && pgErr.Code == pgerrcode.UniqueViolation && strings.Contains(pgErr.ConstraintName, "pg_class") {
+ if attempt < 2 {
+ logrus.Warnf("SETUP EXEC attempt %d hit pg_class race, retrying: %v", attempt+1, err)
+ time.Sleep(100 * time.Millisecond)
+ continue
+ }
+ }
+ return err📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if pgErr, ok := err.(*pgconn.PgError); ok && pgErr.Code == pgerrcode.UniqueViolation && strings.Contains(pgErr.ConstraintName, "pg_class") { | |
| logrus.Warnf("SETUP EXEC attempt %d hit pg_class race, retrying: %v", attempt+1, err) | |
| time.Sleep(100 * time.Millisecond) | |
| continue | |
| if pgErr, ok := err.(*pgconn.PgError); ok && pgErr.Code == pgerrcode.UniqueViolation && strings.Contains(pgErr.ConstraintName, "pg_class") { | |
| if attempt < 2 { | |
| logrus.Warnf("SETUP EXEC attempt %d hit pg_class race, retrying: %v", attempt+1, err) | |
| time.Sleep(100 * time.Millisecond) | |
| continue | |
| } | |
| } | |
| return err |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/drivers/pgsql/pgsql.go` around lines 152 - 155, The retry branch logs
"retrying" and sleeps even on the final attempt because it doesn't check the
loop's last iteration; change the if-block that checks pgErr (the condition
using pgerrcode.UniqueViolation and strings.Contains(pgErr.ConstraintName,
"pg_class")) to only call logrus.Warnf("... retrying ...", attempt+1, err) and
time.Sleep(100 * time.Millisecond) when the current attempt is not the final one
(e.g., check attempt < maxAttempts-1 or attempt != lastAttempt); leave the error
handling/continue flow untouched for the final attempt so no misleading log or
unnecessary sleep occurs.
| // Use smaller batches to reduce lock contention on MSSQL, where | ||
| // SERIALIZABLE DELETE transactions can block other queries for the | ||
| // entire batch duration. 100 revisions per batch keeps each | ||
| // transaction under ~1 s even on slow backends. | ||
| compactBatchSize = 100 |
There was a problem hiding this comment.
Compaction is over-throttled globally (100 batch + fixed 50ms sleep each loop).
Line 24 plus Lines 174-176 can reduce compaction throughput by an order of magnitude under backlog, and it diverges from the stated 1,000-row objective.
Proposed patch
- compactBatchSize = 100
+ compactBatchSize = 1000
@@
- time.Sleep(50 * time.Millisecond)
+ select {
+ case <-s.ctx.Done():
+ return
+ case <-time.After(50 * time.Millisecond):
+ }Also applies to: 174-176
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/logstructured/sqllog/sql.go` around lines 20 - 24, The compaction is
over-throttled: increase compactBatchSize from 100 to 1000 (change the
compactBatchSize constant) and remove the unconditional 50ms sleep inside the
compaction loop (the loop that currently sleeps at the end of each iteration) —
instead make the pause adaptive or configurable (e.g., a new CompactionConfig or
env var controlling throttle/backoff) so under backlog the loop can process
multiple back-to-back batches while still allowing small delays for MSSQL if
needed; update references to compactBatchSize and the compaction loop/sleep call
accordingly.
Summary
Fix kine-level issues that caused intermittent e2e test failures:
WITH (NOLOCK)hints to prevent deadlocks under concurrent accessVerified across 235 rounds of PG/MySQL/MSSQL integration tests with 20+ consecutive passes on each database.
Summary by CodeRabbit
Release Notes
Bug Fixes
Performance Improvements