Skip to content

raftadmin: add phase 3 native admin surface#472

Merged
bootjp merged 7 commits intomainfrom
feat/replace-raft-phase3
Apr 10, 2026
Merged

raftadmin: add phase 3 native admin surface#472
bootjp merged 7 commits intomainfrom
feat/replace-raft-phase3

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 9, 2026

No description provided.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces the external raftadmin dependency with a native implementation within the repository. It introduces a new cmd/raftadmin CLI tool, updates the internal raft administration services, and modifies the existing codebase to use these new internal services instead of the external ones. The reviewer suggests improving the CLI tool's usability by providing command-specific usage messages rather than a generic one.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 9, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces external dependencies for Raft administration and health monitoring with internal implementations. It introduces the internal/raftadmin package, a dedicated cmd/raftadmin CLI tool, and updates the protobuf definitions to include a RaftAdmin service. Feedback focuses on improving the health monitoring efficiency by making the polling interval configurable and addressing security concerns regarding the use of insecure gRPC credentials in the administrative tool.

Comment on lines +33 to +54
func observeLeaderHealth(ctx context.Context, engine raftengine.Engine, healthSrv *health.Server, serviceNames []string) {
services := dedupeServices(serviceNames)
status := currentHealthStatus(ctx, engine)
setHealthStatus(healthSrv, services, status)

ticker := time.NewTicker(defaultHealthPollInterval)
defer ticker.Stop()

for {
select {
case <-ctx.Done():
setHealthStatus(healthSrv, services, healthpb.HealthCheckResponse_NOT_SERVING)
return
case <-ticker.C:
next := currentHealthStatus(ctx, engine)
if next != status {
setHealthStatus(healthSrv, services, next)
status = next
}
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The observeLeaderHealth goroutine uses a ticker to poll the health status. While the interval is small (250ms), it might be more efficient to use a notification mechanism if the underlying raft engine supports it (e.g., a channel for state changes). If polling is necessary, consider making the interval configurable via environment variables or configuration, as 250ms might be too aggressive for some environments or too slow for others.

Comment on lines +59 to +62
conn, err := grpc.NewClient(addr, grpc.WithTransportCredentials(insecure.NewCredentials()))
if err != nil {
return nil, func() {}, errors.Wrap(err, "dial raft admin")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The dialClient function uses grpc.NewClient with insecure.NewCredentials(). While this matches the current project state, it's important to ensure that administrative tools don't accidentally encourage insecure deployments in production. As this is a limited-scope implementation, explicitly document this limitation and consider adding runtime validation or a flag to support TLS credentials in the future to prevent misuse in a general context.

References
  1. For prototype or limited-scope implementations, explicitly document the limitations (e.g., single-node only, blocking startup) and add runtime validation to prevent misuse in a general context.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 9, 2026

/gemini review

@bootjp bootjp closed this Apr 9, 2026
@bootjp bootjp reopened this Apr 9, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces external Raft administration and health-check dependencies with a native internal implementation, including a new RaftAdmin gRPC service and a CLI tool. Review feedback suggests enhancing error reporting in the CLI, enforcing a minimum health poll interval to protect system performance, and ensuring the background health-monitoring goroutine terminates correctly to avoid potential resource leaks.

case "0", "false", "no", "off":
return false, nil
default:
return false, errors.Errorf("parse %s", name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The error message here is quite generic and doesn't provide much context to the user about what went wrong during parsing. Including the invalid value and the environment variable name would be more helpful.

Suggested change
return false, errors.Errorf("parse %s", name)
return false, errors.Errorf("invalid boolean value for %s: %q", name, raw)

Comment on lines +64 to +65
}
millis, err := strconv.Atoi(raw)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Allowing a health poll interval as low as 1ms could lead to excessive CPU usage, especially in clusters with many nodes or when the system is under load. It's generally safer to enforce a reasonable minimum interval (e.g., 10ms or 50ms) to prevent accidental performance degradation.

millis, err := strconv.Atoi(raw)
	if err != nil || millis <= 0 {
		return defaultHealthPollInterval
	}
	if millis < 10 {
		millis = 10
	}
References
  1. For low-latency systems, a short duration like 10ms is generally the acceptable minimum for repetitive operations or backoffs to prevent performance degradation.


healthSrv := health.NewServer()
healthpb.RegisterHealthServer(gs, healthSrv)
go observeLeaderHealth(ctx, engine, healthSrv, serviceNames, healthPollInterval())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The goroutine started here depends on the provided ctx for termination. If the caller passes context.Background() (which is the fallback at line 27), this goroutine will leak if the gRPC server is stopped but the process continues. While the current usage in main.go and tests seems to handle this via application-level contexts, it's a potential pitfall for future integrations.

References
  1. When designing interfaces and background tasks, use context.Context for managing deadlines and cancellation to ensure proper resource cleanup and prevent goroutine leaks.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 10, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces external Raft administration and health dependencies with a new internal raftadmin package. It introduces a RaftAdmin gRPC service, a corresponding CLI tool, and integrated health checking. The changes also include updates to the Raft engine interfaces and implementations (Etcd and Hashicorp) to support these administrative operations. Feedback suggests enhancing the CLI's TLS configuration to support custom CA certificates and ensuring consistent context usage in the health monitoring logic to avoid static status states.

return nil, err
}
if useTLS {
return credentials.NewTLS(&tls.Config{MinVersion: tls.VersionTLS12}), nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The TLS configuration uses MinVersion: tls.VersionTLS12 but does not specify any RootCAs. While this works for publicly trusted certificates, internal Raft administration endpoints often use self-signed certificates or internal CAs. Consider adding a flag to allow users to provide a custom CA certificate file for verification.

Comment on lines +32 to +35
if ctx == nil {
registerStaticHealthService(healthSrv, engine, serviceNames)
return
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When ctx is nil, the health status is set once but never updated. It would be safer to enforce a non-nil context to ensure proper lifecycle management and cancellation, as context.Context should be the primary mechanism for managing such operations in this repository.

References
  1. When designing interfaces, use context.Context for managing deadlines and cancellation.

return cfg, nil
}

pemBytes, err := os.ReadFile(caPath)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [golangci] reported by reviewdog 🐶
G703: Path traversal via taint analysis (gosec)

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 10, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces the external raftadmin and leaderhealth dependencies with a custom internal implementation. It introduces a new RaftAdmin gRPC service, a corresponding server implementation in internal/raftadmin, and a new CLI tool in cmd/raftadmin. The raftengine interfaces have been expanded to include administrative operations such as adding voters and transferring leadership, with concrete implementations added for the Hashicorp Raft engine. Jepsen tests and deployment scripts were also updated to integrate these changes. One review comment suggests verifying the Go version or restoring a loop variable capture in adapter/s3_test.go to prevent potential race conditions in parallel tests.

}

for _, tc := range cases {
tc := tc
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The tc := tc line has been removed from this for loop, which contains a parallel test (t.Parallel()). This is safe only if the project is built with Go 1.22 or newer, as loop variable semantics changed in that version to prevent this kind of data race in sub-tests.

If the project's Go version is older than 1.22, this change introduces a race condition where multiple parallel sub-tests might operate on the same test case data from the slice. Please ensure the project's minimum Go version is 1.22, or restore this line to prevent potential test flakiness.

@bootjp bootjp merged commit 02c6d2b into main Apr 10, 2026
6 of 7 checks passed
@bootjp bootjp deleted the feat/replace-raft-phase3 branch April 10, 2026 06:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant