raftadmin: add phase 3 native admin surface#472
Conversation
There was a problem hiding this comment.
Code Review
This pull request replaces the external raftadmin dependency with a native implementation within the repository. It introduces a new cmd/raftadmin CLI tool, updates the internal raft administration services, and modifies the existing codebase to use these new internal services instead of the external ones. The reviewer suggests improving the CLI tool's usability by providing command-specific usage messages rather than a generic one.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request replaces external dependencies for Raft administration and health monitoring with internal implementations. It introduces the internal/raftadmin package, a dedicated cmd/raftadmin CLI tool, and updates the protobuf definitions to include a RaftAdmin service. Feedback focuses on improving the health monitoring efficiency by making the polling interval configurable and addressing security concerns regarding the use of insecure gRPC credentials in the administrative tool.
internal/raftadmin/health.go
Outdated
| func observeLeaderHealth(ctx context.Context, engine raftengine.Engine, healthSrv *health.Server, serviceNames []string) { | ||
| services := dedupeServices(serviceNames) | ||
| status := currentHealthStatus(ctx, engine) | ||
| setHealthStatus(healthSrv, services, status) | ||
|
|
||
| ticker := time.NewTicker(defaultHealthPollInterval) | ||
| defer ticker.Stop() | ||
|
|
||
| for { | ||
| select { | ||
| case <-ctx.Done(): | ||
| setHealthStatus(healthSrv, services, healthpb.HealthCheckResponse_NOT_SERVING) | ||
| return | ||
| case <-ticker.C: | ||
| next := currentHealthStatus(ctx, engine) | ||
| if next != status { | ||
| setHealthStatus(healthSrv, services, next) | ||
| status = next | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
The observeLeaderHealth goroutine uses a ticker to poll the health status. While the interval is small (250ms), it might be more efficient to use a notification mechanism if the underlying raft engine supports it (e.g., a channel for state changes). If polling is necessary, consider making the interval configurable via environment variables or configuration, as 250ms might be too aggressive for some environments or too slow for others.
cmd/raftadmin/main.go
Outdated
| conn, err := grpc.NewClient(addr, grpc.WithTransportCredentials(insecure.NewCredentials())) | ||
| if err != nil { | ||
| return nil, func() {}, errors.Wrap(err, "dial raft admin") | ||
| } |
There was a problem hiding this comment.
The dialClient function uses grpc.NewClient with insecure.NewCredentials(). While this matches the current project state, it's important to ensure that administrative tools don't accidentally encourage insecure deployments in production. As this is a limited-scope implementation, explicitly document this limitation and consider adding runtime validation or a flag to support TLS credentials in the future to prevent misuse in a general context.
References
- For prototype or limited-scope implementations, explicitly document the limitations (e.g., single-node only, blocking startup) and add runtime validation to prevent misuse in a general context.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request replaces external Raft administration and health-check dependencies with a native internal implementation, including a new RaftAdmin gRPC service and a CLI tool. Review feedback suggests enhancing error reporting in the CLI, enforcing a minimum health poll interval to protect system performance, and ensuring the background health-monitoring goroutine terminates correctly to avoid potential resource leaks.
cmd/raftadmin/main.go
Outdated
| case "0", "false", "no", "off": | ||
| return false, nil | ||
| default: | ||
| return false, errors.Errorf("parse %s", name) |
There was a problem hiding this comment.
The error message here is quite generic and doesn't provide much context to the user about what went wrong during parsing. Including the invalid value and the environment variable name would be more helpful.
| return false, errors.Errorf("parse %s", name) | |
| return false, errors.Errorf("invalid boolean value for %s: %q", name, raw) |
| } | ||
| millis, err := strconv.Atoi(raw) |
There was a problem hiding this comment.
Allowing a health poll interval as low as 1ms could lead to excessive CPU usage, especially in clusters with many nodes or when the system is under load. It's generally safer to enforce a reasonable minimum interval (e.g., 10ms or 50ms) to prevent accidental performance degradation.
millis, err := strconv.Atoi(raw)
if err != nil || millis <= 0 {
return defaultHealthPollInterval
}
if millis < 10 {
millis = 10
}References
- For low-latency systems, a short duration like 10ms is generally the acceptable minimum for repetitive operations or backoffs to prevent performance degradation.
|
|
||
| healthSrv := health.NewServer() | ||
| healthpb.RegisterHealthServer(gs, healthSrv) | ||
| go observeLeaderHealth(ctx, engine, healthSrv, serviceNames, healthPollInterval()) |
There was a problem hiding this comment.
The goroutine started here depends on the provided ctx for termination. If the caller passes context.Background() (which is the fallback at line 27), this goroutine will leak if the gRPC server is stopped but the process continues. While the current usage in main.go and tests seems to handle this via application-level contexts, it's a potential pitfall for future integrations.
References
- When designing interfaces and background tasks, use context.Context for managing deadlines and cancellation to ensure proper resource cleanup and prevent goroutine leaks.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request replaces external Raft administration and health dependencies with a new internal raftadmin package. It introduces a RaftAdmin gRPC service, a corresponding CLI tool, and integrated health checking. The changes also include updates to the Raft engine interfaces and implementations (Etcd and Hashicorp) to support these administrative operations. Feedback suggests enhancing the CLI's TLS configuration to support custom CA certificates and ensuring consistent context usage in the health monitoring logic to avoid static status states.
cmd/raftadmin/main.go
Outdated
| return nil, err | ||
| } | ||
| if useTLS { | ||
| return credentials.NewTLS(&tls.Config{MinVersion: tls.VersionTLS12}), nil |
There was a problem hiding this comment.
The TLS configuration uses MinVersion: tls.VersionTLS12 but does not specify any RootCAs. While this works for publicly trusted certificates, internal Raft administration endpoints often use self-signed certificates or internal CAs. Consider adding a flag to allow users to provide a custom CA certificate file for verification.
internal/raftadmin/health.go
Outdated
| if ctx == nil { | ||
| registerStaticHealthService(healthSrv, engine, serviceNames) | ||
| return | ||
| } |
There was a problem hiding this comment.
When ctx is nil, the health status is set once but never updated. It would be safer to enforce a non-nil context to ensure proper lifecycle management and cancellation, as context.Context should be the primary mechanism for managing such operations in this repository.
References
- When designing interfaces, use context.Context for managing deadlines and cancellation.
| return cfg, nil | ||
| } | ||
|
|
||
| pemBytes, err := os.ReadFile(caPath) |
There was a problem hiding this comment.
🚫 [golangci] reported by reviewdog 🐶
G703: Path traversal via taint analysis (gosec)
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request replaces the external raftadmin and leaderhealth dependencies with a custom internal implementation. It introduces a new RaftAdmin gRPC service, a corresponding server implementation in internal/raftadmin, and a new CLI tool in cmd/raftadmin. The raftengine interfaces have been expanded to include administrative operations such as adding voters and transferring leadership, with concrete implementations added for the Hashicorp Raft engine. Jepsen tests and deployment scripts were also updated to integrate these changes. One review comment suggests verifying the Go version or restoring a loop variable capture in adapter/s3_test.go to prevent potential race conditions in parallel tests.
| } | ||
|
|
||
| for _, tc := range cases { | ||
| tc := tc |
There was a problem hiding this comment.
The tc := tc line has been removed from this for loop, which contains a parallel test (t.Parallel()). This is safe only if the project is built with Go 1.22 or newer, as loop variable semantics changed in that version to prevent this kind of data race in sub-tests.
If the project's Go version is older than 1.22, this change introduces a race condition where multiple parallel sub-tests might operate on the same test case data from the slice. Please ensure the project's minimum Go version is 1.22, or restore this line to prevent potential test flakiness.
No description provided.