Fix recovered consensus pipes staying stopped after snapshot load#17438
Fix recovered consensus pipes staying stopped after snapshot load#17438Pengzna wants to merge 2 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes ConfigNode recovery behavior so consensus pipes loaded from a snapshot don’t remain incorrectly STOPPED after restart, while still preserving STOPPED state for pipes halted due to runtime exceptions.
Changes:
- Normalize recovered consensus pipe statuses to RUNNING during
PipeTaskInfo.processLoadSnapshot, except for pipes marked as stopped by runtime exception. - Add a unit test covering snapshot take/load and verifying the expected status normalization behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/persistence/pipe/PipeTaskInfo.java | Normalizes recovered consensus pipe runtime status after snapshot load. |
| iotdb-core/confignode/src/test/java/org/apache/iotdb/confignode/persistence/pipe/PipeTaskInfoConsensusPipeTest.java | Adds snapshot recovery test ensuring only healthy stopped consensus pipes are restarted. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| LOGGER.info( | ||
| "Recovered consensus pipes {} as RUNNING during snapshot load.", restartedConsensusPipes); | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
The INFO log prints the full list of restarted consensus pipe names, which can be very large (consensus pipes scale roughly with N*(N-1) per region). This may bloat logs during snapshot recovery; consider logging only the count (and optionally a small sample) at INFO, and the full list at DEBUG.
| LOGGER.info( | |
| "Recovered consensus pipes {} as RUNNING during snapshot load.", restartedConsensusPipes); | |
| } | |
| } | |
| final List<String> restartedConsensusPipeSample = | |
| samplePipeNamesForInfoLog(restartedConsensusPipes, 10); | |
| LOGGER.info( | |
| "Recovered {} consensus pipes as RUNNING during snapshot load. Sample: {}{}", | |
| restartedConsensusPipes.size(), | |
| restartedConsensusPipeSample, | |
| restartedConsensusPipes.size() > restartedConsensusPipeSample.size() ? " ..." : ""); | |
| if (LOGGER.isDebugEnabled()) { | |
| LOGGER.debug( | |
| "Recovered consensus pipes {} as RUNNING during snapshot load.", | |
| restartedConsensusPipes); | |
| } | |
| } | |
| } | |
| private List<String> samplePipeNamesForInfoLog( | |
| final List<String> pipeNames, final int maxSampleSize) { | |
| return new ArrayList<>(pipeNames.subList(0, Math.min(pipeNames.size(), maxSampleSize))); | |
| } |
| } finally { | ||
| new File(snapshotDir, "pipe_task_info.bin").delete(); | ||
| snapshotDir.delete(); | ||
| } |
There was a problem hiding this comment.
Test cleanup hard-codes the snapshot file name ("pipe_task_info.bin") and ignores delete() results. Consider deleting the temp directory recursively (and/or deleting all files under snapshotDir) so cleanup doesn't depend on PipeTaskInfo's private snapshot filename and doesn’t silently leak temp dirs if deletion fails.
Summary
Validation
JAVA_HOME=$(/usr/libexec/java_home -v 17) mvn -pl iotdb-core/confignode -Dspotless.skip=false spotless:applymvn -pl iotdb-core/confignode -Dtest=PipeTaskInfoConsensusPipeTest testearlier, but dependency resolution was too slow in the current environment