Exit quiescence when splice_init and tx_init_rbf are rejected#4495
Exit quiescence when splice_init and tx_init_rbf are rejected#4495jkczyz wants to merge 7 commits intolightningdevkit:mainfrom
splice_init and tx_init_rbf are rejected#4495Conversation
|
👋 Thanks for assigning @wpaulino as a reviewer! |
splice_init and tx_init_rbf are rejected
a5d670c to
57aad34
Compare
8cdc480 to
959b553
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4495 +/- ##
==========================================
- Coverage 87.09% 87.04% -0.05%
==========================================
Files 163 163
Lines 108856 108649 -207
Branches 108856 108649 -207
==========================================
- Hits 94808 94575 -233
- Misses 11563 11606 +43
+ Partials 2485 2468 -17
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
959b553 to
69a1a1c
Compare
| fn quiescent_negotiation_err(&mut self, err: ChannelError) -> InteractiveTxMsgError { | ||
| let exited_quiescence = | ||
| if matches!(err, ChannelError::Abort(_)) { self.exit_quiescence() } else { false }; | ||
| InteractiveTxMsgError { err, splice_funding_failed: None, exited_quiescence } | ||
| } |
There was a problem hiding this comment.
In splice_init, resolve_queued_contribution (which can return Abort(FeeRateTooHigh)) is called before validate_splice_init checks is_quiescent(). If a misbehaving peer sends splice_init during the STFU handshake (before quiescence is established), and the holder has a queued contribution with a low max_feerate, the Abort error would flow into quiescent_negotiation_err → exit_quiescence(), which has debug_assert!(!is_local_stfu_sent()) / debug_assert!(!is_remote_stfu_sent()).
Those asserts would fire because the STFU flags are still set (handshake incomplete). In release builds, exit_quiescence would harmlessly return false, but in debug/test builds this would panic.
Same issue applies to tx_init_rbf at line 12762 (and line 12780).
Consider guarding with an is_quiescent() check before calling exit_quiescence:
| fn quiescent_negotiation_err(&mut self, err: ChannelError) -> InteractiveTxMsgError { | |
| let exited_quiescence = | |
| if matches!(err, ChannelError::Abort(_)) { self.exit_quiescence() } else { false }; | |
| InteractiveTxMsgError { err, splice_funding_failed: None, exited_quiescence } | |
| } | |
| fn quiescent_negotiation_err(&mut self, err: ChannelError) -> InteractiveTxMsgError { | |
| let exited_quiescence = if matches!(err, ChannelError::Abort(_)) | |
| && self.context.channel_state.is_quiescent() | |
| { | |
| self.exit_quiescence() | |
| } else { | |
| false | |
| }; | |
| InteractiveTxMsgError { err, splice_funding_failed: None, exited_quiescence } | |
| } |
There was a problem hiding this comment.
Fixed, though I took the opportunity to have validate_splice_init and validate_ tx_init_rbf no longer call validate_splice_contributions and FundingScope::for_splice. Rather they are called at those call sites. This let's us check if we are quiescent before checking fee rates when calling resolve_queued_contribution.
|
No new issues found beyond the prior review comments. Prior comment status:
|
There was a problem hiding this comment.
Note that we'll now abort an in-progress RBF if the counterparty misbehaves by sending us tx_init_rbf. The spec isn't really clear on what to do here. There's a similar case in validate_splice_init where we'll disconnect if we already have a pending splice (even if negotiation is still in progress). The spec does indicate this should be done. So maybe both of these are fine?
When tx_init_rbf is rejected with ChannelError::Abort (e.g., insufficient RBF feerate, negotiation in progress, feerate too high), the error is converted to a tx_abort message but quiescence is never exited and holding cells are never freed. This leaves the channel stuck in a quiescent state. Fix this by intercepting ChannelError::Abort before try_channel_entry! in internal_tx_init_rbf, calling exit_quiescence on the channel, and returning the error with exited_quiescence set so that handle_error frees holding cells. Also make exit_quiescence available in non-test builds by removing its cfg gate. Update tests to use the proper RBF initiation flow (with tampered feerates) so that handle_tx_abort correctly echoes the abort and exits quiescence, rather than manually crafting tx_init_rbf messages that leave node 0 without proper negotiation state. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
d275551 to
3c55d3c
Compare
|
Rebased |
lightning/src/ln/channelmanager.rs
Outdated
| &self.logger, | ||
| ); | ||
| if let Err(ChannelError::Abort(_)) = &init_res { | ||
| funded_channel.exit_quiescence(); |
There was a problem hiding this comment.
Hm I'm not sure we want to use this since the debug assertions there are reachable, but the mark_response_received call makes me realize we may need to do this elsewhere to avoid an unintentional disconnect (maybe write a test to confirm?).
There was a problem hiding this comment.
Hm I'm not sure we want to use this since the debug assertions there are reachable,
The fixup commit makes sure that we don't return ChannelError::Abort until we know we are quiescent, so I think those asserts are unreachable from here?
but the
mark_response_receivedcall makes me realize we may need to do this elsewhere to avoid an unintentional disconnect (maybe write a test to confirm?).
Like when processing splice_init, splice_ack, tx_init_rbf, tx_ack_rbf, or interactive-tx messages?
One thing I noticed when exploring this as since we don't have a FundedChannel for v2 establishment, we use UNFUNDED_CHANNEL_AGE_LIMIT_TICKS to timeout instead. So updating sent_message_awaiting_response on the ChannelContext has no effect there. We'd effectively have two different mechanisms for timing out. Are we ok with that? Or should we try to use the UNFUNDED_CHANNEL_AGE_LIMIT_TICKS for splice funding, too?
There was a problem hiding this comment.
Like when processing splice_init, splice_ack, tx_init_rbf, tx_ack_rbf, or interactive-tx messages?
It should be when processing a message that triggers the end of quiescence (see FundedChannel::should_disconnect_peer_awaiting_response), so either tx_abort or tx_signatures.
One thing I noticed when exploring this as since we don't have a FundedChannel for v2 establishment, we use UNFUNDED_CHANNEL_AGE_LIMIT_TICKS to timeout instead. So updating sent_message_awaiting_response on the ChannelContext has no effect there. We'd effectively have two different mechanisms for timing out. Are we ok with that? Or should we try to use the UNFUNDED_CHANNEL_AGE_LIMIT_TICKS for splice funding, too?
The quiescence timeout needs to be much shorter because we have an active open channel that can handle payments.
There was a problem hiding this comment.
Gotcha. Updated and added a few tests. PTAL.
| ); | ||
|
|
||
| // Node 1 handles the echo (no-op since it already aborted). | ||
| nodes[1].node.handle_tx_abort(node_id_0, &tx_abort_echo); |
There was a problem hiding this comment.
We should include a payment such that the outbound HTLC gets sent out after the abort as a result of freeing the holding cell.
Add a payment during quiescence in test_splice_rbf_insufficient_feerate to verify that outbound HTLCs queued in the holding cell are freed when quiescence is exited by the tx_abort exchange. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The same bug fixed in the prior commit for tx_init_rbf also exists in internal_splice_init: when splice_init triggers FeeRateTooHigh in resolve_queued_contribution, the ChannelError::Abort goes through try_channel_entry! without exiting quiescence. Apply the same fix: intercept ChannelError::Abort before try_channel_entry!, call exit_quiescence, and return the error with exited_quiescence set. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The prior two commits manually intercepted ChannelError::Abort in the channelmanager handlers for splice_init and tx_init_rbf to exit quiescence before returning, since the channel methods didn't signal this themselves. The interactive TX message handlers already solved this by returning InteractiveTxMsgError which bundles exited_quiescence into the error type. Apply the same pattern: change splice_init and tx_init_rbf to return InteractiveTxMsgError, adding a quiescent_negotiation_err helper on FundedChannel that exits quiescence for Abort errors and passes through other variants unchanged. Extract handle_interactive_tx_msg_err in channelmanager to deduplicate the error handling across internal_tx_msg, internal_splice_init, and internal_tx_init_rbf. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
validate_splice_init and validate_tx_init_rbf check preconditions including quiescence. Move them before resolve_queued_contribution so that a misbehaving peer sending splice_init or tx_init_rbf before quiescence is rejected early. This also moves validate_splice_contributions and FundingScope construction into the callers since they depend on the resolved contribution. Have validate_tx_init_rbf return the last candidate's funding pubkeys so the caller can construct FundingScope without re-accessing pending_splice. Add a debug_assert in quiescent_negotiation_err that Abort errors only occur when the channel is quiescent, since exit_quiescence requires it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3c55d3c to
e4b1c6b
Compare
|
🔔 1st Reminder Hey @wpaulino! This PR has been waiting for your review. |
lightning/src/ln/channelmanager.rs
Outdated
| debug_assert!(!err.exited_quiescence || matches!(err.err, ChannelError::Abort(_))); | ||
|
|
||
| MsgHandleErrInternal::from_chan_no_close(err.err, channel_id) | ||
| .with_exited_quiescence(err.exited_quiescence) |
There was a problem hiding this comment.
This does feel a bit like a leaky abstraction - can we move exited_quiescence into the ChannelError and just call it needs_holding_cell_release? Even moreso because the variable doesn't make sense if we disconnect, close, or send an error.
There was a problem hiding this comment.
Ok, I think we can infer it from tx_abort. At least I managed to get Clause to convince me that. Also, renamed the variables to needs_holding_cell_release and include similar inference elsewhere.
Let me know if this is looks good. I can make it a fixup if you prefer. May need a careful look, though.
|
🔔 2nd Reminder Hey @wpaulino! This PR has been waiting for your review. |
The `exited_quiescence` field on `MsgHandleErrInternal` and `InteractiveTxMsgError` is a leaky abstraction -- the channelmanager error handling shouldn't know about quiescence, only whether the holding cell needs to be released. Infer this from the presence of a `tx_abort` instead, since exiting quiescence via an error always produces one. Remove `exited_quiescence` from `InteractiveTxMsgError`, `MsgHandleErrInternal`, and the return type of `Channel::tx_abort`, along with the `with_exited_quiescence` builder. For unfunded v2 channels, `tx_abort` may be present without quiescence having been exited, but the holding cell release is a no-op since an unfunded channel won't have any HTLCs. Similarly, the unreachable `debug_assert!(false)` branch in `fail_interactive_tx_negotiation` for funded channels produces a `tx_abort` without exiting quiescence, but the holding cell release is a no-op since the channel is still quiescent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| } else { | ||
| debug_assert!(false); | ||
| } |
There was a problem hiding this comment.
Bug: regression in defensive coding — channel can get stuck in quiescence in release builds.
The old code called clear_quiescent() before the if let, so it always ran regardless of which branch was taken. The new code only calls exit_quiescence() in the success branch. If this debug_assert!(false) branch is somehow reached in a release build (e.g., due to a bug elsewhere), the channel would be stuck in quiescence indefinitely — unable to generate new commitments — until a disconnect/reconnect cycle recovers it.
Consider moving exit_quiescence() after the if/else, or adding it to the else branch:
| } else { | |
| debug_assert!(false); | |
| } | |
| } else { | |
| debug_assert!(false); | |
| self.exit_quiescence(); | |
| } |
There was a problem hiding this comment.
Done. BTW, the reason it is moved after rather than left before is we can't take two mutable references to self.
Several code paths exit quiescence by calling `clear_quiescent()` directly without also clearing the disconnect timer via `mark_response_received()`. This causes the timer to fire after the splice completes or is aborted, spuriously disconnecting the peer. Replace `clear_quiescent()` with `exit_quiescence()` in `on_tx_signatures_exchange`, `reset_pending_splice_state`, and `peer_connected_get_handshake`, which clears both the quiescent state and the disconnect timer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
b861c68 to
bdc17dc
Compare
Summary
tx_init_rbforsplice_initis rejected withAbort(e.g., insufficient RBF feerate, negotiation in progress), which left the channel stuck in a quiescent stateInteractiveTxMsgError, reusing the same pattern already used by the interactive TX message handlers, with a sharedhandle_interactive_tx_msg_errhelper in channelmanagerTest plan
test_splice_rbf_insufficient_feerateupdated to verify quiescence is properly exited aftertx_aborttest_splice_feerate_too_highupdated to verify quiescence is properly exited aftersplice_initrejection🤖 Generated with Claude Code
Based on #4494.