Cp thd swa with ag by sudhakarsingh27 · Pull Request #2829 · NVIDIA/TransformerEngine

sudhakarsingh27 · 2026-04-03T07:07:42Z

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

… cu_seqlens - Use per-step cu_seqlens_q_padded to select Q chunks instead of tensor slicing - Use padded cu_seqlens_kv for K/V reordering (ensures divisibility) - Add cu_seqlens_kv and cu_seqlens_kv_padded to AllGather function signature - Compute per-step Q and KV cu_seqlens correctly from actual seqlens - Support non-causal attention (all KV visible) - Zero-initialize out/dq for THD to avoid garbage in padding regions - Save per-step cu_seqlens in ctx for backward (avoid recomputation) Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Remove skip gates that blocked THD format with all_gather CP comm type. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

…seqlens_q_padded The interleaved valid mask computation assumed cu_seqlens_q_padded starts at 0. With the CP offset-based approach, cu_seqlens_q_padded can start at a non-zero offset, causing a size mismatch. Use absolute positions from cu_seqlens_q_padded to build the valid mask instead. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

for more information, see https://pre-commit.ci

sudhakarsingh27 · 2026-04-03T17:44:04Z

transformer_engine/pytorch/attention/dot_product_attention/context_parallel.py

+        if qkv_format == "thd":
+            # [cp*t, h, d] -> reorder to contiguous per-sequence order -> [t_full, h, d]
+            chunk_ids_for_kv_ag = get_seq_chunk_ids_for_reordering_before_attn(cp_size, k.device)
+            k_ag = reorder_seq_chunks_after_a2a_before_attn_thd(


This reorder_seq_chunks_after_a2a_before_attn_thd and the other releated method are not "a2a" specific now, rename them to something like dualchunk_to_contiguous_order_thd and the other one contiguous_to_dualchunk_order_thd

sudhakarsingh27 · 2026-04-07T21:07:48Z

transformer_engine/pytorch/attention/dot_product_attention/context_parallel.py

        if use_fused_attention and causal and "bottom_right" not in attn_mask_type:
-            attn_mask_type = attn_mask_type + "_bottom_right"
+            if qkv_format != "thd":
+                attn_mask_type = attn_mask_type + "_bottom_right"


Why only for non THD cases, we're creating *_bottom_right masks? Shouldn't THD cases also have it?
Edit: for THD, there are per step masks but those masks are same for each step i.e. "padding_causal_bottom_right" for both steps if causal otherwise "padding" both steps for THD, so maybe there's no need to distinguish these two paths separately.

transformer_engine/pytorch/attention/dot_product_attention/context_parallel.py

sudhakarsingh27 added 3 commits April 7, 2026 11:32

[PyTorch][CP] Enable THD+all_gather tests in test_attention_with_cp

1a5ca4c

Remove skip gates that blocked THD format with all_gather CP comm type. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

sudhakarsingh27 force-pushed the cp_thd_swa_with_ag branch from 1164a15 to b4db9eb Compare April 7, 2026 18:37

[pre-commit.ci] auto fixes from pre-commit.com hooks

7491ab6

for more information, see https://pre-commit.ci

sudhakarsingh27 commented Apr 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cp thd swa with ag#2829

Cp thd swa with ag#2829
sudhakarsingh27 wants to merge 4 commits intoNVIDIA:mainfrom
sudhakarsingh27:cp_thd_swa_with_ag

sudhakarsingh27 commented Apr 3, 2026

Uh oh!

sudhakarsingh27 Apr 3, 2026 •

edited

Loading

Uh oh!

sudhakarsingh27 Apr 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sudhakarsingh27 commented Apr 3, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

sudhakarsingh27 Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sudhakarsingh27 Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sudhakarsingh27 Apr 3, 2026 •

edited

Loading

sudhakarsingh27 Apr 7, 2026 •

edited

Loading