Add Megatron-FSDP E2E integration test to TE CI/CD (L1). by cspades · Pull Request #2845 · NVIDIA/TransformerEngine

cspades · 2026-04-07T18:25:42Z

Description

Adds Megatron-FSDP E2E integration tests to the TransformerEngine CI/CD.

Details

Megatron-FSDP has experienced various issues with TransformerEngine in the past months, from race conditions on gradients with deterministic FlashAttention, to FSDP2 / Megatron-FSDP DCP checkpointing or decoupled_grad bugs related to FusedAdam, and other less obvious CPU offloading & Tensor API bugs that are difficult to catch without running Megatron-FSDP. This functional test aims to reduce the frequency of that.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

greptile-apps · 2026-04-07T18:31:04Z

Greptile Summary

This PR adds a new L1_pytorch_mcore_fsdp_integration CI test that exercises the Megatron-FSDP end-to-end path with TransformerEngine, covering FSDP2 sharding, NVLink UBR, cpu-offloading, FP8 (MXFP8), and DCP checkpointing on GB200.

The implementation is clean: environment variables are set via standalone export/unset statements and python3 is invoked directly with backslash continuations, which correctly avoids the shell-parsing pitfalls raised in earlier review threads. The mocked vocab/merges setup mirrors the sibling L1_pytorch_mcore_integration test.

Confidence Score: 5/5

Safe to merge — the implementation is correct and all previously raised shell-parsing concerns have been resolved.

All three files are new additions that do not touch existing code paths. The test script correctly uses standalone export/unset statements rather than embedding them in a collapsed COMMAND string, so the bash-parsing bugs raised in prior review threads are not present. No P0 or P1 issues remain; remaining observations (torch.distributed.launch deprecation, missing trailing newline in .gitignore) are cosmetic P2 items that do not affect correctness or CI reliability.

No files require special attention.

Important Files Changed

Filename	Overview
qa/L1_pytorch_mcore_fsdp_integration/test.sh	New CI test script; correctly uses standalone export/unset statements and direct python3 invocation with backslash continuations — previous shell-parsing concerns do not apply to this implementation.
qa/L1_pytorch_mcore_fsdp_integration/.gitignore	Ignores the cloned Megatron-LM directory and generated vocab.json; correct and minimal. Missing trailing newline (cosmetic).
qa/L1_pytorch_mcore_fsdp_integration/merges.txt	Stub BPE merges file with version header; sufficient for mock-data mode used by the test.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A([test.sh start]) --> B{Megatron-LM\ncloned?}
    B -- No --> C[git clone NVIDIA/Megatron-LM\ngit checkout MCORE_REF]
    C --> D[Generate mock vocab.json\n4096 tokens]
    B -- Yes --> D
    D --> E[unset CUDA_DEVICE_MAX_CONNECTIONS\nexport NVTE_* env vars\nexport NVTE_CPU_OFFLOAD_V1=0]
    E --> F[python3 -m torch.distributed.launch\n--nproc_per_node=GPU_COUNT\npretrain_gpt.py]
    F --> G[FSDP2 + TE training\n10 train iters, BF16+FP8 MXFP8\ncpu-offload, NVLink UBR, DCP ckpt]
    G --> H([Pass / Fail via set -e])

_{Reviews (11): Last reviewed commit: "Bump MCore commit." | Re-trigger Greptile}

qa/L1_pytorch_mcore_fsdp_integration/test.sh

timmoon10

LGTM, pending CI

timmoon10 · 2026-04-07T20:53:21Z

Pipeline 47956532

cspades · 2026-04-08T01:47:11Z

Depends on this: NVIDIA/Megatron-LM#4133

This PR correctly uses decoupled_grad depending on the FP8 recipe matching the distributed optimizer logic in Megatron-Core.

qa/L1_pytorch_mcore_fsdp_integration/test.sh

Signed-off-by: Cory Ye <cye@nvidia.com>

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Cory Ye <44509866+cspades@users.noreply.github.com>

Signed-off-by: Cory Ye <cye@nvidia.com>

greptile-apps bot reviewed Apr 7, 2026

View reviewed changes

qa/L1_pytorch_mcore_fsdp_integration/test.sh Outdated Show resolved Hide resolved

qa/L1_pytorch_mcore_fsdp_integration/test.sh Outdated Show resolved Hide resolved

greptile-apps bot reviewed Apr 7, 2026

View reviewed changes

qa/L1_pytorch_mcore_fsdp_integration/test.sh Outdated Show resolved Hide resolved

timmoon10 previously approved these changes Apr 7, 2026

View reviewed changes

cspades dismissed timmoon10’s stale review via e521694 April 7, 2026 21:39

cspades force-pushed the cye/mfsdp-te-e2e-test branch 2 times, most recently from 5fb4871 to fce5369 Compare April 8, 2026 00:51

cspades mentioned this pull request Apr 8, 2026

Fix incorrectly set decoupled_grad and DistOpt mechanics for MFSDP. NVIDIA/Megatron-LM#4133

Merged

5 tasks

timmoon10 self-assigned this Apr 8, 2026

vthumbe1503 reviewed Apr 9, 2026

View reviewed changes

qa/L1_pytorch_mcore_fsdp_integration/test.sh Outdated Show resolved Hide resolved

cspades and others added 7 commits April 10, 2026 11:46

Add Megatron-FSDP E2E integration test to TE CI/CD (L1).

9676bf4

Signed-off-by: Cory Ye <cye@nvidia.com>

Update qa/L1_pytorch_mcore_fsdp_integration/test.sh

7fbeda8

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Cory Ye <44509866+cspades@users.noreply.github.com>

Explicit torchrun invoke.

2ff6900

Signed-off-by: Cory Ye <cye@nvidia.com>

Edits.

372f294

Signed-off-by: Cory Ye <cye@nvidia.com>

Remove CPU initialization, add FW args.

fc82575

Signed-off-by: Cory Ye <cye@nvidia.com>

Expose MCore hash/tag as an argument to the E2E script.

495025a

Signed-off-by: Cory Ye <cye@nvidia.com>

Bump MCore commit.

d685275

Signed-off-by: Cory Ye <cye@nvidia.com>

cspades force-pushed the cye/mfsdp-te-e2e-test branch from 0fa5989 to d685275 Compare April 10, 2026 18:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Megatron-FSDP E2E integration test to TE CI/CD (L1).#2845

Add Megatron-FSDP E2E integration test to TE CI/CD (L1).#2845
cspades wants to merge 7 commits intoNVIDIA:mainfrom
cspades:cye/mfsdp-te-e2e-test

cspades commented Apr 7, 2026

Uh oh!

greptile-apps bot commented Apr 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

timmoon10 left a comment

Uh oh!

timmoon10 commented Apr 7, 2026

Uh oh!

cspades commented Apr 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cspades commented Apr 7, 2026

Description

Details

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

timmoon10 commented Apr 7, 2026

Uh oh!

cspades commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps bot commented Apr 7, 2026 •

edited

Loading

cspades commented Apr 8, 2026 •

edited

Loading