Skip to content

Add Megatron-FSDP E2E integration test to TE CI/CD (L1).#2845

Open
cspades wants to merge 7 commits intoNVIDIA:mainfrom
cspades:cye/mfsdp-te-e2e-test
Open

Add Megatron-FSDP E2E integration test to TE CI/CD (L1).#2845
cspades wants to merge 7 commits intoNVIDIA:mainfrom
cspades:cye/mfsdp-te-e2e-test

Conversation

@cspades
Copy link
Copy Markdown
Member

@cspades cspades commented Apr 7, 2026

Description

  • Adds Megatron-FSDP E2E integration tests to the TransformerEngine CI/CD.

Details

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 7, 2026

Greptile Summary

This PR adds a new L1_pytorch_mcore_fsdp_integration CI test that exercises the Megatron-FSDP end-to-end path with TransformerEngine, covering FSDP2 sharding, NVLink UBR, cpu-offloading, FP8 (MXFP8), and DCP checkpointing on GB200.

The implementation is clean: environment variables are set via standalone export/unset statements and python3 is invoked directly with backslash continuations, which correctly avoids the shell-parsing pitfalls raised in earlier review threads. The mocked vocab/merges setup mirrors the sibling L1_pytorch_mcore_integration test.

Confidence Score: 5/5

Safe to merge — the implementation is correct and all previously raised shell-parsing concerns have been resolved.

All three files are new additions that do not touch existing code paths. The test script correctly uses standalone export/unset statements rather than embedding them in a collapsed COMMAND string, so the bash-parsing bugs raised in prior review threads are not present. No P0 or P1 issues remain; remaining observations (torch.distributed.launch deprecation, missing trailing newline in .gitignore) are cosmetic P2 items that do not affect correctness or CI reliability.

No files require special attention.

Important Files Changed

Filename Overview
qa/L1_pytorch_mcore_fsdp_integration/test.sh New CI test script; correctly uses standalone export/unset statements and direct python3 invocation with backslash continuations — previous shell-parsing concerns do not apply to this implementation.
qa/L1_pytorch_mcore_fsdp_integration/.gitignore Ignores the cloned Megatron-LM directory and generated vocab.json; correct and minimal. Missing trailing newline (cosmetic).
qa/L1_pytorch_mcore_fsdp_integration/merges.txt Stub BPE merges file with version header; sufficient for mock-data mode used by the test.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A([test.sh start]) --> B{Megatron-LM\ncloned?}
    B -- No --> C[git clone NVIDIA/Megatron-LM\ngit checkout MCORE_REF]
    C --> D[Generate mock vocab.json\n4096 tokens]
    B -- Yes --> D
    D --> E[unset CUDA_DEVICE_MAX_CONNECTIONS\nexport NVTE_* env vars\nexport NVTE_CPU_OFFLOAD_V1=0]
    E --> F[python3 -m torch.distributed.launch\n--nproc_per_node=GPU_COUNT\npretrain_gpt.py]
    F --> G[FSDP2 + TE training\n10 train iters, BF16+FP8 MXFP8\ncpu-offload, NVLink UBR, DCP ckpt]
    G --> H([Pass / Fail via set -e])
Loading

Reviews (11): Last reviewed commit: "Bump MCore commit." | Re-trigger Greptile

timmoon10
timmoon10 previously approved these changes Apr 7, 2026
Copy link
Copy Markdown
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending CI

@timmoon10
Copy link
Copy Markdown
Collaborator

Pipeline 47956532

@cspades cspades force-pushed the cye/mfsdp-te-e2e-test branch 2 times, most recently from 5fb4871 to fce5369 Compare April 8, 2026 00:51
@cspades
Copy link
Copy Markdown
Member Author

cspades commented Apr 8, 2026

Depends on this: NVIDIA/Megatron-LM#4133

This PR correctly uses decoupled_grad depending on the FP8 recipe matching the distributed optimizer logic in Megatron-Core.

cspades and others added 7 commits April 10, 2026 11:46
Signed-off-by: Cory Ye <cye@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Cory Ye <44509866+cspades@users.noreply.github.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
@cspades cspades force-pushed the cye/mfsdp-te-e2e-test branch from 0fa5989 to d685275 Compare April 10, 2026 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants