Add Megatron-FSDP E2E integration test to TE CI/CD (L1).#2845
Add Megatron-FSDP E2E integration test to TE CI/CD (L1).#2845cspades wants to merge 7 commits intoNVIDIA:mainfrom
Conversation
Greptile SummaryThis PR adds a new The implementation is clean: environment variables are set via standalone Confidence Score: 5/5Safe to merge — the implementation is correct and all previously raised shell-parsing concerns have been resolved. All three files are new additions that do not touch existing code paths. The test script correctly uses standalone export/unset statements rather than embedding them in a collapsed COMMAND string, so the bash-parsing bugs raised in prior review threads are not present. No P0 or P1 issues remain; remaining observations (torch.distributed.launch deprecation, missing trailing newline in .gitignore) are cosmetic P2 items that do not affect correctness or CI reliability. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A([test.sh start]) --> B{Megatron-LM\ncloned?}
B -- No --> C[git clone NVIDIA/Megatron-LM\ngit checkout MCORE_REF]
C --> D[Generate mock vocab.json\n4096 tokens]
B -- Yes --> D
D --> E[unset CUDA_DEVICE_MAX_CONNECTIONS\nexport NVTE_* env vars\nexport NVTE_CPU_OFFLOAD_V1=0]
E --> F[python3 -m torch.distributed.launch\n--nproc_per_node=GPU_COUNT\npretrain_gpt.py]
F --> G[FSDP2 + TE training\n10 train iters, BF16+FP8 MXFP8\ncpu-offload, NVLink UBR, DCP ckpt]
G --> H([Pass / Fail via set -e])
Reviews (11): Last reviewed commit: "Bump MCore commit." | Re-trigger Greptile |
|
Pipeline 47956532 |
5fb4871 to
fce5369
Compare
|
Depends on this: NVIDIA/Megatron-LM#4133 This PR correctly uses |
Signed-off-by: Cory Ye <cye@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Cory Ye <44509866+cspades@users.noreply.github.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
0fa5989 to
d685275
Compare
Description
Details
decoupled_gradbugs related to FusedAdam, and other less obvious CPU offloading & Tensor API bugs that are difficult to catch without running Megatron-FSDP. This functional test aims to reduce the frequency of that.Type of change
Changes
Please list the changes introduced in this PR:
Checklist: