Skip to content

Enforce minimum NCCL version for cuBLASMp#2857

Merged
vcherepanov-nv merged 1 commit intoNVIDIA:mainfrom
vcherepanov-nv:enforce_min_nccl_for_cublasmp
Apr 10, 2026
Merged

Enforce minimum NCCL version for cuBLASMp#2857
vcherepanov-nv merged 1 commit intoNVIDIA:mainfrom
vcherepanov-nv:enforce_min_nccl_for_cublasmp

Conversation

@vcherepanov-nv
Copy link
Copy Markdown
Collaborator

Description

cuBLASMp uses API, introduced in NCCL 2.29. This change adds a check to CMake to enforce this minimum version and avoid linker errors.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Check that NCCL version is >= 2.29 if building with cuBLASMp integration

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 8, 2026

Greptile Summary

Adds a CMake find_nccl_version helper that parses the NCCL version from nccl.h and enforces >= 2.29.0 when building with NVTE_WITH_CUBLASMP, preventing linker errors caused by the missing NCCL 2.29 API. The version-comparison logic and the FATAL_ERROR messaging are correct. Two minor robustness gaps exist in find_path (no path hints, cached result) that are flagged as P2 suggestions.

Confidence Score: 5/5

Safe to merge; all remaining findings are P2 style/robustness suggestions that do not block correctness in standard single-NCCL environments.

The core version-check logic is correct: NCCL version is parsed deterministically from the header, compared with CMake's VERSION_LESS, and raises a FATAL_ERROR on failure. The two P2 flags (no HINTS in find_path, cached path variable) affect edge-case environments and do not introduce build failures for the common case.

transformer_engine/common/CMakeLists.txt – find_path hints and cache hygiene in find_nccl_version.

Vulnerabilities

No security concerns identified.

Important Files Changed

Filename Overview
transformer_engine/common/CMakeLists.txt Adds find_nccl_version helper and enforces NCCL >= 2.29.0 when NVTE_WITH_CUBLASMP is enabled; find_path lacks HINTS and uses a cached result that could mismatch the linked NCCL in multi-install environments.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[cmake configure] --> B{NVTE_WITH_CUBLASMP?}
    B -- No --> Z[skip cuBLASMp setup]
    B -- Yes --> C[find_nccl_version\nfind_path nccl.h]
    C --> D[parse NCCL_MAJOR\nNCCL_MINOR\nNCCL_PATCH]
    D --> E{version fields\nparsed OK?}
    E -- No --> F[FATAL_ERROR\nFailed to parse NCCL version]
    E -- Yes --> G[NCCL_VERSION = MAJOR.MINOR.PATCH]
    G --> H{NCCL_VERSION\nVERSION_LESS 2.29.0?}
    H -- Yes --> I[FATAL_ERROR\nNCCL >= 2.29.0 required]
    H -- No --> J[find_library CUBLASMP_LIB\nfind_library NCCL_LIB]
    J --> K[target_link_libraries\nNCCL + cuBLASMp]
    K --> L[STATUS messages\ncuBLASMp path + NCCL version]
Loading

Reviews (1): Last reviewed commit: "Enforce minimum NCCL version for cuBLASM..." | Re-trigger Greptile

Comment on lines +102 to +105
find_path(_nvte_nccl_include_dir
NAMES nccl.h
PATH_SUFFIXES include
REQUIRED)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 find_path may locate the wrong nccl.h

find_path here has no PATHS or HINTS, so it searches CMake's default system paths rather than the NCCL that will actually be linked. On systems where both a system NCCL and a CUBLASMP_DIR-bundled NCCL exist, the header discovered could belong to a different version than the NCCL_LIB found by the later find_library call, making the version check unreliable.

Consider extracting a hint from CUBLASMP_DIR, e.g.:

find_path(_nvte_nccl_include_dir
          NAMES nccl.h
          HINTS ${CUBLASMP_DIR}
          PATH_SUFFIXES include
          REQUIRED)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this suggestion from Greptile sensible? My understanding was that cuBLASMp does not come pre-packaged with it's own NCCL so I don't think this makes sense.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct—my suggestion doesn't make sense. If cuBLASMp doesn't bundle NCCL, then ${CUBLASMP_DIR} won't contain nccl.h.

The potential issue I was flagging (that find_path and find_library could locate different NCCL installations on systems with multiple versions) is real in principle, but the fix would be to accept a user-provided NCCL_DIR or derive the include path from the library location found by find_library, not to look in CUBLASMP_DIR.

For most practical scenarios where NCCL is installed in a single standard location, the current approach should work fine—both find_path and find_library will discover the same installation through CMake's default search paths.

Tip: You can customize Greptile's behavior for this repo with .greptile/rules and .greptile/config.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vcherepanov-nv I think Greptile's latest reply here does have a point here that a user's NCCL dir may differ from the system installation, for example, if they are using NCCL from a pip package. I'm not sure if NCCL from pip is supported with NVTE_WITH_CUBLASMP, but if it is we should probably allow the user to provide a NCCL_DIR hint (unless we have a better way of choosing between pip packages and system installations in CMake)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think existing find_library approach would find pip-installed NCCL. So whatever find_path finds, it's likely whatever find_library finds as well. Until now no one asked for a custom path to NCCL, so I'd like to avoid adding any extra configurable paths, unless absolutely necessary.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meaning - less options to configure, less options to misconfigure.

Comment on lines +102 to +105
find_path(_nvte_nccl_include_dir
NAMES nccl.h
PATH_SUFFIXES include
REQUIRED)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Cached find_path result may become stale on reconfigure

find_path stores _nvte_nccl_include_dir in the CMake cache by default. If NCCL is upgraded and cmake is re-run without clearing the build directory, the old cached path is reused and the version check reads the stale nccl.h, potentially passing or failing incorrectly. Use NO_CACHE (available since CMake 3.21) or prepend unset(_nvte_nccl_include_dir CACHE) to force a fresh search every configure run.

find_path(_nvte_nccl_include_dir
          NAMES nccl.h
          HINTS ${CUBLASMP_DIR}
          PATH_SUFFIXES include
          NO_CACHE
          REQUIRED)

@jberchtold-nvidia
Copy link
Copy Markdown
Collaborator

LGTM, thanks Vlad!

@vcherepanov-nv vcherepanov-nv requested a review from timmoon10 April 8, 2026 21:55
@ptrendx
Copy link
Copy Markdown
Member

ptrendx commented Apr 8, 2026

/te-ci

@vcherepanov-nv vcherepanov-nv merged commit 2f17c9b into NVIDIA:main Apr 10, 2026
38 of 42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants