Skip to content

feat(models): add vLLM provider support#1860

Open
NmanQAQ wants to merge 2 commits intobytedance:mainfrom
NmanQAQ:fix/vllm-qwen-thinking-toggle
Open

feat(models): add vLLM provider support#1860
NmanQAQ wants to merge 2 commits intobytedance:mainfrom
NmanQAQ:fix/vllm-qwen-thinking-toggle

Conversation

@NmanQAQ
Copy link
Copy Markdown

@NmanQAQ NmanQAQ commented Apr 4, 2026

Summary

This PR adds DeerFlow support for vLLM 0.19.0 OpenAI-compatible chat endpoints and fixes the Qwen reasoning toggle so flash mode can actually disable thinking.

What Changed

  • Add deerflow.models.vllm_provider:VllmChatModel to preserve vLLM's non-standard reasoning field across non-streaming responses, streaming deltas, and follow-up tool-call turns
    • backend/packages/harness/deerflow/models/vllm_provider.py
    • backend/tests/test_vllm_provider.py
  • Extend the model factory so vLLM thinking-disable logic handles both legacy chat_template_kwargs.thinking configs and the Qwen/vLLM 0.19.0 chat_template_kwargs.enable_thinking format
    • backend/packages/harness/deerflow/models/factory.py
    • backend/tests/test_model_factory.py
  • Document the Qwen-specific toggle, add a VLLM_API_KEY example, and update the sample vLLM config
    • .env.example
    • config.example.yaml
    • README.md
    • backend/CLAUDE.md
  • Refresh backend/uv.lock for the new harness dependency state included with the provider addition

User-Visible Effect

  • DeerFlow can be configured against vLLM 0.19.0 using a first-class provider instead of a generic OpenAI adapter
  • vLLM reasoning content is preserved across tool-call turns instead of being dropped by the default LangChain OpenAI conversion path
  • For Qwen-style vLLM models, switching DeerFlow to flash mode now turns thinking off reliably via extra_body.chat_template_kwargs.enable_thinking
  • Existing configs that still use chat_template_kwargs.thinking continue to work because requests are normalized before they are sent

Validation

Backend:

cd backend
uvx ruff check .
uvx ruff format --check .
uv run --with pytest python -m pytest tests/test_vllm_provider.py tests/test_model_factory.py -q
uv run pytest tests/test_client_live.py -v -s

Notes

  • tests/test_client_live.py completed with 17 passed, 1 skipped, 1 failed in the local Windows environment. The remaining failure is a UnicodeEncodeError when the test prints a Unicode symbol during TestLiveMultiToolChain::test_write_then_read; the underlying write/read tool flow completed successfully before the console-print failure.
  • A broader local Windows run of uv run pytest tests --ignore=tests/test_client_live.py --ignore=tests/test_create_deerflow_agent_live.py -q still reports pre-existing local-sandbox path separator and symlink-permission failures unrelated to this PR.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 4, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ WillemJiang
❌ NmanQAQ


NmanQAQ seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@NmanQAQ NmanQAQ force-pushed the fix/vllm-qwen-thinking-toggle branch from b54d5a7 to 88e2f3b Compare April 5, 2026 00:09
@WillemJiang
Copy link
Copy Markdown
Collaborator

@NmanQAQ thanks for your contribution. Please click the CLA button to sign the CLA before we merge your PR.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class vLLM (OpenAI-compatible) chat provider support, ensuring vLLM’s non-standard reasoning field is preserved across full responses, streaming deltas, and follow-up tool-call turns, and improves the “thinking disabled” normalization for Qwen/vLLM chat template kwargs.

Changes:

  • Introduces VllmChatModel provider to preserve vLLM reasoning across requests and responses.
  • Extends the model factory to disable thinking for vLLM/Qwen via chat_template_kwargs (supporting both legacy thinking and new enable_thinking).
  • Updates documentation and examples for vLLM setup; refreshes the backend lockfile and adds tests.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
README.md Adds vLLM/Qwen configuration example and explains the vLLM reasoning/thinking toggle behavior.
config.example.yaml Documents a vLLM 0.19.0 example configuration and bumps config_version.
backend/uv.lock Updates locked dependencies for the new provider-related dependency state.
backend/tests/test_vllm_provider.py Adds unit tests covering vLLM reasoning preservation and thinking kwarg normalization.
backend/tests/test_model_factory.py Adds regression tests for vLLM thinking-disable behavior in both thinking and enable_thinking formats.
backend/packages/harness/deerflow/models/vllm_provider.py Implements VllmChatModel (ChatOpenAI subclass) to preserve vLLM reasoning and normalize chat template kwargs.
backend/packages/harness/deerflow/models/factory.py Adds deep-merge helper and vLLM-specific disable logic for chat_template_kwargs.
backend/CLAUDE.md Documents the new vLLM provider and Qwen reasoning toggle behavior.
.env.example Adds VLLM_API_KEY example entry.

Comment on lines +107 to +112
if reasoning := _dict.get("reasoning"):
additional_kwargs["reasoning"] = reasoning
reasoning_text = _reasoning_to_text(reasoning)
if reasoning_text:
additional_kwargs["reasoning_content"] = reasoning_text

Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _convert_delta_to_message_chunk_with_reasoning, the reasoning field is only preserved when it is truthy (if reasoning := _dict.get("reasoning")). If vLLM emits an empty string / empty list as a valid intermediate delta value, this will silently drop the reasoning key and can break the “echo prior reasoning” requirement on subsequent turns. Consider checking is not None instead of truthiness and then separately deciding whether to compute/set reasoning_content text.

Copilot uses AI. Check for mistakes.
# Bump this number when the config schema changes.
# Run `make config-upgrade` to merge new fields into your local config.yaml.
config_version: 5
config_version: 6
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config_version was bumped from 5 → 6, but the PR appears to only add/adjust examples and provider behavior (no config schema changes). Since config_version is used to warn users that their config is outdated, consider keeping it at 5 unless there’s a concrete schema migration that make config-upgrade needs to apply.

Suggested change
config_version: 6
config_version: 5

Copilot uses AI. Check for mistakes.
Comment on lines +145 to +151
- name: qwen3-32b-vllm
display_name: Qwen3 32B (vLLM)
use: deerflow.models.vllm_provider:VllmChatModel
model: Qwen/Qwen3-32B
api_key: dummy
base_url: http://localhost:8000/v1
supports_thinking: true
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vLLM example uses api_key: dummy, but the PR also introduces VLLM_API_KEY in .env.example and uses $VLLM_API_KEY in config.example.yaml. To avoid confusing users, consider switching this README snippet to api_key: $VLLM_API_KEY (and optionally note that some local vLLM deployments accept any non-empty key).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants