perf: trim irrelevant delegation examples (0.38 -> 0.43 SWE-bench Verified dev)#237
Open
alanzabihi wants to merge 1 commit intomainfrom
Open
perf: trim irrelevant delegation examples (0.38 -> 0.43 SWE-bench Verified dev)#237alanzabihi wants to merge 1 commit intomainfrom
alanzabihi wants to merge 1 commit intomainfrom
Conversation
homanp
reviewed
Apr 2, 2026
b7aa8db to
d6572c5
Compare
Contributor
|
@alanzabihi I will test this with schedule workflows and other tasks that are not coding related and see what the effect is. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Results
Two baselines because the codebase changed between the initial autoresearch run (Mar 30) and the ablation study (Apr 2). The original baseline (32/100) was measured before batch mode, computer sub-agent, and @-mention autocomplete landed. The current main baseline (38/100) was measured against today's main with all those features. Both comparisons point the same direction.
What changed
One edit to
src/agent/agent.ts. Removed 9 delegation examples that reference tools irrelevant to coding tasks:Kept 3 examples that match what the model actually does during code tasks:
The delegation policy rules, tool descriptions in
tools.ts, and all sub-agent definitions are untouched. Schedule tools, generate_image, generate_video, computer, and vision all remain fully functional -- this change only removes their delegation examples from the agent system prompt, not the tools or their descriptions.Method
Two rounds of experiments.
Round 1: autoresearch (28 experiments, ~36 hours). An automated prompt-tuning loop based on Karpathy's autoresearch pattern. Each experiment edits the orchestration policy text, runs all 100 SWE-bench tasks with fixed budgets (
grok-code-fast-1, 120 max tool rounds, 900s timeout), evaluates via the standard SWE-bench Docker harness, keeps improvements, reverts regressions. The best variant (0.42, exp22) combined three changes: example trim, policy simplification, and vision removal from tools.ts.Round 2: ablation study (9 experiments, ~9 hours). Isolated each change to understand which ones actually help. Ran against current main with all recent features (batch mode, computer sub-agent, @-mention autocomplete).
Ablation results
The example trim alone (exp 5) is the single strongest change and the simplest. Combinations that looked promising in isolation showed negative interaction effects when combined (exp 3, 7). Adding computer guidance back to the simplified policy actively hurt (exp 9 vs 8).
Why this is the safest change
This PR only removes examples from the EXAMPLES block. It does not touch:
src/grok/tools.ts(task tool description unchanged, vision mention stays)buildSubagentPromptEvery tool remains fully described and accessible. The model can still use generate_image, schedule_create, computer, and everything else -- it just no longer sees worked examples for those tools in the coding-focused EXAMPLES block.
Run-to-run variance
SWE-bench results are stochastic. We observed 4-point spreads between identical runs on different machines. In the ablation, the example-trim variant (0.43) is the single highest score across all 9 experiments. Nothing else matched it. The direction is consistent with the earlier 28-experiment autoresearch run where example trimming appeared in every "keep" commit.
Full autoresearch experiment log (28 runs, round 1)