Skip to content

perf: trim irrelevant delegation examples (0.38 -> 0.43 SWE-bench Verified dev)#237

Open
alanzabihi wants to merge 1 commit intomainfrom
perf/trim-irrelevant-delegation-examples
Open

perf: trim irrelevant delegation examples (0.38 -> 0.43 SWE-bench Verified dev)#237
alanzabihi wants to merge 1 commit intomainfrom
perf/trim-irrelevant-delegation-examples

Conversation

@alanzabihi
Copy link
Copy Markdown
Contributor

@alanzabihi alanzabihi commented Apr 2, 2026

Results

Metric Original baseline Current main This PR Delta (vs main)
SWE-bench Verified dev (n=100) 32/100 38/100 43/100 +5pp, +13% relative
Median task runtime 121.7s 167.6s 93.3s -44%
Cost per resolved task $0.56 $0.61 $0.55 -10%

Two baselines because the codebase changed between the initial autoresearch run (Mar 30) and the ablation study (Apr 2). The original baseline (32/100) was measured before batch mode, computer sub-agent, and @-mention autocomplete landed. The current main baseline (38/100) was measured against today's main with all those features. Both comparisons point the same direction.

What changed

One edit to src/agent/agent.ts. Removed 9 delegation examples that reference tools irrelevant to coding tasks:

- "investigate why this test fails" -> delegate to explore first, then continue with findings
- "refactor this module" -> delegate a focused part to general when helpful
- "open the host app and click through it" -> use computer
- "generate a logo" -> use generate_image
- "animate this still image" -> use generate_video
- Recurring specialized workflows -> use the matching custom sub-agent via task
- "every weekday at 9am run this check" -> use schedule_create with a cron expression
- "run this once automatically" -> use schedule_create with the right timing
- "make sure scheduled jobs keep running" -> use schedule_daemon_status and schedule_daemon_start

Kept 3 examples that match what the model actually does during code tasks:

- "review this change" -> delegate to explore first
- "research how auth works" -> delegate to explore first
- "verify this feature locally" -> use verify

The delegation policy rules, tool descriptions in tools.ts, and all sub-agent definitions are untouched. Schedule tools, generate_image, generate_video, computer, and vision all remain fully functional -- this change only removes their delegation examples from the agent system prompt, not the tools or their descriptions.

Method

Two rounds of experiments.

Round 1: autoresearch (28 experiments, ~36 hours). An automated prompt-tuning loop based on Karpathy's autoresearch pattern. Each experiment edits the orchestration policy text, runs all 100 SWE-bench tasks with fixed budgets (grok-code-fast-1, 120 max tool rounds, 900s timeout), evaluates via the standard SWE-bench Docker harness, keeps improvements, reverts regressions. The best variant (0.42, exp22) combined three changes: example trim, policy simplification, and vision removal from tools.ts.

Round 2: ablation study (9 experiments, ~9 hours). Isolated each change to understand which ones actually help. Ran against current main with all recent features (batch mode, computer sub-agent, @-mention autocomplete).

Ablation results

# Variant Rate vs main Changes
1 current main (baseline) 0.38 -- none
2 vision removal only 0.41 +3pp tools.ts
3 prompt changes only 0.35 -3pp policy + examples
4 policy simplification only 0.41 +3pp policy rules
5 example trim only 0.43 +5pp examples (this PR)
6 policy + vision 0.39 +1pp policy + tools.ts
7 examples + vision 0.35 -3pp examples + tools.ts
8 all three combined 0.42 +4pp policy + examples + tools.ts
9 all three + computer added back 0.37 -1pp all + computer guidance

The example trim alone (exp 5) is the single strongest change and the simplest. Combinations that looked promising in isolation showed negative interaction effects when combined (exp 3, 7). Adding computer guidance back to the simplified policy actively hurt (exp 9 vs 8).

Why this is the safest change

This PR only removes examples from the EXAMPLES block. It does not touch:

  • The DEFAULT DELEGATION POLICY rules (all 11 rules stay, including computer)
  • The TOOLS section (all tool descriptions stay, including computer_, schedule_, generate_*)
  • src/grok/tools.ts (task tool description unchanged, vision mention stays)
  • Sub-agent prompts and behavioral rules in buildSubagentPrompt
  • Any functional code

Every tool remains fully described and accessible. The model can still use generate_image, schedule_create, computer, and everything else -- it just no longer sees worked examples for those tools in the coding-focused EXAMPLES block.

Run-to-run variance

SWE-bench results are stochastic. We observed 4-point spreads between identical runs on different machines. In the ablation, the example-trim variant (0.43) is the single highest score across all 9 experiments. Nothing else matched it. The direction is consistent with the earlier 28-experiment autoresearch run where example trimming appeared in every "keep" commit.

Full autoresearch experiment log (28 runs, round 1)
Exp Rate Status Description
0 (baseline) 0.38 keep trimmed examples from main
1 0.40 keep simplify policy to 5 rules, 3 examples
2 0.34 discard emphasize direct action, reduce to 4 rules
3 0.33 discard add test-running guidance
4 0.31 discard ultra-minimal: 3 rules, 1 example
5 0.38 discard bug-fixing workflow examples
6 0.26 discard shorten task tool summary
7 0.34 discard shorten delegate description
8 0.36 discard enrich sub-agent capability descriptions
9 0.35 discard encourage direct action
10 0.34 discard remove all examples, keep 5 rules
11 0.39 discard streamline template: remove vision + proactive nudge
12 0.39 discard remove vision mention only
13 0.23 discard remove vision, keep winner ordering
14 0.33 discard make task the default
15 0.41 keep explore before general when context comes first
16 0.40 discard swap in review/research examples
17 0.40 discard code review as explicit explore use case
18 0.35 discard explicit delegate-only-background rule
19 0.38 discard remove-vision + task-vs-delegate rule
20 0.38 discard remove-vision + explore-before-general
21 0.42 keep remove-vision + review-and-research examples
22 0.38 discard explore-first summary + winner text
23 0.39 discard explore-first summary + remove-vision template
24 0.14 discard test/review/verify examples
25-28 0.00 discard various crashes

@github-actions github-actions bot added the contributor:verified Contributor passed trust analysis. label Apr 2, 2026
@alanzabihi alanzabihi requested a review from homanp April 2, 2026 13:26
@homanp homanp self-assigned this Apr 2, 2026
Copy link
Copy Markdown
Contributor

@homanp homanp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will the update in the prompt affect behavior on tasks that require image gen, computer use etc. now that they are stripped from the prompt?

@alanzabihi

@alanzabihi alanzabihi force-pushed the perf/trim-irrelevant-delegation-examples branch from b7aa8db to d6572c5 Compare April 3, 2026 06:39
@alanzabihi alanzabihi changed the title perf: simplify agent delegation prompt (0.32 -> 0.42 SWE-bench Verified dev) perf: trim irrelevant delegation examples (0.38 -> 0.43 SWE-bench Verified dev) Apr 3, 2026
@homanp
Copy link
Copy Markdown
Contributor

homanp commented Apr 3, 2026

@alanzabihi I will test this with schedule workflows and other tasks that are not coding related and see what the effect is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor:verified Contributor passed trust analysis.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants