perf: trim irrelevant delegation examples (0.38 -> 0.43 SWE-bench Verified dev) by alanzabihi · Pull Request #237 · superagent-ai/grok-cli

alanzabihi · 2026-04-02T13:23:50Z

Results

Metric	Original baseline	Current main	This PR	Delta (vs main)
SWE-bench Verified dev (n=100)	32/100	38/100	43/100	+5pp, +13% relative
Median task runtime	121.7s	167.6s	93.3s	-44%
Cost per resolved task	$0.56	$0.61	$0.55	-10%

Two baselines because the codebase changed between the initial autoresearch run (Mar 30) and the ablation study (Apr 2). The original baseline (32/100) was measured before batch mode, computer sub-agent, and @-mention autocomplete landed. The current main baseline (38/100) was measured against today's main with all those features. Both comparisons point the same direction.

What changed

One edit to src/agent/agent.ts. Removed 9 delegation examples that reference tools irrelevant to coding tasks:

- "investigate why this test fails" -> delegate to explore first, then continue with findings
- "refactor this module" -> delegate a focused part to general when helpful
- "open the host app and click through it" -> use computer
- "generate a logo" -> use generate_image
- "animate this still image" -> use generate_video
- Recurring specialized workflows -> use the matching custom sub-agent via task
- "every weekday at 9am run this check" -> use schedule_create with a cron expression
- "run this once automatically" -> use schedule_create with the right timing
- "make sure scheduled jobs keep running" -> use schedule_daemon_status and schedule_daemon_start

Kept 3 examples that match what the model actually does during code tasks:

- "review this change" -> delegate to explore first
- "research how auth works" -> delegate to explore first
- "verify this feature locally" -> use verify

The delegation policy rules, tool descriptions in tools.ts, and all sub-agent definitions are untouched. Schedule tools, generate_image, generate_video, computer, and vision all remain fully functional -- this change only removes their delegation examples from the agent system prompt, not the tools or their descriptions.

Method

Two rounds of experiments.

Round 1: autoresearch (28 experiments, ~36 hours). An automated prompt-tuning loop based on Karpathy's autoresearch pattern. Each experiment edits the orchestration policy text, runs all 100 SWE-bench tasks with fixed budgets (grok-code-fast-1, 120 max tool rounds, 900s timeout), evaluates via the standard SWE-bench Docker harness, keeps improvements, reverts regressions. The best variant (0.42, exp22) combined three changes: example trim, policy simplification, and vision removal from tools.ts.

Round 2: ablation study (9 experiments, ~9 hours). Isolated each change to understand which ones actually help. Ran against current main with all recent features (batch mode, computer sub-agent, @-mention autocomplete).

Ablation results

#	Variant	Rate	vs main	Changes
1	current main (baseline)	0.38	--	none
2	vision removal only	0.41	+3pp	tools.ts
3	prompt changes only	0.35	-3pp	policy + examples
4	policy simplification only	0.41	+3pp	policy rules
5	example trim only	0.43	+5pp	examples (this PR)
6	policy + vision	0.39	+1pp	policy + tools.ts
7	examples + vision	0.35	-3pp	examples + tools.ts
8	all three combined	0.42	+4pp	policy + examples + tools.ts
9	all three + computer added back	0.37	-1pp	all + computer guidance

The example trim alone (exp 5) is the single strongest change and the simplest. Combinations that looked promising in isolation showed negative interaction effects when combined (exp 3, 7). Adding computer guidance back to the simplified policy actively hurt (exp 9 vs 8).

Why this is the safest change

This PR only removes examples from the EXAMPLES block. It does not touch:

The DEFAULT DELEGATION POLICY rules (all 11 rules stay, including computer)
The TOOLS section (all tool descriptions stay, including computer_, schedule_, generate_*)
src/grok/tools.ts (task tool description unchanged, vision mention stays)
Sub-agent prompts and behavioral rules in buildSubagentPrompt
Any functional code

Every tool remains fully described and accessible. The model can still use generate_image, schedule_create, computer, and everything else -- it just no longer sees worked examples for those tools in the coding-focused EXAMPLES block.

Run-to-run variance

SWE-bench results are stochastic. We observed 4-point spreads between identical runs on different machines. In the ablation, the example-trim variant (0.43) is the single highest score across all 9 experiments. Nothing else matched it. The direction is consistent with the earlier 28-experiment autoresearch run where example trimming appeared in every "keep" commit.

Full autoresearch experiment log (28 runs, round 1)

Exp	Rate	Status	Description
0 (baseline)	0.38	keep	trimmed examples from main
1	0.40	keep	simplify policy to 5 rules, 3 examples
2	0.34	discard	emphasize direct action, reduce to 4 rules
3	0.33	discard	add test-running guidance
4	0.31	discard	ultra-minimal: 3 rules, 1 example
5	0.38	discard	bug-fixing workflow examples
6	0.26	discard	shorten task tool summary
7	0.34	discard	shorten delegate description
8	0.36	discard	enrich sub-agent capability descriptions
9	0.35	discard	encourage direct action
10	0.34	discard	remove all examples, keep 5 rules
11	0.39	discard	streamline template: remove vision + proactive nudge
12	0.39	discard	remove vision mention only
13	0.23	discard	remove vision, keep winner ordering
14	0.33	discard	make task the default
15	0.41	keep	explore before general when context comes first
16	0.40	discard	swap in review/research examples
17	0.40	discard	code review as explicit explore use case
18	0.35	discard	explicit delegate-only-background rule
19	0.38	discard	remove-vision + task-vs-delegate rule
20	0.38	discard	remove-vision + explore-before-general
21	0.42	keep	remove-vision + review-and-research examples
22	0.38	discard	explore-first summary + winner text
23	0.39	discard	explore-first summary + remove-vision template
24	0.14	discard	test/review/verify examples
25-28	0.00	discard	various crashes

homanp

How will the update in the prompt affect behavior on tasks that require image gen, computer use etc. now that they are stripped from the prompt?

@alanzabihi

…ified dev)

homanp · 2026-04-03T06:48:26Z

@alanzabihi I will test this with schedule workflows and other tasks that are not coding related and see what the effect is.

github-actions bot added the contributor:verified Contributor passed trust analysis. label Apr 2, 2026

alanzabihi requested a review from homanp April 2, 2026 13:26

homanp self-assigned this Apr 2, 2026

homanp reviewed Apr 2, 2026

View reviewed changes

perf: trim irrelevant delegation examples (0.38 -> 0.43 SWE-bench Ver…

d6572c5

…ified dev)

alanzabihi force-pushed the perf/trim-irrelevant-delegation-examples branch from b7aa8db to d6572c5 Compare April 3, 2026 06:39

alanzabihi changed the title ~~perf: simplify agent delegation prompt (0.32 -> 0.42 SWE-bench Verified dev)~~ perf: trim irrelevant delegation examples (0.38 -> 0.43 SWE-bench Verified dev) Apr 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: trim irrelevant delegation examples (0.38 -> 0.43 SWE-bench Verified dev)#237

perf: trim irrelevant delegation examples (0.38 -> 0.43 SWE-bench Verified dev)#237
alanzabihi wants to merge 1 commit intomainfrom
perf/trim-irrelevant-delegation-examples

alanzabihi commented Apr 2, 2026 •

edited

Loading

Uh oh!

homanp left a comment •

edited

Loading

Uh oh!

homanp commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alanzabihi commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Results

What changed

Method

Ablation results

Why this is the safest change

Run-to-run variance

Uh oh!

homanp left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

homanp commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alanzabihi commented Apr 2, 2026 •

edited

Loading

homanp left a comment •

edited

Loading