Skip to content

Auto-improve content extraction with OpenAI evaluation#74

Open
paskal wants to merge 10 commits intomodularise-retrievalfrom
openai-auto-extraction
Open

Auto-improve content extraction with OpenAI evaluation#74
paskal wants to merge 10 commits intomodularise-retrievalfrom
openai-auto-extraction

Conversation

@paskal
Copy link
Copy Markdown
Member

@paskal paskal commented Mar 29, 2026

Summary

  • Adds automatic extraction quality evaluation using OpenAI during content extraction
  • When OpenAI is configured (--openai-api-key) and no existing rule for the domain, GPT evaluates the extraction result and suggests CSS selectors if the result is poor
  • Iterates up to 3 times (configurable via --openai-max-iter), saves the best selector as a rule for future use
  • ExtractAndImprove() force mode ignores existing rules — for when a user reports bad extraction
  • Protected POST /api/content-parsed-wrong?url=... endpoint for force mode
  • Fail-open: GPT errors never break extraction — original result returned unchanged

Depends on #73 (modularise-retrieval).

Key design decisions

  • Inline evaluation: runs during Extract(), not async. First request to a new domain may take longer (GPT calls), but always returns the best result. Subsequent requests use the saved rule
  • GPT sees URL + extracted text + truncated HTML: enough context for accurate selector suggestions without excessive token cost
  • Force mode: passes nil rule to general parser (ignores stored rules), then re-evaluates from scratch
  • Image extraction deferred: extractPics (which downloads images via HTTP) runs once on the final result, not on every evaluation iteration

New configuration

Flag Env Default Description
--openai-api-key OPENAI_API_KEY none Enables auto-evaluation when set
--openai-model OPENAI_MODEL gpt-5.4-mini Model for evaluation
--openai-max-iter OPENAI_MAX_ITER 3 Max evaluation iterations

New interface

type AIEvaluator interface {
    Evaluate(ctx, url, extractedText, htmlBody, prevSelector string) (*EvalResult, error)
}

paskal added 10 commits March 29, 2026 20:34
Add AIEvaluator and MaxGPTIter fields to UReadability, implement
evaluateAndImprove() loop that iterates with AI to find better CSS
selectors, and add ExtractAndImprove() force mode that bypasses
stored rules for re-evaluation.
Add GET /api/content-parsed-wrong protected endpoint that triggers
ExtractAndImprove() to re-evaluate and improve extraction for a URL.
- fix UTF-8 truncation in buildUserPrompt (rune-safe slicing for multi-byte content)
- pass prevSelector through evaluation loop so AI avoids repeating failed selectors
- fix double getText processing in extractWithSelector (return raw HTML, process once)
- add normalizeLinks and extractPics to AI-improved content
- change /api/content-parsed-wrong from GET to POST (mutating operation)
- add context timeout (60s) for OpenAI API calls
- return sentinel errInvalidJSON instead of nil,nil anti-pattern
- return error on double invalid JSON instead of silent fail-open
- create OpenAI client once via sync.Once instead of per-call
- consolidate duplicate genParser closure with getContentGeneral
- add test for retry succeeding after initial invalid JSON
- add test for Rules.Save failure not propagating
- fix CLAUDE.md mock location description
- add /api/content-parsed-wrong to README API section
- customParser now delegates to extractWithSelector (eliminates duplicated
  goquery parse+find+html loop)
- image extraction moved out of evaluation loop — runs once on final
  result instead of every iteration
- extract "ai-evaluator" to aiEvaluatorUser constant
- fix incorrect doc comment on callAPI
- remove unused getAuth test helper
- remove redundant cancel() call and restating comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant