Auto-improve content extraction with OpenAI evaluation#74
Open
paskal wants to merge 10 commits intomodularise-retrievalfrom
Open
Auto-improve content extraction with OpenAI evaluation#74paskal wants to merge 10 commits intomodularise-retrievalfrom
paskal wants to merge 10 commits intomodularise-retrievalfrom
Conversation
…ion quality evaluation
Add AIEvaluator and MaxGPTIter fields to UReadability, implement evaluateAndImprove() loop that iterates with AI to find better CSS selectors, and add ExtractAndImprove() force mode that bypasses stored rules for re-evaluation.
Add GET /api/content-parsed-wrong protected endpoint that triggers ExtractAndImprove() to re-evaluate and improve extraction for a URL.
- fix UTF-8 truncation in buildUserPrompt (rune-safe slicing for multi-byte content) - pass prevSelector through evaluation loop so AI avoids repeating failed selectors - fix double getText processing in extractWithSelector (return raw HTML, process once) - add normalizeLinks and extractPics to AI-improved content - change /api/content-parsed-wrong from GET to POST (mutating operation) - add context timeout (60s) for OpenAI API calls - return sentinel errInvalidJSON instead of nil,nil anti-pattern - return error on double invalid JSON instead of silent fail-open - create OpenAI client once via sync.Once instead of per-call - consolidate duplicate genParser closure with getContentGeneral - add test for retry succeeding after initial invalid JSON - add test for Rules.Save failure not propagating - fix CLAUDE.md mock location description - add /api/content-parsed-wrong to README API section
- customParser now delegates to extractWithSelector (eliminates duplicated goquery parse+find+html loop) - image extraction moved out of evaluation loop — runs once on final result instead of every iteration - extract "ai-evaluator" to aiEvaluatorUser constant - fix incorrect doc comment on callAPI - remove unused getAuth test helper - remove redundant cancel() call and restating comments
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--openai-api-key) and no existing rule for the domain, GPT evaluates the extraction result and suggests CSS selectors if the result is poor--openai-max-iter), saves the best selector as a rule for future useExtractAndImprove()force mode ignores existing rules — for when a user reports bad extractionPOST /api/content-parsed-wrong?url=...endpoint for force modeDepends on #73 (modularise-retrieval).
Key design decisions
Extract(), not async. First request to a new domain may take longer (GPT calls), but always returns the best result. Subsequent requests use the saved rulenilrule to general parser (ignores stored rules), then re-evaluates from scratchextractPics(which downloads images via HTTP) runs once on the final result, not on every evaluation iterationNew configuration
--openai-api-keyOPENAI_API_KEY--openai-modelOPENAI_MODELgpt-5.4-mini--openai-max-iterOPENAI_MAX_ITER3New interface