Modularise URL retrieval with Cloudflare Browser Rendering support by paskal · Pull Request #73 · ukeeper/ukeeper-readability

paskal · 2026-03-29T13:07:06Z

Summary

Addresses review feedback on radio-t/super-bot#156: content extraction improvements belong in ukeeper-readability (the extraction layer), not in super-bot (the consumer).

Introduces Retriever interface in the extractor package to abstract URL fetching — implementations return raw HTML + final URL + headers, existing parsing pipeline stays unchanged
HTTPRetriever — extracts the current HTTP fetch logic (Safari user-agent, redirect following, connection reuse)
CloudflareRetriever — uses Cloudflare Browser Rendering /content endpoint for JS-rendered pages behind bot protection
CLI flags --cf-account-id / --cf-api-token to enable Cloudflare retrieval; defaults to HTTP when not set
Backward compatible: UReadability{} without Retriever field falls back to a cached HTTPRetriever

Additional improvements in touched code:

checkToken helper with subtle.ConstantTimeCompare — extracted from extractArticleEmulateReadability, now also protects POST /extract
normalizeLinks signature simplified from *http.Request to *url.URL
Fixed %b → %v format verb bug in text.go, switched from stdlib log to lgr for consistency

Extract URL fetching abstraction from the inline HTTP logic in extractWithRules. Defines Retriever interface, RetrieveResult struct, and HTTPRetriever with Safari user-agent, redirect following, and timeout support. Includes moq generate directive and comprehensive tests.

Generate moq mock for Retriever interface as a test-only file (retriever_mock_test.go) instead of mocks/ subpackage to avoid import cycle (mocks/retriever.go would import extractor, cycling with readability_test.go). Run gofmt on all modified files, zero lint issues.

- fix err shadowing in deferred Body.Close() in both retrievers (use closeErr) - handle Cloudflare API success=false response explicitly instead of treating JSON error as HTML - truncate CF API error body to 512 bytes in error messages - add comment documenting CF retriever URL limitation (no final URL after JS redirects) - fix pre-existing %b format verb in text.go logging (should be %v) - replace network-dependent TestCloudflareRetriever_DefaultBaseURL with local httptest - add TestCloudflareRetriever_SuccessFalse for the new success=false handling - add TestExtractWithCustomRetriever integration test using RetrieverMock - remove duplicate plan file from docs/plans/ (already in completed/) - update README.md with new CF CLI flags and feature description - update CLAUDE.md CI bullet to reflect split docker.yml workflow

POST /api/extract never had token auth in the original code. The checkToken refactoring should only apply to the legacy /content/v1/parser endpoint which always had it.

paskal · 2026-03-29T20:59:51Z

This PR addresses the review feedback on radio-t/super-bot#156 — the content extraction improvement belongs in ukeeper-readability (the extraction layer), not in super-bot. With the Retriever interface and Cloudflare Browser Rendering support here, super-bot#156 can be closed.

paskal added 17 commits March 29, 2026 20:28

feat: implement CloudflareRetriever for Browser Rendering API

5bcd16d

feat: wire Retriever interface into UReadability extraction pipeline

4b932a3

feat: add CLI flags and wire Cloudflare retriever in main.go

6128691

feat: verify acceptance criteria for Retriever interface

8d560a0

feat: update documentation for Retriever interface and CLI flags

a7ff6d0

fix: address code review findings

7119f29

fix: address code review findings

ef56b96

fix: address codex review findings

298fa03

fix: address code review findings

e0562ed

fix: address code review findings

60f1a07

fix: address code review findings

38255d5

fix: cache default retriever, add defensive timeouts, extract constants

cd8ab64

fix: revert token auth addition to POST /api/extract

4bdfd51

POST /api/extract never had token auth in the original code. The checkToken refactoring should only apply to the legacy /content/v1/parser endpoint which always had it.

docs: add OpenAI auto-extraction improvement plan

1ef736b

paskal force-pushed the modularise-retrieval branch from ff27ed8 to 1ef736b Compare March 29, 2026 19:28

paskal mentioned this pull request Mar 29, 2026

Auto-improve content extraction with OpenAI evaluation #74

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modularise URL retrieval with Cloudflare Browser Rendering support#73

Modularise URL retrieval with Cloudflare Browser Rendering support#73
paskal wants to merge 17 commits intomasterfrom
modularise-retrieval

paskal commented Mar 29, 2026

Uh oh!

paskal commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

paskal commented Mar 29, 2026

Summary

Uh oh!

paskal commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant