Modularise URL retrieval with Cloudflare Browser Rendering support#73
Open
Modularise URL retrieval with Cloudflare Browser Rendering support#73
Conversation
Extract URL fetching abstraction from the inline HTTP logic in extractWithRules. Defines Retriever interface, RetrieveResult struct, and HTTPRetriever with Safari user-agent, redirect following, and timeout support. Includes moq generate directive and comprehensive tests.
Generate moq mock for Retriever interface as a test-only file (retriever_mock_test.go) instead of mocks/ subpackage to avoid import cycle (mocks/retriever.go would import extractor, cycling with readability_test.go). Run gofmt on all modified files, zero lint issues.
- fix err shadowing in deferred Body.Close() in both retrievers (use closeErr) - handle Cloudflare API success=false response explicitly instead of treating JSON error as HTML - truncate CF API error body to 512 bytes in error messages - add comment documenting CF retriever URL limitation (no final URL after JS redirects) - fix pre-existing %b format verb in text.go logging (should be %v) - replace network-dependent TestCloudflareRetriever_DefaultBaseURL with local httptest - add TestCloudflareRetriever_SuccessFalse for the new success=false handling - add TestExtractWithCustomRetriever integration test using RetrieverMock - remove duplicate plan file from docs/plans/ (already in completed/) - update README.md with new CF CLI flags and feature description - update CLAUDE.md CI bullet to reflect split docker.yml workflow
POST /api/extract never had token auth in the original code. The checkToken refactoring should only apply to the legacy /content/v1/parser endpoint which always had it.
ff27ed8 to
1ef736b
Compare
Member
Author
|
This PR addresses the review feedback on radio-t/super-bot#156 — the content extraction improvement belongs in ukeeper-readability (the extraction layer), not in super-bot. With the Retriever interface and Cloudflare Browser Rendering support here, super-bot#156 can be closed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Addresses review feedback on radio-t/super-bot#156: content extraction improvements belong in ukeeper-readability (the extraction layer), not in super-bot (the consumer).
Retrieverinterface in the extractor package to abstract URL fetching — implementations return raw HTML + final URL + headers, existing parsing pipeline stays unchangedHTTPRetriever— extracts the current HTTP fetch logic (Safari user-agent, redirect following, connection reuse)CloudflareRetriever— uses Cloudflare Browser Rendering/contentendpoint for JS-rendered pages behind bot protection--cf-account-id/--cf-api-tokento enable Cloudflare retrieval; defaults to HTTP when not setUReadability{}withoutRetrieverfield falls back to a cachedHTTPRetrieverAdditional improvements in touched code:
checkTokenhelper withsubtle.ConstantTimeCompare— extracted fromextractArticleEmulateReadability, now also protectsPOST /extractnormalizeLinkssignature simplified from*http.Requestto*url.URL%b→%vformat verb bug intext.go, switched from stdliblogtolgrfor consistency