Add CLI Filtering for CVs and Introduce --dry-run Option for Deployment#57
Add CLI Filtering for CVs and Introduce --dry-run Option for Deployment#57anshuman9468 wants to merge 2 commits intodbpedia:mainfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds pipe-separated download filtering to Databus URIs, implements filter-matching logic for JSON-LD Part nodes (extensions, compression, content variants), extends download flow to apply filters, adds a CLI Changes
Sequence DiagramsequenceDiagram
autonumber
actor User
participant CLI
participant Download
participant JSONLD
participant Filesystem
User->>CLI: download("base_uri|filter1|filter2")
CLI->>Download: invoke download(base_uri, filters)
Download->>Download: parse URI -> base_uri + filters
Download->>JSONLD: fetch artifact/version/group JSON-LD
JSONLD-->>Download: return `@graph` with Part nodes
Download->>Download: for each Part -> _matches_filters(node, filters)
alt match
Download->>Download: collect file URLs
else no match
Download-->>Download: skip Part
end
Download->>Filesystem: download collected URLs
Filesystem-->>Download: files saved
Download-->>CLI: results
CLI-->>User: report complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 1 | ❌ 4❌ Failed checks (2 warnings, 2 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@databusclient/api/download.py`:
- Around line 903-915: The filter logic in the for-loop that checks
f.startswith(".") and f.startswith("..") currently reads formatExtension and
compression from node.get(...) only, causing valid Part nodes with only a "file"
field to fail; update the checks in that loop to fall back to extracting the
extension and compression from node["file"] when node.get("formatExtension") or
node.get("compression") are empty—use the file name (node.get("file") or
node["file"]) to derive the format extension (e.g., the suffix after the last
'.') and the compression (e.g., a trailing .gz/.bz2/etc.), then compare those
derived values (lowercased) against expected in the existing f.startswith(".")
and f.startswith("..") branches so the filters match when metadata fields are
absent.
In `@databusclient/cli.py`:
- Around line 162-171: The current dry-run branch returns before building the
dataset/DataID preview so `--dry-run --webdav-url` prints an empty list; fix by
ensuring metadata is constructed even in dry-run mode: extract or reuse the same
metadata-generation logic used for real deploys (the code that produces
`metadata` from `distributions`, `remote`, `path`, and `webdav_url`) and call it
regardless of `dry_run`, but only skip the actual upload side-effect when
`dry_run` is true (i.e., call `webdav.upload_to_webdav` or a new
`build_metadata` helper in both cases or pass a dry_run flag to
`webdav.upload_to_webdav`), and remove the premature `return` so the
DataID/metadata preview is printed for dry runs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d7b368af-d320-4e10-8d56-42702b89db04
📒 Files selected for processing (4)
README.mddatabusclient/api/download.pydatabusclient/cli.pytests/test_filter.py
| for f in filters: | ||
| if f.startswith(".."): | ||
| # Compression filter | ||
| expected = f[2:].lower() | ||
| actual = str(node.get("compression", "")).lower() | ||
| if actual != expected: | ||
| return False | ||
| elif f.startswith("."): | ||
| # Format extension filter | ||
| expected = f[1:].lower() | ||
| actual = str(node.get("formatExtension", "")).lower() | ||
| if actual != expected: | ||
| return False |
There was a problem hiding this comment.
Fall back to the file name when formatExtension or compression is missing.
file_format and compression are optional in published metadata, so .ttl / ..gz currently fail on otherwise valid Part nodes that only expose file. That makes the new filter syntax silently skip matching files on those datasets.
💡 Localized fix
for f in filters:
if f.startswith(".."):
# Compression filter
expected = f[2:].lower()
- actual = str(node.get("compression", "")).lower()
+ actual = str(node.get("compression", "")).lower()
+ if not actual:
+ actual = _detect_compression_format(str(node.get("file", ""))) or ""
if actual != expected:
return False
elif f.startswith("."):
# Format extension filter
expected = f[1:].lower()
- actual = str(node.get("formatExtension", "")).lower()
+ actual = str(node.get("formatExtension", "")).lower()
+ if not actual:
+ path = urlparse(str(node.get("file", ""))).path
+ basename = os.path.basename(path).lower()
+ basename = re.sub(r"\.(bz2|gz|xz)$", "", basename)
+ actual = basename.rsplit(".", 1)[-1] if "." in basename else ""
if actual != expected:
return False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@databusclient/api/download.py` around lines 903 - 915, The filter logic in
the for-loop that checks f.startswith(".") and f.startswith("..") currently
reads formatExtension and compression from node.get(...) only, causing valid
Part nodes with only a "file" field to fail; update the checks in that loop to
fall back to extracting the extension and compression from node["file"] when
node.get("formatExtension") or node.get("compression") are empty—use the file
name (node.get("file") or node["file"]) to derive the format extension (e.g.,
the suffix after the last '.') and the compression (e.g., a trailing
.gz/.bz2/etc.), then compare those derived values (lowercased) against expected
in the existing f.startswith(".") and f.startswith("..") branches so the filters
match when metadata fields are absent.
| if dry_run: | ||
| click.echo("[DRY-RUN] Skipping WebDAV upload") | ||
| metadata = [] | ||
| else: | ||
| metadata = webdav.upload_to_webdav(distributions, remote, path, webdav_url) | ||
|
|
||
| if dry_run: | ||
| click.echo("[DRY-RUN] Generated metadata (partial):") | ||
| click.echo(json.dumps(metadata, indent=2)) | ||
| return |
There was a problem hiding this comment.
--dry-run in WebDAV mode never builds the DataID preview.
This branch returns after printing [], so it skips the same dataset-construction path that real metadata deploys use. The result is that deploy --dry-run --webdav-url ... does not actually simulate the deploy or surface metadata/DataID errors until a real run.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@databusclient/cli.py` around lines 162 - 171, The current dry-run branch
returns before building the dataset/DataID preview so `--dry-run --webdav-url`
prints an empty list; fix by ensuring metadata is constructed even in dry-run
mode: extract or reuse the same metadata-generation logic used for real deploys
(the code that produces `metadata` from `distributions`, `remote`, `path`, and
`webdav_url`) and call it regardless of `dry_run`, but only skip the actual
upload side-effect when `dry_run` is true (i.e., call `webdav.upload_to_webdav`
or a new `build_metadata` helper in both cases or pass a dry_run flag to
`webdav.upload_to_webdav`), and remove the premature `return` so the
DataID/metadata preview is printed for dry runs.
Description
This PR introduces two key enhancements to improve usability and safety:
CLI Filtering for CVs: Adds support for filtering CVs directly via the command-line interface, enabling more efficient and targeted processing.
--dry-run Deployment Option: Introduces a --dry-run flag for deployment commands, allowing users to simulate the deployment process without making any actual changes. This helps in validating configurations and preventing unintended updates.
These changes aim to improve developer experience by providing better control over CV handling and safer deployment workflows.
Related Issues
Closes #12
Summary by CodeRabbit
New Features
--dry-runoption for deploy to generate and print metadata without performing deployment.completionsubcommand to generate shell completions (bash, zsh, fish).Documentation
Tests