Skip to content

Add CLI Filtering for CVs and Introduce --dry-run Option for Deployment#57

Open
anshuman9468 wants to merge 2 commits intodbpedia:mainfrom
anshuman9468:fix/cv-filtering-and-dryrun
Open

Add CLI Filtering for CVs and Introduce --dry-run Option for Deployment#57
anshuman9468 wants to merge 2 commits intodbpedia:mainfrom
anshuman9468:fix/cv-filtering-and-dryrun

Conversation

@anshuman9468
Copy link
Copy Markdown

@anshuman9468 anshuman9468 commented Apr 10, 2026

Description

This PR introduces two key enhancements to improve usability and safety:

CLI Filtering for CVs: Adds support for filtering CVs directly via the command-line interface, enabling more efficient and targeted processing.
--dry-run Deployment Option: Introduces a --dry-run flag for deployment commands, allowing users to simulate the deployment process without making any actual changes. This helps in validating configurations and preventing unintended updates.

These changes aim to improve developer experience by providing better control over CV handling and safer deployment workflows.

Related Issues
Closes #12

Summary by CodeRabbit

  • New Features

    • Download filtering via pipe-separated filters on URIs to select content variants, formats, and compression.
    • --dry-run option for deploy to generate and print metadata without performing deployment.
    • completion subcommand to generate shell completions (bash, zsh, fish).
  • Documentation

    • README updated with filter syntax, matching rules, and dry-run usage examples.
  • Tests

    • Added unit tests covering download filter matching and URL extraction.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 10, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7951de90-5aef-4aab-aade-e06cd21cdd32

📥 Commits

Reviewing files that changed from the base of the PR and between 03c820b and d7a3a3a.

📒 Files selected for processing (1)
  • tests/test_filter.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/test_filter.py

📝 Walkthrough

Walkthrough

Adds pipe-separated download filtering to Databus URIs, implements filter-matching logic for JSON-LD Part nodes (extensions, compression, content variants), extends download flow to apply filters, adds a CLI --dry-run for deploy and a completion subcommand, and includes unit tests for filtering.

Changes

Cohort / File(s) Summary
Documentation
README.md
Documented new download filter URI syntax ($URI|filter1|filter2), matching rules for content variants (key=value or value), file extensions (.ext), compression (..gz), example usage, and --dry-run deploy flag.
Download & Filtering
databusclient/api/download.py
Added parsing of pipe-separated filters in download(), propagated optional filters: List[str] into _download_version, _download_artifact, _download_group, and _get_file_download_urls_from_artifact_jsonld. Implemented _matches_filters(node, filters) to match compression (..name), extensions (.ext), and content-variant key/value or bare-value checks against dcv:/dataid-cv: fields; updated URL extraction to skip non-matching Part nodes.
CLI Enhancements
databusclient/cli.py
Added --dry-run flag to deploy command and plumbing to generate/print JSON-LD without performing upload/deploy across Classic, Metadata-file, and WebDAV modes. Added completion subcommand accepting `bash
Tests
tests/test_filter.py
New unit tests for _matches_filters and _get_file_download_urls_from_artifact_jsonld, covering empty/None filters, extension matching, compression matching, CV key/value and bare-value matching, combined filters, and negative cases.

Sequence Diagram

sequenceDiagram
    autonumber
    actor User
    participant CLI
    participant Download
    participant JSONLD
    participant Filesystem

    User->>CLI: download("base_uri|filter1|filter2")
    CLI->>Download: invoke download(base_uri, filters)
    Download->>Download: parse URI -> base_uri + filters
    Download->>JSONLD: fetch artifact/version/group JSON-LD
    JSONLD-->>Download: return `@graph` with Part nodes
    Download->>Download: for each Part -> _matches_filters(node, filters)
    alt match
        Download->>Download: collect file URLs
    else no match
        Download-->>Download: skip Part
    end
    Download->>Filesystem: download collected URLs
    Filesystem-->>Download: files saved
    Download-->>CLI: results
    CLI-->>User: report complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 1 | ❌ 4

❌ Failed checks (2 warnings, 2 inconclusive)

Check name Status Explanation Resolution
Out of Scope Changes check ⚠️ Warning The PR includes a new CLI completion subcommand that is not mentioned in the PR description or issue objectives, representing out-of-scope functionality. Either remove the completion subcommand or clarify its relevance to issue #12 and update the PR description accordingly.
Docstring Coverage ⚠️ Warning Docstring coverage is 58.82% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description covers both features and their benefits, but the required checklist items are not completed or checked off. Complete the checklist items to confirm code style compliance, self-review, testing, and documentation updates have been performed.
Linked Issues check ❓ Inconclusive The PR partially addresses issue #12 by adding filtering support and dry-run functionality, but does not implement all the core download features (vault token authentication, Docker image) mentioned in the linked issue. Clarify whether this PR is intended as a partial implementation of issue #12 or if additional features are expected in separate PRs.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the two main features added in the PR: CLI filtering for CVs and a --dry-run deployment option.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@databusclient/api/download.py`:
- Around line 903-915: The filter logic in the for-loop that checks
f.startswith(".") and f.startswith("..") currently reads formatExtension and
compression from node.get(...) only, causing valid Part nodes with only a "file"
field to fail; update the checks in that loop to fall back to extracting the
extension and compression from node["file"] when node.get("formatExtension") or
node.get("compression") are empty—use the file name (node.get("file") or
node["file"]) to derive the format extension (e.g., the suffix after the last
'.') and the compression (e.g., a trailing .gz/.bz2/etc.), then compare those
derived values (lowercased) against expected in the existing f.startswith(".")
and f.startswith("..") branches so the filters match when metadata fields are
absent.

In `@databusclient/cli.py`:
- Around line 162-171: The current dry-run branch returns before building the
dataset/DataID preview so `--dry-run --webdav-url` prints an empty list; fix by
ensuring metadata is constructed even in dry-run mode: extract or reuse the same
metadata-generation logic used for real deploys (the code that produces
`metadata` from `distributions`, `remote`, `path`, and `webdav_url`) and call it
regardless of `dry_run`, but only skip the actual upload side-effect when
`dry_run` is true (i.e., call `webdav.upload_to_webdav` or a new
`build_metadata` helper in both cases or pass a dry_run flag to
`webdav.upload_to_webdav`), and remove the premature `return` so the
DataID/metadata preview is printed for dry runs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d7b368af-d320-4e10-8d56-42702b89db04

📥 Commits

Reviewing files that changed from the base of the PR and between ca7ac2d and 03c820b.

📒 Files selected for processing (4)
  • README.md
  • databusclient/api/download.py
  • databusclient/cli.py
  • tests/test_filter.py

Comment on lines +903 to +915
for f in filters:
if f.startswith(".."):
# Compression filter
expected = f[2:].lower()
actual = str(node.get("compression", "")).lower()
if actual != expected:
return False
elif f.startswith("."):
# Format extension filter
expected = f[1:].lower()
actual = str(node.get("formatExtension", "")).lower()
if actual != expected:
return False
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fall back to the file name when formatExtension or compression is missing.

file_format and compression are optional in published metadata, so .ttl / ..gz currently fail on otherwise valid Part nodes that only expose file. That makes the new filter syntax silently skip matching files on those datasets.

💡 Localized fix
     for f in filters:
         if f.startswith(".."):
             # Compression filter
             expected = f[2:].lower()
-            actual = str(node.get("compression", "")).lower()
+            actual = str(node.get("compression", "")).lower()
+            if not actual:
+                actual = _detect_compression_format(str(node.get("file", ""))) or ""
             if actual != expected:
                 return False
         elif f.startswith("."):
             # Format extension filter
             expected = f[1:].lower()
-            actual = str(node.get("formatExtension", "")).lower()
+            actual = str(node.get("formatExtension", "")).lower()
+            if not actual:
+                path = urlparse(str(node.get("file", ""))).path
+                basename = os.path.basename(path).lower()
+                basename = re.sub(r"\.(bz2|gz|xz)$", "", basename)
+                actual = basename.rsplit(".", 1)[-1] if "." in basename else ""
             if actual != expected:
                 return False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@databusclient/api/download.py` around lines 903 - 915, The filter logic in
the for-loop that checks f.startswith(".") and f.startswith("..") currently
reads formatExtension and compression from node.get(...) only, causing valid
Part nodes with only a "file" field to fail; update the checks in that loop to
fall back to extracting the extension and compression from node["file"] when
node.get("formatExtension") or node.get("compression") are empty—use the file
name (node.get("file") or node["file"]) to derive the format extension (e.g.,
the suffix after the last '.') and the compression (e.g., a trailing
.gz/.bz2/etc.), then compare those derived values (lowercased) against expected
in the existing f.startswith(".") and f.startswith("..") branches so the filters
match when metadata fields are absent.

Comment on lines +162 to +171
if dry_run:
click.echo("[DRY-RUN] Skipping WebDAV upload")
metadata = []
else:
metadata = webdav.upload_to_webdav(distributions, remote, path, webdav_url)

if dry_run:
click.echo("[DRY-RUN] Generated metadata (partial):")
click.echo(json.dumps(metadata, indent=2))
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

--dry-run in WebDAV mode never builds the DataID preview.

This branch returns after printing [], so it skips the same dataset-construction path that real metadata deploys use. The result is that deploy --dry-run --webdav-url ... does not actually simulate the deploy or surface metadata/DataID errors until a real run.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@databusclient/cli.py` around lines 162 - 171, The current dry-run branch
returns before building the dataset/DataID preview so `--dry-run --webdav-url`
prints an empty list; fix by ensuring metadata is constructed even in dry-run
mode: extract or reuse the same metadata-generation logic used for real deploys
(the code that produces `metadata` from `distributions`, `remote`, `path`, and
`webdav_url`) and call it regardless of `dry_run`, but only skip the actual
upload side-effect when `dry_run` is true (i.e., call `webdav.upload_to_webdav`
or a new `build_metadata` helper in both cases or pass a dry_run flag to
`webdav.upload_to_webdav`), and remove the premature `return` so the
DataID/metadata preview is printed for dry runs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant