Skip to content

fix(memory): skip duplicate fact entries#1857

Closed
Gujiassh wants to merge 1 commit intobytedance:mainfrom
Gujiassh:fix/memory-fact-dedup-clean
Closed

fix(memory): skip duplicate fact entries#1857
Gujiassh wants to merge 1 commit intobytedance:mainfrom
Gujiassh:fix/memory-fact-dedup-clean

Conversation

@Gujiassh
Copy link
Copy Markdown
Contributor

@Gujiassh Gujiassh commented Apr 4, 2026

Summary

  • skip appending duplicate fact content when applying memory updates
  • preserve existing fact removal and max-facts trimming behavior
  • add focused regression coverage for existing duplicates, same-batch duplicates, and max-facts trimming

End-user benefit

Without this fix, repeated memory updates can accumulate the same fact content over and over, which bloats the stored memory and degrades later memory injection quality. This change keeps the fact list concise while preserving the existing update flow.

Testing

  • backend/.venv/bin/python -m pytest backend/tests/test_memory_updater.py -q

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-opencode)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
@WillemJiang
Copy link
Copy Markdown
Collaborator

@Gujiassh, thanks for your contribution. Please resolve the conflicts from the main branch.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes long-term memory bloat by preventing duplicate fact entries (based on fact content) from being appended during memory updates, while preserving existing removal and max-facts trimming behavior.

Changes:

  • Add content-based deduplication for facts during _apply_updates().
  • Introduce a small helper to canonicalize fact content for dedupe comparisons.
  • Add regression tests covering existing duplicates, same-batch duplicates, and interaction with max-facts trimming.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
backend/packages/harness/deerflow/agents/memory/updater.py Adds content-key helper and uses it to skip adding duplicate facts during memory update application.
backend/tests/test_memory_updater.py Adds focused tests validating dedupe behavior and ensuring removals/trimming still work.

Comment on lines 356 to 365
if confidence >= config.fact_confidence_threshold:
content = fact.get("content", "")
fact_key = _fact_content_key(content)
if fact_key is not None and fact_key in existing_fact_keys:
continue

fact_entry = {
"id": f"fact_{uuid.uuid4().hex[:8]}",
"content": fact.get("content", ""),
"content": content,
"category": fact.get("category", "context"),
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

content is taken directly from fact.get("content") and stored back into memory without type/emptiness normalization. If the LLM returns a non-string (e.g., null/list/dict), later _strip_upload_mentions_from_memory() calls _UPLOAD_SENTENCE_RE.search() on that value and will raise a TypeError, aborting the memory update. Consider coercing to str or skipping facts whose content is not a non-empty string (e.g., strip() to empty) and using the normalized value both for dedupe and storage.

Copilot uses AI. Check for mistakes.
Comment on lines +126 to +130
with patch(
"deerflow.agents.memory.updater.get_memory_config",
return_value=_memory_config(max_facts=2, fact_confidence_threshold=0.7),
):
result = updater._apply_updates(current_memory, update_data, thread_id="thread-9")
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test constructs a MemoryConfig with max_facts=2, but MemoryConfig.max_facts is constrained with ge=10 (see memory_config.py). The helper bypasses validation via setattr, so the test is exercising an impossible production configuration and may become brittle if assignment validation is enabled later. Suggest keeping max_facts within valid bounds (>=10) and adjusting the fixture data accordingly (e.g., generate >10 facts and assert trimming).

Copilot uses AI. Check for mistakes.
@Gujiassh
Copy link
Copy Markdown
Contributor Author

Gujiassh commented Apr 5, 2026

I rechecked this while resolving the main-branch conflicts and found the duplicate-fact fix is already present on current , including the content-key guard in \ and the focused dedupe regressions in . The same fix was already merged earlier as #1193, so rebasing this branch now would only replay behavior that is already upstream. Closing this one as superseded to avoid duplicate review noise.

@Gujiassh
Copy link
Copy Markdown
Contributor Author

Gujiassh commented Apr 5, 2026

Closing as superseded by the already-merged main-branch fix and tests from #1193.

@Gujiassh Gujiassh closed this Apr 5, 2026
@Gujiassh
Copy link
Copy Markdown
Contributor Author

Gujiassh commented Apr 5, 2026

Follow-up clarification: current main already contains the duplicate-fact guard in backend/packages/harness/deerflow/agents/memory/updater.py and the focused dedupe regressions in backend/tests/test_memory_updater.py. The equivalent fix was merged earlier via #1193, so this PR is being closed as superseded rather than rebased.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants