knowledge: consolidation proposals endpoint

Finds near-duplicate entities the upsert-time cosine guard didn't
catch — typically legacy data from before that guard landed, or
pairs whose embeddings sit between 0.85 (default proposal floor)
and 0.92 (auto-collapse threshold). Pure read-side feature; the
actual merging still goes through the existing
/knowledge/entities/merge action.

New DAO method `find_consolidation_proposals(threshold,
max_groups)`:
  - Loads every non-rejected entity with an embedding.
  - Partitions by entity_type so a person can't cluster with a
    place.
  - Pairwise cosine, edges above threshold feed a union-find for
    transitive grouping (Sara → Sarah → Sarah J. all land in one
    cluster).
  - Tracks min/max cosine per component so the UI can show "how
    tight" each cluster is before clicking in.
  - Returns groups of >= 2 sorted by size desc then max cosine
    desc; trimmed to `max_groups`.

New endpoint `GET /knowledge/consolidation-proposals?threshold=
&limit=` accepts the threshold (clamped 0.5–0.99 to prevent the
"every entity in one mega-cluster" case) and returns groups with
per-entity persona fact-count breakdowns baked in — saves the UI
a separate query per cluster member.

ConsolidationGroup is exported through database/mod.rs so the
handler can use it without depending on knowledge_dao internals.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Cameron Cordes
2026-05-11 18:43:11 -04:00
parent 89d0a6527c
commit 6620fa48d7
3 changed files with 276 additions and 5 deletions

View File

@@ -7,8 +7,8 @@ use std::sync::Mutex;
use crate::data::Claims;
use crate::database::models::{Entity, EntityFact, EntityPhotoLink, InsertEntityFact};
use crate::database::{
EntityFilter, EntityPatch, EntitySort, FactFilter, FactPatch, KnowledgeDao, PersonaFilter,
RecentActivity,
ConsolidationGroup, EntityFilter, EntityPatch, EntitySort, FactFilter, FactPatch, KnowledgeDao,
PersonaFilter, RecentActivity,
};
use crate::personas::PersonaDaoData;
use crate::state::AppState;
@@ -330,6 +330,27 @@ pub struct RecentQuery {
pub limit: Option<i64>,
}
#[derive(Deserialize)]
pub struct ConsolidationQuery {
/// Cosine threshold for clustering. Default 0.85 — looser than
/// the upsert-time guard (0.92) so this view surfaces "probably
/// same" pairs for human review.
pub threshold: Option<f32>,
pub limit: Option<i64>,
}
#[derive(Serialize)]
pub struct ConsolidationGroupView {
pub entities: Vec<EntitySummary>,
pub min_cosine: f32,
pub max_cosine: f32,
}
#[derive(Serialize)]
pub struct ConsolidationResponse {
pub groups: Vec<ConsolidationGroupView>,
}
// ---------------------------------------------------------------------------
// Service registration
// ---------------------------------------------------------------------------
@@ -370,7 +391,11 @@ where
web::resource("/facts/{id}/restore")
.route(web::post().to(restore_fact::<D>)),
)
.service(web::resource("/recent").route(web::get().to(get_recent::<D>))),
.service(web::resource("/recent").route(web::get().to(get_recent::<D>)))
.service(
web::resource("/consolidation-proposals")
.route(web::get().to(get_consolidation_proposals::<D>)),
),
)
}
@@ -1146,3 +1171,64 @@ async fn get_recent<D: KnowledgeDao + 'static>(
}
}
}
async fn get_consolidation_proposals<D: KnowledgeDao + 'static>(
req: HttpRequest,
claims: Claims,
query: web::Query<ConsolidationQuery>,
dao: web::Data<Mutex<D>>,
persona_dao: PersonaDaoData,
) -> impl Responder {
// Clamp threshold so a curious client can't drag the cosine
// floor to 0 and pull every entity into one giant cluster.
let threshold = query.threshold.unwrap_or(0.85).clamp(0.5, 0.99);
let max_groups = query.limit.unwrap_or(50).clamp(1, 200) as usize;
let persona = resolve_persona_filter(&req, &claims, &persona_dao);
let cx = opentelemetry::Context::current();
let mut dao = dao.lock().expect("Unable to lock KnowledgeDao");
let groups: Vec<ConsolidationGroup> =
match dao.find_consolidation_proposals(&cx, threshold, max_groups) {
Ok(g) => g,
Err(e) => {
log::error!("find_consolidation_proposals: {:?}", e);
return HttpResponse::InternalServerError()
.json(serde_json::json!({"error": "Database error"}));
}
};
// Decorate with per-persona fact counts so the curation UI can
// show "default 8 · journal 3" inline and the curator can pick
// which entity is the strongest target.
let entity_ids: Vec<i32> = groups
.iter()
.flat_map(|g| g.entities.iter().map(|e| e.id))
.collect();
let breakdowns = dao
.get_persona_breakdowns_for_entities(&cx, &entity_ids, persona.user_id())
.unwrap_or_default();
let groups_view: Vec<ConsolidationGroupView> = groups
.into_iter()
.map(|g| ConsolidationGroupView {
entities: g
.entities
.into_iter()
.map(|e| {
let id = e.id;
let summary = EntitySummary::from(e);
match breakdowns.get(&id) {
Some(bd) => summary.with_persona_breakdown(bd.clone()),
None => summary,
}
})
.collect(),
min_cosine: g.min_cosine,
max_cosine: g.max_cosine,
})
.collect();
HttpResponse::Ok().json(ConsolidationResponse {
groups: groups_view,
})
}