FIX SOURCE INGESTION: DEDUPLICATE AND RELEVANCE-FILTER YOUTUBE INPUTS
The input set contains the same YouTube video twice and content unrelated to backend/AI-in-SDLC, exposing gaps in our ingestion pipeline. Add deterministic dedu...
The input set contains the same YouTube video twice and content unrelated to backend/AI-in-SDLC, exposing gaps in our ingestion pipeline. Add deterministic deduplication by YouTube videoId and a lightweight relevance classifier on titles/descriptions to filter off-topic items. This reduces noise before summarization and speeds editorial review.
Cuts reviewer time and model token spend on irrelevant media.
Improves trust in automated digests and downstream metrics.
-
terminal
Compare LLM zero-shot vs. a small supervised classifier over embeddings for relevance on a labeled set.
-
terminal
Evaluate exact videoId matching vs. embedding-based near-duplicate detection to catch re-uploads and playlist variants.