People do not think in queries. They think in moments.
They know what they want to find inside a video or across a collection. Most traditional video search systems are hit or miss. They run one search, return a ranked list, and stop. If the top results are wrong, the system does not improve. The user has to restart by rewriting the query and trying again.
That leads to the same loop every time: search, skim, rephrase, restart. The system forces retries without learning from them.
Deep Search is our answer. It treats the first query as a starting point, then uses follow-ups to revise how it searches and improve results over time. It does not only learn from user feedback, it also has internal checks that decide when results look wrong, adjust the plan, and retry automatically.
How Deep Search works
Deep Search is not a single search call. It is a retrieval loop.
A user gives a natural language request. The system turns that request into an executable plan, runs targeted searches across multiple indexes, combines the results, checks whether the clips actually match the intent, and then either retries or returns clips.
If the system cannot continue because any detail is missing, it pauses and asks a short clarification question. If it has enough confidence to show results, it pauses and returns a ranked list of clips.
At a high level, the flow looks like this:
- Convert the request into a plan
- Execute targeted searches across the right indexes
- Combine the results into candidate clips
- Check whether those clips satisfy the intent
- Retry with a revised plan if results are weak
- Rerank accepted clips and show them
- Pause for follow-up input or clarification when needed
(natural language)
(show clips)
uses extracted signals;
does not reprocess
frames
persist graph state;
resume on user input
single execution point
missing
detail?
(ask 1 question)
That is the full system at a high level. The rest of this post moves down through that loop, starting with the structure that makes retrieval possible and then walking through how execution works step by step.
The retrieval graph
We model the system as a state machine using LangGraph. LangGraph lets us define explicit nodes, transitions, and pause states while preserving execution state across retries and user interactions.
Deep Search begins with a routing step. Its job is to decide whether the system is handling a new request or resuming an existing session.
- A fresh request starts in the Planner Node
- A resumed session starts in the Interpreter Node with saved state
From there, execution moves through the following stages:
- Planner Node: Converts user intent into a structured plan
- Search Node: Executes the plan and combines results across indexes
- Validation Node: Checks whether candidate clips actually satisfy the intent
- Recovery Node: Handles empty results by broadening the plan or asking for clarification
- Interpreter Node: Turns feedback or follow-ups into controlled plan edits
- Reranking Node: Reorders accepted clips for final display
- Preview State and Clarification State: The only two pause states in the system
- Preview State: The point where Deep Search stops and shows a ranked list of clips to the user.
- Clarification State: The point where Deep Search stops and asks one question because it is missing a detail that blocks the search.
This creates two loops inside the system: one that the user sees and one that happens internally.
The outer loop
(show clips)
(ask 1 question)
This is the loop the user sees.
When one of the pause states is reached:
- The current graph state is persisted
- The system waits for user input
This is what allows follow-ups to build on previous reasoning instead of restarting the search from scratch every time.
The inner loop
enough?
Inside a single run, Deep Search may revise and retry multiple times before it pauses.
A typical internal flow looks like this:
- Execute the current plan in the Search Node
- If candidates exist, send them to the Validation Node
- If the result set is empty or the Validation Node rejects the candidates, revise the plan
- Retry execution
The user only sees the final outcome of this internal loop.
So far, we have looked at the retrieval loop from the top down. To understand how Deep Search executes a query, we first need to look at what it is actually searching over. That starts with indexing each clip into structured signals.
Indexing: How we make clips searchable
Stable Clip Boundaries
subplot summaries + full
arc summary
Deep Search works because it has real structure to work with. That structure is built through the following steps:
Step 1: Turn a video into scenes
We ingest the video and audio into VideoDB, then run scene detection using VideoDB. This gives us consistent clip boundaries for everything that follows.
Step 2: Generate base signals per clip
For each clip, we generate foundational signals:
- Transcript Generation
- Object Detection
These signals serve two purposes:
- They are directly searchable in some cases
- They provide structured inputs for higher-level semantic extraction
Step 3: Extract structured meaning with a VLM
For each clip, the model fuses:
- Temporal video frames
- Transcript
- Detected objects
We keep fields short and evidence-based. Here are the extraction entities we store per clip and what they capture:
| Extraction entity | What it captures |
|---|---|
| Location | Setting and environment. Interior or exterior, style, time of day, weather cues, and scene scale. |
| Action | Dominant actions and interactions. Key verbs, motion, and actor-object interactions. |
| Scene description | Broad visual description. Costumes, colors, ambience, staging, plus on-screen text when present. |
| Character description | Appearance and identity traits. Age cues, clothing, accessories, distinguishing features, and body language. |
| Shot type | Dominant camera framing over the clip, for example wide, close-up, or establishing. |
| Emotion | Primary emotion signal for the clip, with confidence and evidence source. |
| Topic | What is being discussed or sung about, not the exact words. |
| Transcript | The exact spoken words in the clip. |
| Object description | Main objects and their attributes. Condition, color, distinctive markings, and relevance in the scene. |
This separation is intentional. Different signals express different kinds of meaning, and keeping them separate gives us more control at retrieval time.
Step 4: Build video-level structure
Once we have processed the clips, we generate higher-level structure for the full video:
- Subplot summaries that break the video into contiguous story segments
- A final summary that describes the full arc
These become additional searchable fields, especially useful when the user is describing a broader sequence or storyline rather than a single moment.
Step 5: Build separate semantic indexes
Finally, we create separate semantic indexes for each field. Instead of forcing every query through a single embedding space, Deep Search can choose the index that best matches the user’s intent.
This makes retrieval more precise. A query about location, dialogue, action, or sequence can be routed to the field that represents that kind of meaning best.
The next step is to turn that request into an executable search plan. This process is driven by the graph.
The Planner Node: Turning a request into a plan
The first stage in execution is the Planner Node. Its job is to convert the request into something the system can actually run.
The output is a plan object that answers three questions:
- What should we search for?
- Where should we search for it?
- How strict should we be when combining and relaxing constraints?
A plan has four main parts:
- Subqueries: Each subquery has an ID, a query string, and a list of indexes to search.
- Join plan: Defines how to combine subquery result sets, usually with AND for intersection or OR for union.
- Metadata filters: Faceted filters applied to every search call, such as actors, characters, shot type, emotion, or objects.
- Fallback order: The order in which constraints should be relaxed when results are weak.
A simplified example looks like this:
subqueries:
- subquery_id: Q1
index: [location]
q: "hotel corridor"
- subquery_id: Q2
index: [action]
q: "walking while holding a phone"
- subquery_id: Q3
index: [transcript, topic]
q: "talking on the phone"
join_plan:
op: AND
subqueries: [Q1, Q2, Q3]
metadata_filters:
actors: ["Tom Cruise"]
fallback_order: ["actors"]
The Planner Node does not try to compress everything into one query string. It decomposes the request into a few targeted searches that can be combined later.
It also extracts metadata filters. In this case, Tom Cruise becomes an actor filter, which is applied to every search call so retrieval happens inside the right subset of clips from the start.
The Search Node: Executing the plan and combining results
Once the Planner Node produces a plan, the next stage is the Search Node. This is where the plan turns into candidate clips.
The Search Node does two things:
- Execute each subquery against the right indexes.
- Combine the result sets using the join plan.
Step 1: Run subqueries in parallel
Each subquery targets one or more indexes. The Search Node runs them independently.
Before querying an index, it generates a few alternative phrasings of the subquery. This improves recall because the wording that matches the indexed text is not always the wording the user typed.
By default, we generate a small number of variants per subquery. If the subquery targets the Transcript index, we also keep the original phrasing as an extra variant, since exact wording often matters there.
For each variant, we call the VideoDB search API with the plan’s metadata filters applied. That means the actor filter from the example is active for every call.
Each subquery produces a set of clip hits with scores. If the same clip appears across multiple variants for the same subquery, we fuse those hits into one clip entry and keep the best score.
At the end of this step, we have one result set per subquery.
Step 2: Combine results with a boolean join
The Search Node then applies the join plan.
If the join plan uses AND, we take the intersection. A clip survives only if it appears in every subquery result set. If the join plan uses OR, we take the union. A clip survives if it appears in any subquery result set. The output of the Search Node is a single list of candidate clips sorted by score.
If that list is empty, the system routes to the Recovery Node.
If that list is not empty, the system routes to the Validation Node.
Where the flow splits
At this point, the system has already done two things:
- Built a plan
- Executed that plan across the right indexes
From here, the flow splits depending on whether any candidate clips were found.
Node
State
There are now two possible paths:
- The Validation Node, which handles the non-empty case
- The Recovery Node, which handles the empty case
The Validation Node: Checking whether candidate clips really match
candidate clips
FAIL
FAIL
PASS
FAIL
AMBIG
FAIL
PASS
AMBIG
FAIL
PASS
clips
(intent + plan + session history)
(ranked clips)
The Search Node returns candidates based on semantic similarity and boolean joins. That produces plausible matches, but plausibility is not enough.
A clip can score high and still miss the intent in subtle ways:
- The action matches but the location does not
- The character appears but is not performing the requested action
- Dialogue contains similar wording but refers to something else
The Validation Node is the final verification layer before results are shown. It does not reprocess video frames, it operates only on previously extracted structured signals and transcript snippets. This constrains decisions to known evidence and reduces hallucinated matches.
When all candidates fail
If every candidate is labeled Fail, the Validation Node produces a structured feedback object describing the mismatch pattern, for example:
- “Location constraint satisfied but action missing”
- “Dialogue matched but no speaking action detected”
- “Actor filter overly restrictive”
This feedback is passed to the Interpreter Node, which applies controlled edits to the plan and triggers another search.
The Validation Node does not just filter results. It also produces the signal that drives the next iteration.
The Recovery Node: What happens when the search returns nothing
The Recovery Node runs only when the Search Node produced zero candidate clips.
This usually means the plan is too strict somewhere. The join may be intersecting signals that rarely co-occur. A filter may be narrowing too hard. A subquery may be phrased in a way that does not match the collection’s vocabulary.
The Recovery Node looks at the current query, the current plan, and what has already been tried in the session, then chooses one of two outcomes:
- Revise the plan to broaden recall, then retry search
- Pause and ask a short clarification question, because a missing detail is blocking the search
Broadening is done through controlled edits. Typical changes include:
- Relaxing low-priority constraints first, such as objects or emotion
- Weakening the join strategy, for example switching part of an AND into an OR when it makes sense
- Rewording a subquery to be less specific
- Adding an extra index to a subquery to give it another source of evidence
If the Recovery Node decides a missing detail is the real blocker, it routes to the Clarification State and asks a single question. Once the user answers, the graph resumes and continues with an updated plan.
Node
State
Node
At this point, we have covered how the plan is built, how it is executed, and how the system branches based on results. The next part is what ties those retries and follow-ups together.
The Interpreter Node: Turning feedback into plan edits
The Interpreter Node is what keeps the loop moving.
It runs in two situations:
- After a pause, when the user sends a follow-up or answers a clarification question
- After the Validation Node rejects all candidates and returns feedback
In both cases, the job is to take a signal and convert it into a small, controlled update to the current plan.
What the Interpreter Node reads
The Interpreter Node looks at the full context it needs to make a good decision:
- The original query
- The current plan
- The most recent results shown to the user, if any
- The user input, if we are resuming after a pause
- Validation feedback, if we are in the all-rejected path
- The accumulated history of plan changes and question-answer turns in the session
This matters because a follow-up is rarely meaningful on its own. For example, “more like clip 2” only makes sense if the system knows what clip 2 was.
What the Interpreter Node outputs
The Interpreter Node produces one of two outcomes:
- A batch of plan edits
- A clarification question, if it still needs a missing detail
Most of the time, it returns plan edits.
Those edits are applied to the plan, recorded in history, and the system routes back to the Search Node to run the updated plan.
How the loop reconnects
Any time the system decides the plan needs to change, it routes back to the Search Node. That includes:
- The Recovery Node broadening the plan after an empty result
- The Interpreter Node applying user follow-ups after a pause
- The Interpreter Node applying validation feedback when all candidates are rejected
The Search Node remains the single execution point. The rest of the graph exists to decide whether the current plan is good enough and, if not, how to revise it.
The Reranking Node: Preparing clips for display
Once the Validation Node returns at least one candidate as Pass or Ambiguous, the system has something usable. At that point the flow stops being about recovery and becomes about presentation.
The next stage is the Reranking Node.
The Reranking Node takes the accepted candidates and reorders them into a final ranked list for display. Its input is not just retrieval scores. It also includes the original query, the current plan, and the session history, so ranking reflects intent and preferences, not just similarity. It returns a new ordering of clip ids.
After reranking, the graph pauses in the Preview State and returns the ranked clips.
How this works in practice
Deep Search works best when the goal is to find a specific moment. The first query is only a starting point, the system refines toward the right clip through planning, validation, and controlled retries.
It is less suited for broad discovery prompts like “funny scenes” or “best moments,” where relevance depends more on subjective judgment than precise intent.
This is the key shift: instead of relying on one-shot ranking, Deep Search treats retrieval as a loop that improves before returning results.
Open source
We’re open sourcing Deep Search so anyone can inspect how it works, try it themselves, and extend it for their own use cases. You can explore the code, architecture, and examples here: https://github.com/video-db/deepsearch