Deep Search: How We Built an Engine for Finding Exact Moments in Video

People do not think in queries. They think in moments.

They know what they want to find inside a video or across a collection. Most traditional video search systems are hit or miss. They run one search, return a ranked list, and stop. If the top results are wrong, the system does not improve. The user has to restart by rewriting the query and trying again.

That leads to the same loop every time: search, skim, rephrase, restart. The system forces retries without learning from them.

Deep Search is our answer. It treats the first query as a starting point, then uses follow-ups to revise how it searches and improve results over time. It does not only learn from user feedback, it also has internal checks that decide when results look wrong, adjust the plan, and retry automatically.

How Deep Search works

Deep Search is not a single search call. It is a retrieval loop.

A user gives a natural language request. The system turns that request into an executable plan, runs targeted searches across multiple indexes, combines the results, checks whether the clips actually match the intent, and then either retries or returns clips.

If the system cannot continue because any detail is missing, it pauses and asks a short clarification question. If it has enough confidence to show results, it pauses and returns a ranked list of clips.

At a high level, the flow looks like this:

Convert the request into a plan
Execute targeted searches across the right indexes
Combine the results into candidate clips
Check whether those clips satisfy the intent
Retry with a revised plan if results are weak
Rerank accepted clips and show them
Pause for follow-up input or clarification when needed

User Request
(natural language)

Planner Node

Search Node

Candidates?

Validator Node

Reranking Node

Preview State
(show clips)

Note: Validator Node
uses extracted signals;
does not reprocess
frames

Note: Pause states
persist graph state;
resume on user input

Note: Search Node is the
single execution point

Recovery Node

Need
missing
detail?

Clarification State
(ask 1 question)

Interpreter Node

YesNoYesNoanswerfollow-uprerun

That is the full system at a high level. The rest of this post moves down through that loop, starting with the structure that makes retrieval possible and then walking through how execution works step by step.

The retrieval graph

We model the system as a state machine using LangGraph. LangGraph lets us define explicit nodes, transitions, and pause states while preserving execution state across retries and user interactions.

Deep Search begins with a routing step. Its job is to decide whether the system is handling a new request or resuming an existing session.

A fresh request starts in the Planner Node
A resumed session starts in the Interpreter Node with saved state

From there, execution moves through the following stages:

Planner Node: Converts user intent into a structured plan
Search Node: Executes the plan and combines results across indexes
Validation Node: Checks whether candidate clips actually satisfy the intent
Recovery Node: Handles empty results by broadening the plan or asking for clarification
Interpreter Node: Turns feedback or follow-ups into controlled plan edits
Reranking Node: Reorders accepted clips for final display
Preview State and Clarification State: The only two pause states in the system
- Preview State: The point where Deep Search stops and shows a ranked list of clips to the user.
- Clarification State: The point where Deep Search stops and asks one question because it is missing a detail that blocks the search.

This creates two loops inside the system: one that the user sees and one that happens internally.

The outer loop

Outer loop

Run Deep Search

Preview State
(show clips)

Clarification State
(ask 1 question)

User follow-up

User answer

Resume from saved state

This is the loop the user sees.

When one of the pause states is reached:

The current graph state is persisted
The system waits for user input

This is what allows follow-ups to build on previous reasoning instead of restarting the search from scratch every time.

The inner loop

Inner loop (internal retries)

Search

Validate

Good
enough?

Rerank

Interpret

YesNo

Inside a single run, Deep Search may revise and retry multiple times before it pauses.

A typical internal flow looks like this:

Execute the current plan in the Search Node
If candidates exist, send them to the Validation Node
If the result set is empty or the Validation Node rejects the candidates, revise the plan
Retry execution

The user only sees the final outcome of this internal loop.

So far, we have looked at the retrieval loop from the top down. To understand how Deep Search executes a query, we first need to look at what it is actually searching over. That starts with indexing each clip into structured signals.

Indexing: How we make clips searchable

Video + Audio Ingest

Scene Detection:
Stable Clip Boundaries

Base signals per clip

Transcript

Object Detection

Face Detection

VLM Extraction

Video-level structure:
subplot summaries + full
arc summary

Semantic Indexes

Deep Search works because it has real structure to work with. That structure is built through the following steps:

Step 1: Turn a video into scenes

We ingest the video and audio into VideoDB, then run scene detection using VideoDB. This gives us consistent clip boundaries for everything that follows.

Step 2: Generate base signals per clip

For each clip, we generate foundational signals:

Transcript Generation
Object Detection

These signals serve two purposes:

They are directly searchable in some cases
They provide structured inputs for higher-level semantic extraction

Step 3: Extract structured meaning with a VLM

For each clip, the model fuses:

Temporal video frames
Transcript
Detected objects

We keep fields short and evidence-based. Here are the extraction entities we store per clip and what they capture:

Extraction entity	What it captures
Location	Setting and environment. Interior or exterior, style, time of day, weather cues, and scene scale.
Action	Dominant actions and interactions. Key verbs, motion, and actor-object interactions.
Scene description	Broad visual description. Costumes, colors, ambience, staging, plus on-screen text when present.
Character description	Appearance and identity traits. Age cues, clothing, accessories, distinguishing features, and body language.
Shot type	Dominant camera framing over the clip, for example wide, close-up, or establishing.
Emotion	Primary emotion signal for the clip, with confidence and evidence source.
Topic	What is being discussed or sung about, not the exact words.
Transcript	The exact spoken words in the clip.
Object description	Main objects and their attributes. Condition, color, distinctive markings, and relevance in the scene.

This separation is intentional. Different signals express different kinds of meaning, and keeping them separate gives us more control at retrieval time.

Step 4: Build video-level structure

Once we have processed the clips, we generate higher-level structure for the full video:

Subplot summaries that break the video into contiguous story segments
A final summary that describes the full arc

These become additional searchable fields, especially useful when the user is describing a broader sequence or storyline rather than a single moment.

Step 5: Build separate semantic indexes

Finally, we create separate semantic indexes for each field. Instead of forcing every query through a single embedding space, Deep Search can choose the index that best matches the user’s intent.

This makes retrieval more precise. A query about location, dialogue, action, or sequence can be routed to the field that represents that kind of meaning best.

The next step is to turn that request into an executable search plan. This process is driven by the graph.

The Planner Node: Turning a request into a plan

The first stage in execution is the Planner Node. Its job is to convert the request into something the system can actually run.

The output is a plan object that answers three questions:

What should we search for?
Where should we search for it?
How strict should we be when combining and relaxing constraints?

A plan has four main parts:

Subqueries: Each subquery has an ID, a query string, and a list of indexes to search.
Join plan: Defines how to combine subquery result sets, usually with AND for intersection or OR for union.
Metadata filters: Faceted filters applied to every search call, such as actors, characters, shot type, emotion, or objects.
Fallback order: The order in which constraints should be relaxed when results are weak.

A simplified example looks like this:

subqueries:
  - subquery_id: Q1
    index: [location]
    q: "hotel corridor"
  - subquery_id: Q2
    index: [action]
    q: "walking while holding a phone"
  - subquery_id: Q3
    index: [transcript, topic]
    q: "talking on the phone"
join_plan:
  op: AND
  subqueries: [Q1, Q2, Q3]
metadata_filters:
  actors: ["Tom Cruise"]
fallback_order: ["actors"]

The Planner Node does not try to compress everything into one query string. It decomposes the request into a few targeted searches that can be combined later.

It also extracts metadata filters. In this case, Tom Cruise becomes an actor filter, which is applied to every search call so retrieval happens inside the right subset of clips from the start.

The Search Node: Executing the plan and combining results

Once the Planner Node produces a plan, the next stage is the Search Node. This is where the plan turns into candidate clips.

The Search Node does two things:

Execute each subquery against the right indexes.
Combine the result sets using the join plan.

Step 1: Run subqueries in parallel

Each subquery targets one or more indexes. The Search Node runs them independently.

Before querying an index, it generates a few alternative phrasings of the subquery. This improves recall because the wording that matches the indexed text is not always the wording the user typed.

By default, we generate a small number of variants per subquery. If the subquery targets the Transcript index, we also keep the original phrasing as an extra variant, since exact wording often matters there.

For each variant, we call the VideoDB search API with the plan’s metadata filters applied. That means the actor filter from the example is active for every call.

Each subquery produces a set of clip hits with scores. If the same clip appears across multiple variants for the same subquery, we fuse those hits into one clip entry and keep the best score.

At the end of this step, we have one result set per subquery.

Step 2: Combine results with a boolean join

The Search Node then applies the join plan.

If the join plan uses AND, we take the intersection. A clip survives only if it appears in every subquery result set. If the join plan uses OR, we take the union. A clip survives if it appears in any subquery result set. The output of the Search Node is a single list of candidate clips sorted by score.

If that list is empty, the system routes to the Recovery Node.

If that list is not empty, the system routes to the Validation Node.

Where the flow splits

At this point, the system has already done two things:

Built a plan
Executed that plan across the right indexes

From here, the flow splits depending on whether any candidate clips were found.

Plan Node

Search Node

Validator
Node

...

Recovery Node

Clarification
State

...

candidates existno candidatesretryclarify

There are now two possible paths:

The Validation Node, which handles the non-empty case
The Recovery Node, which handles the empty case

The Validation Node: Checking whether candidate clips really match

Search Node output: 10
candidate clips

Validator verdicts

Clip 1
FAIL

Clip 2
FAIL

Clip 3
PASS

Clip 4
FAIL

Clip 5
AMBIG

Clip 6
FAIL

Clip 7
PASS

Clip 8
AMBIG

Clip 9
FAIL

Clip 10
PASS

After validation

Keep PASS + AMBIG

Drop FAILED
clips

Rerank
(intent + plan + session history)

Preview State
(ranked clips)

The Search Node returns candidates based on semantic similarity and boolean joins. That produces plausible matches, but plausibility is not enough.

A clip can score high and still miss the intent in subtle ways:

The action matches but the location does not
The character appears but is not performing the requested action
Dialogue contains similar wording but refers to something else

The Validation Node is the final verification layer before results are shown. It does not reprocess video frames, it operates only on previously extracted structured signals and transcript snippets. This constrains decisions to known evidence and reduces hallucinated matches.

When all candidates fail

If every candidate is labeled Fail, the Validation Node produces a structured feedback object describing the mismatch pattern, for example:

“Location constraint satisfied but action missing”
“Dialogue matched but no speaking action detected”
“Actor filter overly restrictive”

This feedback is passed to the Interpreter Node, which applies controlled edits to the plan and triggers another search.

The Validation Node does not just filter results. It also produces the signal that drives the next iteration.

The Recovery Node: What happens when the search returns nothing

The Recovery Node runs only when the Search Node produced zero candidate clips.

This usually means the plan is too strict somewhere. The join may be intersecting signals that rarely co-occur. A filter may be narrowing too hard. A subquery may be phrased in a way that does not match the collection’s vocabulary.

The Recovery Node looks at the current query, the current plan, and what has already been tried in the session, then chooses one of two outcomes:

Revise the plan to broaden recall, then retry search
Pause and ask a short clarification question, because a missing detail is blocking the search

Broadening is done through controlled edits. Typical changes include:

Relaxing low-priority constraints first, such as objects or emotion
Weakening the join strategy, for example switching part of an AND into an OR when it makes sense
Rewording a subquery to be less specific
Adding an extra index to a subquery to give it another source of evidence

If the Recovery Node decides a missing detail is the real blocker, it routes to the Clarification State and asks a single question. Once the user answers, the graph resumes and continues with an updated plan.

Plan Node

Search Node

Validator
Node

...

Recovery Node

Clarification
State

Interpreter
Node

candidates existno candidatesretryclarifyokall rejected

At this point, we have covered how the plan is built, how it is executed, and how the system branches based on results. The next part is what ties those retries and follow-ups together.

The Interpreter Node: Turning feedback into plan edits

The Interpreter Node is what keeps the loop moving.

It runs in two situations:

After a pause, when the user sends a follow-up or answers a clarification question
After the Validation Node rejects all candidates and returns feedback

In both cases, the job is to take a signal and convert it into a small, controlled update to the current plan.

What the Interpreter Node reads

The Interpreter Node looks at the full context it needs to make a good decision:

The original query
The current plan
The most recent results shown to the user, if any
The user input, if we are resuming after a pause
Validation feedback, if we are in the all-rejected path
The accumulated history of plan changes and question-answer turns in the session

This matters because a follow-up is rarely meaningful on its own. For example, “more like clip 2” only makes sense if the system knows what clip 2 was.

What the Interpreter Node outputs

The Interpreter Node produces one of two outcomes:

A batch of plan edits
A clarification question, if it still needs a missing detail

Most of the time, it returns plan edits.

Those edits are applied to the plan, recorded in history, and the system routes back to the Search Node to run the updated plan.

How the loop reconnects

Any time the system decides the plan needs to change, it routes back to the Search Node. That includes:

The Recovery Node broadening the plan after an empty result
The Interpreter Node applying user follow-ups after a pause
The Interpreter Node applying validation feedback when all candidates are rejected

The Search Node remains the single execution point. The rest of the graph exists to decide whether the current plan is good enough and, if not, how to revise it.

The Reranking Node: Preparing clips for display

Once the Validation Node returns at least one candidate as Pass or Ambiguous, the system has something usable. At that point the flow stops being about recovery and becomes about presentation.

The next stage is the Reranking Node.

The Reranking Node takes the accepted candidates and reorders them into a final ranked list for display. Its input is not just retrieval scores. It also includes the original query, the current plan, and the session history, so ranking reflects intent and preferences, not just similarity. It returns a new ordering of clip ids.

After reranking, the graph pauses in the Preview State and returns the ranked clips.

How this works in practice

Deep Search works best when the goal is to find a specific moment. The first query is only a starting point, the system refines toward the right clip through planning, validation, and controlled retries.

It is less suited for broad discovery prompts like “funny scenes” or “best moments,” where relevance depends more on subjective judgment than precise intent.

This is the key shift: instead of relying on one-shot ranking, Deep Search treats retrieval as a loop that improves before returning results.

Open source

We’re open sourcing Deep Search so anyone can inspect how it works, try it themselves, and extend it for their own use cases. You can explore the code, architecture, and examples here: https://github.com/video-db/deepsearch

How Deep Search works

The retrieval graph

The outer loop

The inner loop

Indexing: How we make clips searchable

Step 1: Turn a video into scenes

Step 2: Generate base signals per clip

Step 3: Extract structured meaning with a VLM

Step 4: Build video-level structure

Step 5: Build separate semantic indexes

The Planner Node: Turning a request into a plan

The Search Node: Executing the plan and combining results

Step 1: Run subqueries in parallel

Step 2: Combine results with a boolean join

Where the flow splits

The Validation Node: Checking whether candidate clips really match

When all candidates fail

The Recovery Node: What happens when the search returns nothing

The Interpreter Node: Turning feedback into plan edits

What the Interpreter Node reads

What the Interpreter Node outputs

How the loop reconnects

The Reranking Node: Preparing clips for display

How this works in practice

Open source

Get the VideoDB engineering newsletter