research / Research note

How to Evaluate Multimodal VLMs for Your Video Use Case

A practical workflow for evaluating video VLM setups with VideoDB and Langfuse, from task definition and dataset design to tracing, scoring, and deployment decisions.

This blog explains how we evaluate VLMs for real video use cases and how to build a repeatable workflow around VideoDB and Langfuse.

The goal is simple: do not evaluate only the model, evaluate the full setup. For video workflows, the output depends on the segmentation strategy, frame sampling, video resolution, prompts, model choice, reasoning budgets, latency requirements, and post-processing.

The goal of the evaluation is not to declare a winner in the abstract. The goal is to decide what setup is right for your task, on your videos, at the quality, latency, and cost you can support.

Define the task before touching the stack

Start by writing down what the system is expected to do.

That sounds basic, but it shapes almost everything that follows. Retrieval, monitoring, summarization, moderation, metadata extraction, and Q&A are different tasks. They produce different outputs, tolerate different errors, and usually require different extraction and evaluation strategies.

At this stage, the goal is not to answer every possible question. The goal is to narrow the problem enough that the benchmark reflects the real use case.

A useful way to do that is to get clarity on a few broad dimensions:

  • What is the system expected to produce? A ranked clip list, an alert, a summary, an answer, or structured metadata all need to be evaluated differently.
  • What does success look like in practice? In some workflows, false positives are the main problem. In others, missing an event is worse. This is where you define what "good enough" actually means for the product.
  • What kind of signal does the task depend on? Some tasks depend mostly on static visual frame. Others depend on motion, spoken content, scene changes, visible text, or a combination of these. That directly affects extraction strategy, frame count, and model choice.
  • What constraints does the system need to operate under? Real-time systems, batch pipelines, low-cost pipelines, and quality-first pipelines all push the setup in different directions.

Once these questions are clear, the rest of the setup becomes easier to design and much easier to interpret.

They also tell you where to start. If the task depends on short-lived actions, you will usually test denser sampling or more frames. If the video is mostly static, lighter extraction and smaller models may be enough. If latency or cost is the main constraint, the benchmark should include lighter configurations early. If quality matters most, start with a stronger baseline and optimize down later.

Build the dataset around the production decision

The dataset is the centre of the eval.

If the dataset does not reflect production, the results will not help much. Public benchmarks are fine for sanity checks, but they do not answer the question most teams actually care about: will this work on our data?

That means your evaluation set should include:

  • Normal cases
  • Hard cases
  • Near-miss negatives
  • Boring stretches
  • Failure modes you already know about

For example, surveillance data should include occlusion, low light, motion blur, empty scenes, and crowded scenes. Meeting data should include crosstalk, screen shares, poor audio, quiet speakers, and long static sections. Retrieval tasks should include semantically similar wrong answers, not just obvious misses.

Do not build the set around what is easiest to label. Build it around the product decision you need to make.

Define what accuracy means for the task

For retrieval, the real question is usually whether the right moment appears in the results, how high it ranks, and whether similar-but-wrong clips stay out.

For alerting, the question is usually whether the alert stream is usable. A detector that catches everything but raises an alert constantly may still be the wrong system.

For summarization, the useful question is whether the summary is factually correct, covers the important events, and avoids inventing things.

For metadata extraction, it is often better to score field by field. If you need location, action, visible_text, and object_count, score those separately.

This is also where precision and recall become product choices instead of academic terms. Decide early whether missed events or false alarms are more expensive for the use case.

Compare setups, not just model names

Once the task and dataset are defined, compare complete configurations.

For video use cases, the main knobs are usually:

  • Segmentation strategy
  • Frame count
  • Interval length
  • Prompt
  • Model family
  • Reasoning or thinking settings
  • Resolution and preprocessing
  • Downstream validation logic

VideoDB already exposes several of these directly. Its scene extraction method supports shot-based and time-based extraction. Its indexing method supports custom scene indexes.

That is why the right benchmark unit is not "model A vs model B." It is "configuration A vs configuration B."

Build the evaluation stack

We use VideoDB and Langfuse to run and track the evaluation workflow.

  • VideoDB for ingest, segmentation, frame extraction, indexing, playback evidence, and running VideoDB-hosted models on scenes
  • Langfuse for traces, datasets, experiment runs, and later analysis

Start with VideoDB

The first job is to turn raw video into something benchmarkable.

That means uploading the asset, extracting scenes, choosing frame sampling, and optionally creating a baseline scene index. VideoDB's quickstart and scene methods support all of that directly.

import os
import videodb

conn = videodb.connect(api_key=os.environ["VIDEO_DB_API_KEY"])
coll = conn.get_collection()
video = coll.upload(url="https://example.com/sample-video.mp4")

If you want natural scene boundaries, use shot-based extraction. If you want fixed windows for benchmarking, use time-based extraction. VideoDB's docs show both patterns.

Shot-based extraction

from videodb import SceneExtractionType

scene_collection = video.extract_scenes(
    extraction_type=SceneExtractionType.shot_based,
    extraction_config={
        "threshold": 30,      # Sensitivity (lower = more sensitive)
        "frame_count": 10     # Frames per detected shot
    }
)

for scene in scene_collection.scenes:
    print(scene.id, scene.start, scene.end)
    for frame in scene.frames:
        print(frame.url)

Time-based extraction

from videodb import SceneExtractionType

scene_collection = video.extract_scenes(
    extraction_type=SceneExtractionType.time_based,
    extraction_config={
        "time": 5,
        "frame_count": 3
    }
)

for scene in scene_collection.scenes:
    print(scene.id, scene.start, scene.end)
    for frame in scene.frames:
        print(frame.url)

At this point, you have the video units the benchmark will run on: scene boundaries with sampled frames.

These sampled frames are what the VideoDB-hosted model uses when you call describe on a scene. After the model returns descriptions, labels, or other structured metadata, you can use VideoDB's index_scenes() method to turn that output into searchable indexes.

With that, the media side of the workflow is set up. The next step is to run the VLM over these scenes.

Make the first request

The easiest path is to call describe directly on a VideoDB scene. That keeps the benchmark easy to reason about: the extraction step is explicit, the input is inspectable, and every output can still be tied back to the exact scene and sampled frames that produced it.

A minimal first request looks like this:

description = scene.describe(
    model_name="google/gemma-4-31B-it",
    prompt=(
        ""
    ),
)

print(description)

This keeps model execution inside VideoDB while still letting you control the model and prompt used for each scene.

Trace and compare with Langfuse

Once the execution layer is in place, the next job is to make the runs inspectable and reproducible.

In this workflow, Langfuse is the observability layer. It helps us trace each evaluation item, attach metadata, compare outputs, define metrics, and preserve enough context to understand why a result was good or bad.

This matters because a benchmark is not only about producing a score. It is also about being able to answer questions like:

  • What exact input produced this output?
  • Which configuration generated the result?
  • How was it scored?
  • What changed between two runs?

A useful trace for offline evaluation usually includes:

  • Video ID
  • Scene start and end timestamps
  • Frame URLs
  • Extraction config
  • Prompt
  • Model name
  • Output
  • Scores

That way, every result stays tied back to the exact media evidence and configuration that produced it.

Define the right metrics before you compare runs

Not every task should be judged the same way, and not every score should be reduced to one overall number. A good evaluation pipeline should score the task in a way that reflects the actual product decision.

The important thing is to define those metrics early and keep them stable while comparing runs.

A simple trace structure is:

  • One root span per evaluation item
  • One child span for the model call
  • Final output and metadata on the root

A simple example looks like this:

import os
from time import perf_counter
from dotenv import load_dotenv
from langfuse import get_client

load_dotenv()

langfuse = get_client()

MODEL_NAME = "google/gemma-4-31B-it"
PROMPT = (
    "Describe the scene, key actions, and any visible text. "
    "Keep the description grounded in what is visible in the sampled frames."
)

for scene in scene_collection.scenes:
    frame_urls = [frame.url for frame in scene.frames]

    with langfuse.start_as_current_observation(
        as_type="span",
        name="video-evaluation",
        input={
            "video_id": video.id,
            "scene_id": scene.id,
            "scene_start": scene.start,
            "scene_end": scene.end,
            "frame_urls": frame_urls,
        },
        metadata={
            "model_name": MODEL_NAME,
            "prompt": PROMPT,
        },
    ) as root:
        with langfuse.start_as_current_observation(
            as_type="generation",
            name="scene-describe",
            model=MODEL_NAME,
            input={
                "scene_id": scene.id,
                "frame_urls": frame_urls,
                "prompt": PROMPT,
            },
        ) as generation:
            start = perf_counter()

            output_text = scene.describe(
                model_name=MODEL_NAME,
                prompt=PROMPT,
            )

            latency_ms = round((perf_counter() - start) * 1000, 4)
            score = evaluate_scene_output(output_text)

            generation.update(
                output=output_text,
                metadata={
                    "latency_ms": latency_ms,
                    "score": score,
                    "model_name": MODEL_NAME,
                },
            )

        root.update(
            output={
                "result": output_text,
                "latency_ms": latency_ms,
                "score": score,
            }
        )

langfuse.flush()

At this point, the trace contains the full context for each evaluation item: input, output, latency, and score. That makes the evaluation observable end to end.

Use the output to make a decision

The output of this workflow should not be "model X won."

It should help you answer practical questions:

  • Which configuration becomes the default path?
  • Which lighter setup is good enough for easier cases?
  • Which stronger setup should be reserved for harder slices?
  • Where does the current system still fail?
  • Should the next change be in the prompt, the extraction strategy, the thresholds, or the model itself?

That is the real purpose of the benchmark. It is not to produce a leaderboard. It is to help you decide what to deploy and what to improve next.

Let the evaluation compound over time

A good evaluation run should not disappear after you make the first decision.

Over time, the traces, scores, and reviewed outputs start to become a high-quality dataset of real examples from your own domain. That makes future evaluations easier, helps catch regressions earlier, and gives you a stronger base for prompt iteration, dataset expansion, or even training and adapting a smaller model on your own custom data if that becomes the right next step.

Run it on your own data

We have open-sourced the pipeline behind this workflow so you can run the same process on your own videos, define your own metrics, swap in your own models, and compare configurations without rebuilding the stack from scratch. You can find the repo here: benchmark-vlms.

A good first run is usually small and deliberate:

  • Pick one use case
  • Build a representative evaluation set
  • Define the metric that matters for that task
  • Compare a few meaningful configurations
  • Review
  • Choose a default path and if required a fallback strategy

Make the benchmark useful

If your goal is best possible quality, start with a stronger baseline and optimize down later. Compare model choice, frame count, extraction interval, and resolution first, since those usually have a bigger impact on output quality than smaller model-running tweaks.

If your goal is lower latency, look first at lighter models, shorter context, fewer frames, lower resolution where acceptable, and model side optimizations like batching and caching.

If your goal is lower cost, test the same task with smaller models, quantized models, fewer frames, longer sampling intervals, and caching.

A practical way to think about it is:

  • If accuracy or quality is the problem, start with the parts of the system that affect how much signal the model actually sees. In video workflows, that usually means segmentation strategy, sampling density, number of frames per segment, and resolution. If the benchmark is missing short actions, quick scene changes, or small visual details, the fix may not be a different model right away. It may be denser sampling, more frames, or better scene boundaries. If those changes do not move the result enough, then it makes sense to compare stronger models or more reasoning-heavy settings.
  • If latency is the problem, reduce the amount of work each request has to do. That usually means sending fewer frames, shortening context, lowering resolution where acceptable, or moving to a smaller or faster model. It can also mean tightening the model-running setup so requests are handled more efficiently.
  • If cost is the problem, look for the parts of the pipeline that are easiest to simplify without breaking quality. That can mean fewer frames, longer extraction intervals, lower resolution, smaller models, quantized models, or caching repeated prompt and context patterns. It can also mean introducing a lighter default path for common or easy cases, and reserving the more expensive configuration only for the slices that actually need it.

By the end of a run, you should know which configuration becomes the default path, which setup gives you the best quality, which lighter configuration is acceptable when latency or cost matters more, and where the current system still fails. That is the output that matters.