Welcome to the first issue of The VideoDB Dispatch: our notes on video AI, multimodal models, agents, and what we are building at VideoDB.

This first issue is about a shift we keep running into while building with agents:

Agents will show their work through video.


The main signal: agents will show their work through video

The next important output from an agent may not be an answer. It may be a video.

A video artifact that carries the work: what the agent saw, tried, broke, fixed, and verified.

As agents move from chat boxes into browsers, terminals, editors, remote desktops, dashboards, and live workflows, more of their work becomes visual. A final text response can say “done.” Logs can show commands. But neither captures the full shape of a run: loading states, wrong turns, failed installs, visual errors, hidden popups, confusing docs, or the moment something finally works.

That is why video is the native artifact of agent work.

We have this hooked up to our OpenClaw agent in Slack. Instead of cloning a repo and wrestling with setup, the team just tags it with a URL and watches the video of it running.

Tagging our OpenClaw agent in Slack to see run-llama/liteparse run without setting it up locally

At first, this looks like monitoring. Record the run, stream it live, replay it later, and search for the moment something happened.

Then it becomes proof. The agent does not just claim it completed the task; it can show the path it took.

Then it becomes an interface. Ask an agent to try a GitHub repo, and it returns a first-person walkthrough. Ask it to research a topic, and it returns a sourced video briefing.

We think this is going to show up everywhere: debugging agent runs, QA, reviewing repos, and turning research into video briefings.

Here are three experiments from our side:

  • Agent monitoring: recording and indexing computer-use sessions so runs can be watched live, replayed later, and searched by what happened on screen. GitHub repo
  • Repo walkthroughs: giving an agent a GitHub repo and a goal, then asking it to set up, run, record, and narrate the experience in first person. Sample walkthrough. We are working on a Twitter/X bot where you can tag any repo you want to see running and get back a video walkthrough. Stay tuned.
  • Generated video briefings: giving an agent a topic and getting back a sourced video assembled from research, real assets, narration, and review. GitHub repo · Samples: US-Iran war news digest, financial market report

Model watch

A few model launches and research directions we have been watching.

ChatGPT Images 2.0: generated visuals become production assets

OpenAI’s image update showcases better precision and control, stronger typography, multilingual text rendering, flexible aspect ratios, and more polished layout-heavy outputs: posters, magazine spreads, infographics, brochures, comic pages, product grids, and educational diagrams.

That matters because image generation is moving beyond one-off pictures. When models can reliably produce readable text, structured layouts, and consistent visual styles, they become useful as asset generators.

For video agents, this is a practical unlock. Not every scene needs full video generation. Many useful videos are assembled from title cards, diagrams, screenshots, charts, overlays, thumbnails, generated stills, clips, narration, and music. Better image models make it easier for an agent to create the missing visual layer, then compose it into a watchable video.

Reference: OpenAI announcement


We published a paper on Gemini thinking for video understanding

We published a paper on a practical question in video understanding:

Does giving a vision-language model more “thinking” actually improve scene understanding?

In our benchmark on Gemini vision-language models across scenes extracted from 100 hours of video, we looked at how internal reasoning traces, what we call thought streams, affect final video scene outputs.

Key findings:

  • More thinking helps, but gains plateau quickly in this setup.
  • Most quality improvement happens in the first few hundred thought tokens; beyond about 700 tokens, extra thinking adds cost with smaller gains.
  • Flash Lite 1024 was the quality leader in the benchmark while using fewer thought tokens than Flash Dynamic.
  • Tight reasoning budgets increase compression-step hallucination: the final output more often includes details that were not explicitly present in the observable thought stream.
  • Flash and Flash Lite think about many of the same things, with cross-tier thought-stream similarity nearly as high as same-model determinism.
  • Flash Lite was more token-efficient, spending less of its budget on process narration and more on scene content.

Production video understanding is not only about model quality. It is also about cost, efficiency, and trust. If a model uses substantially more thinking tokens but gives only a small quality bump, that changes how you design indexing pipelines.

Read the full paper: Thought streams for video scene understanding


Things we liked

Flipbook: a generative visual internet

Flipbook is a delightful experiment: an infinite visual browser where every page is generated on demand as an image. Click anywhere to explore that part of the image in more depth.

It feels like a preview of a more visual web, especially with their experimental live video stream mode where generated images, interactive browsing, and video streams start blending together.


What’s new in VideoDB

Organization management for VideoDB Console

We shipped Organization Management in VideoDB Console for Pro users.

You can now invite team members into your organization so they can access shared VideoDB assets in VideoDB Console and manage API keys together.


Build idea: create a personalized video agent

If you want to try this direction, start with Agentic Streams. We have been experimenting with agents for news digests, market reports, and topic-based video briefings.

The interesting part is personalization. Instead of a generic “make me a video” prompt, give the agent a beat, sources, format, and taste:

“Make me a 3-minute weekly video on open-source AI launches. Use GitHub trending, Hacker News, X posts, and YouTube demos as sources. Keep the tone technical and concise. Show links, screenshots, and clips as proof.”

A useful video agent should know:

  • Sources: which YouTube channels, websites, repos, newsletters, papers, or social accounts to follow
  • Evidence: what counts as proof (clips, screenshots, charts, commits, posts, citations)
  • Format: duration, sections, pacing, narration style, and visual density
  • Taste: what to skip, what to emphasize, and what “high signal” means for you

That turns the agent from a generic content generator into a personalized video researcher: it gathers evidence, assembles the story, and returns something you can watch instead of another long feed to scroll.


Explore, ship, collaborate

Two updates for builders who want to explore, ship, or collaborate with VideoDB.

VideoDB for Developers

Our developer page is the best place to connect with the VideoDB builder ecosystem, follow upcoming events, and explore ways to collaborate with us.

Explore it here: VideoDB for Developers

Growth Forge

We also launched Growth Forge: a 14-day sprint for five builders to build a growth agent for our agents.

The idea is simple: growth is becoming less like campaigns and more like loops. We want to work with builders who can design, ship, and prove an agentic growth engine.

Apply here: Growth Forge


Thanks for reading. If you end up building something with agent video, reply and send it our way. We would love to see it.