The problem
OpenClaw is a framework for running autonomous agents. You ask it to do something, book a flight, dig through a repo, file a PR, fill out a form, and it goes off and does it. What you get back is the result. Sometimes a "done" message. Sometimes a failure. Almost never the bit in between.
That bit in between is what bites you.
When an agent quietly opens the wrong tab, fills the wrong field, gets stuck on a captcha for forty seconds, or accepts a cookie banner you did not want it to accept, none of that shows up in logs. Logs tell you which tools the agent called. Traces tell you how long each call took. Neither one tells you what was on the screen when the agent decided to click. If the agent fails, you find out from the missing outcome, not from a record you can rewind.
This stops being a curiosity and starts being a real operational problem the moment you are running agents on remote machines, against real accounts, with real consequences. "It worked on my laptop" does not apply when the laptop is a headless EC2 box and the agent has been running for six hours unattended.
The thing that is missing is not more logging. It is a recording.
How VideoDB solves it
VideoDB has a primitive called Capture. It treats a desktop the same way it treats any other continuous video stream: ingest it, hold it as an RTStream, and then run the rest of the platform on top of it.
That last part is the interesting bit. Once a screen and its audio are RTStreams, every other VideoDB primitive applies to them with no special handling:
index_visualsbuilds a searchable scene index over what is on screen.start_transcriptruns live speech-to-text on the audio.index_audiosummarizes audio into queryable segments.search()runs natural-language queries against the index and returns timestamped clips.generate_stream(start, end)mints a playable URL for any time range.
A desktop does not need a custom pipeline. It is a pipeline.
For an OpenClaw agent, that means the agent's screen becomes the same kind of thing as a video file or an RTSP camera feed: indexable, searchable, clippable, all through the same SDK calls you would make against any other media. The agent itself stays untouched. No instrumentation, no event hooks, no tool-call wrappers.
How we used it: the skill
We packaged this as an OpenClaw skill called videodb-monitoring. A skill, in OpenClaw, is a folder OpenClaw loads at startup. It contains a SKILL.md, which is the playbook the agent reads, plus whatever scripts the playbook tells it to run.
This skill ships exactly two scripts:
monitor.ts, a long-running daemon that opens a Capture session and streams the screen.videodb.ts, a one-shot CLI the agent calls to ask questions about what has been recorded.
The agent decides when to invoke either of them. The user never sees monitor.ts or videodb.ts. They just chat with the agent.
Architecture
User chats with the OpenClaw agent
"Open example.com and send me the recording"
|
v
OpenClaw agent loop reads SKILL.md
|
+-----------------------------+
| |
v v
monitor.ts videodb.ts
background daemon one-shot CLI
runs until killed now, stream, search, summary, transcript
| |
| CaptureClient | connect(apiKey)
| RTSP screen + audio | getCaptureSession
v v
VideoDB Cloud
|
+--> CaptureSession
| |
| v
| RTStreams
| - screen
| - system_audio
|
+--> Visual index
+--> Transcript and audio index
|
v
search, stream, scenes, getTranscript
|
v
Agent reply with clip URL, search results, summary, or transcript segments
Shared state lives in ~/.openclaw/openclaw.json:
isRunningcaptureSessionIdmonitorPidenv.VIDEODB_API_KEY
monitor.ts and videodb.ts never talk to each other directly. They share state through ~/.openclaw/openclaw.json, the same JSON file OpenClaw uses for all skill configuration. That is how the CLI knows which capture session to query.
What runs when
There are three time scales at play.
Once, at install. You copy the skill into ~/.openclaw/workspace/skills/videodb-monitoring/, run npm install, set the API key, and restart the OpenClaw gateway. After that the agent can see the skill in its skill list.
Once per first use. The first time the user asks the agent for a recording, the agent reads SKILL.md, sees that VIDEODB_IS_RUNNING is false, and starts monitor.ts in the background with nohup. From that moment on, screen and audio are streaming to VideoDB continuously. The monitor does not stop on its own. It runs until you kill it.
On every agent turn that needs the recording. The agent invokes videodb.ts with a subcommand: now, stream, search, summary, or transcript. Each of these is a short-lived process that connects to VideoDB, asks one question, prints the answer, and exits.
Inside monitor.ts
monitor.ts is the only piece of the skill that is stateful. Everything else is a one-shot CLI. The structure is roughly:
- Resolve the API key.
getApiKey()checksprocess.env.VIDEODB_API_KEYandprocess.env.VIDEO_DB_API_KEYfirst, then falls back toskills.entries.videodb-monitoring.env.VIDEODB_API_KEYin~/.openclaw/openclaw.json. If nothing is set, the monitor exits with a message telling you which key to set. - Mark itself as running.
setSkillConfig("isRunning", true)andsetSkillConfig("monitorPid", process.pid)write into the OpenClaw config so other tools and future agent turns can see that capture is live and which PID owns it. - Create a capture session.
createSession(apiKey)callsconn.createCaptureSession({ endUserId: "openclaw-monitor", metadata: { app: "openclaw-monitoring" } })andconn.generateClientToken(). The session ID is written into config ascaptureSessionId. The client token is held in memory and used to authorize the capture client. - Open the local capture client.
new CaptureClient({ sessionToken: token })is the VideoDB SDK's bridge to the OS-level recorder. It needsscreen-capture, which is required, andmicrophone, which is optional. If the user denies microphone access, the monitor logs the failure and continues with screen only. - List channels and pick the defaults.
client.listChannels()returns the displays and audio devices the OS exposes. The monitor takeschannels.displays.defaultand, if permitted,channels.systemAudio.default. Both are selected withstore: true, which tells VideoDB to persist the recording rather than just stream it through. - Start the session.
client.startSession({ sessionId, channels })is what actually flips the screen and audio into RTStreams on the VideoDB side. - Kick off indexing in the background. Five seconds after the session starts,
startIndexing()runs. - Hold the process open.
return new Promise(() => {})keeps Node alive forever. The monitor is meant to run indefinitely; it only stops on a signal. - Clean up on shutdown.
SIGINT,SIGTERM,uncaughtException, andunhandledRejectionall route throughclearSkillState(), which resetsisRunning,captureSessionId, andmonitorPidin config and tellsclient.stopSession()andclient.shutdown()to close the streams cleanly.SIGHUPis deliberately ignored sonohupanddisownwork as expected.
The indexing step does three things:
screen.indexVisuals({ prompt, batchConfig: { type: "time", value: 5, frameCount: 1 } }): every five seconds, one frame is sampled and described by an LLM using the prompt.audio.startTranscript(): live transcription on system audio.audio.indexAudio({ prompt: "Summarize the audio content.", batchConfig: { type: "time", value: 30 } }): every 30 seconds, the audio is summarized.
The five-second wait is there to let the streams reach active state before we attach indexes to them.
The default visual prompt is also defined here, and it shapes everything search and summary later returns:
Describe the screen: (1) Active application and current activity.
(2) Browser status - is one open? What URL/page?
(3) Any error dialogs, crashes, or warning messages?
(4) Timestamp if a clock is visible.
You can override it with --visual-prompt "..." when launching monitor.ts.
Inside videodb.ts
videodb.ts is the agent's hands. It has five subcommands. Every one of them follows the same pattern: load config, connect to VideoDB, fetch the capture session, get the relevant RTStream, do one thing, exit.
now prints Math.floor(Date.now() / 1000). That is it. The agent uses this to bracket a recording: capture now before doing the work, capture now again after, then ask for a stream URL between the two timestamps.
stream <start> <end> calls screen.generateStream(startTs, endTs) and returns a playable URL for that time range. The whole "record this and send me the recording" flow is just now, do work, now, stream.
search <query> runs screen.search({ query, resultThreshold: 5 }) over the visual index. The CLI calls getShots() and then await shot.generateStream() on each one, printing the shot's text, timestamp range, search score, and a watchable URL per shot. This is what powers "find when I opened the spreadsheet."
summary [--hours N] pulls the visual scene index for the last N hours, defaulting to 0.5. It calls screen.listSceneIndexes() to find the index, then index.getScenes(start, now, 1, 50) to fetch up to 50 scenes in the window. Each scene is printed as [HH:MM:SS] <description>. This is the agent's "what did I do in the last hour?" answer.
transcript [--hours N] pulls system-audio transcript segments from audio.getTranscript({ start, end: now, pageSize: 100 }) and prints each as [HH:MM:SS] <text>. This is the agent's "what was said in that meeting?" answer.
The shared loadConfig() helper does two things worth knowing about:
- It reads the API key from the same places
monitor.tsdoes: environment first, thenopenclaw.json. - It reads
captureSessionIdfrom config, and ifisRunningis false, it prints a clear hint about how to start the monitor instead of failing silently.
Both files write to ~/.videodb/logs/: monitor.log for the daemon and skill.log for the CLI. Same format, same place, easy to tail.
What this gets you
Once the monitor is up, the agent has these capabilities, all reachable through videodb.ts:
- Record any time range on demand.
now + work + now + streamis the whole pattern. The clip is a real, shareable URL backed by VideoDB's player. - Find specific moments by description.
search "user opened Amazon"returns timestamped clips, ranked by semantic match, with playable URLs. - Summarize a window of activity.
summary --hours 2returns the last two hours of visual descriptions, timestamped and in chronological order. - Read back what was said.
transcript --hours 1returns the live transcript for the last hour of system audio. - Persist everything. Because
monitor.tssetsstore: trueon both channels, the recording survives the session. You can come back tomorrow and search a clip from today.
Three steps to install it
mkdir -p ~/.openclaw/workspace/skills/videodb-monitoring
cp -r videodb-monitoring-skill/* ~/.openclaw/workspace/skills/videodb-monitoring/
cd ~/.openclaw/workspace/skills/videodb-monitoring && npm install
openclaw config set skills.entries.videodb-monitoring.env.VIDEODB_API_KEY 'sk-xxx'
openclaw config set skills.entries.videodb-monitoring.enabled true
openclaw gateway restart
Get the key from console.videodb.io. After the gateway restart, openclaw skills list shows videodb-monitoring. The next time you ask the agent for a recording, the skill takes over.
That is the whole setup. Two TypeScript files, one config entry, one restart.
Why this turned out to be small
When we started, the obvious design was a backend service that watched OpenClaw, instrumented its tool calls, and emitted screenshots at decision points. We did not build that. We did not have to.
The reason is that VideoDB Capture treats the desktop as a media stream from the first frame. Once screen and system_audio are RTStreams, search, transcripts, summaries, and clip URLs are SDK calls, the same calls you would use against a YouTube video. The agent's screen has no special status. And because the OS-level capture happens outside the agent, OpenClaw does not need a single line changed.
What is left is gluing the agent's intent to the right SDK call at the right moment. That is SKILL.md, monitor.ts, and videodb.ts. The skill is small because the heavy lifting already lives in VideoDB.
Built by the VideoDB team. Free API key at console.videodb.io. Community on Discord.
