how i built / How I Built

Pair Programmer: live screen and audio context for coding agents

A short engineering note on turning a developer session into searchable context with VideoDB Capture, RTStreams, local events, and agent-side retrieval.

The context problem for coding agents

Coding agents are good at repositories. They can read files, inspect diffs, run tests, and edit code. They are much worse at the live session around the repository: the red terminal error that flashed by, the browser tab you were looking at, the thing you said out loud, or the tutorial playing in the background.

That missing context changes the interface. Instead of saying “fix this,” you have to copy the error, describe the screen, summarize what you just tried, and restate your intent. The agent is not failing because it cannot reason. It is failing because it cannot see or hear the working session.

Pair Programmer is a VideoDB-powered skill that gives coding agents that live context. It captures the selected display, microphone, and system audio; indexes those streams in real time; writes recent context to a local event log; and lets the agent search that context before it answers or acts.

Useful links:

What Pair Programmer does

This is not a new IDE and it is not an autonomous coding product. Pair Programmer is a context layer for the agent you already use, including Claude Code, Codex, Cursor, and other skill-compatible coding agents.

The goal is narrow:

  • Let the agent answer questions about what happened on screen.
  • Use mic transcript as developer intent.
  • Include system audio from tutorials, demos, calls, or other machine audio.
  • Keep recent context cheap, local, and inspectable.
  • Keep long-session context searchable without flooding the agent conversation.
  • Let the agent act on spoken instructions with the screen as reference.

The useful abstraction is simple: your development session becomes a set of searchable, timestamped streams.

How recording starts

Pair Programmer treats a development session as three live streams.

Source Stream Local event channel
selected display Video RTStream visual_index
microphone Audio RTStream transcript
system audio Audio RTStream audio_index

The user starts a recording from the coding agent with /pair-programmer record. A small Electron recorder opens because source selection, screen permissions, microphone access, and system audio are desktop problems. Once the user chooses what to capture, VideoDB Capture creates RTStreams for the selected sources.

The startup path is intentionally explicit:

  1. Check the local PID guard so two recorders do not run at once.
  2. Verify the VideoDB API key from the environment or project .env.
  3. Launch the Electron recorder with the project root as its working directory.
  4. Ask VideoDB Capture for available display, mic, and system-audio channels.
  5. Let the user choose sources in the picker.
  6. Start a Capture session and write the RTStream IDs to local session state.
  7. Begin indexing each stream as events arrive.

The recording state has to be visible and stoppable. A tool that can see and hear a machine needs clear user control as part of the core product, not as polish.

How live streams become events

The important unit is not a recording file. The useful unit is a timestamped event.

Screen frames are indexed every few seconds, with a small frame sample per batch. Mic and system audio are indexed in sentence-sized chunks. The defaults favor low-latency live context over perfect offline analysis.

streams = capture(screen=True, mic=True, system_audio=True)

for stream in streams:
    if stream.kind == "video":
        stream.index_visuals(batch_seconds=2, frame_count=3)
    else:
        stream.index_audio(batch="sentence")

That gives the agent small context atoms while the developer is still working. It does not need to wait for an exported recording or a post-session transcript.

The local JSONL event log

Every useful event is appended to a local JSONL file:

/tmp/videodb_pp_events.jsonl

A line looks like this:

{
  "ts": "2026-03-05T10:15:30.123Z",
  "unix_ts": 1709374530.12,
  "channel": "visual_index",
  "data": {
    "text": "User is viewing VS Code with auth.ts open"
  }
}

This file is deliberately boring. That is what makes it useful.

For recent context, the agent does not need a vector search call or a large prompt. It can read the tail of one channel, grep for a word, or join two channels by timestamp.

Saving raw events locally also keeps search simple. Instead of hardcoding one brittle shell pipeline, the skill describes the filtering logic and lets the agent choose the right tool for the question: grep for exact terms, tail for recent context, jq or Python for timestamp windows, and RTStream search when the local file is not enough.

Searching across screen, mic, and system audio

Some questions are not one-channel questions.

If I ask, “what was happening when I mentioned the auth bug?”, the answer may need mic transcript, screen state, system audio, and semantic search. The main agent should not read all of that raw context itself.

Instead, it can fan out the search:

Worker Job Output
Transcript worker Search transcript events for “auth bug” Timestamp and spoken context
Screen worker Inspect visual_index events around that timestamp What was visible
Audio worker Check audio_index events near the same time Tutorial, call, or demo context
Semantic worker Run RTStream search for “auth bug login failure” Fuzzy matches across the stream indexes

Each worker gets one job and returns a compact summary. The main agent spends its context on synthesis instead of log scanning.

This is the part that makes the retrieval feel practical. Screen events are dense. Transcript events are sparse. System audio may or may not matter. RTStream search is useful, but not always needed. Splitting those paths keeps the main loop clean.

How spoken requests become code changes

Search answers “what happened?” The next loop is “do the thing I just asked for.”

/pair-programmer act reads the recent mic transcript, finds the latest actionable instruction, pulls screen context around that timestamp, and then uses the agent’s normal coding tools.

recent_speech = read_recent(channel="transcript", minutes=3)
instruction = find_latest_actionable_request(recent_speech)
screen_context = read_visual_events_around(instruction.timestamp)

if instruction.is_clear:
    confirm(instruction.summary)
    edit_code(instruction, screen_context)
    run_relevant_checks()
else:
    ask_for_clarification()

Voice gives intent. Screen context gives referents. The agent still does normal engineering work: read files, edit files, run commands, and test changes.

Future concept: re-reading a time range

Pair Programmer does not support this yet, but the VideoDB infrastructure makes it a natural next step: let the agent re-look at a specific moment with a specific question.

Today, the local JSONL file gives the agent text memory: timestamped events, transcript chunks, and screen descriptions. The RTStream gives it video memory: the original session is still addressable by time. For many coding tasks, that distinction matters. A caption might say “terminal shows an auth error,” but the useful detail may be the exact filename, stack trace, browser state, or line number visible at that moment.

A future API could let the agent take a timestamp from the local event map and ask the original stream a more specific question:

answer = rtstream.ask(
    "Look at the terminal. What exact error, filename, and stack trace are visible?",
    start="10:15:20",
    end="10:15:45",
)

That is different from asking the saved caption again. It is also different from only asking the few sampled frames that produced the first event. The stronger primitive is to revisit the original time range and ask a better question.

In that model, a development session lives in two forms: a small text memory file for fast retrieval, and a video memory stream for moments where visual detail matters more than the first text summary.

What VideoDB handles and what the agent handles

Pair Programmer works best as a context substrate, not as a new IDE.

VideoDB handles the media layer: capture, RTStreams, indexing, and searchable video memory. The coding agent handles retrieval, synthesis, and action.

That boundary is what keeps the system small. Give the agent eyes, ears, and memory, then let it use the tools it already has.