How I Built a Real-Time Meeting Copilot in a Day
By the VideoDB Team
I might be one of the few people who like attending meetings, but that is if I am prepared, and I am able to extract good notes and insights from it. But if only that was the easy part. A lot of conversations derail from the main topic and we realise it far later. Sometimes, I forget when to stop talking, or that I should probably chime in with my points if I have them. Worst is when the meeting gets over and then, after analysing all the discussed points, that there are still some gaps, or unanswered questions which you either forgot to ask, or which popped up mid meeting.
The issue with this is, the acts of maintaining, analysing the direction of the meeting and constant noting things down sometimes take too much cognitive load, which I can rather put on actually discussing things in a meeting or brainstorming. Here is where I felt I can do better. I was looking for a meeting assistant of sorts who will do all this for it.
And so like any other person I went in to check for current solutions, but for the most of them, the bigger issue was that they mostly supported post-meeting analysis, which although is great, it is not really optimum for me. I want something where I can look back and see what was discussed 5 minutes ago, a live checklist, what should I ask or clear out and so on.
Since I didn't find anything meaningful, I decided to create my own solution for this.
And in this article, I will show how I built this app altogether.
So let's begin!
What would the solution app look like?
For my app to be something which helps me in all my meetings, I was thinking of the app's process to be in the following steps
- Pre-meeting -> Setting up context for the meeting and preparing context
- In-meeting -> Live meeting analysis, nudges, questions, live transcript etc.
- Post-meeting -> Recording, transcript, summary
This gives me a good mental model for the app to be built. Basically:
- This AI Meeting agent shall gather all the info before the call begins
- After the call starts I need live insights
- After the meeting
After some brainstorming, I decided on following flow and features.
Pre-meeting features
Before the meeting the AI Meeting Agent will:
- Gather basics information like meeting title, description and goals
- Ask questions back to gain more information for the meeting analysis
- Prepare a meeting checklist which covers the goals and agenda of the meeting
In-meeting features
These features are the most fundamental, and as I could see, maybe the most technically challenging part.
- The AI Agent should record the meeting. This means the audio and the meeting screen.
- It should give me live transcript of what is being said, and for the AI Agent to differentiate between me and the other meeting participants
- A Live Assistance flow, where the AI Agent will give me questions to ask for points I have missed, or nugdes for things I should say
- Metrics which might tell me how much am I speaking against others to give or take chance, and similar metrics
- Ability to attach external sources of knowledge such as common issues board, tickets, notes etc. via different providers like Hubspot, Linear, Notion etc. so that the AI Agent is helpful
This might be an exhaustive list, but once the audio of the meeting is available, these all become manageable.
Post-meeting features
Finally after the meeting is over, I must be able to get:
- The final recording of the meeting session
- Summary, transcript, key points and checklists
- Potentially, some post meeting workflows attachments, so that I can attach a Zapier or n8n workflow at the end, which I will use to store the meeting data, share it on Slack, perhaps run it across another AI Agent.
We have a very tall order of tasks. But let's see what we can build and start with the technical implementation of this app.
Technical Implementation
Let's clear out what I want from this app first and foremost
- It can be a desktop or a web app, depending on which is more feasible.
- I need to figure out how to record and stream the live screen video and audio footage for AI Analysis
- There are Live AI Agents I need to run, and some post meeting ones, meaning I would need the live and recorded feed for the meeting together.
If you see the most important part of this app, is the recording part. And for that the choice for me was VideoDB Capture.
Recording with VideoDB Capture
For what we are building, VideoDB Capture is a great choice.
VideoDB Capture is a desktop capture SDK. It takes the screen, mic, and system audio off the machine, processes them, and sends back transcripts and scene descriptions in a couple of seconds.
This handles the live transcript support and video analysis, while also give us the final recording of the session too. So all our needs are getting met here.
VideoDB Capture takes display, system audio and mic audio as input. And gives transcripts for mic and system audio, and AI indexing from the visuals and audios.
We will make the best use of the transcript websocket streams.
Building the app with VideoDB Capture
The VideoDB Capture comes with it's own binary which requires OS level permissions for recording audio and screen. An electron desktop makes the most intuitive sense.
In an Electron App, we run 2 processes:
- The renderer process -> this is the UI, running in a Chromium window. It can't touch the OS or Node directly.
- The main process -> this is Node.js, where the app talks to the OS, runs native binaries, and accesses the filesystem.
Here is what I have decided for this app's renderer process
- React 19 -> renders the whole UI: meeting setup, live transcript, nudges, cue cards, summary
- Vite -> dev server and bundler for the renderer
- Tailwind + Radix UI -> styling and primitive components (dialog, tabs, accordion, select)
- Zustand -> client-side state for the session, recording, copilot
- TanStack Query + tRPC client -> typed data fetching from the local server
- React Markdown -> renders AI-generated summaries and notes
Here is what I have decided for this app's main process:
- Electron -> owns windows, app lifecycle, system tray, custom
call-md://protocol, global menus - Node.js -> the runtime everything else lives in
@videodb/recorder(CaptureClient) -> spawns the native recorder binary for screen, mic, and system audio- Hono +
@hono/node-server-> local HTTP server on a chosen port - tRPC server (
@hono/trpc-server) -> typed RPC mounted on Hono at/api/trpc - better-sqlite3 + Drizzle ORM -> local SQLite for meetings, recordings, and copilot state
@modelcontextprotocol/sdk-> connects to MCP servers (Linear, Notion, etc.) over stdio or HTTP- OpenAI SDK -> runs the live-assist, cue-card, nudge, and summary agents
- Pino -> structured logging
Now let's begin
How I actually built the features
I'm going to walk through this in the order the user actually hits the app. Recorder first (because nothing else works without it), then the pre-meeting wizard, then everything that fires while the call is going, then what happens once they hit stop.
1. Plugging in VideoDB Capture
I built this first. Every other feature in the app reads from the streams VideoDB Capture produces, so until this works, nothing works.
There are two pieces of the SDK you have to wire up together. CaptureClient from videodb/capture owns the native recorder binary. It enumerates OS channels (mic, system audio, displays), runs the session, and lets you pause or resume individual tracks. connect() from videodb is the other half. It opens authenticated websockets that stream transcript and scene-index events back from VideoDB while the recorder is running.
Anything that touches the OS or the SDK lives in Electron's main process. The renderer talks to it over IPC.
The IPC bridge
From the renderer side this is pretty boring. You need to list channels (so you can show a device picker on the start screen), start, and stop. That's it.
// src/preload/index.ts
capture: {
listChannels: (sessionToken, apiUrl) =>
ipcRenderer.invoke('recorder-list-channels', sessionToken, apiUrl),
startRecording: (params) =>
ipcRenderer.invoke('recorder-start-recording', params),
stopRecording: () =>
ipcRenderer.invoke('recorder-stop-recording'),
}
listChannels runs before the user has actually decided to record, so the main process spins up a CaptureClient just to enumerate devices and then reuses it when the user hits start. This matters more than it sounds. If you create two clients back-to-back the binary will throw an "another recorder instance is running" error and you'll spend an afternoon figuring out why.
// src/main/ipc/capture.ts
import { CaptureClient } from 'videodb/capture';
import { connect } from 'videodb';
let captureClient: CaptureClient | null = null;
ipcMain.handle('recorder-list-channels', async (_e, sessionToken, apiUrl) => {
if (!captureClient) {
captureClient = new CaptureClient({ sessionToken, ...(apiUrl && { apiUrl }) });
captureClient.on('error', (err) => logger.error({ err }, 'CaptureClient error'));
}
const channels = await captureClient.listChannels();
return channels.all().map((ch) => ({
channelId: ch.id, type: ch.type, name: ch.name,
}));
});
Starting a session: recorder plus websockets
When the user hits Start, four things happen in the main process. The order matters. First, open a videodb connection with the session token. Second, open three websockets, one each for mic transcript, system-audio transcript, and scene index off the screen. Third, reuse (or create) the CaptureClient and attach its event listeners. Fourth, build the channel configs from listChannels() and call startSession.
The websocket setup is the bit I tripped over first. connect() and connectWebsocket() look like the same call but they're not. connectWebsocket() returns a builder, and you have to call .connect() on it to actually open the socket. Also, each socket only carries one stream, so you need three of them.
// src/main/ipc/capture.ts
const videodbConnection = connect({ sessionToken, baseUrl: apiUrl });
const micWs = await (await videodbConnection.connectWebsocket()).connect();
const sysAudioWs = await (await videodbConnection.connectWebsocket()).connect();
const screenWs = await (await videodbConnection.connectWebsocket()).connect();
listenForMessages(micWs, 'mic');
listenForMessages(sysAudioWs, 'system_audio');
listenForVisualIndexMessages(screenWs);
Then the recorder. listChannels() returns a typed object that you index by device class (mics.default, systemAudio.default, displays.default), and you build channel configs from those IDs. The thing to know here is the transcript: true flag. That's the switch that turns on transcription for a channel. Without it, you can have all three sockets open and not a single transcript event will arrive.
// reuse the client created during listChannels above
setupCaptureEventListeners(); // forwards recording:started/stopped/error to renderer
const channels = await captureClient.listChannels();
const captureChannels = [
{ channelId: channels.mics.default.id, type: 'audio', record: true, transcript: true },
{ channelId: channels.systemAudio.default.id, type: 'audio', record: true, transcript: true },
{ channelId: channels.displays.default.id, type: 'video', record: true },
];
await captureClient.startSession({
sessionId: config.sessionId,
channels: captureChannels,
});
Notice the display channel doesn't have transcript. The scene-index socket is fed by VideoDB's visual analyzer running over the screen stream on its own. It's a different pipeline from audio transcription, but the consumer side looks identical. Events arrive on a websocket and you forward them on.
Reading the streams
Each websocket exposes an async iterator. The mic and system-audio listeners do exactly the same thing except for the source tag they stamp on each segment, so they share a function.
async function listenForMessages(
ws: WebSocketConnection,
source: 'mic' | 'system_audio'
) {
for await (const msg of ws.receive()) {
const channel = msg.channel || msg.type || msg.event_type;
if (channel !== 'transcript' && !msg.text) continue;
const data = (msg.data ?? {}) as Record<string, unknown>;
sendRecorderEvent({
event: 'transcript',
data: {
text: (data.text ?? msg.text ?? '') as string,
isFinal: (data.is_final ?? msg.is_final ?? false) as boolean,
start: (data.start ?? msg.start) as number,
end: (data.end ?? msg.end) as number,
source,
},
});
}
}
Two things about the payload that took me a while to figure out. Different SDK builds put the same fields either at the top level (msg.text) or nested under msg.data (msg.data.text). The listener has to handle both, which is why the double-coalesce is there. It looks paranoid but I had it crash on me both ways. The other thing is isFinal. That's the difference between a rolling partial (which you'd want for live captions) and a locked segment (which is what you actually save to disk and feed to downstream agents). If you skip partials you get cleaner state. If you want a typewriter effect, render them.
sendRecorderEvent is just a wrapper around mainWindow.webContents.send('recorder-event', event). Every stream message turns into one IPC message.
The scene-index listener has the same shape: async iterator, filter by channel name, forward to the renderer. The only difference is which events you let through.
async function listenForVisualIndexMessages(ws: WebSocketConnection) {
for await (const msg of ws.receive()) {
const channel = msg.channel || msg.type || msg.event_type;
if (channel !== 'scene_index' && channel !== 'visual_index') continue;
const data = (msg.data ?? {}) as Record<string, unknown>;
sendRecorderEvent({
event: 'visual_index',
data: {
text: (data.text ?? msg.text ?? '') as string,
start: (data.start ?? msg.start) as number,
end: (data.end ?? msg.end) as number,
},
});
}
}
Routing into state
The renderer subscribes once on mount and routes each event into a Zustand store, keyed by source:
// renderer
useEffect(() => {
return window.electronAPI.on.recorderEvent((event) => {
if (event.event === 'transcript') {
const speaker = event.data.source === 'mic' ? 'me' : 'them';
transcriptStore.append(speaker, event.data);
} else if (event.event === 'visual_index') {
visualIndexStore.append(event.data);
}
});
}, []);
Mic is "me", system audio is "them". You get speaker separation without a diarization model, because VideoDB is handing you two physically separate audio channels.
And that's it. Once that transcript store is filling up, every feature in the rest of the post is just a function over it. Live transcript view, talk-ratio metrics, the 20-second live-assist loop, MCP intent detection, the post-meeting summary. They all read from transcriptStore. Get this pipe clean and the rest is downhill.
State persistence
While the call is running, transcripts go into SQLite via Drizzle. When recording stops, the main process kicks off an export poller that watches VideoDB for the final stitched video URL and writes it back to the recording row. So if the user closes the app right after stopping, the recording isn't lost. There's a recoverPendingSessions() step on startup that reconciles any rows still stuck in processing.
2. Pre-meeting
Two things have to happen before the user can hit Start. Permissions, and the meeting setup wizard.
OS permissions
This is pure main-process work. Only Electron's systemPreferences can talk to the macOS TCC database:
// src/main/ipc/permissions.ts
ipcMain.handle('check-mic-permission', async () =>
systemPreferences.getMediaAccessStatus('microphone') === 'granted'
);
ipcMain.handle('request-mic-permission', async () =>
systemPreferences.askForMediaAccess('microphone')
);
ipcMain.handle('request-screen-permission', async () => {
// macOS does not allow programmatic prompts for screen recording, so we open Settings.
if (systemPreferences.getMediaAccessStatus('screen') !== 'granted') {
await shell.openExternal('x-apple.systempreferences:com.apple.preference.security?Privacy_ScreenCapture');
return false;
}
return true;
});
The renderer calls these and renders the right "grant access" UI based on the result.
Meeting setup wizard
A small multi-step form. Title, description, participants, goals, then AI-generated probing questions, then an AI-generated discussion checklist. Both AI steps go through the LLM service (OpenAI SDK pointed at VideoDB's OpenAI-compatible proxy):
// pseudocode of src/main/services/meeting-setup.service.ts
const questions = await llm.chat({
system: PROBING_QUESTIONS_PROMPT,
user: `Meeting: ${name}\nDescription: ${description}\nGoals: ${goals}`,
responseFormat: 'json',
});
const checklist = await llm.chat({
system: CHECKLIST_PROMPT,
user: `Meeting: ${name}\nQ&A: ${JSON.stringify(answers)}`,
responseFormat: 'json',
});
The wizard state lives in Zustand on the renderer. Once the user confirms, it gets POSTed to the local Hono server via tRPC and persisted to SQLite as a meeting_setup row. That same row is loaded back into the live-assist context during the call so the AI knows what the meeting was supposed to be about.
If the user has connected Google Calendar, calendar-poller.service.ts runs in the main process, polls upcoming events, and pushes them to the renderer via calendar:events-updated. When a meeting is about to start, it fires calendar:auto-start-recording and the renderer opens the setup wizard pre-filled with the event details.
3. In-meeting
Everything below is fed by the transcript stream coming out of VideoDB Capture. Each feature is its own service in the main process. They subscribe to the transcript and emit their own events to the renderer.
Live transcript view
The renderer subscribes once and appends each fragment into a Zustand store, keyed by speaker:
// renderer
window.electronAPI.on.recorderEvent((event) => {
if (event.event === 'transcript') {
const speaker = event.data.source === 'mic' ? 'me' : 'them';
transcriptStore.append(speaker, event.data);
}
});
Mic is "you", system audio is "them". So the speaker separation is basically free. No diarization model. VideoDB already gives us two physically separate audio channels.
What's on screen (visual index)
The third websocket I opened earlier carries scene-index events. VideoDB watches the screen stream and tells me, in plain English, what is on the user's screen right now. Stuff like "a Notion page titled Q3 Planning is visible" or "the user is sharing a Figma file with three frames open."
I save each event to SQLite via a small IPC handler:
// src/main/ipc/visual-index.ts
ipcMain.handle('visual-index:save-item', async (_event, data) => {
const item = createVisualIndexItem({
id: uuid(),
recordingId: data.recordingId,
sessionId: data.sessionId,
text: data.text,
startTime: data.startTime,
endTime: data.endTime,
});
return { success: true, id: item.id };
});
Two things use these scene descriptions. The Live Assist prompt gets a "SCREEN CONTENT" block when something relevant is on screen, so the AI can suggest things like "you might want to walk them through the second column on this Figma frame." And the floating widget shows the latest scene description as a tiny live caption, so the user knows the AI is "seeing" the same thing they are.
Stream toggles (mute mid-call)
The user can flip mic, system audio, or screen on and off mid-call. That maps cleanly onto the recorder's pause/resume API:
// preload
pauseTracks: (tracks) => ipcRenderer.invoke('recorder-pause-tracks', tracks),
resumeTracks: (tracks) => ipcRenderer.invoke('recorder-resume-tracks', tracks),
// main
ipcMain.handle('recorder-pause-tracks', async (_e, tracks) => {
if (captureClient) await captureClient.pauseTracks(tracks);
});
When the user toggles their mic off in the UI, the renderer calls pauseTracks(['mic']). The native binary stops recording that track but keeps the session alive. So the rest of the meeting is unaffected, and the metrics service knows there is a gap.
Live checklist panel
The checklist that the AI built during pre-meeting is not just LLM context. It is also rendered as a tickable panel during the call. Each item is matched against the rolling transcript and marked covered, partial, or missing:
// inside the copilot loop
for (const item of playbook.items) {
const evidence = findEvidence(item.keywords, recentSegments);
item.status = evidence.length === 0 ? 'missing'
: evidence.length < 2 ? 'partial'
: 'covered';
}
sendToRenderer('copilot:playbook', { item, snapshot });
The user can also click an item to mark it done manually. So if the AI misses something, they have an out.
Bookmarking
Sometimes the user just wants to flag a moment. "Come back to this," "ask about this later," that kind of thing. There's a hotkey and a button on the floating widget that drops a bookmark with the current timestamp:
ipcMain.handle('copilot:create-bookmark', async (_e, data) => {
const bookmark = createBookmark({
recordingId: data.recordingId,
timestamp: data.timestamp,
category: data.category, // 'question' | 'todo' | 'highlight'
note: data.note,
});
return { success: true, bookmark };
});
These show up later on the recording detail page as clickable timestamps, so the user can jump straight to that point in the playback.
Live nudges (Live Assist)
Every 20 seconds, a timer in the main process grabs the rolling transcript window plus the meeting context (name, description, probing Q&A, checklist) and asks the LLM what the user should say or ask right now:
// src/main/services/live-assist.service.ts
const SYSTEM_PROMPT = `You are a live meeting coach. You receive a rolling 20-second
transcript from an ongoing meeting. Your job is to surface helpful nudges for the User.
...
Return a JSON object with two arrays of strings:
{
"say_this": ["That's a great point, should we lock in Q3 as our target?"],
"ask_this": ["What specific metrics are behind that 15% number?"]
}`;
setInterval(async () => {
const window = transcriptBuffer.getLast(20_000); // ms
const result = await llm.chatJSON({
system: SYSTEM_PROMPT,
user: buildContextBlock(meetingContext, window),
});
sendToRenderer('live-assist:update', { insights: result, processedAt: Date.now() });
}, 20_000);
The Live Assist panel in the renderer just listens on live-assist:update and renders the two arrays.
Metrics and coaching nudges
conversation-metrics.service.ts runs on a 10-second timer and computes talk ratio, pace (WPM), question count, and monologue detection from the buffered transcript. No LLM in this loop, just plain code over the segments.
The metrics feed nudge-engine.service.ts, which fires a coaching nudge when something looks off. Long monologue, low question count, talk ratio skewed too far. Nudges are rate-limited so the user doesn't get spammed mid-call:
// src/main/services/copilot/sales-copilot.service.ts
this.metricsTimer = setInterval(() => this.updateMetrics(), 10_000);
copilot.on('nudge', ({ nudge }) => sendToRenderer('copilot:nudge', { nudge }));
The renderer shows nudges as toasts inside the recording view and also forwards them to the floating widget.
MCP auto-triggering
This is the part I had the most fun with. The user connects MCP servers (Linear, Notion, Hubspot, etc.) via OAuth or stdio. During the call, intent-detector.service.ts watches the transcript for patterns that suggest a tool call would help:
// src/main/services/mcp/intent-detector.service.ts
const INTENT_PATTERNS = [
{ pattern: /who\s+(is|are)\s+\w+\s+(at|from)\s+\w+/i,
intent: 'crm_lookup',
toolKeywords: ['contact', 'search'], confidence: 0.7 },
{ pattern: /(schedule|book)\s+a?\s*(call|meeting|demo)/i,
intent: 'schedule_meeting',
toolKeywords: ['calendar', 'schedule'], confidence: 0.8 },
// documentation lookup, deal lookup, competitor research, etc.
];
Regex matching is cheap, so it runs on every transcript segment. When something hits with high enough confidence, mcp-agent.service.ts asks the LLM to pick the right tool from the connected MCP servers and fill in the arguments via function calling:
const tools = await toolAggregator.getAllTools(); // tools from every connected MCP server
const decision = await llm.chat({
system: MCP_AGENT_PROMPT,
user: `Detected intent: ${intent}\nTranscript: "${segment}"`,
tools,
});
if (decision.tool_calls?.length) {
const result = await mcpClient.executeTool(decision.tool_calls[0]);
sendToRenderer('mcp:result', { result }); // shown as a cue card / panel
}
The MCP results panel in the renderer renders the tool output (markdown, links, structured fields) inline with the transcript.
Bookmarking and the floating widget
The main process owns a separate BrowserWindow for the floating widget, which is the always-on-top recording controls. It has its own preload (src/preload/widget.ts) and its own set of IPC channels. Bookmark requests from either window go through tRPC to the local Hono server and end up in SQLite.
4. Post-meeting
When the user hits stop, three things happen in parallel:
- Main process tells
CaptureClientto stop the session and shut down. - It starts the export poller to wait for VideoDB to produce the final stitched video URL.
- It runs three LLM calls against the full transcript to build the summary:
// src/main/services/copilot/summary-generator.service.ts
const [shortOverview, keyPoints, postMeetingChecklist] = await Promise.all([
llm.chat({ system: SHORT_OVERVIEW_SYSTEM_PROMPT, user: transcript }),
llm.chatJSON({ system: KEY_POINTS_SYSTEM_PROMPT, user: transcript }),
llm.chatJSON({ system: CHECKLIST_SYSTEM_PROMPT, user: transcript }),
]);
const summary = { shortOverview, keyPoints, postMeetingChecklist, generatedAt: Date.now() };
copilot.emit('call-ended', { summary, metrics, duration });
The call-ended event is forwarded to the renderer, which renders the summary view, fills in the recording detail page, and offers the markdown export. exportMeetingToMarkdown writes a single .md file with transcript, summary, metrics, and bookmarks all in one place.
Finally, workflow-webhook.service.ts fires any user-configured webhooks (n8n, Zapier, Slack, custom CRM) with the meeting payload. That's how the "send my meeting notes to X" automations work.
Recording history
Every meeting the user records ends up in a list inside the app. The home view pulls a list of past recordings from SQLite, and clicking one opens a detail page with the playable video, full transcript, summary, metrics, bookmarks, and the visual-index timeline. All of that comes from the same DB rows that the live call wrote into.
// renderer (TanStack Query + tRPC)
const { data: recordings } = trpc.recordings.list.useQuery();
const { data: detail } = trpc.recordings.byId.useQuery(id);
Because the playback URL only exists once VideoDB has finished exporting the stitched video, the recording card shows a "processing" state until the export poller updates the row. Once it lands, the row flips to "ready" and the renderer rerenders with the playback URL.
Local-first storage
The whole thing is local. SQLite, the videos, the logs. They all live under Electron's userData path, which on macOS resolves to ~/Library/Application Support/call-md/:
// src/main/db/index.ts
const userDataPath = app.getPath('userData');
const dbDir = path.join(userDataPath, 'data');
return path.join(dbDir, 'call-md.db');
So if the user closes the app, restarts the laptop, or loses internet, their meetings are still there on disk. The only thing that needs the cloud is the VideoDB-side video export and any MCP servers they have connected.
So the whole loop is capture, then live agents on the transcript, then summary, then export. The main process owns every external integration: the recorder binary, OS permissions, websockets, SQLite, MCP servers, OpenAI. The renderer is a thin React UI subscribed to a handful of IPC channels.
