Skip to main content

Part 11: Building Realtime and Voice Agents

You've built chat interfaces in Part 10—rendering streaming responses, visualizing tool calls, and deploying frontends. Now you'll add realtime communication and voice capabilities—the technologies that make AI interactions feel natural, immediate, and conversational.

This part teaches you to build voice-enabled AI agents, implement realtime bidirectional communication, and create multimodal experiences that go beyond text.


Why Realtime and Voice Matter

Text chat is powerful, but voice is natural. Humans evolved to speak, not type:

  • Voice interfaces: Hands-free interaction, accessibility, natural conversation flow
  • Realtime communication: Immediate feedback, duplex conversations, live collaboration
  • Multimodal IO: Images, screens, audio—richer context for AI understanding

Voice AI is the next frontier. This part prepares you to build it.


What You'll Learn

Realtime APIs for Agents

You'll implement bidirectional communication patterns:

  • Server-Sent Events (SSE): Streaming AI responses from server to browser
  • WebSockets: Full-duplex communication for conversational AI
  • WebRTC: Peer-to-peer connections for voice/video AI
  • Connection management: Reconnection logic, heartbeats, graceful disconnection

Browser Audio Capabilities

You'll build voice-enabled AI interfaces:

  • Audio capture: Using Web Audio API to record user speech
  • Voice Activity Detection (VAD): Detecting when users start/stop speaking
  • Streaming to STT: Sending audio chunks to Speech-to-Text services
  • Playing TTS responses: Rendering Text-to-Speech audio in browsers
  • Duplex conversations: Managing simultaneous input/output audio streams

TTS/STT Pipelines

You'll implement speech processing workflows:

  • Speech-to-Text integration: OpenAI Whisper, Google STT, Deepgram
  • Text-to-Speech pipelines: OpenAI TTS, ElevenLabs, Google TTS
  • Latency optimization: Chunked processing, streaming audio, parallel pipelines
  • Quality tradeoffs: Balancing accuracy, latency, and cost

Multimodal Interactions

You'll create rich AI experiences beyond text:

  • Image/screen capture: Allowing AI to see user screens or uploaded images
  • Tool visualization: Showing when/how AI uses external tools
  • Rich media responses: Rendering charts, tables, code blocks from AI
  • Interactive elements: Buttons, forms, and widgets within AI conversations

Mobile & PWA Considerations

You'll build AI experiences that work everywhere:

  • Progressive Web Apps: Offline-first capabilities for AI tools
  • Mobile optimization: Touch interfaces, responsive layouts, gesture controls
  • Background processing: Handling audio while app is backgrounded
  • Permission management: Microphone, camera, location access flows

Load, Cost, and Quality of Service

You'll optimize realtime performance:

  • Backpressure handling: Slowing down when systems are overloaded
  • Fallback strategies: Degrading gracefully when primary services fail
  • Caching: Semantic caching for repeated AI queries
  • Rate limiting: Managing costs while maintaining user experience
  • Token budgeting: Staying within context window limits

Prerequisites

This part builds on:

  • Part 5 (Python): Understanding async patterns that apply to TypeScript/JavaScript
  • Part 6 (AI Native): Knowing agent APIs (OpenAI SDK, MCP) you'll integrate with
  • Part 9 (TypeScript): Language fundamentals, async patterns, HTTP/WebSocket communication
  • Part 10 (Frontends): Chat UIs, streaming responses, component architecture

You need Part 10 completed before starting this part.


What Makes This Different

Traditional audio/video courses teach media processing. This part teaches voice and realtime for AI agents:

Traditional approach:

  • Record and playback audio files
  • Build video conferencing apps
  • Handle media encoding/decoding

Our approach:

  • Stream voice to AI and back in real-time
  • Handle variable AI response latencies in voice flows
  • Build duplex conversations where AI and human can interrupt each other
  • Optimize for natural conversation rhythm

You're building conversational AI, not just media apps.


Real-World Applications

These skills enable you to build:

Voice AI Applications:

  • Voice-controlled home automation
  • AI phone assistants with natural conversation flow
  • Language learning apps with pronunciation feedback
  • Accessibility tools for vision-impaired users

Realtime Collaboration:

  • Shared AI workspaces where teams interact with agents together
  • Live coding assistants that respond as you type
  • Multiplayer AI games

Multimodal Products:

  • AI that can see and describe your screen
  • Visual debugging assistants
  • Document analysis with image understanding

Chapter Progression

This part's chapters build realtime and voice capability:

Realtime APIs for Agents

Implement SSE, WebSockets, and WebRTC for bidirectional agent communication. Handle reconnection, heartbeats, and graceful degradation.

Browser Audio Capture

Use Web Audio API to capture user speech. Implement Voice Activity Detection to know when users are speaking.

TTS/STT Pipelines

Build end-to-end speech processing workflows. Integrate with OpenAI Whisper, Google STT, ElevenLabs, and optimize for latency.

Multimodal IO

Add image/screen capture, tool visualization, and rich media rendering to your AI interfaces.

Mobile & PWA

Optimize for mobile devices, implement Progressive Web App patterns, and handle background audio processing.

Load, Cost, and QoS

Manage backpressure, implement fallback strategies, and optimize costs for realtime AI systems.


Pedagogical Approach

This part uses all four teaching layers:

Layer 1 (Manual Foundation): Understanding audio APIs, WebSocket protocols, streaming patterns Layer 2 (AI Collaboration): Building voice components with Claude Code/Cursor assistance Layer 3 (Intelligence Design): Creating reusable audio utilities, streaming patterns, voice pipelines Layer 4 (Spec-Driven): Implementing complete voice AI products from specifications

You'll build progressively: audio capture → speech recognition → AI processing → speech synthesis → natural conversation.


Success Metrics

You succeed when you can:

  • ✅ Implement realtime communication with SSE/WebSockets/WebRTC
  • ✅ Capture browser audio and detect voice activity
  • ✅ Build STT/TTS pipelines with latency optimization
  • ✅ Create multimodal experiences (text, voice, images)
  • ✅ Optimize for mobile devices and Progressive Web Apps
  • ✅ Manage performance, cost, and quality of service for realtime systems

What You'll Build

Capstone projects:

  1. Voice AI Interface: Browser-based voice assistant with STT/TTS integration and natural conversation flow
  2. Multimodal Agent: AI that can see your screen, hear your voice, and respond with rich media
  3. Mobile AI App: Progressive Web App with offline capabilities and mobile optimization

By the end, you'll have built complete voice-enabled AI experiences.


Looking Ahead

After mastering realtime and voice, you're ready for Part 12: Agentic AI is the Future—exploring emerging patterns like the Agentic Web, Agentic Organizations, and Agentic Commerce.

You've built the full interactive stack: Backend (Parts 5-7), Language (Part 9), Frontend (Part 10), Voice/Realtime (Part 11). Part 12 shows you where this technology is heading.