Part 11: Building Realtime and Voice Agents

You've built chat interfaces in Part 10—rendering streaming responses, visualizing tool calls, and deploying frontends. Now you'll add realtime communication and voice capabilities—the technologies that make AI interactions feel natural, immediate, and conversational.

This part teaches you to build voice-enabled AI agents, implement realtime bidirectional communication, and create multimodal experiences that go beyond text.

Why Realtime and Voice Matter

Text chat is powerful, but voice is natural. Humans evolved to speak, not type:

Voice interfaces: Hands-free interaction, accessibility, natural conversation flow
Realtime communication: Immediate feedback, duplex conversations, live collaboration
Multimodal IO: Images, screens, audio—richer context for AI understanding

Voice AI is the next frontier. This part prepares you to build it.

What You'll Learn

Realtime APIs for Agents

You'll implement bidirectional communication patterns:

Server-Sent Events (SSE): Streaming AI responses from server to browser
WebSockets: Full-duplex communication for conversational AI
WebRTC: Peer-to-peer connections for voice/video AI
Connection management: Reconnection logic, heartbeats, graceful disconnection

Browser Audio Capabilities

You'll build voice-enabled AI interfaces:

Audio capture: Using Web Audio API to record user speech
Voice Activity Detection (VAD): Detecting when users start/stop speaking
Streaming to STT: Sending audio chunks to Speech-to-Text services
Playing TTS responses: Rendering Text-to-Speech audio in browsers
Duplex conversations: Managing simultaneous input/output audio streams

TTS/STT Pipelines

You'll implement speech processing workflows:

Speech-to-Text integration: OpenAI Whisper, Google STT, Deepgram
Text-to-Speech pipelines: OpenAI TTS, ElevenLabs, Google TTS
Latency optimization: Chunked processing, streaming audio, parallel pipelines
Quality tradeoffs: Balancing accuracy, latency, and cost

Multimodal Interactions

You'll create rich AI experiences beyond text:

Image/screen capture: Allowing AI to see user screens or uploaded images
Tool visualization: Showing when/how AI uses external tools
Rich media responses: Rendering charts, tables, code blocks from AI
Interactive elements: Buttons, forms, and widgets within AI conversations

Mobile & PWA Considerations

You'll build AI experiences that work everywhere:

Progressive Web Apps: Offline-first capabilities for AI tools
Mobile optimization: Touch interfaces, responsive layouts, gesture controls
Background processing: Handling audio while app is backgrounded
Permission management: Microphone, camera, location access flows

Load, Cost, and Quality of Service

You'll optimize realtime performance:

Backpressure handling: Slowing down when systems are overloaded
Fallback strategies: Degrading gracefully when primary services fail
Caching: Semantic caching for repeated AI queries
Rate limiting: Managing costs while maintaining user experience
Token budgeting: Staying within context window limits

Prerequisites

This part builds on:

Part 5 (Python): Understanding async patterns that apply to TypeScript/JavaScript
Part 6 (AI Native): Knowing agent APIs (OpenAI SDK, MCP) you'll integrate with
Part 9 (TypeScript): Language fundamentals, async patterns, HTTP/WebSocket communication
Part 10 (Frontends): Chat UIs, streaming responses, component architecture

You need Part 10 completed before starting this part.

What Makes This Different

Traditional audio/video courses teach media processing. This part teaches voice and realtime for AI agents:

Traditional approach:

Record and playback audio files
Build video conferencing apps
Handle media encoding/decoding

Our approach:

Stream voice to AI and back in real-time
Handle variable AI response latencies in voice flows
Build duplex conversations where AI and human can interrupt each other
Optimize for natural conversation rhythm

You're building conversational AI, not just media apps.

Real-World Applications

These skills enable you to build:

Voice AI Applications:

Voice-controlled home automation
AI phone assistants with natural conversation flow
Language learning apps with pronunciation feedback
Accessibility tools for vision-impaired users

Realtime Collaboration:

Shared AI workspaces where teams interact with agents together
Live coding assistants that respond as you type
Multiplayer AI games

Multimodal Products:

AI that can see and describe your screen
Visual debugging assistants
Document analysis with image understanding

Chapter Progression

This part's chapters build realtime and voice capability:

Realtime APIs for Agents

Implement SSE, WebSockets, and WebRTC for bidirectional agent communication. Handle reconnection, heartbeats, and graceful degradation.

Browser Audio Capture

Use Web Audio API to capture user speech. Implement Voice Activity Detection to know when users are speaking.

TTS/STT Pipelines

Build end-to-end speech processing workflows. Integrate with OpenAI Whisper, Google STT, ElevenLabs, and optimize for latency.

Multimodal IO

Add image/screen capture, tool visualization, and rich media rendering to your AI interfaces.

Mobile & PWA

Optimize for mobile devices, implement Progressive Web App patterns, and handle background audio processing.

Load, Cost, and QoS

Manage backpressure, implement fallback strategies, and optimize costs for realtime AI systems.

Pedagogical Approach

This part uses all four teaching layers:

Layer 1 (Manual Foundation): Understanding audio APIs, WebSocket protocols, streaming patterns Layer 2 (AI Collaboration): Building voice components with Claude Code/Cursor assistance Layer 3 (Intelligence Design): Creating reusable audio utilities, streaming patterns, voice pipelines Layer 4 (Spec-Driven): Implementing complete voice AI products from specifications

You'll build progressively: audio capture → speech recognition → AI processing → speech synthesis → natural conversation.

Success Metrics

You succeed when you can:

✅ Implement realtime communication with SSE/WebSockets/WebRTC
✅ Capture browser audio and detect voice activity
✅ Build STT/TTS pipelines with latency optimization
✅ Create multimodal experiences (text, voice, images)
✅ Optimize for mobile devices and Progressive Web Apps
✅ Manage performance, cost, and quality of service for realtime systems

What You'll Build

Capstone projects:

Voice AI Interface: Browser-based voice assistant with STT/TTS integration and natural conversation flow
Multimodal Agent: AI that can see your screen, hear your voice, and respond with rich media
Mobile AI App: Progressive Web App with offline capabilities and mobile optimization

By the end, you'll have built complete voice-enabled AI experiences.

Looking Ahead

After mastering realtime and voice, you're ready for Part 12: Agentic AI is the Future—exploring emerging patterns like the Agentic Web, Agentic Organizations, and Agentic Commerce.

You've built the full interactive stack: Backend (Parts 5-7), Language (Part 9), Frontend (Part 10), Voice/Realtime (Part 11). Part 12 shows you where this technology is heading.

Why Realtime and Voice Matter​

What You'll Learn​

Realtime APIs for Agents​

Browser Audio Capabilities​

TTS/STT Pipelines​

Multimodal Interactions​

Mobile & PWA Considerations​

Load, Cost, and Quality of Service​

Prerequisites​

What Makes This Different​

Real-World Applications​

Chapter Progression​

Realtime APIs for Agents​

Browser Audio Capture​

TTS/STT Pipelines​

Multimodal IO​

Mobile & PWA​

Load, Cost, and QoS​

Pedagogical Approach​

Success Metrics​

What You'll Build​

Looking Ahead​

Why Realtime and Voice Matter

What You'll Learn

Realtime APIs for Agents

Browser Audio Capabilities

TTS/STT Pipelines

Multimodal Interactions

Mobile & PWA Considerations

Load, Cost, and Quality of Service

Prerequisites

What Makes This Different

Real-World Applications

Chapter Progression

Realtime APIs for Agents

Browser Audio Capture

TTS/STT Pipelines

Multimodal IO

Mobile & PWA

Load, Cost, and QoS

Pedagogical Approach

Success Metrics

What You'll Build

Looking Ahead