Part 11: Building Realtime and Voice Agents
You've built chat interfaces in Part 10—rendering streaming responses, visualizing tool calls, and deploying frontends. Now you'll add realtime communication and voice capabilities—the technologies that make AI interactions feel natural, immediate, and conversational.
This part teaches you to build voice-enabled AI agents, implement realtime bidirectional communication, and create multimodal experiences that go beyond text.
Why Realtime and Voice Matter
Text chat is powerful, but voice is natural. Humans evolved to speak, not type:
- Voice interfaces: Hands-free interaction, accessibility, natural conversation flow
- Realtime communication: Immediate feedback, duplex conversations, live collaboration
- Multimodal IO: Images, screens, audio—richer context for AI understanding
Voice AI is the next frontier. This part prepares you to build it.
What You'll Learn
Realtime APIs for Agents
You'll implement bidirectional communication patterns:
- Server-Sent Events (SSE): Streaming AI responses from server to browser
- WebSockets: Full-duplex communication for conversational AI
- WebRTC: Peer-to-peer connections for voice/video AI
- Connection management: Reconnection logic, heartbeats, graceful disconnection
Browser Audio Capabilities
You'll build voice-enabled AI interfaces:
- Audio capture: Using Web Audio API to record user speech
- Voice Activity Detection (VAD): Detecting when users start/stop speaking
- Streaming to STT: Sending audio chunks to Speech-to-Text services
- Playing TTS responses: Rendering Text-to-Speech audio in browsers
- Duplex conversations: Managing simultaneous input/output audio streams
TTS/STT Pipelines
You'll implement speech processing workflows:
- Speech-to-Text integration: OpenAI Whisper, Google STT, Deepgram
- Text-to-Speech pipelines: OpenAI TTS, ElevenLabs, Google TTS
- Latency optimization: Chunked processing, streaming audio, parallel pipelines
- Quality tradeoffs: Balancing accuracy, latency, and cost
Multimodal Interactions
You'll create rich AI experiences beyond text:
- Image/screen capture: Allowing AI to see user screens or uploaded images
- Tool visualization: Showing when/how AI uses external tools
- Rich media responses: Rendering charts, tables, code blocks from AI
- Interactive elements: Buttons, forms, and widgets within AI conversations
Mobile & PWA Considerations
You'll build AI experiences that work everywhere:
- Progressive Web Apps: Offline-first capabilities for AI tools
- Mobile optimization: Touch interfaces, responsive layouts, gesture controls
- Background processing: Handling audio while app is backgrounded
- Permission management: Microphone, camera, location access flows
Load, Cost, and Quality of Service
You'll optimize realtime performance:
- Backpressure handling: Slowing down when systems are overloaded
- Fallback strategies: Degrading gracefully when primary services fail
- Caching: Semantic caching for repeated AI queries
- Rate limiting: Managing costs while maintaining user experience
- Token budgeting: Staying within context window limits
Prerequisites
This part builds on:
- Part 5 (Python): Understanding async patterns that apply to TypeScript/JavaScript
- Part 6 (AI Native): Knowing agent APIs (OpenAI SDK, MCP) you'll integrate with
- Part 9 (TypeScript): Language fundamentals, async patterns, HTTP/WebSocket communication
- Part 10 (Frontends): Chat UIs, streaming responses, component architecture
You need Part 10 completed before starting this part.
What Makes This Different
Traditional audio/video courses teach media processing. This part teaches voice and realtime for AI agents:
Traditional approach:
- Record and playback audio files
- Build video conferencing apps
- Handle media encoding/decoding
Our approach:
- Stream voice to AI and back in real-time
- Handle variable AI response latencies in voice flows
- Build duplex conversations where AI and human can interrupt each other
- Optimize for natural conversation rhythm
You're building conversational AI, not just media apps.
Real-World Applications
These skills enable you to build:
Voice AI Applications:
- Voice-controlled home automation
- AI phone assistants with natural conversation flow
- Language learning apps with pronunciation feedback
- Accessibility tools for vision-impaired users
Realtime Collaboration:
- Shared AI workspaces where teams interact with agents together
- Live coding assistants that respond as you type
- Multiplayer AI games
Multimodal Products:
- AI that can see and describe your screen
- Visual debugging assistants
- Document analysis with image understanding
Chapter Progression
This part's chapters build realtime and voice capability:
Realtime APIs for Agents
Implement SSE, WebSockets, and WebRTC for bidirectional agent communication. Handle reconnection, heartbeats, and graceful degradation.
Browser Audio Capture
Use Web Audio API to capture user speech. Implement Voice Activity Detection to know when users are speaking.
TTS/STT Pipelines
Build end-to-end speech processing workflows. Integrate with OpenAI Whisper, Google STT, ElevenLabs, and optimize for latency.
Multimodal IO
Add image/screen capture, tool visualization, and rich media rendering to your AI interfaces.
Mobile & PWA
Optimize for mobile devices, implement Progressive Web App patterns, and handle background audio processing.
Load, Cost, and QoS
Manage backpressure, implement fallback strategies, and optimize costs for realtime AI systems.
Pedagogical Approach
This part uses all four teaching layers:
Layer 1 (Manual Foundation): Understanding audio APIs, WebSocket protocols, streaming patterns Layer 2 (AI Collaboration): Building voice components with Claude Code/Cursor assistance Layer 3 (Intelligence Design): Creating reusable audio utilities, streaming patterns, voice pipelines Layer 4 (Spec-Driven): Implementing complete voice AI products from specifications
You'll build progressively: audio capture → speech recognition → AI processing → speech synthesis → natural conversation.
Success Metrics
You succeed when you can:
- ✅ Implement realtime communication with SSE/WebSockets/WebRTC
- ✅ Capture browser audio and detect voice activity
- ✅ Build STT/TTS pipelines with latency optimization
- ✅ Create multimodal experiences (text, voice, images)
- ✅ Optimize for mobile devices and Progressive Web Apps
- ✅ Manage performance, cost, and quality of service for realtime systems
What You'll Build
Capstone projects:
- Voice AI Interface: Browser-based voice assistant with STT/TTS integration and natural conversation flow
- Multimodal Agent: AI that can see your screen, hear your voice, and respond with rich media
- Mobile AI App: Progressive Web App with offline capabilities and mobile optimization
By the end, you'll have built complete voice-enabled AI experiences.
Looking Ahead
After mastering realtime and voice, you're ready for Part 12: Agentic AI is the Future—exploring emerging patterns like the Agentic Web, Agentic Organizations, and Agentic Commerce.
You've built the full interactive stack: Backend (Parts 5-7), Language (Part 9), Frontend (Part 10), Voice/Realtime (Part 11). Part 12 shows you where this technology is heading.