ARIA — Multipurpose AI Voice Agent

Real-time phone AI built for adaptable workflows: Twilio media streams, a Node/Hono WebSocket bridge, Gemini Live or a Sequential Deepgram → Gemini → dual-TTS path, local VAD for barge-in, Supabase persistence, and Upstash-backed sessions. The shipped demo focuses on multilingual clinic reception — the same architecture extends to other tool-driven voice flows.

ARIA — live call, transcription & tools

// AI Systems / Real-Time / Telephony · Multipurpose voice
ARIA: active call, real-time transcription, booking and SMS tool flows
ARIA
Multipurpose voice stack · clinic reception reference · multilingual STT/TTS · Live vs Sequential · Supabase + Redis · local VAD barge-in
  • ARIA is a multipurpose AI voice agent — Twilio telephony, a Node/Hono WebSocket bridge, and switchable Live vs Sequential stacks so workflows aren’t limited to one vertical. The production reference build automates a medical clinic receptionist: callers are greeted, handled in their own language, and guided through booking with SMS follow-up.
  • End-to-end multilingual support — detects spoken language and continues the whole conversation in that language (Deepgram nova-2 multi + BCP-47 normalization in Sequential mode; native detection in Gemini Live).
  • Dual-provider TTS in Sequential mode (tts.ts):
    • English (en-*): Deepgram Aura 2 (e.g. aura-2-thalia-en) for low-latency output
    • All other languages: Google Cloud TTS (µ-law 8 kHz); graceful fallback to English Deepgram with a warning if GOOGLE_TTS_API_KEY is unset
  • Two switchable call handlers (toggle via CallHandler import in server.ts):
    • LiveCallHandler: Gemini Multimodal Live (gemini-3.1-flash-live-preview) — native audio in/out, Aoede voice, real-time function calling and transcription; 8 kHz µ-law upsampled to 16 kHz PCM ingress, 24 kHz PCM downsampled to 8 kHz µ-law egress in 20 ms frames
    • SequentialCallHandler: Twilio µ-law → Deepgram streaming STT (nova-2 multi) → Gemini 2.5 Flash (text, streaming + tools) → sentence splitter → dual TTS; parallel TTS fetch with a serialized speak queue for low first-word latency
    • Sequential extras: echo gate + 300 ms post-speak cooldown, 4 s silence reprompt, 2-layer booking safety in CONFIRMING (forced tool call + premature booking claim guard)
  • Local Silero ONNX VAD (@ericedouard/vad-node-realtime) for instant barge-in:
    • Caller speech clears Twilio’s outbound buffer (clear) and, in Live mode, sends turnComplete: true to preempt the model mid-turn
  • Live sessions inject [SYSTEM STATE UPDATE] user messages after each tool execution so booking state stays aligned despite the Live API’s persistent context.
  • State machine with full logging: GREETING → COLLECTING_INFO → CONFIRMING → BOOKED (plus FAILED); dynamic instructions keep the LLM aware of missing fields before booking.
  • Appointment flow:
    • Collects name, date, time, and reason via dialogue; checks slot availability against PostgreSQL
    • Confirms verbally and sends SMS confirmation via Twilio
    • Live mode: post-booking hangup waits for Gemini’s goodbye audio (with a 15 s safety timer)
  • Data: Supabase for appointments and call_logs; Upstash Redis (REST) for session keys with ~1 h TTL and deleteSession cleanup on hang-up.
  • Stack wiring: Hono for HTTP + WebSocket routes (/twiml, /media-stream), alawmulaw for µ-law decode, onnxruntime-node for VAD inference.

// tech

TypeScriptNode.jsHonoTwilioWebSocketsGemini LiveGemini 2.5 FlashDeepgramAura 2Google Cloud TTSSupabaseUpstash RedisSilero VADonnxruntime-nodealawmulaw