InstantAIguru Twilio ConversationRelay Architecture Documentation
Overview
The InstantAIguru Twilio ConversationRelay implementation provides a sophisticated, multi-layered voice AI system deployed across three distinct layers:
- Twilio ConversationRelay - Handles telephony, WebSocket connections, STT/TTS
- CloudFlare Worker - Edge-based intelligent relay with preprocessing and interrupt handling
- AWS Lambda - Backend AI processing with RAG retrieval and response generation
This architecture enables real-time voice conversations with low latency, streaming responses, and sophisticated intent detection.
Architecture Layers
Layer 1: Twilio ConversationRelay (Telephony Layer)
Responsibilities:
- Receive incoming phone calls
- Handle audio streaming to/from caller
- Speech-to-Text (STT) conversion
- Text-to-Speech (TTS) playback
- DTMF digit collection
- WebSocket connection management to CloudFlare Worker
Key Features:
- Real-time bidirectional audio streaming
- Google Cloud STT with telephony-optimized model
- Configurable TTS providers (Google, Amazon Polly, Microsoft Azure)
- Voice and language customization per phone number
- Interrupt detection (when user speaks during AI response)
- DTMF detection for numeric input workflows
- Session handoff with custom parameters
TwiML Configuration: The TwiML response uses the <Connect> verb with the <ConversationRelay> noun to establish a WebSocket connection. It configures parameters for STT/TTS providers, voice selection, and language. Custom <Parameter> tags are used to pass session context (such as the active conversation thread and the agent persona) to the WebSocket server, along with the authenticated session credentials needed by the backend.
Layer 2: CloudFlare Worker (Edge Relay Layer)
Responsibilities:
- WebSocket endpoint for Twilio ConversationRelay
- Edge-based preprocessing for simple intents (greetings, farewells)
- SSE stream consumption from AWS Lambda
- Real-time response chunking for TTS optimization
- Interrupt handling and session state management
- Multi-language support (English, Spanish)
WebSocket Routes: The Worker exposes distinct routes per environment (production and development) and per supported language (English and Spanish), so each call is handled by the correct preprocessing and localization logic.
Key Responsibilities:
Session Handling
Sets up the WebSocket connection and routes messages:
The session handler initializes the WebSocket connection and sets up event listeners. It handles the setup message to extract TwiML parameters into the session context, the prompt message to process user speech, dtmf for digit collection, and interrupt to handle user interruptions. It also ensures proper cleanup on connection close.
Input Processing
Core processing that handles preprocessing and Lambda communication:
Preprocessing Flow: The input processor is the core logic at the edge. It first checks for edge-based preprocessing matches (using AI or static rules). If a match is found, it sends an immediate response. If not, it encodes the user prompt and context, connects to the AWS Lambda backend via SSE, and streams the response chunks back to the client. It handles the parsing of SSE events (chunks, answers, context updates) and manages the session state.
Preprocessing Functions:
The worker includes 70+ regex patterns for fast intent detection of questions (questions, requests) and specific intents like 'greeting' and 'farewell', enabling immediate edge responses without backend latency.
System Message to Filler Mapping:
The worker converts Lambda progress messages (e.g., "Analyzing", "Searching", "Evaluating") into natural conversational fillers. This maps system states to random variations of "hold music" phrases (e.g., "Let me check that...") to maintain user engagement during processing delays.
Interrupt Handling
Cancels ongoing Lambda requests when the user interrupts:
The interrupt handler sets a break flag, cancels any ongoing Lambda requests using an AbortController, cancels the SSE reader, and waits for the session busy state to clear, ensuring the system is ready for the next user input immediately.
Session State Management:
The WebSocket object maintains internal flags to track the session status:
- Whether the system is currently processing a prompt.
- Whether a user interruption has occurred.
- Whether a backend AI request is pending.
- The active stream reader, to allow for immediate cancellation.
- The conversation context and parameters, preserved across turns.
Layer 3: AWS Lambda (Backend Processing Layer)
Responsibilities:
Voice Webhook Handler
Generates initial TwiML to establish the ConversationRelay connection:
The voice webhook handler verifies the Twilio request signature, sanitizes the phone number, and loads the configuration profile for that number. It initializes the chat context manager and determines the correct WebSocket route (development/production, English/Spanish). Finally, it generates and returns the XML response containing the <ConversationRelay> instruction with all necessary context parameters.
Handoff Handler
Handles user input during transfer scenarios:
The handoff handler processes the results of a Twilio <Gather> action initiated during a handoff or menu flow. It evaluates the user's input (DTMF or speech) against configured options and generates the subsequent TwiML to route the call (e.g., dialing a configured destination, playing a message, or hanging up).
Streaming Request Handler
Handles streaming responses via SSE:
The streaming request handler accepts the prompt and context from the request, validates authentication, and establishes an SSE stream. It initializes the chat context and hands off to the query orchestrator to process the request, streaming the results back to the client.
Query Orchestration
Core query processing with RAG retrieval:
The query orchestrator drives the AI processing. It performs pre-qualification to analyze intent and potentially rephrase the question. Based on the analysis, it either generates a direct response or triggers a RAG (Retrieval-Augmented Generation) workflow to search the knowledge base and generate an informed answer. It reports progress via the stream.
RAG Retrieval
Retrieval-Augmented Generation with streaming:
The RAG retrieval stage searches the vector index for relevant documents, builds a context block from the retrieval results, and calls the AI model (e.g., Claude, GPT) to generate a response based on the retrieved information, streaming chunks of the answer as they are generated.
Stream Finalization
Finalizes the SSE stream by writing the final data object and closing the stream connection.
2. WebSocket Connection Establishment
Twilio ConversationRelay receives TwiML
↓
Initiates WebSocket connection to the edge Worker endpoint
↓
CloudFlare Worker receives WebSocket upgrade request
↓
Session handler starts listening
CloudFlare Worker Processing:
- Accepts WebSocket upgrade request
- Creates WebSocketPair (client ↔ server)
- Returns client WebSocket to Twilio (101 Switching Protocols)
- Starts the session handler on the server WebSocket
- Stores connection parameters (backend route, environment flag, language)
- Sets up event listeners for:
message→ Route to handler based on typeclose→ Clean up sessionerror→ Handle errors gracefully
3. Session Setup
Twilio ConversationRelay sends "setup" message
↓
Session handler processes setup
↓
Stores context from TwiML Parameters
Setup Message (from Twilio):
{
"type": "setup",
"callSid": "CA123abc456def",
"from": "+12025551234",
"to": "+18005551234",
"customParameters": {
"user_email": "+12025551234",
"display_name": "John Doe",
"thread_id": "CA123abc456def",
"guru_name": "AI Assistant",
"checkLiveTransfer": "true"
}
}
The setup message also carries the authenticated session credentials the backend uses to load the correct configuration; those values are omitted here.
CloudFlare Worker Processing:
- Extracts customParameters
- Stores context on the WebSocket object:
- Session parameters (such as the conversation thread and persona).
- Call metadata (SID, timestamps).
- Initializes session state flags.
4. User Prompt Processing
User speaks
↓
Twilio ConversationRelay performs Speech-to-Text
↓
Sends "prompt" message with voicePrompt
↓
Session handler routes to the input processor
Prompt Message (from Twilio):
{
"type": "prompt",
"voicePrompt": "What are your business hours?"
}
5. Edge Preprocessing (CloudFlare Worker)
Input processor receives voicePrompt
↓
Try AI preprocessing (if enabled)
↓
Try static preprocessing
↓
If match found, send immediate response
↓
Otherwise, forward to Lambda
AI Preprocessing (Optional): The system attempts to classify the user's intent using a fast AI model (e.g., Claude Haiku). This model categorizes inputs into intents such as 'greeting', 'farewell', 'question', or 'live_agent_request'. If a simple intent like a greeting is detected, the worker generates a response immediately without invoking the potentially slower main backend.
Static Preprocessing: For even lower latency, the system checks the input against a set of regex patterns and static rules:
- Greeting Intent: Matches words like "hi", "hello".
- Farewell Intent: Matches words like "goodbye", "see you".
- Question Detection: Uses 70+ regex patterns to identify if the input is a question (e.g., starts with "what", "how", "can you").
If a static match is found (e.g., a greeting), a random friendly response is selected and sent immediately. If the input is identified as a complex question, or no static match is found, it is forwarded to the main Lambda backend.
If Preprocessed Response Found: When a locally generated response is available, it is sent directly to Twilio. The system then checks if the response implies terminating the call (e.g., "Goodbye") and sends the appropriate end-session signal if necessary.
6. Lambda Request (SSE Fetch)
No preprocessing match
↓
Input processor constructs backend request
↓
Fetches the Lambda streaming endpoint
↓
Opens SSE ReadableStream
Backend Request Construction: The worker constructs the request to the Lambda streaming endpoint. It passes the user's question and the full session context (encoded as a JSON string).
Fetch with Abort Controller: The system uses the fetch API to initiate the connection to Lambda. Crucially, it attaches an AbortController. This allows the worker to immediately cancel the pending network request if:
- The user interrupts (speaks again).
- The WebSocket connection closes.
- A network error occurs.
If the fetch fails (non-200 status), a localized error message is sent to the user. Upon success, a ReadableStream reader is obtained to process the incoming Server-Sent Events.
7. Lambda Processing (Streaming Request Handler)
Lambda receives request
↓
Streaming handler extracts question and context
↓
Validates session credentials
↓
Sets up SSE response stream
↓
Calls the query orchestrator with the stream
Streaming Handler Processing: The streaming request handler first extracts the prompt and context parameters. It performs a security check by validating the supplied session credentials against the loaded configuration. Once validated, it initializes the HTTP response stream with the proper headers for Server-Sent Events (Content-Type: text/event-stream). It then instantiates the chat context manager and hands off execution to the core query orchestration logic, which writes chunks directly to the open stream.
8. Query Handling (Orchestration + RAG)
Query orchestrator starts processing
↓
Sends SSE: { progress: "Analyzing..." }
↓
Performs pre-qualification (intent analysis)
↓
Sends SSE: { progress: "Rephrased Question: ..." }
↓
Runs RAG retrieval
↓
Sends SSE: { progress: "Searching knowledge base..." }
↓
Retrieves from the vector index
↓
Sends SSE: { progress: "Selected 5 relevant documents" }
↓
Generates response with streaming
↓
Sends SSE: { chunk: "Our business" }
↓
Sends SSE: { chunk: " hours are" }
↓
Sends SSE: { chunk: " Monday-Friday" }
↓
Sends SSE: { chunk: " 9am-5pm" }
↓
Sends SSE: { answer: "Our business hours are Monday-Friday 9am-5pm" }
SSE Event Sequence: The protocol involves a sequence of JSON messages sent over the stream:
- Status: Initial connection confirmation (
status: "connected"). - Progress: Updates on the AI's thought process (e.g., "Analyzing", "Searching knowledge base").
- Chunk: Pieces of the generated response text streamed token-by-token.
- Answer: The final complete text for logging and verification.
9. SSE Stream Consumption (CloudFlare Worker)
Input processor reads from SSE stream
↓
Accumulates buffer
↓
Splits on '\n\n'
↓
Parses JSON events
↓
Categorizes: chunks, answers, context
↓
Sends to Twilio ConversationRelay
SSE Parsing Loop: The SSE parsing logic runs in a loop, reading chunks from the Lambda stream. It buffers and splits data into events, then processes them based on type: 'chunk' (intermediate text), 'answer' (final text), 'context' (updates), or 'progress' (system status). It handles sanitization, chunking adjustments for TTS, and checks for special signals like goodbye or live agent transfer.
10. Text-to-Speech Playback
Twilio ConversationRelay receives text chunks
↓
Queues for TTS synthesis
↓
Streams audio to caller
↓
Monitors for user interrupt
ConversationRelay Processing:
- Receives text message from CloudFlare Worker
- Checks
interruptibleandpreemptibleflags:interruptible: true→ User can speak to interruptpreemptible: true→ This chunk can be skipped if new content arrivespreemptible: false→ This chunk must be played completely
- Queues for TTS synthesis using configured provider (Google, Polly, Azure)
- Synthesizes audio chunk
- Streams audio to caller
- If
last: false, waits for next chunk - If
last: true, finalizes and awaits next user input
TTS Configuration: The TTS settings (provider, voice, language) are defined in the initial TwiML configuration and respected throughout the session.
Audio Streaming:
- Sample rate: 8000 Hz (Twilio default for voice)
- Codec: μ-law (G.711)
- Latency: ~200-500ms from text to audio start
11. User Interrupt Handling
User speaks during AI response
↓
Twilio ConversationRelay detects interrupt
↓
Sends "interrupt" message
↓
CloudFlare Worker invokes the interrupt handler
↓
Cancels ongoing Lambda SSE stream
↓
Waits for new prompt
Interrupt Message (from Twilio):
{
"type": "interrupt",
"timestamp": "2024-01-15T10:30:45.123Z"
}
CloudFlare Worker Processing: When the worker receives an interrupt message, it logs the event and immediately invokes the interrupt handler to halt all current activities.
Interrupt Handling Steps: The interrupt handler performs a clean handling of the interruption:
- Sets Flags: Marks the session as 'breaking' to prevent new logic from starting.
- Cancels Backend: Aborts any pending fetch requests to the Lambda backend.
- Cancels Stream: Cancels the active SSE reader to stop processing incoming chunks.
- Waits for Clear: Loops briefly to ensure the 'busy' state has been fully reset.
- Resets State: Returns the session to a clean IDLE state, ready for the new input.
Result:
- Lambda SSE stream is cancelled via AbortController
- CloudFlare Worker stops sending text to Twilio
- Session is ready for next user input
12. Session Termination
Goodbye Detection:
Assistant response contains goodbye phrase
↓
Goodbye detector returns true
↓
CloudFlare sends "end" message with handoffData
↓
Twilio ConversationRelay ends call
Goodbye Patterns: The system uses regex pattern matching on the assistant's response to detect if the conversation should end ("goodbye"). If triggered, it sends a termination signal with the appropriate reason code.
End Message:
{
"type": "end",
"handoffData": "{\"reasonCode\":\"goodbye\",\"reason\":\"The assistant said goodbye\",\"startTime\":\"2024-01-15T10:25:30.000Z\"}"
}
Live Agent Transfer:
Assistant response indicates live agent needed
↓
Live-agent detector returns true
↓
CloudFlare sends "end" message with live_agent_request
↓
Twilio ConversationRelay triggers the handoff action endpoint
↓
Backend handles the handoff
Live Agent Patterns: Similarly, it checks if the AI suggests transferring to a live agent. If triggered, it sends a termination signal with the live_agent_request reason code.
Live Agent End Message:
{
"type": "end",
"handoffData": "{\"reasonCode\":\"live_agent_request\",\"reason\":\"The assistant determined a live agent is needed\",\"startTime\":\"2024-01-15T10:25:30.000Z\"}"
}
Visual Flow Diagram
┌─────────────────────────────────────────────────────────────────────────┐
│ INCOMING PHONE CALL │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ TWILIO → Voice webhook (POST) │
│ • CallSid, From, To, CallStatus │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ LAMBDA: Voice Webhook Handler │
│ • Validate Twilio signature │
│ • Load per-number configuration profile │
│ • Create chat context manager │
│ • Return TwiML with ConversationRelay │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ TwiML RESPONSE │
│ <ConversationRelay │
│ url="wss://<edge-worker-endpoint>/ws" │
│ welcomeGreeting="Hello!" │
│ language="en-US" │
│ ttsProvider="google" │
│ voice="en-US-Neural2-H"> │
│ <Parameter name="thread_id" value="..."/> │
│ <Parameter name="guru_name" value="..."/> │
│ <!-- plus authenticated session credentials --> │
│ </ConversationRelay> │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ TWILIO CONVERSATIONRELAY │
│ WebSocket Connection → the edge Worker endpoint │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ CLOUDFLARE WORKER: Session Handler │
│ • Accept WebSocket │
│ • Listen for messages │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ MESSAGE TYPE: "setup" │
│ • Extract customParameters from TwiML │
│ • Store context: {thread_id, guru_name, ...} │
│ • Initialize session state │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ USER SPEAKS → Twilio STT → voicePrompt │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ MESSAGE TYPE: "prompt" │
│ { "type": "prompt", "voicePrompt": "What are your hours?" } │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ CLOUDFLARE WORKER: Input Processor │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ PREPROCESSING (Edge) │ │
│ │ • Try AI preprocessing (fast model) │ │
│ │ • Try static pattern matching │ │
│ │ • Intents: greeting, farewell, simple Q&A │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │ │ │
│ Matched? Not Matched │
│ │ │ │
│ ▼ ▼ │
│ ┌────────────────────────────────┐ ┌───────────────────────────────┐ │
│ │ Send Immediate Response │ │ Forward to Lambda │ │
│ │ • ws.send({type:"text",...}) │ │ • Encode question & context │ │
│ │ • Check goodbye/live agent │ │ • Fetch SSE stream │ │
│ │ • Send "end" if needed │ │ • Open ReadableStream │ │
│ └────────────────────────────────┘ └───────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ LAMBDA: Streaming Request Handler │
│ • Extract question and context from the request │
│ • Validate session credentials │
│ • Set up SSE response stream │
│ • stream.write("data: {\"status\":\"connected\"}\n\n") │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────────────┐
│ LAMBDA: Query Orchestrator │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ SSE: {"progress": "Analyzing..."} │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ Pre-qualification │ │
│ │ • Intent analysis │ │
│ │ • Question rephrasing │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ SSE: {"progress": "Rephrased Question: ..."} │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ RAG retrieval │ │
│ │ • Query the vector index with embeddings │ │
│ │ • Retrieve top-k relevant documents │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ SSE: {"progress": "Selected 5 relevant documents"} │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ Streaming response generation │ │
│ │ • Build RAG context │ │
│ │ • Call AI model (OpenAI, Bedrock, Google) │ │
│ │ • Stream chunks as they arrive │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ SSE: {"chunk": "Our business"} │ │
│ │ SSE: {"chunk": " hours are"} │ │
│ │ SSE: {"chunk": " Monday-Friday"} │ │
│ │ SSE: {"chunk": " 9am-5pm"} │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ SSE: {"answer": "Our business hours are Monday-Friday 9am-5pm"} │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ stream.end() │ │
│ └───────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ CLOUDFLARE WORKER: Input Processor SSE Parsing │
│ • Read from stream.body.getReader() │
│ • Accumulate buffer, split on "\n\n" │
│ • Parse JSON events: chunk, answer, context, progress │
│ • Send to Twilio ConversationRelay │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ CHUNK EVENTS → Relay to Twilio │
│ ws.send({ │
│ type: "text", │
│ token: "Our business", │
│ last: false, │
│ interruptible: true, │
│ preemptible: false │
│ }) │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ ANSWER EVENT → Final Response │
│ ws.send({ │
│ type: "text", │
│ token: "", // or full text if no chunks sent │
│ last: true │
│ }) │
│ │
│ Check for goodbye or live agent transfer │
│ If detected: │
│ ws.send({ │
│ type: "end", │
│ handoffData: JSON.stringify({ │
│ reasonCode: "goodbye" | "live_agent_request", │
│ reason: "...", │
│ startTime: "..." │
│ }) │
│ }) │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ TWILIO CONVERSATIONRELAY │
│ • Queue text chunks for TTS │
│ • Synthesize audio (Google TTS) │
│ • Stream audio to caller │
│ • Monitor for user interrupt │
│ • If "end" message: terminate call or handoff │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ CALLER HEARS AI RESPONSE │
│ "Our business hours are Monday through Friday from 9 AM to 5 PM." │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────┴─────────────────────────────┐
│ │
▼ ▼
┌───────────────────┐ ┌─────────────────────┐
│ User Speaks Again │ │ User Says Goodbye │
│ (loop to prompt) │ │ or Requests Agent │
└───────────────────┘ └─────────────────────┘
│
▼
┌─────────────────────┐
│ Call Ends or │
│ Transfers to Agent │
└─────────────────────┘
Key Technical Details
1. Multi-Language Support
WebSocket Routing: Each call is routed to a language- and environment-specific path so that production and development traffic, and English and Spanish traffic, are handled by the matching preprocessing and localization logic.
Language-Specific Processing: The system uses the session's language property to select the appropriate preprocessing functions, goodbye detectors, and intent classifiers. It also localizes error messages sent to the user (e.g., "Sorry, there was a connection error" vs. "Lo siento, hubo un error de conexión").
Spanish Functions: Specialized regex detectors are implemented to recognize Spanish phrases for termination or transfer (e.g., "Adiós", "Déjame transferirte").
2. DTMF and Numeric Input
Digit Accumulation: The system handles DTMF input by accumulating digits arriving via WebSocket.
- End Marker: If the user presses
#, the accumulated sequence is immediately processed. - Max Digits: If the count matches the configured maximum, it auto-submits.
- Context: Minimum and maximum digit constraints can be set dynamically via the session context.
- Voice Digits: The system also extracts digits spoken by the user (e.g., "one two three") from the voice prompt if a numeric input is expected.
Context Configuration: Numeric input constraints (min/max digits) can be configured via session parameters.
3. TTS Optimization
Response Sanitization & Chunking: Response text is split into chunks (max 256 characters) to optimize for TTS playback latency. This ensures that the telephony provider can start speaking the first part of a long sentence while the rest is being processed or transmitted. Additionally, text is sanitized to remove URLs, email addresses, and special characters that might cause TTS errors.
Sending Chunked Response: The system iterates through the text chunks, sending them to Twilio with the interruptible: true flag. This allows the user to cut in at any point. Small delays are inserted between chunks to prevent network jitter from causing out-of-order playback.
4. Session State Management
State Flags: The system tracks the session's lifecycle using properties on the WebSocket object, including whether a prompt is being processed, whether an interruption is active, whether the backend is generating, how much of the current response has been sent, and the accumulated answer for logging.
State Transitions: The session moves from IDLE to BUSY when input arrives. During BUSY, it may transition to an in-backend state while fetching. Once the answer is complete (or interrupted), it returns to IDLE.
5. Error Handling
Lambda Request Errors: When the Lambda backend returns a non-200 status or a missing body, the system sends a localized error message (English or Spanish) to the user via TTS and resets the session usage flags.
SSE Parsing Errors: Incoming SSE data lines are parsed safely; invalid JSON is logged and skipped to prevent session crashes.
WebSocket Errors: The worker listens for WebSocket errors. If the error is related to a closed connection or network loss, it is ignored to prevent log spam. Errors on open sessions are logged for debugging.
Global Error Handlers: Global event listeners are attached to the CloudFlare Worker environment to catch and suppress unhandled exceptions or rejections, ensuring the instance doesn't crash unexpectedly, logging the error context instead.
6. Performance Optimizations
Pre-compiled Regex Patterns: To ensure minimal latency at the edge, heavy regex patterns (for question detection or intent matching) are compiled once at the module scope rather than per-request.
Batched SSE Processing: Network chunks are naturally batched. The system buffers incomplete events and processes complete JSON payloads from the SSE stream in batches to optimize network throughput.
Response Streaming: The Lambda backend streams AI-generated content chunks immediately to the response stream as Server-Sent Events (SSE), ensuring low latency without buffering.
CloudFlare Edge Locations:
- Global edge network (275+ cities)
- Low latency to both Twilio and AWS Lambda
- WebSocket termination at edge
- Reduces round-trip time for preprocessing
Configuration
Phone Number Configuration
Each phone number is backed by a configuration profile that defines its behavior. At a high level a profile covers:
- Identity: assistant name and display name.
- Voice Settings: TTS vendor (e.g., Google, Azure), voice selection, and greeting message.
- AI Settings: preferred model (e.g., Claude Sonnet 4.6), RAG behavior (enabled/disabled), and system instructions.
- Workflows: logical triggers and steps for specific tasks like appointment scheduling.
Profiles, credentials, and knowledge-base wiring are managed server-side and are never exposed to the client.
Deployment
CloudFlare Worker Deployment
The CloudFlare Worker (ConversationRelay) is deployed globally and is configured, via environment variables, to reach the correct AWS Lambda backend for the production and development environments.
Lambda Deployment
The AWS backend is deployed as a set of managed functions. Key responsibilities include the voice webhook (generates the initial TwiML from Twilio) and the streaming handler (serves the WebSocket stream requests for AI processing). Functions are configured with appropriate timeouts (e.g., 30s) and CORS settings.
Monitoring and Debugging
CloudFlare Worker Logs
Log monitoring provides real-time visibility into the edge connection. Key events tracked include:
- Call setup and parameter extraction (Caller ID, Language).
- AI Preprocessing decisions (Response vs. No Response).
- Backend Fetch initiation and status.
- WebSocket session events (Open, Close, Error).
Lambda Logs (CloudWatch)
Backend logs provide detailed execution traces for the AI logic:
- Configuration loading and credential validation.
- Query understanding and rephrasing.
- RAG document selection specifics.
- Streaming response generation milestones.
Persistent Logging
The system maintains granular logs for auditing and analysis, kept in separate stores for request/response audit pairs, full chat transcripts, and voice-call metadata (such as duration and status).
Troubleshooting
Issue: WebSocket Connection Fails
Symptoms:
- ConversationRelay can't connect to CloudFlare Worker
- Immediate call termination
Causes:
- Edge security settings blocking WebSocket
- Invalid WebSocket URL in TwiML
- Worker not deployed or crashed
Solutions:
- Verify Deployment: Confirm the CloudFlare Worker is active and its deployment completed successfully.
- Check Logs: Review the Worker's real-time logs for errors.
- Verify Edge Settings: Ensure the edge configuration permits WebSocket connections.
Issue: No Response from AI
Symptoms:
- User speaks but gets no response
- Timeout after 30 seconds
Causes:
- Lambda timeout (30s max)
- Vector search timeout
- AI model throttling
Solutions:
- Timeouts: Verify Lambda timeout settings are sufficient (e.g., >30s).
- Infrastructure: Verify vector-search connectivity and AI provider API status/quotas.
Issue: TTS Error 64111
Symptoms:
- Twilio error: "Unable to synthesize text"
- Call drops after AI response
Cause:
- Response contains URLs or special characters that TTS can't handle.
Solution: The system automatically sanitizes text before sending it to TTS, stripping URLs, emails, and replacing special characters.
Issue: User Interrupt Not Working
Symptoms:
- User speaks but AI continues talking
- No interrupt detection
Causes:
interruptibleflag not set on response chunks.- Session break logic failing to cancel backend requests.
Solutions: Ensure the backend sends the interruptible: true flag on text chunks and that the WebSocket listener correctly triggers session cancellation routines upon receiving user input.
Summary
This Twilio ConversationRelay implementation provides a production-ready, scalable voice AI system with the following key characteristics:
Architecture:
- 3-layer design: Twilio → CloudFlare → Lambda
- Edge-based preprocessing for low latency
- Server-Sent Events for streaming responses
- Real-time interrupt handling
Performance:
- Global edge presence via CloudFlare (275+ cities)
- Sub-second response times for simple queries
- Streaming AI responses for natural conversation flow
- Optimized TTS chunking to prevent buffer overflow
Features:
- Multi-language support (English, Spanish, extensible)
- DTMF and voice digit collection
- Live agent transfer capability
- Goodbye detection and call termination
- Configurable per phone number
- Workflow engine for complex interactions
- RAG retrieval from a vector knowledge base
Reliability:
- Comprehensive error handling at all layers
- Automatic retry and fallback logic
- Session state management with graceful cleanup
- Abort controllers for request cancellation
- Dead session timeout detection
Monitoring:
- CloudFlare Worker logs (real-time)
- Lambda CloudWatch logs
- Persistent logging for analytics
- Detailed debug logging throughout
This architecture provides a robust foundation for building sophisticated voice AI applications with enterprise-grade reliability and performance.