Conversation Design

Design script flows, craft effective system prompts, select voices, and handle user interruptions for natural phone conversations.

16 min read

What You'll Learn

Design effective system prompts that guide agent behavior, set tone, and handle edge cases in voice conversations
Configure voice selection in Vapi including provider differences, voice cloning options, and emotion/speed controls
Implement interruption handling and endpointing settings to make conversations feel natural rather than robotic
Structure conversation flows using background context, dynamic variables, and conditional branching logic
Write first messages and call scripts optimized for voice rather than text, accounting for how people speak on the phone

Writing System Prompts for Voice

Voice prompts have different requirements than text-based chatbot prompts. When someone is on the phone, they expect short, punchy responses - not paragraphs. Your system prompt must explicitly constrain response length and format.

Key voice-specific prompt rules to include:

No markdown: The agent should never output bullet points, bold text, or headers. TTS reads these as literal characters or awkward pauses.
Response length: "Keep all responses under 2-3 sentences unless the caller explicitly asks for a detailed explanation."
No lists: "Never use numbered or bulleted lists. Instead, say items conversationally: 'We have three options: the first is... the second is... and the third is...'
Filler awareness: "Do not start responses with filler words like 'Certainly!' or 'Great question!' - go directly to the answer."
Natural language numbers: "Say phone numbers and dates naturally: 'July fifteenth' not '07/15', 'five-five-five, one-two-three-four' not '5551234'."

Prompt structure for voice agents follows a clear hierarchy: identity, purpose, personality, knowledge scope, hard constraints, and fallback behavior. The fallback is critical - tell the agent exactly what to say when it cannot answer ("I don't have that information, but I can transfer you to a human agent who can help."). Without explicit fallback instructions, agents hallucinate confidently.

Dynamic variables let you personalize prompts at call time. In Vapi, use {{variableName}} syntax in your system prompt. Pass values when creating the call via the API under assistant.variableValues. Common uses include caller name, account status, appointment date, and location.

Quick Test: Rewrite Your System Prompt for Voice

Open your Vapi assistant and rewrite the system prompt using this structure:

1) "You are [Name], [role] for [Company]."

2) "Your job is to [primary purpose]."

3) "You speak in a [tone] way."

4) "Keep all responses under 2 sentences."

5) "If asked something outside [topic], say: [specific fallback phrase]."

Test it with a phone call and notice how constraining response length improves conversation naturalness.

Voice Selection and TTS Configuration

Voice selection is one of the most impactful choices for your agent because it shapes the entire caller experience. Vapi supports multiple TTS providers, each with distinct tradeoffs:

ElevenLabs offers the most human-like quality. The "Rachel" and "Adam" voices are popular defaults. ElevenLabs also supports voice cloning - you can upload audio of a real person and create a custom voice. Downside: highest cost and occasional higher latency.

PlayHT offers a good balance of quality and cost. Their v2 voices are noticeably better than v1. Good for high-volume applications where cost matters.

Deepgram Aura is the fastest and cheapest option. Quality is lower but acceptable for utility-focused agents (scheduling, IVR replacement). Best choice when minimizing cost is the priority.

OpenAI TTS (alloy, echo, nova, shimmer) offers consistent quality at a mid-range price. Nova and Shimmer sound the most natural for customer-facing applications.

In the Vapi dashboard under Voice settings, you can also configure:

Speed: 0.8-1.2x. Slightly slower (0.9x) helps with comprehension for complex information.
Stability (ElevenLabs only): Higher stability = more consistent, lower = more expressive variation.
Background noise/ambience: Vapi lets you add subtle office or coffee shop background noise to make the agent sound more like a real person in a real environment.

For B2B enterprise use cases, a neutral professional voice (deep, measured) builds more trust. For consumer-facing retail or hospitality, a warmer, more expressive voice works better. Test voices by generating a 60-second script and listening before committing.

Interruption Handling and Endpointing

Two settings most developers overlook are endpointing and interruption sensitivity, yet they have the biggest impact on whether conversations feel natural or robotic.

Endpointing is how Vapi determines when the caller has finished speaking. The default is silence detection - after a certain pause, Vapi assumes the turn is over and sends the audio to the LLM. The silence threshold is configurable (typically 500-1000ms). Too low and the agent cuts off callers mid-sentence. Too high and there are awkward pauses.

Vapi also supports smart endpointing using a secondary LLM to analyze whether the sentence feels complete. This handles cases like "I want to book an appointment... for next Tuesday" where the pause in the middle would otherwise trigger a premature response.

Interruption handling controls what happens when the caller speaks while the agent is talking. The interruptionsEnabled setting (true by default) lets callers break in at any point. Setting numWordsToInterruptAssistant to 2-3 means the agent only stops after detecting at least 2-3 words, reducing false triggers from background noise.

For outbound sales calls, you often want to allow interruptions freely. For IVR-style interactions where you need the caller to hear full instructions before responding, consider disabling interruptions for specific turns.

The backchannel feature plays subtle affirmations ("mm-hmm", "I see", "sure") while the caller speaks. This makes the silence during transcription feel more human. Enable it under Assistant > Advanced settings.

Endpointing Tuning Strategy

Start with Vapi's default endpointing settings and run 10-15 test calls. Count how many times the agent cut off the caller mid-sentence (endpoint too fast) vs. had awkward pauses (endpoint too slow). Adjust the silence threshold by 100ms increments based on your findings. Most agents for appointment booking work best around 800-1000ms. Fast-paced sales contexts work better at 500-600ms.

Conversation Flow Architecture

There are two primary ways to structure conversation flow in Vapi: freeform LLM-driven and workflow-structured.

Freeform is the default - the system prompt tells the agent what to do and the LLM figures out the conversation path. This works well for open-ended support agents and general assistants. The downside is unpredictability; the LLM can deviate from the intended flow.

Vapi Workflows (launched in late 2024) add a visual flow builder where you define explicit nodes: Conversation nodes (the agent speaks and listens), API Request nodes (calls external services), Logic nodes (if/else branching), and Transfer nodes (hand off to another number or agent). Workflows give you much more control over the exact sequence of a call.

For a medical appointment booking flow, a well-structured workflow might look like:

Greeting node - "Hi, this is Dr. Smith's office..."
Collect name node - ask for and confirm caller name
Collect DOB node - for patient verification
Intent detection - new appointment vs. existing appointment vs. billing
Branch to appropriate sub-flow
API call to check calendar availability
Confirm booking and end call

The Vapi Workflow builder is accessible from Assistants > Workflows in the dashboard. Even if you prefer freeform LLM agents, understanding workflows helps you write better system prompts by thinking in terms of explicit states the conversation should pass through.

Try This Yourself

In the Vapi dashboard, go to Assistants and click the Workflows tab. Create a new workflow and add three nodes: a "Say" node with a greeting, a "Gather" node that asks for the caller's name, and a "Say" node that responds using the collected name ("Nice to meet you, {{callerName}}!"). Save and test the workflow using the built-in call simulator. This simple exercise teaches you the node-based flow model that scales to complex multi-step agents.

Core Insights

Voice system prompts must explicitly forbid markdown, limit response length, and define clear fallback phrases - text chatbot prompts do not work as-is for voice
ElevenLabs gives the best voice quality; Deepgram Aura is the best cost-performance tradeoff for high-volume deployments
Endpointing (how long to wait for silence) and interruption sensitivity are the two settings that most affect perceived conversation naturalness
Vapi Workflows add a visual node-based flow builder for explicit call path control, while freeform LLM mode is better for open-ended conversations
Dynamic variables ({{variableName}} syntax) let you personalize system prompts at call time, enabling context-aware conversations from the first second

Vapi Fundamentals

Phone System Integration