To build an AI voice agent, you scope one call type, write the conversation as a script, assemble a pipeline of speech-to-text, a language model, and text-to-speech (or pick a platform like Vapi or Retell that bundles all three), wire it to a phone number, connect your calendar or CRM through function calling, run at least 50 test calls, and launch on a fraction of your real call volume before scaling. On a managed platform you can have a working prototype in an afternoon; a custom build on your own stack typically runs 4 to 12 weeks for basic scope and 3 to 6 months for enterprise scope, according to development estimates published by Riseup Labs.
The guides ranking for this query today are vendor blogs from Rasa, AssemblyAI, and Cake, and each is genuinely useful. But all three share the same blind spots: not one publishes a dollar figure, walks through connecting a real phone number, mentions the TCPA or the FCC's 2024 ruling on AI voices, or shows what a finished call sounds like. This guide covers what they cover and fills those gaps: named stack choices, a seven-step build framework with a checkable artifact at every step, latency targets with real numbers, honest cost math, US compliance rules, and a failure-mode catalog.
One disclosure up front: MapleVoice sells the opposite of DIY, fully managed voice agents we build and operate for you. That bias gets stated once, in a clearly marked section near the end. Everything before it is the build guide we would want if we were on your side of the table, including the cases where building is genuinely the right call.
What You Are Actually Building: The Voice Agent Pipeline
An AI voice agent is software that holds a real-time spoken conversation over the phone and takes action: it answers calls, figures out what the caller wants, looks things up, books appointments, and hands off to a human when needed. That last clause separates it from its cousins. A chatbot handles typed text. A voice assistant adds speech but mostly answers questions. A voice agent completes tasks, which means it needs working connections to your calendar, CRM, or order system, not just a knowledge base. Rasa's guide draws this assistant-versus-agent line well, and it is worth internalizing before you build: most businesses asking how to build one actually need the action-taking version.
It is also not an IVR. A phone tree forces callers through press-1 menus with a fixed vocabulary. A voice agent listens to open-ended natural speech, holds context across turns, handles interruptions, and recovers when callers change their minds mid-sentence.
Under the hood, most production agents run a cascaded pipeline. Voice activity detection notices the caller is speaking, streaming speech-to-text transcribes audio as it arrives, an endpointing model decides the caller has finished a thought, a large language model decides what to say and which tools to call, and text-to-speech streams the reply back, all in well under two seconds. The newer alternative is a realtime speech-to-speech API, such as OpenAI's Realtime API or Google's Gemini Live, where a single model consumes and produces audio directly. Speech-to-speech cuts latency and sounds more natural, but as of 2026 you trade away per-stage control: it is harder to swap in a better transcription model, enforce exact scripted wording, or tune domain vocabulary. Most business builds still choose the cascade for that control. The glossary below covers the terms the rest of this guide leans on.
- STT / ASR: speech-to-text, also called automatic speech recognition. Converts caller audio into text.
- TTS: text-to-speech. Converts the agent's written reply into audio.
- VAD: voice activity detection. Detects when someone is speaking versus silence.
- Endpointing: deciding the caller has finished their turn. Too eager and you interrupt; too patient and you create dead air.
- Barge-in: the caller talking over the agent. The agent must stop speaking and listen.
- TTFT and TTFB: time to first token (LLM) and time to first byte (TTS). The two biggest latency levers.
- Function calling: the LLM invoking your APIs (check a calendar, create a booking) instead of just generating words.
- Containment rate: the percentage of calls fully handled without a human.
- Cascaded pipeline vs. speech-to-speech: chained STT, LLM, and TTS stages versus one model that hears and speaks directly.
Who Should Build One, and Who Should Not
Voice agents solve a specific cluster of problems: calls missed during rushes and after hours, callers abandoning hold queues, staff interrupted by the same five questions, and leads going cold because nobody followed up within minutes. If those are your problems, some form of voice agent, built or bought, probably pays for itself.
Building one yourself is a different question. DIY makes sense when you have engineering capacity, enough call volume that per-minute platform pricing hurts, requirements no platform supports, or data-residency rules that force self-hosting. It rarely makes sense for a dental office, law firm, or restaurant that needs the phone answered next week, where every hour spent debugging prompts is an hour not spent on patients, clients, or covers. Answer these ten questions before committing to any path:
- How many calls per month, and what is the average length? Under roughly 1,000 minutes a month, per-minute pricing barely matters; build effort dominates.
- How many distinct call types? Booking, rescheduling, hours, pricing, emergencies. Start with one.
- What share of calls arrive after hours or while staff are busy?
- What systems must it touch: calendar, CRM, POS, EHR?
- Will it ever make outbound calls? This triggers TCPA obligations; see the compliance section.
- Are you in an all-party recording-consent state, or do you take calls from one?
- Does any call involve health information? HIPAA changes your vendor list.
- Who on your team will own prompts, testing, and monitoring after launch, and how many hours a week do they have?
- What is your latency tolerance: would your callers forgive a two-second pause?
- What is your kill criterion: at what failure rate do you switch approaches?
The Three Ways to Get a Voice Agent
Every path to a working agent is some mix of three options: assemble components yourself, configure a managed platform, or hire a service that does it all. Worth knowing as you read elsewhere: each of the three guides ranking for this query funnels readers to its own layer, Rasa to its orchestration platform, AssemblyAI to its transcription API, and Cake to its infrastructure platform. That is not a criticism, just a reason to want the comparison laid out neutrally. (On speed, Cake's guide pegs no-code platform setup at hours rather than weeks, which matches what we see.)
Cake's guide also makes a fair point about the middle tier: prebuilt platforms can feel like a black box when you cannot see why the agent behaved a certain way, and per-minute costs climb with volume. The honest counterpoint is that platform observability has improved, and for most teams the black box is still far cheaper than the glass house they would build themselves.
| Path | Runtime cost | Time to live | Your ongoing workload | Best for |
|---|---|---|---|---|
| DIY component stack | Roughly $0.05-$0.15/min in raw STT, LLM, TTS, and telephony fees, per cost guides from CloudTalk and GetVoIP, plus engineering time | 4-12 weeks for basic scope, 3-6 months enterprise, per Riseup Labs | High: you own prompts, monitoring, upgrades, and 2 a.m. outages | Dev teams with high volume, custom requirements, or data-residency rules |
| Managed platform (Vapi, Retell, Bland, Synthflow) | Roughly $0.25-$0.50/min all-in, per the same cost-guide cluster | Hours to a prototype; days to weeks to production quality | Medium: vendor runs infrastructure; you own prompts, testing, integrations | Technical operators who want control without building infrastructure |
| Done-for-you service (MapleVoice and similar) | Flat monthly fee; no per-minute meter | About 48 hours with MapleVoice | Low: vendor builds, tests, monitors, and maintains | Businesses that want answered phones, not a software project |
Choosing Your Stack, Layer by Layer
If you are building, here are the real choices at each layer as of 2026, with the selection criteria that actually matter.
- Speech-to-text: Deepgram and AssemblyAI are the most commonly cited picks for streaming phone audio; OpenAI's Whisper is the common self-hosted pick. Judge on streaming latency and accuracy on phone-quality audio with your domain vocabulary: drug names, menu items, street names. Skip Mozilla DeepSpeech; as Rasa's guide notes, the project is archived and no longer safe for new production builds.
- LLM: a frontier hosted model (GPT-4-class or Claude) for reasoning and reliable function calling, or a self-hosted Llama or Mistral variant where data residency demands it, an option Cake's guide flags as workable for narrow flows. Judge on time-to-first-token under load and tool-calling reliability, not benchmark scores.
- Text-to-speech: ElevenLabs is, per Cake's guide, widely considered the market leader for realism; Cartesia and Rime, both on Rasa's commercial-provider list, are known for low latency, and Azure Speech for enterprise language coverage. Judge on time-to-first-byte, pronunciation of your proper nouns, and how it reads numbers, addresses, and times aloud.
- Telephony: Twilio and Telnyx for buying numbers and SIP trunking. Rasa's connector list, Jambonz, Twilio Media Streams, AudioCodes, and Genesys Cloud, is a good map of how agents attach to existing phone infrastructure.
- Orchestration: the glue that manages turn-taking, interruptions, state, and tool calls. Pipecat and LiveKit Agents are the two most visible open-source frameworks, and AssemblyAI's tutorials build on LiveKit. Vapi and Retell are essentially this layer sold as a product.
Step 1: Pick One Call Type and Define Success (Half a Day)
Resist the everything-agent. Pick the single call type that is highest volume and lowest ambiguity: appointment booking for most service businesses, order taking or reservations for restaurants, lead qualification for agencies. Rasa's start-narrow advice is right; the version with teeth is committing to one call type in writing.
Write a one-page scope document that names the call type, the three to five outcomes the agent may produce (booked, rescheduled, message taken, transferred, ended), the systems it may touch, and the situations where it must hand off to a human: emergencies, angry callers, anything involving money disputes. Then set numeric success criteria before you build anything: a target containment rate (70 percent is a reasonable opening goal for a single-purpose booking line), a maximum acceptable escalation rate, and the latency ceiling you will hold yourself to.
Artifact for this step: a one-page scope doc with numbers in it. If you cannot fill in the numbers yet, you are not ready for step 2.
Step 2: Write the Call as a Transcript Before Any Code (1-2 Days)
Voice agents are scripts before they are software. Write at least three full transcripts by hand: the happy path, a messy path where the caller rambles and changes their mind, and an escalation path. Reading them aloud teaches you more about voice UX than any framework doc: spoken replies must be one to two sentences, the agent must confirm names and numbers by reading them back, and every dead end needs a graceful exit line.
Here is an illustrative example, written for this article rather than taken from a live call; you can hear real recorded calls at /call-recordings.
Step 3: Assemble the Pipeline and Hit the Latency Budget (1-2 Weeks DIY, a Day on a Platform)
Once the transcripts feel right, distill them into a system prompt: the agent's role, its allowed outcomes, hard rules, the confirmation patterns, and the escalation triggers. Then wire your chosen layers together and get the agent talking in a browser test call before touching telephony.
An illustrative skeleton of a working system prompt, condensed to the four blocks that matter. Role: you are the booking assistant for Lakeside Dental; you can book, move, or cancel appointments, and answer questions only from the approved facts list. Hard rules: never quote a price that is not on the price list; never say a booking is confirmed until the calendar tool returns success; offer a human transfer whenever the caller asks. Style: replies of one to two sentences, one question per turn, read every name and number back before using it. Escalation: transfer immediately for emergencies, billing disputes, or a caller who is clearly upset. A prompt with those four blocks, filled in for your business, beats most first drafts that try to anticipate every utterance.
The make-or-break engineering constraint is latency. Rasa's guide nails the framing: a 400-millisecond delay is invisible in chat and feels broken in voice. Below is the working budget practitioners aim for, stage by stage. Three tactics buy back the most time: stream every stage so they overlap; have the agent speak a brief acknowledgment such as 'one moment, checking the calendar' whenever a tool call will take more than a second; and keep the system prompt lean, because every token you add is paid on every turn. Artifacts for this step: a version-one system prompt under roughly 800 words, and a recorded test call where measured voice-to-voice response beats your target.
| Pipeline stage | Working target | What blows the budget |
|---|---|---|
| Voice activity detection and endpointing | 200-300 ms after the caller stops | Over-patient endpointing; AssemblyAI's tutorial treats a 700 ms pause as end-of-utterance, a reasonable upper bound |
| Streaming STT (partial transcripts) | 100-300 ms behind live speech | Non-streaming transcription that waits for complete audio |
| LLM time-to-first-token | 300-500 ms | Long system prompts, cold starts, slow tool calls inside the turn |
| TTS time-to-first-byte | 100-250 ms | Non-streaming synthesis; very long first sentences |
| Total voice-to-voice | Under 1 second target; alert at 1.2 s p95 | Any stage running sequentially instead of streaming |
Step 4: Wire Up Telephony: Numbers, Forwarding, and STIR/SHAKEN (2-5 Days)
This is the step every ranking guide skips, and it is the first real blocker for a business build. You have three ways to get calls into the agent. Buy a new number from Twilio or Telnyx and publish it, which is clean but abandons your existing number's equity. Forward your existing line, either unconditionally or conditionally so the agent only catches the calls you miss or calls after hours, which is the gentlest rollout. Or port the number entirely, which takes days to weeks and is hard to reverse, so it should come last, not first.
If you will make outbound calls, the carrier layer gets stricter. STIR/SHAKEN is the caller-ID authentication framework US carriers use; calls from unregistered or poorly attested numbers increasingly get labeled Spam Likely or blocked outright. Register your numbers and business identity with your carrier, ramp calling volume gradually, and monitor your number's reputation. None of this is optional if you want your calls answered.
Artifact for this step: a real phone number that rings your agent, plus a tested failover. If the agent or any vendor in the chain goes down, calls must roll to a human line or voicemail, never to dead air.
Step 5: Connect the Tools: Calendars, CRMs, and Confirmed Writes (About a Week)
Function calling is what turns a conversationalist into an agent: the model invokes your APIs to check real availability, create the booking, and log the call in the CRM. The cardinal rule is tool-confirmed writes. The agent may only say 'you're booked' after the calendar API returns success. Skip this and you will eventually hit the worst failure in voice AI: a confident, fluent confirmation of an appointment that does not exist.
Make write operations idempotent so a retry cannot create duplicate bookings, handle the caller's timezone versus the business's explicitly, and write everything back: the booking, the call outcome, and the transcript reference, so the front desk can see what the agent did without asking. Artifact for this step: one live end-to-end test in which a phone call produces a real calendar event and a real CRM entry.
Step 6: Test Like a Pessimist: 50 Shadow Calls (3-5 Days)
Internal happy-path demos prove nothing. Before launch, run at least 50 scripted shadow calls across a deliberately hostile matrix, and score each one pass or fail against your step 1 criteria.
Set a numeric bar before you start, for example 45 of 50 passing with zero failures in the escalation category, and do not launch below it. Artifact for this step: the filled-in scorecard.
- Accents and speech styles: fast talkers, soft talkers, non-native speakers, children.
- Acoustic abuse: speakerphone in a car, kitchen noise, wind, a weak cell connection.
- Barge-in: interrupt the agent mid-sentence and verify it stops and listens.
- Compound requests: cancel Tuesday and book something next month, oh, and do you take Aetna?
- Hard names and numbers: verify the spell-back confirmation catches deliberate misreadings.
- Out-of-scope and emergencies: medical urgency, legal threats, a demand for a manager. Each must escalate, not improvise.
- Silence and nonsense: ten seconds of nothing, a pocket dial, a fax tone.
- The angry caller: verify the tone holds and the human-transfer offer comes early.
Step 7: Launch at Partial Volume and Watch Six Numbers (Ongoing)
Do not flip 100 percent of calls on day one. Start with after-hours only, or 20 to 25 percent of inbound volume, and expand as the numbers hold. Rasa's KPI list is the strongest single idea on the current SERP; here is the working version with thresholds attached.
Listen to actual recordings weekly, not just dashboards; the worst failures read fine in transcripts and sound terrible. Artifact: a dashboard with these six numbers and alerts on two of them, latency and escalation.
- Containment rate: the share of calls fully handled without a human. A reasonable opening target for a single-purpose booking line is around 70 percent; investigate any week-over-week decline.
- Escalation rate and reasons: a working ceiling of roughly 15 percent, with every escalation tagged with a cause.
- Task completion: of callers who wanted a booking, how many left with one.
- Voice-to-voice latency: p95 under 1.2 seconds, with an alert above it.
- Transcription accuracy on your domain terms: spot-check ten calls a week against the audio.
- Time-to-first-meaningful-action, a metric framing borrowed from Rasa's guide: how long until the agent does something useful rather than just acknowledging.
Compliance Guardrails: TCPA, Recording Consent, and HIPAA
US compliance is absent from every guide ranking for this query, and it is where DIY builders get hurt. The rules below are public regulatory facts, stated as of 2026; confirm specifics with counsel.
Outbound calling is the high-voltage zone. In February 2024 the FCC issued a declaratory ruling that AI-generated voices count as artificial or prerecorded voices under the TCPA, which means outbound calls using an AI voice require the called party's prior express consent, and prior express written consent for marketing calls. TCPA statutory damages run $500 per violation and up to $1,500 for willful violations, per call, and class actions are routine. If your agent dials out, consent records, do-not-call scrubbing, calling-hour windows, and instant opt-out handling are launch requirements, not enhancements.
Recording consent: federal law and most states require one party's consent, but a number of states, including California, Florida, Illinois, Maryland, Massachusetts, Pennsylvania, and Washington as of 2026, require all parties to consent. Since you cannot control where callers are standing, the safe pattern is announcing recording at the start of every call. Pair it with an AI disclosure in the greeting, something as simple as 'I'm an automated assistant': it builds trust, state legislatures are actively adding disclosure requirements as of 2026, and callers behave more predictably when they are not guessing.
HIPAA: if the agent handles protected health information for a covered entity, every vendor that touches call data, the platform, the STT provider, the LLM provider, the telephony carrier, must sign a business associate agreement. One BAA-less link in the chain is a violation. Many consumer-tier AI APIs will not sign one, which by itself reshapes a healthcare stack. The launch checklist:
- Inbound-only at launch unless you hold documented consent for outbound.
- Recording disclosure in the first five seconds of every call.
- AI disclosure in the greeting.
- Outbound: prior express written consent on file before any marketing call with an AI voice.
- Outbound: scrub against the National Do Not Call Registry and honor opt-outs immediately.
- Outbound: respect TCPA calling hours, 8 a.m. to 9 p.m. in the called party's local time.
- STIR/SHAKEN registration and ongoing number-reputation monitoring.
- Healthcare: signed BAAs from every vendor in the audio path, no exceptions.
- A data-retention policy: how long recordings and transcripts live, and who can access them.
- A written kill-switch procedure for shutting the agent off in minutes.
What It Really Costs: The Math Nobody Publishes
None of the three ranking guides prints a dollar figure; Rasa's how-much-does-it-cost FAQ manages to answer without one. The cost-comparison pages that rank for the cost query do publish numbers, so here they are, attributed and assembled. Cost guides from CloudTalk and GetVoIP put raw infrastructure-layer runtime around $0.05 to $0.15 per minute and managed platforms at roughly $0.25 to $0.50 per minute. Master of Code's development-cost guide puts custom builds at $10,000 to $25,000 for an MVP, $25,000 to $150,000 for production, and $75,000 to $300,000 for enterprise scope, with roughly $800 to $3,000 a month in running costs at 10,000 minutes per month. A widely shared Medium analysis by Jordan Gibbs worked bare-bones DIY component runtime out to about $0.28 per hour, far below even the infrastructure-layer range, which mostly shows how wildly published figures swing with stack choices and usage assumptions.
The line that ambushes DIY budgets is the bottom two rows. Component fees are nearly free at small volume; the engineering hours are not. A build that consumes three weeks of one engineer's time has already cost more than a year of most managed options before it answers its first real call. Run the math on your own volume before the sunk-cost machine starts running.
| Cost line | DIY stack | Managed platform | Done-for-you |
|---|---|---|---|
| Runtime at 500 calls/mo (about 1,500 min) | $75-$225 in component fees | $375-$750 | Flat monthly fee, volume-independent |
| Runtime at 2,000 calls/mo (about 6,000 min) | $300-$900 | $1,500-$3,000 | Same flat fee |
| Build cost | $10K-$25K MVP to $25K-$150K production, per Master of Code, or weeks of in-house engineering | Usually $0 upfront; your hours go to prompts, flows, and testing | Typically none; the setup is the service |
| Ongoing maintenance | Your engineers: monitoring, prompt regressions, model upgrades, vendor changes | Your operator: log review, prompt tuning, retesting | Included |
| Hidden line item | On-call burden and the opportunity cost of those engineers | The per-minute bill grows with your success | Less low-level control over the stack |
Failure Modes Nobody Warns You About
Every voice agent fails. The difference between a good build and a liability is whether you designed for the failures in advance. These are the recurring ones, with mitigations.
One limitation no mitigation removes: some callers want a human, full stop, and some calls, grief, rage, true emergencies, complex negotiations, should never be automated. The right design goal is not 100 percent containment; it is fast, context-rich handoff for the calls a machine should not own.
- Hallucinated confirmations: the agent tells a caller they are booked when no write succeeded. Mitigation: tool-confirmed writes only, enforced in code, not in the prompt.
- Misheard names and numbers: transcription quietly turns Kaitlyn into Caitlin or drops a digit. Mitigation: spell-back and read-back confirmation on every name and number, plus confidence-gated re-asks.
- Barge-in mishandling: the agent talks over interruptions, or stops and never recovers. Mitigation: test interruption explicitly in every release.
- Dead air during tool calls: three silent seconds while the calendar API thinks. Mitigation: spoken acknowledgments before any slow operation.
- Spam and robocall loops: other robots call your robot and burn minutes. Mitigation: rate limiting, known-spam screening, and a maximum call duration.
- Vendor and model outages: any one of four vendors going down takes the agent with it. Mitigation: failover routing to voicemail or a human line, plus alerting that actually pages someone.
- Model-version drift: a provider upgrades the underlying model and your agent's behavior shifts overnight. Mitigation: pin model versions and run your 50-call regression suite before adopting any upgrade.
- Accent and noise blind spots: error rates climb for some caller groups while the averages look fine. Mitigation: segment your accuracy spot-checks; never trust the mean.
The Maintenance Bill: Week 2, Month 2, Month 6
All three ranking guides say continuously improve, and none says what that costs. Here is the honest operating cadence.
Week 2: daily log triage. Production callers surface phrasings, requests, and edge cases your testing never imagined; expect a steady stream of small prompt fixes and at least one how-did-it-do-that incident. Budget an hour a day.
Month 2: the cadence settles into weekly. The recurring work is reviewing escalation tags, refreshing the facts the agent quotes (hours, prices, menus, providers), and running the regression suite before any prompt or model change. Budget a few hours a week, indefinitely. This is the number DIY plans most often omit.
Month 6: structural work appears. A vendor changes pricing or deprecates an API version; a better TTS voice ships; the model under your agent gets superseded; seasonal scripts need rewriting. An agent is a product you operate, not a project you finish. If nobody owns it on your org chart, it decays, and decayed voice agents do not fail loudly; they quietly mishandle callers until someone finally reviews the recordings.
When Not to Build, and Where MapleVoice Fits
Honest routing, including the answers that pay us nothing. Build DIY if you have engineers, tens of thousands of minutes a month, hard data-residency requirements, or a use case no platform supports; at that scale, owning the stack wins. Use a Vapi- or Retell-class platform if you are a technical operator who wants control over prompts and flows without owning infrastructure; for most builders this is the right DIY choice. Hire a human answering service if call volume is tiny and every call is deeply relational. And do nothing if missed calls are not actually costing you anything; not every business has this problem.
Here is the clearly marked sales paragraph. MapleVoice is the done-for-you path: we build, test, and operate the agent so you skip every step above. Agents go live in about 48 hours, answer 24/7 in under 2 seconds, book appointments, qualify leads, and take orders, transfer to a human with full context, and integrate with your booking, CRM, and POS systems. Pricing is a flat monthly rate with no per-minute meter, builds are tuned to your vertical across the 20 industries we cover, outbound includes TCPA controls, and we sign BAAs for qualifying healthcare customers. Every call produces a recording, transcript, summary, call reason, outcome, and next step, the same artifacts this guide just told you to build, delivered on day one.
Next steps, whichever path you take. If you are buying, see /compare for side-by-sides against the DIY platforms and /pricing for the flat-rate model. If you are still mapping the territory, /blog/what-is-an-ai-voice-agent covers the fundamentals, and /call-recordings has real calls to benchmark your build against. If you are building: open a doc, name one call type, and write the first transcript today. That half-day artifact is worth more than another week of vendor research.
Frequently asked questions
Can I build my own AI voice agent?
Yes. A solo developer can ship a working prototype in a day on a platform like Vapi or Retell, and in a few weeks on open-source frameworks like Pipecat or LiveKit Agents. The hard part is not the prototype; it is telephony, integrations, compliance, testing, and the permanent maintenance load after launch.
How much does it cost to build an AI voice agent?
Raw component runtime is roughly $0.05 to $0.15 per minute and managed platforms about $0.25 to $0.50 per minute, per CloudTalk and GetVoIP cost guides. Custom development runs $10,000 to $25,000 for an MVP and up to $150,000 for production scope, per Master of Code's estimates.
How long does it take to build an AI voice agent?
Hours to a prototype on a no-code platform, then 4 to 12 weeks for a basic custom build and 3 to 6 months for enterprise scope, per Riseup Labs estimates. Done-for-you services compress this further; MapleVoice deploys fully managed agents in about 48 hours.
Can you build an AI voice agent with no code?
Yes. Platforms like Vapi, Retell, Synthflow, and SignalWire bundle speech-to-text, the language model, text-to-speech, and telephony behind visual flow builders, and a functional agent can be configured in hours. You still own the non-code work: writing scripts, running 50 hostile test calls, and maintaining it after launch.
What is the difference between an AI voice agent and an IVR?
An IVR is a menu; a voice agent is a conversation. IVRs force callers through fixed press-1 trees with a rigid vocabulary, while voice agents understand open-ended speech, hold context across turns, handle interruptions, and complete tasks like booking appointments by calling your calendar or CRM APIs directly.
Do I need to build speech recognition or text-to-speech from scratch?
No, and you should not try. Production builds compose existing providers: Deepgram, AssemblyAI, or Whisper for transcription; ElevenLabs, Cartesia, or Azure for voices; a frontier LLM in between. Your real engineering work is orchestration, latency, tool integrations, and testing, not rebuilding already-solved components.
Should I deploy a voice agent in the cloud or on-premises?
Cloud, unless regulation forces otherwise. Cloud deployment is faster to ship, easier to scale, and how every managed platform operates. On-premises or private-cloud hosting matters mainly in finance, government, and healthcare where data-residency rules apply, and it points you toward self-hostable stacks like Whisper plus open-weight LLMs.
Is it legal for an AI voice agent to record calls?
Yes, with consent. Most US states require only one party's consent, but states including California, Florida, Pennsylvania, and Washington require all parties' consent as of 2026. Since callers can be anywhere, the safe practice is announcing recording at the start of every call, paired with an AI disclosure.
Will an AI voice agent sound robotic?
Not if you engineer for it. Modern TTS voices are close to human; what actually sounds robotic is latency and rigidity: long pauses, talked-over interruptions, and paragraph-length answers. Keep voice-to-voice response under about a second, support barge-in, and write replies of one to two sentences.
The “How to…” series
Ten hands-on playbooks — real steps, real numbers, honest about the work involved.
Keep reading
Hear it answer a real call
MapleVoice builds and runs a fully-managed AI voice agent for your business — live in about 48 hours, flat monthly price.
