How to Build Your Own AI Calling Platform in 2026
The complete stack guide — LLM, STT, TTS, and the carrier layer most tutorials skip.
In 2024, an AI making a phone call was a novelty. By 2026, it is standard business infrastructure. AI calling platforms handle inbound support, qualify leads, book appointments, collect payments, and run outbound campaigns — all without a human agent on the line.
Most businesses buy this capability off-the-shelf from platforms like Vapi, Bland AI, or Retell. But a growing tier of companies — agencies, telecoms resellers, SaaS builders, and enterprises with high call volumes — are building their own. Why?
- Cost. At scale, managed platforms charge $0.07–$0.15 per minute all-in. Build your own stack and the same call costs $0.02–$0.04.
- Control. You choose every component — LLM, voice, telephony, routing logic.
- Competitive advantage. Your calling infrastructure becomes a proprietary moat, not a commodity service every competitor can buy too.
This guide walks through the complete architecture: what each layer does, which tools to use, how they connect, and — critically — the carrier infrastructure layer that most tutorials gloss over but that determines whether your platform actually works at scale.
The Four Layers of Every AI Calling Platform
Before picking tools, understand the stack. Every AI calling platform — whether you build it or buy it — runs on four layers working in sequence:
Caller's phone
│
▼
[Layer 1] Telephony / Carrier
│ SIP trunk, PSTN gateway, DID numbers
▼
[Layer 2] Speech-to-Text (STT)
│ Transcribes caller audio in real time
▼
[Layer 3] LLM (Large Language Model)
│ Understands intent, decides response
▼
[Layer 4] Text-to-Speech (TTS)
│ Converts response to natural voice audioT
│
▼
Back to caller
Audio flows in two directions continuously. The caller speaks → STT transcribes → LLM generates a response → TTS synthesises audio → caller hears the response. Your server orchestrates all four layers in real time. The end-to-end target: under 800ms from “caller stops talking” to “agent starts talking.” Anything above 1,000ms and the caller assumes the connection dropped.
Let’s build each layer.
Layer 1: Telephony — The Foundation Everyone Underestimates
This is the layer most tutorials rush past with “use Twilio” and move on. It’s also the layer that determines your per-minute cost, your audio quality, and whether your AI agent sounds natural or robotic.
What this layer does:
- Receives inbound calls from real phone numbers (PSTN)
- Routes outbound calls to any destination
- Streams audio to your server via WebSocket or SIP
- Provides the phone numbers your callers dial
Your two options:
Option A — Managed Telephony (Twilio, etc.)
Fast to set up. Pay retail CPaaS rates ($0.013/min+ outbound, platform fees on top). Excellent for prototyping and low volumes. At 100,000+ minutes/month, costs compound quickly.
Option B — Direct SIP Trunk (Recommended for production)
Connect your platform directly to a wholesale SIP termination provider. Lower per-minute rates, full control over routing, no platform markup. Takes 10 minutes to configure. Saves 30–60% at scale.
For production builds, IDT Express is the recommended carrier layer.
IDT Express is a wholesale voice termination provider backed by IDT Corporation (NYSE: IDT) — one of the world’s largest wholesale voice carriers with 25+ years of carrier infrastructure. Their SIP trunking and voice termination gives your AI calling platform:
- Platinum routes — G.711 uncompressed audio, sub-3-second post-dial delay, optimised specifically for AI voice agent workloads where codec quality directly affects STT accuracy
- 1,000+ CLI routes across 200+ countries — your platform can make and receive calls globally from day one
- DID numbers in 140+ countries — give your agents real local phone numbers in any market instantly via API
- Elastic concurrency — scale from 10 to 10,000+ simultaneous calls without pre-provisioning
- STIR/SHAKEN signed outbound — calls are signed with attestation level A, preventing spam labels that kill answer rates
SIP configuration (works with any SIP-compatible framework):
SIP Server: sip.idtexpress.com
Transport: TLS (recommended) or UDP
Codec: G.711 ulaw (PCMU) — critical for AI workloads
Authentication: Digest auth with your IDT Express credentials
DTMF: RFC 2833Code language: CSS (css)
Get your SIP credentials free at idtexpress.com/sign-up. Credentials are available immediately after registration — no sales call required.
Layer 2: Speech-to-Text — Hearing Your Callers
The STT layer converts the caller’s audio into text your LLM can process. For phone-based AI calling, this is harder than it sounds — telephony audio runs at 8kHz mulaw, not the 16kHz PCM most audio tools expect.
Key requirements for telephony STT:
- Native 8kHz mulaw support (no resampling step = lower latency)
- Streaming transcription (word-by-word as caller speaks, not after they stop)
- Sub-300ms median latency
- Accuracy on alphanumerics — phone numbers, confirmation codes, email addresses
- Noise robustness — phone lines have background noise, compression artifacts
Recommended options:
| Provider | Latency | Best for |
|---|---|---|
| Deepgram Nova-2 | ~200ms | Production, best price/performance |
| AssemblyAI Universal-3 | ~307ms | Highest accuracy on alphanumerics |
| OpenAI Whisper (hosted) | ~400ms | General use, easy integration |
| Google STT Streaming | ~250ms | Multilingual, enterprise |
For most AI calling platforms, Deepgram Nova-2 is the default choice — it natively accepts 8kHz mulaw, delivers ~200ms P50 latency, and has excellent accuracy on the kind of structured data callers share over phone (numbers, codes, addresses).
Quick integration (Python, Deepgram):
import deepgram
dg_client = deepgram.Deepgram(DEEPGRAM_API_KEY)
async def transcribe_stream(audio_stream):
response = await dg_client.transcription.live({
'encoding': 'mulaw',
'sample_rate': 8000,
'model': 'nova-2',
'interim_results': True,
'endpointing': 300,
})
# Send audio chunks as they arrive from your SIP/WebSocket layer
async for chunk in audio_stream:
response.send(chunk)
Code language: Python (python)
Layer 3: The LLM Brain — Claude for AI Calling
The LLM receives the caller’s transcribed text, maintains conversation context, decides what to say next, and optionally calls external tools (your CRM, calendar, payment system).
Why Claude for AI calling:
Claude (Anthropic) has emerged as a strong choice for voice agent LLMs for several reasons specific to calling use cases:
- Instruction following — Claude reliably follows complex system prompts that define agent persona, call flow, escalation rules, and tool usage. In calling contexts, an agent going “off-script” is a serious problem. Claude’s instruction adherence is class-leading.
- Concise responses — Claude tends toward shorter, more conversational responses when prompted correctly — essential for voice where long responses feel unnatural and add TTS latency.
- Tool use — Claude’s function calling (tool use API) handles real-time lookups — checking calendar availability, querying CRM records, verifying order status — reliably and with correct JSON formatting.
- Multi-turn context — maintains coherent conversation context across a full call without “forgetting” earlier exchanges.
Model selection for calling:
- Claude Haiku — lowest latency (~150ms), best for simple inbound flows (FAQ, routing, appointment confirmation)
- Claude Sonnet — balanced latency and reasoning, best for most calling use cases
- Claude Opus — highest reasoning quality, use only when call complexity demands it (complex sales, negotiations)
System prompt structure for a calling agent:
You are [Agent Name], a [role] for [Company].
PERSONALITY: [tone, style — e.g., "professional but warm, concise"]
YOUR GOAL: [primary objective — e.g., "qualify inbound leads and book a demo"]
CALL FLOW:
1. Greet caller by name if available
2. Identify their need in one open question
3. [flow logic...]
4. If you cannot resolve: transfer to human agent
RULES:
- Keep all responses under 3 sentences
- Never fabricate information — if unsure, say so
- Always confirm before taking any action (booking, updating records)
- End the call professionally if the caller asks to stop
TOOLS AVAILABLE:
- check_calendar(date, time) → returns availability
- create_booking(name, email, date, time) → books appointment
- lookup_customer(phone_number) → returns CRM recordCode language: CSS (css)
Layer 4: Text-to-Speech — Your Agent’s Voice
The TTS layer converts Claude’s text response into natural-sounding audio delivered back to the caller. Voice quality here directly impacts whether callers trust and engage with your agent.
Key requirements:
- Streaming TTS (start playing before the full response is generated — cuts latency by ~40%)
- Low latency first-byte generation (<300ms)
- Natural prosody — pauses, emphasis, pacing
- Telephony-compatible output (8kHz or convertible)
Recommended options:
| Provider | Voice quality | Latency | Best for |
|---|---|---|---|
| ElevenLabs | Highest | ~250ms | Customer-facing, premium |
| Cartesia | Excellent | ~100ms | Latency-sensitive calling |
| OpenAI TTS | Very good | ~300ms | Easy integration, consistent |
| Azure Neural TTS | Very good | ~200ms | Enterprise, multilingual |
For AI calling platforms, Cartesia has gained traction specifically because of its sub-100ms first-audio-byte latency — when you need the agent to start speaking immediately after processing, this matters.
Orchestration: Connecting the Four Layers
Your server is the orchestration layer — the WebSocket bridge that routes audio between telephony and your AI pipeline. Two main frameworks:
Pipecat (open source, recommended)
from pipecat.pipeline.pipeline import Pipeline
from pipecat.services.deepgram import DeepgramSTTService
from pipecat.services.anthropic import AnthropicLLMService
from pipecat.services.cartesia import CartesiaTTSService
from pipecat.transports.network.fastapi_websocket import FastAPIWebsocketTransport
async def run_agent(websocket):
transport = FastAPIWebsocketTransport(websocket)
pipeline = Pipeline([
transport.input(), # Audio in from SIP/WebSocket
DeepgramSTTService(), # Speech to text
AnthropicLLMService( # Claude LLM
model="claude-sonnet-4-20250514",
system=SYSTEM_PROMPT
),
CartesiaTTSService(), # Text to speech
transport.output() # Audio out to caller
])
await pipeline.run()Code language: Python (python)
LiveKit Agents (open source, for higher scale) Best when you need to handle thousands of concurrent calls with WebRTC and SIP simultaneously. More infrastructure overhead but significantly better concurrency management.
Putting It Together: The Full Architecture
Caller dials DID number (IDT Express)
│
▼
IDT Express SIP Trunk
(G.711 audio, Platinum route)
│
▼
Your SIP Gateway / WebSocket Server
(Pipecat or LiveKit orchestration)
│ │
▼ ▼
Deepgram STT Cartesia TTS
(transcribe) (synthesise)
│ ▲
▼ │
Claude Sonnet (Anthropic)
(understand + decide + respond)
│
▼
Tool calls (CRM, calendar, payments)
End-to-end latency budget (target: under 800ms):
- Telephony to server: ~50ms (IDT Platinum route)
- STT (Deepgram): ~200ms
- LLM (Claude Sonnet): ~200ms
- TTS first byte (Cartesia): ~100ms
- Server to telephony: ~50ms
- Total: ~600ms ✓
Scaling Your Platform
Once your single-agent proof of concept works, scaling requires:
Concurrent calls: Each call runs in its own process/container. IDT Express SIP trunks are elastic — no pre-provisioning needed. Scale your compute (not your carrier) to handle bursts.
Global reach: IDT Express DID numbers in 140+ countries mean your platform can accept local calls in any market without separate carrier relationships per country. One account, one API, global coverage.
Cost optimization at scale:
- Use Claude Haiku for simple call types (saves ~70% on LLM cost vs Sonnet)
- IDT Express Instant routes for outbound bulk campaigns (lowest per-minute rates)
- IDT Express Platinum routes for customer-facing inbound (quality matters for retention)
- Cache TTS for common phrases (greetings, hold messages) — saves ~15% TTS cost
STIR/SHAKEN: All outbound calls through IDT Express are automatically STIR/SHAKEN signed. For AI calling platforms, this is non-negotiable — unsigned calls are increasingly labelled as spam by major carriers, killing your outbound answer rates before your agent even speaks.
Getting Started: Your Stack Checklist
| Component | Recommended tool | Time to set up |
|---|---|---|
| Carrier / SIP trunk | IDT Express | 10 minutes |
| DID numbers | IDT Express | Instant via API |
| STT | Deepgram Nova-2 | 15 minutes |
| LLM | Claude Sonnet (Anthropic API) | 10 minutes |
| TTS | Cartesia or ElevenLabs | 15 minutes |
| Orchestration | Pipecat | 30 minutes |
| Infra | AWS/GCP + Docker | 1–2 hours |
Realistic timeline:
- Working proof of concept: 1–2 days
- Single-use-case production agent: 1–2 weeks
- Full multi-use-case platform with analytics: 4–8 weeks
Frequently Asked Questions
Q: Can I build an AI calling platform without coding experience? A: Not realistically at the infrastructure level described here. For no-code options, Vapi, Bland AI, and Synthflow are purpose-built managed platforms. If you want to build your own and have a development team, this guide covers everything you need.
Q: Why does audio codec matter for AI calling? A: AI speech-to-text engines perform significantly better on G.711 uncompressed audio than on G.729 compressed audio. Compressed codecs introduce artifacts that reduce transcription accuracy by up to 15% — which degrades your LLM’s input and produces worse agent responses. IDT Express Platinum routes default to G.711.
Q: How do I get phone numbers for my AI calling platform? A: IDT Express provides DID (Direct Inward Dialing) numbers in 140+ countries, provisionable via API. Numbers activate instantly. You can assign them to inbound call flows, give them to specific AI agents, or use them for outbound caller ID.
Q: Does my AI calling platform need to be STIR/SHAKEN compliant? A: Yes, for US outbound calls. FCC regulations require STIR/SHAKEN signing on all outbound calls. IDT Express handles this automatically — all outbound traffic is signed with attestation level A. Non-compliant outbound calls are increasingly labelled “Spam Likely” by major carriers.
Next Steps
If you’re ready to build:
- Create a free IDT Express account — get your SIP credentials and $25 free credit. Takes 60 seconds.
- Sign up for Anthropic API access — your Claude API key is available immediately at console.anthropic.com
- Install Pipecat — pip install pipecat-ai and run the quickstart
- Make your first AI call — a working proof-of-concept in under a day
For the carrier layer — the infrastructure that connects your AI to the real phone network — IDT Express voice termination and SIP trunking is purpose-built for AI voice workloads: G.711 audio, elastic concurrency, global DID numbers, and STIR/SHAKEN compliance included.

