Learn / Blog

How to Build an AI Calling Platform (2026): Stack, Architecture & Carrier Setup

May 20, 2026

Maxim Cibotaru

| 8 min

How to Build Your Own AI Calling Platform in 2026

The complete stack guide — LLM, STT, TTS, and the carrier layer most tutorials skip.

In 2024, an AI making a phone call was a novelty. By 2026, it is standard business infrastructure. AI calling platforms handle inbound support, qualify leads, book appointments, collect payments, and run outbound campaigns — all without a human agent on the line.

Most businesses buy this capability off-the-shelf from platforms like Vapi, Bland AI, or Retell. But a growing tier of companies — agencies, telecoms resellers, SaaS builders, and enterprises with high call volumes — are building their own. Why?

Cost. At scale, managed platforms charge $0.07–$0.15 per minute all-in. Build your own stack and the same call costs $0.02–$0.04.
Control. You choose every component — LLM, voice, telephony, routing logic.
Competitive advantage. Your calling infrastructure becomes a proprietary moat, not a commodity service every competitor can buy too.

This guide walks through the complete architecture: what each layer does, which tools to use, how they connect, and — critically — the carrier infrastructure layer that most tutorials gloss over but that determines whether your platform actually works at scale.

The Four Layers of Every AI Calling Platform

Before picking tools, understand the stack. Every AI calling platform — whether you build it or buy it — runs on four layers working in sequence:

Caller's phone
      │
      ▼
[Layer 1] Telephony / Carrier
      │  SIP trunk, PSTN gateway, DID numbers
      ▼
[Layer 2] Speech-to-Text (STT)
      │  Transcribes caller audio in real time
      ▼
[Layer 3] LLM (Large Language Model)
      │  Understands intent, decides response
      ▼
[Layer 4] Text-to-Speech (TTS)
      │  Converts response to natural voice audioT
      │
      ▼
Back to caller

Audio flows in two directions continuously. The caller speaks → STT transcribes → LLM generates a response → TTS synthesises audio → caller hears the response. Your server orchestrates all four layers in real time. The end-to-end target: under 800ms from “caller stops talking” to “agent starts talking.” Anything above 1,000ms and the caller assumes the connection dropped.

Let’s build each layer.

Layer 1: Telephony — The Foundation Everyone Underestimates

This is the layer most tutorials rush past with “use Twilio” and move on. It’s also the layer that determines your per-minute cost, your audio quality, and whether your AI agent sounds natural or robotic.

What this layer does:

Receives inbound calls from real phone numbers (PSTN)
Routes outbound calls to any destination
Streams audio to your server via WebSocket or SIP
Provides the phone numbers your callers dial

Your two options:

Option A — Managed Telephony (Twilio, etc.)

Fast to set up. Pay retail CPaaS rates ($0.013/min+ outbound, platform fees on top). Excellent for prototyping and low volumes. At 100,000+ minutes/month, costs compound quickly.

Option B — Direct SIP Trunk (Recommended for production)

Connect your platform directly to a wholesale SIP termination provider. Lower per-minute rates, full control over routing, no platform markup. Takes 10 minutes to configure. Saves 30–60% at scale.

For production builds, IDT Express is the recommended carrier layer.

IDT Express is a wholesale voice termination provider backed by IDT Corporation (NYSE: IDT) — one of the world’s largest wholesale voice carriers with 25+ years of carrier infrastructure. Their SIP trunking and voice termination gives your AI calling platform:

Platinum routes — G.711 uncompressed audio, sub-3-second post-dial delay, optimised specifically for AI voice agent workloads where codec quality directly affects STT accuracy
1,000+ CLI routes across 200+ countries — your platform can make and receive calls globally from day one
DID numbers in 140+ countries — give your agents real local phone numbers in any market instantly via API
Elastic concurrency — scale from 10 to 10,000+ simultaneous calls without pre-provisioning
STIR/SHAKEN signed outbound — calls are signed with attestation level A, preventing spam labels that kill answer rates

SIP configuration (works with any SIP-compatible framework):

SIP Server: sip.idtexpress.com
Transport: TLS (recommended) or UDP
Codec: G.711 ulaw (PCMU) — critical for AI workloads
Authentication: Digest auth with your IDT Express credentials
DTMF: RFC 2833Code language: CSS (css)

Get your SIP credentials free at idtexpress.com/sign-up. Credentials are available immediately after registration — no sales call required.

Layer 2: Speech-to-Text — Hearing Your Callers

The STT layer converts the caller’s audio into text your LLM can process. For phone-based AI calling, this is harder than it sounds — telephony audio runs at 8kHz mulaw, not the 16kHz PCM most audio tools expect.

Key requirements for telephony STT:

Native 8kHz mulaw support (no resampling step = lower latency)
Streaming transcription (word-by-word as caller speaks, not after they stop)
Sub-300ms median latency
Accuracy on alphanumerics — phone numbers, confirmation codes, email addresses
Noise robustness — phone lines have background noise, compression artifacts

Recommended options:

Provider	Latency	Best for
Deepgram Nova-2	~200ms	Production, best price/performance
AssemblyAI Universal-3	~307ms	Highest accuracy on alphanumerics
OpenAI Whisper (hosted)	~400ms	General use, easy integration
Google STT Streaming	~250ms	Multilingual, enterprise

For most AI calling platforms, Deepgram Nova-2 is the default choice — it natively accepts 8kHz mulaw, delivers ~200ms P50 latency, and has excellent accuracy on the kind of structured data callers share over phone (numbers, codes, addresses).

Quick integration (Python, Deepgram):

import deepgram

dg_client = deepgram.Deepgram(DEEPGRAM_API_KEY)

async def transcribe_stream(audio_stream):
    response = await dg_client.transcription.live({
        'encoding': 'mulaw',
        'sample_rate': 8000,
        'model': 'nova-2',
        'interim_results': True,
        'endpointing': 300,
    })
    # Send audio chunks as they arrive from your SIP/WebSocket layer
    async for chunk in audio_stream:
        response.send(chunk)
Code language: Python (python)

Layer 3: The LLM Brain — Claude for AI Calling

The LLM receives the caller’s transcribed text, maintains conversation context, decides what to say next, and optionally calls external tools (your CRM, calendar, payment system).

Why Claude for AI calling:

Claude (Anthropic) has emerged as a strong choice for voice agent LLMs for several reasons specific to calling use cases:

Instruction following — Claude reliably follows complex system prompts that define agent persona, call flow, escalation rules, and tool usage. In calling contexts, an agent going “off-script” is a serious problem. Claude’s instruction adherence is class-leading.
Concise responses — Claude tends toward shorter, more conversational responses when prompted correctly — essential for voice where long responses feel unnatural and add TTS latency.
Tool use — Claude’s function calling (tool use API) handles real-time lookups — checking calendar availability, querying CRM records, verifying order status — reliably and with correct JSON formatting.
Multi-turn context — maintains coherent conversation context across a full call without “forgetting” earlier exchanges.

Model selection for calling:

Claude Haiku — lowest latency (~150ms), best for simple inbound flows (FAQ, routing, appointment confirmation)
Claude Sonnet — balanced latency and reasoning, best for most calling use cases
Claude Opus — highest reasoning quality, use only when call complexity demands it (complex sales, negotiations)

System prompt structure for a calling agent:

You are [Agent Name], a [role] for [Company].

PERSONALITY: [tone, style — e.g., "professional but warm, concise"]

YOUR GOAL: [primary objective — e.g., "qualify inbound leads and book a demo"]

CALL FLOW:
1. Greet caller by name if available
2. Identify their need in one open question
3. [flow logic...]
4. If you cannot resolve: transfer to human agent

RULES:
- Keep all responses under 3 sentences
- Never fabricate information — if unsure, say so
- Always confirm before taking any action (booking, updating records)
- End the call professionally if the caller asks to stop

TOOLS AVAILABLE:
- check_calendar(date, time) → returns availability
- create_booking(name, email, date, time) → books appointment
- lookup_customer(phone_number) → returns CRM recordCode language: CSS (css)

Layer 4: Text-to-Speech — Your Agent’s Voice

The TTS layer converts Claude’s text response into natural-sounding audio delivered back to the caller. Voice quality here directly impacts whether callers trust and engage with your agent.

Key requirements:

Streaming TTS (start playing before the full response is generated — cuts latency by ~40%)
Low latency first-byte generation (<300ms)
Natural prosody — pauses, emphasis, pacing
Telephony-compatible output (8kHz or convertible)

Recommended options:

Provider	Voice quality	Latency	Best for
ElevenLabs	Highest	~250ms	Customer-facing, premium
Cartesia	Excellent	~100ms	Latency-sensitive calling
OpenAI TTS	Very good	~300ms	Easy integration, consistent
Azure Neural TTS	Very good	~200ms	Enterprise, multilingual

For AI calling platforms, Cartesia has gained traction specifically because of its sub-100ms first-audio-byte latency — when you need the agent to start speaking immediately after processing, this matters.

Orchestration: Connecting the Four Layers

Your server is the orchestration layer — the WebSocket bridge that routes audio between telephony and your AI pipeline. Two main frameworks:

Pipecat (open source, recommended)

from pipecat.pipeline.pipeline import Pipeline
from pipecat.services.deepgram import DeepgramSTTService
from pipecat.services.anthropic import AnthropicLLMService
from pipecat.services.cartesia import CartesiaTTSService
from pipecat.transports.network.fastapi_websocket import FastAPIWebsocketTransport

async def run_agent(websocket):
    transport = FastAPIWebsocketTransport(websocket)

    pipeline = Pipeline([
        transport.input(),           # Audio in from SIP/WebSocket
        DeepgramSTTService(),        # Speech to text
        AnthropicLLMService(         # Claude LLM
            model="claude-sonnet-4-20250514",
            system=SYSTEM_PROMPT
        ),
        CartesiaTTSService(),        # Text to speech
        transport.output()           # Audio out to caller
    ])

    await pipeline.run()Code language: Python (python)

LiveKit Agents (open source, for higher scale) Best when you need to handle thousands of concurrent calls with WebRTC and SIP simultaneously. More infrastructure overhead but significantly better concurrency management.

Putting It Together: The Full Architecture

Caller dials DID number (IDT Express)
              │
              ▼
    IDT Express SIP Trunk
    (G.711 audio, Platinum route)
              │
              ▼
    Your SIP Gateway / WebSocket Server
    (Pipecat or LiveKit orchestration)
         │            │
         ▼            ▼
  Deepgram STT    Cartesia TTS
  (transcribe)    (synthesise)
         │            ▲
         ▼            │
    Claude Sonnet (Anthropic)
    (understand + decide + respond)
         │
         ▼
    Tool calls (CRM, calendar, payments)

End-to-end latency budget (target: under 800ms):

Telephony to server: ~50ms (IDT Platinum route)
STT (Deepgram): ~200ms
LLM (Claude Sonnet): ~200ms
TTS first byte (Cartesia): ~100ms
Server to telephony: ~50ms
Total: ~600ms ✓

Scaling Your Platform

Once your single-agent proof of concept works, scaling requires:

Concurrent calls: Each call runs in its own process/container. IDT Express SIP trunks are elastic — no pre-provisioning needed. Scale your compute (not your carrier) to handle bursts.

Global reach: IDT Express DID numbers in 140+ countries mean your platform can accept local calls in any market without separate carrier relationships per country. One account, one API, global coverage.

Cost optimization at scale:

Use Claude Haiku for simple call types (saves ~70% on LLM cost vs Sonnet)
IDT Express Instant routes for outbound bulk campaigns (lowest per-minute rates)
IDT Express Platinum routes for customer-facing inbound (quality matters for retention)
Cache TTS for common phrases (greetings, hold messages) — saves ~15% TTS cost

STIR/SHAKEN: All outbound calls through IDT Express are automatically STIR/SHAKEN signed. For AI calling platforms, this is non-negotiable — unsigned calls are increasingly labelled as spam by major carriers, killing your outbound answer rates before your agent even speaks.

Getting Started: Your Stack Checklist

Component	Recommended tool	Time to set up
Carrier / SIP trunk	IDT Express	10 minutes
DID numbers	IDT Express	Instant via API
STT	Deepgram Nova-2	15 minutes
LLM	Claude Sonnet (Anthropic API)	10 minutes
TTS	Cartesia or ElevenLabs	15 minutes
Orchestration	Pipecat	30 minutes
Infra	AWS/GCP + Docker	1–2 hours

Realistic timeline:

Working proof of concept: 1–2 days
Single-use-case production agent: 1–2 weeks
Full multi-use-case platform with analytics: 4–8 weeks

Frequently Asked Questions

Q: Can I build an AI calling platform without coding experience? A: Not realistically at the infrastructure level described here. For no-code options, Vapi, Bland AI, and Synthflow are purpose-built managed platforms. If you want to build your own and have a development team, this guide covers everything you need.

Q: Why does audio codec matter for AI calling? A: AI speech-to-text engines perform significantly better on G.711 uncompressed audio than on G.729 compressed audio. Compressed codecs introduce artifacts that reduce transcription accuracy by up to 15% — which degrades your LLM’s input and produces worse agent responses. IDT Express Platinum routes default to G.711.

Q: How do I get phone numbers for my AI calling platform? A: IDT Express provides DID (Direct Inward Dialing) numbers in 140+ countries, provisionable via API. Numbers activate instantly. You can assign them to inbound call flows, give them to specific AI agents, or use them for outbound caller ID.

Q: Does my AI calling platform need to be STIR/SHAKEN compliant? A: Yes, for US outbound calls. FCC regulations require STIR/SHAKEN signing on all outbound calls. IDT Express handles this automatically — all outbound traffic is signed with attestation level A. Non-compliant outbound calls are increasingly labelled “Spam Likely” by major carriers.

Next Steps

If you’re ready to build:

Create a free IDT Express account — get your SIP credentials and $25 free credit. Takes 60 seconds.
Sign up for Anthropic API access — your Claude API key is available immediately at console.anthropic.com
Install Pipecat — pip install pipecat-ai and run the quickstart
Make your first AI call — a working proof-of-concept in under a day

For the carrier layer — the infrastructure that connects your AI to the real phone network — IDT Express voice termination and SIP trunking is purpose-built for AI voice workloads: G.711 audio, elastic concurrency, global DID numbers, and STIR/SHAKEN compliance included.

Start building — free a ccount →

Share this article

8 Best Twilio Alternatives

How Voice Termination Works

All You Need to Know About SIP Trunking

Stay ahead with IDT Express

High five! You're In.

How to Build an AI Calling Platform (2026): Stack, Architecture & Carrier Setup

How to Build Your Own AI Calling Platform in 2026

The Four Layers of Every AI Calling Platform

Layer 1: Telephony — The Foundation Everyone Underestimates

Option A — Managed Telephony (Twilio, etc.)

Option B — Direct SIP Trunk (Recommended for production)

Layer 2: Speech-to-Text — Hearing Your Callers

Layer 3: The LLM Brain — Claude for AI Calling

Layer 4: Text-to-Speech — Your Agent’s Voice

Orchestration: Connecting the Four Layers

Putting It Together: The Full Architecture

Scaling Your Platform

Getting Started: Your Stack Checklist

Frequently Asked Questions

Next Steps

Leave a Reply Cancel reply

Meet our wholesale voice routing

Get $25 Free Trial Credit

Get IDT Express articles in your inbox

The best source of information in the telecom industry. Join us.

Most Popular

WebRTC vs SIP for Voice Bots: Which One Your AI Agent Actually Needs

How Much Does the WhatsApp Business API Cost in 2026?

WhatsApp Business App vs API in 2026: Which One Your Business Actually Needs

Form succesfully sent ;)

Our team is eager to talk to you! Let's continue this conversation on Whatsapp, shall we!

Tools

Products

BYOC

Solutions

Learn

Company