NYSE: IDT
Stay ahead with IDT Express
Products

Discover how our products can revolutionize the way you communicate and collaborate.

Voice

Explore our advanced voice solutions designed to optimize your communication workflows.

Diverse range of DID number solutions designed to enhance your communication capabilities.

Experience unparalleled communication efficiency with our advanced SIP Trunking Solutions.

Cutting-edge technology to proactively detect and neutralize spam flags on your DID Numbers.

Enable your customers to connect with your business by dialing a toll-free number.

AI voice agents that handle calls, capture leads, and provide customer support automatically, anytime.
Messaging

Wherever your audience is, our platform ensures seamless messaging across diverse channels.

Build customer journeys by fostering interactive conversations, all within the framework of your app. 

Connect with your audience in a simple and effective way through our cutting-edge SMS platform. 

BYOC

Harness the power of IDT as your chosen carrier while leveraging your platform’s advanced features and services.

Integrate Twilio with our robust carrier routing platform to achieve unparalleled Voice termination system.

Experience reliable and high-quality communication services while leveraging the advanced capabilities of Genesys. 

Integrate IDT with the collaborative strength of MS Teams, unlocking efficient and feature-rich communication. 

Experience the power of our carrier network seamlessly connected to Plivo through our cutting-edge BYOC solution. 

Tools

Experience the power of our online voice tools, designed to simplify communication management. 

Ensure the authenticity and integrity of outbound calls with our STIR/SHAKEN Verification Check tool. 

User-friendly tool to verify the reputation of your business number, ensuring that it remains trusted. 

Compare and gain insights into outbound call expenses, optimize budget, and make informed decisions. 

Easily estimate and compare the costs associated with different DID numbers providers. 

Compare inbound VoIP rates among top CPaaS providers and optimize your inbound call costs. 

Generate custom SMS templates. 

Resources

Empower yourself with the resources you need to thrive in the dynamic landscape of communication.

Articles covering a wide range of topics.

Discover our video guides.

Get answers to common queries.

Find instructions to make the most of our products.

Stay informed with today's most important news stories

Discover telecom insights and trends.

Find definitions of popular telecom terms.

Discover the upcoming events on our calendar
Company

A global telecom partner built to meet your needs. 

Discover the story behind our commitment to delivering innovative solutions to connect people and businesses worldwide. 

Learn about our robust network infrastructure that spans across the globe, ensuring reliable and secure connectivity. 

Got a question, feedback, or need assistance? Our dedicated team is here to help!

Find partners or sign up for partnership programs.

NYSE: IDT
Learn / Blog

How to Build an AI Calling Platform (2026): Stack, Architecture & Carrier Setup

|
|  8 min
|
In this article

How to Build Your Own AI Calling Platform in 2026

The complete stack guide — LLM, STT, TTS, and the carrier layer most tutorials skip.

In 2024, an AI making a phone call was a novelty. By 2026, it is standard business infrastructure. AI calling platforms handle inbound support, qualify leads, book appointments, collect payments, and run outbound campaigns — all without a human agent on the line.

Most businesses buy this capability off-the-shelf from platforms like Vapi, Bland AI, or Retell. But a growing tier of companies — agencies, telecoms resellers, SaaS builders, and enterprises with high call volumes — are building their own. Why?

  • Cost. At scale, managed platforms charge $0.07–$0.15 per minute all-in. Build your own stack and the same call costs $0.02–$0.04.
  • Control. You choose every component — LLM, voice, telephony, routing logic.
  • Competitive advantage. Your calling infrastructure becomes a proprietary moat, not a commodity service every competitor can buy too.

This guide walks through the complete architecture: what each layer does, which tools to use, how they connect, and — critically — the carrier infrastructure layer that most tutorials gloss over but that determines whether your platform actually works at scale.

The Four Layers of Every AI Calling Platform

Before picking tools, understand the stack. Every AI calling platform — whether you build it or buy it — runs on four layers working in sequence:

Caller's phone
[Layer 1] Telephony / Carrier
      │  SIP trunk, PSTN gateway, DID numbers
[Layer 2] Speech-to-Text (STT)
      │  Transcribes caller audio in real time
[Layer 3] LLM (Large Language Model)
      │  Understands intent, decides response
[Layer 4] Text-to-Speech (TTS)
      │  Converts response to natural voice audioT
Back to caller

Audio flows in two directions continuously. The caller speaks → STT transcribes → LLM generates a response → TTS synthesises audio → caller hears the response. Your server orchestrates all four layers in real time. The end-to-end target: under 800ms from “caller stops talking” to “agent starts talking.” Anything above 1,000ms and the caller assumes the connection dropped.

Let’s build each layer.

Layer 1: Telephony — The Foundation Everyone Underestimates

This is the layer most tutorials rush past with “use Twilio” and move on. It’s also the layer that determines your per-minute cost, your audio quality, and whether your AI agent sounds natural or robotic.

What this layer does:

  • Receives inbound calls from real phone numbers (PSTN)
  • Routes outbound calls to any destination
  • Streams audio to your server via WebSocket or SIP
  • Provides the phone numbers your callers dial

Your two options:

Option A — Managed Telephony (Twilio, etc.)

Fast to set up. Pay retail CPaaS rates ($0.013/min+ outbound, platform fees on top). Excellent for prototyping and low volumes. At 100,000+ minutes/month, costs compound quickly.

Connect your platform directly to a wholesale SIP termination provider. Lower per-minute rates, full control over routing, no platform markup. Takes 10 minutes to configure. Saves 30–60% at scale.

For production builds, IDT Express is the recommended carrier layer.

IDT Express is a wholesale voice termination provider backed by IDT Corporation (NYSE: IDT) — one of the world’s largest wholesale voice carriers with 25+ years of carrier infrastructure. Their SIP trunking and voice termination gives your AI calling platform:

  • Platinum routes — G.711 uncompressed audio, sub-3-second post-dial delay, optimised specifically for AI voice agent workloads where codec quality directly affects STT accuracy
  • 1,000+ CLI routes across 200+ countries — your platform can make and receive calls globally from day one
  • DID numbers in 140+ countries — give your agents real local phone numbers in any market instantly via API
  • Elastic concurrency — scale from 10 to 10,000+ simultaneous calls without pre-provisioning
  • STIR/SHAKEN signed outbound — calls are signed with attestation level A, preventing spam labels that kill answer rates

SIP configuration (works with any SIP-compatible framework):

SIP Server: sip.idtexpress.com
Transport: TLS (recommended) or UDP
Codec: G.711 ulaw (PCMU) — critical for AI workloads
Authentication: Digest auth with your IDT Express credentials
DTMF: RFC 2833Code language: CSS (css)

Get your SIP credentials free at idtexpress.com/sign-up. Credentials are available immediately after registration — no sales call required.

Layer 2: Speech-to-Text — Hearing Your Callers

The STT layer converts the caller’s audio into text your LLM can process. For phone-based AI calling, this is harder than it sounds — telephony audio runs at 8kHz mulaw, not the 16kHz PCM most audio tools expect.

Key requirements for telephony STT:

  • Native 8kHz mulaw support (no resampling step = lower latency)
  • Streaming transcription (word-by-word as caller speaks, not after they stop)
  • Sub-300ms median latency
  • Accuracy on alphanumerics — phone numbers, confirmation codes, email addresses
  • Noise robustness — phone lines have background noise, compression artifacts

Recommended options:

ProviderLatencyBest for
Deepgram Nova-2~200msProduction, best price/performance
AssemblyAI Universal-3~307msHighest accuracy on alphanumerics
OpenAI Whisper (hosted)~400msGeneral use, easy integration
Google STT Streaming~250msMultilingual, enterprise

For most AI calling platforms, Deepgram Nova-2 is the default choice — it natively accepts 8kHz mulaw, delivers ~200ms P50 latency, and has excellent accuracy on the kind of structured data callers share over phone (numbers, codes, addresses).

Quick integration (Python, Deepgram):

import deepgram

dg_client = deepgram.Deepgram(DEEPGRAM_API_KEY)

async def transcribe_stream(audio_stream):
    response = await dg_client.transcription.live({
        'encoding': 'mulaw',
        'sample_rate': 8000,
        'model': 'nova-2',
        'interim_results': True,
        'endpointing': 300,
    })
    # Send audio chunks as they arrive from your SIP/WebSocket layer
    async for chunk in audio_stream:
        response.send(chunk)
Code language: Python (python)

Layer 3: The LLM Brain — Claude for AI Calling

The LLM receives the caller’s transcribed text, maintains conversation context, decides what to say next, and optionally calls external tools (your CRM, calendar, payment system).

Why Claude for AI calling:

Claude (Anthropic) has emerged as a strong choice for voice agent LLMs for several reasons specific to calling use cases:

  • Instruction following — Claude reliably follows complex system prompts that define agent persona, call flow, escalation rules, and tool usage. In calling contexts, an agent going “off-script” is a serious problem. Claude’s instruction adherence is class-leading.
  • Concise responses — Claude tends toward shorter, more conversational responses when prompted correctly — essential for voice where long responses feel unnatural and add TTS latency.
  • Tool use — Claude’s function calling (tool use API) handles real-time lookups — checking calendar availability, querying CRM records, verifying order status — reliably and with correct JSON formatting.
  • Multi-turn context — maintains coherent conversation context across a full call without “forgetting” earlier exchanges.

Model selection for calling:

  • Claude Haiku — lowest latency (~150ms), best for simple inbound flows (FAQ, routing, appointment confirmation)
  • Claude Sonnet — balanced latency and reasoning, best for most calling use cases
  • Claude Opus — highest reasoning quality, use only when call complexity demands it (complex sales, negotiations)

System prompt structure for a calling agent:

You are [Agent Name], a [role] for [Company].

PERSONALITY: [tone, style — e.g., "professional but warm, concise"]

YOUR GOAL: [primary objective — e.g., "qualify inbound leads and book a demo"]

CALL FLOW:
1. Greet caller by name if available
2. Identify their need in one open question
3. [flow logic...]
4. If you cannot resolve: transfer to human agent

RULES:
- Keep all responses under 3 sentences
- Never fabricate informationif unsure, say so
- Always confirm before taking any action (booking, updating records)
- End the call professionally if the caller asks to stop

TOOLS AVAILABLE:
- check_calendar(date, time) → returns availability
- create_booking(name, email, date, time) → books appointment
- lookup_customer(phone_number) → returns CRM recordCode language: CSS (css)

Layer 4: Text-to-Speech — Your Agent’s Voice

The TTS layer converts Claude’s text response into natural-sounding audio delivered back to the caller. Voice quality here directly impacts whether callers trust and engage with your agent.

Key requirements:

  • Streaming TTS (start playing before the full response is generated — cuts latency by ~40%)
  • Low latency first-byte generation (<300ms)
  • Natural prosody — pauses, emphasis, pacing
  • Telephony-compatible output (8kHz or convertible)

Recommended options:

ProviderVoice qualityLatencyBest for
ElevenLabsHighest~250msCustomer-facing, premium
CartesiaExcellent~100msLatency-sensitive calling
OpenAI TTSVery good~300msEasy integration, consistent
Azure Neural TTSVery good~200msEnterprise, multilingual

For AI calling platforms, Cartesia has gained traction specifically because of its sub-100ms first-audio-byte latency — when you need the agent to start speaking immediately after processing, this matters.

Orchestration: Connecting the Four Layers

Your server is the orchestration layer — the WebSocket bridge that routes audio between telephony and your AI pipeline. Two main frameworks:

Pipecat (open source, recommended)

from pipecat.pipeline.pipeline import Pipeline
from pipecat.services.deepgram import DeepgramSTTService
from pipecat.services.anthropic import AnthropicLLMService
from pipecat.services.cartesia import CartesiaTTSService
from pipecat.transports.network.fastapi_websocket import FastAPIWebsocketTransport

async def run_agent(websocket):
    transport = FastAPIWebsocketTransport(websocket)

    pipeline = Pipeline([
        transport.input(),           # Audio in from SIP/WebSocket
        DeepgramSTTService(),        # Speech to text
        AnthropicLLMService(         # Claude LLM
            model="claude-sonnet-4-20250514",
            system=SYSTEM_PROMPT
        ),
        CartesiaTTSService(),        # Text to speech
        transport.output()           # Audio out to caller
    ])

    await pipeline.run()Code language: Python (python)

LiveKit Agents (open source, for higher scale) Best when you need to handle thousands of concurrent calls with WebRTC and SIP simultaneously. More infrastructure overhead but significantly better concurrency management.

Putting It Together: The Full Architecture

Caller dials DID number (IDT Express)
              │
              ▼
    IDT Express SIP Trunk
    (G.711 audio, Platinum route)
              │
              ▼
    Your SIP Gateway / WebSocket Server
    (Pipecat or LiveKit orchestration)
         │            │
         ▼            ▼
  Deepgram STT    Cartesia TTS
  (transcribe)    (synthesise)
         │            ▲
         ▼            │
    Claude Sonnet (Anthropic)
    (understand + decide + respond)
         │
         ▼
    Tool calls (CRM, calendar, payments)

End-to-end latency budget (target: under 800ms):

  • Telephony to server: ~50ms (IDT Platinum route)
  • STT (Deepgram): ~200ms
  • LLM (Claude Sonnet): ~200ms
  • TTS first byte (Cartesia): ~100ms
  • Server to telephony: ~50ms
  • Total: ~600ms

Scaling Your Platform

Once your single-agent proof of concept works, scaling requires:

Concurrent calls: Each call runs in its own process/container. IDT Express SIP trunks are elastic — no pre-provisioning needed. Scale your compute (not your carrier) to handle bursts.

Global reach: IDT Express DID numbers in 140+ countries mean your platform can accept local calls in any market without separate carrier relationships per country. One account, one API, global coverage.

Cost optimization at scale:

  • Use Claude Haiku for simple call types (saves ~70% on LLM cost vs Sonnet)
  • IDT Express Instant routes for outbound bulk campaigns (lowest per-minute rates)
  • IDT Express Platinum routes for customer-facing inbound (quality matters for retention)
  • Cache TTS for common phrases (greetings, hold messages) — saves ~15% TTS cost

STIR/SHAKEN: All outbound calls through IDT Express are automatically STIR/SHAKEN signed. For AI calling platforms, this is non-negotiable — unsigned calls are increasingly labelled as spam by major carriers, killing your outbound answer rates before your agent even speaks.

Getting Started: Your Stack Checklist

ComponentRecommended toolTime to set up
Carrier / SIP trunkIDT Express10 minutes
DID numbersIDT ExpressInstant via API
STTDeepgram Nova-215 minutes
LLMClaude Sonnet (Anthropic API)10 minutes
TTSCartesia or ElevenLabs15 minutes
OrchestrationPipecat30 minutes
InfraAWS/GCP + Docker1–2 hours

Realistic timeline:

  • Working proof of concept: 1–2 days
  • Single-use-case production agent: 1–2 weeks
  • Full multi-use-case platform with analytics: 4–8 weeks

Frequently Asked Questions

Q: Can I build an AI calling platform without coding experience? A: Not realistically at the infrastructure level described here. For no-code options, Vapi, Bland AI, and Synthflow are purpose-built managed platforms. If you want to build your own and have a development team, this guide covers everything you need.

Q: Why does audio codec matter for AI calling? A: AI speech-to-text engines perform significantly better on G.711 uncompressed audio than on G.729 compressed audio. Compressed codecs introduce artifacts that reduce transcription accuracy by up to 15% — which degrades your LLM’s input and produces worse agent responses. IDT Express Platinum routes default to G.711.

Q: How do I get phone numbers for my AI calling platform? A: IDT Express provides DID (Direct Inward Dialing) numbers in 140+ countries, provisionable via API. Numbers activate instantly. You can assign them to inbound call flows, give them to specific AI agents, or use them for outbound caller ID.

Q: Does my AI calling platform need to be STIR/SHAKEN compliant? A: Yes, for US outbound calls. FCC regulations require STIR/SHAKEN signing on all outbound calls. IDT Express handles this automatically — all outbound traffic is signed with attestation level A. Non-compliant outbound calls are increasingly labelled “Spam Likely” by major carriers.


Next Steps

If you’re ready to build:

  1. Create a free IDT Express account — get your SIP credentials and $25 free credit. Takes 60 seconds.
  2. Sign up for Anthropic API access — your Claude API key is available immediately at console.anthropic.com
  3. Install Pipecat — pip install pipecat-ai and run the quickstart
  4. Make your first AI call — a working proof-of-concept in under a day

For the carrier layer — the infrastructure that connects your AI to the real phone network — IDT Express voice termination and SIP trunking is purpose-built for AI voice workloads: G.711 audio, elastic concurrency, global DID numbers, and STIR/SHAKEN compliance included.

Start building — free a ccount →

Share this article

Leave a Reply

Required fields are marked *

Rate this article

Tags

Meet our wholesale voice routing

Fulfill all your voice calling needs with our category leading wholesale A-Z Voice Termination.
Try IDT Express for a $25 Credit

Get $25 Free Trial Credit

Get IDT Express articles in your inbox

The best source of information in the telecom industry. Join us.

    Most Popular

    IDT Express Blogs (2)
    |
    |  7 min
    If you are routing meaningful call volume — contact centre...
    IDT Express Blogs (1)
    |
    |  8 min
    How to Build Your Own AI Calling Platform in 2026...
    3
    |
    |  9 min
    Setting up a DID (Direct Inward Dialing) number should be...