AI & Voice

Building a Multilingual Voice AI Pipeline with Bhashini for Public Welfare Surveys

A real-time speech pipeline supporting 14 Indian languages using Bhashini — enabling automated voice surveys with under 2-second latency.

Platform

Web & Telephony

Duration

2 Months

<2s

Pipeline latency

14

Languages supported

0

Human translators needed

Project overview

Demonstrated that building production-grade multilingual voice systems for India doesn't require expensive commercial APIs or compromising on language coverage.

Platform

Web & Telephony

Duration

2 Months

Type

AI & Voice

Stack

8 technologies

The challenge

A public welfare program needed to conduct large-scale citizen surveys across multiple Indian states. The existing system only supported English, which meant the majority of the target population couldn't participate.

English-only surveys excluded the majority of the target population

Human translators created bottlenecks and couldn't scale beyond a few hundred calls per day

Commercial speech APIs lacked reliable support for most Indian regional languages

Per-minute pricing from commercial providers made large-scale deployment financially unviable

No unified pipeline existed that could handle speech recognition, translation, and synthesis in a single flow

What we set out to do

  • 01

    Build a unified STT → Translation → TTS pipeline supporting 14 Indian languages

  • 02

    Achieve end-to-end voice pipeline latency under 2 seconds

  • 03

    Integrate with Bhashini for sovereign, cost-effective speech services

  • 04

    Enable citizens to complete surveys entirely in their native language without human intervention

  • 05

    Pre-generate survey audio assets at scale using async queue processing

How we solved it

01

Bhashini API Integration

Integrated with Bhashini's Dhruva inference API for ASR, Neural Machine Translation, and TTS.

Key decision

Bhashini over Google Cloud Speech / AWS Transcribe

Result

Coverage for 14 Indian languages. Significantly more cost-effective at scale.

02

Automated Translation in the Pipeline

Automatic translation between any supported language pair using Bhashini's NMT service.

Key decision

Automated NMT over manual/human translation workflows

Result

Zero human translators needed. Full round-trip translation handled automatically.

03

Real-Time STT → NMT → TTS Pipeline

Three-stage pipeline with per-stage latency tracking and language-aware routing.

Key decision

Three-stage pipeline with per-stage latency tracking

Result

End-to-end pipeline latency under 2 seconds.

04

Pre-Generation & Dynamic Variables

Static survey content pre-generated via BullMQ queues. Dynamic content uses variable placeholders.

Key decision

Queue-based pre-generation + real-time synthesis for dynamic variables

Result

Zero TTS latency for static content. Personalized audio without sacrificing speed.

Measurable impact

<2s

End-to-end pipeline latency

14

Indian languages supported

0

Human translators needed

0ms

TTS latency for pre-generated survey audio

Tech stack

NNestJsBBhashini Dhruva APIBBullMQRRedisWWebSocketsFFFmpegAAzure Blob StorageTTypeScript

What we learned

This project demonstrated that building production-grade multilingual voice systems for India doesn't require expensive commercial APIs or compromising on language coverage.

  • 01

    Bhashini provides viable, production-ready speech AI for Indian languages — but requires careful model routing

  • 02

    Pre-generating static audio via queues eliminates TTS latency for known content

  • 03

    Dynamic variable support makes personalized, multilingual audio feasible at scale

  • 04

    Error isolation across pipeline stages is critical when depending on external APIs

Ready to build something that matters?

We solve problems that don't have Stack Overflow answers. Let's talk.

Book a Discovery Call