Senior Voice AI Engineer

Closing Date:24,May 2026

Job Published: 24,Apr 2026

Contact Email: jobs@devicedriven.com

Brief Description

Position Overview
We are seeking an experienced Senior Voice AI Engineer to build the voice infrastructure for an intelligent conversational AI agent serving a US-based client. You will own the real-time voice layer - ensuring natural, low-latency voice interactions that feel human-like and responsive.
This is a hands-on technical role requiring deep expertise in speech technologies, real-time audio systems, and telephony integration. You should have proven experience building production voice systems that handle real user conversations at scale.
Key Responsibilities
Speech & Voice Pipeline
● Implement and optimize Speech-to-Text (STT) pipelines for accuracy, latency, and robustness
● Integrate and fine-tune Text-to-Speech (TTS) engines for natural prosody and appropriate tone
● Implement Voice Activity Detection (VAD) for accurate speech endpoint detection
● Handle interruptions, barge-in, and natural turn-taking in conversations
● Optimize for real-time performance with sub-500ms end-to-end latency
Real-Time Infrastructure
● Build low-latency audio streaming infrastructure using WebSockets/WebRTC
● Implement audio preprocessing (noise reduction, echo cancellation, normalization)
● Design resilient pipelines that handle network variability and audio quality issues
● Build connection management for concurrent voice sessions at scale
Telephony Integration
● Integrate with telephony platforms (Twilio, Vonage) for phone-based voice channels
● Handle call lifecycle management (inbound, outbound, transfers, hold)
● Implement DTMF handling and IVR fallback capabilities
● Support multiple audio codecs and telephony protocols
Quality & Optimization
● Establish metrics for voice quality (latency, Word Error Rate, naturalness)
● Build monitoring and alerting for real-time voice pipeline health
● Analyze call recordings to identify quality improvement opportunities
● Collaborate with the AI/Agent team on seamless voice-to-agent handoff

Preferred Skills

Required Qualifications
Experience
● 5+ years of software engineering experience
● 2+ years building real-time voice or audio systems
● Production experience with STT/TTS integration at scale
● Experience with telephony platforms and voice channels
● Background in audio processing or speech technologies

Technical Skills

Category	Requirements
Speech-to-Text	Deepgram, Whisper, Google STT, AWS Transcribe, or Azure Speech
Text-to-Speech	ElevenLabs, Google TTS, Amazon Polly, Azure Speech, or PlayHT
Languages	Python (primary), with ability to work in TypeScript/Node.js
Real-Time	WebSockets, WebRTC, audio streaming, low-latency optimization
Telephony	Twilio Voice, Vonage, SIP/RTP protocols, call handling
Audio Processing	VAD, noise reduction, audio codecs, PyAudio/soundfile/librosa
Infrastructure	Docker, cloud platforms (AWS/GCP), Redis for session state
Monitoring	Real-time metrics, latency tracking, call quality analytics

AI & Productivity Skills
● Active user of AI-assisted development tools (Claude, Copilot, Cursor, or similar)
● Ability to rapidly prototype and benchmark different STT/TTS providers
● Experience evaluating speech model accuracy and performance
Preferred Qualifications
● Experience with voice agent platforms (Voiceflow, Retell, Vapi, or similar)
● Background in speech recognition or synthesis research/engineering
● Experience with streaming ASR and incremental TTS
● Knowledge of audio signal processing and DSP fundamentals
● Multi-language voice support experience
● Experience optimizing for mobile and low-bandwidth environments
● Understanding of accessibility requirements for voice interfaces
Technical Context
You will be building the voice layer for a conversational AI system that: - Handles natural spoken conversations with end users - Requires ultra-low latency for real-time interaction - Operates across telephony and web-based voice channels - Integrates with an agentic AI backend for conversation logic - Must scale to handle concurrent voice sessions reliably
Soft Skills
● Performance Obsession: Relentlessly optimize for latency and quality
● User Empathy: Ensure voice interactions feel natural and responsive
● Problem Solving: Debug complex issues across audio, network, and speech layers
● Collaboration: Work closely with AI/Agent engineer on integration points
● Attention to Detail: Small latency improvements matter significantly in voice
About the Engagement
This position is with DeviceDriven, a technology consulting firm partnering with a US-based company building next-generation conversational AI experiences. You will work alongside an AI/Agent Engineer to deliver a complete voice AI solution.
Application Process
Interested candidates should provide: 1. Updated resume highlighting voice/audio systems experience 2. Examples of voice or real-time audio systems you have built 3. Metrics you achieved (latency, accuracy, scale) 4. Current and expected compensation