Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech
Original reporting by Hugging Face

Over half the world's population speaks more than one language, and for many, code-switching—seamlessly blending languages mid-sentence—is a natural communication style. This phenomenon is common in enterprise settings, from customer support to IT helpdesks. Yet, evaluating how voice agents handle this bilingual reality has seen little dedicated research. Prompted by a customer's need to support their code-switching user base, we developed a novel benchmark and dataset to assess Automatic Speech Recognition (ASR) systems, the critical first step in any voice agent pipeline where transcription errors can propagate with real operational consequences.
Our benchmark covers four key language pairs (Spanish-English, French-English, Canadian French-English, German-English) within Human Resources and IT Service Management scenarios. We measure model performance using Word Error Rate (WER) for exact accuracy, alongside Semantic Word Error Rate (SWER) and Answer Error Rate (AER) to gauge meaning preservation for downstream tasks. We evaluated seven ASR systems, including frontier Large Audio Language Models (LALMs) and open-source solutions.
Key Findings Our findings reveal that the performance cost of code-switching varies significantly across language pairs and models. ElevenLabs Scribe V2, Google Gemini 3 Flash, and Assembly AI Universal 3-Pro emerged as the top performers, demonstrating surprising robustness to bilingual input and showing only a small degradation compared to monolingual speech. These results suggest that for leading ASR systems, code-switching is becoming a manageable condition, though careful benchmarking for specific language pairs remains crucial.
Our investigation confirms that code-switching, a ubiquitous aspect of multilingual communication, is increasingly within the grasp of frontier ASR systems. While historically a formidable challenge, top models like ElevenLabs Scribe V2, Gemini 3 Flash, and Assembly AI Universal 3-Pro exhibit surprising robustness, handling language shifts with minimal performance degradation compared to monolingual speech. Crucially, semantic accuracy often holds even when word-level errors occur, an encouraging sign for downstream applications. Our analysis further revealed that the likelihood of transcription errors is linked to the frequency of language switches, while the severity of those errors correlates with the overall density of code-mixing. Interestingly, errors disproportionately concentrate within the English segments of code-switched utterances, hinting at complex interactions when a model adapts to an embedded language.
Advancing Inclusive AI
These findings carry profound implications beyond technical metrics. For global enterprises, reliable code-switching ASR translates directly into enhanced operational efficiency and, more importantly, a superior, more natural experience for bilingual customers. By enabling users to communicate in their most comfortable and authentic manner, these advancements remove a significant barrier, pushing the industry closer to truly inclusive AI systems that genuinely reflect the diverse linguistic tapestry of the world. Moving forward, research must delve deeper into the specific contextual factors within embedded language segments that trigger errors, and expand benchmarks to include even more language pairs and naturally spoken, non-synthetic code-switched audio. Continued innovation will be vital to fully unlock the potential of ASR that understands and fluidly adapts to the dynamic, multilingual realities of human interaction, ensuring that language diversity is a strength, not a hurdle, for AI.
Frequently asked questions
- How effectively do modern AI speech recognition systems process conversations involving code-switching?
- Frontier Automatic Speech Recognition (ASR) systems are increasingly robust at handling code-switching, the seamless blending of languages mid-sentence. Leading models show only minimal performance degradation compared to monolingual speech. While word-level errors can occur, semantic accuracy is often preserved, which is crucial for downstream applications. This progress suggests that code-switching is becoming a manageable condition for advanced ASR, though specific language pair benchmarking remains vital.
- Which advanced AI speech recognition models demonstrate the best performance in handling code-switching?
- Several advanced AI speech recognition models exhibit strong performance in processing code-switched speech. ElevenLabs Scribe V2, Google Gemini 3 Flash, and Assembly AI Universal 3-Pro have emerged as top performers. These models demonstrate surprising robustness to bilingual input, showing only a small degradation in accuracy compared to monolingual speech. Their ability to maintain semantic understanding even with some word errors is a significant advancement for multilingual communication.
- Why is it important for AI voice agents to accurately understand code-switching in human communication?
- Accurately understanding code-switching is crucial for AI voice agents to provide inclusive and efficient services. Over half the world's population speaks multiple languages, and code-switching is a natural communication style. Reliable code-switching ASR enhances operational efficiency for global enterprises and delivers a superior, more natural experience for bilingual customers. This advancement removes communication barriers, moving the industry closer to truly inclusive AI systems that reflect global linguistic diversity.