Why Popular Speech Recognition Tools Are Failing People With Aphasia
Speech recognition systems like OpenAI Whisper are widely used for transcription, but a new study finds they consistently fail people with aphasia, a language disorder affecting roughly 2 million Americans. Researchers tested six popular automatic speech recognition (ASR) services and discovered that speakers with aphasia experienced significantly worse transcription accuracy compared to control groups, revealing that standard auditing practices mask harm to vulnerable populations.
What Are the Main Problems With Current Speech Recognition Audits?
The research identifies three critical pitfalls in how companies and researchers evaluate speech recognition systems. Most audits follow a one-size-fits-all approach that doesn't account for the real needs and preferences of people with speech disorders. The study examined six major ASR services, including Amazon, AssemblyAI, Google, Microsoft, OpenAI, and Rev AI, and found that standard evaluation methods fail to capture the full picture of performance disparities.
- Text Standardization Bias: Most audits "clean" transcriptions by removing capitalization, punctuation, and filler words in a standardized way, but this approach can mask performance differences and ignores how marginalized communities prefer their speech to be transcribed.
- Overlooking Intersectional Differences: Audits typically report only high-level demographic comparisons without examining nuanced subgroups or acoustic factors like background noise and pauses, which can significantly affect transcription accuracy for people with aphasia.
- Relying on a Single Metric: Most audits report only Word Error Rate (WER), which fails to capture semantic meaning or detect hallucinations, where the system fabricates entire phrases that never existed in the original speech.
Hallucinations are particularly dangerous for people with aphasia in high-stakes settings like medical appointments, where false transcriptions could mislead doctors or misrepresent a patient's intended speech. The consequences extend beyond inconvenience; inaccurate transcriptions in medical AI scribes or legal proceedings can directly harm vulnerable users who depend on these tools.
How Should Speech Recognition Systems Be Audited More Fairly?
The researchers propose a community-driven auditing framework that addresses each of these pitfalls. Rather than treating all users as identical, the approach emphasizes engagement with affected communities to understand their lived experiences and preferences. The study surveyed and interviewed people with aphasia to better understand how they actually use speech recognition technology and what matters most to them.
- Multiple Text Standardization Methods: Test and report performance across different transcription approaches, allowing communities to choose the method that best serves their needs rather than imposing a single standard.
- Granular Demographic Analysis: Examine performance across intersectional subgroups and relevant acoustic properties, not just broad demographic categories, to uncover disparities that affect specific populations.
- Comprehensive Error Metrics: Report multiple evaluation metrics beyond Word Error Rate, including measures that detect hallucinations and capture semantic accuracy, providing a more complete picture of system performance.
The study used data from AphasiaBank, a publicly available repository containing audio recordings and transcriptions from 551 people with aphasia and 347 control participants performing narrative tasks like story retelling and picture description. This real-world speech data proved essential for identifying performance gaps that standard benchmarks miss.
Why Does Aphasia Matter for Speech Recognition Testing?
Aphasia is a language disorder caused by brain damage that affects communication abilities. It impacts roughly 0.06% of the population, or about 2 million Americans, with most cases occurring in people over 50 years old. People with aphasia face unique challenges with speech recognition systems because they are intersectionally disadvantaged in multiple ways.
First, people with aphasia are severely underrepresented in the speech datasets used to train ASR systems, meaning the technology was not built with their speech patterns in mind. Second, they disproportionately rely on voice-based interfaces because aphasia frequently impairs writing and typing due to its broader effects on language production. For many people with aphasia, speech recognition tools are not a convenience but an essential accessibility feature for communication and documentation.
When these systems fail, the consequences are amplified. A person with aphasia may have fewer alternative modalities to fall back on compared to someone without a speech disorder. If voice transcription fails, they cannot simply switch to typing or writing. This makes auditing accuracy for this population not just a fairness issue but an accessibility imperative.
What Do the Study Results Show About Current Systems?
The research found consistently worse Word Error Rate performance for speakers with aphasia relative to control groups across all six tested systems. This means that OpenAI Whisper, Google, Amazon, Microsoft, AssemblyAI, and Rev AI all showed measurable performance gaps when transcribing speech from people with aphasia compared to people without the disorder.
The study demonstrates that while standard auditing procedures do reveal these disparities, they treat people with aphasia as a monolithic group and rely solely on Word Error Rate, potentially masking nuanced performance differences that could inform technical improvements. When researchers applied the proposed holistic auditing framework, examining different text standardization methods, intersectional subgroups, and multiple error metrics, they uncovered performance patterns that the standard approach had overlooked.
The researchers call on ASR practitioners to implement these robust, community-driven auditing practices better suited for the rapidly changing speech recognition landscape. As automatic speech recognition becomes increasingly integrated into healthcare, legal proceedings, employment, and daily communication, ensuring equitable performance across all user populations is not optional but essential.
This work highlights a broader challenge in AI development: systems trained on majority populations often fail marginalized groups, and standard evaluation methods can inadvertently hide these failures. By centering the needs and experiences of people with aphasia, the research provides a roadmap for more equitable auditing that could benefit other marginalized communities relying on speech recognition technology.