Benchmarking at a glance
1. Define your evaluation goal
Before comparing providers and models, the first step is to define which aspects of performance matter most for your use case. Below are examples of performance aspects that would be more weighted for domain applications of speech to text:- Accuracy on noisy backgrounds: for contact centers, telephony, and field recordings.
- Speaker diarization quality: for meeting assistants and multi-speaker calls.
- Named entity accuracy: for workflows that extract people, organizations, phone numbers, or addresses.
- Domain-specific vocabulary handling: for medical, legal, or financial transcription.
- Timestamp accuracy: for media workflows that need readable, well-timed captions.
- Filler-word handling: for agentic workflows .
2. Choose the right dataset
The right dataset depends on the use case and traffic shape you want to measure. You of course wouldn’t want to be benchmarking call-center audio with clean podcast recordings. So pick audio that matches your real traffic along these axes:- Language: target language(s), accents, code-switching frequency.
- Audio quality: noisy field recordings, telephony, studio, or browser microphone.
- Topics and domain: medical, financial, operational, legal, etc.
- Typical words that matter: numbers, proper nouns, acronyms, domain-specific terms.
- Interaction pattern: single-speaker dictation, dialogue, multi-speaker meetings, or long-form recordings.
- public datasets (for comparability and immediate availability)
- private in-domain datasets, when available, to ensure no data is “spoiled” by some speech-to-text providers training their models on the very datasets you’re benchmarking.
3. Normalize transcripts before computing WER
Normalization removes surface-form differences (casing, abbreviations, numeric rendering) so you compare apples to apples when judging transcription output.| Reference | Prediction | Why raw WER is wrong |
|---|---|---|
It's $50 | it is fifty dollars | Contraction and currency formatting differ, but the semantic content is the same. |
Meet at Point 14 | meet at point fourteen | The normalization should preserve the numbered entity instead of collapsing it into an unrelated form. |
Mr. Smith joined at 3:00 PM | mister smith joined at 3 pm | Honorific and timestamp formatting differ, but the transcript content is equivalent. |
whisper-normalizer. It does not affect numbers, and applies aggressive lowercasing and punctuation stripping.
Gladia’s recommended approach is gladia-normalization, our open-source library designed for transcript evaluation:
It's $50->it is 50 dollarsMeet at Point 14->meet at point 14Mr. Smith joined at 3:00 PM->mister smith joined at 3 pm
gladia-normalization
Open-source transcript normalization library used before WER computation.
4. Compute WER correctly
Word Error Rate measures the edit distance between a reference transcript and a predicted transcript at the word level. The standard formula is:S= substitutionsD= deletionsI= insertionsN= number of words in the reference transcript
- Prepare a reference transcript for each audio sample.
- Run each provider on the exact same audio.
- Normalize both the reference and each prediction with the same pipeline.
- Compute WER on the normalized outputs.
- Aggregate results across the full dataset.
5. Interpret results carefully
Do not stop at a single WER number. Review:- overall average WER
- median WER and spread across files
- breakdowns by language, domain, or audio condition
- failure modes on proper nouns, acronyms, and numbers
- whether differences are consistent or concentrated in a few hard samples
Common pitfalls
- Comparing providers on different datasets
- Using low-quality or inconsistent ground truth
- Treating punctuation and formatting differences as recognition errors
- Drawing conclusions from too few samples
- Reporting one average score without any slice analysis
- Not inspecting the reference transcript: if it contains text not present in the audio, for example an intro like “this audio is a recording of…”, it will inflate WER across all providers
- Not experimenting with provider configurations: for example, using Gladia’s custom vocabulary to improve proper noun accuracy, then comparing against the ground truth