Skip to main content
Benchmarking speech-to-text systems is easy to get wrong. Small methodology changes can produce large swings in reported quality, which makes comparisons misleading.

Benchmarking at a glance

1

Define your goal

Decide what “good” means for your product before comparing systems.
2

Choose the right dataset

Benchmark on audio that matches your real traffic and target users.
3

Normalize transcripts

Normalize both references and hypothesis outputs before computing WER.
4

Compute WER

Measure substitutions, deletions, and insertions on normalized text.
5

Interpret results carefully

Look beyond one average score and inspect meaningful slices.

1. Define your evaluation goal

Before comparing providers and models, the first step is to define which aspects of performance matter most for your use case. Below are examples of performance aspects that would be more weighted for domain applications of speech to text:
  • Accuracy on noisy backgrounds: for contact centers, telephony, and field recordings.
  • Speaker diarization quality: for meeting assistants and multi-speaker calls.
  • Named entity accuracy: for workflows that extract people, organizations, phone numbers, or addresses.
  • Domain-specific vocabulary handling: for medical, legal, or financial transcription.
  • Timestamp accuracy: for media workflows that need readable, well-timed captions.
  • Filler-word handling: for agentic workflows .
Those choices shape every downstream decision: which dataset to use, which normalization rules to apply, and which metrics to report. If your benchmark does not reflect your real traffic, the result will not tell you much about production performance.

2. Choose the right dataset

The right dataset depends on the use case and traffic shape you want to measure. You of course wouldn’t want to be benchmarking call-center audio with clean podcast recordings. So pick audio that matches your real traffic along these axes:
  • Language: target language(s), accents, code-switching frequency.
  • Audio quality: noisy field recordings, telephony, studio, or browser microphone.
  • Topics and domain: medical, financial, operational, legal, etc.
  • Typical words that matter: numbers, proper nouns, acronyms, domain-specific terms.
  • Interaction pattern: single-speaker dictation, dialogue, multi-speaker meetings, or long-form recordings.
Use transcripts that are strong enough to serve as ground truth, and prefer a mix of:
  • public datasets (for comparability and immediate availability)
  • private in-domain datasets, when available, to ensure no data is “spoiled” by some speech-to-text providers training their models on the very datasets you’re benchmarking.
Your favorite LLM with internet access is be very effective at finding public datasets that match your use case.

3. Normalize transcripts before computing WER

Normalization removes surface-form differences (casing, abbreviations, numeric rendering) so you compare apples to apples when judging transcription output.
ReferencePredictionWhy raw WER is wrong
It's $50it is fifty dollarsContraction and currency formatting differ, but the semantic content is the same.
Meet at Point 14meet at point fourteenThe normalization should preserve the numbered entity instead of collapsing it into an unrelated form.
Mr. Smith joined at 3:00 PMmister smith joined at 3 pmHonorific and timestamp formatting differ, but the transcript content is equivalent.
One common limitation is “Whisper-style normalization” (OpenAI, 2022): implemented in packages like whisper-normalizer. It does not affect numbers, and applies aggressive lowercasing and punctuation stripping. Gladia’s recommended approach is gladia-normalization, our open-source library designed for transcript evaluation:
  • It's $50 -> it is 50 dollars
  • Meet at Point 14 -> meet at point 14
  • Mr. Smith joined at 3:00 PM -> mister smith joined at 3 pm

gladia-normalization

Open-source transcript normalization library used before WER computation.
from normalization import load_pipeline

pipeline = load_pipeline("gladia-3", language="en")

reference = "Meet at Point 14. It's $50 at 3:00 PM."
prediction = "meet at point fourteen it is fifty dollars at 3 pm"

normalized_reference = pipeline.normalize(reference)
normalized_prediction = pipeline.normalize(prediction)
Always apply the same normalization pipeline to both the reference transcript and every hypothesis output you compare. Changing the normalization rules between vendors — or forgetting to normalize one side — invalidates the benchmark.

4. Compute WER correctly

Word Error Rate measures the edit distance between a reference transcript and a predicted transcript at the word level. The standard formula is:
WER = (S + D + I) / N
Where:
  • S = substitutions
  • D = deletions
  • I = insertions
  • N = number of words in the reference transcript
Lower is better. In practice:
  1. Prepare a reference transcript for each audio sample.
  2. Run each provider on the exact same audio.
  3. Normalize both the reference and each prediction with the same pipeline.
  4. Compute WER on the normalized outputs.
  5. Aggregate results across the full dataset.
Do not compute WER on raw transcripts if providers format numbers, punctuation, abbreviations, or casing differently. That mostly measures formatting conventions, not recognition quality.
Inspect your reference transcripts carefully before computing WER. If a reference contains text that is not actually present in the audio, for example an intro such as “this audio is a recording of…”, it can make WER look much worse across all providers.

5. Interpret results carefully

Do not stop at a single WER number. Review:
  • overall average WER
  • median WER and spread across files
  • breakdowns by language, domain, or audio condition
  • failure modes on proper nouns, acronyms, and numbers
  • whether differences are consistent or concentrated in a few hard samples
Two systems can post similar average WER while failing on different error classes. Separate statistically meaningful gaps from noise introduced by dataset composition or normalization choices. If two systems are close, inspect actual transcript examples before drawing strong conclusions.

Common pitfalls

  • Comparing providers on different datasets
  • Using low-quality or inconsistent ground truth
  • Treating punctuation and formatting differences as recognition errors
  • Drawing conclusions from too few samples
  • Reporting one average score without any slice analysis
  • Not inspecting the reference transcript: if it contains text not present in the audio, for example an intro like “this audio is a recording of…”, it will inflate WER across all providers
  • Not experimenting with provider configurations: for example, using Gladia’s custom vocabulary to improve proper noun accuracy, then comparing against the ground truth