Question 1

How accurate is WhatsApp voice note transcription?

Accepted Answer

For clear speech in a quiet environment, OpenAI Whisper achieves around 95% word-level accuracy on WhatsApp .opus voice notes (roughly 5% Word Error Rate). Normal background noise such as a café or street drops accuracy to 90–95%, and heavy noise (wind, crowds, construction) reduces it further to 80–90%. The exact figure depends on language, speaker clarity, and how aggressively WhatsApp compressed the original audio.

Question 2

What audio format does WhatsApp use for voice notes?

Accepted Answer

WhatsApp records voice notes as Opus-encoded audio inside an OGG container, exported as .opus files. The codec runs at roughly 16 kbps mono at a 16 kHz sample rate, optimised for speech rather than music. Older iOS exports occasionally use .m4a (AAC). Both formats are read directly by ThreadRecap from the export .zip.

Question 3

Which speech-to-text model does ThreadRecap use?

Accepted Answer

ThreadRecap transcribes WhatsApp voice notes with OpenAI's Whisper model, the same architecture published by OpenAI in 2022 and updated through the large-v3 generation. Whisper is trained on 680,000+ hours of multilingual audio and supports 99 languages with usable quality on roughly 50 of them.

Question 4

Which languages get the best transcription accuracy?

Accepted Answer

Whisper performs best on the languages most represented in its training data, English, Spanish, Portuguese, French, German, Italian, Dutch, Polish, Russian, Mandarin, Japanese, and Arabic typically land between 4% and 12% Word Error Rate on clean audio. Lower-resource languages and strong regional dialects can climb to 15–25% WER, which is still useful for summarisation but less reliable for direct quotation.

Question 5

Why does the transcription get names and proper nouns wrong?

Accepted Answer

Speech-to-text models predict the most statistically likely word given context, and uncommon names rarely appear in training data. Whisper will often substitute a phonetic neighbour (for example "Schwarzschild" becoming "short shield"). Sentence-level meaning is usually preserved, but proper nouns, brand names, and numeric identifiers should be spot-checked against the original audio.

Question 6

Does voice note length affect transcription accuracy?

Accepted Answer

Length itself is not a meaningful accuracy factor. Whisper processes audio in 30-second windows with overlap, so a 5-minute clip is just ten windows stitched together. Quality degrades with noise or speaker change inside a window, not with total duration. Very short clips (under 3 seconds) can be less accurate because Whisper has limited context to disambiguate homophones.

Question 7

Can Whisper separate multiple speakers in a WhatsApp voice note?

Accepted Answer

No. Whisper produces a single transcript without speaker labels. WhatsApp voice notes are usually one-person recordings, so this rarely matters. For the occasional multi-voice clip (a recorded meeting, hands-free dictation), the transcript is concatenated and the listener has to infer speaker turns from context.

Question 8

How does background noise change Whisper's behaviour?

Accepted Answer

Constant background noise (engine hum, air conditioning) is filtered surprisingly well. Intermittent noise (sirens, doors, dogs) and overlapping speech are the harder cases, where Whisper either drops words or hallucinates short phrases that fill the silence. Long silences are the most common hallucination trigger and are handled inside ThreadRecap by voice activity detection before transcription.

Question 9

How does ThreadRecap handle voice notes inside a chat summary?

Accepted Answer

After transcription, each voice note is inserted into the conversation timeline at its original timestamp, attributed to the original sender, and tagged as audio. The downstream summary, decisions, action items, and open questions outputs treat voice content identically to typed messages, so a decision made in audio is captured the same as one written in text.

Question 10

What happens with code-switching or mixed-language voice notes?

Accepted Answer

Whisper detects language at the start of each 30-second window, so a clip that switches languages mid-sentence (English to Spanish, Portuguese to English) usually transcribes the dominant language correctly and may stumble at the switch point. ThreadRecap forces the language hint based on the chat's primary locale, which improves accuracy when the chat is mostly one language with occasional foreign phrases.

Question 11

How can I improve transcription accuracy before sending a voice note?

Accepted Answer

Hold the phone close to your mouth, speak at a moderate pace, avoid pacing or moving the device, state names and numbers slowly, and record in the quietest space available. Voice notes recorded indoors with the phone 10–20 cm from the mouth typically land in the 95%+ accuracy range. Outdoor or driving recordings should be assumed to be 5–10 percentage points lower.

Question 12

Is a 93% accurate transcription actually useful?

Accepted Answer

Yes. A 93% accuracy rate means roughly 7 words out of every 100 are wrong, but those errors are typically minor (verb tense, filler words, proper nouns). Decisions, deadlines, owners, and action items, the parts a recap actually cares about, survive intact in almost every clip we have processed. The alternative, ignoring voice notes entirely, can lose 30–50% of a conversation's content.

Condition	Typical WER	Word-level accuracy
Clear speech, quiet room, native speaker	4–6%	~95%
Café, street, indoor with HVAC	6–10%	90–94%
Outdoor wind, crowd, construction	10–20%	80–90%
Speaker overlap, talking over each other	15–30%	70–85%
Heavy regional dialect, mumbled speech	12–25%	75–88%

How WhatsApp encodes voice notes

The model: Whisper, what generation, what numbers

Language coverage in practice

What goes wrong, in order of frequency

1. Proper nouns

2. Numbers and dates

3. Technical jargon

4. Code-switching mid-sentence

5. Hallucinations on silence

A worked example

How ThreadRecap turns transcripts into a recap

How to improve accuracy before recording

How to improve accuracy after the fact

The accuracy tradeoff, stated plainly

WhatsApp Voice Note Transcription Accuracy (2026)

Ready to analyze your WhatsApp chat?