Voice notes carry the real content of most modern WhatsApp conversations. The 2-minute clip explaining a decision, the rapid-fire daily standup, the parent group's pickup logistics, all of it lives in audio. If transcription is wrong, the recap is wrong, and the most important part of the conversation gets mangled.
This page is a working reference for what to expect from WhatsApp voice note transcription, what moves the accuracy numbers, and how ThreadRecap handles the awkward cases.
How WhatsApp encodes voice notes
WhatsApp records voice notes with the Opus audio codec inside an OGG container. The exported file extension is `.opus`, occasionally `.m4a` on older iOS exports (AAC inside an MP4 container). The Opus encoder runs in voice-over-IP mode at roughly 16 kbps, mono, 16 kHz sample rate, tuned for intelligibility rather than musical fidelity.
Two consequences matter for transcription:
Compression artefacts are aggressive. Opus at 16 kbps is good enough to understand speech but it strips most of the harmonic detail above 8 kHz. Sibilants ("s", "sh", "f") and unvoiced stops ("p", "t", "k") are the first casualties when bandwidth drops further on a poor connection.
Sample rate is fixed at 16 kHz. Whisper accepts up to 16 kHz natively, so there is no resampling penalty. There is also no audio above the Nyquist limit to recover, which sets a hard ceiling on what any speech-to-text model can hear.
ThreadRecap reads the `.opus` files directly from the export `.zip`, decodes them, runs voice activity detection to strip silence, and feeds the audio into Whisper. No intermediate format conversion is involved. The same pipeline is available as a standalone .opus to readable text tool if you only need audio output.
The model: Whisper, what generation, what numbers
ThreadRecap's voice-to-text tool runs on OpenAI's Whisper, originally released in 2022 and updated through the large-v3 generation. Whisper is trained on 680,000+ hours of multilingual web audio, covers 99 languages, and produces usable quality on roughly 50 of them.
Whisper does three things internally that you should know about:
30-second windows. The model encodes audio in 30-second chunks with overlap, then stitches the transcripts. A 4-minute voice note is processed as eight overlapping windows, not as one continuous stream.
Joint language ID. The first 30 seconds run through a language detection head before transcription. Code-switching that happens later in the clip can confuse the language anchor.
No speaker labels. Whisper outputs a single transcript with no diarisation. WhatsApp voice notes are nearly always single-speaker, so this is rarely an issue in practice.
Real-world Word Error Rate (WER) on WhatsApp-style audio:
Condition
Typical WER
Word-level accuracy
Clear speech, quiet room, native speaker
4–6%
~95%
Café, street, indoor with HVAC
6–10%
90–94%
Outdoor wind, crowd, construction
10–20%
80–90%
Speaker overlap, talking over each other
15–30%
70–85%
Heavy regional dialect, mumbled speech
12–25%
75–88%
WER is the percentage of inserted, deleted, or substituted words. A 5% WER means 5 words out of every 100 are wrong, but the wrong words are usually low-information (verb tense slips, filler words, occasional proper nouns).
Language coverage in practice
Whisper's accuracy follows the distribution of its training data. The languages with the most hours represented also score the best.
Tier 1 (4–8% WER on clean audio): English, Spanish, Portuguese, French, German, Italian, Dutch, Russian, Polish, Mandarin Chinese, Japanese, Korean. These are the languages where ThreadRecap delivers near-human transcription quality on typical WhatsApp voice notes.
Tier 2 (8–15% WER): Arabic, Turkish, Hindi, Thai, Vietnamese, Czech, Hungarian, Swedish, Greek, Hebrew, Indonesian, Catalan. Strong utility for summarisation, but proper nouns and numbers should be spot-checked.
Tier 3 (15–25%+ WER): Less common languages, heavy regional dialects, code-mixed varieties. Still useful for "what was this about" recall, but direct quotation should be verified against the audio.
Brazilian Portuguese, European Portuguese, and Latin American Spanish all sit firmly in Tier 1. Rio carioca, paulistano, gaúcho, and similar regional Brazilian accents transcribe at the same accuracy as standardised broadcast Portuguese in our experience. Strong rural dialects with non-standard vocabulary land closer to the Tier 2 number.
What goes wrong, in order of frequency
1. Proper nouns
Names, brand names, place names, and product names are the most common errors. Whisper substitutes a phonetic neighbour: "Priya" becomes "Pria" or "Priya"; "Schwarzschild building" becomes "short shield building"; "Botafogo" might become "Bota fogo". Sentence meaning survives, spelling does not. Always verify proper nouns before quoting.
2. Numbers and dates
Times and dates are usually right (Whisper has seen enough "twenty-third" and "23rd" patterns to handle both). Phone numbers, prices, and order codes are riskier. A spoken "PIX 1.250 reais" can land as "1,250", "1.250", or "1250" depending on locale convention, which is a formatting issue rather than a content error.
3. Technical jargon
Industry-specific terms outside the training distribution (specialised medical, legal, engineering vocabulary) get phonetic substitutions. Common technical English (API, SDK, frontend, deploy) transcribes correctly because the corpus is dominated by English-language web audio.
4. Code-switching mid-sentence
"So basically, vamos a hacer the deployment tomorrow" is hard. Whisper detects language at the window boundary and tries to commit. Brief switches usually transcribe correctly; sustained switches across a 30-second boundary can produce one window in the wrong language.
5. Hallucinations on silence
Whisper's Achilles heel: long silent passages can trigger fabricated text, often filler phrases like "thanks for watching" carried over from training data. ThreadRecap pre-processes audio with voice activity detection, trimming silence before the model sees the audio, which removes this category of error in practice.
A worked example
Here is what the same 35-second voice note looks like under three conditions:
Quiet office, native English speaker:
"Quick update on the launch. We're shipping Friday at 10 AM. Marcus owns the landing copy, Priya is on billing, and I'll handle the Slack announcement. Open question on whether we need a press hold."
WER on this clip: ~3%. The single error was "Marcus" rendered as "Marcus" with a different capitalisation.
Same speaker, walking down a busy street:
"Quick update on the launch. We're shipping Friday at 10 AM. Mark is on the landing copy, Pria is on billing, and I'll handle the slack announcement. Open question on whether we need a press hole."
WER ~9%. Two name substitutions, "Slack" lower-cased, "press hold" misheard as "press hole". Decisions and timeline survived; names need verification.
Same speaker, in a car with windows down:
"Update on launch. Shipping Friday at 10. [unintelligible] is on landing, [unintelligible] on billing, I'll handle the announcement. Question on press."
WER ~22%. Names dropped entirely (Whisper preferred to skip rather than guess), but the decision and the timeline are still recoverable.
How ThreadRecap turns transcripts into a recap
After transcription, each voice note is inserted into the conversation timeline at the exact timestamp where it was sent, attributed to the original sender, and flagged as audio. From there the analysis layer treats voice and text identically.
That means:
A decision spoken in a voice note appears in the Decisions section.
An action item spoken in audio appears in Action Items with the original speaker as the owner.
The Summary synthesises voice and text together rather than treating them as separate streams.
The Notable Quotes output can pull from voice notes, with the timestamp link going back to the original audio.
Without this merge step, an AI tool that "transcribes voice notes" but then summarises only the text content will systematically miss the most substantive parts of the conversation. This is the most common failure mode of general-purpose chat summarisers. When the chat is a recurring work call, the merged transcript can turn WhatsApp threads into meeting minutes directly. For personal chats, the same data can extract relationship insights from a chat instead.
How to improve accuracy before recording
If you regularly send voice notes that will end up in a recap:
Distance. Hold the phone 10–20 cm from your mouth. Closer than that introduces breath and plosive noise; further than that picks up room reverb.
Pace. Moderate pace beats fast or slow. Whisper handles natural conversational speech well; rushed speech compounds errors at 30-second window boundaries.
Environment. Indoors beats outdoors. Stationary beats walking. Silent room beats music or TV in the background.
Names and numbers. State them deliberately, ideally twice if they matter ("invoice number 4-7-2-9, four seven two nine"). The redundancy gives the model a second chance.
One language per clip. If you switch languages, do it across a sentence break, not mid-sentence.
These are not strict requirements. ThreadRecap is built to deal with realistic WhatsApp audio, including kitchen ambient and walking-down-the-street recordings. They are levers if you want to push from "good enough for a summary" toward "verbatim quote".
How to improve accuracy after the fact
Inside ThreadRecap:
Audio playback at message position. Every transcribed voice note has an inline player. Click to verify any specific clip against the transcript.
Spot-check proper nouns first. That is where 70% of meaningful errors live.
Check numbers in commitments. "By Tuesday at 2" and "by Tuesday at 12" are a 12-character difference and a meaningful one.
Use the AI follow-up. Asking "where exactly did Marcus agree to the deadline?" returns the exact clip and timestamp, which surfaces transcription issues if the underlying audio actually said something different.
The accuracy tradeoff, stated plainly
No transcription is perfect. Whisper sits comfortably in the same accuracy range as the major commercial alternatives (Google Speech-to-Text, AWS Transcribe, Deepgram) for the languages where they all have strong coverage, and ahead of most of them for low-resource languages.
The honest comparison is not Whisper vs. perfect. It is Whisper vs. ignoring voice notes entirely. Voice notes typically carry 30–50% of a conversation's content. A 93% accurate transcript that captures every decision and every action item, with a handful of misspelled names you can fix in 30 seconds, is dramatically more useful than a recap that skips half the conversation by design.