If your WhatsApp conversation uses voice notes, a normal text summary will be wrong. The chat log shows "audio omitted" where the voice notes used to be, so any tool that summarises the text alone is summarising half a conversation and confidently presenting it as the whole.
The correct workflow is:
Export the chat as a `.zip` with media.
Transcribe every voice note.
Merge transcripts into the chat timeline at the original timestamps.
Run analysis on the combined stream and extract decisions, action items, and open questions.
This page is the working playbook for that workflow at scale, including the parts most guides skip, what `.opus` actually is, why the merge step matters more than the transcription step, and how to keep group chats useful when half the participants only ever send 30-second voice memos.
WhatsApp records voice notes with the Opus audio codec inside an OGG container, exported as `.opus` files. Older iOS exports occasionally use `.m4a` (AAC inside an MP4 container).
Technical specifics:
Codec: Opus in voice-over-IP mode.
Bitrate: roughly 16 kbps.
Channels: mono.
Sample rate: 16 kHz.
Container: OGG (`.opus`) or MP4 (`.m4a`).
Two consequences:
Compression is aggressive. Opus at 16 kbps preserves intelligibility but strips most of the harmonic detail above 8 kHz. Sibilants and unvoiced stops are the first things to degrade on a poor connection.
Sample rate matches Whisper's input rate. No resampling penalty, but no audio above 8 kHz to recover either.
If you export with media, the `.zip` includes the audio files alongside `_chat.txt`. If you export without media, the audio files are missing entirely and the chat log shows `<attached: ...opus>` placeholder lines or `audio omitted` text where the voice notes used to be.
Practical takeaway: no media, no audio transcription. Re-export with media if you missed it the first time. If you only need the audio side and not the full chat, the .opus to readable text tool handles the same files in isolation.
Step 1: Export the chat with media
iPhone
Open the chat.
Tap the contact or group name at the top.
Scroll to Export Chat.
Choose Attach Media.
Save or share the `.zip`.
Android
Open the chat.
Tap the menu (three dots, top right).
Tap More.
Tap Export chat.
Choose Include media.
Save or share the `.zip`.
Tip: if your export becomes too large (hundreds of megabytes or more), start with a smaller timeframe. Recent month, recent project, recent incident. Uploading three years of media when you only need this week's standups is wasted bandwidth and credits.
Step 2: Verify the export contains voice notes
Inside the `.zip`, you should see:
A chat text file (often `_chat.txt`, sometimes `WhatsApp Chat - <name>.txt`).
Multiple `.opus` or `.m4a` audio files (one per voice note).
Image, video, and other media files if any were sent.
If you do not see `.opus` or `.m4a` files, the export was made without media. Re-export.
If you see them but they are all very small (under 1 KB), the export hit a media-size cap and the audio is corrupted. Re-export with a smaller date range.
Step 3: Bulk transcription strategy (the only one that scales)
Transcribing voice notes one by one is a waste of time. A scalable pipeline does this automatically:
Parse the chat log and detect every voice note reference (`<attached: ...opus>` lines).
Match each reference to the actual `.opus` or `.m4a` file inside the `.zip`.
Decode the audio and run voice activity detection to strip silence (avoids a Whisper hallucination class).
Transcribe with a speech-to-text model (Whisper-class is the current standard).
Return per-clip results: text, language, confidence, timestamps inside the clip.
Merge transcripts into the conversation timeline at the original send timestamps.
That last step is the difference between "a pile of audio transcripts" and "a usable recap". Most tools that advertise WhatsApp voice transcription stop at step five and leave the merge as a manual exercise.
Step 4: Merge transcripts into the timeline
A correctly merged transcript looks like a normal message in the conversation timeline:
Sender: Alex.
Type: audio.
Timestamp: 14:32:11 on 27 January 2026 (original send time).
Transcript: "Ok, we will ship Friday. John owns the landing page. I will handle billing."
With this structure, downstream analysis can correctly extract:
Decisions: ship Friday.
Owners: John for landing page.
Action items: billing tasks (owner: speaker).
Open questions: anything unresolved in the transcript.
Without timeline merge, the AI sees the chat log without audio content and the audio transcripts as a separate disconnected stream. The recap then misses commitments made only in audio, which in many work chats is the majority of substantive content.
This is the most common failure mode of generic transcription tools paired with general-purpose summarisers.
Step 5: Turn transcripts into real outputs
Once audio is merged into the timeline, the choice of analysis goal shapes what you get:
Meeting Recap
Context and purpose.
Agenda topics in order.
Decisions made (with the deciding speaker and timestamp).
Action items (owner, deadline if mentioned, current state).
Open questions.
Suggested follow-ups.
Best for project standups, sprint planning, retros conducted in WhatsApp. The same output reads cleanly as meeting minutes from a WhatsApp chat when you need a sharable artefact.
Action Items only
Task list.
Owner per task.
Deadline or "no deadline mentioned".
Blockers.
Best when you only need a current commitments list and the broader context is not needed.
Conflict Resolution
Root cause.
Each side's perspective.
Misunderstandings.
Resolution status.
Next steps.
Best for arguments and disagreements that played out in audio. Voice tone often matters here, but the transcript captures the content even if it loses the tone.
Decisions
Decision text.
Who decided.
Supporting context.
Dissent (if any).
Date and timestamp.
Best for project history audits or when you need a defensible record of what was agreed and when.
Relationship Insights
Tone arc over time.
Recurring topics.
Communication patterns.
Best for personal or partnership chats where the value is in the longitudinal view rather than specific commitments. The full output shape is documented under relationship insights from WhatsApp history.
Re-export with Include media (Android) or Attach Media (iPhone). Without media, the audio files are not in the `.zip` at all.
My `.zip` is too big to upload
Start with a smaller timeframe. If you only need "what happened this week," do not export three years of media. WhatsApp also caps exports at 10,000 messages when media is included; for very long chats, run two exports, one without media for full historical coverage, one with media for the recent period that contains the voice notes you actually need.
The tool transcribed audio but the recap is still generic
Almost always means the transcripts were not merged into the conversation timeline before analysis. Audio transcripts as a separate document do not carry conversational context, so the analysis cannot reason about who said what and when. ThreadRecap performs the merge automatically; if you are using a different tool, this step is usually missing.
Group chats are noisy
Filter participants. In a 12-person work chat, the three or four people doing 80% of the substantive talking are usually the only ones whose messages and voice notes need to enter the analysis. Combine participant filtering with date-range filtering to focus the recap and reduce credit cost.
The transcript got names wrong
Expected behaviour for Whisper, proper nouns are the most common error category. Spot-check names against the original audio using the inline player (every transcribed clip in ThreadRecap has a player at the message position). Names that appear repeatedly in the chat tend to converge on the right spelling because Whisper has more context to anchor on.
Privacy basics for voice notes
Voice notes can include identity cues, names, locations, and confidential details. The minimum a serious tool should provide:
Preview of what will be processed before upload.
Selective upload: chat text and voice note audio sent to servers; photos, videos, and documents never uploaded.
Encrypted account storage for chat text, voice note audio, and processed recaps, with explicit user control over deletion.
Clear retention policy in writing.
No model training on user-uploaded content.
ThreadRecap parses `.zip` files locally in the browser, never uploads photos, videos, or documents, stores chat text and voice note audio encrypted in your account alongside processed recaps, and gives you deletion control through the dashboard at any time. Review the privacy policy for retention specifics before uploading sensitive content.
Quick reference
Can I transcribe WhatsApp voice notes to text for free?
ThreadRecap's 5 free credits on sign-up cover a typical short or medium chat end-to-end. Other free tools exist but usually have stricter limits or unclear data handling. Treat truly free options as higher risk for sensitive content.
What file format are WhatsApp voice messages?
`.opus` (Opus codec, OGG container) is the default. `.m4a` (AAC, MP4 container) appears in older iOS exports. Both inside the export `.zip` when media is included.
Do I need media export for transcription?
Yes. No media in the export means no audio files to transcribe.
What is the best end result to aim for?
A searchable timeline where voice notes are merged back into the conversation, plus a structured output such as decisions and action items with owners and deadlines. The transcript on its own is much less useful than the same transcript inside the conversational context.
Run the workflow
Export your WhatsApp chat with media, upload the `.zip`, let the pipeline transcribe every voice note in bulk, and pick a goal to generate a structured recap with decisions and action items in minutes.