Transcribe every voice note from a WhatsApp export at once | ThreadRecap
If you have ever tried to work through a WhatsApp conversation where half the messages are voice notes, you already know the problem: you press play, wait, take a note, press play again, lose your place, and repeat. That workflow collapses the moment the volume grows. ThreadRecap solves it by transcribing every voice note in an export simultaneously, merging the results back into the chat timeline so you can read the whole conversation as text.
Why one-by-one transcription stops scaling at 10 voice notes
Playing voice notes individually is fine for a quick personal exchange. It breaks down in three common situations:
High-volume group chats. A busy project group can accumulate dozens of voice notes in a single day. Listening to each one sequentially takes longer than the original conversation did.
Archived or historical chats. When you need to reconstruct what was agreed weeks or months ago, scrubbing through audio is slow and error-prone. A searchable text record is far more useful.
Evidence and compliance use cases. Legal teams, HR departments, and compliance officers need a complete, timestamped record. Manually transcribing audio one clip at a time introduces gaps and inconsistencies that undermine the document's reliability.
The fundamental issue is that audio is not searchable. Text is. Batch transcription converts the entire voice layer of a chat into something you can scan, search, copy, and cite.
ThreadRecap uses OpenAI Whisper for all voice note transcription. On clear audio, Whisper achieves approximately 95% accuracy. A few characteristics of the model are worth understanding before you process a large export.
What Whisper does well
Whisper was trained on a broad multilingual dataset covering 99+ languages. It handles a wide range of accents, moderate background noise, and the relatively short clip lengths that are typical of WhatsApp voice notes. The compressed .opus format does not materially degrade transcription quality for most recordings made in normal conditions.
Where accuracy drops
Whisper's training data is approximately 65% English. The remaining 35% is distributed across 99+ other languages, which means per-language accuracy is uneven. Languages with smaller representation in the training corpus will produce more errors. Additionally, recordings made in noisy environments, on low-quality microphones, or with heavy distortion will fall below the 95% benchmark. Always review transcripts before using them in formal or legal contexts.
Whisper for privacy-sensitive workflows
One reason Whisper is particularly well suited to sensitive communications is that it can be run in environments where you control data handling. ThreadRecap stores voice note audio encrypted in your account, and you can delete it at any time from the dashboard. Photos, videos, and documents in your export never leave your device.
Supported formats: .opus, .m4a, and .mp3
WhatsApp encodes voice notes as .ogg files using the OPUS codec. The files are typically referenced with the .opus extension in an export. ThreadRecap also accepts .m4a and .mp3 files, which appear in exports from certain device configurations or when voice notes have been forwarded and re-encoded.
You do not need to convert files before uploading. The batch processor identifies each audio file in the export ZIP, determines its format, and routes it to the transcription pipeline automatically. If a file is corrupt or unplayable, it is flagged in the output rather than silently skipped, so you have a complete record of what was and was not transcribed.
Open the chat or group in WhatsApp, go to the chat settings, and choose Export Chat. When prompted, select Include Media. This bundles the voice note files into the ZIP alongside the chat text file. Without media included, there are no audio files to transcribe.
Step 2: Upload the ZIP to ThreadRecap
Go to /whatsapp-voice-to-text and upload the ZIP file. ThreadRecap accepts files up to 2 GB, which covers exports containing 60,000 or more messages. The file is sent directly from your device to your encrypted account storage. Photos, videos, and documents in the ZIP are ignored and never uploaded.
Step 3: Batch transcription runs
ThreadRecap parses the chat text file to extract the message timeline, then identifies every audio file referenced in that timeline. Each .opus, .m4a, or .mp3 file is passed to the Whisper pipeline. Clips are processed in parallel rather than sequentially, so a large export does not require proportionally more waiting time.
Step 4: Transcripts merge into the timeline
Once transcription is complete, each transcript is inserted into the chat timeline at the correct position, attributed to the correct sender, and timestamped. The result is a unified, readable conversation that includes both text messages and the transcribed content of every voice note. From there, ThreadRecap can generate structured outputs including Meeting Recaps, Action Items, Decisions, and evidence-ready reports.
Multi-language detection per clip and how it interacts with code-switching
Per-clip language detection
ThreadRecap does not require you to declare a language before processing. Whisper assesses each audio clip independently and transcribes it in the language it detects. This means a single export can contain voice notes in English, Spanish, Portuguese, and French, and each will be transcribed correctly in its own language without any manual configuration.
It is important to understand that multilingual transcription outputs text in the detected language. It does not translate. If you need translated output, that is a separate step.
Code-switching
Code-switching is when a speaker mixes two languages within a single clip, for example beginning a sentence in English and finishing it in Portuguese. This is common in bilingual communities and international teams.
Whisper handles many code-switching cases, particularly when one language clearly dominates the clip. However, per-clip language detection works on the assumption that a single language is present. When two languages are used roughly equally within one short clip, the model may commit to the wrong language for part of the output or produce a lower-confidence transcript. Clips flagged as low-confidence are marked in the ThreadRecap output so you can prioritise them for manual review.
Practical implications for multilingual teams
If your team communicates in a dominant language with occasional phrases in a second language, batch transcription will produce usable results with minimal review. If your chats involve sustained code-switching across multiple clips, plan for a review pass before treating the transcripts as authoritative records.
Getting the most from a batch run
A few practical points before you start:
Export with media. This is the single most common reason a batch run produces no transcripts. If the ZIP contains only the chat text file, there is nothing to transcribe.
Check recording quality. The 95% accuracy figure applies to clear audio. Clips recorded in loud environments or on damaged microphones will need more review time.
Use the dashboard to manage retention. After you have downloaded or shared your transcripts, you can delete the source audio from your account. You are in control of what is stored and for how long.
Consider the output format for your use case. If you are preparing a legal or compliance document, use the evidence-ready report output, which preserves sender attribution, timestamps, and an unedited transcript alongside any structured summary.
Batch transcription does not change the content of your conversations. It makes the content accessible, searchable, and usable in ways that audio alone cannot be.
Transcribe every voice note from a WhatsApp export at once
Transcribe every .opus or .m4a voice note in a WhatsApp export in one batch using Whisper (~95% accuracy), with multi-language detection and full timeline merging.
May 3, 20267 min read
Ready to analyze your WhatsApp chat?
Upload your export and get summaries, insights, and voice note transcriptions in minutes.