Bulk transcription of WhatsApp .opus voice notes | ThreadRecap
Voice messages have become one of the dominant communication formats on WhatsApp, yet the moment you export a chat, those messages land on your computer as a pile of `.opus` files that most desktop software simply refuses to play. Understanding why that happens, and how to turn every one of those files into searchable text without touching them individually, is what this guide covers.
What .opus is and why WhatsApp uses it
Opus is an open, royalty-free audio codec standardised by the Internet Engineering Task Force. It was designed specifically for interactive speech and audio transmission over the internet, covering use cases from Voice over IP and video conferencing to in-game chat. WhatsApp encodes every voice message using Opus, typically at 8–16 kHz sampling rates, delivered over the Real-time Transport Protocol.
The codec earns its place in a messaging app for two reasons: efficiency and speed. Opus can scale from 6 kb/s narrowband speech all the way up to 510 kb/s high-quality stereo audio. More importantly for a live messaging context, its algorithmic delay is 26.5 ms by default and can be reduced to as low as 5 ms when latency matters more than bitrate. That combination of low bandwidth and near-instant delivery is exactly what a mobile app sending short voice clips across variable network conditions needs.
Technically, Opus achieves this by blending two underlying algorithms: SILK, which is optimised for speech, and CELT, which is a lower-latency MDCT-based algorithm suited to a broader range of audio content. The result is a single codec that handles the full range of human voice recordings without switching formats.
When WhatsApp packages an Opus stream for storage, it wraps it in an OGG container. The files in your export carry the `.opus` extension, which is simply the OGG container with an Opus audio stream inside.
Why most desktop players cannot open .opus directly
The `.opus` extension is not registered by default on Windows or macOS. When you double-click one of these files, the operating system looks for an associated application, finds none, and either prompts you to choose a program or throws an error. Even applications that do launch will often fail to decode the file because they lack a built-in Opus codec.
Windows Media Player does not include native Opus support. iTunes and the macOS Music app are similarly limited. QuickTime, which handles a wide range of formats, does not decode Opus out of the box. The players that do work, such as VLC or certain browser-based players, require either a bundled codec library or a system-level codec pack that most users have never installed.
This is a practical problem when a chat export contains dozens or hundreds of voice notes. Even if you install a compatible player, listening through each file one by one is not a realistic approach to understanding a long conversation. The `.opus` format was optimised for transmission, not for post-hoc desktop review.
How ThreadRecap pipes .opus through Whisper
ThreadRecap is built around a specific workflow: you export your WhatsApp chat on your device, then upload the resulting ZIP file to the platform. The export-and-upload sequence matters because it means you hold the file before anything is transmitted. Photos, videos, and documents never leave your device; only chat text and voice note audio are processed, and those are stored encrypted in your account. You can delete them at any time from the dashboard.
Once the ZIP arrives, ThreadRecap unpacks every `.opus` file from the export and routes each one through OpenAI Whisper. Whisper accepts the OGG/Opus format directly, which avoids any intermediate conversion step that could introduce quality loss or metadata errors. The transcription pipeline runs across all voice notes in the export in parallel rather than sequentially, which is what makes bulk processing practical for large or long-running group chats.
For a detailed walkthrough of the conversion mechanics, see the /opus-to-text feature page.
The output for each file is a plain-text transcript tagged with the sender's name and the original message timestamp. That tagged output is what feeds the timeline merge described in the next section.
Performance numbers: time per minute of audio, accuracy ranges
Whisper Large-v3, the model ThreadRecap uses, achieves a 2.7% Word Error Rate on the LibriSpeech clean benchmark. On real-world English audio, including the kind of informal, sometimes noisy recordings that characterise WhatsApp voice notes, the Word Error Rate sits in the 8–12% range. Accuracy varies by language, speaker accent, recording environment, and whether the speaker is close to the microphone.
A few practical observations about what affects accuracy in WhatsApp-specific audio:
Background noise is the single largest accuracy reducer. A voice note recorded on a busy street or with music playing in the background will produce more errors than one recorded in a quiet room.
Accents and code-switching (mixing two languages mid-sentence) can push error rates above the 8–12% range for Whisper, though the model handles many language combinations reasonably well.
Short clips of one or two seconds, common in casual chats, sometimes produce less reliable output than clips of ten seconds or more, because there is less audio context for the model to anchor on.
Clear, close-mic speech in a single language consistently sits at the lower end of the error range.
Whisper's accuracy under good recording conditions is generally high, aligning with industry standards for clear audio., which aligns with what Whisper delivers under good recording conditions.
Merging transcripts back into the conversation timeline
A transcript that exists as a separate file, detached from the conversation it came from, has limited value. The key step in ThreadRecap's pipeline is the timeline merge: each completed transcript is inserted into the conversation at the exact position and timestamp of the original voice message.
This means that when you view the processed chat, a voice note from a participant appears as a text block attributed to that participant, timestamped to the second it was sent, sitting between the text messages that preceded and followed it. The conversation reads as a single continuous thread rather than a mix of text and opaque audio references.
The timeline merge has several downstream effects:
Search becomes uniform. You can search the entire conversation, including what was spoken, using a single query.
Summaries include spoken content. ThreadRecap's Meeting Recap and Action Items outputs draw on the full conversation, not just typed messages. A decision announced in a voice note is captured the same way a typed decision is.
Evidence output is complete. For legal, dispute, or compliance use cases, a conversation record that omits voice notes has gaps. The merged timeline closes those gaps, producing a document where every communication event is represented in text form with its original timestamp.
WhatsApp's built-in transcription and where it stops
WhatsApp has been exploring transcription features, but details on their implementation and availability are limited.. It works on-device, which is a genuine privacy advantage, but it comes with significant constraints: it supports five languages on Android and around twenty on iOS, it transcribes one message at a time, and it produces no summary, no action items, and no exportable record. For a user who wants to review a single recent voice note in a supported language, it is convenient. For anyone dealing with a large export, a multi-language group, or a situation where a complete and structured record matters, the built-in feature does not reach far enough.
ThreadRecap is not positioned as a replacement for WhatsApp's native features. The workflows address different needs. The native feature is immediate and requires no export. ThreadRecap is designed to manage larger volumes and more complex transcription needs than single-message tools..
A note on privacy and data handling
Because voice notes contain spoken words rather than typed text, they often carry more personal information than a text message of equivalent length. ThreadRecap's handling reflects that: voice note audio is stored encrypted in your account, not processed in a way that exposes it to third parties, and you retain full control over deletion via the dashboard. The export-and-upload workflow also means the file exists on your device before any data leaves it, giving you a clear point of control at the start of the process.
voice transcriptionopus codecwhatsapp exportwhisperbulk transcriptionaudio to textconversation timeline
Bulk transcription of WhatsApp .opus voice notes
Learn what WhatsApp .opus files are, why desktop players struggle with them, and how ThreadRecap bulk-transcribes every voice note via Whisper with timeline merge.
May 3, 20267 min read
Ready to analyze your WhatsApp chat?
Upload your export and get summaries, insights, and voice note transcriptions in minutes.