OPUS to Text: Convert WhatsApp Voice Notes | ThreadRecap
You exported a WhatsApp Export Format Explained chat and found a folder full of .opus files. What are they, why does WhatsApp use this format, and how do you turn them into readable text?
What is an .opus file
Opus is an audio codec designed for interactive speech and music. It was developed by the Internet Engineering Task Force (IETF) and is an open, royalty-free format.
WhatsApp uses Opus for voice messages because it:
Compresses audio efficiently (small file sizes)
Maintains good speech quality at low bitrates
Is optimized for real-time voice communication
Works across all platforms (iOS, Android, Web)
When you record a voice note in WhatsApp, it is saved as an .opus file.
Why Opus specifically, not MP3 or AAC
The choice of Opus was deliberate and technical. MP3 was designed primarily for music and produces files that are noticeably larger when encoding speech at the same perceived quality. AAC offers strong compression but is encumbered by licensing requirements, making it a less attractive default for a product that ships on billions of devices. Opus, by contrast, was engineered from the ground up by the IETF as an open, royalty-free standard optimised for the bitrate range where human speech lives.
The practical result is that a 1-minute WhatsApp voice note in .opus format is typically only 50 to 100 KB. That compactness matters enormously at scale: WhatsApp processes hundreds of millions of voice notes every day, and each kilobyte saved multiplies across mobile data plans, server storage, and delivery latency worldwide.
The sequential numbering is not arbitrary. WhatsApp increments the leading integer for every piece of media in the conversation, regardless of type. That means audio files, images, videos, and documents all share the same counter. If you filter the export folder to show only `.opus` files, the gaps in the sequence numbers reveal where photos or other attachments appeared in the timeline.
The timestamp embedded in the filename matches the send time shown in the chat, which makes it straightforward to reconstruct the exact moment each voice note was sent even before you open _chat.txt. This structure is also how tools like ThreadRecap anchor each transcription to the correct position in the conversation: the filename in _chat.txt and the filename in the zip are identical, so the two sources can be joined without ambiguity.
ThreadRecap supports WhatsApp export .zip files up to 2 GB and conversations of 60,000 or more messages, including embedded voice notes. For long-running group chats where voice notes have accumulated over months or years, that capacity means no manual splitting of the export is required before upload.
Why you cannot just play .opus files
Most computers and phones can play .opus files with the right app. VLC, for example, handles Opus natively. But playing each voice note one by one and taking notes is impractical when you have 20 or 50 voice messages.
The real problem is not playback — it is turning all those voice notes into searchable, analyzable text. A dedicated OPUS to text converter handles this automatically.
The time cost of manual transcription
The arithmetic is straightforward but worth spelling out. Manually transcribing a 2-minute voice note takes approximately 5 to 10 minutes when you factor in pausing, rewinding to catch unclear words, and typing. A group chat that contains 30 voice notes averaging 90 seconds each represents roughly 45 minutes of audio. At that transcription rate, converting the whole set by hand could consume 4 to 6 hours of focused work. That figure does not include the time needed to reinsert each transcription into the conversation at the correct timestamp so that it reads coherently alongside the surrounding text messages.
How to convert .opus to text
Manual approach
Open each .opus file in a media player
Listen and type out the content
Insert the text into the conversation at the right position
This is accurate but extremely time-consuming. A 2-minute voice note takes 5-10 minutes to transcribe manually.
Using ThreadRecap
Export your WhatsApp chat with media (include the .opus files)
Each voice note is transcribed using OpenAI Whisper
Transcriptions are inserted into the conversation timeline
The result is a complete conversation where voice notes and text messages flow together in chronological order.
How the transcription pipeline works
ThreadRecap uses OpenAI Whisper, a speech recognition model trained on a large multilingual dataset. When you upload a WhatsApp export zip, ThreadRecap parses _chat.txt to identify every line that references an `.opus` or `.m4a` attachment, extracts the corresponding audio files, passes them through Whisper, and then splices the returned text back into the conversation at the exact timestamp position. The output is a unified transcript where a voice note appears as a clearly labelled block of text between the surrounding typed messages.
On clear recordings of a single speaker, Whisper achieves approximately 95% accuracy. That means a 100-word voice note will contain roughly 5 errors on average under good conditions, which is sufficient for most search, summary, and review tasks without any manual correction.
What happens to the audio quality
WhatsApp records voice notes at relatively low bitrates to keep file sizes small. A 1-minute voice note is typically 50-100 KB. Despite this compression, modern speech recognition handles WhatsApp audio well.
Factors that affect transcription quality:
Background noise — Quiet recordings transcribe best
Language — Major languages (English, Spanish, Portuguese, etc.) have the highest accuracy
Multiple speakers — If someone else is talking in the background, accuracy drops
Understanding accuracy limitations
The 95% figure represents a ceiling that applies to favourable conditions. Real-world WhatsApp voice notes are often recorded in less controlled environments: on the street, in a car, or in a room with other people talking. Background noise introduces competing frequencies that degrade Whisper's confidence scores on individual phonemes, which propagates into word-level errors.
Languages that are underrepresented in Whisper's training data also see lower accuracy. Major world languages with large amounts of publicly available audio, such as English, Spanish, French, German, and Portuguese, perform close to the 95% benchmark. Less-resourced languages may fall meaningfully below that. If your WhatsApp conversations are primarily in a language like that, it is worth reviewing transcriptions carefully before using them for any purpose that requires precision.
Multiple simultaneous speakers are a distinct challenge. Whisper is a transcription model, not a diarisation system, so it does not attempt to separate overlapping voices or label who said what within a single audio file. If a voice note captures two people speaking at once, the output will be a best-effort blend rather than an accurate representation of either speaker.
Opus vs other audio formats
WhatsApp specifically chose Opus over alternatives:
MP3: Larger files, not optimized for speech
AAC: Good quality but not open-source
Opus: Best compression-to-quality ratio for speech, open standard
Some older WhatsApp exports may contain .m4a files instead of .opus — this depends on the WhatsApp version and device. The voice-to-text tool handles both formats.
When you might see .m4a instead of .opus
WhatsApp migrated its default voice note format to Opus incrementally. Exports from conversations that began several years ago, or backups restored from older devices, can still contain .m4a files recorded under the previous default. The .m4a container typically holds AAC-encoded audio, which has different compression characteristics than Opus but is still handled correctly by speech recognition tools designed for voice content. If your export folder contains a mix of .opus and .m4a files, that is normal and reflects the migration history of that specific chat. ThreadRecap processes both formats without requiring any pre-conversion step on your part.
The bottom line
.opus files are just voice notes in an efficient audio format. The challenge is not the format itself but the volume — when a conversation has dozens of voice notes, manually listening to each one is not practical.
Automated transcription turns those .opus files into text that can be searched, summarized, and analyzed alongside the rest of the conversation.