How Libopus Encodes Audio Frame Duration
This article explains how the Opus audio codec (libopus) cleanly and efficiently encodes audio frame duration within its initial packet header. By analyzing the structure of the Table of Contents (TOC) byte, we will explore how libopus uses configuration bits and frame count codes to define the precise duration of audio frames and packets without introducing unnecessary data overhead.
The Opus codec (standardized in RFC 6716) is designed for low overhead and interactive real-time applications. To minimize latency and bandwidth, libopus does not use complex nested headers to describe packet contents. Instead, every Opus packet begins with a single, mandatory Table of Contents (TOC) byte that tells the decoder exactly how to interpret the packet, including the duration of the audio frames.
The TOC Byte Structure
The TOC byte is divided into three distinct bitfields:
0 1 2 3 4 5 6 7
+-+-+-+-+-+-+-+-+
| config |s|c|
+-+-+-+-+-+-+-+-+
config(Bits 0–4; 5 bits): The configuration number.s(Bit 5; 1 bit): The stereo flag (\(0\) for mono, \(1\) for stereo).c(Bits 6–7; 2 bits): The frame count code.
The combination of the config bits and the
c bits allows the decoder to instantly calculate both the
duration of an individual frame and the total duration of the
packet.
Step 1:
Determining Frame Duration via the config Bits
The first five bits (config) map to a predefined table
in the Opus specification. This configuration number identifies: 1.
The codec mode: SILK (speech-optimized), CELT
(music/low-latency-optimized), or Hybrid. 2. The audio
bandwidth: Narrowband, Mediumband, Wideband, Super-wideband, or
Fullband. 3. The frame size (duration): The base
duration of a single audio frame in the packet.
Depending on the configuration number, the base frame duration can be 2.5, 5, 10, 20, 40, or 60 milliseconds. Because these configurations are hardcoded into the codec standard, the decoder only needs to read these 5 bits to know the exact duration of a single frame.
Step 2:
Determining Frame Count via the c Bits
The final two bits of the TOC byte (c) define the frame
count code. This code dictates how many frames of the duration specified
by the config bits are bundled into the packet:
- Code 0 (
00): The packet contains exactly one frame. The total packet duration equals the frame duration. - Code 1 (
01): The packet contains exactly two frames of equal duration. The total packet duration is double the frame duration. - Code 2 (
10): The packet contains two frames of different durations. (This is used for Variable Bitrate/VBR packets where frame sizes are explicitly signaled in subsequent bytes). - Code 3 (
11): The packet contains an arbitrary number of frames (from 1 to 48 frames). When Code 3 is used, it is immediately followed by an extra “Code 3 payload byte” that explicitly defines the frame count and whether the packet is Constant Bitrate (CBR) or Variable Bitrate (VBR).
Calculating Total Packet Duration
By combining the frame duration (derived from config)
and the frame count (derived from c), the decoder
calculates the total packet duration using a simple formula:
\[\text{Total Packet Duration} = \text{Frame Duration} \times \text{Frame Count}\]
For safety and stability, the Opus specification enforces a maximum total packet duration of 120 milliseconds. Any combination of configuration and frame count that exceeds 120 ms is considered invalid, and the decoder will discard the packet.
Through this elegant bit-allocation design, libopus packages all necessary temporal information into the very first byte of the stream, allowing decoders to allocate buffers and manage audio synchronization with minimal CPU cycles.