Minimum Algorithmic Latency of libopus

This article explains the minimum algorithmic latency achievable when encoding audio with the libopus library. It outlines the components that contribute to this delay, explains how the minimum threshold of 5 milliseconds is calculated, and discusses the configuration required to achieve this ultra-low latency state.

Understanding Algorithmic Latency in Opus

Algorithmic latency is the inherent delay introduced by an audio codec’s design, independent of hardware processing speed or network transmission times. In the Opus audio codec (implemented via libopus), this latency is determined by two primary factors: the frame size (or packet duration) and the codec’s look-ahead buffer.

To achieve the absolute lowest latency, the encoder must be configured to use its smallest supported frame size.

The 5 Millisecond Minimum Limit

The absolute minimum algorithmic latency achievable with standard libopus is 5.0 milliseconds (ms).

This 5.0 ms limit is the sum of two distinct components:

Frame Size (2.5 ms): The shortest standard audio frame duration supported by the Opus codec is 2.5 ms.
Look-ahead (2.5 ms): The underlying MDCT (Modified Discrete Cosine Transform) technology used in the CELT layer of Opus requires a 2.5 ms look-ahead window to perform overlap-add operations and prevent audio aliasing.

When you configure libopus to encode using 2.5 ms frames, the mathematical delay formula is:

\[\text{Algorithmic Latency} = \text{Frame Size} + \text{Look-ahead}\] \[\text{Algorithmic Latency} = 2.5\text{ ms} + 2.5\text{ ms} = 5.0\text{ ms}\]

How to Configure libopus for Minimum Latency

To achieve the 5.0 ms latency target in a practical application, you must configure the encoder with specific parameters:

Force CELT Mode: Opus operates in three modes: SILK (for voice), CELT (for music/low-latency), and Hybrid. SILK has a minimum frame size of 10 ms and a 5 ms look-ahead (15 ms total). To get down to 5 ms, you must force the encoder into CELT mode or restricted-lowdelay mode.
Set Frame Size to 2.5 ms: When passing audio buffers to the opus_encode() function, the number of samples per channel must correspond exactly to 2.5 ms. For a 48 kHz sampling rate, this equals 120 samples per channel (\(48000 \times 0.0025\)).

Trade-offs of Ultra-Low Latency Encoding

While a 5 ms algorithmic latency is ideal for real-time interactive applications like musical collaboration or gaming, it comes with trade-offs:

Lower Compression Efficiency: Smaller frame sizes mean more packets must be transmitted per second. 400 packets per second are required for 2.5 ms frames, compared to only 50 packets per second for 20 ms frames.
Increased Network Overhead: Because each packet requires IP, UDP, and RTP headers, transmitting 400 packets per second significantly increases the network overhead compared to larger frame sizes, even if the actual audio payload remains small.