Libopus Stereo to Mono Downmixing Algorithms

This article explores the specific mathematical algorithms and formulas used by the Libopus (Opus codec) library to downmix stereo audio channels into a single mono channel. It covers both the standard time-domain linear downmixing used during basic input/output conversions and the advanced frequency-domain joint-stereo representations employed internally within the SILK and CELT layers of the codec.

Passive Time-Domain Downmixing

When Libopus is instructed to encode a stereo input as mono, or decode a stereo stream to a mono output, it defaults to a standard passive time-domain downmix algorithm. This is performed to prevent digital clipping while retaining the level balance of both channels.

The mathematical formula for this conversion is a simple linear combination of the left (\(L\)) and right (\(R\)) channels:

\[M[n] = 0.5 \cdot L[n] + 0.5 \cdot R[n]\]

Where: * \(M[n]\) is the resulting mono sample at index \(n\). * \(L[n]\) is the left channel sample at index \(n\). * \(R[n]\) is the right channel sample at index \(n\).

Fixed-Point vs. Floating-Point Implementation

Depending on the build configuration of the Libopus library, this formula is implemented in one of two ways to optimize for CPU architecture:

  1. Fixed-Point Math (for embedded systems): To avoid slow floating-point operations, the library utilizes bitwise shifts. The division by two is executed using a bitwise right-shift operator: \[M[n] = (L[n] + R[n]) \gg 1\]
  2. Floating-Point Math (for modern desktop/mobile CPUs): The library directly multiplies the sum of the channels by a float value of 0.5f to maintain high precision and prevent signal degradation.

Frequency-Domain Joint-Stereo Coding (CELT Layer)

Inside the Opus codec, particularly within the CELT (Constrained Energy Lapped Transform) band-pass engine, stereo signals are not merely summed. To achieve high compression efficiency, the codec uses a mathematical transform known as Mid/Side (M/S) stereo coding, coupled with band-by-band energy normalization.

1. Mid/Side (M/S) Transformation

For each frequency band, the left and right channels are converted into Mid (\(M\)) and Side (\(S\)) channels using the following orthogonal transformation:

\[M = \frac{L + R}{\sqrt{2}}\]

\[S = \frac{L - R}{\sqrt{2}}\]

The Mid channel represents the mono downmix, containing the sum of the in-phase signals, while the Side channel contains the spatial differences.

2. Band-Wise Intensity Stereo and Spherical Coupling

To prevent phase cancellation and comb-filtering artifacts common in simple passive downmixing, the CELT layer applies a normalized band-wise intensity algorithm.

For each frequency band, the energy of the Mid and Side channels is computed:

\[E_M = \sum M_i^2\] \[E_S = \sum S_i^2\]

Instead of discarding the Side channel completely during low-bitrate mono downmixing, Libopus uses an algebraic vector quantization scheme on a hypersphere. The relationship between the Mid and Side channels is preserved using an angle parameter, \(\theta\):

\[\theta = \arctan\left(\frac{E_S}{E_M}\right)\]

In extremely low-bitrate scenarios where the Side channel is entirely suppressed (effectively downmixing the codec representation to mono), the encoder uses \(\theta = 0\). This mathematically collapses the signal to the Mid channel while adjusting the gain dynamically to preserve the perceived acoustic energy of the original stereo field:

\[M_{normalized} = \frac{L + R}{\sqrt{L^2 + R^2}}\]

This ensures that even when stereo signals are out of phase, the downmixed mono signal does not suffer from complete destructive interference.