Libopus Stereo to Mono Downmixing Algorithms
This article explores the specific mathematical algorithms and formulas used by the Libopus (Opus codec) library to downmix stereo audio channels into a single mono channel. It covers both the standard time-domain linear downmixing used during basic input/output conversions and the advanced frequency-domain joint-stereo representations employed internally within the SILK and CELT layers of the codec.
Passive Time-Domain Downmixing
When Libopus is instructed to encode a stereo input as mono, or decode a stereo stream to a mono output, it defaults to a standard passive time-domain downmix algorithm. This is performed to prevent digital clipping while retaining the level balance of both channels.
The mathematical formula for this conversion is a simple linear combination of the left (\(L\)) and right (\(R\)) channels:
\[M[n] = 0.5 \cdot L[n] + 0.5 \cdot R[n]\]
Where: * \(M[n]\) is the resulting mono sample at index \(n\). * \(L[n]\) is the left channel sample at index \(n\). * \(R[n]\) is the right channel sample at index \(n\).
Fixed-Point vs. Floating-Point Implementation
Depending on the build configuration of the Libopus library, this formula is implemented in one of two ways to optimize for CPU architecture:
- Fixed-Point Math (for embedded systems): To avoid slow floating-point operations, the library utilizes bitwise shifts. The division by two is executed using a bitwise right-shift operator: \[M[n] = (L[n] + R[n]) \gg 1\]
- Floating-Point Math (for modern desktop/mobile
CPUs): The library directly multiplies the sum of the channels
by a float value of
0.5fto maintain high precision and prevent signal degradation.
Frequency-Domain Joint-Stereo Coding (CELT Layer)
Inside the Opus codec, particularly within the CELT (Constrained Energy Lapped Transform) band-pass engine, stereo signals are not merely summed. To achieve high compression efficiency, the codec uses a mathematical transform known as Mid/Side (M/S) stereo coding, coupled with band-by-band energy normalization.
1. Mid/Side (M/S) Transformation
For each frequency band, the left and right channels are converted into Mid (\(M\)) and Side (\(S\)) channels using the following orthogonal transformation:
\[M = \frac{L + R}{\sqrt{2}}\]
\[S = \frac{L - R}{\sqrt{2}}\]
The Mid channel represents the mono downmix, containing the sum of the in-phase signals, while the Side channel contains the spatial differences.
2. Band-Wise Intensity Stereo and Spherical Coupling
To prevent phase cancellation and comb-filtering artifacts common in simple passive downmixing, the CELT layer applies a normalized band-wise intensity algorithm.
For each frequency band, the energy of the Mid and Side channels is computed:
\[E_M = \sum M_i^2\] \[E_S = \sum S_i^2\]
Instead of discarding the Side channel completely during low-bitrate mono downmixing, Libopus uses an algebraic vector quantization scheme on a hypersphere. The relationship between the Mid and Side channels is preserved using an angle parameter, \(\theta\):
\[\theta = \arctan\left(\frac{E_S}{E_M}\right)\]
In extremely low-bitrate scenarios where the Side channel is entirely suppressed (effectively downmixing the codec representation to mono), the encoder uses \(\theta = 0\). This mathematically collapses the signal to the Mid channel while adjusting the gain dynamically to preserve the perceived acoustic energy of the original stereo field:
\[M_{normalized} = \frac{L + R}{\sqrt{L^2 + R^2}}\]
This ensures that even when stereo signals are out of phase, the downmixed mono signal does not suffer from complete destructive interference.