How Libopus Decides Mid-Side Stereo Coding

This article explains the algorithmic decision-making process used by the libopus encoder to choose between Left/Right (L/R) stereo and Mid-Side (M/S) stereo coding. It details how the encoder’s two primary engines, SILK and CELT, analyze channel correlation, energy distribution, and bitrate constraints to optimize compression efficiency and audio quality.

The Opus audio codec is a hybrid format containing two distinct technologies: SILK (optimized for voice) and CELT (optimized for general audio and music). Because these two engines operate differently, the libopus encoder employs two distinct algorithmic strategies to decide when to use Mid-Side (M/S) stereo coding.

The SILK Layer Decision Process

For speech and low-bitrate scenarios, the SILK engine determines whether to use M/S coding by analyzing the correlation between the left and right input channels.

  1. Correlation Estimation: SILK computes the normalized cross-correlation of the LPC (Linear Predictive Coding) residual signals of both channels. If the signals are highly correlated (meaning the left and right channels are very similar), M/S coding is highly efficient.
  2. Predictive Weighting: Instead of a simple matrix addition and subtraction, SILK uses a prediction-based M/S scheme. It codes the Mid channel as the primary signal, and then predicts the Side channel from the Mid channel using a prediction filter. If the prediction gain is high enough to justify the overhead of sending the prediction coefficients, the encoder uses M/S coding. If the channels are highly independent, it defaults to independent L/R coding.

The CELT Layer Decision Process

For music, high-fidelity audio, and high-bandwidth modes, the CELT engine operates in the frequency domain using the Modified Discrete Cosine Transform (MDCT). CELT’s M/S decision is highly granular and operates on a per-frequency-band basis.

  1. Sum and Difference Energy Analysis: CELT divides the spectrum into several psychoacoustic bands. For each band, the encoder calculates the energy of the Mid (\(M = (L+R)/\sqrt{2}\)) and Side (\(S = (L-R)/\sqrt{2}\)) signals.
  2. Stereo Coupling (Rotation): CELT handles M/S coding using “stereo coupling.” Instead of just coding M and S independently, it codes the total energy of the band and an angle (\(\theta\)) representing the ratio of the Side energy to the Mid energy (\(\theta = \arctan(S/M)\)).
  3. Correlation Thresholding: The encoder calculates the cross-correlation between the MDCT coefficients of the Left and Right channels for each band. If the correlation is above a dynamically calculated threshold, the band is coded using coupled stereo (M/S). If the correlation is low (indicating a wide stereo field or independent signals), the band is coded as dual-stereo (L/R).
  4. Transient Handling: If a transient (a sudden spike in energy) is detected in only one channel, the encoder may temporarily disable M/S coding for those frames. This prevents pre-echo artifacts from leaking from one channel to the other.

The Influence of Bitrate Constraints

The available bitrate heavily influences the decision threshold for both SILK and CELT: