Role of Pitch Estimator in libopus SILK

This article explains the technical role of the internal pitch estimator within the SILK encoding layer of the libopus codec. It covers how the estimator detects the fundamental frequency of speech, drives Long-Term Prediction (LTP), and influences voicing decisions to optimize compression efficiency and voice quality.

The SILK layer in the Opus audio codec is specifically optimized for voice communication. At the core of SILK’s high efficiency is its internal pitch estimator, which analyzes incoming audio frames to identify the fundamental frequency (pitch) of voiced speech. Because human speech contains highly repetitive, periodic patterns during vowel sounds, identifying the exact spacing of these repetitions allows the encoder to compress the audio signal significantly.

Driving Long-Term Prediction (LTP)

The primary function of the pitch estimator is to guide the Long-Term Prediction (LTP) filter. While Short-Term Prediction (LPC) removes spectral redundancy caused by the vocal tract, the LTP filter removes the redundancy caused by the vibration of the vocal cords (pitch).

Once the pitch estimator determines the precise pitch lag—the time delay between repeating waveforms—the LTP filter subtracts a scaled version of the past excitation signal from the current frame. Instead of transmitting the entire waveform, SILK only needs to encode the small difference (residual) between the predicted signal and the actual signal, drastically reducing the required bitrate.

The Multi-Stage Pitch Search Process

To minimize CPU usage while maintaining high accuracy, the SILK pitch estimator operates in three distinct stages:

Coarse Search (Downsampled): The input signal is downsampled to a lower sample rate (typically 8 kHz). A cross-correlation analysis is performed across this downsampled signal to find candidate pitch lags. This dramatically reduces the search space and computational overhead.
Refinement: The candidate pitch lags identified in the coarse search are evaluated and refined at the original sampling rate (up to 16 kHz for wideband SILK) to find the exact sample-accurate pitch period.
Fractional Pitch Interpolation: Because human pitch does not always align perfectly with digital sample boundaries, the estimator calculates fractional pitch lags with 1/8th sample resolution. This sub-sample precision prevents phase drift and ensures highly stable prediction gains for high-pitched voices, such as women’s and children’s speech.

Voicing Decisions and Parameter Quantization

Beyond finding the pitch lag, the pitch estimator determines the “pitch correlation” or voicing strength. If the correlation value is high, the frame is classified as voiced, and the encoder activates the 5-tap LTP filter to exploit the periodicity. If the correlation value is low, the frame is treated as unvoiced (like whispered speech or hiss sounds), and the LTP filter is bypassed to save bits, relying instead on noise-like excitation.

The pitch estimator directly impacts the final bitstream by outputting the pitch lag index and LTP coefficients, which are then quantized and entropy-encoded, ensuring clear, low-latency voice transmission at minimal bandwidth.