Libopus DTX Behavior with Absolute Silence

This article explains how the libopus audio codec processes absolute digital silence when Discontinuous Transmission (DTX) is explicitly enabled. It covers the transition from active streaming to the DTX state, the transmission rate of comfort noise packets, and how the decoder interprets these packets to dramatically reduce bandwidth without dropping the connection.

When absolute digital silence—defined as a continuous stream of pulse-code modulation (PCM) values of zero—is fed into the libopus encoder with DTX explicitly enabled, the encoder undergoes a specific, multi-stage process to minimize data transmission.

The Initial Transition Phase

The encoder does not immediately stop transmitting data when silence begins. Upon receiving the first frames of absolute silence, the encoder continues to transmit regular audio packets for a brief period, typically lasting between 200 and 400 milliseconds (depending on the configured frame size). This buffering window allows the encoder’s Voice Activity Detection (VAD) algorithm to confirm that the signal is stably inactive rather than experiencing a temporary micro-pause.

Entering the DTX State

Once the VAD algorithm confirms the sustained absence of active audio, the encoder transitions into the DTX state. Normally, DTX analyzes background noise to generate comfort noise parameters. Because the input is absolute digital silence, the encoder calculates the background noise energy level as zero.

Packet Reduction and Comfort Noise (CNG)

Once in the DTX state, libopus stops sending regular, continuous audio packets. To keep the connection alive and update the decoder, the encoder switches to sending Comfort Noise Generation (CNG) packets at a highly reduced frequency.

Instead of sending packets every 20 milliseconds, libopus transmits a CNG packet approximately once every 400 milliseconds (or every 20 frames). For absolute silence, these sparse packets contain specific metadata indicating a noise energy level of zero. This reduces the stream’s effective bitrate from its active setting down to less than 0.1 kbps.

Decoder Interpretation

When the decoder stops receiving regular stream packets and begins receiving the sparse CNG packets, it recognizes that the encoder has entered DTX mode. Because the received CNG packets specify an energy level of zero, the decoder generates absolute silence (PCM zeroes) for the playback buffer. This prevents the system from generating artificial background hiss, ensuring the output perfectly mirrors the absolute silence of the input.

Resuming Active Transmission

The moment the input signal transitions from absolute silence back to active audio, the encoder’s VAD immediately detects the change in energy. The encoder instantly exits the DTX state and resumes transmitting standard, full-rate compressed audio packets on the very next frame, preventing any audible clipping or latency at the start of the new audio segment.