Turn Detector Design
class TurnDetectionAction(Enum):
DO_NOTHING = 1
STOP_SPEAKING = 2
START_GENERATION = 3
class TurnDetectionSemantic(Enum):
IDLE = "idle"
INCOMPLETE = "incomplete"
COMPLETE = "complete"
WAIT = "wait"
BACKCHANNEL = "backchannel"
SHOULD_BACKCHANNEL = "should_backchannel"
class TurnVADResult(Enum):
SPEECH = 1
SILENCE = 2
@dataclass(frozen=True)
class TurnDetectionResult:
action: TurnDetectionAction
semantic: TurnDetectionSemantic
vad_result: TurnVADResult | None = None
class TurnDetector(ABC):
"""Abstract interface for turn-taking detectors."""
@property
def listening(self) -> bool:
...
@listening.setter
def listening(self, value: bool) -> None:
...
def listening_lock(self, is_async: bool = True):
...
@abstractmethod
def detect(
self,
audio: Optional[bytes] = None,
text: Optional[str] = None,
speech_start: bool = False,
speech_pause: Optional[bool] = None,
) -> TurnDetectionResult:
...
async def async_detect(
self,
audio: Optional[bytes] = None,
text: Optional[str] = None,
speech_start: bool = False,
speech_pause: Optional[bool] = None,
) -> TurnDetectionResult:
...
@abstractmethod
def clone(self) -> "TurnDetector":
...
Best Practice for Implementing detect
The framework actually calls async_detect. Therefore, the best practice when implementing a new turn detector is to implement async_detect first, and then wrap it synchronously in detect.
import asyncio
def detect(
self,
audio: Optional[bytes] = None,
text: Optional[str] = None,
speech_start: bool = False,
speech_pause: Optional[bool] = None,
) -> TurnDetectionResult:
return asyncio.run(
self.async_detect(audio, text, speech_start, speech_pause)
)
If the underlying implementation is already synchronous, you may also implement detect directly and reuse the default async_detect wrapper from the base class.
async_detect
TurnDetector can consume audio signals, ASR text, and VAD side signals at the same time. Each call should return a TurnDetectionResult representing the turn-taking judgment at the current moment.
Input Parameters
audio: The current audio frame, in PCM 16-bit, mono, 16 kHz bytes.text: The ASR text accumulated so far for the current turn.speech_start: A signal passed in when VAD has just detected speech start.speech_pause: A signal passed in when the user may currently be pausing, usually used together withtext.
Typical combinations of these inputs are:
- Only
audio: pure audio-based detection path - Only
textandspeech_pause: text-semantic detection path - Only
speech_start=True: notify the detector that the current speaking turn has started
The two representative implementations in the current repository correspond to two different paths:
SoulxDuplug: primarily audio-based, with fallback on text pause signalsLLMTurnDetector: primarily text-semantic, mainly relying ontextandspeech_pause
Return Value
The return value is TurnDetectionResult, consisting of three parts:
action: the action the service layer should execute immediatelysemantic: the semantic interpretation of the current conversation statevad_result: an optional VAD result, used only when the detector also proxies VAD state
Meaning of TurnDetectionAction
DO_NOTHING: do not trigger any extra action at the momentSTOP_SPEAKING: the system should interrupt its ongoing spoken outputSTART_GENERATION: the system should start generating a response
In practice:
STOP_SPEAKINGis generally used when the user interrupts the system while it is speakingSTART_GENERATIONis generally used when the user is determined to have finished speaking and the system can start answering
Meaning of TurnDetectionSemantic
IDLE: there is no clear turn-advancing signal at the momentINCOMPLETE: the user is still continuing the current turn and has not finishedCOMPLETE: the user's current input is semantically completeWAIT: the user explicitly expresses a waiting intentBACKCHANNEL: the input is a short acknowledgment and should not count as turn completionSHOULD_BACKCHANNEL: the current state suggests that the system may produce a backchannel
semantic is mainly used to express the detector's semantic judgment, while action determines the immediate service-layer behavior. They are related, but not equivalent.
Meaning of vad_result
vad_result is optional and is only used when the turn detector also serves as a VAD proxy.
TurnVADResult.SPEECH: the current state is speakingTurnVADResult.SILENCE: the current state is silence
When the pipeline does not have an independent VAD configured, the system uses this field to trigger VAD behavior. If a frontend or backend VAD is configured, this field is ignored.
Meaning of listening
The TurnDetector base class includes a built-in listening state and its locks. This state is used to distinguish whether the detector is currently "listening for the user to finish input" or "listening for whether the user interrupts system output".
Common conventions are:
listening = True: the system is waiting for the user to finish, and the detector should decide when to emitSTART_GENERATIONlistening = False: the system is playing output, and the detector should decide whether the user's input should triggerSTOP_SPEAKING
In the service layer, TurnDetectorManager sets listening to False when TTS starts playback, and restores it to True when playback finishes or is interrupted.
Implementation Suggestions
- Whenever returning a result, ensure that
actionandsemanticare semantically consistent. - If the detector maintains session state internally, that state must belong only to the current instance and must not be shared across sessions.
- If the implementation uses
speech_pause, it should treat it as a hint that "a pause is occurring now", not as a reset signal after the turn ends.