Recipe
The examples below show how to extend the framework by modifying it directly.
Introduce a New ASR Model
Assume you want to add Qwen3ASRFlashRealtime, whose implementation currently lives in src/xtalk/speech/asr/qwen3_asr_flash_realtime.py.
- Create
qwen3_asr_flash_realtime.pyundersrc/xtalk/speech/asr. - Prepare the class skeleton and implement the required methods. For model interfaces, refer to
src/xtalk/speech/interfaces.py.
from ..interfaces import ASR
class Qwen3ASRFlashRealtime(ASR):
def __init__(
self,
*,
api_key: Optional[str] = None,
config: Optional[Qwen3ASRFlashConfig] = None,
) -> None:
...
def recognize(self, audio: bytes) -> str:
...
def recognize_stream(
self,
audio: bytes,
*,
is_final: bool = False,
chat_history: str | None = None,
) -> str:
...
def stream_chunk_bytes_hint(self) -> int | None:
...
def reset(self) -> None:
...
def clone(self) -> "ASR":
...
async def async_recognize(self, audio: bytes) -> str:
...
async def async_recognize_stream(
self,
audio: bytes,
*,
is_final: bool = False,
chat_history: str | None = None,
) -> str:
...
- Register the new implementation in
src/xtalk/speech/asr/__init__.py. - Use it in the configuration:
"asr": {
"type": "Qwen3ASRFlashRealtime",
"params": {
"api_key": "your key"
}
}
Introduce a New Agent
Refer to src/xtalk/llm_agent/experimental.py. The implementation and configuration process is similar to the section above, but remember to register the model in src/xtalk/llm_agent/__init__.py.
accept Logic
async def async_accept(self, context: AgentContext) -> AsyncIterator[AgentOutput]:
pass
The accept method subscribes to external inputs and starts the related processing logic. AgentContext comes from src/xtalk/serving/modules/llm_agent_context_manager.py. The currently stable context types include asr_partial, asr_final, and loop.
The loop event is triggered once when the connection is established. It can be used for any proactive logic, or to start an output loop. src/xtalk/llm_agent/experimental.py uses it to trigger proactive dialogue.
AgentOutput can be a string, a tool call, or a tool call result. After a tool call returns, the Manager can use it to trigger related logic. For example, in src/xtalk/serving/modules/llm_agent_context_manager.py, the direct_audio tool call triggers downstream logic in src/xtalk/serving/modules/direct_audio_manager.py to generate directly playable audio events.
From a design perspective, Agent is expected to be the main reasoning core of the whole system and to integrate information from other components into output.
Introduce a New Manager
The Agent in the previous section requires a new src/xtalk/serving/modules/direct_audio_manager.py to forward tool-call output into audio events. All Manager implementations can be created directly under src/xtalk/serving/modules, and then registered in src/xtalk/serving/service.py and src/xtalk/serving/module_types.py.
Manager uses the observer pattern for event subscription and publishing. All events are defined in src/xtalk/serving/events.py. src/xtalk/serving/modules/input_gateway.py and src/xtalk/serving/modules/output_gateway.py are special cases responsible for receiving frontend input and sending output back to the frontend.
To invoke models inside a Manager, refer to src/xtalk/serving/modules/asr_manager.py. You can use methods such as pipeline.get_asr_model.
Note that wait_for_completion in the event publishing method publish controls whether await waits until listeners directly triggered by that event have completed. Enabling wait_for_completion along an event chain ensures that every handler in that chain finishes before control returns to the original event source.
Introduce a New Model Type
Create the interface in src/xtalk/speech/interfaces.py, then create the corresponding folder and model file under src/xtalk/speech. The workflow is similar to introducing a new ASR model. After that, register the new type in src/xtalk/model_loader.py and src/xtalk/model_types.py.