Introduce a New Model
Experimental API
Note See
examples/sample_app/custom_model.pyandexamples/sample_app/echo_agent.pyfor details.Note See Recipe for adding a model of existing types.
You may want to introduce a new model of an existing type (e.g. text-to-speech), or add a model of new type (e.g. a model that handles backchannel). This can be achieved by register_model_search_spec before a xtalk_instance is created from config:
from xtalk import Xtalk
Xtalk.register_model_search_spec(
slot="llm_agent",
spec=Path(__file__).parent / "echo_agent.py",
)
xtalk_instance = Xtalk.from_config(args.config)
Here slot matches the name of corresponding init arg in Pipeline. You can check Xtalk.MODEL_REGISTRY for existing slots, or use a new slot to represent a new type of models (see examples\sample_app\custom_service.py and there llm_output_refactor_model can be the new slot).
spec is the path to model implementation, an example implementation in echo_agent.py looks like this:
from xtalk.model_types import Agent
class EchoAgent(Agent):
"""A simple agent that echoes user input."""
def generate(self, input) -> str:
if isinstance(input, dict):
return input["content"]
return input
def clone(self) -> "EchoAgent":
return EchoAgent()
Then you can use the custom model in config file:
{
"asr": {
"type": "Qwen3ASRFlashRealtime",
"params": {
"api_key": "<API_KEY>"
}
},
"llm_agent": "EchoAgent",
"tts": {
"type": "CosyVoice",
"params": {
"api_key": "<API_KEY>"
}
}
}
Recipe
Recipes for major model customization are listed below. You can read source code for interfaces of other model types. We will update these interfaces from time to time.
Note See
src/xtalk/model_types.pyfor all available model types.[!IMPORTANT] X-Talk has asynchronous default implementations for sync versions, which usually with
run_in_executor, likeasync_recognizeforrecognizew.r.t. ASR. However, in order to achieve best concurrency for production, we recommend to implement these async versions by your self.
New ASR (auto-speech-recognition) Model
Your ASR class must inherit from xtalk.speech.interfaces.ASR and implement the following methods:
recognize(audio: bytes) -> str- Recognize audio in a single pass.
reset() -> None- Reset internal recognition state.
clone() -> ASR- Return a new instance for use in new or concurrent sessions.
- Sharing weights/connections (e.g.,
_shared_model) is allowed, but you can't share states.
Methods below are optional:
* recognize_stream(audio: bytes, *, is_final: bool = False) -> str
* Interface for streaming incremental recognition.
* Returns the "current cumulative recognition result up to this point".
* async_recognize(audio: bytes)
* async def async_recognize_stream(
self, audio: bytes, *, is_final: bool = False
)
[!IMPORTANT] Input for
recognizeandrecognize_streamis PCM 16-bit mono 16 kHz raw bytes. You may need to do conversion by yourself.Note X-Talk have default implementation for
recognize_streamwith aMockStreamRecognizer. Therefore, no worry for your non-streaming ASR models.Note You can refer to existing implementations (e.g.,
src/xtalk/speech/asr/zipformer_local.py) when building your own ASR class. We recommend deploying ASR as a separate service and invoking it via API calls within the ASR class, referencing the implementation ofsrc/xtalk/speech/asr/sherpa_onnx_asr.py.
New TTS (text-to-speech) Model
Your new TTS class must inherit from xtalk.speech.interfaces.TTS and implement the following methods:
-
synthesize(self, text: str) -> bytes -
Input: The text to synthesize.
-
Output: Raw audio bytes in PCM 16-bit, mono, 48000 Hz.
-
clone(self) -> TTS -
Return a new TTS instance:
- It should have isolated runtime state to avoid cross-session interference and it may share read-only resources if your backend supports that.
Note Follow this integration contract for new TTS implementations: - Non-streaming TTS: implement
synthesize; optionally overrideasync_synthesizefor async efficiency. - Streaming TTS: still implementsynthesize, and additionally overridesynthesize_stream. You may also overrideasync_synthesizeandasync_synthesize_streamfor async efficiency. - Do not overridesynthesize_streamfor a non-streaming backend just to adapt signatures. The base-class default already wrapssynthesizeinto one chunk for compatibility, and that inherited wrapper should not be treated as native streaming support.
Optional methods
synthesize_stream(self, text: str, **kwargs) -> Iterable[bytes]- Override this method only if your backend supports true streaming synthesis.
-
set_voice(self, voice_names: list[str]) -
This method works with the
TTSVoiceChangeevent inTTSManagerto switch voices via language model tool calls. -
Usually there is only one element in
voice_names, and this is the current behavior for tool call result. However, some TTS models may support mixing multiple voices for reference. Therefore,voice_namesis list type. -
set_emotion(self, emotion: str | list[float]) -
This method works with the
TTSEmotionChangeevent inTTSManagerto switch emotions via language model tool calls. -
Current tool call result only carries
emotionasstr. However, you may also wantlist[float]as emotion vector for future use. -
async def async_synthesize(self, text: str, **kwargs: Any) - Optional async optimization for both streaming and non-streaming backends.
async def async_synthesize_stream( self, text: str, **kwargs: Any )- Optional async optimization for streaming backends. If omitted, the base class asynchronously iterates over
synthesize_stream.