CHIM Text-to-Speech Services

This page documents the current text-to-speech services still actively used by HerikaServer for CHIM. The split below follows the current quickstart flow: recommended first, then the other valid services.

Recommended

PocketTTS

CPU-oriented local text-to-speech using the standard CHIM voice cache workflow. This is one of the main first-run recommendations.

  • Best for: easiest local baseline.

Chatterbox

Local text-to-speech service using the same general voice cache and management flow as XTTS and PocketTTS, but aimed at a more expressive output path.

  • Best for: recommended local expressive speech.

CHIM XTTS

The dedicated XTTS path in CHIM. This is the service to use when you specifically want the CHIM XTTS workflow, uploaded voice samples, and the familiar clone/cache flow.

  • Best for: XTTS-specific voice cloning workflow.
  • Driver id: xtts-fastapi.
  • Supports paralinguistic tag prompts in the current server config.

Inworld

Hosted text-to-speech with language, model, temperature, and speed controls. It is a good option if you want a cloud voice service instead of running speech locally.

  • Best for: hosted voice generation with cloning support.
  • Strengths: voice cloning flow, multi-language support, model selection.
  • Needs: Inworld workspace and API credentials.

Inworld

Cartesia

Hosted text-to-speech with current language and model options, including the newer sonic line in the server schema.

  • Best for: higher-end hosted voice path with strong language support.
  • Strengths: multiple models, speed control, automatic voice sync from the CHIM voice cache.
  • Needs: Cartesia API key.

Cartesia

Other Services

MeloTTS

Local text-to-speech that maps strongly to Skyrim voice types. This is useful when you want a voice-type oriented local setup instead of the XTTS-style cloned voice flow.

Mimic3

Local HTTP text-to-speech service with direct voice, rate, and volume controls. Useful if you want simple local synthesis without using the XTTS-family flow.

Piper Text-to-Speech

Local Piper endpoint integration. Good for self-hosted offline speech if you are comfortable downloading and managing voice models manually.

Kokoro

Local Kokoro endpoint integration. It appears as a valid current text-to-speech driver in the connector system and uses a local HTTP endpoint.

KoboldCPP Text-to-Speech

Local text-to-speech path exposed through the KoboldCPP extra text-to-speech endpoint. Use it only if your local KoboldCPP stack is already set up for speech.

Zonos

Zonos Gradio endpoint integration for users who want that specific self-hosted model family. It supports a wide language list and CHIM voice cache style voice ids in the current schema.

xVASynth

Legacy-friendly local Skyrim voice workflow tied to xVASynth models. Use this only if you already know you want the xVASynth route.

Azure

Hosted Azure speech synthesis with mood/style support, voice selection, and prosody controls. This is still one of the more configurable cloud voice options in the CHIM stack.

Azure Text-to-Speech

ElevenLabs

Hosted ElevenLabs speech with model, stability, similarity, style, speed, and optional v3 audio tag controls.

ElevenLabs

OpenAI Text-to-Speech

Hosted OpenAI speech synthesis using the current audio speech endpoint. Useful if you want a direct OpenAI voice path instead of a separate cloud text-to-speech provider.

OpenAI Text-to-Speech

Deepgram

Hosted Deepgram text-to-speech. It is still a valid current CHIM service, but it is more of an alternative pick than a main quickstart choice.

Deepgram Text-to-Speech