Proposal for OVOS-AUDIO-IN-1, the audio input service specification.
Problem
The audio input service — the component that acquires audio, runs pre-STT processing, transcribes to text, and injects the result into the utterance lifecycle — has no normative contract. How it is implemented (microphone, file, remote stream, wake word, VAD) is entirely deployer-defined and should stay that way. What it must produce is not specified anywhere.
Proposal
Minimal spec with three normative obligations:
- A STT mechanism MUST exist — deployer-defined; engine, model, API, or local process are all out of scope
- Audio-transformer chain MUST run before STT (TRANSFORM-1 §3.1) — canonical use cases: language identification (writing
session.detected_lang), denoising/normalisation, speaker recognition (result written into Message.context)
- MUST emit
ovos.utterance.handle with data.utterances (array of transcription candidates) and data.lang (BCP-47 output language)
Everything else — capture method, STT engine selection, post-STT transformer chains — is deployer concern and explicitly out of scope.
Language fields
- Language selection order (inputs to STT):
session.detected_lang → session.request_lang → session.lang
data.lang — the transcript's output language (what the text is in)
session.stt_lang (SHOULD write) — the language the STT model was configured to assume; matches data.lang in normal transcription, diverges in speech-translation (stt_lang = audio's spoken language, data.lang = translated output language)
PR
PR #51
Proposal for OVOS-AUDIO-IN-1, the audio input service specification.
Problem
The audio input service — the component that acquires audio, runs pre-STT processing, transcribes to text, and injects the result into the utterance lifecycle — has no normative contract. How it is implemented (microphone, file, remote stream, wake word, VAD) is entirely deployer-defined and should stay that way. What it must produce is not specified anywhere.
Proposal
Minimal spec with three normative obligations:
session.detected_lang), denoising/normalisation, speaker recognition (result written intoMessage.context)ovos.utterance.handlewithdata.utterances(array of transcription candidates) anddata.lang(BCP-47 output language)Everything else — capture method, STT engine selection, post-STT transformer chains — is deployer concern and explicitly out of scope.
Language fields
session.detected_lang→session.request_lang→session.langdata.lang— the transcript's output language (what the text is in)session.stt_lang(SHOULD write) — the language the STT model was configured to assume; matchesdata.langin normal transcription, diverges in speech-translation (stt_lang = audio's spoken language, data.lang = translated output language)PR
PR #51