Skip to content

Spec proposal: OVOS-AUDIO-IN-1 — Audio Input Service #52

@JarbasAl

Description

@JarbasAl

Proposal for OVOS-AUDIO-IN-1, the audio input service specification.

Problem

The audio input service — the component that acquires audio, runs pre-STT processing, transcribes to text, and injects the result into the utterance lifecycle — has no normative contract. How it is implemented (microphone, file, remote stream, wake word, VAD) is entirely deployer-defined and should stay that way. What it must produce is not specified anywhere.

Proposal

Minimal spec with three normative obligations:

  1. A STT mechanism MUST exist — deployer-defined; engine, model, API, or local process are all out of scope
  2. Audio-transformer chain MUST run before STT (TRANSFORM-1 §3.1) — canonical use cases: language identification (writing session.detected_lang), denoising/normalisation, speaker recognition (result written into Message.context)
  3. MUST emit ovos.utterance.handle with data.utterances (array of transcription candidates) and data.lang (BCP-47 output language)

Everything else — capture method, STT engine selection, post-STT transformer chains — is deployer concern and explicitly out of scope.

Language fields

  • Language selection order (inputs to STT): session.detected_langsession.request_langsession.lang
  • data.lang — the transcript's output language (what the text is in)
  • session.stt_lang (SHOULD write) — the language the STT model was configured to assume; matches data.lang in normal transcription, diverges in speech-translation (stt_lang = audio's spoken language, data.lang = translated output language)

PR

PR #51

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions