Mimora

A local, offline pronunciation trainer. Mimora says a phrase out loud, you repeat it, and it scores how close you were — highlighting the words to work on. Practice the same phrase until you pass, then move on to the next one. Everything runs on your machine: speech synthesis, speech recognition, phrase generation, and acoustic analysis.

Mimora is built on the SpeakLoop voice-tutor stack and reuses the pronunciation-scoring core of OpenPronounce (MIT) as a library.

How it works

For each practice phrase Mimora runs a simple loop:

Prompt — a phrase is generated by the local LLM from your practice text, then spoken aloud by the Kokoro TTS voice (this synthesized audio is also the reference for scoring).
Record — you hold SPACE (or click the mic) and repeat the phrase.
Analyze — your recording and the reference are compared in a background thread: Wav2Vec2 acoustic similarity (DTW), phoneme-level word errors, and prosody (pitch / energy).
Feedback — you get a score out of 100, what was recognized, and the words to improve.
Loop — repeat the same phrase until the score passes the configurable threshold, then generate the next phrase.

You can replay the reference and your own recording back-to-back to hear the difference.

Features

🎙️ Push-to-talk recording with peak normalization and silence gating.
🗣️ Single voice everywhere — the reference phrase is synthesized by Kokoro, the same engine used for prompts (no second TTS).
🧠 LLM-generated phrases built from an editable practice text panel — paste your own paragraph, song, or sentences to drill.
📊 Pronunciation scoring combining acoustic similarity — per-step cosine DTW over Wav2Vec2 embeddings (40%) — with phoneme error rate (30%) and word error rate (30%). All components are length-invariant; the acoustic floor is calibrated to your voice with python pronounce/calibrate.py.
🔁 Replay reference vs. your recording to compare.
😀 Articulation face — a schematic mouth opens and closes with the speech as a reference or your recording plays, and shows a smiley reflecting your score while idle.
🧵 Responsive UI — analysis and model loading run in daemon threads; the GUI is updated only via root.after().
💻 Fully local & offline after the models are downloaded.

Architecture

File	Responsibility
`main.py`	`PronunciationTrainerGUI` — Tkinter GUI, recording, the Prompt→Record→Analyze→Feedback→Loop state machine, threading orchestration, LLM-server subprocess management.
`pronounce/speech.py`	Pronunciation analysis core (adapted from OpenPronounce). Single entry point `analyze(...)`; Wav2Vec2 embeddings + DTW, phoneme comparison, prosody, scoring. No GUI dependency.
`pronounce/calibrate.py`	On-request scoring calibration: reads the per-attempt samples from `logs/pronounce_samples.jsonl` and writes the acoustic floor to `pronounce/calibration.json`.
`mimora/tts.py`	`TTSManager` — Kokoro TTS. `synthesize()` returns the waveform; `play_array()` plays any waveform (reference at 24 kHz, your recording at 16 kHz). `loudness_envelope()` precomputes the per-frame mouth-openness track used by the face.
`mimora/face_widget.py`	`FaceWidget` — schematic articulation face (Tk Canvas). Talking mouth driven from a precomputed loudness track while audio plays; smiley reflecting the score when idle. Stdlib `tkinter` only.
`mimora/stt.py`	`STTManager` — faster-whisper speech-to-text (loaded at startup; kept available for future use).
`mimora/llm.py`	`LLMManager` — OpenAI-compatible client. `generate_phrase()` produces one practice phrase per request.
`mimora/config.py`	All configuration: device, model names, score threshold, practice-text path, phrase-generation settings, audio settings.
`llm_server/server.py`	Standalone FastAPI server loading GGUF models via `llama_cpp`; runs as a separate process to avoid CUDA contention. See `llm_server/README.md`.
`config/`	User configuration data: `settings.json` (hand-edited preferences) and `themes/` (UI color schemes).
`texts/practice_text.txt`	Default source text shown in the input panel at startup; put your own practice texts in `texts/`.

Requirements

Python 3.11 or 3.12 (developed and tested on 3.11 and 3.12). Python 3.13 and newer are not yet supported (as of June 2026).
Windows is the primary target (TTS playback uses winsound); a sounddevice fallback exists for other platforms.
A microphone and speakers.
For GPU acceleration: an NVIDIA GPU with a CUDA-enabled PyTorch build.
espeak-ng (native binary, required by the phonemizer) — installed separately, see below.

Models

install.py pre-downloads all of these (see Quick install). Otherwise the three Hugging Face models are fetched automatically on first run, and only the GGUF chat model must be obtained manually.

Model	Used by	Notes
`facebook/wav2vec2-large-960h`	pronunciation analysis	~1.2 GB; via `install.py` or on first run
Kokoro-82M (`hexgrad/Kokoro-82M`)	text-to-speech	via `install.py` or on first run
faster-whisper `small`	speech-to-text	via `install.py` or on first run
A GGUF chat model (e.g. `Llama-3.2-3B-Instruct-Q4_K_M`)	phrase generation	via `install.py`, or download manually into `models/`

Installation

Quick install (script)

install.py automates the whole setup: it installs the Python dependencies, auto-detects an NVIDIA GPU and installs the matching CUDA builds of torch and llama-cpp-python, checks for espeak-ng, pre-downloads the Hugging Face models into model_cache/, and downloads the GGUF chat model into models/.

git clone <your-repo-url> Mimora
cd Mimora

# Create and activate a virtual environment, then run the installer INSIDE it
# (the script installs into whatever interpreter runs it):
python -m venv .venv
.venv\Scripts\activate            # Windows
# source .venv/bin/activate       # macOS / Linux

python install.py

The installer prints each step and the exact command, then asks before running it (answer Y to run, n to abort, s to skip). Anything already installed is detected and offered as reinstall-or-skip rather than blindly redone. The full run is logged to logs/install.log.

Useful flags:

--yes — run non-interactively (skips already-installed steps; add --reinstall to force them)
--dry-run — print the steps and commands without executing anything
--cpu — skip the GPU (CUDA) installs
--skip-models / --skip-gguf — skip the model / GGUF downloads

espeak-ng (a native binary, see below) is checked but not installed on Windows — follow the printed instructions if it is missing. On Windows, enabling Developer Mode lets the model cache use symlinks; without it the installer falls back to copying files (more disk use, but it always works).

Manual installation

# 1. Clone
git clone <your-repo-url> Mimora
cd Mimora

# 2. Core + LLM-server dependencies
pip install -r requirements.txt
pip install -r llm_server/requirements.txt

# 3. Pronunciation module dependencies (Wav2Vec2, phonemizer, DTW, etc.)
pip install -r pronounce/requirements.txt

Install espeak-ng (required for phoneme analysis)

phonemizer needs the native espeak-ng binary on your PATH:

Windows — download and run the installer from the espeak-ng releases.
macOS — brew install espeak-ng
Linux — sudo apt-get install espeak-ng

GPU support (recommended)

The default torch and llama-cpp-python wheels are CPU-only. For NVIDIA GPUs:

PyTorch — install a CUDA build (other CUDA versions: see pytorch.org):
```
python -m pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124 --force-reinstall
```
Reinstall torch and torchaudio together: force-reinstalling torch alone leaves a torchaudio built against the previous torch, which then fails to import (OSError: [WinError 127]) and breaks pronunciation analysis.

llama-cpp-python — build with CUDA (see llm_server/README.md for details):

$env:CMAKE_ARGS="-DGGML_CUDA=on"
pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

Get a GGUF model

install.py already downloads llama-3.2-3b-instruct-q4_k_m.gguf into models/. To do it manually instead, download a small instruct model (e.g. Llama-3.2-3B-Instruct-Q4_K_M.gguf) and place it at the path set by EXTERNAL_MODEL_PATH in mimora/config.py (default: models/llama-3.2-3b-instruct-q4_k_m.gguf).

Usage

python main.py

On first launch the app loads the STT, TTS, and Wav2Vec2 models and starts the LLM server (this takes a while and downloads several GB). Once it shows Ready:

Edit the Practice text panel (or keep the default).
Click 🎲 New phrase — Mimora generates a phrase and speaks it.
Hold SPACE (or press and hold the mic button) and repeat the phrase; release when done.
Read your score and the words to improve in the feedback panel.
Use ▶ Reference / ▶ My recording to compare, then repeat or generate the next phrase.

Press ESC or close the window to quit (the LLM server subprocess is terminated cleanly).

Configuration

Key options in mimora/config.py (overridable via config/settings.json):

Setting	Default	Description
`WAV2VEC2_MODEL_NAME`	`facebook/wav2vec2-large-960h`	Pronunciation/transcription model.
`WAV2VEC2_DEVICE`	`DEVICE` (cuda/cpu)	Set to `"cpu"` to avoid VRAM contention with llama_cpp / Kokoro.
`PRONUNCIATION_SCORE_THRESHOLD`	`70.0`	Score (0–100) required to pass a phrase.
`PRACTICE_TEXT_FILE`	`texts/practice_text.txt`	Source text pre-loaded into the input panel.
`PHRASE_GEN_TEMPERATURE` / `PHRASE_GEN_MAX_TOKENS`	`0.7` / `40`	Phrase-generation sampling.
`PHRASE_GEN_WINDOW_SENTENCES`	`5`	Sentences of the source text sent to the model per request (sliding window).
`PHRASE_GEN_WINDOW_REPEATS`	`5`	Phrases generated per window position before it slides forward by half its size.
`LLM_BACKEND`	`local_server`	`local_server` (auto-started subprocess) or `lm-studio`.
`MAX_RECORD_SECONDS`	`20`	Safety cap on recording length.

Using the pronunciation core as a library

The pronounce/ package is GUI-agnostic and can be used on its own:

import pronounce

pronounce.load_models()   # load Wav2Vec2 once (and warm_up() to remove first-call latency)

result = pronounce.analyze(
    user_audio=user_waveform,        # np.ndarray, 16 kHz mono
    expected_text="hello world",
    reference_audio=reference_wav,   # np.ndarray (e.g. Kokoro output)
    user_sr=16000,
    reference_sr=24000,
)

print(result.score)              # 0–100
print(result.passed)             # score >= threshold
print(result.transcription)      # what was recognized
print(result.words_with_errors)  # words to improve
print(result.prosody)            # {"f0": [...], "energy": [...]}

Running the tests

# Fast unit tests (pure logic, no model download, offline)
python -m unittest tests.test_speech -v

# Optional end-to-end check on real audio (downloads the model, needs espeak-ng)
python tests/test_speech.py path/to/user.wav [path/to/reference.wav]

GPU / CPU notes

Three torch models (Wav2Vec2, Kokoro) plus llama_cpp can compete for VRAM on a single GPU. Mimora mitigates this two ways:

The LLM runs in a separate process (llm_server/), and the practice loop runs its phases (LLM → Kokoro → Wav2Vec2) sequentially, so they don't synthesize/infer at the same time.
If VRAM is still tight, set WAV2VEC2_DEVICE = "cpu" in mimora/config.py — short phrases analyze acceptably on CPU.

Known limitations

English only (phonemizer en-us).
The transcription-based word errors only surface mistakes the ASR actually "hears"; subtle distortions where the word is still recognized may not appear in the word list (the acoustic DTW + prosody partially compensate).
Scoring is heuristic — the acoustic floor depends on your voice and microphone. After a practice session run python pronounce/calibrate.py to fit it to your data (--dry-run previews the change); every attempt's raw components are logged to logs/pronounce_samples.jsonl and logs/main.log for inspection.

Credits

OpenPronounce (MIT) — the pronunciation-scoring core reused in pronounce/.
Kokoro-82M — text-to-speech.
faster-whisper — speech-to-text.
Wav2Vec2 (Hugging Face Transformers) — acoustic embeddings and transcription.
llama.cpp / llama-cpp-python — local LLM inference.

License

See LICENSE. The reused OpenPronounce components are MIT-licensed; their attribution is retained in pronounce/speech.py.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mimora

How it works

Features

Architecture

Requirements

Models

Installation

Quick install (script)

Manual installation

Install espeak-ng (required for phoneme analysis)

GPU support (recommended)

Get a GGUF model

Usage

Configuration

Using the pronunciation core as a library

Running the tests

GPU / CPU notes

Known limitations

Credits

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.github/workflows		.github/workflows
config		config
hwconfig		hwconfig
llm_server		llm_server
mimora		mimora
models		models
pronounce		pronounce
tests		tests
texts		texts
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLA.md		CLA.md
LICENSE		LICENSE
README.md		README.md
install.py		install.py
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Mimora

How it works

Features

Architecture

Requirements

Models

Installation

Quick install (script)

Manual installation

Install espeak-ng (required for phoneme analysis)

GPU support (recommended)

Get a GGUF model

Usage

Configuration

Using the pronunciation core as a library

Running the tests

GPU / CPU notes

Known limitations

Credits

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages