A local, offline pronunciation trainer. Mimora says a phrase out loud, you repeat it, and it scores how close you were β highlighting the words to work on. Practice the same phrase until you pass, then move on to the next one. Everything runs on your machine: speech synthesis, speech recognition, phrase generation, and acoustic analysis.
Mimora is built on the SpeakLoop voice-tutor stack and reuses the pronunciation-scoring core of OpenPronounce (MIT) as a library.
For each practice phrase Mimora runs a simple loop:
- Prompt β a phrase is generated by the local LLM from your practice text, then spoken aloud by the Kokoro TTS voice (this synthesized audio is also the reference for scoring).
- Record β you hold
SPACE(or click the mic) and repeat the phrase. - Analyze β your recording and the reference are compared in a background thread: Wav2Vec2 acoustic similarity (DTW), phoneme-level word errors, and prosody (pitch / energy).
- Feedback β you get a score out of 100, what was recognized, and the words to improve.
- Loop β repeat the same phrase until the score passes the configurable threshold, then generate the next phrase.
You can replay the reference and your own recording back-to-back to hear the difference.
- ποΈ Push-to-talk recording with peak normalization and silence gating.
- π£οΈ Single voice everywhere β the reference phrase is synthesized by Kokoro, the same engine used for prompts (no second TTS).
- π§ LLM-generated phrases built from an editable practice text panel β paste your own paragraph, song, or sentences to drill.
- π Pronunciation scoring combining acoustic similarity β per-step cosine DTW over Wav2Vec2 embeddings (40%) β with phoneme error rate (30%) and word error rate (30%). All components are length-invariant; the acoustic floor is calibrated to your voice with
python pronounce/calibrate.py. - π Replay reference vs. your recording to compare.
- π Articulation face β a schematic mouth opens and closes with the speech as a reference or your recording plays, and shows a smiley reflecting your score while idle.
- π§΅ Responsive UI β analysis and model loading run in daemon threads; the GUI is updated only via
root.after(). - π» Fully local & offline after the models are downloaded.
| File | Responsibility |
|---|---|
main.py |
PronunciationTrainerGUI β Tkinter GUI, recording, the PromptβRecordβAnalyzeβFeedbackβLoop state machine, threading orchestration, LLM-server subprocess management. |
pronounce/speech.py |
Pronunciation analysis core (adapted from OpenPronounce). Single entry point analyze(...); Wav2Vec2 embeddings + DTW, phoneme comparison, prosody, scoring. No GUI dependency. |
pronounce/calibrate.py |
On-request scoring calibration: reads the per-attempt samples from logs/pronounce_samples.jsonl and writes the acoustic floor to pronounce/calibration.json. |
mimora/tts.py |
TTSManager β Kokoro TTS. synthesize() returns the waveform; play_array() plays any waveform (reference at 24 kHz, your recording at 16 kHz). loudness_envelope() precomputes the per-frame mouth-openness track used by the face. |
mimora/face_widget.py |
FaceWidget β schematic articulation face (Tk Canvas). Talking mouth driven from a precomputed loudness track while audio plays; smiley reflecting the score when idle. Stdlib tkinter only. |
mimora/stt.py |
STTManager β faster-whisper speech-to-text (loaded at startup; kept available for future use). |
mimora/llm.py |
LLMManager β OpenAI-compatible client. generate_phrase() produces one practice phrase per request. |
mimora/config.py |
All configuration: device, model names, score threshold, practice-text path, phrase-generation settings, audio settings. |
llm_server/server.py |
Standalone FastAPI server loading GGUF models via llama_cpp; runs as a separate process to avoid CUDA contention. See llm_server/README.md. |
config/ |
User configuration data: settings.json (hand-edited preferences) and themes/ (UI color schemes). |
texts/practice_text.txt |
Default source text shown in the input panel at startup; put your own practice texts in texts/. |
- Python 3.11 or 3.12 (developed and tested on 3.11 and 3.12). Python 3.13 and newer are not yet supported (as of June 2026).
- Windows is the primary target (TTS playback uses
winsound); asounddevicefallback exists for other platforms. - A microphone and speakers.
- For GPU acceleration: an NVIDIA GPU with a CUDA-enabled PyTorch build.
- espeak-ng (native binary, required by the phonemizer) β installed separately, see below.
install.py pre-downloads all of these (see Quick install).
Otherwise the three Hugging Face models are fetched automatically on first run,
and only the GGUF chat model must be obtained manually.
| Model | Used by | Notes |
|---|---|---|
facebook/wav2vec2-large-960h |
pronunciation analysis | ~1.2 GB; via install.py or on first run |
Kokoro-82M (hexgrad/Kokoro-82M) |
text-to-speech | via install.py or on first run |
faster-whisper small |
speech-to-text | via install.py or on first run |
A GGUF chat model (e.g. Llama-3.2-3B-Instruct-Q4_K_M) |
phrase generation | via install.py, or download manually into models/ |
install.py automates the whole setup: it installs the Python dependencies,
auto-detects an NVIDIA GPU and installs the matching CUDA builds of torch and
llama-cpp-python, checks for espeak-ng, pre-downloads the Hugging Face models
into model_cache/, and downloads the GGUF chat model into models/.
git clone <your-repo-url> Mimora
cd Mimora
# Create and activate a virtual environment, then run the installer INSIDE it
# (the script installs into whatever interpreter runs it):
python -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # macOS / Linux
python install.pyThe installer prints each step and the exact command, then asks before running
it (answer Y to run, n to abort, s to skip). Anything already installed is
detected and offered as reinstall-or-skip rather than blindly redone. The full
run is logged to logs/install.log.
Useful flags:
--yesβ run non-interactively (skips already-installed steps; add--reinstallto force them)--dry-runβ print the steps and commands without executing anything--cpuβ skip the GPU (CUDA) installs--skip-models/--skip-ggufβ skip the model / GGUF downloads
espeak-ng (a native binary, see below) is checked but not installed on Windows β
follow the printed instructions if it is missing. On Windows, enabling
Developer Mode lets the model cache use symlinks; without it the installer
falls back to copying files (more disk use, but it always works).
# 1. Clone
git clone <your-repo-url> Mimora
cd Mimora
# 2. Core + LLM-server dependencies
pip install -r requirements.txt
pip install -r llm_server/requirements.txt
# 3. Pronunciation module dependencies (Wav2Vec2, phonemizer, DTW, etc.)
pip install -r pronounce/requirements.txtphonemizer needs the native espeak-ng binary on your PATH:
- Windows β download and run the installer from the espeak-ng releases.
- macOS β
brew install espeak-ng - Linux β
sudo apt-get install espeak-ng
The default torch and llama-cpp-python wheels are CPU-only. For NVIDIA GPUs:
- PyTorch β install a CUDA build (other CUDA versions: see pytorch.org):
Reinstall
python -m pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124 --force-reinstall
torchandtorchaudiotogether: force-reinstallingtorchalone leaves atorchaudiobuilt against the previous torch, which then fails to import (OSError: [WinError 127]) and breaks pronunciation analysis. - llama-cpp-python β build with CUDA (see
llm_server/README.mdfor details):$env:CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
install.py already downloads llama-3.2-3b-instruct-q4_k_m.gguf into models/.
To do it manually instead, download a small instruct model (e.g. Llama-3.2-3B-Instruct-Q4_K_M.gguf) and place it at the path set by EXTERNAL_MODEL_PATH in mimora/config.py (default: models/llama-3.2-3b-instruct-q4_k_m.gguf).
python main.pyOn first launch the app loads the STT, TTS, and Wav2Vec2 models and starts the LLM server (this takes a while and downloads several GB). Once it shows Ready:
- Edit the Practice text panel (or keep the default).
- Click π² New phrase β Mimora generates a phrase and speaks it.
- Hold
SPACE(or press and hold the mic button) and repeat the phrase; release when done. - Read your score and the words to improve in the feedback panel.
- Use βΆ Reference / βΆ My recording to compare, then repeat or generate the next phrase.
Press ESC or close the window to quit (the LLM server subprocess is terminated cleanly).
Key options in mimora/config.py (overridable via config/settings.json):
| Setting | Default | Description |
|---|---|---|
WAV2VEC2_MODEL_NAME |
facebook/wav2vec2-large-960h |
Pronunciation/transcription model. |
WAV2VEC2_DEVICE |
DEVICE (cuda/cpu) |
Set to "cpu" to avoid VRAM contention with llama_cpp / Kokoro. |
PRONUNCIATION_SCORE_THRESHOLD |
70.0 |
Score (0β100) required to pass a phrase. |
PRACTICE_TEXT_FILE |
texts/practice_text.txt |
Source text pre-loaded into the input panel. |
PHRASE_GEN_TEMPERATURE / PHRASE_GEN_MAX_TOKENS |
0.7 / 40 |
Phrase-generation sampling. |
PHRASE_GEN_WINDOW_SENTENCES |
5 |
Sentences of the source text sent to the model per request (sliding window). |
PHRASE_GEN_WINDOW_REPEATS |
5 |
Phrases generated per window position before it slides forward by half its size. |
LLM_BACKEND |
local_server |
local_server (auto-started subprocess) or lm-studio. |
MAX_RECORD_SECONDS |
20 |
Safety cap on recording length. |
The pronounce/ package is GUI-agnostic and can be used on its own:
import pronounce
pronounce.load_models() # load Wav2Vec2 once (and warm_up() to remove first-call latency)
result = pronounce.analyze(
user_audio=user_waveform, # np.ndarray, 16 kHz mono
expected_text="hello world",
reference_audio=reference_wav, # np.ndarray (e.g. Kokoro output)
user_sr=16000,
reference_sr=24000,
)
print(result.score) # 0β100
print(result.passed) # score >= threshold
print(result.transcription) # what was recognized
print(result.words_with_errors) # words to improve
print(result.prosody) # {"f0": [...], "energy": [...]}# Fast unit tests (pure logic, no model download, offline)
python -m unittest tests.test_speech -v
# Optional end-to-end check on real audio (downloads the model, needs espeak-ng)
python tests/test_speech.py path/to/user.wav [path/to/reference.wav]Three torch models (Wav2Vec2, Kokoro) plus llama_cpp can compete for VRAM on a single GPU. Mimora mitigates this two ways:
- The LLM runs in a separate process (
llm_server/), and the practice loop runs its phases (LLM β Kokoro β Wav2Vec2) sequentially, so they don't synthesize/infer at the same time. - If VRAM is still tight, set
WAV2VEC2_DEVICE = "cpu"inmimora/config.pyβ short phrases analyze acceptably on CPU.
- English only (phonemizer
en-us). - The transcription-based word errors only surface mistakes the ASR actually "hears"; subtle distortions where the word is still recognized may not appear in the word list (the acoustic DTW + prosody partially compensate).
- Scoring is heuristic β the acoustic floor depends on your voice and microphone. After a practice session run
python pronounce/calibrate.pyto fit it to your data (--dry-runpreviews the change); every attempt's raw components are logged tologs/pronounce_samples.jsonlandlogs/main.logfor inspection.
- OpenPronounce (MIT) β the pronunciation-scoring core reused in
pronounce/. - Kokoro-82M β text-to-speech.
- faster-whisper β speech-to-text.
- Wav2Vec2 (Hugging Face Transformers) β acoustic embeddings and transcription.
- llama.cpp / llama-cpp-python β local LLM inference.
See LICENSE. The reused OpenPronounce components are MIT-licensed; their attribution is retained in pronounce/speech.py.