Skip to content

vikonix/Mimora

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

63 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Mimora

A local, offline pronunciation trainer. Mimora says a phrase out loud, you repeat it, and it scores how close you were β€” highlighting the words to work on. Practice the same phrase until you pass, then move on to the next one. Everything runs on your machine: speech synthesis, speech recognition, phrase generation, and acoustic analysis.

Mimora is built on the SpeakLoop voice-tutor stack and reuses the pronunciation-scoring core of OpenPronounce (MIT) as a library.


How it works

For each practice phrase Mimora runs a simple loop:

  1. Prompt β€” a phrase is generated by the local LLM from your practice text, then spoken aloud by the Kokoro TTS voice (this synthesized audio is also the reference for scoring).
  2. Record β€” you hold SPACE (or click the mic) and repeat the phrase.
  3. Analyze β€” your recording and the reference are compared in a background thread: Wav2Vec2 acoustic similarity (DTW), phoneme-level word errors, and prosody (pitch / energy).
  4. Feedback β€” you get a score out of 100, what was recognized, and the words to improve.
  5. Loop β€” repeat the same phrase until the score passes the configurable threshold, then generate the next phrase.

You can replay the reference and your own recording back-to-back to hear the difference.


Features

  • πŸŽ™οΈ Push-to-talk recording with peak normalization and silence gating.
  • πŸ—£οΈ Single voice everywhere β€” the reference phrase is synthesized by Kokoro, the same engine used for prompts (no second TTS).
  • 🧠 LLM-generated phrases built from an editable practice text panel β€” paste your own paragraph, song, or sentences to drill.
  • πŸ“Š Pronunciation scoring combining acoustic similarity β€” per-step cosine DTW over Wav2Vec2 embeddings (40%) β€” with phoneme error rate (30%) and word error rate (30%). All components are length-invariant; the acoustic floor is calibrated to your voice with python pronounce/calibrate.py.
  • πŸ” Replay reference vs. your recording to compare.
  • πŸ˜€ Articulation face β€” a schematic mouth opens and closes with the speech as a reference or your recording plays, and shows a smiley reflecting your score while idle.
  • 🧡 Responsive UI β€” analysis and model loading run in daemon threads; the GUI is updated only via root.after().
  • πŸ’» Fully local & offline after the models are downloaded.

Architecture

File Responsibility
main.py PronunciationTrainerGUI — Tkinter GUI, recording, the Prompt→Record→Analyze→Feedback→Loop state machine, threading orchestration, LLM-server subprocess management.
pronounce/speech.py Pronunciation analysis core (adapted from OpenPronounce). Single entry point analyze(...); Wav2Vec2 embeddings + DTW, phoneme comparison, prosody, scoring. No GUI dependency.
pronounce/calibrate.py On-request scoring calibration: reads the per-attempt samples from logs/pronounce_samples.jsonl and writes the acoustic floor to pronounce/calibration.json.
mimora/tts.py TTSManager β€” Kokoro TTS. synthesize() returns the waveform; play_array() plays any waveform (reference at 24 kHz, your recording at 16 kHz). loudness_envelope() precomputes the per-frame mouth-openness track used by the face.
mimora/face_widget.py FaceWidget β€” schematic articulation face (Tk Canvas). Talking mouth driven from a precomputed loudness track while audio plays; smiley reflecting the score when idle. Stdlib tkinter only.
mimora/stt.py STTManager β€” faster-whisper speech-to-text (loaded at startup; kept available for future use).
mimora/llm.py LLMManager β€” OpenAI-compatible client. generate_phrase() produces one practice phrase per request.
mimora/config.py All configuration: device, model names, score threshold, practice-text path, phrase-generation settings, audio settings.
llm_server/server.py Standalone FastAPI server loading GGUF models via llama_cpp; runs as a separate process to avoid CUDA contention. See llm_server/README.md.
config/ User configuration data: settings.json (hand-edited preferences) and themes/ (UI color schemes).
texts/practice_text.txt Default source text shown in the input panel at startup; put your own practice texts in texts/.

Requirements

  • Python 3.11 or 3.12 (developed and tested on 3.11 and 3.12). Python 3.13 and newer are not yet supported (as of June 2026).
  • Windows is the primary target (TTS playback uses winsound); a sounddevice fallback exists for other platforms.
  • A microphone and speakers.
  • For GPU acceleration: an NVIDIA GPU with a CUDA-enabled PyTorch build.
  • espeak-ng (native binary, required by the phonemizer) β€” installed separately, see below.

Models

install.py pre-downloads all of these (see Quick install). Otherwise the three Hugging Face models are fetched automatically on first run, and only the GGUF chat model must be obtained manually.

Model Used by Notes
facebook/wav2vec2-large-960h pronunciation analysis ~1.2 GB; via install.py or on first run
Kokoro-82M (hexgrad/Kokoro-82M) text-to-speech via install.py or on first run
faster-whisper small speech-to-text via install.py or on first run
A GGUF chat model (e.g. Llama-3.2-3B-Instruct-Q4_K_M) phrase generation via install.py, or download manually into models/

Installation

Quick install (script)

install.py automates the whole setup: it installs the Python dependencies, auto-detects an NVIDIA GPU and installs the matching CUDA builds of torch and llama-cpp-python, checks for espeak-ng, pre-downloads the Hugging Face models into model_cache/, and downloads the GGUF chat model into models/.

git clone <your-repo-url> Mimora
cd Mimora

# Create and activate a virtual environment, then run the installer INSIDE it
# (the script installs into whatever interpreter runs it):
python -m venv .venv
.venv\Scripts\activate            # Windows
# source .venv/bin/activate       # macOS / Linux

python install.py

The installer prints each step and the exact command, then asks before running it (answer Y to run, n to abort, s to skip). Anything already installed is detected and offered as reinstall-or-skip rather than blindly redone. The full run is logged to logs/install.log.

Useful flags:

  • --yes β€” run non-interactively (skips already-installed steps; add --reinstall to force them)
  • --dry-run β€” print the steps and commands without executing anything
  • --cpu β€” skip the GPU (CUDA) installs
  • --skip-models / --skip-gguf β€” skip the model / GGUF downloads

espeak-ng (a native binary, see below) is checked but not installed on Windows β€” follow the printed instructions if it is missing. On Windows, enabling Developer Mode lets the model cache use symlinks; without it the installer falls back to copying files (more disk use, but it always works).

Manual installation

# 1. Clone
git clone <your-repo-url> Mimora
cd Mimora

# 2. Core + LLM-server dependencies
pip install -r requirements.txt
pip install -r llm_server/requirements.txt

# 3. Pronunciation module dependencies (Wav2Vec2, phonemizer, DTW, etc.)
pip install -r pronounce/requirements.txt

Install espeak-ng (required for phoneme analysis)

phonemizer needs the native espeak-ng binary on your PATH:

  • Windows β€” download and run the installer from the espeak-ng releases.
  • macOS β€” brew install espeak-ng
  • Linux β€” sudo apt-get install espeak-ng

GPU support (recommended)

The default torch and llama-cpp-python wheels are CPU-only. For NVIDIA GPUs:

  • PyTorch β€” install a CUDA build (other CUDA versions: see pytorch.org):
    python -m pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124 --force-reinstall
    Reinstall torch and torchaudio together: force-reinstalling torch alone leaves a torchaudio built against the previous torch, which then fails to import (OSError: [WinError 127]) and breaks pronunciation analysis.
  • llama-cpp-python β€” build with CUDA (see llm_server/README.md for details):
    $env:CMAKE_ARGS="-DGGML_CUDA=on"
    pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

Get a GGUF model

install.py already downloads llama-3.2-3b-instruct-q4_k_m.gguf into models/. To do it manually instead, download a small instruct model (e.g. Llama-3.2-3B-Instruct-Q4_K_M.gguf) and place it at the path set by EXTERNAL_MODEL_PATH in mimora/config.py (default: models/llama-3.2-3b-instruct-q4_k_m.gguf).


Usage

python main.py

On first launch the app loads the STT, TTS, and Wav2Vec2 models and starts the LLM server (this takes a while and downloads several GB). Once it shows Ready:

  1. Edit the Practice text panel (or keep the default).
  2. Click 🎲 New phrase β€” Mimora generates a phrase and speaks it.
  3. Hold SPACE (or press and hold the mic button) and repeat the phrase; release when done.
  4. Read your score and the words to improve in the feedback panel.
  5. Use β–Ά Reference / β–Ά My recording to compare, then repeat or generate the next phrase.

Press ESC or close the window to quit (the LLM server subprocess is terminated cleanly).


Configuration

Key options in mimora/config.py (overridable via config/settings.json):

Setting Default Description
WAV2VEC2_MODEL_NAME facebook/wav2vec2-large-960h Pronunciation/transcription model.
WAV2VEC2_DEVICE DEVICE (cuda/cpu) Set to "cpu" to avoid VRAM contention with llama_cpp / Kokoro.
PRONUNCIATION_SCORE_THRESHOLD 70.0 Score (0–100) required to pass a phrase.
PRACTICE_TEXT_FILE texts/practice_text.txt Source text pre-loaded into the input panel.
PHRASE_GEN_TEMPERATURE / PHRASE_GEN_MAX_TOKENS 0.7 / 40 Phrase-generation sampling.
PHRASE_GEN_WINDOW_SENTENCES 5 Sentences of the source text sent to the model per request (sliding window).
PHRASE_GEN_WINDOW_REPEATS 5 Phrases generated per window position before it slides forward by half its size.
LLM_BACKEND local_server local_server (auto-started subprocess) or lm-studio.
MAX_RECORD_SECONDS 20 Safety cap on recording length.

Using the pronunciation core as a library

The pronounce/ package is GUI-agnostic and can be used on its own:

import pronounce

pronounce.load_models()   # load Wav2Vec2 once (and warm_up() to remove first-call latency)

result = pronounce.analyze(
    user_audio=user_waveform,        # np.ndarray, 16 kHz mono
    expected_text="hello world",
    reference_audio=reference_wav,   # np.ndarray (e.g. Kokoro output)
    user_sr=16000,
    reference_sr=24000,
)

print(result.score)              # 0–100
print(result.passed)             # score >= threshold
print(result.transcription)      # what was recognized
print(result.words_with_errors)  # words to improve
print(result.prosody)            # {"f0": [...], "energy": [...]}

Running the tests

# Fast unit tests (pure logic, no model download, offline)
python -m unittest tests.test_speech -v

# Optional end-to-end check on real audio (downloads the model, needs espeak-ng)
python tests/test_speech.py path/to/user.wav [path/to/reference.wav]

GPU / CPU notes

Three torch models (Wav2Vec2, Kokoro) plus llama_cpp can compete for VRAM on a single GPU. Mimora mitigates this two ways:

  • The LLM runs in a separate process (llm_server/), and the practice loop runs its phases (LLM β†’ Kokoro β†’ Wav2Vec2) sequentially, so they don't synthesize/infer at the same time.
  • If VRAM is still tight, set WAV2VEC2_DEVICE = "cpu" in mimora/config.py β€” short phrases analyze acceptably on CPU.

Known limitations

  • English only (phonemizer en-us).
  • The transcription-based word errors only surface mistakes the ASR actually "hears"; subtle distortions where the word is still recognized may not appear in the word list (the acoustic DTW + prosody partially compensate).
  • Scoring is heuristic β€” the acoustic floor depends on your voice and microphone. After a practice session run python pronounce/calibrate.py to fit it to your data (--dry-run previews the change); every attempt's raw components are logged to logs/pronounce_samples.jsonl and logs/main.log for inspection.

Credits

License

See LICENSE. The reused OpenPronounce components are MIT-licensed; their attribution is retained in pronounce/speech.py.