Skip to content

ThirdLetterC/normalize_uk-cpp

Repository files navigation

normalize-uk-cpp

C++23 Ukrainian text normalization and tokenization utilities with optional Python 3.10+ bindings.

CMake

cmake -S . -B build
cmake --build build
ctest --test-dir build

Enable Python bindings explicitly when building with CMake:

cmake -S . -B build-python -DNORMALIZE_UK_CPP_BUILD_PYTHON=ON
cmake --build build-python

Python

python -m pip install .
import normalize_uk as nuk

print(nuk.number_to_words(123))
print(nuk.normalize_ukrainian("01.05.2024"))
print(nuk.normalize_ukrainian_with_preset("01.05.2024", nuk.NormalizePreset.TtsFriendly))
print([sentence.text for sentence in nuk.split_sentences("П'ять зв'язків. Два.")])
print([token.text for token in nuk.tokenize("П'ять зв'язків.")])

More examples live in examples/python/.

Formatting

The project includes a .clang-format file and a CMake formatting target. Install clang-format, then run:

cmake --build build --target format

About

Ukrainian text normalization and tokenization utilities

Resources

License

Stars

Watchers

Forks

Contributors