Gradio

About

Morpheus combines unsupervised morphological supervision (Morfessor) with self-supervised objectives (skip-gram, contrastive, masked LM) to learn segmentations that are surface-preserving (lossless decode(encode(w))==w), morphologically aligned, and language-modeling-friendly — the lowest BPC among reversible tokenizers, and the highest frequency-weighted token purity on TR-MMLU.

Because it is neural, the same forward pass that tokenizes also yields a 320-dim word embedding. On the showcase set here, Turkish root families are 31.8× more separable in Morpheus's space than in BERTurk's. In the paper's controlled evaluation, Morpheus leads on lexical retrieval (root-family MAP 0.85 vs BGE-M3 0.80, BERTurk 0.49) and same-root verification (ROC-AUC 1.00), while contextual encoders lead on inflection- and context-dependent tasks (NER, case/number probing) — making Morpheus a strong, cheap lexical index complement to a dense semantic encoder.

See the GitHub repo for architecture details and full evaluation.

🇹🇷 Morpheus: Turkish Morpheme Tokenizer & Word Embedder

About