🇹🇷 Morpheus: Turkish Morpheme Tokenizer & Word Embedder
Neural morpheme-aware tokenizer and word embedder for Turkish.
- Segmentation — type Turkish text, see the morpheme-level breakdown, and a lossless-vs-lossy roundtrip comparison against the rule-based TurkishTokenizer
- Embedding Explorer — find nearest neighbors in Morpheus space, toggle to compare against BERTurk (dbmdz/bert-base-turkish-cased)
Showcase auto-built from the full training corpus with the filters below. Final showcase: 593 words across 35 root families.
Model: lonewolflab/Morpheus-TR-50K ·
Baseline: dbmdz/bert-base-turkish-cased ·
Code: GitHub
Per-word breakdown
About
Morpheus combines unsupervised morphological supervision
(Morfessor) with self-supervised
objectives (skip-gram, contrastive, masked LM) to learn segmentations that are
surface-preserving (lossless decode(encode(w))==w), morphologically
aligned, and language-modeling-friendly — the lowest BPC among reversible
tokenizers, and the highest frequency-weighted token purity on TR-MMLU.
Because it is neural, the same forward pass that tokenizes also yields a 320-dim word embedding. On the showcase set here, Turkish root families are 31.8× more separable in Morpheus's space than in BERTurk's. In the paper's controlled evaluation, Morpheus leads on lexical retrieval (root-family MAP 0.85 vs BGE-M3 0.80, BERTurk 0.49) and same-root verification (ROC-AUC 1.00), while contextual encoders lead on inflection- and context-dependent tasks (NER, case/number probing) — making Morpheus a strong, cheap lexical index complement to a dense semantic encoder.
See the GitHub repo for architecture details and full evaluation.