🇹🇷 Morpheus: Turkish Morpheme Tokenizer & Word Embedder

Neural morpheme-aware tokenizer and word embedder for Turkish.

  1. Segmentation — type Turkish text, see the morpheme-level breakdown, and a lossless-vs-lossy roundtrip comparison against the rule-based TurkishTokenizer
  2. Embedding Explorer — find nearest neighbors in Morpheus space, toggle to compare against BERTurk (dbmdz/bert-base-turkish-cased)

Showcase auto-built from the full training corpus with the filters below. Final showcase: 593 words across 35 root families.

Model: lonewolflab/Morpheus-TR-50K · Baseline: dbmdz/bert-base-turkish-cased · Code: GitHub

Per-word breakdown

Examples

About

Morpheus combines unsupervised morphological supervision (Morfessor) with self-supervised objectives (skip-gram, contrastive, masked LM) to learn segmentations that are surface-preserving (lossless decode(encode(w))==w), morphologically aligned, and language-modeling-friendly — the lowest BPC among reversible tokenizers, and the highest frequency-weighted token purity on TR-MMLU.

Because it is neural, the same forward pass that tokenizes also yields a 320-dim word embedding. On the showcase set here, Turkish root families are 31.8× more separable in Morpheus's space than in BERTurk's. In the paper's controlled evaluation, Morpheus leads on lexical retrieval (root-family MAP 0.85 vs BGE-M3 0.80, BERTurk 0.49) and same-root verification (ROC-AUC 1.00), while contextual encoders lead on inflection- and context-dependent tasks (NER, case/number probing) — making Morpheus a strong, cheap lexical index complement to a dense semantic encoder.

See the GitHub repo for architecture details and full evaluation.