Reference

Data Sources

Learn about where the data in this API comes from and how it was processed.

Vocabulary Data

📚

Wiktionary

en.wiktionary.org

Wiktionary is a free, collaborative, multilingual dictionary. We extracted German vocabulary entries including translations, gender, part of speech, and example sentences.

Raw entries:362,664

After cleaning:346,346

CEFR tagged:8,242

License:CC BY-SA 4.0

Sentence Data

💬

Tatoeba

tatoeba.org

Tatoeba is a large database of sentences and translations. We extracted German-English sentence pairs and tagged them with CEFR levels based on vocabulary complexity.

Raw pairs:485,132

After cleaning:479,628

Curated selection:2,300

License:CC BY 2.0 FR

Grammar Rules

📖

Custom Curated

Original content

Grammar rules were manually curated and written to cover comprehensive German grammar including cases, verb conjugations, sentence structure, and special features.

Total rules:365

Categories:31

Processing Pipeline

All data goes through a rigorous cleaning and enrichment pipeline:

Download: Fetch raw data from sources
Clean: Remove duplicates, fix encoding, normalize text
Validate: German-safe rules (preserve umlauts, ß, capitalization)
Enrich: Add CEFR levels, gender completion, frequency ranking
Select: Curate best entries per level
Export: Generate API-ready JSONL files

Open Source

The dataset is available on our GitHub repository.