Reference
Data Sources
Learn about where the data in this API comes from and how it was processed.
Vocabulary Data
📚
Wiktionary
en.wiktionary.orgWiktionary is a free, collaborative, multilingual dictionary. We extracted German vocabulary entries including translations, gender, part of speech, and example sentences.
Raw entries:362,664
After cleaning:346,346
CEFR tagged:8,242
License:CC BY-SA 4.0
Sentence Data
💬
Tatoeba
tatoeba.orgTatoeba is a large database of sentences and translations. We extracted German-English sentence pairs and tagged them with CEFR levels based on vocabulary complexity.
Raw pairs:485,132
After cleaning:479,628
Curated selection:2,300
License:CC BY 2.0 FR
Grammar Rules
📖
Custom Curated
Original contentGrammar rules were manually curated and written to cover comprehensive German grammar including cases, verb conjugations, sentence structure, and special features.
Total rules:365
Categories:31
Processing Pipeline
All data goes through a rigorous cleaning and enrichment pipeline:
- Download: Fetch raw data from sources
- Clean: Remove duplicates, fix encoding, normalize text
- Validate: German-safe rules (preserve umlauts, ß, capitalization)
- Enrich: Add CEFR levels, gender completion, frequency ranking
- Select: Curate best entries per level
- Export: Generate API-ready JSONL files
Open Source
The dataset is available on our GitHub repository.