Reference

Data Sources

Learn about where the data in this API comes from and how it was processed.

Vocabulary Data

📚

Wiktionary is a free, collaborative, multilingual dictionary. We extracted German vocabulary entries including translations, gender, part of speech, and example sentences.

Raw entries:362,664
After cleaning:346,346
CEFR tagged:8,242
License:CC BY-SA 4.0

Sentence Data

💬

Tatoeba is a large database of sentences and translations. We extracted German-English sentence pairs and tagged them with CEFR levels based on vocabulary complexity.

Raw pairs:485,132
After cleaning:479,628
Curated selection:2,300
License:CC BY 2.0 FR

Grammar Rules

📖

Custom Curated

Original content

Grammar rules were manually curated and written to cover comprehensive German grammar including cases, verb conjugations, sentence structure, and special features.

Total rules:365
Categories:31

Processing Pipeline

All data goes through a rigorous cleaning and enrichment pipeline:

  1. Download: Fetch raw data from sources
  2. Clean: Remove duplicates, fix encoding, normalize text
  3. Validate: German-safe rules (preserve umlauts, ß, capitalization)
  4. Enrich: Add CEFR levels, gender completion, frequency ranking
  5. Select: Curate best entries per level
  6. Export: Generate API-ready JSONL files

Open Source

The dataset is available on our GitHub repository.