DeWaC is a 1.7 billion word corpus constructed from the Web limiting the crawl to the .de domain and using medium-frequency words from the SudDeutsche Zeitung corpus and basic German vocabulary lists as seeds. The corpus was POS-tagged and lemmatized with the TreeTagger using this tagset, more information available here.

  • corpora/dewac.txt
  • Last modified: 2017/10/19 10:35
  • by eros