corpora:dewac

This is an old revision of the document!


DeWaC

DeWaC is a 1.7 billion word corpus constructed from the Web limiting the crawl to the .de domain and using medium-frequency words from the SudDeutsche Zeitung corpus and basic German vocabulary lists as seeds. The corpus was POS-tagged and lemmatized with the TreeTagger using this tagset, more information available here.

  • corpora/dewac.1508401909.txt.gz
  • Last modified: 2017/10/19 08:31
  • by eros