corpora:dewac

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

corpora:dewac [2017/10/19 10:31] – created eroscorpora:dewac [2017/10/19 10:35] (current) eros
Line 1: Line 1:
 ====== DeWaC ====== ====== DeWaC ======
  
-DeWaC is a 1.7 billion word corpus constructed from the Web limiting the crawl to the **.de** domain and using medium-frequency words from the SudDeutsche Zeitung corpus and basic German vocabulary lists as seeds. The corpus was POS-tagged and lemmatized with the [[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/|TreeTagger]] using this [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/stts_guide.pdf|tagset]], more information available {{http://wacky.sslmit.unibo.it/lib/exe/fetch.php?media=papers:wacky_2008.pdf|here}}.+DeWaC is a 1.7 billion word corpus constructed from the Web limiting the crawl to the **.de** domain and using medium-frequency words from the SudDeutsche Zeitung corpus and basic German vocabulary lists as seeds. The corpus was POS-tagged and lemmatized with the [[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/|TreeTagger]] using this [[corpora:tagsets:german|tagset]], more information available {{http://wacky.sslmit.unibo.it/lib/exe/fetch.php?media=papers:wacky_2008.pdf|here}}.
  • corpora/dewac.1508401909.txt.gz
  • Last modified: 2017/10/19 10:31
  • by eros