corpora:ukwac

This is an old revision of the document!


UkWaC

UkWaC is a 2 billion word corpus constructed from the Web limiting the crawl to the .uk domain and using medium-frequency words from the BNC as seeds. The corpus was POS-tagged and lemmatized with the TreeTagger. The tagset is available here, more information can be found in this wacky_2008.pdf.

Consult the tagset

  • corpora/ukwac.1539592263.txt.gz
  • Last modified: 2018/10/15 08:31
  • by eros