UkWaC is a 2 billion word corpus constructed from the Web limiting the crawl to the .uk domain and using medium-frequency words from the BNC as seeds. The corpus was POS-tagged and lemmatized with the TreeTagger. The tagset is available here, more information can be found in this paper.

Consult the tagset

  • corpora/ukwac.txt
  • Last modified: 2018/10/15 10:31
  • by eros