corpora:ukwac

This is an old revision of the document!


UkWaC

UkWaC is a 2 billion word corpus constructed from the Web limiting the crawl to the .uk domain and using medium-frequency words from the BNC as seeds. The corpus was POS-tagged and lemmatized with the TreeTagger. The tagset is available here, more information can be found in this paper.

Consult the tagset

  • corpora/ukwac.1510911911.txt.gz
  • Last modified: 2017/11/17 09:45
  • by eros