corpora:ukwac

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
corpora:ukwac [2018/10/15 08:31] – [UkWaC] eroscorpora:ukwac [2018/10/15 08:31] (current) – [UkWaC] eros
Line 1: Line 1:
 ====== UkWaC ====== ====== UkWaC ======
  
-UkWaC is a 2 billion word corpus constructed from the Web limiting the crawl to the **.uk** domain and using medium-frequency words from the [[http://www.natcorp.ox.ac.uk/|BNC]] as seeds. The corpus was POS-tagged and lemmatized with the [[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/|TreeTagger]]. The tagset is available [[corpora:tagsets:english|here]], more information can be found in this {{ :corpora:wacky_2008.pdf |}}.+UkWaC is a 2 billion word corpus constructed from the Web limiting the crawl to the **.uk** domain and using medium-frequency words from the [[http://www.natcorp.ox.ac.uk/|BNC]] as seeds. The corpus was POS-tagged and lemmatized with the [[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/|TreeTagger]]. The tagset is available [[corpora:tagsets:english|here]], more information can be found in this {{:corpora:wacky_2008.pdf|paper}}.
 ===== Tagset ===== ===== Tagset =====
  
 Consult the [[corpora:tagsets:english|tagset]] Consult the [[corpora:tagsets:english|tagset]]
  • corpora/ukwac.txt
  • Last modified: 2018/10/15 08:31
  • by eros