corpora:ukwac

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
corpora:ukwac [2017/10/19 09:22] – [UkWaC] eroscorpora:ukwac [2018/10/15 08:31] (current) – [UkWaC] eros
Line 1: Line 1:
 ====== UkWaC ====== ====== UkWaC ======
  
-UkWaC is a 2 billion word corpus constructed from the Web limiting the crawl to the **.uk** domain and using medium-frequency words from the [[http://www.natcorp.ox.ac.uk/|BNC]] as seeds. The corpus was POS-tagged and lemmatized with the [[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/|TreeTagger]]. The tagset is available [[corpora:tagsets:english|here]], more information can be found in this {{http://wacky.sslmit.unibo.it/lib/exe/fetch.php?media=papers:wacky_2008.pdf|paper}}.+UkWaC is a 2 billion word corpus constructed from the Web limiting the crawl to the **.uk** domain and using medium-frequency words from the [[http://www.natcorp.ox.ac.uk/|BNC]] as seeds. The corpus was POS-tagged and lemmatized with the [[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/|TreeTagger]]. The tagset is available [[corpora:tagsets:english|here]], more information can be found in this {{:corpora:wacky_2008.pdf|paper}}.
 ===== Tagset ===== ===== Tagset =====
  
 Consult the [[corpora:tagsets:english|tagset]] Consult the [[corpora:tagsets:english|tagset]]
  • corpora/ukwac.1508404962.txt.gz
  • Last modified: 2017/10/19 09:22
  • by eros