This shows you the differences between two versions of the page.
Both sides previous revision Previous revision | Next revisionBoth sides next revision |
corpora:ukwac [2017/11/17 10:45] – eros | corpora:ukwac [2018/10/15 10:29] – [UkWaC] eros |
---|
| |
UkWaC is a 2 billion word corpus constructed from the Web limiting the crawl to the **.uk** domain and using medium-frequency words from the [[http://www.natcorp.ox.ac.uk/|BNC]] as seeds. The corpus was POS-tagged and lemmatized with the [[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/|TreeTagger]]. The tagset is available [[corpora:tagsets:english|here]], more information can be found in this {{http://wacky.sslmit.unibo.it/lib/exe/fetch.php?media=papers:wacky_2008.pdf|paper}}. | UkWaC is a 2 billion word corpus constructed from the Web limiting the crawl to the **.uk** domain and using medium-frequency words from the [[http://www.natcorp.ox.ac.uk/|BNC]] as seeds. The corpus was POS-tagged and lemmatized with the [[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/|TreeTagger]]. The tagset is available [[corpora:tagsets:english|here]], more information can be found in this {{http://wacky.sslmit.unibo.it/lib/exe/fetch.php?media=papers:wacky_2008.pdf|paper}}. |
| |
===== Tagset ===== | ===== Tagset ===== |
| |
Consult the [[corpora:tagsets:english|tagset]] | Consult the [[corpora:tagsets:english|tagset]] |