This shows you the differences between two versions of the page.
Both sides previous revision Previous revision | |
corpora:itwac [2017/10/19 10:18] – eros | corpora:itwac [2017/10/19 10:19] (current) – eros |
---|
===== ITWaC ===== | ===== ITWaC ===== |
| |
* **itWaC**: a 2 billion word corpus constructed from the Web limiting the crawl to the **.it** domain and using medium-frequency words from the [[corpora:Repubblica]] corpus and basic Italian vocabulary lists as seeds. The corpus was POS-tagged with the [[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/|TreeTagger]] using this [[corpora:tagsets:italian|tagset]], and lemmatized using the [[http://sslmit.unibo.it/morphit|Morph-it!]] lexicon, more information available {{:papers:wacky_2008.pdf|here}}. | * **itWaC**: a 2 billion word corpus constructed from the Web limiting the crawl to the **.it** domain and using medium-frequency words from the [[corpora:Repubblica]] corpus and basic Italian vocabulary lists as seeds. The corpus was POS-tagged with the [[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/|TreeTagger]] using this [[corpora:tagsets:italian|tagset]], and lemmatized using the [[http://sslmit.unibo.it/morphit|Morph-it!]] lexicon, more information available {{http://wacky.sslmit.unibo.it/lib/exe/fetch.php?media=papers:wacky_2008.pdf|here}}. |
| |
* semantically and syntactically annotated **Italian Wikipedia**: | * semantically and syntactically annotated **Italian Wikipedia**: |
* [[http://medialab.di.unipi.it/Project/QA/wikiCoNLL.bz2|CoNLL format]] ([[http://medialab.di.unipi.it/wiki/Tanl_Tagsets|tagset]]) | * [[http://medialab.di.unipi.it/Project/QA/wikiCoNLL.bz2|CoNLL format]] ([[http://medialab.di.unipi.it/wiki/Tanl_Tagsets|tagset]]) |
* [[http://medialab.di.unipi.it/Project/QA/wikiMT.bz2|MultiTag format]] | * [[http://medialab.di.unipi.it/Project/QA/wikiMT.bz2|MultiTag format]] |