itWaC: a 2 billion word corpus constructed from the Web limiting the crawl to the
.it domain and using medium-frequency words from the
Repubblica corpus and basic Italian vocabulary lists as seeds. The corpus was POS-tagged with the
TreeTagger using this
tagset, and lemmatized using the
Morph-it! lexicon, more information available
here.