corpora:itwac

This is an old revision of the document!


  • itWaC: a 2 billion word corpus constructed from the Web limiting the crawl to the .it domain and using medium-frequency words from the Repubblica corpus and basic Italian vocabulary lists as seeds. The corpus was POS-tagged with the TreeTagger using this tagset, and lemmatized using the Morph-it! lexicon, more information available here.
  • corpora/itwac.1508400966.txt.gz
  • Last modified: 2017/10/19 10:16
  • by eros