corpora:itwac

  • itWaC: a 2 billion word corpus constructed from the Web limiting the crawl to the .it domain and using medium-frequency words from the Repubblica corpus and basic Italian vocabulary lists as seeds. The corpus was POS-tagged with the TreeTagger using this tagset, and lemmatized using the Morph-it! lexicon, more information available here.
  • corpora/itwac.txt
  • Last modified: 2017/10/19 10:19
  • by eros