FrWaC is a 1.6 billion word corpus constructed from the Web limiting the crawl to the .fr domain and using medium-frequency words from the Le Monde Diplomatique corpus and basic French vocabulary lists as seeds. The corpus was POS-tagged and lemmatized with the TreeTagger using this tagset, more information available here.

  • corpora/frwac.txt
  • Last modified: 2017/10/19 11:11
  • by eros