This is an old revision of the document!


The “la Repubblica” corpus is a very large corpus of Italian newspaper text (approximately 380M tokens).

The corpus is tokenized, pos-tagged (with the Treetagger trained with ad-hoc resources), lemmatized (with Morph-it) and categorized in terms of genre and topic (with SVMLight trained with ad-hoc resources).

The POS tagset used by the corpus is available here. The genre labels used in the corpus are news-report and comment; topic labels are church, culture, economics, education, news, politics, science, society, sport, weather.

Articles in the corpus are structured into the following (mostly optional) parts: title, subtitle, summary, text. Meta-data information about article author and year is also available.

Some very out-of-date information about how the corpus was encoded can be found in the following paper: M. Baroni, S. Bernardini, F. Comastri, L. Piccioni, A. Volpi, G. Aston, M. Mazzoleni. 2004. Introducing the “la Repubblica” corpus: A large, annotated, TEI(XML)-compliant corpus of newspaper Italian.Proceedings of LREC 2004.

Other people who have contributed to the development of the corpus are Eros Zanchetta and Sara Castagnoli. Please quote the above paper or the url (a permanent link to the corpus interface) if you publish work based on the “la Repubblica” corpus.

ADVadverb (excluding -mente forms)
ADV:menteadveb ending in -mente
ARTPREpreposition + article
AUX:finfinite form of auxiliary
AUX:fin:clifinite form of auxiliary with clitic
AUX:gerugerundive form of auxiliary
AUX:geru:cligerundive form of auxiliary with clitic
AUX:infiinfinitival form of auxiliary
AUX:infi:cliinfinitival form of auxiliary with clitic
AUX:ppastpast participle of auxiliary
AUX:pprepresent participle of auxiliary
DET:demodemonstrative determiner
DET:indefindefinite determiner
DET:numnumeral determiner
DET:posspossessive determiner
DET:whwh determiner
NOCATnon-linguistic element
NPRproper noun
PRO:demodemonstrative pronoun
PRO:indefindefinite pronoun
PRO:numnumeral pronoun
PRO:perspersonal pronoun
PRO:posspossessive pronoun
PUNnon-sentence-final punctuation mark
SENTsentence-final punctuation mark
VER2:finfinite form of modal/causal verb
VER2:fin:clifinite form of modal/causal verb with clitic
VER2:gerugerundive form of modal/causal verb
VER2:geru:cligerundive form of modal/causal verb with clitic
VER2:infiinfinitival form of modal/causal verb
VER2:infi:cliinfinitival form of modal/causal verb with clitic
VER2:ppastpast participle of modal/causal verb
VER2:pprepresent participle of modal/causal verb
VER:finfinite form of verb
VER:fin:clifinite form of verb with clitic
VER:gerugerundive form of verb
VER:geru:cligerundive form of verb with clitic
VER:infiinfinitival form of verb
VER:infi:cliinfinitival form of verb with clitic
VER:ppastpast participle of verb
VER:ppast:clipast participle of verb with clitic
VER:pprepresent participle of verb
WHwh word
  • corpora/repubblica.1462197873.txt.gz
  • Last modified: 2016/05/02 16:04
  • by eros