User Tools

Site Tools


corpora:repubblica

Repubblica

Click here to consult the "la Repubblica" corpus

The “la Repubblica” corpus is a very large corpus of Italian newspaper text (approximately 380M tokens).

The corpus is tokenized, pos-tagged (with the Treetagger trained with ad-hoc resources), lemmatized (with Morph-it) and categorized in terms of genre and topic (with SVMLight trained with ad-hoc resources).

The POS tagset used by the corpus is available here. The genre labels used in the corpus are news-report and comment; topic labels are church, culture, economics, education, news, politics, science, society, sport, weather.

Articles in the corpus are structured into the following (mostly optional) parts: title, subtitle, summary, text. Meta-data information about article author and year is also available.

Some very out-of-date information about how the corpus was encoded can be found in the following paper: M. Baroni, S. Bernardini, F. Comastri, L. Piccioni, A. Volpi, G. Aston, M. Mazzoleni. 2004. Introducing the “la Repubblica” corpus: A large, annotated, TEI(XML)-compliant corpus of newspaper Italian.Proceedings of LREC 2004.

Other people who have contributed to the development of the corpus are Eros Zanchetta and Sara Castagnoli. Please quote the above paper or the url http://sslmit.unibo.it/repubblica (a permanent link to the corpus interface) if you publish work based on the “la Repubblica” corpus.

Tagset

TagMeaning
ADJadjective
ADVadverb (excluding -mente forms)
ADV:menteadveb ending in -mente
ARTarticle
ARTPREpreposition + article
AUX:finfinite form of auxiliary
AUX:fin:clifinite form of auxiliary with clitic
AUX:gerugerundive form of auxiliary
AUX:geru:cligerundive form of auxiliary with clitic
AUX:infiinfinitival form of auxiliary
AUX:infi:cliinfinitival form of auxiliary with clitic
AUX:ppastpast participle of auxiliary
AUX:pprepresent participle of auxiliary
CHEche
CLIclitic
CONconjunction
DET:demodemonstrative determiner
DET:indefindefinite determiner
DET:numnumeral determiner
DET:posspossessive determiner
DET:whwh determiner
NEGnegation
NOCATnon-linguistic element
NOUNnoun
NPRproper noun
NUMnumber
PREpreposition
PRO:demodemonstrative pronoun
PRO:indefindefinite pronoun
PRO:numnumeral pronoun
PRO:perspersonal pronoun
PRO:posspossessive pronoun
PUNnon-sentence-final punctuation mark
SENTsentence-final punctuation mark
VER2:finfinite form of modal/causal verb
VER2:fin:clifinite form of modal/causal verb with clitic
VER2:gerugerundive form of modal/causal verb
VER2:geru:cligerundive form of modal/causal verb with clitic
VER2:infiinfinitival form of modal/causal verb
VER2:infi:cliinfinitival form of modal/causal verb with clitic
VER2:ppastpast participle of modal/causal verb
VER2:pprepresent participle of modal/causal verb
VER:finfinite form of verb
VER:fin:clifinite form of verb with clitic
VER:gerugerundive form of verb
VER:geru:cligerundive form of verb with clitic
VER:infiinfinitival form of verb
VER:infi:cliinfinitival form of verb with clitic
VER:ppastpast participle of verb
VER:ppast:clipast participle of verb with clitic
VER:pprepresent participle of verb
WHwh word
corpora/repubblica.txt · Last modified: 2017/05/24 15:14 by eros