The “la Repubblica” corpus is a very large corpus of Italian newspaper text (approximately 380M tokens).

The corpus is tokenized, pos-tagged (with the Treetagger trained with ad-hoc resources), lemmatized (with Morph-it) and categorized in terms of genre and topic (with SVMLight trained with ad-hoc resources).

The POS tagset used by the corpus is available here. The genre labels used in the corpus are news-report and comment; topic labels are church, culture, economics, education, news, politics, science, society, sport, weather.

Articles in the corpus are structured into the following (mostly optional) parts: title, subtitle, summary, text. Meta-data information about article author and year is also available.

Some very out-of-date information about how the corpus was encoded can be found in the following paper: M. Baroni, S. Bernardini, F. Comastri, L. Piccioni, A. Volpi, G. Aston, M. Mazzoleni. 2004. Introducing the “la Repubblica” corpus: A large, annotated, TEI(XML)-compliant corpus of newspaper Italian.Proceedings of LREC 2004.

Other people who have contributed to the development of the corpus are Eros Zanchetta and Sara Castagnoli. Please quote the above paper or the url (a permanent link to the corpus interface) if you publish work based on the “la Repubblica” corpus.

