corpora:citra:start

This is an old revision of the document!


CITRA-25

  • Language: Italian (native)
  • Text type: informative, imaginative
  • Genre: press, general prose, learned writing, fiction
  • Time reference: 2020-2025
  • Sampling unit: full text - with exceptions
  • Size: missing

The Corpus of Italian Texts for Register Analysis (CITRA-25) is a principled collection of written texts sampled to be representative of 21st century Italian. The rationale behind its compilation is twofold: on the one hand, to provide a reference point for (mostly) human-authored texts at a time when synthetic data begins to populate contemporary spaces of discourse production and reception; on the other, to allow for contrastive register analysis with comparable corpora. The latter is achieved through careful reconsideration of the sampling frame adopted by the Brown Family of Corpora, a de facto standard for balancedness and representativeness influencing later approaches to corpus design. While consistency with Brown's wider schematisation of text types and genres is preserved, its internal composition is redefined and made explicit in the form of subgenres, with an eye towards exemplifying relevant changes in situational contexts having occurred between the 1960's and the 2020's. Comparability is thus reflected in functional coverage and balance logic, complemented by a similar timeframe to such recent representatives of Brown's approach as BE21 and Koditex.

The corpus is intended to be maximally representative of contemporary written Italian (from the years 2020-2025), defined as Italian linguistic productions encoded in the written modality, typically occurring in highly-to-moderately controlled settings and oriented towards standard or neo-standard norms. Its strata are formalised through an inductive process of categorisation, mapping discourse practices onto established genre conventions and fitting these, in turn, into the wider typology popularised by Brown: press, general prose, learned writing and fiction.

Press

missing text

General prose

missing text

Learned writing

missing text

Fiction

missing text

Texts have been manually collected from the web following stratified sampling strategies. Apart from time reference, selection and exclusion criteria have been devised that specifically address the features of each stratum. The content of preferred webpages has been converted to .txt files with UTF-8 encoding, normalising para- and peritextual artefacts (e.g. tables and figures, footnotes, author's name and text's metadata) by means of placeholders implemented via empty-element XML tags in the format <deleted object=“VALUE”/>. Naming conventions for files' ID follow the template [YEAR]_[GENRE]_[SUBGENRE]_000.txt, with decimals for each subgenre indicating sequential progression by year.

Press

missing text

General prose

missing text

Learned writing

missing text

Fiction

missing text

Data from the corpus has been annotated for contextual and structural information in a semi-automatic fashion. Metadata fields, their values and associated descriptions are reported in the table below. Their operationalisation, based on a review of existing schemes from comparable resources, allows for the analysis of sociolinguistic variables, most notably along the diastratic (author's gender, publisher), diamesic (type of publication) and diaphasic (genre and subgenre, topic) axes. Given the purpose of the corpus, as well as the heterogeneity of texts included, structural information has been encoded following a minimal TEI structure encompassing sentences (<s>) and headings (<head>).

Metada scheme

Metadata field Metadata value Description
id e.g. 2024_pr_rev_001.txt file's ID, encapsulating selection criteria
text type informative; imaginative the 1st level domain to which the text is assigned based on functional criteria (text types)
genre press; general prose; learned writing; fiction the 2nd level domain to which the text is assigned based on structural conventions
subgenre e.g. editorial; novel the 3rd level domain to which the text is assigned based on structural conventions
topic e.g. politics; fantasy, thriller the subject matter around which the text is built. Multiple co-occurring topics are separated through a comma
publication_year e.g. 2025 the year when the text was issued
publication_type print; digital; print and digital the medium through which the text is made available
publisher e.g. Consiglio Nazionale delle Ricerche the individual, company or entity producing and distributing the text
author_gender man; men; woman; women; mixed; N/a authors' assumed gender, catering to both individual as well as collaborative production
word_count e.g. 781 the total number of tokens from the cleaned text as reported in the text editor
url e.g. https://cineforum.it/recensione/Nosferatu a link directing to the original text file

missing text

missing text

missing text

  • corpora/citra/start.1777457617.txt.gz
  • Last modified: 2026/04/29 12:13
  • by dpolizzi