This is an old revision of the document!
CITRA-25
Corpus of Italian Texts for Register Analysis (2020-2025)
Corpus summary
- Language: Italian (native)
- Text type: informative, imaginative
- Genre: press, general prose, learned writing, fiction
- Time reference: 2020-2025
- Sampling unit: full text - with exceptions
- Size: missing
Motivation and rationale
The Corpus of Italian Texts for Register Analysis (CITRA-25) is a principled collection of written texts sampled to be representative of 21st century Italian. The rationale behind its compilation is twofold: on the one hand, to provide a reference point for (mostly) human-authored texts at a time when synthetic data begins to populate contemporary spaces of discourse production and reception; on the other, to allow for contrastive register analysis with comparable corpora. The latter is achieved through careful reconsideration of the sampling frame adopted by the Brown Family of Corpora, a de facto standard for balancedness and representativeness influencing later approaches to corpus design. While consistency with Brown's wider schematisation of text types and genres is preserved, its internal composition is redefined and made explicit in the form of subgenres, with an eye towards exemplifying relevant changes in situational contexts having occurred between the 1960's and the 2020's. Comparability is thus reflected in functional coverage and balance logic, complemented by a similar timeframe to such recent representatives of Brown's approach as BE21 and Koditex.
Domain definition
The corpus is intended to be maximally representative of contemporary written Italian (from the years 2020-2025), defined as Italian linguistic productions encoded in the written modality, typically occurring in highly-to-moderately controlled settings and oriented towards standard or neo-standard norms. Its strata are formalised through an inductive process of categorisation, mapping discourse practices onto established genre conventions and fitting these, in turn, into the wider typology popularised by Brown: press, general prose, learned writing and fiction.
Press
missing text
General prose
missing text
Learned writing
missing text
Fiction
Data collection, conversion and normalisation
Texts have been manually collected from the web following stratified sampling strategies. Apart from time reference, selection and exclusion criteria have been devised that specifically address the features of each stratum. The content of preferred webpages has been converted to .txt files with UTF-8 encoding, normalising para- and peritextual artefacts (e.g. tables and figures, footnotes, author's name and text's metadata) by means of placeholders implemented via empty-element XML tags in the format <deleted object=“VALUE”/>. Naming conventions for files' ID follow the template [YEAR]_[GENRE]_[SUBGENRE]_000.txt, with decimals for each subgenre indicating sequential progression by year.
Press
missing text
General prose
missing text
Learned writing
missing text
Fiction
missing text
Annotation
Data from the corpus has been annotated for contextual and structural information in a semi-automatic fashion. Metadata fields, their values and associated descriptions are reported in the table below. Their operationalisation, based on a review of existing schemes from comparable resources, allows for the analysis of sociolinguistic variables, most notably along the diastratic (author's gender, publisher), diamesic (type of publication) and diaphasic (genre and subgenre, topic) axes. Given the purpose of the corpus, as well as the heterogeneity of texts included, structural information has been encoded following a minimal TEI structure encompassing sentences (<s>) and headings (<head>).
Metada scheme
| Metadata field | Metadata value | Description | |
|---|---|---|---|
| id | e.g. 2024_pr_rev_001.txt | file's ID, encapsulating selection criteria | |
| text type | informative; imaginative | the 1st level domain to which the text is assigned based on functional criteria (text types) | |
| genre | press; general prose; learned writing; fiction | the 2nd level domain to which the text is assigned based on structural conventions | |
| subgenre | e.g. editorial; novel | the 3rd level domain to which the text is assigned based on structural conventions | |
| topic | e.g. politics; fantasy, thriller | the subject matter around which the text is built. Multiple co-occurring topics are separated through a comma | |
| publication_year | e.g. 2025 | the year when the text was issued | |
| publication_type | print; digital; print and digital | the medium through which the text is made available | |
| publisher | e.g. Consiglio Nazionale delle Ricerche | the individual, company or entity producing and distributing the text | |
| author_gender | man; men; woman; women; mixed; N/a | authors' assumed gender, catering to both individual as well as collaborative production | |
| word_count | e.g. 781 | the total number of tokens from the cleaned text as reported in the text editor | |
| url | e.g. https://cineforum.it/recensione/Nosferatu | a link directing to the original text file |
License and conditions of use
missing text
References
missing text
Contributors
missing text
