CITRA-25

This is an old revision of the document!

Corpus of Italian Texts for Register Analysis (2020-2025)

Corpus summary

Language: Italian (native)
Text type: informative, imaginative
Genre: press, general prose, learned writing, fiction
Time reference: 2020-2025
Sampling unit: full text - with exceptions
Size: missing

The Corpus of Italian Texts for Register Analysis (CITRA-25) is a principled collection of written texts sampled to be representative of 21^st century Italian. The rationale behind its compilation is twofold: on the one hand, to provide a reference point for (mostly) human-authored texts at a time when synthetic data begins to populate contemporary spaces of discourse production and reception; on the other, to allow for contrastive register analysis with comparable corpora. The latter is achieved through careful reconsideration of the sampling frame adopted by the Brown Family of Corpora, a de facto standard for balancedness and representativeness influencing later approaches to corpus design. While consistency with Brown's wider schematisation of text types and genres is preserved, its internal composition is redefined and made explicit in the form of subgenres, with an eye towards exemplifying relevant changes in situational contexts having occurred between the 1960's and the 2020's. Comparability is thus reflected in functional coverage and balance logic, complemented by a similar timeframe to such recent representatives of Brown's approach as BE21 and Koditex.

Domain definition

The corpus is intended to be maximally representative of contemporary written Italian (from the years 2020-2025), defined as Italian linguistic productions encoded in the written modality, typically occurring in highly-to-moderately controlled settings and oriented towards standard or neo-standard norms. Its strata are formalised through an inductive process of categorisation, mapping discourse practices onto established genre conventions and fitting these, in turn, into the wider typology popularised by Brown: press, general prose, learned writing and fiction.

Press

missing text

General prose

missing text

Learned writing

missing text

Fiction

missing text

Texts have been manually collected from the web following stratified sampling strategies. Apart from time reference, selection and exclusion criteria have been devised that specifically address the features of each stratum. The content of preferred webpages has been converted to .txt files with UTF-8 encoding, normalising para- and peritextual artefacts (e.g. tables and figures, footnotes, author's name and text's metadata) by means of placeholders implemented via empty-element XML tags in the format <deleted object=“VALUE”/>. Naming conventions for files' ID follow the template [YEAR]_[GENRE]_[SUBGENRE]_000.txt, with decimals for each subgenre indicating sequential progression by year.

Press

missing text

General prose

missing text

Learned writing

missing text

Fiction

missing text

Data from the corpus has been annotated for contextual and structural information in a semi-automatic fashion. Metadata fields, their values and associated descriptions are reported in the table below. Their operationalisation, based on a review of existing schemes from comparable resources, allows for the analysis of sociolinguistic variables, most notably along the diastratic (author's gender, publisher), diamesic (type of publication) and diaphasic (genre and subgenre, topic) axes. Given the purpose of the corpus, as well as the heterogeneity of texts included, structural information has been encoded following a minimal TEI structure encompassing sentences (<s>) and headings (<head>).

Metada scheme

Metadata field	Metadata value	Description
id	e.g. 2024_pr_rev_001.txt	file's ID, encapsulating selection criteria
text type	informative; imaginative	the 1^st level domain to which the text is assigned based on functional criteria (text types)
genre	press; general prose; learned writing; fiction	the 2^nd level domain to which the text is assigned based on structural conventions
subgenre	e.g. editorial; novel	the 3^rd level domain to which the text is assigned based on structural conventions
topic	e.g. politics; fantasy, thriller	the subject matter around which the text is built. Multiple co-occurring topics are separated through a comma
publication_year	e.g. 2025	the year when the text was issued
publication_type	print; digital; print and digital	the medium through which the text is made available
publisher	e.g. Consiglio Nazionale delle Ricerche	the individual, company or entity producing and distributing the text
author_gender	man; men; woman; women; mixed; N/a	authors' assumed gender, catering to both individual as well as collaborative production
word_count	e.g. 781	the total number of tokens from the cleaned text as reported in the text editor
url	e.g. https://cineforum.it/recensione/Nosferatu	a link directing to the original text file

missing text

CITRA-25

Corpus of Italian Texts for Register Analysis (2020-2025)

Corpus summary

Motivation and rationale

Domain definition

Press

General prose

Learned writing

Fiction

Data collection, conversion and normalisation

Press

General prose

Learned writing

Fiction

Annotation

Metada scheme

License and conditions of use

References

Contributors

Docs