corpora:desert_island_discs_corpus

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
corpora:desert_island_discs_corpus [2026/02/10 16:02] adrianocorpora:desert_island_discs_corpus [2026/02/10 16:06] (current) adriano
Line 3: Line 3:
 ===== A corpus of transcriptions of BBC's Desert Island Discs episodes (1951-2025) ===== ===== A corpus of transcriptions of BBC's Desert Island Discs episodes (1951-2025) =====
  
-The **Desert Island Discs** corpus is a diachronic collection of nearly 12 million words of spoken English. It contains transcripts from the complete archive of 74 years of of [[https://en.wikipedia.org/wiki/Desert_Island_Discs|Desert Island Discs]], a [[https://www.bbc.co.uk/programmes/b006qnmr|radio programme by BBC]]. The corpus features 2,448 episodes, corresponding to all those available since 1951 and updated to the end of May 2025. The corpus is annotated for a large set of parameters, including speaker turns and socio-demographic information (such as gender, profession, age at time of recording) derived from [[https://www.wikidata.org/wiki/Wikidata:Main_Page|Wikidata]].+The **Desert Island Discs** corpus is a diachronic collection of nearly 12 million words of spoken English. It contains transcripts from the complete archive of 74 years of [[https://en.wikipedia.org/wiki/Desert_Island_Discs|Desert Island Discs]], a [[https://www.bbc.co.uk/programmes/b006qnmr|radio programme by BBC]].  
 + 
 +The corpus features **2,448 episodes**, corresponding to all those available since 1951 and updated to the end of May 2025. The corpus is **annotated for a large set of parameters**, including speaker turns and socio-demographic information (such as gender, profession, age at time of recording) derived from [[https://www.wikidata.org/wiki/Wikidata:Main_Page|Wikidata]].
  
 The corpus can be accessed freely from the [[https://bellatrix.sslmit.unibo.it/noske/public/#dashboard?corpname=desert_island_discs|CoLiTec corpora platform]]. The corpus can be accessed freely from the [[https://bellatrix.sslmit.unibo.it/noske/public/#dashboard?corpname=desert_island_discs|CoLiTec corpora platform]].
  
-The following table illustrates the available metadata (full list of metadata concerning hosts, guests, episodes and turn).+The following table illustrates the **available metadata** (full list of metadata concerning hosts, guests, episodes and turn).
  
 ^Metadata field type ^ Metadata field name ^ Metadata field value ^ Metadata source ^ ^Metadata field type ^ Metadata field name ^ Metadata field value ^ Metadata source ^
-|On host & guest|Host's name|E.g. Sue Lawley|BBC archive|+|**On host & guest**|Host's name|E.g. Sue Lawley|BBC archive|
 | |Guest's name|E.g. Douglas Adams|BBC archive| | |Guest's name|E.g. Douglas Adams|BBC archive|
 | |Guest's gender|Female, male, nonbinary, transgender female, transgender male|Wikidata| | |Guest's gender|Female, male, nonbinary, transgender female, transgender male|Wikidata|
Line 30: Line 32:
 | |Guest's positions held|E.g. Member of the House of Lords|Wikidata| | |Guest's positions held|E.g. Member of the House of Lords|Wikidata|
 | |Guest's Wikidata identifier|E.g. Q42|Wikidata| | |Guest's Wikidata identifier|E.g. Q42|Wikidata|
-|On recording|Recording: exact date|E.g. 1994-02-06|BBC archive|+|**On recording**|Recording: exact date|E.g. 1994-02-06|BBC archive|
 | |Recording: decade|E.g. 1990|Inferred from other metadata| | |Recording: decade|E.g. 1990|Inferred from other metadata|
 | |Recording: year|E.g. 1994|Inferred from other metadata| | |Recording: year|E.g. 1994|Inferred from other metadata|
 | |Text ID|E.g. DouglasAdams_1994|Inferred from other metadata| | |Text ID|E.g. DouglasAdams_1994|Inferred from other metadata|
-|On turn|Turn type|guest, host, intro, music, other, thanking and ending|Heuristics based on WhisperX output|+|**On turn**|Turn type|guest, host, intro, music, other, thanking and ending|Heuristics based on WhisperX output|
  
 ==== Corpus statistics ==== ==== Corpus statistics ====
  
-Summary statistics on the corpus are provided below.+Summary statistics on **corpus size** are provided below.
  
 |Number of texts (episodes)|2,448| |Number of texts (episodes)|2,448|
Line 62: Line 64:
 The corpus may be used for research and educational purposes only. Users may not reconstruct full programmes, systematically extract content, or redistribute materials beyond what is permitted by applicable copyright law. The corpus may be used for research and educational purposes only. Users may not reconstruct full programmes, systematically extract content, or redistribute materials beyond what is permitted by applicable copyright law.
  
-For copyright-related concerns, please contact: [[mailto:adriano.ferraresi@unibo.it|adriano.ferraresi@unibo.it]]+For copyright-related concerns, please contact: [[adriano.ferraresi@unibo.it|adriano.ferraresi@unibo.it]]
  • corpora/desert_island_discs_corpus.1770735740.txt.gz
  • Last modified: 2026/02/10 16:02
  • by adriano