corpora:desert_island_discs_corpus

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
corpora:desert_island_discs_corpus [2025/09/16 17:32] – created eroscorpora:desert_island_discs_corpus [2026/02/10 16:06] (current) adriano
Line 3: Line 3:
 ===== A corpus of transcriptions of BBC's Desert Island Discs episodes (1951-2025) ===== ===== A corpus of transcriptions of BBC's Desert Island Discs episodes (1951-2025) =====
  
-The **Desert Island Discs** corpus is a diachronic collection of nearly 12 million words of spoken English. It contains transcripts from the complete archive of 74 years of of [[https://en.wikipedia.org/wiki/Desert_Island_Discs|Desert Island Discs]], a [[https://www.bbc.co.uk/programmes/b006qnmr|radio programme by BBC]]. The corpus features 2,448 episodes, corresponding to all those available since 1951 and updated to the end of May 2025. The corpus is annotated for a large set of parameters, including speaker turns and socio-demographic information (such as gender, profession, age at time of recording) derived from Wikidata.+The **Desert Island Discs** corpus is a diachronic collection of nearly 12 million words of spoken English. It contains transcripts from the complete archive of 74 years of [[https://en.wikipedia.org/wiki/Desert_Island_Discs|Desert Island Discs]], a [[https://www.bbc.co.uk/programmes/b006qnmr|radio programme by BBC]]. 
  
-The corpus can be accessed freely from the CoLiTec corpora platform: [[https://corpora.dipintra.it]].+The corpus features **2,448 episodes**, corresponding to all those available since 1951 and updated to the end of May 2025. The corpus is **annotated for a large set of parameters**, including speaker turns and socio-demographic information (such as gender, profession, age at time of recording) derived from [[https://www.wikidata.org/wiki/Wikidata:Main_Page|Wikidata]].
  
-The following table illustrates the available metadata (full list of metadata concerning hosts, guests, episodes and turn).+The corpus can be accessed freely from the [[https://bellatrix.sslmit.unibo.it/noske/public/#dashboard?corpname=desert_island_discs|CoLiTec corpora platform]]. 
 + 
 +The following table illustrates the **available metadata** (full list of metadata concerning hosts, guests, episodes and turn).
  
 ^Metadata field type ^ Metadata field name ^ Metadata field value ^ Metadata source ^ ^Metadata field type ^ Metadata field name ^ Metadata field value ^ Metadata source ^
-|On host & guest|Host's name|E.g. Sue Lawley|BBC archive|+|**On host & guest**|Host's name|E.g. Sue Lawley|BBC archive|
 | |Guest's name|E.g. Douglas Adams|BBC archive| | |Guest's name|E.g. Douglas Adams|BBC archive|
 | |Guest's gender|Female, male, nonbinary, transgender female, transgender male|Wikidata| | |Guest's gender|Female, male, nonbinary, transgender female, transgender male|Wikidata|
Line 30: Line 32:
 | |Guest's positions held|E.g. Member of the House of Lords|Wikidata| | |Guest's positions held|E.g. Member of the House of Lords|Wikidata|
 | |Guest's Wikidata identifier|E.g. Q42|Wikidata| | |Guest's Wikidata identifier|E.g. Q42|Wikidata|
-|On recording|Recording: exact date|E.g. 1994-02-06|BBC archive|+|**On recording**|Recording: exact date|E.g. 1994-02-06|BBC archive|
 | |Recording: decade|E.g. 1990|Inferred from other metadata| | |Recording: decade|E.g. 1990|Inferred from other metadata|
 | |Recording: year|E.g. 1994|Inferred from other metadata| | |Recording: year|E.g. 1994|Inferred from other metadata|
 | |Text ID|E.g. DouglasAdams_1994|Inferred from other metadata| | |Text ID|E.g. DouglasAdams_1994|Inferred from other metadata|
-|On turn|Turn type|guest, host, intro, music, other, thanking and ending|Heuristics based on WhisperX output| +|**On turn**|Turn type|guest, host, intro, music, other, thanking and ending|Heuristics based on WhisperX output| 
-  + 
-Summary statistics on the corpus are provided below (corpus size information).+==== Corpus statistics ==== 
 + 
 +Summary statistics on **corpus size** are provided below.
  
 |Number of texts (episodes)|2,448| |Number of texts (episodes)|2,448|
Line 51: Line 55:
  
 Further information on how the corpus was compiled can be found in the article. Further information on how the corpus was compiled can be found in the article.
 +
 +==== Copyright and Use ====
 +
 +This corpus contains transcriptions of copyrighted radio programme recordings. Copyright in the original audio material remains with the respective rightsholders. The recordings were obtained from publicly available sources and processed for non-commercial research and teaching purposes. The original recordings are available via the BBC’s official online archive.
 +
 +Only textual transcriptions are made available through this platform. Access is provided exclusively via a query-based interface, which returns short textual excerpts (concordances). Audio files and full transcripts are not distributed.
 +
 +The corpus may be used for research and educational purposes only. Users may not reconstruct full programmes, systematically extract content, or redistribute materials beyond what is permitted by applicable copyright law.
 +
 +For copyright-related concerns, please contact: [[adriano.ferraresi@unibo.it|adriano.ferraresi@unibo.it]]
  • corpora/desert_island_discs_corpus.1758036754.txt.gz
  • Last modified: 2025/09/16 17:32
  • by eros