Show pageOld revisionsBacklinksODT exportBack to top This page is read only. You can view the source, but not change it. Ask your administrator if you think this is wrong. ====== Desert Island Discs Corpus ====== ===== A corpus of transcriptions of BBC's Desert Island Discs episodes (1951-2025) ===== The **Desert Island Discs** corpus is a diachronic collection of nearly 12 million words of spoken English. It contains transcripts from the complete archive of 74 years of of [[https://en.wikipedia.org/wiki/Desert_Island_Discs|Desert Island Discs]], a [[https://www.bbc.co.uk/programmes/b006qnmr|radio programme by BBC]]. The corpus features 2,448 episodes, corresponding to all those available since 1951 and updated to the end of May 2025. The corpus is annotated for a large set of parameters, including speaker turns and socio-demographic information (such as gender, profession, age at time of recording) derived from Wikidata. The corpus can be accessed freely from the CoLiTec corpora platform: [[https://corpora.dipintra.it]]. The following table illustrates the available metadata (full list of metadata concerning hosts, guests, episodes and turn). ^Metadata field type ^ Metadata field name ^ Metadata field value ^ Metadata source ^ |On host & guest|Host's name|E.g. Sue Lawley|BBC archive| | |Guest's name|E.g. Douglas Adams|BBC archive| | |Guest's gender|Female, male, nonbinary, transgender female, transgender male|Wikidata| | |Guest's date of birth|E.g. 1952-03-11|Wikidata| | |Guest's year of birth|E.g. 1952|Wikidata| | |Guest's year of death|E.g. 2001|Wikidata| | |Guest's age at date of recording: exact|E.g. 41|Inferred from other metadata | | |Guest's age at date of recording: generation|E.g. 40|Inferred from other metadata| | |Guest's place of birth|E.g. Cambridge|Wikidata| | |Guest's country of birth|E.g. United Kingdom|Wikidata| | |Guest's country of citizenship|E.g. United Kingdom|Wikidata| | |Guest's occupation: category|academic, activist, actor, architect, artist, broadcaster, businessperson, chef, comedian, dancer, designer, director, engineer, explorer, farmer, gardener, journalist, lawyer, medical personnel, military, misc, model, musician, photographer, politician, producer, religion personnel, scientist, sportsperson, trainer, writer|Inferred from other metadata (authors’ classification)| | |Guest's occupation: first mentioned|E.g. playwright|Wikidata| | |Guest's occupation: all|E.g. playwright; screenwriter; novelist; science fiction writer|Wikidata| | |Guest's education (where)|E.g. Brentwood School|Wikidata| | |Guest's languages spoken|E.g. English|Wikidata| | |Guest's political affiliation (member of party)|E.g. Labour Party|Wikidata| | |Guest has children?|Y, NA|Wikidata| | |Guest's positions held|E.g. Member of the House of Lords|Wikidata| | |Guest's Wikidata identifier|E.g. Q42|Wikidata| |On recording|Recording: exact date|E.g. 1994-02-06|BBC archive| | |Recording: decade|E.g. 1990|Inferred from other metadata| | |Recording: year|E.g. 1994|Inferred from other metadata| | |Text ID|E.g. DouglasAdams_1994|Inferred from other metadata| |On turn|Turn type|guest, host, intro, music, other, thanking and ending|Heuristics based on WhisperX output| Summary statistics on the corpus are provided below (corpus size information). |Number of texts (episodes)|2,448| |Number of different guests|2,352| |Number of tokens|14,217,426| |Number of types|114,242| |Average text length: median (and IQR)|5242.5 (1504.0)| ==== Citation ==== If you use the corpus in your research, teaching, or any other work, please cite: [CITATION HERE] Further information on how the corpus was compiled can be found in the article. corpora/desert_island_discs_corpus.txt Last modified: 2025/09/16 17:32by eros