corpora:desert_island_discs_corpus

Desert Island Discs Corpus

The Desert Island Discs corpus is a diachronic collection of nearly 12 million words of spoken English. It contains transcripts from the complete archive of 74 years of of Desert Island Discs, a radio programme by BBC. The corpus features 2,448 episodes, corresponding to all those available since 1951 and updated to the end of May 2025. The corpus is annotated for a large set of parameters, including speaker turns and socio-demographic information (such as gender, profession, age at time of recording) derived from Wikidata.

The corpus can be accessed freely from the CoLiTec corpora platform: https://corpora.dipintra.it.

The following table illustrates the available metadata (full list of metadata concerning hosts, guests, episodes and turn).

Metadata field type Metadata field name Metadata field value Metadata source
On host & guestHost's nameE.g. Sue LawleyBBC archive
Guest's nameE.g. Douglas AdamsBBC archive
Guest's genderFemale, male, nonbinary, transgender female, transgender maleWikidata
Guest's date of birthE.g. 1952-03-11Wikidata
Guest's year of birthE.g. 1952Wikidata
Guest's year of deathE.g. 2001Wikidata
Guest's age at date of recording: exactE.g. 41Inferred from other metadata
Guest's age at date of recording: generationE.g. 40Inferred from other metadata
Guest's place of birthE.g. CambridgeWikidata
Guest's country of birthE.g. United KingdomWikidata
Guest's country of citizenshipE.g. United KingdomWikidata
Guest's occupation: categoryacademic, activist, actor, architect, artist, broadcaster, businessperson, chef, comedian, dancer, designer, director, engineer, explorer, farmer, gardener, journalist, lawyer, medical personnel, military, misc, model, musician, photographer, politician, producer, religion personnel, scientist, sportsperson, trainer, writerInferred from other metadata (authors’ classification)
Guest's occupation: first mentionedE.g. playwrightWikidata
Guest's occupation: allE.g. playwright; screenwriter; novelist; science fiction writerWikidata
Guest's education (where)E.g. Brentwood SchoolWikidata
Guest's languages spokenE.g. EnglishWikidata
Guest's political affiliation (member of party)E.g. Labour PartyWikidata
Guest has children?Y, NAWikidata
Guest's positions heldE.g. Member of the House of LordsWikidata
Guest's Wikidata identifierE.g. Q42Wikidata
On recordingRecording: exact dateE.g. 1994-02-06BBC archive
Recording: decadeE.g. 1990Inferred from other metadata
Recording: yearE.g. 1994Inferred from other metadata
Text IDE.g. DouglasAdams_1994Inferred from other metadata
On turnTurn typeguest, host, intro, music, other, thanking and endingHeuristics based on WhisperX output

Summary statistics on the corpus are provided below (corpus size information).

Number of texts (episodes)2,448
Number of different guests2,352
Number of tokens14,217,426
Number of types114,242
Average text length: median (and IQR)5242.5 (1504.0)

If you use the corpus in your research, teaching, or any other work, please cite:

[CITATION HERE]

Further information on how the corpus was compiled can be found in the article.

  • corpora/desert_island_discs_corpus.txt
  • Last modified: 2025/09/16 17:32
  • by eros