Desert Island Discs Corpus
A corpus of transcriptions of BBC's Desert Island Discs episodes (1951-2025)
The Desert Island Discs corpus is a diachronic collection of nearly 12 million words of spoken English. It contains transcripts from the complete archive of 74 years of of Desert Island Discs, a radio programme by BBC. The corpus features 2,448 episodes, corresponding to all those available since 1951 and updated to the end of May 2025. The corpus is annotated for a large set of parameters, including speaker turns and socio-demographic information (such as gender, profession, age at time of recording) derived from Wikidata.
The corpus can be accessed freely from the CoLiTec corpora platform: https://corpora.dipintra.it.
The following table illustrates the available metadata (full list of metadata concerning hosts, guests, episodes and turn).
| Metadata field type | Metadata field name | Metadata field value | Metadata source |
|---|---|---|---|
| On host & guest | Host's name | E.g. Sue Lawley | BBC archive |
| Guest's name | E.g. Douglas Adams | BBC archive | |
| Guest's gender | Female, male, nonbinary, transgender female, transgender male | Wikidata | |
| Guest's date of birth | E.g. 1952-03-11 | Wikidata | |
| Guest's year of birth | E.g. 1952 | Wikidata | |
| Guest's year of death | E.g. 2001 | Wikidata | |
| Guest's age at date of recording: exact | E.g. 41 | Inferred from other metadata | |
| Guest's age at date of recording: generation | E.g. 40 | Inferred from other metadata | |
| Guest's place of birth | E.g. Cambridge | Wikidata | |
| Guest's country of birth | E.g. United Kingdom | Wikidata | |
| Guest's country of citizenship | E.g. United Kingdom | Wikidata | |
| Guest's occupation: category | academic, activist, actor, architect, artist, broadcaster, businessperson, chef, comedian, dancer, designer, director, engineer, explorer, farmer, gardener, journalist, lawyer, medical personnel, military, misc, model, musician, photographer, politician, producer, religion personnel, scientist, sportsperson, trainer, writer | Inferred from other metadata (authors’ classification) | |
| Guest's occupation: first mentioned | E.g. playwright | Wikidata | |
| Guest's occupation: all | E.g. playwright; screenwriter; novelist; science fiction writer | Wikidata | |
| Guest's education (where) | E.g. Brentwood School | Wikidata | |
| Guest's languages spoken | E.g. English | Wikidata | |
| Guest's political affiliation (member of party) | E.g. Labour Party | Wikidata | |
| Guest has children? | Y, NA | Wikidata | |
| Guest's positions held | E.g. Member of the House of Lords | Wikidata | |
| Guest's Wikidata identifier | E.g. Q42 | Wikidata | |
| On recording | Recording: exact date | E.g. 1994-02-06 | BBC archive |
| Recording: decade | E.g. 1990 | Inferred from other metadata | |
| Recording: year | E.g. 1994 | Inferred from other metadata | |
| Text ID | E.g. DouglasAdams_1994 | Inferred from other metadata | |
| On turn | Turn type | guest, host, intro, music, other, thanking and ending | Heuristics based on WhisperX output |
Summary statistics on the corpus are provided below (corpus size information).
| Number of texts (episodes) | 2,448 |
| Number of different guests | 2,352 |
| Number of tokens | 14,217,426 |
| Number of types | 114,242 |
| Average text length: median (and IQR) | 5242.5 (1504.0) |
Citation
If you use the corpus in your research, teaching, or any other work, please cite:
[CITATION HERE]
Further information on how the corpus was compiled can be found in the article.