The Desert Island Discs corpus is a diachronic collection of nearly 12 million words of spoken English. It contains transcripts from the complete archive of 74 years of Desert Island Discs, a radio programme by BBC.
The corpus features 2,448 episodes, corresponding to all those available since 1951 and updated to the end of May 2025. The corpus is annotated for a large set of parameters, including speaker turns and socio-demographic information (such as gender, profession, age at time of recording) derived from Wikidata.
The corpus can be accessed freely from the CoLiTec corpora platform.
The following table illustrates the available metadata (full list of metadata concerning hosts, guests, episodes and turn).
| Metadata field type | Metadata field name | Metadata field value | Metadata source |
|---|---|---|---|
| On host & guest | Host's name | E.g. Sue Lawley | BBC archive |
| Guest's name | E.g. Douglas Adams | BBC archive | |
| Guest's gender | Female, male, nonbinary, transgender female, transgender male | Wikidata | |
| Guest's date of birth | E.g. 1952-03-11 | Wikidata | |
| Guest's year of birth | E.g. 1952 | Wikidata | |
| Guest's year of death | E.g. 2001 | Wikidata | |
| Guest's age at date of recording: exact | E.g. 41 | Inferred from other metadata | |
| Guest's age at date of recording: generation | E.g. 40 | Inferred from other metadata | |
| Guest's place of birth | E.g. Cambridge | Wikidata | |
| Guest's country of birth | E.g. United Kingdom | Wikidata | |
| Guest's country of citizenship | E.g. United Kingdom | Wikidata | |
| Guest's occupation: category | academic, activist, actor, architect, artist, broadcaster, businessperson, chef, comedian, dancer, designer, director, engineer, explorer, farmer, gardener, journalist, lawyer, medical personnel, military, misc, model, musician, photographer, politician, producer, religion personnel, scientist, sportsperson, trainer, writer | Inferred from other metadata (authors’ classification) | |
| Guest's occupation: first mentioned | E.g. playwright | Wikidata | |
| Guest's occupation: all | E.g. playwright; screenwriter; novelist; science fiction writer | Wikidata | |
| Guest's education (where) | E.g. Brentwood School | Wikidata | |
| Guest's languages spoken | E.g. English | Wikidata | |
| Guest's political affiliation (member of party) | E.g. Labour Party | Wikidata | |
| Guest has children? | Y, NA | Wikidata | |
| Guest's positions held | E.g. Member of the House of Lords | Wikidata | |
| Guest's Wikidata identifier | E.g. Q42 | Wikidata | |
| On recording | Recording: exact date | E.g. 1994-02-06 | BBC archive |
| Recording: decade | E.g. 1990 | Inferred from other metadata | |
| Recording: year | E.g. 1994 | Inferred from other metadata | |
| Text ID | E.g. DouglasAdams_1994 | Inferred from other metadata | |
| On turn | Turn type | guest, host, intro, music, other, thanking and ending | Heuristics based on WhisperX output |
Summary statistics on corpus size are provided below.
| Number of texts (episodes) | 2,448 |
| Number of different guests | 2,352 |
| Number of tokens | 14,217,426 |
| Number of types | 114,242 |
| Average text length: median (and IQR) | 5242.5 (1504.0) |
If you use the corpus in your research, teaching, or any other work, please cite:
[CITATION HERE]
Further information on how the corpus was compiled can be found in the article.
This corpus contains transcriptions of copyrighted radio programme recordings. Copyright in the original audio material remains with the respective rightsholders. The recordings were obtained from publicly available sources and processed for non-commercial research and teaching purposes. The original recordings are available via the BBC’s official online archive.
Only textual transcriptions are made available through this platform. Access is provided exclusively via a query-based interface, which returns short textual excerpts (concordances). Audio files and full transcripts are not distributed.
The corpus may be used for research and educational purposes only. Users may not reconstruct full programmes, systematically extract content, or redistribute materials beyond what is permitted by applicable copyright law.
For copyright-related concerns, please contact: adriano.ferraresi@unibo.it