Desert Island Discs Corpus
- A corpus of transcriptions of BBC's Desert Island Discs episodes (1951-2025)

Desert Island Discs Corpus

A corpus of transcriptions of BBC's Desert Island Discs episodes (1951-2025)

The Desert Island Discs corpus is a diachronic collection of nearly 12 million words of spoken English. It contains transcripts from the complete archive of 74 years of Desert Island Discs, a radio programme by BBC.

The corpus features 2,448 episodes, corresponding to all those available since 1951 and updated to the end of May 2025. The corpus is annotated for a large set of parameters, including speaker turns and socio-demographic information (such as gender, profession, age at time of recording) derived from Wikidata.

The corpus can be accessed freely from the CoLiTec corpora platform.

The following table illustrates the available metadata (full list of metadata concerning hosts, guests, episodes and turn).

Metadata field type	Metadata field name	Metadata field value	Metadata source
On host & guest	Host's name	E.g. Sue Lawley	BBC archive
	Guest's name	E.g. Douglas Adams	BBC archive
	Guest's gender	Female, male, nonbinary, transgender female, transgender male	Wikidata
	Guest's date of birth	E.g. 1952-03-11	Wikidata
	Guest's year of birth	E.g. 1952	Wikidata
	Guest's year of death	E.g. 2001	Wikidata
	Guest's age at date of recording: exact	E.g. 41	Inferred from other metadata
	Guest's age at date of recording: generation	E.g. 40	Inferred from other metadata
	Guest's place of birth	E.g. Cambridge	Wikidata
	Guest's country of birth	E.g. United Kingdom	Wikidata
	Guest's country of citizenship	E.g. United Kingdom	Wikidata
	Guest's occupation: category	academic, activist, actor, architect, artist, broadcaster, businessperson, chef, comedian, dancer, designer, director, engineer, explorer, farmer, gardener, journalist, lawyer, medical personnel, military, misc, model, musician, photographer, politician, producer, religion personnel, scientist, sportsperson, trainer, writer	Inferred from other metadata (authors’ classification)
	Guest's occupation: first mentioned	E.g. playwright	Wikidata
	Guest's occupation: all	E.g. playwright; screenwriter; novelist; science fiction writer	Wikidata
	Guest's education (where)	E.g. Brentwood School	Wikidata
	Guest's languages spoken	E.g. English	Wikidata
	Guest's political affiliation (member of party)	E.g. Labour Party	Wikidata
	Guest has children?	Y, NA	Wikidata
	Guest's positions held	E.g. Member of the House of Lords	Wikidata
	Guest's Wikidata identifier	E.g. Q42	Wikidata
On recording	Recording: exact date	E.g. 1994-02-06	BBC archive
	Recording: decade	E.g. 1990	Inferred from other metadata
	Recording: year	E.g. 1994	Inferred from other metadata
	Text ID	E.g. DouglasAdams_1994	Inferred from other metadata
On turn	Turn type	guest, host, intro, music, other, thanking and ending	Heuristics based on WhisperX output

Corpus statistics

Summary statistics on corpus size are provided below.

Number of texts (episodes)	2,448
Number of different guests	2,352
Number of tokens	14,217,426
Number of types	114,242
Average text length: median (and IQR)	5242.5 (1504.0)

Citation

If you use the corpus in your research, teaching, or any other work, please cite:

[CITATION HERE]

Further information on how the corpus was compiled can be found in the article.

Copyright and Use

This corpus contains transcriptions of copyrighted radio programme recordings. Copyright in the original audio material remains with the respective rightsholders. The recordings were obtained from publicly available sources and processed for non-commercial research and teaching purposes. The original recordings are available via the BBC’s official online archive.

Only textual transcriptions are made available through this platform. Access is provided exclusively via a query-based interface, which returns short textual excerpts (concordances). Audio files and full transcripts are not distributed.

The corpus may be used for research and educational purposes only. Users may not reconstruct full programmes, systematically extract content, or redistribute materials beyond what is permitted by applicable copyright law.

For copyright-related concerns, please contact: adriano.ferraresi@unibo.it

Table of Contents

Desert Island Discs Corpus

A corpus of transcriptions of BBC's Desert Island Discs episodes (1951-2025)

Corpus statistics

Citation

Copyright and Use