EPIC Transcription Conventions
Theoretical Framework
Transcribing oral texts represents a first level of analysis, in that the transcription process is by its nature a selective process. Indeed, it is virtually impossible to reproduce all the characteristics of speech in writing, as there are several levels (i.e. linguistic, paralinguistic and extra-linguistic) comprising an infinite number of features e.g. pauses, repetitions, prosody, body language and many more. The specific type of material under investigation and the aim of one's research are probably some of the most significant factors that can influence the way oral texts are transcribed for later analysis.
Against this background, the aim of EPIC researchers was to prepare a large amount of original and simultaneously interpreted texts for automatic analysis. Thus, the idea was to produce a basic transcript, to which further levels of annotation could be easily added if and when needed.
All the conventions are summarised in the table below.
Transcription Levels
Linguistic level
All the words spoken by both speakers and simultaneous interpreters are transcribed (orthographic transcription). There are no punctuation signs in the transcripts, as they could be misleading and create problems in automatic analysis. Transcribed texts are segmented in units of meaning, on the basis of the speaker's intonation and syntactic information in the sentence involved. The double bar sign \/\/ is used to indicate the end of each segment. This segmentation is also functional to the alignment process between source and target texts.
Spelling conventions follow the standards applied in EU official documents. These indications can be found in the Interinstitutional Style Guide which is available on the European Parliament website for all the official languages of the Union.
Figures, dates and percentages are fully spelt out.
Paralinguistic level
This level is limited to a very small number of features, namely truncated words and mispronunciations.
In order to perform automatic POS-tagging, mispronounced words and those with an internal truncation are first “normalised” and then transcribed as they were actually spoken. For truncations we use the - symbol at the end of the truncated word (e.g. Pre- it is President it is a pleasure to be here…; this is important for all the countries…). For mispronunciations we simply enclose the mispronounced words between bars (e.g. qui al Parlamento si discute…).
Pauses are also included but they are currently annotated on the basis of the transcriber's perception only, i.e. they have not been measured by using appropriate electronic tools electronically. Both silent (…) and filled (ehm) pauses are considered, but no details are provided about their duration. Although this might sound as a methodologically bias, it is in fact an attempt to reflect closely the oral data and produce user-friendly transcripts.
Extra-linguistic level
This level provides information about the context, the speaker and the spoken material itself. All this is structured in the form of a header, which comes before the transcribed text and which was used to set the parameters to carry out automatic queries.
The header structure is the same for all texts (see Header and Search Parameters).
EPIC transcripts are first saved in text format. Then, they are transformed into xml documents to allow for POS-tagging and lemmatization. This is done by using the Tree-Tagger for English and Italian and Freeling for Spanish. The tagged subcorpora are then encoded using the IMS Corpus WorkBench. The corpora can be searched by entering a query in the box provided. Each simple query result provides a link (on the left) to the full text of the transcript from which it is possible to display alternatively all the header fields, the “normalised” transcript, the Part-of-Speech tags, the lemmas and the transcribed version of the speech (including mispronunciations and internal truncations).
SPEECH FEATURE | EXAMPLE | TRANSCRIPTION CONVENTION |
---|---|---|
Word truncations | propo | propo- |
pro posal | proposal /pro_posal/ | |
Pronunciation disfluencies | Parlomento | Parlamento /Parlomento/ |
Pauses | (filled / empty) | ehm … |
Numbers | 532 | five hundred and thirty-two |
Figures | 4% | four per cent |
Dates | 1997 | nineteen ninety-nine |
Unintelligible | # | |
Units | based on syntax & intonation | \/\/ |
Easing the transcription process
Needless to say, transcribing spoken material is a demanding task, requiring significant efforts and patience. Here are some suggestions to speed up the transcription process. Note that these suggestions are valid for EPIC material, but they could be applied to other spoken text sources as well.
In this respect, more and more sources are currently available in the Web providing audio material and “neat” and revised transcripts. As with EPIC material, these could be used as a basis to produce transcripts that reflect spoken language features very closely. Some examples are the United Nations' website, the CNN radio Web pages and national government's websites of several countries.
In our case, all EP Plenary debates are always transcribed by EU officials and the verbatim reports are then translated into all the official languages. However, these transcripts do not reflect spoken language features (e.g. repetitions, reformulations, unfinished sentences etc.) very closely. They are written texts and they have been stylistically revised. Nevertheless, the verbatim reports are an extremely useful basis to work on, i.e. they are a first draft for our transcripts.
As for the target texts, i.e. the simultaneously interpreted versions, it is not possible to use the official translations of the verbatim reports – available in all the EU official languages - because they are written texts produced by the EP translators after each plenary, differing considerably from the EP interpreters' versions produced during the debate. Since in the EPIC team we are all trained interpreters, we have decided to use speech recognition programs to speed up the transcription process. We perform “shadowing” (i.e. listening to the speech and simultaneously repeating it aloud) by using headsets and microphones connected to our computers which have been trained by us to recognise our voices. In this way, computers do the writing and we obtain a first draft of the interpreted versions very quickly. The draft is then revised manually and mistakes are corrected. This method can be applied to other oral sources as well in order to obtain a draft transcript more easily and quickly, provided that the transcriber can perform shadowing.