corpora:epic:creating_the_multimedia

The header

The header contains extra-linguistic information on each speech. It is made up of a number of fields, which provide information about the transcript file, the speech and the speaker. The following is an example of the template we use:

(date: 25-02-04-p speech number: 017 language: en type: org-en

duration: short timing: 24

text length: short number of words: 69

speed: high words per minute: 172

source text delivery: impromptu

speaker: Cox, Patrick gender: M country: Ireland mother tongue: yes

political function: President of the European Parliament political group: ELDR

topic: Procedure & Formalities specific topic: speeches on matters of political importance comments: NA)

The first group of four fields (date, speech number, language and type) contains a reference code, which is used to classify the speeches. The first number (25) indicates the day, the second item (02) indicates the month (in this case, February), followed by the year (04, that is, 2004). The letters (m) or (p) tell us if the speech was delivered during a morning or afternoon sitting (in this particular case, in the afternoon). The number that follows (in our example 017) is a progressive number we assign to speeches.

The abbreviations “en”, “it” and “es” indicate, respectively, a speech in English, Italian or Spanish. “org” and “int” indicate whether it is an original speech (i.e. a source text) or an interpretation (i.e. a target text). If it is an interpreted speech, we indicate both source and target languages, for example “int-en-it” means that the speech was interpreted from English into Italian.

This reference code is followed by a number of fields containing information on the speech, namely duration, text length and speed. We have recorded the exact figures indicating the number of seconds (timing), the number of words and the words per minute (calculated by dividing the number of words by the duration expressed in seconds). We have also classified the duration of speeches as short, medium or long (short: < 120 secs; medium 121-360 secs; long: >360 secs).

The same applies to text length, classfied as short, medium or long (short: < 300 words; medium 301 - 1000 words; long > 1000).

Speed was classified as low, medium or high (low: < 130 w/m; medium: 131 - 160 w/m; high: > 160 w/m). It must be pointed out that these values were calculated on the basis of the present corpus of speeches, and therefore can only be considered representative of this type of material, that is speeches delivered during a specific group of plenary sittings of the European Parliament. Indeed, in different contexts (e.g. the Italian conference interpreting market) a speech lasting 5 minutes (300 seconds) would be considered short, as opposed to medium, since simultaneous interpreters normally work in shifts of about 30 minutes. Likewise, a speech delivered at an average speed of 150 w/m is fast (not medium) by normal conference interpreting standards: however, owing to the specific rules for the allocation of speaking time in European Parliament sittings (click on Source Texts in the left-hand side bar for more information), most MEPs try and say as much as possible in the shortest possible time and therefore tend to speak very fast. In this sense and in this particular context, 150 w/m can be considered a medium speed.

Other information related to the speech includes source text delivery (that is, mode of presentation of the source speech), classified as impromptu, read or mixed. This information is recorded in the transcripts of interpreted speeches as well, since it is important to know whether the source text was read or improvised when analysing the target text.

We have grouped the speeches on the basis of macro-categories indicating the general topic of each speech and we have also recorded the specific topic under discussion in the debate. Specific topics are varied, ranging from the Parmalat fraud case to human rights in Afghanistan. A full list of specific topics, with corresponding clip numbers, is available in the archive (click on Multimedia Archive in the left-hand side bar).

The next fields in the header contain information on the speaker: name, gender, country of origin, mother tongue, political function and political group. When the speaker is an interpreter, no values are assigned to the fields name, country, political function and political group (indicated as NA, that is, not assigned).

The labels “European Commission” and “European Council” indicate that the speaker is either a Commissioner or a European Council Minister: in both cases, we record the field of action of the Commissioner or the Council configuration in the space reserved for comments at the end of the header.

European Commission's areas of responsibility:

Agriculture and Fisheries
Administrative Reform
Competition
Enterprise and Information Society
Internal Market
Research
Development and Humanitarian Aid
Enlargement
External Relations
Trade
Health and Consumer Protection
Education and Culture
Budget
Environment
Justice and Home Affairs
Employment and Social Affairs
Regional Policy
Economic and Monetary Affairs
Relations with the European Parliament, Transport and Energy
President of the European Commission.
European Council configurations:
General Affairs and External Relations
Economic and Financial Affairs
Cooperation in the fields of Justice and Home Affairs
Employment, Social Policy, Health and Consumer Affairs
Competitiveness
Transport, Telecommunications and Energy
Agriculture and Fisheries
Environment
Education, Youth and Culture

Finally, the label “guest” indicates that the speaker does not belong to a European Union institution: s/he could be a head of state or government, an intellectual, a politician from a country outside the EU, etc.

The last field is the space reserved for comments. As was mentioned above, this space is used to add information on Commissioners and European Council Ministers, but also to indicate whether the speaker has a noticeable accent (Scottish, Welsh, Irish; Andalusian, Latin American), to comment on any technical problems in the recordings and record any unusual features of each speech which are considered potentially useful for later analysis.

The multimedia archive is currently stored on the hard disk of a dedicated machine, but there are plans to load it on an Internet server to enable external researchers to access the audio and video clips as well as the transcripts which make up the EPIC corpus.

The EP plenary sessions were recorded off the satellite news channel EbS (Europe by Satellite), which enables viewers to select different sound channels for different EU languages. Four TV+videorecorder workstations were used for each plenary to obtain a recording of the original sound channel, and recordings of the English, Italian and Spanish sound channels (that is, of the interpreters working in the three booths).

The part-sessions recorded include the following plenaries: February, March, April and July (2004). See the official 2004 EP calendar.

We used 240 minute VHS tapes. Each part-session generally lasted four days (Monday to Thursday), and we used about 2 tapes per day per language, reflecting the EbS broadcasting schedule which does not include entire EP part-sessions. Moreover, owing to technical difficulties with satellite broadcasts or with our recording equipment, it was not always possible to record everything that was broadcast by EbS. It must be noted that EbS also broadcasts press conferences and stock footage which European TV channels can use when reporting on EU affairs. Therefore, although on average we used 28 VHS tapes for each plenary, our recordings had to be edited to select only the debates. The next step, therefore, was digitisation.

The VHS tapes with the recordings of the original speakers are being digitised as video files, as visual information is potentially useful for later analysis of the corpus. By contrast, the interpreted speeches are digitised as audio files, since the images on the VHS tapes are exactly the same, i.e. the plenary speakers, whereas our interest lay in the audio information, i.e. the interpreters' performances. For each plenary, we thus obtain one video file (the original version) and three audio files (the English, Italian and Spanish interpretations respectively).

The recordings of the original speakers are converted into digital video files thanks to Pinnacle Studio (9.0), a video-capture and editing software programme. The chosen format for the video files is “.mpeg1”.

The recordings of the interpreted speeches are digitised by using Cool Edit-Pro 2.0, a sound editor. The chosen format is “.wav” (sample rate = 32.000; channel = mono; resolution = 8 bit), which ensures very good audio quality for possible future studies of prosodic features (distribution of pauses, hesitations, etc.). There are plans to upload the EPIC archive to a dedicated Web server from which researchers will be able to download the clips. When the project reaches that stage, the “.wav” clips will be converted into a lighter format, probably “.mp3”.

Once the original recording of each plenary has been converted into a video file, all the speeches made in Italian, English and Spanish are selected and saved as individual clips (video files for the original speakers and audio files for the interpreters).

The archive includes video clips of each source language speaker, audio clips of the corresponding interpreted target texts, and the transcripts of all the texts.