E.P.I.C. (European Parliament Interpreting Corpus)

This is an old revision of the document!

EPIC is an open, parallel, trilingual (Italian, English and Spanish) corpus of European Parliament speeches and their corresponding interpretations currently being compiled at DIT (University of Bologna).

Two research grants provided by the Scuola Superiore di Studi Umanistici (SSSUB) of the University of Bologna, one grant provided by SITLeC and funding from the Ministery for University and Research (RFO ex quota 60%) have made it possible to have three researchers (Annalisa Sandrelli, Claudio Bendazzoli and Cristina Monti) working on the project full-time for two years (2004-2006). They are aided and supported by an interdisciplinary Research Group coordinated by Professor Mariachiara Russo, which includes interpreters, translators and computational linguists. The full list of members of the Directionality Research Group is the following: Mariachiara Russo, Annalisa Sandrelli, Claudio Bendazzoli, Cristina Monti, Marco Baroni, Gabriele Mack, Elio Ballardini, Peter Mead and Silvia Bernardini.

In 2004 several European Parliament plenary sessions were recorded off the news channel EbS (Europe by Satellite). By selecting different audio channels, it was possible to record the original speakers and the interpreters working in the various booths (in our case, Italian, English and Spanish). All the material thus obtained is being digitised and edited by using dedicated software in order to create a multimedia archive. At the moment, video and audio files are not available on-line, but information on the content and the structure of the archive can be obtained by clicking on Multimedia Archive in the left hand-side bar.

All the clips are transcribed following specific conventions. Each transcript also features a specially-designed header where we record information about the text (e.g. duration of the speech, mode of delivery, average speed, etc.) and the speaker (e.g. nationality, gender, political function, etc). This is to enable us to carry out focused searches later on, by selecting the relevant search parameters. For more information, click on Header and Search Parameters in the left hand-side bar.

The following step is the POS (part-of-speech) tagging and lemmatising of the transcripts (see Tagging in the left hand-side bar). This is done by using existing taggers, such as Treetagger (for Italian and English) and Freeling (for Spanish).

The final step in the compilation of EPIC is the alignment of source texts and target texts in order to create parallel subcorpora (see Aligned Texts). Overall, EPIC is made up of three subcorpora of original texts (Org-It, Org-En and Org-Es) and 6 subcorpora of interpreted texts (indicated as Int followed by the language direction, e.g. En-It for English into Italian) covering all the combinations and directions of the three languages, as well as 6 aligned subcorpora of source and target texts (indicated as Org + Int).

This complex structure makes it possible to carry out separate searches in original texts and/or in interpreted texts, for example in order to compare original English with interpreted English, or to compare English source texts with two interpreted target texts - in Italian and Spanish. On the other hand, the aligned subcorpora may be searched to study the relevance of the language combination in simultaneous interpreting (a Romance language and a Germanic language vs. two Romance languages) and the influence of directionality on simultaneous interpreting (e.g. to detect strategies and patterns when interpreting from Italian into English vs. English into Italian).

EPIC can be queried using CQL (the Corpus Query Language). For a quick tutorial and for information on the properties we encoded as positional and structural attributes, please read the advanced query how-to (see link on left bar).

For information on how to extract ngram frequency lists, please read the frequency lists how-to (see link on left bar).

Useful Links: Interpreting at the European Parliament

EPIC material was collected with the support of the EU Multimedia Archive Service.

E.P.I.C. (European Parliament Interpreting Corpus)

Tagsets

Docs