Differences
This shows you the differences between two versions of the page.
| corpora:nomadlingo [2025/06/27 13:24] – created eros | corpora:nomadlingo [2026/01/09 09:32] (current) – eros | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | ====== NomadLingo 1.0 ====== | + | ====== NomadLingo 1.1 ====== |
| + | ===== General description ===== | ||
| + | |||
| + | NomadLingo is the first publicly available corpus documenting multilingual, | ||
| + | |||
| + | The corpus aims to represent translingual interactions based on the fluid use of English as a lingua franca, other languages and linguae francae such as Spanish and Portuguese, and strategies of transcultural communication like intercomprehension and peer/ | ||
| + | |||
| + | The folder NomadLingo1.1 open contains: | ||
| + | |||
| + | * this readme file; | ||
| + | * a folder with the three annotated versions of the corpus (Annotated_versions_NomadLingo1.1), | ||
| + | * a folder named Naked_NomadLingo1.1 containing files which only include transcribed conversation, | ||
| + | * a folder named MetadataNomadLingo1.1 containing information about speakers and sessions in the corpus in .xls and .csv formats; | ||
| + | * a folder named Annotation_schema_NomadLingo1.1 containing the annotation scheme in .xls and .csv formats; | ||
| + | * the corpus in .csv format. | ||
| + | |||
| + | Moreover, another version of the corpus, for which access is restricted for privacy reasons, includes two folders, namely Recordings, containing the .wav files, and Privacy Notice and Informed Consent documents, which contains legal and ethical documentation. To get access to the integral version of the dataset you can contact novella.tedesco2@unibo.it. | ||
| + | |||
| + | You can find the project on GitHub (https:// | ||
| + | |||
| + | ===== Information about the research project ===== | ||
| + | |||
| + | This dataset is part of a larger research project named FLO, the European Fluid Languages Observatory, | ||
| + | |||
| + | The general approach to research undertaken in the FLO project is ethnographic and participatory (Tedesco 2025). For this reason, notable features of NomadLingo1.0 include: | ||
| + | * the presence of the main researcher in the data; | ||
| + | * participants actively involved in the organisation of data collection procedures; | ||
| + | * variable audio quality; | ||
| + | * rich contextual information (see Section 7 and Section 8). | ||
| + | |||
| + | ===== Information about data collection, privacy regulation and license ===== | ||
| + | |||
| + | Data were collected with the researcher' | ||
| + | |||
| + | ===== Information about the corpus ===== | ||
| + | |||
| + | Number of tokens: 82,897 | ||
| + | Number of texts: 9 | ||
| + | Number of data collection sessions: 6 | ||
| + | Number of transcribed recordings: 9 | ||
| + | Number of speakers: 45 DIGITAL NOMADS (+2 HOSTEL VOLUNTEER) | ||
| + | |||
| + | Each text is a transcription of conversations recorded at some of the events organised to collect data within the FLO project. Complete information about all the project sessions can be found in the file NomadLingo_sessions.csv. Each session is marked with a unique identifier (e.g., M1, M2 etc.). | ||
| + | |||
| + | The corpus is available in five different versions: | ||
| + | * Trans& | ||
| + | * Trans& | ||
| + | * Trans& | ||
| + | * Naked_NomadLingo1.0 contains texts with metadata only related to speakers and sessions, but no annotation regarding linguistic and pragmatic phenomena (see Sections 7 and 8); | ||
| + | * NomadLingo1.1.csv, | ||
| + | |||
| + | The corpus contains 9 texts, with the following characteristics: | ||
| + | |||
| + | < | ||
| + | M1: | ||
| + | Original audio length: 1h57mins | ||
| + | Transcribed portion: 59mins | ||
| + | Number of tokens: 9,803 | ||
| + | Number of speakers: 7 | ||
| + | |||
| + | M2: | ||
| + | Original audio length: 2h41mins | ||
| + | Transcribed portion: 1h51mins | ||
| + | Number of tokens: 9,543 | ||
| + | Number of speakers: 9 | ||
| + | |||
| + | M7_1: | ||
| + | Original audio length: 40mins | ||
| + | Transcribed portion: 40mins | ||
| + | Number of tokens: 6,321 | ||
| + | Number of speakers: 4 | ||
| + | |||
| + | M7_2: | ||
| + | Original audio length: 1h32mins | ||
| + | Transcribed portion: 1h32mins | ||
| + | Number of tokens: 8,817 | ||
| + | Number of speakers: 11 | ||
| + | |||
| + | M7_3: | ||
| + | Original audio length: 2h21mins | ||
| + | Transcribed portion: 38mins | ||
| + | Number of tokens: 4,023 | ||
| + | Number of speakers: 9 | ||
| + | |||
| + | GC1_1: | ||
| + | Original audio length: 1h47mins | ||
| + | Transcribed portion: 1h47mins | ||
| + | Number of tokens: 13,905 | ||
| + | Number of speakers: 14 | ||
| + | |||
| + | GC1_2: | ||
| + | Original audio length: 1h55mins | ||
| + | Transcribed portion: 1h55mins | ||
| + | Number of tokens: 11,341 | ||
| + | Number of speakers: 9 | ||
| + | |||
| + | F1 | ||
| + | Original audio length: 3h8mins | ||
| + | Transcribed portion: 50mins | ||
| + | Number of tokens: 6,844 | ||
| + | Number of speakers: 8 | ||
| + | |||
| + | F2 | ||
| + | Original audio length: 1h18mins | ||
| + | Transcribed portion: 1h18mins | ||
| + | Number of tokens: 12,300 | ||
| + | Number of speakers: 6 | ||
| + | </ | ||
| + | |||
| + | ===== Transcription procedure and conventions ===== | ||
| + | |||
| + | The main segmentation criterion is based on discourse units (Mahlberg 2014): the longest possible units that have independent meaning and are understandable even when taken out of context. Specifically, | ||
| + | |||
| + | * generally, one conversational turn corresponds to a single segment; | ||
| + | * A segment can be interrupted by brief segments spoken by other speakers (in such cases, the segment is not split, and overlapping parts are indicated in square brackets). e.g. < | ||
| + | |||
| + | |||
| + | The following transcription conventions (inspired by Jefferson 2004) have been applied: | ||
| + | |||
| + | < | ||
| + | [] = overlapping | ||
| + | : = long sounds | ||
| + | (number) = pause longer than one second | ||
| + | XX = inaudible portion, every X correspondes to one syllable roughly. | ||
| + | ? = ascending tone | ||
| + | # = hypothesis | ||
| + | * = clipped words | ||
| + | ((laughs)) = double round brakets and third person verbs are used to define a non-verbal sound attributed to a specific speaker. | ||
| + | ((laughing)) = double round brakets and -ing verbs are used to define a non-verbal sound attributed to more or undefined speakers. | ||
| + | Filler words are transcribed as they are heard, except for hesitations (i.e., filled pauses which could sound like uh, mh, ehm). These are all transcribed ' | ||
| + | </ | ||
| + | |||
| + | Transcription followed a double-review semi-automatic procedure, where transcripts generated by WhisperAI, used on a local server to protect the data, were revised by three different revisors. The last revisor also revised annotation (see Sections 8 and 9). | ||
| + | |||
| + | ===== Encoding ===== | ||
| + | |||
| + | Texts are encoded in UTF-8 and feature an XML structure. | ||
| + | |||
| + | ===== Metadata ===== | ||
| + | |||
| + | The metadata included in the texts relate to the Speaker(s) and communicative context. The communicative context is described in the tag < | ||
| + | |||
| + | * session code (an identifier attributed to each event, like F2, the second data collection event that took place in Fuerteventura); | ||
| + | * month and year when the data collection took place; | ||
| + | * country (Spain or Portugal), region (Canary Islands or Madeira Islands) and location (such as, Corralejo or Las Palmas de Gran Canaria) where the data were collected; | ||
| + | * planning, which refers to whether the nomads had been gathered for a particular event or if the conversation was grabbed without planning (it should be noted that unplanned data collection suffer from poor data quality, and that even the events referred to as planned are natural events organised by community managers, hostel managers, and only in few cases by the researcher with the help of other community members); | ||
| + | * context, which can be closed, controlled environment or public, depending on whether it's a private or public space; | ||
| + | * number of speakers; | ||
| + | * interaction which can be completely free or task-based; | ||
| + | * whether a self-introduction by the speakers is present within the text; | ||
| + | * number of different nationalities present at the event; | ||
| + | * all the languages known by all the speakers in the session. | ||
| + | |||
| + | The attributes specified for speakers are: " | ||
| + | Indeed, metadata can be accessed also in .csv formats. In the folder called ' | ||
| + | |||
| + | The text is included into a < | ||
| + | |||
| + | ===== Annotation schema ===== | ||
| + | |||
| + | The annotated versions of the corpus include xml tags aimed to highlight contextual information, | ||
| + | |||
| + | The xml structure for the files contained in the NomadLingo corpus is here outlined: | ||
| + | |||
| + | <code xml> | ||
| + | <session ...> | ||
| + | < | ||
| + | <turn ...> | ||
| + | < | ||
| + | [transcription in plain text which may contain linguistic annotation] < | ||
| + | </ | ||
| + | </ | ||
| + | <turn ...> | ||
| + | < | ||
| + | < | ||
| + | [transcription in plain text] | ||
| + | </ | ||
| + | </ | ||
| + | </ | ||
| + | < | ||
| + | <turn ...> | ||
| + | < | ||
| + | [transcription in plain text which may contain linguistic annotation] < | ||
| + | </ | ||
| + | </ | ||
| + | </ | ||
| + | < | ||
| + | </ | ||
| + | </ | ||
| + | </ | ||
| + | |||
| + | The full annotation schema is available in both csv and xls formats in the folder Annotation_schema_NomadLingo1.1. Some examples are provided in the next Section. | ||
| + | |||
| + | ===== Annotation procedure ===== | ||
| + | |||
| + | The revision of both transcriptions and annotation involved the participation of Master' | ||
| + | |||
| + | ===== Linguistic annotation ===== | ||
| + | |||
| + | The corpus features four main linguistic annotation categories. | ||
| + | |||
| + | * Misunderstanding sequences are annotated at turn level. They may contain various turns and other linguistic annotation. The annotation of misunderstanding sequences is inspired by the annotation schema proposed by Cervini & Paone (2025). An example of a misunderstanding sequence is provided below (Example 1). | ||
| + | |||
| + | * Three types of interactional moves are annotated within each turn, at the level of transcription, | ||
| + | |||
| + | <code xml> | ||
| + | Example 1) | ||
| + | < | ||
| + | <turn ID=" | ||
| + | < | ||
| + | Where are you from though in Italy? | ||
| + | </ | ||
| + | </ | ||
| + | <turn ID=" | ||
| + | < | ||
| + | < | ||
| + | </ | ||
| + | </ | ||
| + | <turn ID=" | ||
| + | < | ||
| + | Where are you from though in Italy? | ||
| + | </ | ||
| + | </ | ||
| + | <turn ID=" | ||
| + | < | ||
| + | From the north | ||
| + | </ | ||
| + | </ | ||
| + | </ | ||
| + | |||
| + | Example 2) | ||
| + | <turn ID=" | ||
| + | < | ||
| + | < | ||
| + | </ | ||
| + | </ | ||
| + | <turn ID=" | ||
| + | < | ||
| + | mh mh | ||
| + | </ | ||
| + | </ | ||
| + | |||
| + | Example 3) | ||
| + | <turn ID=" | ||
| + | < | ||
| + | your finance | ||
| + | </ | ||
| + | </ | ||
| + | <turn ID=" | ||
| + | < | ||
| + | no ehm | ||
| + | </ | ||
| + | </ | ||
| + | <turn ID=" | ||
| + | < | ||
| + | income | ||
| + | </ | ||
| + | </ | ||
| + | <turn ID=" | ||
| + | < | ||
| + | < | ||
| + | </ | ||
| + | </ | ||
| + | </ | ||
| + | |||
| + | * Translanguaging phenomena are the third category of linguistic phenomena annotated in the corpus. For the purposes of annotation, translanguaging phenomena are considered as all those instances where translanguaging processes (Garcia and Wei 2014, Wei 2018) become evident in linguistic data. According to translanguaging theories, speakers make use of their semiotic repertoire in a holistic way depending on their communicative aims as well as on the context where communication takes place. For operational reasons, translanguaging phenomena are categorised into: a)codemixing, | ||
| + | |||
| + | <code xml> | ||
| + | Example 4) | ||
| + | <turn ID=" | ||
| + | < | ||
| + | < | ||
| + | </ | ||
| + | </ | ||
| + | |||
| + | Example 5) | ||
| + | <turn ID=" | ||
| + | < | ||
| + | < | ||
| + | </ | ||
| + | </ | ||
| + | <turn ID=" | ||
| + | < | ||
| + | it's in front of mercadona | ||
| + | </ | ||
| + | </ | ||
| + | |||
| + | Example 6) | ||
| + | <turn ID=" | ||
| + | < | ||
| + | plans change | ||
| + | </ | ||
| + | </ | ||
| + | <turn ID=" | ||
| + | < | ||
| + | < | ||
| + | </ | ||
| + | </ | ||
| + | <turn ID=" | ||
| + | < | ||
| + | < | ||
| + | </ | ||
| + | </ | ||
| + | </ | ||
| + | |||
| + | It should be noted that this English-centric representation has been applied for reasons of convenience, | ||
| + | |||
| + | * The fourth category of phenomena annotated in the corpus is repairs. The term is generally referred to modifications made by speakers to their own speech while speaking. In this corpus, two types of repair have been annotated, namely reformulation and restart, which both cause some kind of interruption in speech flow. Restart normally occur at the beginning of a clause, whereas during reformulation speakers modify a part of their speech by rephrasing. However, it should be noticed that the line between these two phenomena is not sharp, and some cases may fall into both categories (Schegloff et al. 1977) (see Examples 7 and 8). | ||
| + | |||
| + | <code xml> | ||
| + | Example 7) | ||
| + | <speaker code=" | ||
| + | I was born in russia but <repair type=" | ||
| + | </ | ||
| + | |||
| + | Example 8) | ||
| + | <speaker code=" | ||
| + | you need to ask for it like in thailand it's like <repair type=" | ||
| + | </ | ||
| + | </ | ||
| + | |||
| + | ===== License and use ===== | ||
| + | |||
| + | The corpus is available for consultation on NoSketchEngine hosted by the University of Bologna servers (https:// | ||
| + | The transcripts, | ||
| + | |||
| + | ===== References ===== | ||
| + | |||
| + | * Antonini, R., Cirillo, L., Rossato, L. and Torresi, I. (eds.) (2017) Non-professional Interpreting and Translation: | ||
| + | * Bhatia, Tej K. & Ritchie, William C. (Eds.). (2004). The Handbook of Bilingualism. Malden, MA: Blackwell. | ||
| + | * Cervini, C.; Paone, E., Annotazione di dati orali in contesti di interazione plurilingue: | ||
| + | * García, Ofelia & Li Wei (2014). Translanguaging: | ||
| + | * Jefferson, Gail (2004) Glossary of transcript symbols with an Introduction. In G. H. Lerner (Ed.) Conversation Analysis: Studies from the first generation (pp. 13-23). Philadelphia: | ||
| + | * Mahlberg, M. (2014). Corpus linguistics and discourse analysis. In K. P. Schneider & A. Barron (Eds.), Pragmatics of Discourse (pp. 215–238). De Gruyter. | ||
| + | * Schegloff, E., Jefferson, G. and Sacks, H. (1977). The preference for self-correction in the organization of repair in conversation. Language, 53, 361-382. | ||
| + | * Tedesco, N. (2025). Translanguaging in the era of digital nomadism: a sociolinguistic perspective on voluntary mobility in Europe. Discov glob soc 3, 50. https:// | ||
| + | * Tedesco, N., Bernardini S., Cervini, C. (2025). NomadLingo1.0 open [dataset]. http:// | ||
| + | * Wei, L. (2018) Translanguaging as a Practical Theory of Language, Applied Linguistics, | ||
| + | |||
| + | ===== Acknowledgements ===== | ||
| + | |||
| + | Contributors: | ||