corpora:nomadlingo

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

corpora:nomadlingo [2025/06/27 13:24] – created eroscorpora:nomadlingo [2026/01/09 09:32] (current) eros
Line 1: Line 1:
-====== NomadLingo 1.======+====== NomadLingo 1.======
  
 +===== General description =====
 +
 +NomadLingo is the first publicly available corpus documenting multilingual, naturally occurring interactions among European digital nomads, a rapidly growing yet understudied transnational community (Tedesco 2025). The corpus contains transcripts of extracts from naturally-occurring conversations which were audio-recorded between November 2023 and April 2024 at social events organised and promoted within digital nomad communities based in Madeira and Canary Islands. The total time of transcribed recording is 11 hours 38 mins. For further information about the texts in the corpus see Section 4. The version 1.1 open is an updated version of the older NomadLingo1.0 open (Tedesco et al. 2025). In this version two annotation layers have been added (see Section 8) and transcripts were further revised.
 +
 +The corpus aims to represent translingual interactions based on the fluid use of English as a lingua franca, other languages and linguae francae such as Spanish and Portuguese, and strategies of transcultural communication like intercomprehension and peer/self-translation. 
 +
 +The folder NomadLingo1.1 open contains:
 +
 +  * this readme file;
 +  * a folder with the three annotated versions of the corpus (Annotated_versions_NomadLingo1.1), including Trans&repair&misunderstanding&interactionAnnotated_NomadLingo1.1, Trans&repairAnnotated_NomadLingo1.1, and TranslanguagingAnnotated_NomadLingo1.1; 
 +  * a folder named Naked_NomadLingo1.1 containing files which only include transcribed conversation, structural and contextual annotation but no linguistic annotation; 
 +  * a folder named MetadataNomadLingo1.1 containing information about speakers and sessions in the corpus in .xls and .csv formats;
 +  * a folder named Annotation_schema_NomadLingo1.1 containing the annotation scheme in .xls and .csv formats;
 +  * the corpus in .csv format. 
 +
 +Moreover, another version of the corpus, for which access is restricted for privacy reasons, includes two folders, namely Recordings, containing the .wav files, and Privacy Notice and Informed Consent documents, which contains legal and ethical documentation. To get access to the integral version of the dataset you can contact novella.tedesco2@unibo.it.
 +
 +You can find the project on GitHub (https://github.com/novella-tedesco/FLO) and OSF (DOI 10.17605/OSF.IO/UK9NY), where codes for data processing and analysis are shared and updated.  
 +
 +===== Information about the research project =====
 +
 +This dataset is part of a larger research project named FLO, the European Fluid Languages Observatory, which investigates transcultural communicative strategies characterizing informal communication in the European communities of digital nomads. The data have been collected in the framework of a Ph.D. research at the Department of Interpreting and Translation at the University of Bologna, by Novella Tedesco, supervised by prof. Silvia Bernardini and prof. Cristiana Cervini (https://www.thenomadlinguist.eu/flo/).
 +
 +The general approach to research undertaken in the FLO project is ethnographic and participatory (Tedesco 2025). For this reason, notable features of NomadLingo1.0 include: 
 +  * the presence of the main researcher in the data; 
 +  * participants actively involved in the organisation of data collection procedures; 
 +  * variable audio quality; 
 +  * rich contextual information (see Section 7 and Section 8).
 +
 +===== Information about data collection, privacy regulation and license =====
 +
 +Data were collected with the researcher's personal device and participants' devices. The devices in question are smartphones. The software Mootiv Audio was used to capture audio in .wav format. All participants were asked to fill in a questionnaire, the informed consent and to sign the privacy policy. The transcripts are fully pseudoanonymised. The audio are restricted because they contain sensitive and personal data. To require access to the audio files you can contact Novella Tedesco (novella.tedesco2@unibo.it)
 +
 +===== Information about the corpus =====
 +
 +Number of tokens: 82,897
 +Number of texts: 9
 +Number of data collection sessions: 6
 +Number of transcribed recordings: 9
 +Number of speakers: 45 DIGITAL NOMADS (+2 HOSTEL VOLUNTEER)
 +
 +Each text is a transcription of conversations recorded at some of the events organised to collect data within the FLO project. Complete information about all the project sessions can be found in the file NomadLingo_sessions.csv. Each session is marked with a unique identifier (e.g., M1, M2 etc.).
 +
 +The corpus is available in five different versions: 
 +  * Trans&repair&misunderstanding&interactionAnnotated_NomadLingo1.1 contains texts and annotation about misunderstanding sequences, interactional moves,translanguaging practices and repair strategies (see Sections 8 and 9); 
 +  * Trans&repairAnnotated_NomadLingo1.1 features texts and annotation about translanguaging practices and repair strategies (see Sections 8 and 9);
 +  * Trans&repairannotated_NomadLingo1.1 includes texts where only translanguaging phenomena are annotated (see Sections 8 and 9); 
 +  * Naked_NomadLingo1.0 contains texts with metadata only related to speakers and sessions, but no annotation regarding linguistic and pragmatic phenomena (see Sections 7 and 8); 
 +  * NomadLingo1.1.csv, a single file which contains all metadata and annotation in separated columns from the transcripts. The transcripts without any type of annotation can therefore be extracted by copy-pasting the content of column 10 into a txt file. 
 +
 +The corpus contains 9 texts, with the following characteristics: 
 +
 +<code>
 +M1:
 + Original audio length: 1h57mins
 + Transcribed portion: 59mins
 + Number of tokens: 9,803
 + Number of speakers: 7
 +
 +M2: 
 + Original audio length: 2h41mins
 + Transcribed portion: 1h51mins
 + Number of tokens: 9,543 
 + Number of speakers: 9
 +
 +M7_1:
 + Original audio length: 40mins
 + Transcribed portion: 40mins
 + Number of tokens: 6,321
 + Number of speakers: 4
 +
 +M7_2:
 + Original audio length: 1h32mins
 + Transcribed portion: 1h32mins
 + Number of tokens: 8,817
 + Number of speakers: 11
 +
 +M7_3:
 + Original audio length: 2h21mins
 + Transcribed portion: 38mins
 + Number of tokens: 4,023
 + Number of speakers: 9
 +
 +GC1_1:
 + Original audio length: 1h47mins
 + Transcribed portion: 1h47mins
 + Number of tokens: 13,905
 + Number of speakers: 14 
 +
 +GC1_2:
 + Original audio length: 1h55mins
 + Transcribed portion: 1h55mins
 + Number of tokens: 11,341
 + Number of speakers: 9
 +
 +F1 
 + Original audio length: 3h8mins
 + Transcribed portion: 50mins
 + Number of tokens: 6,844
 + Number of speakers: 8
 +
 +F2
 + Original audio length: 1h18mins
 + Transcribed portion: 1h18mins
 + Number of tokens: 12,300
 + Number of speakers: 6
 +</code>
 +
 +===== Transcription procedure and conventions =====
 +
 +The main segmentation criterion is based on discourse units (Mahlberg 2014): the longest possible units that have independent meaning and are understandable even when taken out of context. Specifically, the following criteria are applied flexibly depending on the case:
 +
 +  * generally, one conversational turn corresponds to a single segment;
 +  * A segment can be interrupted by brief segments spoken by other speakers (in such cases, the segment is not split, and overlapping parts are indicated in square brackets). e.g. <code>[Yeah it's wow it's just amazing] I have to say.// [Yeah yeah]</code>
 +
 +
 +The following transcription conventions (inspired by Jefferson 2004) have been applied:
 +
 +<code>
 +[] = overlapping
 +: = long sounds 
 +(number) = pause longer than one second
 +XX = inaudible portion, every X correspondes to one syllable roughly.
 +? = ascending tone
 +# = hypothesis
 +  * = clipped words
 +((laughs)) = double round brakets and third person verbs are used to define a non-verbal sound attributed to a specific speaker.
 +((laughing)) = double round brakets and -ing verbs are used to define a non-verbal sound attributed to more or undefined speakers.
 +Filler words are transcribed as they are heard, except for hesitations (i.e., filled pauses which could sound like uh, mh, ehm). These are all transcribed 'ehm'.
 +</code>
 + 
 +Transcription followed a double-review semi-automatic procedure, where transcripts generated by WhisperAI, used on a local server to protect the data, were revised by three different revisors. The last revisor also revised annotation (see Sections 8 and 9). 
 +
 +===== Encoding =====
 +
 +Texts are encoded in UTF-8 and feature an XML structure. 
 +
 +===== Metadata =====
 +
 +The metadata included in the texts relate to the Speaker(s) and communicative context. The communicative context is described in the tag <session>, which contains information related to: 
 +
 +  * session code (an identifier attributed to each event, like F2, the second data collection event that took place in Fuerteventura); 
 +  * month and year when the data collection took place; 
 +  * country (Spain or Portugal), region (Canary Islands or Madeira Islands) and location (such as, Corralejo or Las Palmas de Gran Canaria) where the data were collected;
 +  * planning, which refers to whether the nomads had been gathered for a particular event or if the conversation was grabbed without planning (it should be noted that unplanned data collection suffer from poor data quality, and that even the events referred to as planned are natural events organised by community managers, hostel managers, and only in few cases by the researcher with the help of other community members);
 +  * context, which can be closed, controlled environment or public, depending on whether it's a private or public space; 
 +  * number of speakers; 
 +  * interaction which can be completely free or task-based; 
 +  * whether a self-introduction by the speakers is present within the text; 
 +  * number of different nationalities present at the event;
 +  * all the languages known by all the speakers in the session. 
 +
 +The attributes specified for speakers are: "code", "nationality", and "language" (where by language we refer to all the languages that speakers have declared to know). For more information about the Speakers, see the file NomadLingo11_participants.csv. 
 +Indeed, metadata can be accessed also in .csv formats. In the folder called 'Metadata' you can find information related to speakers and sessions in NomadLingo1.1. The last column of the files FLO_sessions.csv and NomadLingo1.0_sessions.cv provides a detailed description of the event. 
 +
 +The text is included into a <conversation> tag and each turn is marked by a unique ID. Turn length in tokens is also specified inside the <turn> tag. 
 +
 +===== Annotation schema =====
 +
 +The annotated versions of the corpus include xml tags aimed to highlight contextual information, conversation structure, misunderstanding sequences, translanguaging phenomena and repair strategies. 
 +
 +The xml structure for the files contained in the NomadLingo corpus is here outlined:
 +
 +<code xml>
 +<session ...>
 + <conversation>
 + <turn ...>
 + <speaker ...>
 + [transcription in plain text which may contain linguistic annotation] <translanguaging type="[codemixing|codeswitching]"> </translanguaging> <repair type="[reforlumation|restart]" </repair>
 + </speaker>
 + </turn>
 + <turn ...>
 + <intro>
 + <speaker ...>
 + [transcription in plain text]
 + </speaker>
 + </intro>
 + </turn>
 + <misunderstanding type="[solved|unsolved">
 + <turn ...>
 + <speaker ...> 
 + [transcription in plain text which may contain linguistic annotation] <interactional_move type="[asking|checking|confirming]"> </interactional_move>
 + </speaker>
 + </turn>
 + </misunderstanding>
 + <sound> [description of non-verbal sounds in plain text] </sound>
 + </conversation>
 +</session>
 +</code>
 +
 +The full annotation schema is available in both csv and xls formats in the folder Annotation_schema_NomadLingo1.1. Some examples are provided in the next Section. 
 +
 +===== Annotation procedure =====
 +
 +The revision of both transcriptions and annotation involved the participation of Master's students in Specialised Translation at the University of Bologna (Alessandro Mongardini, Eva Zaccariotto, Esther Cocco, Eleonora Castaldo) and, in some cases, study participants who remain anonymous for privacy reasons. For any advice and questions you can contact novella.tedesco2@unibo.it. 
 +
 +===== Linguistic annotation =====
 +
 +The corpus features four main linguistic annotation categories. 
 +
 +  * Misunderstanding sequences are annotated at turn level. They may contain various turns and other linguistic annotation. The annotation of misunderstanding sequences is inspired by the annotation schema proposed by Cervini & Paone (2025). An example of a misunderstanding sequence is provided below (Example 1). 
 +
 +  * Three types of interactional moves are annotated within each turn, at the level of transcription, i.e. containing the transcribed portion to which they refer, namely: a) 'asking', when speakers ask for clarification to their interlocutor (see Example 1); b) 'checking', when speakers check their interlocutor(s) understanding (see Example 2); c) confirming, when a speaker confirms their interlocutor(s) have correctly understood (see Example 3). Examples are provided below:
 +
 +<code xml>
 +Example 1) 
 +<misunderstanding type="solved">
 + <turn ID="M7_3_0097" Length="7">
 + <speaker code="PT_M_02" nationality="Portuguese" language="Portuguese,Spanish,English,German">
 + Where are you from though in Italy?
 + </speaker>
 + </turn>
 + <turn ID="M7_3_0098" Length="1">
 + <speaker code="IT_MC_01" nationality="Italian" language="Italian,English,Spanish">
 + <interactional_move type="asking"> what? </interactional_move>
 + </speaker>
 + </turn>
 + <turn ID="M7_3_0099" Length="7">
 + <speaker code="PT_M_02" nationality="Portuguese" language="Portuguese,Spanish,English,German">
 + Where are you from though in Italy?
 + </speaker>
 + </turn>
 + <turn ID="M7_3_0100" Length="3">
 + <speaker code="IT_MC_01" nationality="Italian" language="Italian,English,Spanish">
 + From the north
 + </speaker>
 + </turn>
 +</misunderstanding>
 +
 +Example 2) 
 +<turn ID="M7_2_0008" Length="1">
 + <speaker code="Researcher_IT" nationality="Italian" language="Italian,English,Neapolitan,Spanish,German,French">
 + <interactional_move type="checking"> ok? </interactional_move>
 + </speaker>
 +</turn>
 +<turn ID="M7_2_0009" Length="2">
 + <speaker code="GE_M_04" nationality="German" language="German,English">
 + mh mh
 + </speaker>
 +</turn>
 +
 +Example 3) 
 +<turn ID="F2_0543" Length="2">
 + <speaker code="ES_C_01" nationality="Spanish" language="English,French,Spanish">
 + your finance
 + </speaker>
 +</turn>
 +<turn ID="F2_0544" Length="2">
 + <speaker code="IT_MC_01" nationality="Italian" language="Italian,English,Spanish">
 + no ehm
 + </speaker>
 +</turn>
 +<turn ID="F2_0545" Length="1">
 + <speaker code="GE_C_01" nationality="German" language="German,English,Spanish">
 + income
 + </speaker>
 +</turn>
 +<turn ID="F2_0546" Length="29">
 + <speaker code="IT_MC_01" nationality="Italian" language="Italian,English,Spanish">
 + <interactional_move type="confirming"> income exactly </interactional_move> depend of your income it's like if you grow like 30000 you pay seventy percent if you're like fifty of your income is like twenty-one
 + </speaker>
 +</turn>
 +</code>
 +
 +  * Translanguaging phenomena are the third category of linguistic phenomena annotated in the corpus. For the purposes of annotation, translanguaging phenomena are considered as all those instances where translanguaging processes (Garcia and Wei 2014, Wei 2018) become evident in linguistic data. According to translanguaging theories, speakers make use of their semiotic repertoire in a holistic way depending on their communicative aims as well as on the context where communication takes place. For operational reasons, translanguaging phenomena are categorised into: a)codemixing, when a segment contains words in any language or variety other than standard English (see Example 4); b)codeswitching, when a segment is in a language or variety other than standard English (Bathia et al. 2004) (see Example 5); c) translation (when a segment is a self-translation or peer-interpreting) (Antonini et al. 2017) (see Example 6).
 +
 +<code xml>
 +Example 4) 
 +<turn ID="M7_2_0082" Length="16">
 + <speaker code="RS_M_01" nationality="Serbian" language="English,Portuguese,Serbian,Italian">
 + <translanguaging type="codemixing">do you have glasses for the vino? This is called the vino no?</translanguaging> 
 + </speaker>
 +</turn>
 +
 +Example 5)
 +<turn ID="GC1_2_0634" Length="2">
 + <speaker code="AR_C_02" nationality="Argentinian" language="Spanish,English,Argentinian">
 + <translanguaging type="codeswitching">donde queda?</translanguaging>
 + </speaker>
 +</turn>
 +<turn ID="GC1_2_0635" Length="6">
 + <speaker code="ES_C_04" nationality="Spanish" language="Spanish,English,French">
 + it's in front of mercadona
 + </speaker>
 +</turn>
 +
 +Example 6)
 +<turn ID="M7_2_0254" Length="2">
 + <speaker code="GE_M_04" nationality="German" language="German,English">
 + plans change
 + </speaker>
 +</turn>
 +<turn ID="M7_2_0255" Length="2">
 + <speaker code="NO_M_01" nationality="Norwegian" language="Norwegian,English,Portuguese">
 + <interactional_move type="asking">  blends strange?</interactional_move>
 + </speaker>
 +</turn>
 +<turn ID="M7_2_0256" Length="4">
 + <speaker code="GE_M_04" nationality="German" language="German,English">
 + <translanguaging type="translation">pläne ändern sich</translanguaging>
 + </speaker>
 +</turn>
 +</code>
 +
 +It should be noted that this English-centric representation has been applied for reasons of convenience, given that English is the official declared language of many digital nomad communities (Tedesco 2025) and that the majority of the segments are in English. It should also be noticed that the distinction among different types of translanguaging does not represent any theoretical position but has only been applied for analytical reasons. 
 +
 +  * The fourth category of phenomena annotated in the corpus is repairs. The term is generally referred to modifications made by speakers to their own speech while speaking. In this corpus, two types of repair have been annotated, namely reformulation and restart, which both cause some kind of interruption in speech flow.  Restart normally occur at the beginning of a clause, whereas during reformulation speakers modify a part of their speech by rephrasing. However, it should be noticed that the line between these two phenomena is not sharp, and some cases may fall into both categories (Schegloff et al. 1977) (see Examples 7 and 8).
 +
 +<code xml>
 +Example 7) 
 +<speaker code="GE_M_03" nationality="German" language="German,English,Spanish,Russian,Portuguese">
 + I was born in russia but <repair type="restart">i was I descended</repair> of germans and then I went back to germany and so: so it's it's weird to say half German half Russian because-
 +</speaker>
 +
 +Example 8)
 +<speaker code="IT_MC_01" nationality="Italian" language="Italian,English,Spanish">
 + you need to ask for it like in thailand it's like <repair type="reformulation"> everyone every place </repair>
 +</speaker>
 +</code>
 +
 +===== License and use =====
 +
 +The corpus is available for consultation on NoSketchEngine hosted by the University of Bologna servers (https://bellatrix.sslmit.unibo.it/noske/public/#open).
 +The transcripts, metadata and annotation scheme are realised under a Creative Commons Licence, but the recordings are closed to the public for privacy reasons. However they can be accessed, as well as the legal documentation, by request. For any queries contact novella.tedesco2@unibo.it.
 +
 +===== References =====
 +
 +  * Antonini, R., Cirillo, L., Rossato, L. and Torresi, I. (eds.) (2017) Non-professional Interpreting and Translation: State of the Art and Future of an Emerging Field of Research. Amsterdam: John Benjamins Publishing Company.
 +  * Bhatia, Tej K. & Ritchie, William C. (Eds.). (2004). The Handbook of Bilingualism. Malden, MA: Blackwell.
 +  * Cervini, C.; Paone, E., Annotazione di dati orali in contesti di interazione plurilingue: insegnare l’intercomprensione a studenti di Scienze mediche veterinarie, in: La comunicazione parlata 4 - Spoken communication. I venti anni del GSCP., Roma, Aracne, 2025, pp. 53 - 65 (La comunicazione parlata)
 +  * García, Ofelia & Li Wei (2014). Translanguaging: Language, Bilingualism and Education. London: Palgrave Pivot.
 +  * Jefferson, Gail (2004) Glossary of transcript symbols with an Introduction. In G. H. Lerner (Ed.) Conversation Analysis: Studies from the first generation (pp. 13-23). Philadelphia: John Benjamins.
 +  * Mahlberg, M. (2014). Corpus linguistics and discourse analysis. In K. P. Schneider & A. Barron (Eds.), Pragmatics of Discourse (pp. 215–238). De Gruyter.
 +  * Schegloff, E., Jefferson, G. and Sacks, H. (1977). The preference for self-correction in the organization of repair in conversation. Language, 53, 361-382.
 +  * Tedesco, N. (2025). Translanguaging in the era of digital nomadism: a sociolinguistic perspective on voluntary mobility in Europe. Discov glob soc 3, 50. https://doi.org/10.1007/s44282-025-00191-8.
 +  * Tedesco, N., Bernardini S., Cervini, C. (2025). NomadLingo1.0 open [dataset]. http://hdl.handle.net/20.500.11752/OPEN-1042 
 +  * Wei, L. (2018) Translanguaging as a Practical Theory of Language, Applied Linguistics, Volume 39, Issue 1, Pages 9–30, https://doi.org/10.1093/applin/amx039
 +
 +===== Acknowledgements =====
 +
 +Contributors: Silvia Bernardini, Cristiana Cervini, Alessandro Mongardini, Eva Zaccariotto, Esther Cocco, Eleonora Castaldo, digital nomad communities based in: Porto Santo, Funchal, Ponta do Sol, Las Palmas de Gran Canaria, Maspalomas, Corralejo, Lajares, El Cotillo. 
  
  • corpora/nomadlingo.txt
  • Last modified: 2026/01/09 09:32
  • by eros