9:30 - 11:30 |
Eckhard Bick (University of Southern Denmark)
The Grammatical Annotation of Speech Corpora: Techniques and Perspectives
This talk discusses the grammatical annotation of speech corpora on the one hand (C-ORAL-Brasil, NURC) and speech-like text on the other (e-mail, chat, tv-news, parliamentary discussions), drawing on Portuguese data for the former and English data for the latter. We try to identify and compare linguistic markers for speechlikeness ("orality") in different genres, and argue that broad-coverage Constraint Grammar parsers such as PALAVRAS and EngGram can be adapted to these features, and used across the text-speech divide. Special topics include phonetic variation, emoticons and syntactic features. For ordinary, transcribed speech corpora we propose a system of two-level annotation, where overlaps, retractions and phonetic variation are maintained as meta-tagging, while allowing conventional annotation of an orthographically normalized textual layer. In the absence of punctuation, syntactic segmentation can be achieved by exploiting prosodic breaks as delimiters in parsing rules. With the exception of chat data, the modified "oral" CG parsers perform reasonably close to their written language counterparts, even for true transcribed speech, achieving accuracy rates (F-scores) above 98% for PoS tags and 93-95% for syntactic function.
|
15:30 - 17:30 |
Inês Duarte & Ana Isabel Mata (Universidade de Lisboa)
Exploring European Portuguese Spontaneous Speech: Prosodic, Syntactic and Pragmatic Annotation Guidelines across Domains and Corpora
Studies on prosody-syntax-discourse interface relations based on naturally occurring speech are gaining growing interest. Corpora annotated with all these levels of linguistic information are not very common (e.g., Calhoun et al., 2010 and references therein).
The COPAS project team selected a balanced corpus of European Portuguese (wrt discourse types, subjects gender and age), isolated utterances illustrating different contrast and parallel structures, provided annotation guidelines and applied them to the subset of utterances referred to hereafter as the COPAS corpus.
The COPAS corpus includes (i) a subset of the CPE-FACES corpus (Mata 1999; Mata et al., 2014), 16h of recorded spontaneous and prepared unscripted speech collected in high schools, 3 teachers and 25 teenage students (from both genders), all speakers of Standard European Portuguese (Lisbon); (ii) a subset of the CORAL corpus (Viana et al. 1998; Trancoso et al. 1998), 9h of spoken dialogue, following the main guidelines of the HCRC Map Task Corpus – 64 dialogues between 32 young-adult speakers (from both genders), Lisbon region.
The annotation guidelines were defined by a multidisciplinary team. After the forced alignment of data (phone, syllable, word), four annotation tiers were associated with the speech signal: (i) an orthographic tier, enriched with punctuation marks, disfluencies, and paralinguistic events; (ii) two prosodic tiers (following the ToBI framework), one for tones and the other for break indices; (iii) three syntactic tiers, for construction type, for construction position in the structure and for syntactic function; and (iv) a discourse tier, for the discourse function of the target constituents. The manual annotation was performed using Praat (Boersma & Weenink, 2013). A data-base was built, with all these tiers time-aligned with the target structures.
In this presentation, we will focus on the multi-level annotation process of left periphery structures. In fact, in the development of the project, the analysis of this subset of target structures, coded for syntactic features, discourse function, prosodic prominence and phrasing, in a time-aligned way, provided fundamental training and testing material for further application in other target structures (namely, clefts and right periphery constituents). Besides, there was time to thoroughly review the annotation and to explore the correlations shown by the statistical analysis.
As we will try to show you, the first exploration of the results obtained for the left periphery strengthen our belief that speech corpora with a multi-level annotation are a valuable resource to look into grammar module relations in language use from an integrated viewpoint.
References
- Boersma, P., & Weenink, D. (2013). Praat: doing phonetics by computer [Computer program]. Version 5.3.56, retrieved 15 September 2013 from http://www.praat.org
- Calhoun, S., Carletta, J., Brenier, J. M., Mayo, N., Jurafsky, D., Steedman, M., & Beaver, S. (2010). The NXT-format Switchboard Corpus: a rich resource for investigating the syntax, semantics, pragmatics and prosody of dialogue. Language Resources & Evaluation (2010) 44, pp. 387–419.
- Duarte, I., et al. (2013). Left Periphery: the (mainly) syntactic part of the annotation. “First Workshop of COPAS”, Lisbon, May.
- Mata, A. I. (1999). Para o Estudo da Entoação em Fala Espontânea e Preparada no Português Europeu: Metodologia, Resultados e Implicações Didácticas. PhD Thesis, University of Lisbon.
- Mata, A. I., et al. (2014). "Teenage and adult speech in school context: building and processing a corpus of European Portuguese". In Proceedings of 9th International Conference on Language Resources and Evaluation, LREC 2014, Reykjavik, Iceland: 3914-3919.
- Mata, A. I., et al. (2014). "Prosodic, syntactic, semantic guidelines for topic structures across domains and corpora2. In Proceedings of 9th International Conference on Language Resources and Evaluation, LREC 2014, Reykjavik, Iceland: 1188-1193.
- Mata, A. I. (2015). "Prosodic Cues to Topic Status in Spontaneous Speech". Workshop “Prosody-syntax-semantics interfaces in Portuguese: exploring spontaneous speech corpora”, Lisbon, July.
- Trancoso, I., et al. (1998). "Corpus de diálogo CORAL". In PROPOR’98, Porto Alegre, Brazil.
Viana, M. C., et al. (1998). "Apresentação do Projecto CORAL - Corpus de Diálogo Etiquetado". In Workshop I de Linguística Computacional, Lisboa, Portugal.
- Viana, C., Frota, S., Falé, I., Fernandes, F., Mascarenhas, I., Mata, A. I., Moniz, H. & Vigário, M. (2007). "Towards a P_ToBI". PAPI2007. Workshop on the Transcription of Intonation in Ibero-Romance. University of Minho, Portugal.
|
15:30 - 17:30 |
Brian Clancy (University of Limerick)
Using Spoken Corpora to Investigate Pragmatic Variation
One of the major contributions to current linguistic knowledge derived from corpora has been the insight spoken corpora has afforded into the nuances and particulars of inter- and intra-varietal variation. This workshop will focus on the potential of spoken corpora for unearthing linguistic patterns that characterise pragmatic variation at both of these levels. Comparing spoken corpora affords insights into not only the lexico-grammatical features present, but also into the nature of different pragmatic systems (e.g. Barron and Schneider, 2005; Schneider and Barron, 2008). A highly iterative corpus pragmatic approach will be taken in order to focus on similarities and differences amongst varieties of English and also amongst different contexts within these varieties in which pragmatic forms and functions interact.
The workshop will primarily utilise the Limerick Corpus of Irish English (LCIE), a one-million-word corpus of spoken Irish English collected from a number of different context-types in the Republic of Ireland. Data taken from LCIE will be complemented by insights from the British National Corpus (BNC) and the Corpus of Contemporary American English (COCA).
The starting point for any corpus analysis is, in the main, the frequency list and this corpus method, in addition to others such as keyword and concordance, will be thoroughly introduced, exemplified and systematically employed in order to illustrate the benefits of even the most basic corpus investigative procedures. From a context-specific point of view, the workshop will focus on data taken from intimate discourse – the spoken language of couples, families and close friends in private, non-professional settings. Individual phenomena such as pronouns, pragmatic markers, vocatives and taboo language will be examined, as will features of conversational organisation such as turn-taking, in order to demonstrate the nature of pragmatic variation within intimate discourse when compared to data from other context-types such as the workplace or the classroom. Questioning the ways in which linguistic items are used, particularly if they occur in differing proportions in different corpora, can provide insights, both intuited and unexpected, about language use in context and can empirically bring to light the varietal and contextual nuances of different pragmatic phenomena.
References
- Barron, A., & Schneider, K., eds. (2005). The Pragmatics of Irish English. Berlin: Mouton de Gruyter.
- Schneider, K., & Barron, A., eds. (2008). Variational Pragmatics: A focus on regional varieties in pluricentric languages. Amsterdam: John Benjamins.
|