Article published in:Spoken Corpora and Linguistic Studies
Edited by Tommaso Raso and Heliana Mello
[Studies in Corpus Linguistics 61] 2014
► pp. 105–128
The grammatical annotation of speech corpora
Techniques and perspectives
This chapter discusses the grammatical annotation of speech corpora on the one hand (C-ORAL-Brasil, NURC) and speech-like text on the other (e-mail, chat, tv-news, parliamentary discussions), drawing on Portuguese data for the former and English data for the latter. We try to identify and compare linguistic orality markers (“speechlikeness”) in different genres, and argue that broad-coverage Constraint Grammar parsers such as PALAVRAS and EngGram can be adapted to these features, and used across the text-speech divide. Special topics include emoticons, phonetic variation and syntactic features. For ordinary speech corpora we propose a system of two-level annotation, where overlaps, retractions and phonetic variation are maintained as meta-tagging, while allowing conventional annotation of an orthographically normalized textual layer. In the absence of punctuation, syntactic segmentation can be achieved by exploiting prosodic breaks as delimiters in parsing rules. With the exception of chat data, our modified “oral” CG parsers perform reasonably close to their written language counterparts, even for true transcribed speech, achieving accuracy rates (F-scores) above 98% for PoS tags and 93–95% for syntactic function.
Published online: 14 November 2014
Bick, Eckhard & Módolo, Marcelo
2005 Letters and editorials: A grammatically annotated corpus of 19th century Brazilian Portuguese. In Romance Corpus Linguistics, II: Corpora and Historical Linguistics (Proceedings of the 2nd Freiburg Workshop on Romance Corpus Linguistics, Sept. 2003) , Claus Pusch & Johannes Kabatek & Wolfgang Raible (eds), 271–280. Tübingen: Gunther Narr.
2009 Introducing probabilistic information in constraint grammar parsing. In Proceedings of Corpus Linguistics 2009 , Liverpool, UK . ucrel.lancs.ac.uk/publications/cl2009/
DeLiema, David, Steen, Francis & Turner, Mark
2012 Language, gesture and audiovisual communication: A massive online database for researching multimodal constructions. Lecture, 11th Conceptual Structure, Discourse and Language Conference, Vancouver, May 17–20.
Johannessen, Janne Bondi, Priestley, Joel, Hagen, Kristin, Åfarli, Tor Anders & Vangsnes, Øystein Alexander
Karlsson, Fred, Voutilainen, Atro, Heikkilä, Juka & Anttila, Arto
Klimt, Brian & Yang, Yiming
2004 Introducing the Enron Corpus. In First Conference on Email and Anti-Spam (CEAS) , Mountain View, CA . ftp://ftp.research.microsoft.com/users/joshuago/conference/papers-2004/168.pdf (29 May 2010).
Luz, Saturnino, Masoodian, Masood, Rogers, Bill & Deering, Chris
Maamouri, Mohamed, Bies, Ann, Kulick, Seth, Zaghouani, Wajdi, Graff, Dave & Ciul, Mike
2010 From speech to trees: Applying treebank annotation to Arabic broadcast news. In Proceedings of LREC 2010, Valletta, Malta .
Moreno, Atonio & Guirão, José M.
2003 Tagging a spontaneous speech corpus of Spanish. In Proceedings of the International Conference on Recent Advances in Natural Language Processing, Borovets, Bulgaria , 292–296.
Müürisep, Kaili & Uibo, Heli
2006 Shallow parsing of spoken Estonian using constraint grammar. In Proceedings of NODALIDA-2005 – Special Session on Treebanking [Copenhagen Studies in Language 33], Peter Juel Henriksen & Peter Rossen Skadhauge (eds).
Panunzi, Allesandro, Picchi, Eugenio & Moneglia, Massimo
Raso, Tommaso & Heliana Mello