Article published in:Multilingual Corpora and Multilingual Corpus Analysis
Edited by Thomas Schmidt and Kai Wörner
[Hamburg Studies on Multilingualism 14] 2012
► pp. 25–46
Technological and methodological challenges in creating, annotating and sharing a learner corpus of spoken German
This article discusses questions concerning the creation, annotation and sharing of spoken language corpora. We use the Hamburg Map Task Corpus (HAMATAC), a small corpus in which advanced learners of German were recorded solving a map task, as an example to illustrate our main points. We first give an overview of the corpus creation and annotation process including recording, metadata documentation, transcription and semi-automatic annotation of the data. We then discuss the manual annotation of disfluencies as an example case in which many of the typical and challenging problems for data reuse – in particular the reliability of interpretative annotations – are revealed.
Published online: 15 November 2012