Chapter published in:
Computational PhraseologyEdited by Gloria Corpas Pastor and Jean-Pierre Colson
[IVITRA Research in Linguistics and Literature 24] 2020
► pp. 84–110
Computational extraction of formulaic sequences from corpora
Two case studies of a new extraction algorithm
Alexander Wahl | Donders Institute for Brain, Cognition and Behaviour, Radboud
University
We describe a new algorithm for the extraction of formulaic
language from corpora. Entitled MERGE (Multi-word Expressions from the
Recursive Grouping of Elements), it iteratively combines adjacent bigrams
into progressively longer sequences based on lexical association strengths.
We then provide empirical evidence for this approach via two case studies.
First, we compare the performance of MERGE to that of another algorithm by
examining the outputs of the approaches compared with manually annotated
formulaic sequences from the spoken component of the British National
Corpus. Second, we employ two child language corpora to examine whether
MERGE can predict the formulas that the children learn based on caregiver
input. Ultimately, we show that MERGE indeed performs well, offering a
powerful approach for the extraction of formulas.
Keywords: formulaic sequences, collocation extraction, lexical association, child language, MERGE, adjusted frequency list
Published online: 08 May 2020
https://doi.org/10.1075/ivitra.24.05wah
https://doi.org/10.1075/ivitra.24.05wah
Cited by
Cited by 2 other publications
This list is based on CrossRef data as of 06 february 2021. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.