Chapter published in:Computational Phraseology
Edited by Gloria Corpas Pastor and Jean-Pierre Colson
[IVITRA Research in Linguistics and Literature 24] 2020
► pp. 226–245
Empirical variability of Italian multiword expressions as a useful feature for their categorisation
In contemporary linguistics the definition of those entities which are referred to as multiword expressions (MWEs) remains controversial. It is intuitively clear that some words, when appearing together, have some “special bond” in terms of meaning (e.g. black hole, mountain chain), or lexical choice (e.g. strong tea, to fill a form), contrary to free combinations. Nevertheless, the great variety of features and anomalous behaviours that these expressions exhibit makes it difficult to organise them into categories and gives rise to a great amount of different and sometimes overlapping terminology.So far, most approaches in corpus linguistics have focused on trying to automatically extract MWEs from corpora by using statistical association measures, while theoretical aspects related to their definition, typology and behaviours arising from quantitative corpus-based studies have not been widely explored, especially for languages with a rich morphology and relatively free word order, such as Italian.This contribution attests that a systematic analysis of the empirical behaviour of Italian MWEs in large corpora, with respect to several parameters, such as syntactic and lexical variations, is useful for outlining a categorisation of the expressions in homogeneous sets which approximately correspond to what is intuitively known as multiword units (“polirematiche” in the Italian lexicographic tradition) and lexical collocations. The importance of this kind of approach is that the resulting categorisation of MWEs is grounded on empirical data rather than relying on intuitive and not-always-coherent linguistic definitions.The variational features taken into account are (1) the possibility for the expressions to be syntactically transformed, and (2) the possibility for one of the component to be replaced with a synonym. These features can be automatically and quantitatively investigated using ad hoc designed tools, whose methodology is fully explained, if an annotated corpus and a list of expressions are provided. It is possible to show that the kind of attested variations and the magnitude of variation appear highly correlated to the grammatical structure of a given phrase, indicating that the bond between the components for a multiword unit or a lexical collocation can be formed by activating different kinds of restrictions, depending on the considered grammatical pattern.
Keywords: collocation, categorisation, multiword expressions, PAISÀ corpus, semantic variation
Published online: 08 May 2020