Chapter published in:
Computational PhraseologyEdited by Gloria Corpas Pastor and Jean-Pierre Colson
[IVITRA Research in Linguistics and Literature 24] 2020
► pp. 24–41
Translation asymmetries of multiword expressions in machine translation
An analysis of the TED-MWE corpus
Johanna Monti | Università degli Studi di Napoli
“L’Orientale”
Mihael Arcan | Insight Centre for Data Analytics
Federico Sangati | Università degli Studi di Napoli
“L’Orientale”
Machine Translation (MT) is now extensively used both as a tool
to overcome language barriers on the internet and as a professional tool to
translate technical documentation. The technology has rapidly evolved in
recent years thanks to the availability of large amounts of data in digital
format and in particular parallel corpora, which are used to train
Statistical Machine Translation (SMT) tools. The quality of MT has
considerably improved but the translation of multiword expressions (MWEs)
still represents a big and open challenge, both from a theoretical and a
practical point of view (Monti,
2013). We define MWEs as any group of two or more words or terms
in a language lexicon that generally conveys a single meaning, such as the
Italian expressions anima gemella (soul mate),
carta di credito (credit card), acqua e
sapone (water and soap), piovere a catinelle
(rain cats and dogs). The persistence of mistranslation of MWEs in MT
outputs originates from their lexical, syntactic, semantic, pragmatic but
also translational idiomaticity. Therefore, there is a need to invest in
further research in order to achieve significant improvements MT and
translation technologies. In particular, it is important to develop
resources, mainly MWE-annotated corpora, which can be used for both MT
training and evaluation purposes (Monti
and Todirascu, 2016).This work focuses on the translation asymmetries between English
and Italian MWEs, and how they affect the SMT performance. By translation
asymmetries we mean the differences which may occur between an MWE in a
source language and its equivalent in the target language, like in
many-to-many word translations (En. to be in a position to
→ It. essere in grado di), many-to-one (En. to set
free → It. liberare) and finally one-to-many
correspondences (En. overcooked → It. cotto
troppo). This chapter describes the evaluation of
mistranslations caused by translation asymmetries concerning multiword
expressions detected in the TED-MWE corpus (http://tiny.cc/TED_MWE), which
contains 1,500 sentences and 31,000 EN tokens. This corpus is a subset of
the TED spoken corpus (Monti et al.,
2015) annotated with all the MWEs detected during the evaluation
process. The corpus contains the following information: (i) the English
source text, (ii) the Italian human translations (from the parallel corpus),
and (iii) the Italian SMT output. All the annotators were Italian native
speakers with a good knowledge of the English language and with a background
in linguistics and computational linguistics. They were asked to identify
all MWEs in the source text together with their translations in
approximately 300 random sentences each and to evaluate the automatic
translation correctness. The identified MWEs and the evaluation of both the
human and the machine translation are also recorded in the corpus. This
chapter will discuss (i) the related work concerning the impact of
anisomorphism (the absence of an exact correspondence between words in two
different languages) and the consequent translation asymmetries on MWEs
translation quality in MT, (ii) the corpus, (iii) the annotation guidelines,
(iv) the methodology adopted during the annotation process (Monti et al., 2015), (v) the results
of the annotation and finally (vi) the evaluation of translation asymmetries
in the corpus and ideas for future work.
Keywords: machine translation, translation asymmetries, multiword expressions, TED-MWE corpus
Published online: 08 May 2020
https://doi.org/10.1075/ivitra.24.02mon
https://doi.org/10.1075/ivitra.24.02mon