| Abstract | | We introduce a new probabilistic model for transliteration that
performs significantly better than previous approaches, is
language-agnostic, requiring no knowledge of the source or target
languages, and is capable of both generation (creating the most
likely transliteration of a source word) and discovery (selecting
the most likely transliteration from a list of candidate words).
Our experimental results demonstrate improved accuracy over the
existing state-of-the-art by more than 10% in Chinese, Hebrew and
Russian. While past work has commonly made use of fixed-size n-gram
features along with more traditional models such as HMM or
Perceptron, we utilize an intuitive notion of "productions", where
each source word can be segmented into a series of contiguous,
non-overlapping substrings of any size, each of which independently
transliterates to a substring in the target language with a given
probability (e.g. P(wash)⇒ вaш) = 0.95). To learn these parameters,
we employ ExpectationMaximization (EM), with the alignment between
substrings in the source and target word training pairs as our
latent data. Despite the size of the parameter space and the
2|w|-1 possible segmentations to consider for each word,
by using dynamic programming each iteration of EM takes
O(m6n) time, where m is the length of the longest word
in the data and n is the number of word pairs, and is very fast in
practice. Furthermore, discovering transliterations takes only
O(m4w) time, where w is the number of candidate words to
choose from, and generating a transliteration takes
O(i>m2k2) time, where k is a pruning
constant (we used a value of 100). Additionally, we are able to
obtain training examples in an unsupervised fashion from Wikipedia
by using a relatively simple algorithm to filter potential word
pairs. |