Jeff Pasternack

Computer Science Ph.D. / University of Illinois at Urbana-Champaign

Learning Better Transliterations

Abstract We introduce a new probabilistic model for transliteration that performs significantly better than previous approaches, is language-agnostic, requiring no knowledge of the source or target languages, and is capable of both generation (creating the most likely transliteration of a source word) and discovery (selecting the most likely transliteration from a list of candidate words). Our experimental results demonstrate improved accuracy over the existing state-of-the-art by more than 10% in Chinese, Hebrew and Russian. While past work has commonly made use of fixed-size n-gram features along with more traditional models such as HMM or Perceptron, we utilize an intuitive notion of "productions", where each source word can be segmented into a series of contiguous, non-overlapping substrings of any size, each of which independently transliterates to a substring in the target language with a given probability (e.g. P(wash)⇒ вaш) = 0.95). To learn these parameters, we employ ExpectationMaximization (EM), with the alignment between substrings in the source and target word training pairs as our latent data. Despite the size of the parameter space and the 2|w|-1 possible segmentations to consider for each word, by using dynamic programming each iteration of EM takes O(m6n) time, where m is the length of the longest word in the data and n is the number of word pairs, and is very fast in practice. Furthermore, discovering transliterations takes only O(m4w) time, where w is the number of candidate words to choose from, and generating a transliteration takes O(i>m2k2) time, where k is a pruning constant (we used a value of 100). Additionally, we are able to obtain training examples in an unsupervised fashion from Wikipedia by using a relatively simple algorithm to filter potential word pairs.
Authors Jeff Pasternack and Dan Roth
Venue ACM Conference on Information and Knowledge Management (CIKM)
Year 2009
Download PDF Document