

One method is to identify certain digraphs in one language which will map to a single morpheme in the other language. To reach baseline, extending the default system by allowing characters from the source language to map to multiple characters in the destination language, and vice versa. Your task is to minimize the edit distance of the English transliteration. For your submission, you should include all 1600 lines so that we can grade appropriately. The remainder will be used to calculate an official score for your submission. The corresponding English transliterations have been provided for the first 800 lines in data-dev/en-pub.test, so you may score yourself. You have 1600 lines of Arabic testing data in data-dev/ar-pub.test. Each line contains an English word and an Arabic word, separated by a tab.

These pairs were taken from Chris Callison-Burch’s data set (from Transliterating From All Languages), and were originally generated by scraping Wikipedia article names. The training data is provided in data-train/ain, consisting of about 14000 pairs of English and Arabic transliterations. (Note that the Arabic characters refer to the consonants “m”, “h”, “m”, and “d”. ‘م ح م د’ would be rendered as ‘Muhammad’ in the Roman alphabet.In these examples and the project, all Arabic text is ordered left-to-right. Note that all Arabic words are the same length (4 characters) but the English transliterations are variable length. Additionally, the Arabic writing system may omits vowels, optionally using diacritics to resolve disambiguities. Arabic-English TransliterationĪrabic to English transliteration is a particularly interesting problem since the writing scripts and phoneme sets of the two languages are particularly disparate. This introduces ambiguity and increases the difficulty of back-transliteration into the source language, but results in a clearer form in the target language that can be understood and pronounced without an understanding of the source language.

However, transcription does not try to maintain a bijective map between letters of two writing scripts in order to better represent pronunciation in the target language. The results of transcription and transliteration are often quite similar, as letters represent similar sounds in many languages. However, in the machine translation literature, transcription typically refers to converting speech to text representation rather than text to text. Side note: linguistically, transliteration actually refers to a map between graphemes (units of a writing system) rather than phonemes (units of a language’s sound base) transcription is a more accurate term for what you’re doing in this assignment.

That is, the source and target languages may each contain phonemes that do not exist in the other language, but also be missing phonemes from the other language. It is even more crucial when moving between writing systems, as the source and target languages may have sets of phonemes that do not map bijectively. This problem manifests between any two languages, even those who employ the same writing system, as it is most desirable to represent differences in pronunciations. For example, proper nouns make up 40% of search queries and over 70% of most searches and are especially important in interlingual communication in the global media. Proper nouns - people, placess, and organizations - are the major source of OOVs, making transliteration a crucial problem, since they contain. Transliteration is used to translate “out of vocabulary” (OOV) words in the source language, which have no actual translation to the target language. Transliteration is the task of transcribing a word or phrase originally composed in one writing system to another in an understandable form.
