In his 2002 book, Andrew Robinson, then literary editor of a London higher education supplement, argued that “successfully deciphering archaeological puzzles requires a synthesis of logic and intuition… which computers do not possess (and probably cannot possess).”
Regina Barzilay, an assistant professor at the MIT Computer Science and Artificial Intelligence Lab, Ben Snyder, a graduate student in her lab, and Kevin Knight of the University of Southern California took that claim personally. At the annual meeting of the Association for Computational Linguistics in Sweden next month, present a paper on a novel computer system that deciphered a enormous part of the old Semitic language Ugaritic in a matter of hours. In addition to helping archaeologists decipher about eight old languages that have so far resisted their efforts, the work could also assist expand the number of languages that automated translation systems like Google Translate can handle.
To replicate the “intuition” that Robinson says computers can’t grasp, the researchers’ software makes several assumptions. The first is that the language being deciphered is closely related to another language: in the case of Ugaritic, the researchers chose Hebrew. The second is that there’s a systematic way to map the alphabet from one language to the alphabet from another, and that correlated symbols will occur with similar frequency in the two languages.
The system makes a similar assumption at the word level: languages should have at least some cognates or words with common roots, such as in French and Spanish, or and . Finally, the system makes a similar mapping for parts of words. For example, a word like “overloading” has both a prefix — “over” — and a suffix — “ing.” The system predicts that other words in the language will have the prefix “over” or the suffix “ing,” or both, and the cognate of “overloading” in another language — say, “surchargeant” in French — will have a similar three-part structure.
Crosstalk
The system recreates these different levels of correspondence from itself. For example, it might start with several competing hypotheses about alphabetic mappings based entirely on symbol frequency—mapping symbols that occur frequently in one language to those that occur frequently in another. Using a kind of probabilistic modeling common in AI research, the system would then figure out which of these mappings seems to identify a set of consistent suffixes and prefixes. From that, it could look for word-level correspondences, which in turn could assist it refine its alphabetic mappings. “We iterate through the data hundreds, thousands of times,” Snyder says, “and each time, our guesses get more likely because we’re actually getting closer to a solution where we get more consistency.” Eventually, the system reaches a point where changing the mappings no longer improves consistency.
Ugaritic has already been deciphered: otherwise, the researchers would have no way of assessing their system’s performance. The Ugaritic alphabet has 30 letters, and the system correctly mapped 29 of them to their Hebrew equivalents. About a third of the words in Ugaritic have Hebrew equivalents, and of those, the system correctly identified 60 percent. “Of the ones that are wrong, they’re often only wrong by one letter, so they’re often very good guesses,” Snyder says.
In addition, he points out, the system currently does not operate any contextual information to resolve ambiguities. For example, the Ugaritic words for “house” and “daughter” are spelled the same way, but their Hebrew equivalents are not. Although the system might occasionally confuse them, a human decipherer would easily recognize from the context what was intended.
Bubble
Still, Andrew Robinson remains skeptical. “If the authors believe that their approach will ultimately lead to computerized ‘automatic’ reading of currently undeciphered scripts,” he writes in an email, “then I’m afraid I’m not convinced by their work at all.” The researchers’ approach, he says, assumes that the language to be deciphered has an alphabet that can be mapped onto that of a known language—“which is almost certainly not the case for any of the other important undeciphered scripts,” Robinson writes. It also assumes, he says, that it’s clear where one character or word ends and another begins, which is not true for many deciphered and undeciphered scripts.
“Each language has its own challenges,” Barzilay agrees. “Successful decipherment would most likely require a method tailored to the language.” But, as he points out, deciphering Ugaritic took years and relied on a few lucky breaks—such as the discovery of an axe with the word for “axe” written on it in Ugaritic. “The output from our system would shorten that process by orders of magnitude,” he says.
Indeed, Snyder and Barzilay don’t think a system like the one they designed with Knight would ever replace human decryptors. “But it’s a powerful tool that can help the human decryption process,” Barzilay says. What’s more, a variation of it could also assist extend the versatility of translation software. Many online translators rely on parallel text analysis to determine word correspondences: They might, for example, look through the collected works of Voltaire, Balzac, Proust, and many other writers, in both English and French, looking for consistent mappings between words. “That’s how statistical translation systems have worked for the last 25 years,” Knight says.
But not all languages have such comprehensively translated literature: Snyder points out that Google Translate currently works in only 57 languages. The techniques used in the decryption system could be adapted to create lexicons for thousands of other languages. “The technology is very similar,” says Knight, who works in machine translation. “They feed off each other.”