Monday, December 23, 2024

Essays in English provide information about other languages

Share

Computer scientists at MIT and Israel’s Technion have discovered an unexpected source of information about the world’s languages: the habits of native speakers of those languages ​​when writing in English.

The work could enable computers to sift through relatively accessible documents to produce rugged data that might take months for trained linguists in the field to gather. But the data could in turn lead to better computational tools.

“These [linguistic] the features our system learns are, of course, on the one hand, theoretically interesting for linguists,” says Boris Katz, principal research scientist at MIT’s Computer Science and Artificial Intelligence Laboratory and one of the leaders of the fresh work. “But on the other hand, they are starting to be used more and more often in applications. Everyone is very interested in creating computational tools for the world’s languages, but to create them you need these features. So we may be able to do much more than just learn linguistic features. … These features could be extremely valuable for building better parsers, better speech recognizers, better natural language translators, and so on.”

In fact, as Katz explains, the researchers’ theoretical discovery arose from their work on practical applications: About a year ago, Katz proposed to one of his students, Yevgeny Berzak, that he try to write an algorithm that could automatically determine the native language of a person writing in English. The hope was to develop grammar proofreading software that could be tailored to the user’s specific linguistic background.

Family resemblance

With the support of Katz and Roi Reichart, an engineering professor at the Technion who was a postdoc at MIT, Berzak built a system that searched more than 1,000 English essays written by native speakers of 14 different languages. First, he analyzed the parts of speech of the words in each sentence of each essay and the relationships between them. He then looked for patterns in these relationships that correlated with the authors’ native languages.

Like most machine learning classification algorithms, Berzak assigned probabilities to its conclusions. For example, he might find that a particular essay had a 51% chance of being written by a native Russian speaker, a 33% chance of being written by a native Polish speaker, and only a 16% chance of being written by native Japanese speaker.

Analyzing the results of their experiments, Berzak, Katz, and Reichart noticed something unusual: the algorithm’s probability estimates provided a quantitative measure of how closely related any two languages ​​were; For example, the syntactic patterns of Russian speakers were more similar to those of Polish speakers than to those of Japanese speakers.

When they used this measure to create a family tree of the 14 languages ​​in their dataset, it was almost identical to the family tree generated from data collected by linguists. For example, the nine languages ​​that belonged to the Indo-European family were clearly different from the five that were not, and the Romance and Slavic languages ​​were more similar to each other than to other Indo-European languages.

What’s your type?

“What’s striking about this tree is that our system inferred it without ever seeing a single word in either language,” Berzak says. “We’re essentially getting the similarity structure for free. Now we can go a step further and use this tree to predict typological features of a language about which we have no linguistic knowledge.”

By “typological features,” Berzak means the types of syntactic patterns that linguists utilize to characterize languages—things like the typical order of subject, object, and verb; how to create negation; or whether nouns take articles. A widely used online linguistic database called the World Atlas of Language Structures (WALS) identifies nearly 200 such features and contains data on over 2,000 languages.

However, as Berzak argues, for some of these languages ​​WALS contains only a few typological features; the rest have not yet been determined. Even widely studied European languages ​​may have dozens of missing entries in the WALS database. Berzak points out that at the time of his research, only 14 percent of WALS entries were completed.

Branching out

The 14 languages ​​in which the researchers conducted their initial experiments were those for which a sufficient number of essays were publicly available – an average of 88 in each. Katz, however, is confident that with enough training data, the system will perform equally well in other languages. Berzak points out that the African language Tswana, which has only five entries in WALS, nevertheless has 6 million speakers worldwide. It shouldn’t be too hard, Berzak argues, to find more essays in English written by native Tswana speakers.

“There are people who debate the extent to which, when you learn a second language, you just start over and learn the structure of the language,” says Robert Frank, chairman of the linguistics department at Yale University. “Another hypothesis is that you think of the new language as a modified version of your own language. Some researchers think that such modifications occur at a fairly superficial level. But others think that they operate on elements of abstract grammar.”

The technique used by the MIT researchers could sharpen that debate, Frank says. The ability to predict features of speakers’ native languages ​​from the syntax of their written English, he says, shows that “there’s a clear reflection of the grammar of the original language. So it’s not like they’re just starting from scratch.

For example, in French, objects follow verbs—unless they are pronouns, in which case they precede the verbs. In Yiddish, both pronouns and definite objects precede the verb, but other objects do not. So are French and Yiddish verb-object languages ​​or object-verb languages?

Frank would like to see how well the MIT researchers’ technique predicts classifications within other, more specific and abstract typological systems. “What are the basic characteristics that would determine, ‘Oh, is it just a pronoun, or is it just objects that have certain kinds of properties?'” Frank says. “Careful, I’m bullish. “I’m excited about the possibility that even more abstract properties will be reflected in an English production.”

Latest Posts

More News