Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory have shown in two recently published papers that for several specific tasks it is possible to write computer programs using regular language rather than specialized programming languages.
This work can be helpful for programmers, and it may allow non-programmers to manipulate common file types – such as text documents and spreadsheets – in ways that previously required knowledge of programming languages. However, researchers’ methods can also be applied to other programming tasks, expanding the range of contexts in which programmers can specify functions using ordinary language.
“I don’t think we’ll be able to do this for everything in programming, but there are areas where there are lots of examples of people doing translation,” says Regina Barzilay, an assistant professor of computer science and electrical engineering and co-author of both papers. “If the information is there, you can learn how to translate that language into code.”
In other cases, Barzilay says, programmers may already be practicing writing specifications that describe computational tasks in precise and formal language. “Even though they’re written in natural language and have some variability, they’re not exactly Shakespeare,” Barzilay says. “So again, they can be translated.”
Recent work by researchers demonstrates both approaches. In a paper presented in June at the annual conference of the North American Chapter of the Association for Computational Linguistics, Barzilay and graduate student Nate Kushman examples used taken from the Internet teach a computer system to convert natural-language descriptions into so-called “regular expressions”: combinations of symbols that enable much more pliant file searching than the standard search functions available in computer software.
In a paper presented at the annual conference of the Association for Computational Linguistics in August, Barzilay and her graduate student, Tao Lei, join forces with professor of electrical engineering and computer science Martin Rinard and his graduate student Fan Long to describe the system which automatically learned how to handle data stored in various file formats, based on specifications prepared for a popular programming competition.
Regular irregularities
As Kushman explains, computer science researchers have had some success with systems that translate natural-language questions into special-purpose formal languages—for example, languages used to specify database searches. “Typically, these techniques work by finding a fairly direct mapping between natural language and this formal representation,” Kushman says. “In general, logical forms are written by hand to get a nice mapping.”
Unfortunately, Kushman says, this approach doesn’t work for regular expressions, which are strings of symbols that can describe the data in a file in great detail. A regular expression might target, say, only the numeric entries in a spreadsheet that are three columns away from a cell containing a word of any length whose last three letters are “BOS.”
But regular expressions, as they’re usually written, don’t map well to natural language. For example, Kushman explains, a regular expression used to search for a three-letter word starting with “a” would have a symbol to indicate the beginning of the word, another to indicate the letter “a,” a set of symbols to indicate the letter’s identification, and a set of symbols to indicate that the previous operation should be repeated twice. “If I try to do the same syntactic mapping that I would normally do,” Kushman says, “I can’t extract any subfragment of that that would indicate ‘three-letter.’”
However, Kushman and Barzilay found that each regular expression has an equivalent that closely replicates natural language – although it may not be very concise or, for the programmer, very intuitive. Furthermore, using a mathematical construct called a graph, it is possible to represent all equivalent versions of a regular expression at once. Kushman and Barzilay’s system therefore needs to learn only one plain way of mapping natural language to symbols; he can then utilize the chart to find a more concise version of the same expression.
When Kushman presented the paper he coauthored with Barzilay, he asked a group of computer scientists to write down a regular expression for a fairly plain text search. When he revealed the answer and asked how many of them got it right, only a few hands went up. So the system could be useful for experienced programmers, but it could also allow ordinary users of, say, spreadsheets and word processors to specify convoluted searches using natural language.
Opening Gambit
The system that Barzilay, Rinard, Lei, and Long developed is a system that can automatically write input analysis programs, an vital component of all software applications. Each application has an associated file type – .doc for Word, .pdf for document viewers, .mp3 for music players, etc. And each file type organizes data differently. For example, an image file might start with a few bits indicating the file type, a few more bits indicating the width and height of the image, and a few more bits indicating the number of bits assigned to each pixel, before moving on to the bits that actually represent the pixel colors.
Input parsers determine which parts of a file contain which types of data: Without an input parser, a file is just a random string of zeros and ones.
The MIT researchers’ system can write input parsers from natural-language specifications. They tested it on more than 100 examples collected from the Collegiate Association for Computing Machinery International Programming Competition, which contains file specifications for every programming challenge it sets up. The system was able to produce working input parsers for about 80 percent of the specifications. In the remaining cases, changing just one or two words of the specification usually produced a working parser.
“You could use this as an interactive tool for the programmer,” Long says. “The programmer could look at these cases and see what kind of changes they need to make to the natural language—maybe a word is hard for the system to understand.”
The system starts with minimal information about how written specifications might correspond to parsing programs. He knows a few words that should consistently refer to certain types of data – for example, the word “integer” – and he knows that the specification probably describes some data structures nested within others: for example, an image file may consist of many pieces, and at the head Each fragment has several bytes indicating its size.
Otherwise, the system simply tries many different interpretations of the specification on a few sample files; in the researchers’ experiments, samples were also posted on the competition website. If the resulting parser fails on some samples, the system slightly changes its interpretation of the specification. Moreover, as it builds more and more working parsers, it becomes increasingly adept at recognizing patterns in the way parsers are specified. The system needed only about 10 minutes of computation on a regular laptop to generate candidate parsers for all 100 odd specifications.
“This is a big first step toward enabling everyday users to program computers without having to know a programming language,” says Luke Zettlemoyer, an assistant professor of computer science and engineering at the University of Washington. “The techniques they developed should certainly generalize to other related programming tasks.”