Monday, December 23, 2024

Finding a good read among billions of choices

Share

With billions of books, news and documents available online, there’s never been a better time to read – if you take the time to explore all the options. “There is a lot of text on the Internet,” he says Justin Salomon, assistant professor at MIT. “Anything that helps cut through all this material is incredibly helpful.”

WITH MIT-IBM Watson Artificial Intelligence Laboratory and his Geometric Data Processing Group at MIT, Solomon recently demonstrated a modern technique for cutting through massive amounts of text on a web page Conference on Neural Information Processing Systems (NeurIPS). Their method combines three popular text analysis tools – topic modeling, word embeddings, and optimal transport – to provide better and faster results than competing methods in the popular document classification benchmark.

If the algorithm knows what you’ve liked in the past, it can scan millions of possibilities for something similar. As you improve your natural language processing techniques, the suggestions you might like become faster and more right.

In the method presented at NeurIPS, an algorithm summarizes a collection of, say, books into topics based on commonly used words in the collection. He then divides each book into five to 15 most significant topics and estimates how much each topic contributes to the overall book.

Researchers exploit two other tools to compare books: word embeddings, a technique that turns words into lists of numbers to reflect their similarity in popular usage, and optimal transport, a framework for calculating the most proficient way to move objects – or data points – among multiple destinations.

Word embeddings enable two uses of optimal transport: first, to compare topics within the collection as a whole, and second, within any pair of books, to measure how much common topics overlap.

This technique works particularly well when scanning enormous collections of books and voluminous documents. In the study, the researchers cite the example of Frank Stockton’s “The Great War Syndicate,” a 19th-century American novel that predicted the development of nuclear weapons. If you’re looking for a similar book, a topic model will lend a hand you identify dominant themes shared with other books – in this case, sailing, elementals, and martial arts.

However, the topic model alone would not identify Thomas Huxley’s 1863 lecture: “A past state of organic nature” as a good match. The writer was a supporter of Charles Darwin’s theory of evolution, and his lecture, interspersed with mentions of fossils and sedimentation, reflected emerging ideas about geology. When the themes of Huxley’s lecture are juxtaposed with Stockton’s novel through optimal transportation, certain intersecting themes emerge: Huxley’s themes of geography, flora/fauna, and lore closely correspond to Stockton’s nautical, elemental, and war themes, respectively.

Modeling books by their representative topics rather than individual words allows for high-level comparisons. “If you ask someone to compare two books, they will break each book down into easy-to-understand concepts and then compare them,” says the study’s lead author Mikhail Yurochkinresearcher at IBM.

The study results show that comparisons are faster and more right as a result. The researchers compared 1,720 pairs of books from the Project Gutenberg dataset in one second – more than 800 times faster than the next best method.

In addition to faster and more right document categorization, this method offers insight into the model’s decision-making process. In the list of emerging topics, users can see why the model recommends a particular document.

The other authors of the study are: Sebastian Claica AND Edward Psastudent and graduate student, respectively, at the Department of Electrical Engineering and Computer Science and at the MIT Computer Science and Artificial Intelligence Laboratory, and Farzaneh Mirzazadehresearcher at IBM.

Latest Posts

More News