The up-to-date tool makes it easier for database users to perform intricate statistical analyses of tabular data without having to see what’s happening behind the scenes.
GenSQL, a generative AI system for databases, can assist users make predictions, detect anomalies, guess missing values, fix errors, or generate synthetic data with just a few keystrokes.
For example, if the system were used to analyze the medical records of a patient who always had high blood pressure, it could detect that the blood pressure in that particular patient was low but within the normal range.
GenSQL automatically integrates a tabular dataset and a generative probabilistic AI model that can account for uncertainty and adapt decision-making based on up-to-date data.
In addition, GenSQL can be used to create and analyze synthetic data that mimics the real data in the database. This can be especially useful in situations where confidential data cannot be shared, such as patient medical records, or when the real data is scattered.
This up-to-date tool is based on the SQL programming language for creating and maintaining databases, which was introduced in the tardy 1970s and is used by millions of programmers around the world.
“Historically, SQL taught the business world what a computer could do. They didn’t have to write custom programs; they just had to ask questions to a database in a high-level language. We think that as we move from just querying data to asking questions of models and data, we’re going to need an analogous language that teaches people consistent questions to ask a computer that has a probabilistic data model,” says Vikash Mansinghka, senior author GenSQL introductory article and principal investigator and project manager of the Probabilistic Computing Project in the Department of Brain and Cognitive Sciences at MIT.
When researchers compared GenSQL to popular AI-based approaches to data analysis, they found that not only was it faster, but it also produced more exact results. Importantly, the probabilistic models used by GenSQL are explainable, so users can read and edit them.
“Looking at the data and trying to find some meaningful patterns by using a few simple statistical rules can result in missing important interactions. You really want to capture the correlations and dependencies of variables, which can be quite complex, in the model. With GenSQL, we want to enable a large group of users to query their data and their model without having to know all the details,” adds lead author Mathieu Huot, a research scientist in the Department of Brain and Cognitive Sciences and a member of the Probabilistic Computing Project.
They were joined by Matin Ghavami and Alexander Lew, both graduate students at MIT; Cameron Freer, a research scientist; Ulrich Schaechtel and Zane Shelby of Digital Garage; Martin Rinard, an MIT professor in the Department of Electrical Engineering and Computer Science and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL); and Feras Saad, an assistant professor at Carnegie Mellon University. The research was recently presented at the ACM Conference on Programming Language Design and Implementation.
Connecting models and databases
SQL, which stands for Structured Query Language, is a programming language for storing and manipulating information in a database. In SQL, people can ask questions about data using keywords, such as summing, filtering, or grouping database records.
But querying a model can provide deeper insights because models can capture what the data means to a given person. For example, a programmer wondering whether she is underpaid is probably more interested in what the salary data means to her individually than in trends in the database records.
The researchers noted that SQL does not provide an productive way to incorporate probabilistic AI models, while approaches that utilize probabilistic models to draw inferences do not support intricate database queries.
To fill this gap, they created GenSQL, allowing the user to query both the dataset and the probabilistic model using a uncomplicated yet powerful formal programming language.
GenSQL users upload their data and probabilistic model, which the system automatically integrates. They can then run queries on the data, which also receive input from the probabilistic model running in the background. This not only enables more intricate queries, but can also provide more exact answers.
For example, a query in GenSQL might look something like this: “What is the probability that a programmer from Seattle knows the Rust programming language?” Simply looking at the correlation between columns in the database may not capture subtle relationships. Incorporating a probabilistic model can capture more intricate interactions.
In addition, GenSQL probabilistic models are auditable, so people can see what data the model uses to make decisions. In addition, these models provide a calibrated measure of uncertainty with each response.
For example, if someone uses this calibrated uncertainty to ask the model to predict the outcomes of different cancer treatments for a patient from a minority group that is underrepresented in the dataset, GenSQL will tell the user that the outcomes are uncertain and provide the degree of uncertainty, rather than overconfidently advocating for the wrong treatment.
Faster and more exact results
To evaluate GenSQL, the researchers compared their system to popular baseline methods that utilize neural networks. GenSQL was 1.7 to 6.8 times faster than those approaches, executing most queries in a few milliseconds while delivering more exact results.
They also applied GenSQL to two case studies: in the first, the system identified mislabeled clinical trial data, and in the second, it generated exact synthetic data that captured intricate relationships in genomics.
Next, the researchers want to utilize GenSQL more broadly to model human populations at scale. With GenSQL, they can generate synthetic data to draw conclusions about things like health and salary, while controlling what information is used in the analysis.
They also want to make GenSQL easier to utilize and more productive by adding up-to-date optimizations and automation to the system. In the long run, the researchers want to enable users to perform natural-language queries in GenSQL. Their goal is to eventually develop an AI expert similar to ChatGPT, who could talk to you about any database, and base their answers on GenSQL queries.
This research is funded in part by the Defense Advanced Research Projects Agency (DARPA), Google and the Siegel Family Foundation.