By examining changes in gene expression, researchers learn how cells function at the molecular level, which can support them understand the development of certain diseases.
However, humans have approximately 20,000 genes that can interact with each other in complicated ways, so even knowing which groups of genes to target is an extremely complicated problem. Moreover, genes work together in modules that regulate each other.
MIT researchers have now developed theoretical foundations for methods that could identify the best way to aggregate genes into related groups, so that the underlying cause-and-effect relationships between multiple genes can be effectively understood.
Importantly, this up-to-date method achieves this using only observational data. This means that researchers do not have to conduct costly and sometimes unfeasible intervention experiments to obtain the data needed to infer underlying causal relationships.
In the long term, this technique could support scientists identify potential target genes that will trigger specific behaviors more accurately and effectively, potentially enabling them to develop precision therapies for patients.
“In genomics, it is very critical to understand the mechanism underlying cellular states. But cells have a multi-scale structure, so the level of summary is also very critical. If you find the right way to aggregate the data you observe, the information you learn about the system should be more understandable and useful,” says student Jiaqi Zhang, an Eric and Wendy Schmidt Center Fellow and co-author of the book article about this technique.
In the article, Zhang is joined by co-author Ryan Welch, currently a master’s student in engineering; and senior author Caroline Uhler, professor in the Department of Electrical Engineering and Computer Science (EECS) and the Institute for Data, Systems and Society (IDSS), who is also director of the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard and a researcher in the Information and Systems Laboratory Decision (LIDS) MIT. The research results will be presented at the Conference on Neural Information Processing Systems.
Learning from observational data
The problem that scientists decided to address concerns gene learning programs. These programs describe which genes work together to regulate other genes in biological processes such as cell development or differentiation.
Because scientists cannot effectively study the interactions of all 20,000 genes, they use a technique called causal disentangling to learn how to combine related groups of genes into a representation that allows them to effectively study cause-and-effect relationships.
In previous work, researchers demonstrated how this can be effectively done in the presence of intervening data, that is, data obtained by confounding variables in the network.
However, conducting intervention experiments is often expensive, and there are certain scenarios in which such experiments are either unethical or the technology is not good enough for the intervention to be successful.
With only observational data, researchers cannot compare genes before and after an intervention to learn how groups of genes function together.
“Most causal unraveling studies assume access to interventions, so it was unclear how much information could be unraveled from observational data alone,” Zhang says.
MIT researchers have developed a more general approach that uses a machine learning algorithm to efficiently identify and aggregate groups of observed variables, such as genes, using only observational data.
They can use this technique to identify causal modules and reconstruct an accurate representation of the cause-and-effect mechanism. “Although this research was motivated by the problem of explaining cellular programs, we first needed to develop a new causal theory to understand what could and could not be learned from observational data. With this theory in hand, in future work we will be able to apply our knowledge to genetic data and identify gene modules as well as their regulatory connections,” says Uhler.
Layered representation
Using statistical techniques, researchers can calculate a mathematical function known as the variance of the Jacobian score of each variable. Causal variables that do not influence any subsequent variables should have a variance of zero.
Scientists reconstruct the representation in a layer-by-layer structure, starting by removing variables from the bottom layer that have a variance of zero. They then work backwards, layer by layer, removing variables with zero variance to determine which variables or groups of genes are connected.
“Identifying variances that are zero is quickly becoming a combinatorial goal that is quite hard to solve, so developing an capable algorithm that could solve this problem was a significant challenge,” Zhang says.
Ultimately, their method produces an abstract representation of observed data with layers of interrelated variables that accurately summarizes the underlying cause-and-effect structure.
Each variable represents an aggregated group of genes that function together, and the relationship between two variables represents how one group of genes regulates another. Their method effectively captures all the information used to determine each layer of variables.
After proving that their technique was theoretically sound, the researchers ran simulations to demonstrate that the algorithm could successfully unravel meaningful causal representations using only observational data.
In the future, scientists want to apply this technique to real-world applications of genetics. They also want to explore how their method can provide additional insight into situations where some intervention data is available, or support researchers understand how to design effective genetic interventions. In the future, this method could support scientists better determine which genes function together as part of the same program, which could support identify drugs that target these genes to treat specific diseases.
This research is funded in part by the MIT-IBM Watson AI Lab and the U.S. Office of Naval Research.