In a recent conversation, Joshua Schrier, Ph.D., spoke about the science that has been taking place in his lab since he became the first Kim B. and Stephen E. Bepler Chair in Chemistry last fall. His research blends three scientific branches: quantum mechanics, chemistry, and computational science. This month, his research on biases in chemical reaction data was published in the journal Nature. Fordham News spoke to him about his work.
The Bepler endowed chair has allowed you more time to tackle research projects. That includes your $7.4 million project, “Discovering reactions and uncovering mechanisms of perovskite [mineral] formation,” funded by the Defense Advanced Research Projects Agency. Tell me about this project.
Most experiments are designed, conducted, and interpreted by humans. The goal of this project is to create the capability of having machine-specified experiments so that computer algorithms can select new experiments to perform that accelerate the scientific discovery process.
What does “machine-specified” mean?
We want to give computer algorithms the ability to perform experiments in the real world. But to do this, we need to make sure that the specifications of what to do are completely unambiguous. Humans are pretty good about working with imprecise instructions about what to do. If I say, “Hey, let’s go to the zoo,” you would infer it’s the Bronx Zoo and that “us” includes you and I and other individuals within earshot. But a computer is not going to know what I meant: What zoo? What entrance? How do we get there? Who should go? We are working to develop software that allows people (or computers) to specify the experiments they want to be performed. The software turns that into a set of instructions in an unambiguous way. This might include a mixture of instructions for human operators and for machines—just like the way that specifying where you want to go in an Uber ride fills in the details of how to get there. Finally, we want to make it easy to collect all of the things that happen during the process so that we can learn from that data.
Like programming a self-driving car, but for science experiments?
Yeah. That’s the high-level goal: a “self-driving” or autonomous laboratory. Just like a self-driving car, we have to be able to “steer” the experiments (specify what to do) and “see” the world. So we are also collecting as much information about everything that happens in the laboratory so that the algorithms can make sense of what is happening when devising new experiment plans. Experiment specifications are the steering wheel, so to speak. As new experiments are performed, machine-learning models get trained on the new data. This is a general problem across many areas of science—how do we use data to more efficiently get scientific insight? Because of the scale of the data, we use algorithms to sift through the data and identify anomalies, and use the insights latent in that data to devise the next round of experiment plans.
There’s another part to your project: using this “self-driving laboratory” to develop as many different types of perovskites—minerals that help create solar cells—as possible, and then identify the most useful perovskites.
Yes. Essentially, what we’ve cooked up—in collaboration with researchers at Lawrence Berkeley National Laboratory and Haverford College—is a way to do these types of [perovskite]syntheses using commercially-available laboratory robots. More specifically, organohalide perovskite materials are hybrid materials that have both organic and inorganic building units—and changing these changes their electronic and optical properties. As a result, there is a general interest in using perovskites for high performance, low-cost solar cells. We are using the robotic system [called RAPID]to try to discover new materials that will have higher performance. But just to be clear, our focus for now is on discovering new compounds. We don’t yet build devices from these discoveries, although we are expanding work in that direction [in collaboration with researchers from MIT]. It would be neat if we also found some really great high-performance perovskites—but even if we do not, we’ll still be able to learn rules about how they form, and demonstrate this toolbox which can be applied to other scientific problems.
Another ongoing research project is the National Science Foundation-funded “Dark Reaction Project.” What is that about?
“Dark reactions” sounds mysterious, right? But it’s a simple idea. Most of the experiments performed in laboratories are never reported. Journals tend to publish only a single example of “success.” So this vast, unreported collection of marginal successes and failures never gets exposed to the world. So by analogy to the astronomer’s “dark matter,” we like to think of “dark reactions” as this vast majority of scientific experiments that aren’t seen directly [in journal articles], but yet influences scientists’ decisions in a non-obvious way.
The good news is that scientists keep good laboratory notebooks, so the “dark reactions” are in principle available. This project is an initiative to harness the unpublished failures and marginal successes [dark reactions]in laboratory notebooks, turn them into digital data, and use that to advance hydrothermal synthesis of oxides. Once you digitize the results, you can use that database to build a machine-learning model. With that machine-learning model, you can recommend reactions to perform in the laboratory.
So the machine-learning model is learning from “dark reactions,” or our mistakes—what not to do?
Correct. And you can only do this if you’ve got the complete record of success and failures.
If you look at all the published scientific literature, all you see are successes. You never see any of the failures. So if you’re trying to identify a mathematical function that divides success and failure—and that’s really all that you’re doing with machine-learning, is finding the mathematical function—then your algorithm is going to look at all of these examples in the published literature and say, “Oh, good news, everything is successful.” Because all the examples that it sees are only examples of success.
Lastly, you have a paper that was recently published by Nature: “Anthropogenic biases in chemical reaction data hinder exploratory inorganic synthesis.” How does it relate to dark reactions?
This work is supported by the same project from the National Science Foundation, and is a natural continuation. “Dark reactions” are the experiments that have been tried in the laboratory, but not reported because they are “failures” or marginal successes. But what about the “extra dark” reactions that don’t even get attempted? In practice, chemical experiments are planned by human scientists and thus are subject to a variety of human cognitive biases, heuristics, and social influences that might lead to some reactions being systematically excluded. What we were able to show in this study is that such biases are present in the chemical reaction literature, and that the underrepresented reactions are not being excluded for any “good” reason—it’s not because they are more expensive, or more difficult, or more prone to failure, but rather simply because humans tend to get stuck in a rut when planning reactions. This might just be a curiosity, except for the fact that these anthropogenic (human-generated) data are now being widely used to train machine-learning models to predict chemical syntheses. The hazard is that we end up making the machine in our own image, so to speak, rather than letting it perform as well as it could. We were able to show that indeed, human-selected experiments were inferior to randomly-generated experiments for building machine learning models, even if you gave the humans many more reaction data.
This interview has been edited and condensed for clarity.