Ian Foster: Exploring and Evaluating Foundation Models

Large language models aren’t just powering chatbots like ChatGPT. This type of computational model is an example of a particular flavor of artificial intelligence known as foundation models, which are trained on vast amounts of data to make inferences in new areas. Although text is one rich data source, science offers many more from biology, chemistry, physics and more. Such models open up a tantalizing new set of research questions. How effective are foundation models for science? How could they be improved? Could they help researchers work on challenging questions? And what might they mean for the future of science?

This episode begins a series where we’ll explore these questions and more, talking with computational scientists about their work with foundation models and the opportunities and challenges in this exciting, rapidly changing area of research. We’ll start by talking with Ian Foster of Argonne National Laboratory and the University of Chicago about AuroraGPT, a foundation model being developed for science and named for Argonne’s new exascale computer.

You’ll meet:

Ian Foster is a senior scientist at Argonne National Laboratory where he directs the data science and learning division. He’s also a professor of computer science at the University of Chicago. He is the co-leader of the data team for Argonne’s AuroraGPT project.

From the episode:

Foundation models are machine learning algorithms that power a range of generative artificial intelligence tools. The term was coined by Stanford University researchers in a 2021 report: On the Opportunities and Risks of Foundation Models.

AuroraGPT is an Argonne National Laboratory project examining how Department of Energy supercomputers could help researchers understand and develop foundation models for science. It is named the Aurora supercomputer, the exascale system housed at Argonne Leadership Computing Facility.

We discussed the potential role of foundation models as subject matter experts, an idea that was described in the Department of Energy’s AI for Science, Energy and Security Report 2023.

Ian mentioned areas where foundation models are showing promise for science: short-to-medium term weather forecasting using FourCastNet and protein structure prediction using AlphaFold.

Related episodes:

Transcript

SPEAKERS

Sarah Webb, Ian Foster

Sarah Webb 00:00

Welcome to season six of Science in Parallel, a podcast about people and projects in computational science. I’m your host, Sarah Webb, and this series will focus on artificial intelligence, specifically foundation models, and how they might be applied in science. Large language models, such as those that support ChatGPT and other similar applications are one example of this type of AI model.

Sarah Webb 00:48

In this episode, you’ll hear from Ian Foster, a scientist at Argonne National Laboratory, where he directs the data science and learning division. He’s also a professor at the University of Chicago. He’ll be our initial guide to this research area and talk about Aurora GPT, a project to construct a foundation model for science. The project is named for Argonne’s exascale computer, which became fully available for research use in January.

Sarah Webb 01:19

This episode combines two interviews that took place in January and February. The first one happened before DeepSeek’s results generated big tech news and shook financial markets. We’ll discuss that toward the end of the episode. Join us for a conversation that describes more about what foundation models are, what they could offer researchers and how they might contribute to new scientific discoveries.

Sarah Webb 01:44

Ian, I’m really looking forward to talking to you today. Thank you for coming on the podcast.

Ian Foster 01:49

It’s my pleasure. Thanks for having me.

Sarah Webb 01:51

AI, foundation models are a big buzzword right now. Can you give a little bit of context for how you got to this point and the kinds of things that you’ve been interested in over your career?

Ian Foster 02:03

I’ve always been interested in how we can use computers, especially high performance, computers, to accelerate discovery, to enhance the abilities of scientists to tackle challenging problems. And that has led me into many different things over the years. I started out working in AI methods, got involved in high performance simulation, did a lot of work with organization and analysis of large amounts of data. And all of those really come together when we start talking about modern AI methods, generative AI and foundation models.

Sarah Webb 02:41

I feel like large language models with chat GPT and things like that are really dominating the conversation about AI. What do you think has made all of that so compelling and how much of it is the external world hyping ChatGPT versus the power of what foundation models can actually do?

Ian Foster 03:00

It’s a bit all of the above. We’ve seen some remarkable developments in technology just over the last few years, and that comes from a number of different factors. First of all, certain developments in algorithms, new AI methods, methods for training large models. Secondly, the increased amount of computing that we’ve been able to bring to bear on large language model and other model training and those together have led to some really remarkable capabilities. And many of us are engaged in trying to understand what can those new methods do for science. And as in many new technologies that arise over the years, we end up trying all sorts of things and often being surprised by what works and sometimes being surprised by what doesn’t work.

Sarah Webb 03:49

What makes large language models powerful, exciting?

Ian Foster 03:53

So a large language model, which is an example of a broader class of entity called a foundation model, is basically a large, general-purpose AI model trained on a very large and diverse data set. In the case of a large language model, the data set is large quantities of natural language text in the broader class of foundation models that also could include images and data sets and other sorts of information. These models are trained by a process by which, in an iterative manner, the values for billions of parameters are set in ways that allow them to predict the possible future outcomes of patterns in the data. And as part of that, they’re able to learn these sort of very rich internal representations of language, or could be in the broader case, images, computer code, etc. And these internal representations then allow them to be adapted or fine tune for a wide range of downstream tasks like text classification, question answering, image capture, computer coding and so forth. And what is impressive about the foundation models, the large language models of today, is that they seem to be able to address a very wide range of downstream tasks. Used to be you’d build a machine learning model for one purpose, and then you’d have to build a new one for another purpose. But we seem to be able to use these most recent models for many different purposes, which is very exciting.

Ian Foster 03:53

So basically, chat GPT is one slice of this larger foundation model piece.

Ian Foster 05:15

Yeah, so ChatGPT, you know precisely, it’s an application, a chatbot built on top of a foundation model developed by this company, OpenAI, and they’ve collected a very large amount of text, mostly from the web, but presumably also from many other sources, to train a foundation model that they then use to implement ChatGPT.

Sarah Webb 05:55

Let’s move into this foundation model for science. How are you thinking about foundation models and the data that’s going into them? How, in this science space, are we thinking about training these multi-purpose models as you’ve been describing them?

Ian Foster 06:12

Yeah, that’s a fascinating question, in a way, as they say, the $64,000 or maybe now the $64 billion question. So we have models like ChatGPT and others that we know can be very effective at quite a few different things. We also know that they’re not as good at some things as we’d like them to be. We suspect that is because they haven’t been trained on the scientific data that we have access to. And so one thing we want to do, that we are doing is trying to build models that are either directly trained on scientific data that wouldn’t be have been seen by ChatGPT, or alternatively, finetune models, as they say, further refine them to know about scientific data.

Sarah Webb 07:00

So at this point, what are a couple examples of those types of gaps that you’re that you’re talking about, where the model doesn’t quite do what we would like it to do?

Ian Foster 07:09

So we’re in the midst of many experiments, both at Argonne and with other collaborators at other national labs, trying these models on many different problems to learn what they’re good at and what they’re not so good at. In areas like, say, biology, we find that they know quite a bit about some aspects of, say, cellular metabolism, but when you get into details, they clearly don’t have the knowledge that we’d like them to have. So we’re trying to work out how to collect data from various sources, the literature, but also scientific datasets to teach them that information.

Sarah Webb 07:45

The DOE report that came out, I believe it was 2023, the AI for Science, Energy and Security report. This quote struck me: “Foundation models are intended to become the digital equivalent of a subject matter expert.” What do you see that meaning or possibly meaning depending on what we learn about, what foundation models are capable of?

Ian Foster 08:06

What we see with ChatGPT and similar systems is that they by training them on very large amounts of data, we give them a very wide breadth of knowledge. What we want to be able to do is to fine tune them on more specialized, domain-specific data to give them really expert-level capability in specific areas. And it could be materials properties or molecular properties of various systems. And if we get this combination of general knowledge and domain specific knowledge, we’ll end up with something that works a bit like a subject matter expert. You’d like these models to be able to function as a very knowledgeable colleague that you can ask questions about some area that you want to investigate, perhaps a colleague who knows more about that specific area than you do, and is able to at least generate plausible ideas that you can then investigate further, perhaps in conversation with that expert, or perhaps by other means.

Sarah Webb 09:05

So let’s talk a little bit more about the kinds of things that you’re working on that are going on at Argonne. One core project that I’ve heard a little bit about is AuroraGPT. But let’s talk a little bit about what you’re thinking about at Argonne, and a little bit more about AuroraGPT.

Ian Foster 09:23

Argonne has really been a pioneer in applying computational methods to advanced scientific discovery, and that has involved actually work in AI since the 1960s of various sorts, certainly the development of very powerful computers and computational methods and their application in different domains. We also, of course, have a lot of scientists and engineers with very specialized knowledge and unique data sets, so we want to pull these things together to see how we can contribute to the development and application of foundation models. Now it turns out these foundation models, large language models, involve a number of different components. So first of all, of course, data, unique data that is used to train them. Secondly, massive amounts of computing and the ability to apply that computing. And thirdly, applications in which these new models can be applied. And so we have people that are gone with expertise in all of those areas. And we also have, in this form of Aurora, our exascale computer, a very powerful and in some ways unique system. So the AuroraGPT project is us pulling these elements together to address this challenge of foundation models for science.

Sarah Webb 10:40

So what’s the scope of that? Because it seems like there are so many directions in which you could go as you’re thinking about that.

Ian Foster 10:48

We, of course, have very good people, powerful computers, but you realize that industry is investing millions of dollars in these problems. So where can we make a unique contribution? Well, so we’re doing a few things. One is building via experiment experience in this whole foundation model training and application pipeline. So we’re focusing on particular areas so far, materials, biology and climate, with some work in astronomy, and then building a complete foundation model training pipeline. So that means collecting data, preparing it, actually training foundation models on Aurora, working out how to perform tasks called post training and fine tuning, which perhaps we don’t need to go into, putting a lot of effort into evaluation, which I’d like to say more about, and then working out how to apply the resulting models to various applications.

Ian Foster 11:42

I think the evaluation question is really interesting. So we you build a large language model, and you want to know, is it better more useful than, say, ChatGPT or Llama, or any of the other models that are out there? So how do you ask and answer that question? It turns out that the machine learning, the AI community has built many benchmark sets for studying the performance of these models, but most of them are not focused on scientific reasoning. So we have to start with very little ability to determine whether a model that we might create is actually better than another model. So we’ve we’re putting a lot of effort into the question of how to evaluate models for science, and we’re doing that a few ways. One is we’re building specialized, basically, question-answer data sets that test the knowledge of these models about scientific topics. Then we’re building methods for evaluating their ability to perform scientific reasoning. And then we’re doing what we call, well, first of all, lab-style evaluations, and secondly, field-style evaluations, in which we engage real scientists in trying to solve problems with and without these large language models as research assistants and seeing how well they do with and without the large language models.

Sarah Webb 12:59

That last idea, it’s really, really interesting to me. So is it the same group of scientists say, work on a problem in one way or the other way? How does that experiment play out?

Ian Foster 13:09

Yeah, so the first thing we’re doing, actually, is not with and without, but just getting people to tackle hard problems and report on their experiences and seeing whether they are able to develop solutions to problems that perhaps were not known to them ahead of time. One technique that one of my colleagues, Franck Cappello, and his colleagues, have been pioneering, is one in which you take a result from a recent scientific paper and see whether a scientist who hasn’t read the paper is able to reproduce that result the help of the large language model. For example, I think one thing that they one problem they worked on was checkpointing algorithms. These are algorithms that will save the state of a program while it’s running, that can operate asynchronously, so that not every node has to work together. And one group recently published a paper describing a asynchronous checkpointing algorithm that was more efficient than previous ones. And the question was, could someone who wasn’t aware of that work come up with the same algorithm with the help of a large language model? And it turned out they could do reasonably well at it. So that taught us something.

Sarah Webb 14:13

Okay, I see. So it’s about process. Yes, got it. There are lots of steps that you were talking about here to make this come together. How is this progressing?

Ian Foster 14:25

Yeah, I would emphasize first of all that we view the project as a learning and research activity. What was not necessarily to produce a competitor to ChatGPT. It’s to learn by doing what we want to learn is where can DOE labs and science in general, do unique things that will advance scientific discovery. And so I think we’re learning a lot about this question of evaluation. We’re learning things about how to prepare scientific data to use it to teach models new things. And we’re learning things about how to use these newly trained models in scientific research.

Ian Foster 15:05

I think this question of one thing you can think of as the training process is you’re trying to teach this. It’s, I think, a mistake to anthropomorphize these things too much. But if you think of it as an uneducated child, you’re trying to teach these models things that you would like your research assistant to know. So how do you teach things about biology or physics? And then how do you evaluate whether your teaching has been successful? So we do that in educational institutions with people, right? Well, how do we do it? We lecture to them, we get them to solve problems; then we evaluate them in silly ways, like multiple choice tests and so forth. But so we want, we want to do similar things, but hopefully better things, in teaching these models, and then we want to put them into work and see how, to what extent they’re actually helpful to research scientists.

Sarah Webb 15:55

Yeah, so we were talking a little bit about this evaluation piece, but I think maybe I want to take a step and talk a little bit about data piece that you were talking about. How do you have to think about the data in all of these pretty different subject areas as you’re thinking about bringing this into a foundation model and making it something that the algorithms, the computers can work with.

Ian Foster 16:19

We’re still learning a lot there, but some observations as well. Large language models themselves learn by they’re presented with lots of natural language text, and they learn patterns within that text. And you know, the wonderful thing about most natural language text is it was, it’s been created to communicate something to someone. And so by seeing lots of that, you learn how people communicate and what they want to communicate. So in the sciences, of course, we have scientific articles, we have perhaps textbooks, if we can get access to those that are all designed with this communication in mind. But then we also have lots of scientific data which was not designed to communicate. It’s just observations of the natural world, or perhaps simulations of natural phenomena. So how do we extract information from that, and what do we want to extract?

Ian Foster 17:09

These are things that I think we’re still learning. In some specialized domains like protein structure or weather forecasting people have worked out how to build predictive models based on these same transformer-based techniques that are used in large language models. So you get things like FourCastNet, which is very good at medium, short-to-medium range weather forecasting. You’ve got AlphaFold, which is wonderful at folding proteins. We think we’re going to see many similar models in other domains of science. But then we’d also like to take other sorts of databases and use those to train models. So, for example, we’ve got lots of huge amounts of data about cellular metabolism. One approach we’re trying is turning those into English-language narratives and then using those narratives to train the model. So we’ll see how that we’re still getting experience of how that works.

Sarah Webb 17:55

I want to go back to the evaluation piece a little bit because there’s a question of hallucinations, other things that can come out of these models. How are thinking about those issues and assessing uncertainty in these models as well?

Ian Foster 18:11

Yeah, no, those are really, of course, interesting and important questions. One thing to point out is hallucinations are usually viewed as a bad thing, so if you ask a factual question, and you get the wrong answer that is disappointing and may even be dangerous in some cases. On the other hand, if you’re asking a model to come up with possible hypotheses that would explain some data, then in a sense, you want them to hallucinate, but you want them to hallucinate in an intelligent manner. So the point there is, I guess the evaluation techniques you use will vary according to the intended application.

Sarah Webb 18:44

Basically, you might want a model to pull things together in an unusual way, because that might make you get some look at a problem in a new way.

Ian Foster 18:53

I mean, people debate what is human creativity? But, you know, I think one important part of it is this ability to combine ideas from different domains constrained by knowledge of physical laws and come up with possible explanations for things which you then might go out and evaluate via experiment or simulation. And so one area in which we want to apply these models is in generation of hypotheses and then perhaps also the development of or the proposal of experiments that could evaluate hypotheses. So how would you work out whether a model that’s designed to generate hypotheses is doing well or badly? It’s a challenging task. You can ask humans to evaluate how well it’s doing, and that’s one approach that people are taking. If they’re generating hypotheses that can be evaluated by simulation or simple experiments, then you can actually test them. But I think working out how to do that is something that we’re still studying.

Sarah Webb 19:47

What about the uncertainty piece and understanding if you’re asking a model to predict what might happen scientifically or based on physical principles, what sort of approaches are you taking to figure out uncertainty?

Ian Foster 20:01

Yeah, so that’s another area, of course, which people are very concerned about. Some approaches one can take. One is to ask it to make predictions that you can test fairly easily so that if you can perform a simulation, that will evaluate a prediction once it’s made. On the other hand, if what you’re asking it to predict is something that’s very difficult to simulate or perhaps dangerous to experiment with, then things are more complicated. So then you’d like to find ways of, as you said, evaluating the uncertainty of prediction. So you might ask it the same question many different ways or with slightly different assumptions, and see what comes out. Another is to try and get it to evaluate whether it’s extrapolating or interpolating. Now is your prediction that you’re making in an area in which it’s previously shown itself to perform well.

Sarah Webb 20:01

Have there been any interesting surprises along the way, things that either you haven’t expected, or interesting avenues that have come out of this work?

Ian Foster 20:26

So far, we have a subgroup at Argonne working on medium range and regional weather forecasting using transformer-based models, and they are getting just amazing results. So that’s been, I think, a big surprise to the community and to us in terms of the abilities a small team has managed to deliver. We’re also, I think, very pleased with how well the techniques we’re applying are working in terms of predicting possible experimental targets for some protein design applications, but it still is preliminary work in that in that space.

Sarah Webb 21:29

And how do you see these models connecting with the traditional modeling and simulation parts of the computational science ecosystem? Because obviously that’s been the core bread and butter of what DOE has been doing for years now.

Ian Foster 21:46

I certainly don’t see for the most part, these models replacing conventional computational simulations, but we’re hopeful that they’ll be able to do at least a couple of things. One is to allow people to build models more quickly, codes more quickly. They’re seeing good success in the use of these models for writing software, or helping people to write software, to adapt software for new platforms, to debug software when problems arise. Another area of interest is in using these techniques to develop efficient approximations to various functions that are currently computationally intractable, or at least very computationally demanding. So, in a sense, we see these models or these techniques serving as expert research assistants for software developers. When you think that DOE has, I’m sure, billions of lines of software out there that continuously need to be adapted to new platforms and new purposes. This could be a tremendous advantage.

Sarah Webb 22:46

DeepSeek made headlines in late January, and I asked Ian more about that model and its implications for scientific research. What was the accomplishment here? And what do you think is interesting about DeepSeek?

Ian Foster 22:59

Well, you know, DeepSeek is the product of a fascinating company in China which has clearly got considerable resources behind it, and some very smart people. So they released a model, which I don’t think it’s as capable as the latest OpenAI models, but it’s coming close in some regards, and that got people very excited, I think, for a couple of reasons. One is it’s because it’s coming from outside the U.S. and particularly from China. And the other is that they claim to have created it using substantially fewer resources than OpenAI. It’s hard to know how many resources exactly they use, but so I’m not so concerned about that. But I think the fact that it’s as capable as it is, is interesting. Now there’d be, you know, people are busy exploring its use for different purposes. In particular, people are exploring its use for science, and I think they find that its ability to engage in scientific reasoning is a bit of a mixed bag. Sometimes it seems to do very well, sometimes it doesn’t do as well as we would like, but it’s certainly an intriguing development.

Sarah Webb 24:02

What do you think this development means for this idea of looking at foundation models? How might this affect the kinds of questions that we were talking about before?

Ian Foster 24:15

Well, I think we’re we continue to be impressed and in some regards surprised by the pace of progress in both closed and, to some extent, open models coming out of industry and the nonresearch space. And so that’s both exciting, and I think also forces us to think, well, what is going to be the right approach over the next several years for science? Should we be building our own from scratch, which is what we originally assumed, or should we be working out how to leverage and perhaps collaborate with the developers of commercial models? So I think we need to be trying both approaches.

Sarah Webb 24:51

This is a time of incredible change, and we need people who are capable of working on these systems, understanding these systems. What do you think are the implications for up-and-coming computer scientists, computational scientists? What kinds of things do we need to be thinking about in terms of building a workforce that understands all the different pieces that are coming together here?

Ian Foster 25:15

Some people express concern that these AI models will render certain professions obsolete. But certainly in the sciences, to misquote someone, I can’t remember who, the AI will not replace scientists, but scientists who use AI will replace those who don’t use it. It’s probably true to a significant extent. So I think the most important thing that people can be doing at the moment is just experimenting, learning what the state of the art in models can do and investigating in a fairly open-ended manner how they might use them to accelerate the work that they’re doing. And I don’t mean using them to write papers or review papers. Perhaps they can serve some purpose there, but using them, for example, as we see a lot of people doing now in a fairly open-ended manner to explore concepts, to propose hypotheses, to propose different approaches to solving problems. We’ve started a series of exercises initially at Argonne, we had, I think, 200 scientists get together for a day and each with a challenging problem they wanted to investigate and got them working using the best models we could bring to bear and ask them to report on their experiences. In a little while, we’ll be doing a similar exercise now with 1000 scientists across several national labs, and there’s two purposes. The explicit one is we’re going to learn how they find these models, helping them the Secondly, we’re going to engage these people in using the tools and learning about the tools.

Sarah Webb 26:45

It sounds like a win-win. In many ways, the scientists learn how to work with these tools, and you learn some things about the tools and how they work.

Ian Foster 26:53

Yes, exactly. And we hope that it will also, as a third purpose, help us understand where these tools need to be improved to benefit DOE science.

Sarah Webb 27:04

What are the problems or challenges that keep you up at night relative to this, the things that you chew on most?

Ian Foster 27:12

Yeah, there’s a couple of things that I think about a lot. One is, how does the national research enterprise engage most effectively with what’s going on in industry? So we see tremendous rate of development out there. We want to innovate ourselves in the way that most benefit the U.S. research enterprise. So where are the areas we should be focusing our attention, things that presumably are not going to be done by outside parties? And then a second question is Department of Energy in particular, one thing it’s done very effectively for many years is build research facilities that serve the national and even international, in some cases, scientific community. So what should those facilities look like going forward? We historically have designed and built instruments that are designed to be used by human scientists? Do we need to look at building different sorts of instruments that may be designed to be used by AI scientists? Those are two questions, at least.

Sarah Webb 28:11

And I guess, what are you most excited about? I mean, I just feel like this world is moving really rapidly.

Ian Foster 28:18

It is. I think computing maybe 10 years ago seemed almost a bit boring. We could build slightly faster computers and apply them to slightly bigger problems, but we could sort of guess what would be happening in the next few years. So it’s no longer the case, right? We don’t know what’s happening, so that’s somewhat nerve wracking as well, but I hope that we’ll see these techniques being used to develop solutions to problems that the US and the world face. A whole range of them: cost-effective energy production, environmental protections, global health. I think there’s opportunities in all of those areas.

Sarah Webb 28:54

Is there anything else that you think is important to mention?

Ian Foster 28:57

I just encourage people to learn about what these technologies can do and engage in open-ended thinking, there’s an area where many people can contribute in fascinating ways.

Sarah Webb 29:09

Ian, thank you so much. I’ve really enjoyed talking with you today.

Ian Foster 29:13

My pleasure. Okay, keep up the good work. Bye-bye.

Sarah Webb 29:16

To learn more about Ian Foster and AuroraGPT and for other resources on foundation models for science, please check out our show notes at scienceinparallel.org. Science in Parallel is produced by the Krell Institute and is a media project of the Department of Energy Computational Science Graduate Fellowship program. Any opinions expressed are those of the speaker and not those of their employer, the Krell Institute or the U.S. Department of Energy. Our music is by Steve O’Reilly. This episode was written, produced and edited by me, Sarah Webb.

Transcript produced using otter.ai.