In this third episode of our “Generative AI” series, Cornell Tech and SC Johnson College of Business professor Karan Girotra joins us once again to assess the current capabilities and business uses of generative AI tech and examine what's coming next — as well as what's not.
With Cornell Tech and SC Johnson College of Business professor Karan Girotra, we will look closely at late-breaking technical advances in generative AI, including new video capabilities, autonomous agents and AI-enabled robotics as well as the impending arrival of the next generation of models.
Plus, we’ll highlight how organizations in finance, health, education, media and manufacturing are using these technologies in clever ways. We’ll also chart a path for the next generation of use cases — ones that go beyond using assistants to enhance individual productivity.
What You'll Learn
The Cornell Keynotes podcast is brought to you by eCornell, which offers more than 200 online certificate programs to help professionals advance their careers and organizations. Karan Girotra is an author of three online programs:
Learn more about all of our generative AI certificate programs.
Follow Girotra on LinkedIn and X.
Did you enjoy this episode of the Cornell Keynotes podcast? Watch the Keynote.
Chris Wofford: In this third episode of our Generative AI series, Cornell Tech and SC Johnson College of Business professor, Karan Girotra is joined by Professor Alexander Sasha Rush, an authority on natural language processing, machine learning, and open source development, which is what we're going to be covering today.
Chris Wofford: Our two professors get quite in depth on open source and discuss where the technology may be headed for business. It's fair to characterize this episode as quite in depth and next level to some degree. While it's suitable for people immersed in the AI space already, there's actually lots of intriguing and thought provoking discussion for anyone interested in open source and AI.
Chris Wofford: So check out the episode notes for links to several of eCornell's online AI certificate programs from Cornell University, including those authored by our faculty guest, Karan Girotra. And now, here's the conversation with Karan and Sasha.
Karan Girotra: So today our topic, or what we want to talk about is going to be open source in the context of a large language models of broadly generative AI. And the interesting thing about open source is it's one aspect of AI that has everyone excited from the developers to the folks trying to build companies are finding applications for these models to the regulators.
Karan Girotra: Everybody thinks of open source as a solution to a lot of the challenges. And so everybody talks about it. That said, I think it's quite nuanced. So today Sasha will help us dive through that nuance, understand what is really new here. And, and how does that matter from a business and a technology point of view?
Karan Girotra: So Sasha, let me start the conversation by simply asking you, what is open source? What does open source typically mean uh, in this context or other contexts?
Sasha Rush: Yeah, thanks for having me, Karan. So I would say open source has kind of technical, formal definitions. But intuitively, people think of it as mostly free software that you have access to the internals of.
Sasha Rush: And kind of famous examples like Linux are these kind of collaboratively developed open projects. That originally were kind of done and kind of small teams, but are now kind of the thing powering a lot of compute around the world. The reason some of these questions are subtle is because a lot of it tends to kind of revolve around the licensing behind the software itself and what you're allowed to do with it.
Sasha Rush: And so open source can mean a lot of different things within that context, but often it depends on kind of whether companies are able to use it, whether people are able to scale with it and kind of how it's actually developed in practice. And there's a kind of wide range of different software projects that kind of fall under different open source guidelines.
Karan Girotra: So if I think about it, there is potentially three axes here, free, how much you pay for it, what's going on internally, and what you can do with it. That's, that's kind of the three axes one could probably think about, about an open source. Now let's come to the context of large language models. In this context, I think what is, what does open source really mean?
Karan Girotra: in the context of large language models.
Sasha Rush: So one thing that's crazy is that no one really knows right now. We're in a really wild frontier, and people are kind of in the process of coming up with these definitions and kind of formalizing it. But for most people, that matters less than what it actually means in terms of a community and what's being built.
Sasha Rush: I think when people talk about it informally, they're really talking about the fact that large companies and small distributed organizations are training large language models or models for generative image processing and releasing it to the public. Now, that allows people to basically have access to things that are kind of getting close to the scale of OpenAI style generative models.
Sasha Rush: And that's been really exciting to see what people are doing and, and building with them. Now, the reason I say the definition is a little bit complicated, is because, well, the main models that we think of as the kind of leaders in this kind of quote open source generative AI are models like the llama model from Facebook, which under traditional definitions would really not be considered open source.
Sasha Rush: They really just consist of a kind of single file. That has a bunch of kind of mysterious values that we call weights that allow you to generate text. And so in that sense, the source is kind of not available. But because you're able to do interesting things with this and because you're able to kind of build new and kind of variant models, people are kind of thinking of it within this realm of, of an open source kind of thing.
Sasha Rush: And in fact, Facebook is kind of explicitly using the term open source AI.
Karan Girotra: Let me let me try to understand that a little better from my from our business point of view. So three aspects with that. How much you pay for using it? How much we know about the internals and then what you can do with it. So I guess OpenAI also has a free version . And as does, uh, Lama has a free version. And, and for those who might not know, OpenAI, we're talking about the company that is behind the GPT class of models, powering your chat GPT, also powering some image generation models like, like DALI. So, I think Free, many versions are available free and I guess what I'm hearing from you, it is the internals where this is how much you know about the internals or how much of the internals are open is where the distinction really is.
Karan Girotra: So, what are the typical internals in a, in a large language model? What are different components in a large language model?
Sasha Rush: that's a good way of thinking about it. So OpenAI now has a kind of free tier where you're able to send a request to their headquarters. They'll process that request, run their secret large language model, and send you back the answer.
Sasha Rush: So I think no one really considers that kind of within the realm of being open source. And despite the name of the company, people are thinking about that just as kind of a free to use uh, kind of access to the model. Where Lama gets a little more interesting is that they're releasing what are called the weights of the model.
Sasha Rush: So, it's hard to describe what this is. It's kind of a big, opaque file that they kind of post on the internet. And if you download this file, you can get it to basically speak to you. You can send it some text and it will return some new text. Now, in that sense, it's kind of similar to being able to basically run on your computer the thing that you would run on OpenAI's server.
Sasha Rush: Where things start to get a little bit more interesting is you can do this secondary process, which people call fine tuning, where you start with the weights of the model that you were given, and then you feed in some more data. Maybe this data is proprietary or even private, and you basically update the weights of the model to now take in this new data.
Sasha Rush: It's hard to do that with these fully closed systems without sending them the private data. We're kind of paying them to update the weights for you. There's also another thing you can do with these models that people are excited about, which is that you can use them to generate more data at a very large scale, which allows you to do other things like train your own new models on that data.
Sasha Rush: Again, this is something you may be able to do with some of the closed systems, but oftentimes the licensing of whether that's allowed or not is complex, and it's also complex for some of the open models, but it's a little bit more lenient.
Karan Girotra: I think what you can do with these models is great, but to still kind of, uh, and we'll come back to that, but to really understand what are the internals that, that that we can see.
Karan Girotra: So if I go back to my 25 years back, I was a Linux hacker and the good news about Linux was you could download the source and you could recompile it and you could create it from scratch. I don't think we can do that with the weight. So what, what gives you, this is not like I really have uh, So to really create this whole thing from scratch, what all would I need and what all are different companies giving me?
Karan Girotra: From what I understand, OpenAI is giving me nothing. They're like, you give us the data and, and we'll process it for you. But Llama is giving us something, but it doesn't sound like what Linux gives me, that I can replicate it from start. So what's, what's the difference here?
Sasha Rush: Yeah. So the secret sauce of something like Linux is the source code.
Sasha Rush: It's the literal programming people did all over the world to make Linux work. And that source code required lots of talent, lots of patches, and gets updated day to day. The source code for language models is actually not that complex. And in fact, roughly the similar source code can run all the different models you might know.
Sasha Rush: The secret sauce of language models is the data they're trained on. And that data comes in two forms. One form is what's known as the pre training data. And that's the data that these companies scrape from the entire web. So this is kind of every article, every book that's available on the web, all sorts of different texts from newspapers, or from creative writing, or from math, or science papers.
Sasha Rush: That data is not released with the models. And for most of these models, we don't really know exactly what they were trained on, or how you might go about replicating that data. The second form of data is what's called instruction tuning data. And this is the data that gets the language model to kind of be polite and do what you ask it to do.
Sasha Rush: So that's the data that gets ChatGPT to respond to your question, to produce good answers, to basically figure out what you're asking and come up with a good answer to it. That data is often produced by very large teams of people who are annotating and responding to questions. So they basically have um, kind of a massive factory of, of people who, who kind of give human answers to questions so that the model can learn from .
Sasha Rush: That data is also not released with these systems. And it can be quite expensive to produce because you have to kind of pay folks to produce this data in this form. Now, because of that, it's not possible to, say, compile a language model by yourself because you don't have access to the data that you need to basically replicate this process.
Sasha Rush: People think that the reason the kind of first data is not released is because of all the legal issues surrounding language models. Even if that data is accessible, it might not be legal to train on. And it's unclear whether it's allowed to be released with these models.
Karan Girotra: And so openly I can pay its lawyers to fight against New York times, but I might not be able to.
Karan Girotra: So probably better for me not to even try messing with that data. Um, And, and maybe for opening and not to kind of put that pre training data out.
Sasha Rush: Well, that's one part of it. But the second part of it is, is, is at the moment, actually, we don't know what say Facebook is training on. They might not want to reveal that they trained on certain types of data.
Sasha Rush: Again, it's just speculation, but we don't have a good sense.
Karan Girotra: So, and, and instruction data is costly to get. So companies spend a lot of money paying contractors. I imagine around the world to annotate or to kind of really say, this is a good response to a question. This is a polite way of talking. This is not a polite way of um, uh, of talking.
Karan Girotra: So. The source is simple. I heard and you can tell me if I'm correct or incorrect here. Somebody told me the source for a large language model, roughly speaking, is under a thousand lines of code. Is that a myth or is that, is that true?
Sasha Rush: Yeah, it's a little bit subtle. So the code to run the language model once it's been trained is under a thousand lines and it's surprisingly readable. In fact About five or six years ago, I wrote a blog post that goes through all of the different details of, of how these things work, and, and, and honestly, the code hasn't changed so much, even in that time. Now, the code to, to, to train the model is a bit more complex.
Sasha Rush: I mean, it's a lot more complex. It's not Linux style complex, but it can be a bit more challenging. But you don't actually need that run the model in practice
Karan Girotra: Right, right. So the training code is more complex than the inference code. So we have the source code, we have the data, but then it seems there's also, I would imagine, again I'm no expert, but to train these on these very large corpuses of data, one probably has to come up with some tricks to make all the, uh, H100s work together in Junision uh, or uh, so what about that piece?
Karan Girotra: What, what we call that and, and how important is that?
Sasha Rush: Yeah, it's a good question. So we, we call this like distributed training. This refers to the fact that in order to train a model, you often need to use, say, hundreds of computers. And actually, for LLAMA 3, I believe the number was that they used 16, 000 H100s to train it.
Sasha Rush: These things are extraordinarily expensive. The estimate for training LLAMA 3 was around 100 million for these computers. And so. every millisecond counts when training them. So there are kind of world's experts on getting the most out of NVIDIA GPUs, world's experts on networking and data center construction and even energy in order to kind of make these things as efficient as possible.
Sasha Rush: So, um, In, in some sense, even if they open source that code, very, very few organizations in the world would be able to kind of take advantage of it or run it in their way. And in fact, I think Google is an interesting case study too. They basically use their own chips internally that uh, you can kind of, you can use through the cloud, but they aren't actually released.
Sasha Rush: So even if they kind of put out all the details there basically it would only apply to them.
Karan Girotra: And you think this this stuff matters, right? So, the I'm calling maybe that is the engineering pipeline. There is the source code. There is the data. And I imagine there is all the things you mentioned, which which are probably the not necessarily the things that are getting the most attention, are probably equally important.
Karan Girotra: The networking, the efficiency, even the design of the of the computers and the data centers to run this. And none of that information is out there. So I think I see there are two problems. Even if the information was out there, I'd probably not have the money or the access to the chips to do it. And in some cases, those chips are really private.
Karan Girotra: But the engineering of that is also not public or out there. Is that, is that correct to say, or, or some of that is more, more out in the open?
Sasha Rush: Yeah, that's correct to say. I mean, there's a lot of secrets about the details of these models and how they're trained. That being said, I think that.
Sasha Rush: People in the kind of open source community would be happy kind of just to know, say, the details of the data. There's kind of an assumption that perhaps the actual training of these systems may be beyond the reach of open source consortiums or things of that form, but that knowing what the models were trained on would be very helpful to know how things work.
Sasha Rush: So I'll give you one example. One thing that's been very challenging with open source language models is that we don't fully know how to evaluate them or determine which one is better than another. And um, people are thinking very hard about kind of new benchmarks or new datasets to make this happen, but one question that comes up a lot is kind of data contamination.
Sasha Rush: So because we do not know what these models are trained on, it makes it a little hard to know when they're doing something completely novel or whether they've kind of learned something from the data they were trained on. So even just kind of knowing what was or was not fed into the model would be beneficial for understanding the abilities they have or the kind of science behind how they work.
Karan Girotra: Very nice. So I guess, I guess the next question would be on the what you can do with these models. So we know on the internals, we don't really know. It's not like Linux where you really know all the internals. We know some stuff which is not. Not that hard to know the source code or that innovative.
Karan Girotra: The data is where the secret sauce might be, which we have limited information on even in the most open models. And then the engineering skill, which would be another one to have, but I think data is the place where we want to start first. So at least we can evaluate and and not get shocked next time uh, chat GPT says something that we think it is having emergent properties, but it might really be in the data.
Karan Girotra: So, so at least that, that kind of shock or people getting worried about their Terminator fantasies would, we could check that with, with knowing what the uh, what the data might be. So that's cool. But what can we do with these models? I guess that's what, if I read. What Mark Zuckerberg talks about, about open, he's uh, he's using the word open source, but, but it's really not the source data, which is open or the engineering, which is open, but he says you can do a lot of it, a lot of new stuff with it that you couldn't do with other, other things.
Karan Girotra: So what are the new things we can do with it uh, with these models? You mentioned one, we can run it locally beyond that, what else can we do and why do we even care about running it locally?
Sasha Rush: So there's a bunch of things you can do with these models and they all depend kind of, kind of how much compute or how much data you want to inject into the systems themselves.
Sasha Rush: So the, the most common one that people talk about is this idea of fine tuning. So what's exciting about fine tuning is that let's say, I don't know, I'm Bloomberg and I have a ton of interesting financial data and I don't think that that data was originally used by the models to learn. I can take that data and continually train one of these open source models on that data and even just giving it more data within a given domain will make it much better or much smarter in that setting.
Sasha Rush: You can kind of do this with closed models but you end up having to pay them a lot of money and you don't have full control of the process and maybe you have to kind of literally send them the data. That you want to train on. So there's a lot of excitement. Oh, sorry, go on.
Karan Girotra: Yeah, question on, on on fine tuning.
Karan Girotra: What's the, what's the verdict on fine tuning? Because I hear some conflicting kind of research. One and particularly in financial data and Bloomberg uh, Bloomberg's kind of model, which was fine tuned with that as good as it gets kind of proprietary financial data. So the two streams I hear is one fine tuning will make these models better because they'll know a little bit more about the domain.
Karan Girotra: And then the second stream I hear is if we just increase the pre training data doesn't kind of matter. And and and and even if you fine tune models. It's not like they necessarily adhere to the fine tuning data more than the pre training data. So I don't know, what is the latest research on that?
Karan Girotra: What is the, what is the practical kind of uh, what do, what is the best knowledge on, on fine tuning? Is it, is it good necessary or uh, or mixed, outcomes out of it?
Sasha Rush: There are a lot of small details about how this can work or how it can't work. A lot of it has to do with what's the most financially viable way to do it.
Sasha Rush: So, for instance, it seems like a lot of times it's better to use a bigger model that was trained on more data than it is to, say, fine tune a model. But if you keep other things equal, Fine tuning is certainly going to improve the model within its space. And there's a general sense that we're maybe getting to the limit of say size of models or amount of training data.
Sasha Rush: And so, particularly having specialized data will be a kind of very useful thing going forward to get marginal gains.
Karan Girotra: And one more kind of understanding questions, fine tuning versus uh, rag. Or kind of impromptu adding adding relevant information in the prompt. What's the verdict on because I guess and why it relates to open source.
Karan Girotra: I think most systems will allow you to do some sort of retrieval augmented generation. Fine tuning is certainly much more complicated. So if one of the advantages of open source is more control on fine tuning. How much how relevant is that? Could I just get away with doing retrieval augmented generation instead?
Karan Girotra: Thank you.
Sasha Rush: Yeah, so let's define these terms just because people may not fully have a good sense of them. So, so generally in the research community, we think of two ways to get your model to learn about a new domain. One is fine tuning, where you actually change the weights of the system itself.
Sasha Rush: The other is what's known as in context learning, where you give it the examples you want it to learn from in its context. So you basically can think of the first as kind of changing its brain, and the second as kind of first telling it what you want it to know, and then asking it to produce responses.
Sasha Rush: The difference between these two is a major question of research, and one that people are still thinking a lot about. Practically at the moment though, we know the following. We know that in context learning is inherently limited, In the sense that we can only make the context a certain length. So models have a fixed context length, and it's actually not extremely large.
Sasha Rush: So think about it as maybe uh, ten pages that you can give to the model before running. The other downside of this is that when you do in context learning, you're making the approach slower. So you're going to have to pay for the extra cost of running this. So at the moment, we don't totally know which one works better, but we do know that fine tuning can scale to much larger data and therefore it's a useful tool in our current toolbox.
Sasha Rush: Thanks. Now there's a third idea here, which is called RAG, Retrieval Augmented Generation, and the idea of that is that maybe you can't fit all your data in in context, but you can fit a subset of your data within context that basically fits. And the way you do that is you use a much weaker model to first determine what subset you think will be important for every given query.
Sasha Rush: This idea has been very practical for the last couple of years of using large language models. And certainly if I were, say, building a company today, that would be the first attempt that I would do. This works both for open source models and also for closed source models as a way of getting these models to work.
Sasha Rush: That being said, in the long term, I think people think of this as maybe a band aid, and maybe not the kind of final attempt. We either expect to have extremely long in context learning, or figuring out really efficient ways to do fine tuning. Which we end up will matter about which, whether open source or closed source wins, and basically what the compute profiles look like.
Karan Girotra: Yeah, very nice. So okay, so we can bring our own data in any of these formats, either fine tuning in context learning or retrieval augmented generation into models and open source has some ease or advantage of control in doing that. What else can we do from open source models? What else can I do from LLAMA that I can't do from uh, do from OpenAI's models?
Sasha Rush: Yeah, so one idea that's emerged as a kind of practical way of using open source models is this idea of distillation. So the term distillation in the machine learning literature means basically when you use a very smart model as the teacher and a very fast model as the student. And we actually do use those terminologies.
Sasha Rush: And the hope is that if you can basically get a very, very, very expensive model to produce lots of good examples, You can teach a much smaller, faster student model to do those tasks. So the idea here is that you could take a really, really good model, but one that might be too expensive to run in production, and use it to basically generate a new dataset for training a much smaller, faster model.
Sasha Rush: Now, in theory, technically you can do this with closed source approaches. You could just make lots of queries to OpenAI and train on those. Thanks. Now, the legality of this is a little bit in question right now. There are certain aspects of their terms of service that maybe prevent this but people have been doing it, and there, I think there are kind of cases ongoing.
Sasha Rush: For the open models, people are explicitly writing licenses about this question of data distillation. So, NVIDIA put out a very large model a couple weeks ago, that allows you the freedom to basically distill it to any model you'd want. LLAMA also allows you to do this. They have some terminology about kind of the naming of the models you produce.
Sasha Rush: But people are kind of understanding that this is maybe a practical use case that open models allow you to kind of, uh, work
Karan Girotra: This is this is fascinating, because if I understand correctly, essentially the chat GPT is overkill for the vast majority of tasks we need to we need to do even an apple in its strategy kind of acknowledges that and right now we have this super smart kind of model for everything.
Karan Girotra: And which is which is latency cost all of these challenges in in production or in the inference stage when people are using them, but but if we could use. This to create specialized chat GPT's who don't know everything about the world, but for example, know everything I can think of our use cases in our teaching.
Karan Girotra: We're experimenting with using language models for let's say tutoring or or other kind of Q& A type tasks that that are teaching assistant to do. Probably a lot of that can be done with a much smaller model . And so the idea here would be to use a large model to create something very specific for us.
Karan Girotra: What are the kind of cost benefits we can see by doing that? If, if any of these things have been fully put in practice, are we talking one 10th the kind of runtime cost inference cost or, or a smaller gains than that?
Sasha Rush: Yeah, I think it's not unreasonable that you could see something like one tenth the cost.
Sasha Rush: Again, particularly if you're doing a specialized model. One um, uh, exciting one that Google actually recently released was a model known as GEMMA 2b. And 2b refers to the fact that the model, the size of the kind of weights of the model, is two billion parameters. Up until this point, people really hadn't been able to train very good models at that scale, but they were able to do it by training a much larger model and then distilling it to a smaller model that you could use in practice.
Sasha Rush: Another one that um, uh,
Karan Girotra: just to benchmark that. So a normal model would be 400 B or 400 billion. And we're really uh, thanks thanks to distillation. We could use that and perhaps create a 2 billion model. So that looks like a 200 times kind of reduction from a layman's kind of way in terms of the number of weights in there, or number of parameters, which, which could be a 200 times reduction in cost.
Karan Girotra: Uh, If I had to do the bag, maybe not as good, but, but it sounds like almost as good.
Sasha Rush: Yeah, I, I, I, I think you should think of it more like 10 times. I, I, I don't think it's gonna be as good as, as, as, as that scale. As a rule of thumb, you can roughly think of these numbers as being the speed of these models.
Sasha Rush: Again, it depends on whether you're trying to do a chatbot that's real time versus a model that's maybe offline. But I think roughly thinking about the size as the speed is not a bad way of doing it. One other thing to note, though, is that OpenAI has also been doing this well, I guess we don't know explicitly, but people assume they've been doing this as well.
Sasha Rush: So, uh, Chad GPT hasn't been standing still. If you use the newest version, which I think they call Chad GPT 4. 0 Mini that that's likely a distilled version of their much larger model that they're serving.
Karan Girotra: From a business point of view, I think distillation uh, again, I think most, beyond the cost and latency advantages.
Karan Girotra: It's almost better if a model is not too general purpose. I almost like if a model knows only about my teaching and doesn't know about politics, for example, and therefore has almost no risk of bringing it in. So I don't know if that kind of control can happen, but, I can see the let's say lobotomized or a, or a scaled down version of the brain from a, from cost and latency point of view.
Karan Girotra: Maybe it has advantages from keeping it on topic kind of point of view also uh, or, or is that just speculation that may or may not happen?
Sasha Rush: I think it's an interesting idea. I should say that we are really at the kind of frontier of understanding how we can control what language models say or do. It's remained an extraordinarily hard challenge to do what's called unlearning, which is to get a model to forget something that it's produced.
Sasha Rush: Okay. And, um, I think that distillation by itself is not really a silver bullet for that. There are approaches that, that, that might help or that might be related to distillation, but we don't really have the technology to really kind of, closely and very carefully put a guardrail around what a language model will talk about or do.
Sasha Rush: And uh, yeah, I can see why that would be a frustrating thing from a business perspective.
a, there's still plenty of good news here. From a business point of view, what I'm seeing is I have uh, perhaps not everything to recreate, but, but that is great for research, replicability.
Karan Girotra: Maybe not that useful from a practical point of view from a practical point of view. I can get these open source models, run them any which way I like on my hardware in my premises, add my own data with no risks of that data being touched by any other other company. And potentially I could use it to train a very custom model for my, my particular business line, which will be faster, cheaper and uh, and, and offer all the advantages that come of the responsiveness, et cetera, that come from that.
Karan Girotra: Perhaps even completely new kinds of use cases because we can kind of get these small models to work better. But there's one unanswered question. We often say you get you you get what you pay for. So is the free stuff as good as the as the as the gold standard kind of for the open AI stuff? How good is it?
Karan Girotra: Are we almost are we compromising something on? I know it's hard to measure the performance of these models, but given the metrics people have been using, how close are we on the open side?
Sasha Rush: Yeah, one thing that's crazy about all this is how fast it's been moving. Uh, We're, we're roughly like, I don't know, a year, year and a half into some of these, these approaches.
Sasha Rush: And we can say within the last couple weeks that we're pretty close. So, It seems like the latest versions of LLAMA3 that were released are nearing about where GP4 was when it was released on, on many of these, these, these benchmarks. I think a lot of people thought that was maybe not possible a couple years ago, but it, it, it took about a year of time, but it seems like the open models have caught up.
Sasha Rush: Now. We don't know what OpenAI is working on. They may now be prepping GP5, which may have all sorts of new things that will take open source a while to catch up on. But, but given given how short a time frame it is, it's pretty impressive that they've, they've reached that ability.
Karan Girotra: Yeah. So if I was a business executive, is it a In your considered opinion, is it a dangerous thing to bet on open source or I think it's a fair bet to, what would you do?
Karan Girotra: You'd hedge your bets. You'd, and let's say you're thinking of a relatively high business high business value application where costs probably don't matter as much or matter less. Yeah. So what would you do in those cases? Hedge your bets, or you'd feel comfortable going one, one approach? I know it's a tough question to answer, and neither of, that's why neither of us are CIOs, but what, what would you recommend a CIO does in this kind of situation?
Sasha Rush: Yeah, I'm, I'm very underqualified for this question. That's one part about Cornell Tech that's interesting. get behind your students and their understanding of business structure, but no, it, There are many advantages for open source not being locked into a vendor, being able to, say, choose other companies to actually run the inference of your model, being able to customize all these properties and be able to do it in a legal way.
Sasha Rush: Depending on what your structure is, you may or may not be able to use certain open source models, and you have to take that into account. But it definitely seems like a viable path forward in a way that maybe it wasn't a year ago.
Karan Girotra: Very nice. Yeah, you mentioned lock in. So lock in is uh, is something that, uh, when these um, if I start putting my business hat on, my economist hat on, when, when I first learned about pre training or the structure of these models I remember in the room, the computer scientists are of course excited about the performance.
Karan Girotra: I'm not sure how good these models can do, but, but I was sitting there thinking of the cost structure and I'm outing myself as a, as a real dork or a person who thinks about money a little more than perhaps I should. And, and, and, and to me, it seemed like the cost structure of pre trained models was such.
Karan Girotra: That you have a lot of kind of fixed cost up front to do this pre training stage and the fine tuning would be a minor cost and whenever we see that pattern of costs when when you create something generic at an expensive cost, and then you can create a lot of people can, in a way, use it at relatively small customization costs, so to say you tend to get winner take all markets because somebody makes that big, big thing in there.
Karan Girotra: So my my original uh, Prediction here was that this is going to be like everything else. Big tech does where the bigger gets better. And so somebody will invest a lot of money and then control the so to say the highway, which everybody would have to use. I always think of them as infrastructure that Which has a similar cost structure.
Karan Girotra: Got to pay a lot to lay down a road. And then everybody can use it with relatively modest, modest costs. So my prediction was that it was going to become completely um, that it would have a winner take all nature and somebody would become a monopoly. And as we've seen in other tech monopolies uh, everybody who uses builds any application that uses, for example, a language generation task.
Karan Girotra: For conversation for code, anybody will have to use the same highway because the biggest and the only highway available, and we'd have to end up paying 30, 40 percent of our returns, or maybe even a larger fraction or whatever we make through that, to that monopolist who controls the backend. I'm happy to say it hasn't worked out so far as that, but yeah.
Karan Girotra: So what is, what is your kind of. Does that make sense that that thinking that these things have a tendency necessarily to go winner take all, and they're not yet becoming winner take all, or, or I'm just, yeah, what do you think about that?
Sasha Rush: It's an interesting question. There are a couple aspects that have have been quite interesting, which is we've seen a lot of. Smaller companies that were pre training models kind of stopped as the cost continued to get higher. In that sense, it seemed like, as you were saying, things were kind of congealing into kind of winner take all scenario.
Sasha Rush: The people who are still doing it are places like Meta that seem to have alternative reasons for trying to avoid vendor lock in. And in his letter about Lama 3, Mark Zuckerberg explicitly talks about the fact that he felt so burned by the kind of tax that Apple took on its apps that he felt it was necessary to kind of at any cost avoid that.
Sasha Rush: Now, we'll see if that continues. I mean, as Each generation of these models seems to be exponentially more expensive. He'll have to keep on kind of paying to keep on doing this. So it's more like
Karan Girotra: a battle of the monopolists or battle of the aspiring monopolists where one might just want to play a spoiler to not let anyone else be a monopolist rather than it becoming necessarily kind of where you and I can compete, but how do, how do companies like, and we haven't really mentioned the big, big players here, but I imagine the big open source model would be Lama.
Karan Girotra: Mistral is also a company which has what I believe is mostly open, open model. How are any, any insights on how are they able to pull it off or, or it's just people do a lot of things in the early ages of of, of technology advancement.
Sasha Rush: Mistral is an extremely impressive company. They've been able to get a lot of really strong engineering talent working there, and they've been able to produce some really strong open models.
Sasha Rush: The history of it was that the founders of Mistral, or some of them, were from the original LLAMA team, so it kind of forked off of, of that project. But they've been able to hire a lot of great folks interested in both open model building and other aspects. Their models are only partially open source, so they have kind of, lower tier models that are extremely good, that are open.
Sasha Rush: And then they have some larger models that are either not open or open with licenses that restrict usage over a certain size. That being said, they're an extremely impressive company and they've been able to keep up with a lot of what's going on. And yeah, no, I, I really hope to see them kind of continue, continue growing over the next couple of years.
Karan Girotra: I think we have we have time to perhaps take one question from Elizabeth. We do these live unlike a podcast so we can truly interact with with folks. So I Elizabeth's question, perhaps, perhaps, Chris, you can tell us more about what Elizabeth wanted to ask.
Chris Wofford: Yeah, let me chime in.
Chris Wofford: So Elizabeth asks, Do you have to worry about divulging trade secrets if you use Llama 3 or others? What are some of the IP issues for companies when they use open source models? She goes further. In other words, what do companies need to worry about in protecting their proprietary data? If they use these systems?
Sasha Rush: Yeah. So, if you are running Llama 3 on your own hardware or on your own cloud, It's not going to be shipping kind of any data off premises. The weights Facebook provides are fully just a file that you can run. It's not a system or a code, and it doesn't have any kind of security risk of that form. A lot of people are running these open source models with third party inference providers.
Sasha Rush: So these are companies that have done the work to set up the infrastructure to run these systems extremely fast. If you are sending your data to these parties, you have to kind of be careful about their terms or how they're managing the data themselves. I know there are some that can set it up to run it within your cloud or within your infrastructure.
Sasha Rush: But others simply just kind of literally take your data and send you the response of the models. And in that case, you should treat it like any other data being shipped, shipped off.
Karan Girotra: So overall, it looks like a promising, promising kind of development. I always like saying that, I was somewhat pessimistic about how How this large language model or broadly the new generative AI technology was evolving.
Karan Girotra: It seemed like something which would be another tech monopoly where the big players have the advantage. And, and while that, that does remain true, but I think the big players are working very hard. It seems to make this accessible, make this as accessible and as easy to build with as as you could. Now I've not seen that in any previous generation of technology.
Karan Girotra: And like, I always like saying these models are easy to build with. But it is much harder to know what to build and what not to build with them. So that's, that's almost uh, in a way the technologists have done their job, that they've made something quite accessible. And in the specific case of open source, what I, what I learned today was these models are essentially do what you want with them by and large free.
Karan Girotra: Now it's, they're not telling us everything about what the secret sauce to train it, but at some level, unless you're trying to compete with them in the model layer, that doesn't matter. From the user application layer, most of the things you'd want to do use these models to, for example, as Elizabeth was asking, really, you can inject your data much more safely without really worrying about any leakage at any place in the in the pipeline.
Karan Girotra: If you're a regulated business, everything is within your control within your standard it practices. So it gives you a lot more ability to deal with that. And then this brilliant idea of distillation which is you can use these models as teachers to create smaller student models, which is, which is essentially, if we think about it, that's how we train employees.
Karan Girotra: That's how we train agents in every, every situation which is possible with these open source models, not so much possible with other models. So they're, uh, they're free. You can do a less, less risks. And more things you can do with them. And at least for now, as good in performance. So this seemed like a, seemed like a big big win here.
Karan Girotra: Thank you so much, Sasha for highlighting, these advantages of open source models. I know our students are always ahead and, and people, CIOs and other, other folks would struggling with these choices are always ahead. But I think we can say uh, there is a new option that folks should very seriously consider uh, because of the advantages we mentioned, as they think about building their internal AI stacks.
Chris Wofford: Thanks for listening to Cornell Keynotes, and check out the episode notes for info on eCornell's online AI certificate programs from Cornell University.
Chris Wofford: Thanks again, friends, and subscribe to stay in touch.