Monitor performance issues & errors in your code

#465: The AI Revolution Won't Be Monopolized Transcript

Recorded on Thursday, May 9, 2024.

00:00 There hasn't been a boom like the AI boom since the dot com days, and it may look like

00:05 a space destined to be controlled by a couple of tech giants.

00:08 But Ines Montani thinks open source will play an important role in the future of AI.

00:13 I hope you join us for this excellent conversation about the future of AI and open source.

00:19 This is Talk Python to Me, episode 465, recorded May 8th, 2024.

00:24 Are you ready for your host, Darius?

00:27 You're listening to Michael Kennedy on Talk Python to Me.

00:30 Live from Portland, Oregon, and this segment was made with Python.

00:37 Welcome to Talk Python to Me, a weekly podcast on Python.

00:40 This is your host, Michael Kennedy.

00:42 Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython,

00:47 both on fosstodon.org.

00:50 Keep up with the show and listen to over seven years of past episodes at talkpython.fm.

00:55 We've started streaming most of our episodes live on YouTube.

00:59 Subscribe to our YouTube channel over at talkpython.fm/youtube to get notified about upcoming shows and

01:05 be part of that episode.

01:06 This episode is brought to you by Sentry.

01:08 Don't let those errors go unnoticed.

01:10 Use Sentry like we do here at Talk Python.

01:12 Sign up at talkpython.fm/sentry.

01:14 And it's brought to you by Pork Bun.

01:18 Launching a successful project involves many decisions, not the least of which is choosing

01:22 a domain name.

01:23 Get a .app, .dev, or .food domain name at Pork Bun for just $1 for the first year at

01:29 talkpython.fm/porkbun.

01:32 Before we jump into the show, a quick announcement.

01:34 Over at Talk Python, we just launched a new course, and it's super relevant to today's

01:39 topic.

01:40 The course is called Getting Started with NLP and Spacy.

01:44 It was created by Vincent Vormerdam, who has spent time working directly on Spacy at Explosion

01:49 AI.

01:50 The course is a really fun exploration of what you can do with Spacy for processing

01:53 and understanding text data.

01:55 And Vincent uses the past nine years of Talk Python transcripts as the core data for the

02:00 course.

02:01 If you have text data you need to understand, check out the course today at talkpython.fm/spacy.

02:07 The link is in your podcast player show notes.

02:09 Now on to more AI and Spacy with Ines.

02:12 Ines, welcome back to Talk Python with me.

02:14 Yeah, thanks for having me back again.

02:16 You're one of my favorite guests.

02:17 It's always awesome to have you.

02:19 Thanks.

02:20 You're my favorite podcast.

02:21 Thank you.

02:22 We have some really cool things to talk about.

02:24 Spacy, some of course, but also more broadly, we're going to talk about just LLMs and AIs

02:32 and open source and business models and even monopolies.

02:35 We're going to cover a lot of things.

02:37 You've been kind of making a bit of a roadshow, a tour around much of Europe and talking about

02:44 some of these ideas, right?

02:46 Yeah.

02:47 I've been invited to quite a few conferences and I feel like this is like after COVID,

02:51 the first proper, proper year again that I'm traveling for conferences.

02:55 And I was like, why not?

02:56 And then I think especially now that so much is happening in the AI space, I think it's

03:00 actually really nice to go to these conferences and connect with actual developers.

03:05 Because if you're just sitting on the internet and you're scrolling, I don't know, LinkedIn

03:09 and sometimes it can be really hard to tell what people are really thinking and what's...

03:14 Do people really believe some of these hot, weird takes that people are putting out there?

03:19 So yeah, it was very, very nice to talk about some of these ideas, get them checked against

03:25 what developers think.

03:26 So yeah, it's been really cool.

03:28 And there's more to come.

03:29 Yeah, I know.

03:30 I'll be traveling again later this month to Italy for PyCon for my first time, then Pineda

03:35 London and who knows what else.

03:37 If you must go to Italy and London, what terrible places to spend time in, huh?

03:42 I'm definitely very open for tips, especially for Italy, for Florence.

03:45 I've never been to Italy ever.

03:48 Oh, awesome.

03:49 I've been to Rome, but that's it.

03:51 And so I don't have any tips, but London is also fantastic.

03:55 Yeah.

03:56 Cool.

03:57 So people can check you out.

03:58 Maybe, I think, do you have a list of that publicly where people can see some of your

04:02 talks?

04:03 We can put that in the show notes.

04:04 Yeah.

04:06 It's on my website and then also on the Explosion site of our company, we've actually added

04:08 an events page because it came up a lot that like either me, Matt, people from our team

04:14 giving talks.

04:15 And so we thought like, hey, let's... and podcasts as well.

04:17 So we're collecting everything on one page, all the stuff we're doing, which is kind of

04:20 fun.

04:21 Which is quite a bit, actually, for sure.

04:22 Yeah.

04:24 Well, I know many people know you, but let's just talk a little bit about Spacey, Explosion,

04:31 Prodigy, the stuff that you guys are doing to give people a sense of where you're coming

04:34 from.

04:35 Yeah.

04:36 So we're an open source company and we build developer tools for AI, natural language processing

04:43 specifically.

04:44 So, you know, you're working with lots of text, you want to analyze it beyond just looking

04:47 for keywords.

04:49 That's kind of where we started and what we've always been focusing on.

04:53 So Spacy is probably what we're mostly known for, which is a popular open source library

04:58 for really what we call industrial strength NLP.

05:01 So built for production, it's fast, it's efficient.

05:05 We've put a lot of work into having good usable, user-friendly, developer-friendly APIs.

05:11 Actually, yeah, I always set an example.

05:13 I always like to show in my talks a nice side effect that we never anticipated like that

05:17 is that ChatGPT and similar models are actually pretty good at writing Spacey code because

05:23 we put a lot of work into all of this stuff like backwards compatibility, not breaking

05:27 people's code all the time, stuff like that.

05:30 But that happens to really help, at least for now, with these models.

05:34 It's really nice.

05:35 It's a good thing you've done to make it a really stable API that people can trust.

05:40 But is it weird to see LLMs talking about stuff you all created?

05:44 It's kind of, it's funny in some way.

05:46 I mean, it's also, there is this whole other side to, you know, doing user support for

05:54 detecting clearly auto-generated code, because for Spacey, these models are pretty good for

05:58 Prodigy, which is our annotation tool, which is also scriptable in Python.

06:02 It's a bit less precise and it hallucinates a lot because there's just less code online

06:06 and on GitHub.

06:07 So we sometimes get like support requests where like users post their code and we're

06:12 like, this is so strange.

06:13 How did you find these APIs?

06:14 They don't exist.

06:15 And then we're like, ah, this was auto-generated.

06:17 Oh, okay.

06:19 So that was a very new experience.

06:21 And also it's, you know, I think everyone who publishes online deals with that, but

06:25 like, it's very frustrating to see all these like auto-generated posts that like look like

06:30 tech posts up, but are completely bullshit and completely wrong.

06:34 Like I saw with something on Spacy LLM, which is our extension for integrating large language

06:40 models into Spacy.

06:41 And they're like some blog posts that look like they're tech blog posts, but they're

06:46 like completely hallucinated.

06:48 And it's very, very strange to see that about like your own software.

06:52 And also it frustrates me because that stuff is going to feed into the next generation

06:55 of these models.

06:56 Right.

06:57 And I think people stop being so good at this because they're full of stuff that they've

07:02 generated themselves on like APIs and things that don't even exist.

07:06 Yeah.

07:07 It's just going to cycle around and around and around until it just gets worse every

07:11 time.

07:12 And then.

07:13 That's interesting.

07:14 Like it's very interesting to see what's going on and where these things lead.

07:17 It is.

07:18 You know, I just had a thought.

07:19 I was, you know, OpenAI and some of these different companies are doing work to try

07:25 to detect AI generated images.

07:28 And I imagine AI generated content.

07:31 When I heard that, my thought was like, well, that's just because they kind of want to be

07:33 good citizens and they want to put little labels and say, what if it's just so they

07:38 don't ingest it twice?

07:40 I think that's definitely, and I mean, in a way it's also good because, you know, it

07:44 would make these models worse.

07:46 And so from like, you know, from a product perspective for a company like OpenAI, that's

07:50 definitely very useful.

07:53 And I think also, you know, commercially, I think there's definitely, you know, a big

07:57 market in that also for like social networks and stuff to detect are these real images,

08:04 are these deep fakes, is there money in that too?

08:07 So it's not, I don't think it's just, yeah, being good citizens, but like there's a clear

08:11 product motivated thing, which is fine, you know, for a company.

08:14 Yeah, it is fine.

08:15 I just, I never really thought about it.

08:17 Of course.

08:19 I mean, I think we're getting to some point where in food or art, you hear about artisanal

08:25 handcrafted pizza or, you know, whatever.

08:28 Will there be artisanal human created tech that has got a special, special flavor to

08:33 it?

08:34 Like this was created with no AI.

08:35 Look at how cool this site is or whatever.

08:37 I think it's already something that like, you see, like, I don't know which product

08:41 this was, but I saw there was some ad campaign.

08:43 I think it might've been a language learning app or something else where they really like

08:48 put that into one of their like marketing claims, like, Hey, it's not AI generated.

08:53 We don't use AI.

08:54 It's actually real in humans because it seems to be, you know, what people want.

08:58 They want, you know, they want to have at least that feeling.

09:00 So I definitely think there's an appeal of that also going forward.

09:04 The whole LLM and AI stuff, it's just, it's permeated culture so much.

09:09 I was at the motorcycle shop yesterday talking to a guy who was a motorcycle salesman.

09:13 And he was like, do you think that AI is going to change how software developers work?

09:18 Do you think there's still going to be relevant?

09:19 I'm like, you're a sales guy.

09:20 You're a motorcycle sales guy.

09:21 This is amazing.

09:22 How did you're like really this tuned into it, right?

09:25 That you're like, but you know, you think it's maybe just a little echo chamber of us

09:30 talking about, but it seems to be like these kinds of conversations are more broad than

09:33 maybe you would have guessed.

09:34 ChatGPT definitely, you know, brought the conversation into the mainstream, but on the

09:38 other hand, on the plus side, it also means it makes it a lot easier for us kind of to

09:42 explain our work because people have at least heard of this.

09:46 And I think it's also for developers working in teams, like on the one hand, it can maybe

09:50 be frustrating to do this expectation management because you know, you have management who

09:54 just came back from some fancy conference and got sold on like, Ooh, we need like some

10:00 chat bot or LLM.

10:01 It's kind of the chat bot hype all over again that we already had in 2015 or so.

10:07 That can be frustrating.

10:08 Not that those were going to be so important.

10:09 And now what are they doing?

10:10 Yeah, but yeah, but it's like, yeah, I see a lot of parallels.

10:13 It's like, if you look at kind of the hype cycle and people's expectations and expectation

10:18 management, it's kind of the same thing in a lot of ways, only that like it actually

10:22 cuts a lot of parts actually kind of work now, which we didn't really have before.

10:26 Yeah.

10:27 But yeah, it also means for teams and developers that they at least have some more funding

10:30 available and resources that they can work with.

10:33 Because I felt like before that happened, it looked like that, you know, companies are

10:37 really cutting their budgets, all these exploratory AI projects, they all got cut.

10:42 It was quite frustrating for a lot of developers.

10:44 And now at least, you know, it means they can actually work again, even though they

10:48 also have to kind of manage the expectations and like work around some of the Baywild ideas

10:55 that companies might have at the moment.

10:57 Absolutely.

10:58 Now, one of the things that's really important and we're going to get to here, give you a

11:02 chance to give a shout out to the other thing that you all have is, is how do you teach

11:06 these things?

11:09 Information and how do you get them to know things and so on?

11:12 And you know, for the spaCy world, you have Prodigy and maybe give a shout out to Prodigy

11:16 Teams.

11:17 That's something you just are just announcing, right?

11:19 Yeah.

11:20 So that's currently in beta.

11:21 It's something we've been working on.

11:22 So the idea of Prodigy has always been, hey, you know, support spaCy, also other libraries.

11:27 And how can we, yeah, how can we make the training and data collection process more

11:31 efficient or so efficient that companies can in-house that process?

11:36 Like whether it's creating training data, creating evaluation data, like even if what

11:41 you're doing is completely generative and you have a model that does it well, you need

11:44 some examples and some data where you know the answer.

11:47 And often that's a structured data format.

11:49 So you need to create that with, you know, really seeing that like outsourcing that doesn't

11:54 work very well.

11:55 And also now with the newer technologies, you transfer learning, you don't need millions

12:00 of examples anymore.

12:01 So like this big, big data idea for task specific stuff is really dead in a lot of ways.

12:07 So Prodigy is a developer tool that you can script in Python and that makes it easy to

12:13 really collect this kind of structured data on text images and so on.

12:19 And then Prodigy Teams, that has been a very ambitious project.

12:22 We've really been, we've wanted to ship this a long time ago already, but it's been very

12:27 challenging because we basically want to bring also a lot of these ideas that probably we're

12:31 going to talk about today a bit into the cloud while retaining the data privacy.

12:36 And so you'll be able to run your own cluster on your own infrastructure that has the data

12:41 that's scriptable in Python.

12:43 So you can kind of script the SaaS app in Python, which is very cool, which you normally

12:47 can't do.

12:48 Your data never leaves our service.

12:51 And you can basically also use these workflows like distillation, where you start out with

12:56 a super easy prototype that might use Lama or some other models to add GPT, GPT-4.

13:03 Then you benchmark that, see how it does.

13:06 And then you collect some data until you can beat that inaccuracy and have a task specific

13:11 model that really only does the one extraction you're interested in.

13:14 And that model can be tiny.

13:15 Like we've had users build models that are under 10 megabytes.

13:19 Like that's, that is pretty crazy to think about these days.

13:23 And that run like 20 times faster, they're entirely private.

13:27 You can, you know, you don't need like tons of compute to run them.

13:31 And that's kind of really one of the workflows of the future that we see as very promising.

13:35 And it's also, people are often surprised how little task specific data you actually

13:40 need to say beat GPT-4 inaccuracy.

13:43 It's not as much as people think.

13:45 And it's totally, you know, in a single workday, you could often do it.

13:50 The main idea we've been thinking about a lot is basically how can we make that workflow

13:53 better and more user-friendly, even for people who don't have an extensive machine learning

13:57 background.

13:59 Because one thing that like prompting an LLM or prompting a generative model has is that

14:03 it's a very low barrier to entry.

14:05 And it's very, very, the UX is very good.

14:08 You just type in a question, you talk to it the way you would talk to a human.

14:11 And that's easy to get started with.

14:13 The workflow that's a bit more involved.

14:15 Yes, machine learning developers know how to do that, and they know when to do it.

14:20 But it's not as accessible to people who don't have all of that experience.

14:24 And so that's kind of the underlying thing that we're trying to solve.

14:30 This portion of Talk Python to Me is brought to you by Sentry.

14:32 In the last episode, I told you about how we use Sentry to solve a tricky problem.

14:37 This time, I want to talk about making your front end and back end code work more tightly

14:41 together.

14:42 If you're having a hard time getting a complete picture of how your app is working, and how

14:47 requests flow from the front end JavaScript app, back to your Python services down into

14:52 database calls for errors and performance, you should definitely check out Sentry's distributed

14:57 tracing.

14:58 With distributed tracing, you'll be able to track your software's performance, measure

15:02 metrics like throughput and latency, and display the impact of errors across multiple systems.

15:09 Distributed tracing makes Sentry a more complete performance monitoring solution, helping you

15:13 diagnose problems and measure your application's overall health more quickly.

15:18 Tracing in Sentry provides insights such as what occurred for a specific event or issue,

15:23 the conditions that cause bottlenecks or latency issues, and the endpoints and operations that

15:28 consume the most time.

15:29 Help your front end and back end teams work seamlessly together.

15:33 Check out Sentry's distributed tracing at talkpython.fm/sentry-trace.

15:39 That's talkpython.fm/sentry-trace.

15:40 And when you sign up, please use our code TALKPYTHON, all caps, no spaces, to get more

15:48 features and let them know that you came from us.

15:51 Thank you to Sentry for supporting the show.

15:54 Talked about transfer learning and using relatively small amounts of data to specialize models.

15:59 Tell people about what that is.

16:00 How do you actually do that?

16:02 It's actually the same idea that has ultimately really led to these large generative models

16:07 that we see.

16:08 And that's essentially realizing that we can learn a lot about the language and the world

16:14 and a lot of general stuff from raw.

16:17 If we just train a model with a language modeling objective on a bunch of text on all internet

16:24 or parts of the internet or whatever, in order to basically solve the task, which can be

16:30 stuff like predict the next word, in order to do that, the model has to learn so much

16:34 in its weights and in its representations about the language and about really underlying

16:40 subtle stuff about a language that it's also really good at other stuff.

16:44 That's kind of in a nutshell the basic idea.

16:46 And that's then later led to larger and larger models and more and more of these ideas.

16:52 But yeah, the basic concept is if you just train on a lot of raw text and a lot of these

16:58 models are available, like something like BERT, that's already quite a few years old,

17:03 but still, if you look at the literature and look at the experiments people are doing,

17:07 it's still very competitive.

17:09 It's like you get really good results, even with one of the most basic foundation models.

17:14 And you can use that, initialize your model with that, and then just train a small task

17:18 network on top instead of training everything from scratch, which is what you had to do

17:22 before.

17:23 And it's like, if you imagine hiring a new employee, it's like, yes, you can raise them

17:29 from birth or you can sort of have them, which is like a very creepy concept, but it's really

17:35 similar.

17:36 Teach them everything.

17:38 You were born to be a barista.

17:40 Let me tell you.

17:41 Yeah.

17:42 And then you teach them English and you teach them.

17:44 Yeah.

17:45 I mean, it's a lot of work.

17:47 And I guess you know this more than me because you have kids.

17:50 Right.

17:51 So yeah.

17:52 So, and you know, it's, it's understandable that like, okay, this, this, this made a lot

17:56 of these ML projects really hard, but now you actually have the employee come in and

18:00 they can, they know how to talk to people.

18:02 They speak the language and all you have to teach them is like, Hey, here's how you make

18:06 a coffee here.

18:07 Exactly.

18:08 Yeah.

18:09 You basically lean on the school system to say they know the language, they know arithmetic,

18:15 they know how to talk to people.

18:16 I just need to show them how this expresso machine works.

18:19 Here's how you check in.

18:21 Please take out the trash every two hours.

18:23 Like, yeah, a very little bit of specialized information, but you, the sort of general

18:27 working human knowledge is like the base LLM.

18:30 Right.

18:31 That's the idea.

18:32 And also transfer learning, it's still, it's just one technology and in context learning,

18:36 which is what we have with these generative models, that's also just another technique.

18:41 Like it's, you know, it's not the case that transfer learning is sort of outdated or has

18:45 been replaced by in context learning.

18:47 It's two different strategies and you use them in different contexts.

18:51 So another thing I want to touch on for people, I know some people, probably everyone listening

18:56 is more or less aware of this, but in practice, a lot of folks out there listening is certainly

19:01 the ones who are not in the ML or developer space.

19:06 They just go to chat or they go to somewhere and they're like, this is the AI I've gone

19:10 to, right?

19:12 Maybe they go to Bard.

19:13 I don't know.

19:14 Gemini, whatever they call it.

19:16 But there's a whole bunch, I mean, many, many, many open source models with all sorts of

19:22 variations.

19:23 One thing I really like is LM studio in Moire introduced this to me, introduced me to it

19:29 a couple of months ago.

19:30 And basically it's a UI for exploring hugging face models and then downloading them and

19:35 running them with like a chat interface, just in a UI.

19:38 And the really cool thing is they just added LLAMA3, but a lot of these are open source.

19:43 A lot of these are accessible.

19:44 You run them on your machine, you get 7 billion parameter models run easily on my Mac mini.

19:48 Yeah.

19:49 What do you think about some of these models rather than the huge ones?

19:52 Yeah, no, I think it's, and also a lot of them are like, you know, the model itself

19:56 is not necessarily much smaller than what a lot of these chat systems deploy.

20:02 And I think it's also, you know, these are really just the core models for everything

20:06 that's like proprietary and sort of in-house behind like an API is at least one open source

20:13 version that's very similar.

20:15 Like I think it's the whole model is really based on academic research, a lot of the same

20:22 data that's available.

20:23 And I think the most important differentiation we see is then around these chat assistants

20:29 and how they work and how the products are designed.

20:31 So I think it's also, this is a, it's kind of a nice exercise or a nice way to look at

20:36 this distinction between the products versus the machine facing models.

20:42 Because I think that's AI or like, you know, these products are more than just a model.

20:46 And I think that's like a super important thing to keep in mind.

20:49 It's really relevant for this conversation because you have a whole section where you

20:53 talk about regulation and what is the thing, what is the aspect of these things that should

20:58 or could be regulated?

20:59 We'll get to that in a minute.

21:00 A lot of the confusion, confusion that people have around like, Ooh, are we like, is all

21:04 AI going to be locked away behind APIs?

21:07 And how do these bigger, bigger models work?

21:10 I think it kind of stems from the fact that like, it's not always the distinction between

21:14 like the models and products isn't always clear.

21:17 And you know, you could even have, you maybe some companies that are in this business,

21:20 you know, it benefits them to call everything the AI.

21:24 That really doesn't help.

21:25 So to hear you really see the models.

21:27 Yeah.

21:28 And just sorry to talk over you, but to give people a sense, even if you search for Lama

21:32 3 in this thing, there's 192 different configured, modified, et cetera, ways to work with this

21:39 work with the Lama 3 model, which is just crazy.

21:42 So there's a lot of, a lot of stuff that maybe people haven't really explored, I imagine.

21:46 Yeah, it's very cool.

21:47 Yeah.

21:49 One other thing about this, just while we're on it, is it also comes with a open AI API.

21:51 So you could just run it and say, turn on a server API and point if I want to talk to

21:56 it.

21:57 Very fun.

21:58 But let's talk about some of the things you talked about in your talk.

22:01 The AI revolution will not be monopolized.

22:04 How open source beats economies of scale, even for LLMs.

22:07 I love it.

22:08 It's a great title and a great topic.

22:09 Thanks.

22:10 No, I'm very, it's something I'm very passionate about.

22:12 I was like, I was very, you know, happy to be able to say a lot of these things or to,

22:17 you know, also be given a platform.

22:19 You and I, we spoke before about open source and running successful businesses in the tech

22:24 space and all sorts of things.

22:25 So it's a cool follow on for sure.

22:27 I think one of the first parts that you talked about that was really interesting and has

22:32 nothing to do specifically with LLMs or AI is just why is open source a good choice and

22:39 why are people choosing, why is it a good thing to base businesses on and so on?

22:44 Yeah.

22:45 Also often when I give this as a talk, I've, I ask like for a show of hands, like, Hey,

22:50 who uses open source software?

22:52 Who works for a company that depends on open source software?

22:55 Who's contributed before?

22:56 And usually I think most people raise their hand when I ask like who works for a company

23:01 that relies on open source software.

23:03 So I often feel like, Hey, I don't even have to like explain like, Hey, it's a thing.

23:09 It's more about like, you know, collecting these reasons.

23:11 And I do think a lot of it is around like, you know, the transparency, the extensibility,

23:16 it's all kind of connected.

23:17 Like you're not locked in, you can run it in house, you can fork it, you can program

23:23 with it.

23:24 Like those are all important things for companies when they adopt software.

23:28 And you also often, you have these small teams running the project, they can accept PRs,

23:31 they can move fast.

23:33 It's a community around it that can basically give you a sense for, Hey, is this a thing?

23:37 Should I adopt it?

23:39 And all of this I think is important.

23:40 And we also often make a point to like, yes, I've always mentioned, Hey, it's also free,

23:46 which is what people usually associate with open source software.

23:49 It's kind of the first thing that comes to mind, but I actually don't think this is for

23:52 companies the main motivation why they use open source software.

23:56 I absolutely agree.

23:58 Even though we have FOSS, free and open source software, this is not really why companies

24:05 care about it.

24:06 Right?

24:07 Some people do, some people don't, but companies, they often see that as a negative, I think

24:12 almost like, well, who do we sue if this goes wrong?

24:15 Where's our service level agreement?

24:17 Who's going to help us?

24:18 Who's legally obligated to help us?

24:20 We've definitely also seen that, or like we have companies who are like, well, who can

24:24 we pay or can we pay to like, I don't know, get some guarantee or like some support or

24:32 can you like confirm to us that, Hey, if there is a critical vulnerability, that's like really

24:37 directly affecting our software, which has never happened, but are you going to fix it?

24:42 We're like, yes, we can say that that's what we've been doing.

24:45 But if you want that guarantee, we can give that to you for money.

24:48 Sure.

24:49 But like, you can pay us, we'll promise to do what we already promised to do, but we'll

24:52 really double, double promise to do it.

24:53 Right.

24:54 That's definitely a thing.

24:55 And also it's kind of to, you know, to go back up to the business model thing, it's

24:57 what we've seen with Prodigy, which, you know, we offer kind of it's, it's really as a tool

25:02 that follows the open source spirit.

25:04 Like you don't, you pip install it, it's a Python library, you work with it, but we decided

25:09 to kind of use that as a stepping stone between our free open source offering and like the

25:13 SaaS product that we're about to launch soon, hopefully.

25:18 And it's kind of in the middle and it's paid.

25:20 And we've definitely not found that this is like a huge disadvantage for companies.

25:25 Like, yeah, sure, you always have companies with like no budget, but those are also usually

25:29 not the teams that are really doing, you know, a lot of the high value work because you know,

25:34 it is quite normal to have a budget or like software tools.

25:38 Companies pay a lot for this.

25:39 Like, you know, if you, if you want to buy Prodigy, like that, that costs less than,

25:43 I don't know, getting a decent office chair, like, you know, in a, in a commercial context,

25:48 these, these scales are all a bit different.

25:50 So yeah, I do think companies are happy to pay for something that they need and that's

25:54 cool.

25:55 Yeah.

25:56 And the ones who wouldn't have paid, there's a group who said, well, maybe I'll use the

25:59 free one, but they're not serious enough about it to actually pay for it or actually make

26:04 use of it.

26:05 You know, I think of sort of analogies of piracy, right?

26:08 Like, oh, they stole our app or they stole our music.

26:10 Like, well, because it was a link, they clicked it, but they, they wouldn't have bought it

26:14 or used it at all.

26:15 It's not like you lost a customer because they were not going to be customers.

26:18 They just happened to click the link.

26:19 Yeah, exactly.

26:20 I always tell the story of like, when I was, you know, a teenager, I did download a crack

26:25 version of Adobe Photoshop.

26:27 And because I was a teenager, I would have never been able to, like, back then they had,

26:31 they didn't have a SaaS model.

26:32 Like, I don't know what it would Photoshop costs, but like, it's definitely not something

26:35 I would have been able to afford as a 13, 14 year old.

26:38 So I did find that online.

26:39 I downloaded it.

26:40 I'm pretty sure if Adobe had wanted, they could have come after me for that.

26:45 And I do think like, I don't know, maybe I'm giving them too much credit, but I do think

26:48 they might've not done that because they're like, well, what, it's not like we lost a

26:52 customer here.

26:53 And now I'm an adult and I'm, I'm proficient at Photoshop and now I'm paying for it.

26:57 Yeah, exactly.

26:58 And I think there was this whole generation of teenagers who then maybe went into creative

27:02 jobs and came in with Photoshop skills.

27:04 Like I wasn't even like, compared to all these other teenagers I was hanging out with on

27:08 the internet, like all these, like mostly, mostly girls.

27:11 I wasn't even that talented at Photoshop specifically.

27:13 So maybe, maybe there was someone smart who thought about this as like a business strategy

27:18 that these teenagers have our professional tools.

27:21 Exactly.

27:22 It's almost marketing.

27:23 Yeah.

27:24 Another aspect here that I think is really relevant to LLMs is runs in house, AKA we're

27:30 not sending our private data, private source code, API keys, et cetera, to other companies

27:36 that may even use that to train their models, which then regurgitate that back to other

27:40 people who are trying to solve the same problems.

27:42 Right.

27:43 That's also, we're definitely seeing that companies are becoming more and more aware

27:46 of this, which is good.

27:48 Like in a lot of industries, like I wouldn't want, I don't know, my healthcare provider

27:52 to just upload all of my data to like whichever SaaS tool they decide to use at the moment.

27:57 Like, you know, of course not.

27:59 So I think it's, you know, it's, it's good.

28:00 And then also with, you know, more data privacy regulations, that's all, that's really on

28:05 people's minds and people don't want this.

28:08 Like often we have, we have companies or users who actually have to run a lot of their AI

28:13 stuff on completely air gap machines, so they can't even have internet access.

28:17 Or it's about, you know, financial stuff.

28:19 We're actually working on a case study that we're hoping to publish soon where even the

28:24 financial information can move markets.

28:26 It's even segregated in the office.

28:28 So it needs to be 100% in-house.

28:31 Yeah.

28:32 That makes sense.

28:33 And I think open source software, it's great because you can do that and you can build

28:36 your own things with it and really decide how you want to host it, how it fits into

28:41 your existing stack.

28:43 That's another big thing.

28:44 People will already use some tools and you know, you don't want to change your entire

28:49 workflow for every different tool or platform you use.

28:53 And I think especially people have been burned by that so many times by now.

28:56 And there's so many like, you know, unreliable startups, things you have, there's a company

29:01 that really tries to convince you to build on their product.

29:04 And then two months later they close everything down or, you know, it doesn't even have to

29:08 be startup.

29:09 You know, Google, I'm still mad at Google for shutting down Google reader.

29:14 And I don't know, it's been over 10 years, I'm sure.

29:17 And I'm still angry about that.

29:19 I actually had a practice.

29:20 We did it.

29:21 We were invited to give a talk at Google and I needed a text example to visualize, you

29:26 know, something grammatical and that text I made, Google shut down Google reader.

29:30 That's a quiet protest.

29:33 That's amazing.

29:34 Yeah.

29:35 We're going to run sentiment analysis on this article here.

29:39 Sure.

29:40 Open source projects can become unmaintained and that sucks, but like, you know, you can

29:44 fork it.

29:45 It's there and you can have it.

29:47 So there is this, this is motivating.

29:49 And I think we've always called it like you can reinvent the wheel, but don't reinvent

29:52 the road, which is basically, you can build something.

29:57 Reinventing the wheel I don't think is bad, but like you don't want to make people follow

30:01 like, you know, your way of doing everything.

30:05 And yeah, that's interesting.

30:06 Yeah.

30:07 Like we have electric cars now.

30:08 All right.

30:09 So give us a sense of some of the open source models in this AI space here.

30:15 I've kind of divided it into sort of three categories.

30:18 So one of them is what I've called task specific models.

30:22 So that's really models that we're trying to do one specific or some specific things.

30:27 It's kind of what we distribute for spacey.

30:30 There's also a lot of really cool community projects like SciSpacy for scientific biomedical

30:37 techs.

30:38 Stanford also publishes their stanza models.

30:41 And yeah, if you've been on the Hugging Face Hub, there's like tons of these models that

30:45 were really fine tuned to predict like a particular type of categories, stuff like that.

30:50 And so that's been around for quite a while, quite established.

30:54 A lot of people use these in production and it was so quite, especially nowadays, but

30:58 today's standards, they're quite small, cheap, but of course they do one particular thing.

31:03 So they don't generalize very well.

31:05 So that's kind of the one category.

31:07 You probably used to think of them as large and now you see how giant, how many gigabytes

31:14 those models are, you know?

31:15 Yeah.

31:16 When deep learning first kind of came about and people were sort of migrating from linear

31:20 models and stuff, like I've met people complaining that the models were too, were so big and

31:25 slow and that was before we even used much transfer learning and transformer models and

31:33 BERT and stuff.

31:34 And even when that came about, it was also of course the challenge like, Hey, these are

31:37 significantly bigger.

31:38 We do have to change a lot around it.

31:40 Or even, you know, Google who published BERT, they had to do a lot of work around it to

31:46 kind of make it work into their workflows and ship them into production and optimize

31:50 them because they're quite different from what was there before.

31:55 This portion of Talk Python to Me is sponsored by porkbun.com.

31:59 Launching a successful project involves many decisions, not the least of which is choosing

32:03 a domain name.

32:05 And as your project grows, ownership and control of that domain is critical.

32:09 You want a domain registrar that you can trust.

32:12 I recently moved a bunch of my domains to a new provider and asked the community who

32:15 they recommended I choose.

32:18 Porkbun was highly recommended.

32:20 Porkbun specializes in domains that developers need like .app, .dev and .foo domains.

32:26 If you're launching that next breakthrough developer tool or finally creating a dedicated

32:30 website for your open source project, how about a .dev domain or just show off your

32:36 kung.foo programming powers with a domain there.

32:40 These domains are designed to be secure by default.

32:43 All .app and .dev domains are HSTS preloaded, which means that all .app and .dev websites

32:49 will only load over an encrypted SSL connection.

32:53 This is the gold standard of website security.

32:56 If you're paying for Whois privacy, SSL certificates and more, you should definitely check out

33:02 Porkbun.

33:03 These features are always free with every domain.

33:05 So get started with your next project today.

33:08 Lock down your .app, .dev or .foo domain at Porkbun for only $1 for the first year.

33:14 That's right, just $1.

33:17 Visit talkbython.fm/porkbun.

33:19 That's talkbython.fm/porkbun.

33:22 The link is in your podcast player's show notes.

33:25 Thank you to Porkbun for supporting the show.

33:28 Another one in this category of task specific models is SciSpacy, which is kind of cool.

33:34 What's SciSpacy?

33:35 Yeah, so SciSpacy, that's for scientific biomedical text that was published by Allen AI researchers.

33:42 And yeah, it's really, it has like components specific for working with that sort of data.

33:49 And it's actually, it's definitely, if that's kind of the domain, yeah, any listeners are

33:53 working with, definitely check it out.

33:55 They've also done some pretty smart work around like a training components, but also implementing

34:02 like hybrid rule-based things for say acronym expansion.

34:07 They're like cool algorithms that you can implement that don't necessarily need much

34:11 machine learning, but that work really well.

34:13 And so it's basically this suite of components and also models that are more tuned for that

34:19 domain.

34:20 You mentioned some, but also encoder models.

34:24 What's the difference between the task specific ones and the encoder ones?

34:26 That's kind of also what we were talking about earlier, actually, with the transfer learning

34:31 foundation models.

34:33 These are models trained with a language modeling objective, for example, like Google's BERT.

34:38 And that can also be the foundation for task specific models.

34:41 That's kind of what we're often doing nowadays.

34:44 Like you start out with some of these pre-trained weights, and then you train like this task

34:50 specific network on top of it that uses everything that is in these weights about the language

34:56 and the world.

34:57 And yeah, actually by today's standards, these are still relatively small and relatively

35:02 fast and they generalize better because they're trained on a lot of raw text that has like

35:08 a lot of, yeah, a lot of that intrinsic meta knowledge about the language and the world

35:13 that we need to solve a lot of other tasks.

35:16 Absolutely.

35:17 And then you've used the word, the term large generative models for things like Lama and

35:23 Mistral and so on.

35:25 One thing that's very unfortunate when we're talking about these models is that like everything

35:29 we've talked about here has at some point been called an NLM by someone.

35:34 That makes it like really hard to talk about it.

35:39 You can argue that like, well, all of them are kind of large language models.

35:43 And then there's also the marketing confusion.

35:47 When LLMs were hot, everyone wants to have LLMs.

35:52 And so by some definition of LLMs, we've all been running LLMs in production for years.

35:57 But basically I've kind of decided, okay, I want to try and avoid that phrase as much

36:02 as possible because it really doesn't help.

36:04 And so large generative models kind of captures that same idea, but it makes it clear, okay,

36:09 these generate text, text goes in, text comes out and they're large and they're different

36:15 from the other types of models basically.

36:18 Question out of the audience is Mr. Magnetic said, I'd love to learn how to develop AI.

36:22 So maybe let me rephrase that just a little bit and see what your thoughts are.

36:25 Like if people want to get more foundational, this kind of stuff, like what areas should

36:28 they maybe focus in to learn?

36:32 What are your thoughts there?

36:33 It depends on really what it means.

36:35 Like if you really, there is a whole path to, okay, you really want to learn more about

36:41 the models, how they work, the research that goes into it.

36:45 I think there's a lot of actually also academic resources and courses that you can take that

36:50 are similar to what you would learn in university if you started.

36:54 ML course.

36:55 Yeah, like ML and also I think some universities have made some of their like beginners courses

37:01 public.

37:02 I think Stanford has.

37:03 Yeah, right.

37:04 I thought Stanford, I think there's someone else, but like there's definitely also a lot

37:07 of stuff coming out.

37:08 So you can kind of go in that direction, really learn, okay, what goes into this?

37:14 What's the theory behind these?

37:15 And there are some people who really like that approach.

37:18 And then there's a whole more practical side.

37:20 Okay, I want to build an application that uses the technology and it solves a problem.

37:26 And often it helps to have like an idea of what you want to do.

37:29 Like if you don't want to develop AI for the sake of it, then it often helps like, hey,

37:33 you have, even if it's just your hobby, like you're into football and you come up with

37:37 like some fun problem, like you want to analyze football news, for example, and analyze it

37:44 for something you care about.

37:45 Like, I don't know, like often really helps to have this hobby angle or something you're

37:49 interested in.

37:50 Yeah, it does.

37:51 Yeah.

37:52 And then you can start looking at tools that go in that direction, like start with some

37:55 of these open source models, even, you know, try out some of these generative models, see

38:00 how you go, try out if you want to do information extraction, try out maybe something like spacey.

38:06 That's like really a lot there.

38:07 And it's definitely been become a lot easier to get started and build something these days.

38:12 Another thing you talked about was economies of scale.

38:16 And this one's really interesting.

38:17 So basically we've got Gemini and OpenAI where they've just got so much traffic and kind

38:25 of a little bit back to the question actually is if you want to do this kind of stuff, you

38:29 want to run your own service, do it, you know, even if you had the equivalent to stuff, it's

38:33 tricky because even just the way you batch compute, you maybe want to talk about that

38:38 a bit.

38:39 The idea of economies of scale is basically, well, as the, you know, as the companies produce

38:43 more output, the cost per unit decreases and yeah, there's like all kinds of, you know,

38:48 basically it gets cheaper to do more stuff.

38:51 And you know, they're like a lot of more boring, like businessy reasons why it's like that.

38:56 But I think for machine learning specifically, the fact that GPUs are so parallel really

39:02 makes a difference here.

39:03 And you, because you know, you get the user text in, you can't just arbitrarily chop up

39:07 that text because the context matters.

39:09 You need to process that.

39:10 So in order to make the most use of the compute, you basically need to batch it up.

39:15 So either, you know, kind of need to wait until there's enough to batch up.

39:20 And that means that, yes, that favors a lot of those providers that have a lot of traffic

39:25 or, you know, you introduce latency.

39:28 So that's definitely something that at least looks like a problem or, you know, something

39:33 that can be discouraging because it feels like, Hey, if you, you know, if, if supposedly

39:37 the only way you can kind of participate is by running these models and either you have

39:41 to run them yourself or go via an API, like then you kind of doomed.

39:47 And does that mean that, okay, only some large companies can provide AI for us.

39:52 So that's kind of also the, you know, the point and, you know, the very legit like worry

39:56 that some people have, like, does that lead to like monopolizing AI basically?

40:01 It's a very valid concern because even if you say, okay, look, here's the deal.

40:06 OpenAI gets to run on Azure.

40:08 I can go get a machine with a GPU stuck to it and run that on Azure.

40:12 Well, guess what?

40:13 They get one of those huge arm chips.

40:16 That's like the size of a plate and they get the special machines and they also get either

40:22 wholesale compute costs or they get just, we'll give you a bunch of compute for some

40:28 ownership of your company, kind of like Microsoft and OpenAI.

40:32 That's a very difficult thing to compete with on one hand, right?

40:34 Yes.

40:35 If you want to, you know, run your own like LLM or generative model API services, that's

40:42 definitely a disadvantage you're going to have.

40:46 But on the other hand, I think one thing that leads to this perception that I think is not

40:51 necessarily true is the fact that you want to do anything you need, basically larger

40:55 and larger models that, you know, if you want to do something specific, the only way to

40:59 get there is to turn that request into arbitrary language and then use the largest model that

41:05 can handle arbitrary language and go from there.

41:07 Like if you, and I know this is like something that, you know, maybe a lot of LLM companies

41:11 want to tell you, but that's not necessarily true.

41:14 And you don't, yeah, for a lot of things you're doing, you don't even need to depend on a

41:19 large model at runtime.

41:21 You can distill it and you can use it at development time and then build something that you can

41:26 run in-house.

41:27 And these calculations also look, look very, very different if you're using something at

41:32 development time versus in production at runtime.

41:36 And then it can actually be totally fine to just run something in-house.

41:40 And the other point here is actually some, if we're having a situation where, Hey, you're

41:45 paying a large company to provide some service for you, provide a model for you via an API.

41:52 And there are lots of companies and kind of the main differentiator is who can offer it

41:56 for cheaper.

41:57 That's sort of the opposite of a monopoly at least, right?

42:00 That's like competition.

42:01 So this actually, I feel like economies of scale, this idea does not prove that, Hey,

42:07 we're heading into, we're heading into a monopoly.

42:11 And it's also not true because it's not, if you realize that, Hey, it's not, you know,

42:15 you don't need the biggest, most arbitrary models for everything you're doing, then the

42:22 calculation looks very, very different.

42:24 Yeah, I agree.

42:25 I think there's a couple of thoughts I also have here is one, this LM studio I was talking

42:30 about, I've been running the Lama 3 7 billion parameter model locally instead of using chat

42:35 these days.

42:36 And it's been, I would say just as good.

42:38 And it's, it runs about the same speed on my Mac mini as a typical request does over

42:44 there.

42:45 I mean, can't handle as many, but it's just me, it's my computer, right?

42:47 I'm fine.

42:48 And then the other one is if you specialize one of these models, right?

42:53 You feed it a bunch of your datasets from your companies.

42:55 It might not be able to write you something in the style of Shakespeare around, you know,

43:01 a legal contract or some weird thing like that, but it can probably answer really good

43:05 questions about what is our company policy on this?

43:08 Or what is, what are our engineering reports about this thing say, or, you know, stuff

43:12 that you actually care about, right?

43:14 You could run that.

43:15 That's kind of what you want.

43:16 Like you actually want to, if you're talking about, if we're going to like some of the

43:19 risks or things people are worried about, like a lot of that is around what people refer

43:23 to like, Oh, the model going rogue or like the model doing stuff it's not supposed to

43:27 do.

43:28 If you're just sort of wrapping ChatGPT and you're not careful, then when you're giving

43:32 it access to stuff, there's a lot of unintended things that people could do with it if you're

43:38 actually running this.

43:39 And once you expose it to users, there's like a lot of risks there.

43:42 And yeah, writing something in the style of Shakespeare is like probably the most harmless

43:46 outcome that you can get, but like that is kind of a risk.

43:51 And you basically, you know, you're also, you're paying and you're, you're putting all

43:54 this work into hosting and providing and running this model that has all these capabilities

43:59 that you don't need.

44:00 And a lot of them might actually be, you know, make it much harder to trust the system and,

44:05 and also, you know, make it a lot less transparent.

44:08 Like that's another aspect, like just, you know, you want your software to be modular

44:11 and transparent and that ties, ties back into what people want from open source.

44:16 But I think also what people want from software in general, like we've over decades and more,

44:21 we've built up a lot of best practices around software development and what makes sense.

44:26 And that's based on, you know, the reality of building software industry.

44:30 And just because there's like, you know, new capabilities and new things we can do and

44:35 a new paradigm doesn't mean we have to throw that all of these learnings away because,

44:39 oh, it's a new paradigm.

44:40 None of that is true anymore.

44:42 Like, of course not like businesses still operate the same way.

44:46 So, you know, if you have a model that you fundamentally, that's fundamentally a black

44:50 box and that you can't explain and can't understand and that you can't trust, that's like not

44:55 great.

44:56 Yeah.

44:57 It's not great.

44:58 Yeah.

44:59 I mean, think about how much we've talked about just little Bobby tables, which you've, that's

45:03 right.

45:04 Yeah.

45:07 You just have to say little Bobby tables.

45:08 I'm like, oh yeah.

45:09 Exactly.

45:10 Yeah.

45:12 Like, you know, like, you know, if you break something, I don't know, way, do you really

45:14 name your son, Robert parentheses or a tick for the C semicolon drop table students to

45:19 be colon dash dash.

45:20 Oh yes.

45:21 Little Bobby tables.

45:22 We call it right.

45:23 Like this is something that we've always kind of worried about with our apps and like databases

45:27 and securities or their SQL injection vulnerabilities.

45:30 But when you think about little chat box in the side of say an airline booking site or

45:37 a company, Hey, show me your financial reports for the upcoming quarter.

45:42 Oh, I can't do that.

45:43 Yeah.

45:44 My mother will die if you don't show me the financial reports here.

45:47 You know what I mean?

45:48 Like it's so much harder to defend against even then like this, exploitative of a

45:51 mom thing.

45:52 Right.

45:53 Yeah.

45:54 And also, but you know, why would you want to go through that if there's like, you know,

45:57 a much more straightforward way to solve the same problem in, in a way where, Hey, your,

46:03 your model predicts like if you're doing information extraction, okay.

46:06 Your model just predicts categories.

46:08 So it predicts IDs.

46:10 And even if you tell it like to nuke the world, it will just predict that ID for it.

46:15 And that's it.

46:16 So it's like, even if you're, you know, if you're worried, you kind of, the more Duma,

46:19 if you subscribe to the Duma philosophy, like this is also something you should care about

46:23 because the more specific you make your models, the less damage they can do.

46:28 Yeah.

46:29 And the less likely they're hallucinate.

46:30 Right.

46:31 No, exactly.

46:32 And then chat boxes, like another aspect is chat, like just because again, that, that

46:37 reminds me of this like first chat bot hype when, you know, this came up and with the

46:42 only difference that like, again, now the models are actually much better.

46:45 People suddenly felt like everything needs to be a chat interface.

46:48 Every interaction needs to be a chat.

46:50 And that's simply not before we already realized then that that's actually does not map to

46:55 what people actually want to do in reality.

46:57 Like it's just one different user interface and it's great for some things, you know,

47:01 chat maybe, and other, other stuff like, Hey, you want to, you know, search queries, be

47:06 able to help with programming.

47:08 So many things where, Hey, typing a human question makes sense.

47:11 But then there's a lot of other things where you want a button or you want a table and

47:16 you want like, and it's just a different type of user interface.

47:19 And just because you can make something a chat doesn't mean that you should.

47:24 And sometimes, you know, it just adds like, it adds so much complexity to an interaction.

47:29 That could just be a button.

47:30 And the button click is a very focused prompt or whatever, right?

47:34 Yeah, exactly.

47:35 Yeah.

47:36 Even if it's about like, Hey, your earnings reports or something, you want to just see

47:38 a table of stuff and sum it up at the end.

47:41 You don't want your model to confidently say 2 million.

47:45 That's not solving the problem if you're a business analyst.

47:48 Yeah.

47:49 Like you want to see stuff.

47:50 So yeah.

47:51 And that actually also sort of ties into, yeah, another point that I've also had in

47:54 the talk, which is around like actually looking at what are actually the things we're trying

47:58 to solve in industry and how have these things changed?

48:01 And while there is new stuff you can now do, like generating texts and that finally works,

48:06 yay.

48:07 There's also a lot of stuff around text goes in, structured data comes out and that structured

48:12 data needs to be machine readable, not human readable, like needs to go into some other

48:16 process and a lot of industry problems, if you really think about it, have not changed

48:22 very much.

48:23 They've only changed in scale.

48:24 Like we started with index cards.

48:26 Well, there's kind of limit of how much you can do with that and how many projects you

48:30 can do at the same time.

48:31 But this was always, even since before computers, this has always been bringing structure into

48:35 unstructured data has always been the fundamental challenge.

48:38 And that's not going to just magically go away because we have new capacities and new

48:43 things we can do.

48:44 Let's talk about some of the workflows here.

48:47 So you have an example where you take a large model and do some prompting and this sort

48:53 of iterative model assisted data annotation.

48:56 Like, what's that look like?

48:58 You start out with this model, maybe one of these models that you can run locally in API

49:03 during development time and you prompt it to produce some structured output, for example,

49:10 or some answer.

49:11 You know, we also have like, for example, you can use something like spacy LLM that

49:14 lets you plug in any model in the same way you would otherwise train a model yourself.

49:21 And then you look at the results, you can actually get a good feel for how was your

49:25 model even doing.

49:27 And you can also, before you really get into distilling a model, you can create some data

49:32 to evaluate it.

49:33 Because I think that's something people are often forgetting because it's kind of not,

49:37 it's not, maybe not the funnest part, but it's really, you know, it's like writing tests.

49:41 It's like writing tests can be frustrating.

49:43 I remember when I kind of started out, like the tests are frustrating because they actually

49:47 kind of turn up all of these edge cases and mistakes that you kind of want to forget about.

49:52 Right.

49:53 Oh, I forgot to test for this.

49:54 Whoops.

49:55 Yeah.

49:57 And then like, Oh, if you start writing tests and you suddenly see all this stuff that goes

49:59 wrong and then you have to fix it.

50:01 And it's like, it's annoying.

50:02 So you better just not have tests.

50:04 I can see that.

50:05 But like evaluation is kind of like that.

50:07 And it's ultimately a lot of these problems.

50:10 You have to know what you want and here's the input.

50:14 Here's the expected output.

50:15 You kind of have to have to define that.

50:17 And that's not something any AI can help you with because you know, you are trying to teach

50:21 the machine something.

50:22 You're teaching the AI.

50:23 Yeah.

50:24 You want to build something that does what you want.

50:26 So you kind of need examples where you know the answer and then you can also evaluate

50:29 like, Hey, how does this model do out of the box for like some easy tasks?

50:34 Like, Hey, you might find something like GPT-4 can give you 75% accuracy out of the box without,

50:41 without any work.

50:42 So that's, that's kind of good or even higher.

50:44 And it's like, if it's a bit harder, you'll see, Oh, okay.

50:47 You went like 20% accuracy, which is kind of, which is pretty bad.

50:50 And the bar is very low, but that's kind of the ballpark that you're also looking to beat.

50:55 And then you can look at examples that are predicted by the model.

50:58 All you have to do is look at them.

51:00 Yes.

51:01 Correct.

51:02 If not, you make a small correction and then you go through that and you do basically do

51:05 that until you've beat the baseline.

51:07 The transfer learning aspect, right?

51:09 Yeah.

51:10 So you transfer learning in order to give the model like the solid foundation of knowledge

51:15 about the language and the world.

51:17 And you can end up with a model that's much smaller than what you started with.

51:21 And you have a model that's really has a task network that's only trained to do one specific

51:26 thing.

51:27 Which brings us from going from prototype to production where you can sort of try some

51:32 of these things out, but then maybe not run a giant model, but something smaller, right?

51:36 Yeah.

51:38 So you can have a model basically that you're interested in, in the larger model and train

51:42 components that do exactly that.

51:45 And another thing that's also good or helpful here is to have kind of a good path from prototype

51:51 to production.

51:52 I think that's also where a lot of machine learning projects in general often fail because

51:57 it's all, you have this nice prototype and it all looks promising and you've hacked something

52:01 together in your Jupyter notebook and that's all looking nice.

52:06 You maybe have like a nice streamlined demo and you can show that, but then you're like,

52:10 okay, can we ship that?

52:11 And then if your workflow that leads to the prototype is completely different from the

52:16 workflow that leads to production, you might find that out exactly at that phase.

52:20 And that's kind of where projects go to die.

52:22 And that's sad.

52:23 And yeah, so that's, that's actually something we've been thinking about a lot.

52:27 And also what we've kind of been trying to achieve with spaCy LLM, where you have this

52:31 LLM component that you plug in and it does exactly the same as the components would do

52:36 at runtime.

52:37 And it really just slots in and then might use GPT-4 behind the scenes to create the

52:43 exact same structured object.

52:45 And then you can swap that out.

52:46 Or maybe, you know, there are a lot of things you might even want to swap out with rules

52:50 or no AI at all.

52:52 Like, you know, like a ChatGPT is good at recognizing US addresses and it's great to

52:57 build a prototype, but instead of asking it to extract US addresses, for example, you

53:02 can ask it, give me spaCy rules, match your rules for US addresses.

53:06 And it can actually do that pretty well.

53:08 And then you can bootstrap from there.

53:10 There's a lot of stuff like that that you can do.

53:12 And there might be cases where you find that, yeah, you can totally beat any model accuracy

53:17 and have a much more deterministic approach if you just write a regex.

53:22 Like that's still true.

53:23 That'll still work.

53:24 Yeah, it's still something it's easy to forget because, you know, again, if you look at research

53:29 and literature, nobody's talking about that because this is not an interesting research

53:34 question.

53:35 Like nobody cares.

53:36 You know, you can take any benchmark and say, I can beat ChatGPT accuracy with two regular

53:41 expressions.

53:42 And that's like, that's true.

53:43 Probably in some cases.

53:44 Yeah.

53:45 It's like, nobody cares.

53:46 Like that's not, that's not research.

53:47 For sure.

53:48 But you know, what is nice to do is to go to ChatGPT or LM studio or whatever and say,

53:54 Hey, I need a Python based regular expression to match this text and this text.

53:59 And I want to capture group for that.

54:01 And I don't want to think about it.

54:02 It's really complicated.

54:03 Here you go.

54:04 Oh, perfect.

54:05 Now I'll run the regular expression.

54:06 Yeah, that's actually, that's a good use case.

54:08 I've still been sort of hacking around on like this, you know, interactive regex, because

54:12 I'm not particularly good at regular expressions.

54:15 Neither am I.

54:16 Like on the scale, like I can do it, but like, I know people who really, I think my co-founder

54:20 Matt, he worked through it, like he's more the type who really approaches these things

54:24 very methodically.

54:25 And he was like, now he wants to read this one big book on regular expressions.

54:30 And like, he really did it like the hard core way, but like, that's why he's obviously much

54:35 better than I am.

54:36 I consider regular expressions kind of right only.

54:39 Like you can write them and make them do stuff, but then reading them back is tricky.

54:42 Yeah.

54:43 At least for me.

54:44 All right, let's wrap this up.

54:45 So what are the things that you did here at the end of your presentation, which I want

54:49 to kind of touch on is you brought back some of the same ideas that we had for like, what

54:54 are the values of open source or why open source, but back to creating these smaller

54:59 focused models.

55:00 Talk us through this.

55:01 Our specific components.

55:02 Yeah.

55:03 I mean, if you kind of look at, hey, what are the, you know, the advantages of sort of approach

55:07 that we talked about of distilling things down, of creating these smaller models, a

55:12 lot of it comes down to it being like, it's modular.

55:15 Again, you're not locked in to anything.

55:17 You own the model.

55:19 Nobody can take that away from you.

55:20 It's easier to write tests.

55:22 You have the flexibility.

55:23 You can extend it because you know, it's code.

55:26 You can program with it because often, very rarely you do machine learning for the sake

55:30 of machine learning.

55:31 It's always like, there is some other process.

55:34 You populate a database, you do some other stuff with your stack.

55:37 And so you want to program with it.

55:38 It needs to be affordable.

55:40 You want to understand it.

55:41 You need to be able to say, why is it doing what it is?

55:44 Like what do I do to fix it?

55:45 It again, runs in house.

55:47 It's entirely private.

55:49 And then yeah, when I was kind of thinking about this, I realized like, oh, actually,

55:53 you know, like this really maps exactly the reasons that we saw, we talked about earlier,

55:59 why people choose open source or companies.

56:01 And that's obviously not a coincidence.

56:03 It's because ultimately these are principles that we have come up with over a long period

56:07 of time of, yeah, that's good software development.

56:12 And ultimately AI is just another type of software development.

56:15 So of course it makes sense that the same principles make sense and are beneficial.

56:21 And that, you know, just having a workflow where everything's a black box and third party,

56:26 this can work for prototyping, but it's not, that kind of goes against a lot of the things

56:30 that we've identified as very useful in applied settings.

56:35 Absolutely.

56:36 All right.

56:37 So we have to answer the question.

56:39 Will it be monopolized?

56:42 Our contention is no, that open source wins even for LLMs.

56:46 Open source means there's no monopoly to be gained in AI.

56:49 You know, I've kind of broken it down into some of these strategies, which, you know,

56:54 how do you get to a monopoly?

56:55 And these are like, you know, this is not just some big stuff.

56:58 These are things like a lot of companies are actively thinking about.

57:00 If you were in a business where, you know, it's winner takes all, like you want to, you

57:05 know, get rid of like all of that competition that companies hate, that investors hate.

57:10 And there are ways to do that.

57:11 And companies really actively think about this.

57:12 Those pesky competitors, let's get rid of them.

57:14 There are different ways to do that.

57:16 Like one is having this compounding advantage.

57:19 So that stuff like network effects, like, you know, if you're a social network, of course,

57:22 that makes a lot of sense.

57:24 Everyone's on it.

57:25 If you could kind of have these network effects, that's good.

57:28 And economies of scale.

57:29 But as we've seen, like economies of scale is a pretty lame mode in that respect.

57:35 Like that has a lot of, you know, a lot of limitations.

57:37 It's not even fully true.

57:39 It's kind of the opposite of a monopoly in some ways.

57:42 Yeah.

57:43 Especially in software.

57:44 Yeah, in software.

57:45 Exactly.

57:46 So it's like, I don't think that's, that's not really the way to go.

57:49 One example that comes to mind, at least for me, maybe I'm seeing it wrong, but Amazon,

57:55 amazon.com, just, you know, how many companies can have massive warehouse with everything

58:00 by every single person's house?

58:02 Yeah.

58:03 The one platform that everyone goes on.

58:04 So even if you're a retailer, you kind of, yeah, they feel the Amazon has kind of forced

58:08 everyone to either sell on Amazon or go bust because...

58:11 Exactly.

58:12 It's very sad, but it's the way it is.

58:14 And then network effects.

58:15 I'm thinking, you know, people might say Facebook or something, which is true, but I would say

58:20 like Slack, actually.

58:21 Oh, okay.

58:22 Or Slack or Discord or, you know, there's a bunch of little chat apps and things, but

58:27 if you want to have one and you want to have a little community, you want people to be

58:30 able to, well, I already have Slack open.

58:32 It's just an icon next to it versus install my own app.

58:36 Make sure you run it.

58:37 Be sure to check it.

58:38 Like people are going to forget to run it and you disappear off the space, you know?

58:41 That will make sense.

58:42 And I do think, you know, these things don't necessarily happen accidentally.

58:45 Like companies think about, okay, how do we, you know, Amazon definitely thought about

58:48 this.

58:49 This didn't just like happen to Amazon.

58:50 Yes, they were lucky in a lot of ways, but like, you know, that's, that's a strategy.

58:55 Exactly.

58:56 Yeah.

58:57 And then the other thing that's not relevant here is like, you know, another way is controlling

58:59 a resource that's really more, if you're like, you know, if they are physical, if this was

59:03 like a physical resource.

59:04 Vibrant cables.

59:05 Something like that.

59:06 Yeah.

59:07 I mean, it's, yeah.

59:08 Or like in Germany, I think for a long time, the telecom, they owned the wires in the building.

59:13 Right.

59:14 Exactly.

59:15 And they still do, I think.

59:16 So they used to have the monopoly.

59:17 Now they don't, but they kind of still, to some extent they still do because they need

59:20 to come.

59:21 If even, no matter who you sign up for, with, for internet, telecom needs to come and activate

59:26 it.

59:27 So if you sign up with telecom, you usually get service a lot faster.

59:31 You need a little better service.

59:32 Yeah, exactly.

59:33 Don't wait two weeks, use us.

59:35 That's kind of, that's how it still works, but we don't, we don't really have that here.

59:38 And then the other, the next point that's very attractive, the final one is regulation.

59:43 So that's kind of like, you have to have a monopoly because the government says so.

59:47 And that is one where we have to be careful because if we're not like, if in all of these

59:53 discussions, we're not making the distinction between the models and the actual product,

01:00:00 you know, very different characteristics and we now do a very different things.

01:00:04 If that gets muddied, which like, you know, a lot of also companies quite actively do

01:00:09 in that discourse, then we might end up in a situation where we sort of accidentally

01:00:15 or gift a company or some companies a monopoly via the regulation.

01:00:20 Because if we let them write the regulation, for example, and we're not just regulating

01:00:24 products, but lumping that in with technology itself.

01:00:29 Yeah, it's a part of your talk.

01:00:31 I can't remember if it was the person hosting it or you who brought this up, but an example

01:00:35 of that might be all the third party cookie banners, rather than banning, just targeted

01:00:41 retargeting and tracking.

01:00:43 Like instead of banning through the GDPR, instead of banning the thing that is the problem,

01:00:49 it's like, let's ban the implementation of the problem.

01:00:52 That's a risk or that's like, you know, in hindsight, yes, I think in hindsight, we would

01:00:56 all agree that like, we should have just banned targeted advertising.

01:01:01 Instead what we got is these cookie pop-ups.

01:01:02 That's like really annoying.

01:01:04 And that's actually what I feel like is one of the, as much as I think the EU, you know,

01:01:08 I'm not an expert on like AI regulation or the EU AI Act, but what I'm seeing is at least

01:01:14 they did make a distinction between use cases.

01:01:16 And it's very much, there is a focus on here are the products and the things people are

01:01:21 doing.

01:01:22 How high risk is that as opposed to how big is the model and how, you know, what does

01:01:26 that, because that doesn't, doesn't say anything, but that would kind of be a very dangerous

01:01:30 way to go about it.

01:01:32 But the risk is of course, if we're rushing regulate, like if we're rushing regulation,

01:01:37 then you know, we might actually end up with something that's not quite fit for purpose.

01:01:41 Or if we let big tech companies write the regulation or lobby.

01:01:45 Lobby for it.

01:01:46 Yeah.

01:01:47 These are my ideas because, you know, if they're doing that, like, I think it's pretty obvious.

01:01:50 They're not just worried about the safety of AI and are like appealing to like Congress

01:01:55 or whatever.

01:01:56 Like I think most people are aware of that, but like, yes, the, I think the intentions

01:02:00 are even less pure than that.

01:02:02 And I think that's a big risk.

01:02:04 Regulation is very tricky.

01:02:05 It's, you know, just for the record, I am pro regulation.

01:02:08 I'm very pro regulation in general, but I also think you can, if you fuck up regulation,

01:02:14 that can also be very damaging, obviously.

01:02:16 Absolutely.

01:02:17 And it can be put in a way so that it makes it hard for competitors to get into the system.

01:02:22 There's so much paperwork and so much monitoring that you need a team of 10 people just to

01:02:26 operate.

01:02:27 Well, if you, a startup, well, you can't do that because, hey, we got a thousand people

01:02:30 and 10 of them work on this.

01:02:31 Like, well.

01:02:32 Even beyond that, like it's, you know, if you think back to all the stuff we talked

01:02:35 about, like they are, this goes against a lot of the best practices of software.

01:02:40 This goes, you know, this goes against a lot of what we've identified that actually makes

01:02:46 good, secure, reliable, modular, whatever software, safe software internally.

01:02:53 And even doing a lot of the software development internally, like there are so many benefits

01:02:57 of that.

01:02:58 And I think, you know, companies, companies actually working on their own product is good.

01:03:02 And if it was suddenly true that like only certain companies could even provide AI models,

01:03:08 I didn't even know what that would mean for open source or for academic research.

01:03:12 Like that would make absolutely no sense.

01:03:13 I also don't think that's like really enforceable, but it would mean that, you know, this would

01:03:18 limit like everyone in what they could be doing.

01:03:21 Like I think, and it's like, you know, there's nothing to do with like, there's a lot of

01:03:25 other things you can do if you care about AI safety, but that's really not it.

01:03:30 And I also, you know, I just think being aware of that is good.

01:03:33 I don't like, I can, you know, not see an outcome where we really do that.

01:03:37 It would, it would really not make sense.

01:03:39 I could not see the reality of this, you know, shaking out, but I think it's still relevant.

01:03:45 I think the open source stuff and some of the smaller models really does give us a lot

01:03:48 of hope.

01:03:49 So that's awesome.

01:03:50 I feel positive, you know, also very positive about this.

01:03:53 I've also talked to a lot of developers at conferences who said like, yeah, actually

01:03:57 thinking and talking about this gave them some hope, which obviously is nice because

01:04:02 I definitely got some of the vibe I got.

01:04:04 Like it can be kind of easy to end up a bit disillusioned by like a lot of the narratives

01:04:10 people hear and that, you know, also even if you're entering the field, you're like,

01:04:14 wait, a lot of this doesn't really make sense.

01:04:16 Like why is it like this?

01:04:18 It's like, no, it actually, you know, your intuition is right.

01:04:22 Like a lot of software, software engineering best practices, of course, still matter.

01:04:27 And you know, no, they are like, you know, they are better ways that we're not, you know,

01:04:31 we're not just going in that direction.

01:04:33 And I think I definitely believe in that.

01:04:35 A lot of the reasons why open source won in a whole bunch of areas could be exactly why

01:04:40 it wins at LLM's as well.

01:04:41 Right.

01:04:42 Yep.

01:04:43 And you know, again, it's all based on open research.

01:04:44 A lot of stuff is already published and there's no secret source.

01:04:49 The software, you know, software industry does not run on like secrets.

01:04:53 All the differentiators are product stuff.

01:04:56 And yes, you know, open AI might monopolize or dominate AI powered chat assistants, or

01:05:02 maybe Google will do.

01:05:03 Like that's, you know, that's a whole race that, you know, if you're not in that business,

01:05:06 you don't have to be a part of, but that does not mean that anyone's going to win at or

01:05:10 monopolize AI.

01:05:11 Those are very different things.

01:05:13 Absolutely.

01:05:14 All right.

01:05:15 A good place to leave it as well, Ines.

01:05:16 Thanks for being here.

01:05:17 Yeah, thanks.

01:05:18 That was fun.

01:05:19 Yeah.

01:05:20 People want to learn more about the stuff that you're doing.

01:05:21 Maybe check out the video of your talks or whatever.

01:05:24 What do you recommend?

01:05:25 Yeah, I think I definitely, I'll definitely give you some links for the show notes like

01:05:28 this.

01:05:29 The slides are online, so you can have a look at that.

01:05:31 There is at least one recording of the talk online now from the really cool Python Lithuania.

01:05:37 It was my first time in Lithuania this year.

01:05:40 Definitely, you know, if you have a chance to visit their conference, it was a lot of

01:05:43 fun.

01:05:44 I learned a lot about Lithuania as well.

01:05:46 We also on our website, Explosion AI, we publish kind of this feed of like all kinds of stuff

01:05:51 that's happening from maybe some talk or podcast interview community stuff.

01:05:57 There's still like a lot of super interesting plugins that are developed by people in community

01:06:02 papers that are published.

01:06:03 So we really try to give a nice overview of everything that's happening in our ecosystem.

01:06:08 And then of course, you could try out spaCy, spaCy LLM.

01:06:11 You know, if you want to try out some of these generative models, especially for prototyping

01:06:17 or production, whatever you want to do for structured data.

01:06:21 If you're any of the conferences and check out the list of events and stuff, I'm going

01:06:27 to do a lot of travel this year.

01:06:29 So I would love to catch up with more developers in person and also learn more about all the

01:06:34 places I'm visiting.

01:06:35 So that's cool.

01:06:36 I've seen the list.

01:06:37 It's very, very comprehensive.

01:06:38 So I kind of a neat freak.

01:06:40 I like to, I also very much like to organize things in that way.

01:06:43 So yeah.

01:06:45 So there might be something local for people listening that you're going to be doing.

01:06:48 All right.

01:06:49 Well, as always, thank you for being on the show.

01:06:51 It's great to chat with you.

01:06:52 Yeah, thanks.

01:06:53 Thanks.

01:06:54 Till next time.

01:06:55 Bye.