The AI Revolution Won't Be Monopolized

Episode #465, published Sat, Jun 8, 2024, recorded Thu, May 9, 2024

Episode Deep Dive Links Transcript

There hasn't been a boom like the AI boom since the .com days. And it may look like a space destined to be controlled by a couple of tech giants. But Ines Montani thinks open source will play an important role in the future of AI. I hope you join us for this excellent conversation about the future of AI and open source.

Play on YouTube

Watch the live stream version

Episode Deep Dive

Guest: Ines Montani is the co-founder of Explosion.ai and a core developer of the open-source NLP library spaCy. She's deeply involved in the Python and AI community, speaking at conferences around the world on topics like NLP, large language models (LLMs), and open-source development. Ines and her team have built several tools, most notably spaCy, Prodigy, and Prodigy Teams (in beta), to help developers and data scientists train, evaluate, and deploy AI models efficiently.

1. Ines' Background and Projects

spaCy: An industrial-strength NLP library focusing on efficiency and developer experience.
- Link: spacy.io
Prodigy: A Python-based data annotation tool that allows quick and efficient creation of labeled data for machine learning.
- Link: prodi.gy
Prodigy Teams (beta): A forthcoming product from Explosion, designed to bring scriptable data annotation and model training to a private, self-hosted or on-premises environment.

2. The Rise of Large Language Models (LLMs) and AI Hype

Ines and Michael discuss the surge in AI interest since ChatGPT debuted, noting that even non-technical folks (like a motorcycle salesperson in the example) are asking how AI will reshape coding and software development.
While LLMs have powerful generative capabilities, developers must weigh issues such as data privacy, hallucinations, and over-reliance on large, general-purpose models.

3. Why Open Source Matters for AI

Transparency and Control: Companies want to see and modify the code, run it on-premises (e.g., for healthcare or financial data), and avoid vendor lock-in.
Modular Software: Smaller, specialized models or components can be swapped in and out, making systems more explainable, testable, and cost-effective.
Community and Collaboration: Open-source allows for faster improvements, more contributors, and the ability to fork a project if it becomes unmaintained.

4. Different Types of Models

Ines contrasts:

Task-Specific Models: Often pre-trained on smaller domains or fine-tuned for a single task (e.g., named entity recognition for biomedical text).
- Example: SciSpaCy for scientific and biomedical text from Allen AI.
  - Link: allenai.org/ (See "SciSpaCy" in their projects)
Encoder Models: Like BERT, used for broader tasks and then fine-tuned for specific purposes.
Large Generative Models: Examples include Llama (Meta's model) and various open models on Hugging Face. These generate text and can handle more open-ended tasks, but they can be large and expensive to run at scale.

5. Prototyping vs. Production

Prototype with LLMs: Use them to rapidly build a proof-of-concept or annotate data (e.g., harnessing GPT-4 or other LLMs to label training examples).
Distillation and Transfer Learning: Once the prototype is shown to work, create a smaller specialized model or even use rule-based approaches (like regex) if that outperforms a generic solution.
spaCy LLM: A spaCy component that seamlessly integrates large language models for tasks like text extraction, useful for quickly switching between an LLM-based prototype and a more specialized or distilled model.

6. Regulation Concerns

Ines emphasizes that regulating products and high-stakes use cases makes sense (e.g., AI in legal decisions or healthcare), but regulating the technology itself can inadvertently benefit only large tech companies.
Example parallels: GDPR's cookie banners show how regulating an implementation detail (cookies) rather than the actual problem (intrusive tracking) led to annoying pop-ups without fully solving privacy concerns. The same pitfalls could arise with overly broad AI regulation.

7. Will the AI Revolution Be Monopolized?

Economies of Scale: While big players might run massive LLMs at lower per-unit cost, smaller specialized models can be cheaper and more accurate for narrower tasks.
Network Effects and Closed Platforms: Companies can certainly monopolize chatbots or consumer services, but the underlying research and open-source models remain accessible to everyone.
Open Source and Small Models: The open-source community is releasing many high-performing models (e.g., Llama variants, Mistral, etc.), showing that you don't need a tech giant's resources to innovate in NLP and AI.

Relevant Tools and Links Mentioned

spaCy: spacy.io
Prodigy: prodi.gy
Explosion.ai website (news, events, and resources): explosion.ai
LM Studio (GUI for locally running LLMs): lmstudio.ai
Hugging Face (hub for open-source models): huggingface.co
SciSpaCy: allenai.org/ (search for "SciSpaCy")
Talk Python's NLP and spaCy course: talkpython.fm/spacy

Overall Takeaway

Despite concerns that a few large companies might dominate AI through expensive infrastructure and massive models, open-source tools and specialized smaller models offer real alternatives. Developers can prototype with large generative models, then distill or fine-tune specialized models that are more explainable, cheaper to run, and easy to integrate. Ultimately, the future of AI isn't locked to a handful of monopolies, open source, community-driven collaboration, and modular best practices will keep innovation broadly accessible.

Links from the show

Ines Montani on Twitter: @_inesmontani
spaCy: spacy.io
Prodigy App: prodi.gy
Ines' presentation at PyCon Lithuania: youtube.com
LM Studio: lmstudio.ai
Little Bobby Tables: xkcd.com

spaCy and NLP course: talkpython.fm

Use my link to get your .app, .dev, or .foo domain for just $1 right now at Porkbun: talkpython.fm/porkbun
Watch this episode on YouTube: youtube.com
Episode #465 deep-dive: talkpython.fm/465
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy
Episode #465 deep-dive: talkpython.fm/465

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 There hasn't been a boom like the AI boom since the dot-com days, and it may look like a space

00:05 destined to be controlled by a couple of tech giants. But Ines Montani thinks open source will

00:11 play an important role in the future of AI. I hope you join us for this excellent conversation

00:16 about the future of AI and open source. This is Talk Python To Me, episode 465, recorded May 8th,

00:23 2024.

00:23 Are you ready for your host?

00:26 You're listening to Michael Kennedy on Talk Python To Me. Live from Portland, Oregon,

00:31 and this segment was made with Python.

00:33 Welcome to Talk Python To Me, a weekly podcast on Python. This is your host, Michael Kennedy.

00:42 Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython,

00:47 both on fosstodon.org. Keep up with the show and listen to over seven years of past episodes

00:53 at talkpython.fm. We've started streaming most of our episodes live on YouTube. Subscribe to our

00:59 YouTube channel over at talkpython.fm/youtube to get notified about upcoming shows and be part of

01:05 that episode. This episode is brought to you by Sentry. Don't let those errors go unnoticed. Use

01:10 Sentry like we do here at Talk Python. Sign up at talkpython.fm/sentry. And it's brought to you by

01:16 Pork Bun. Launching a successful project involves many decisions, not the least of which is choosing a

01:22 domain name. Get a .app, .dev, or .food domain name at Pork Bun for just $1 for the first year at

01:29 talkpython.fm/Pork Bun. Before we jump into the show, a quick announcement. Over at Talk Python, we just

01:36 launched a new course, and it's super relevant to today's topic. The course is called Getting Started with

01:42 NLP and spaCy. It was created by Vincent Warmerdom, who has spent time working directly on spaCy at

01:48 Explosion AI. The course is a really fun exploration of what you can do with spaCy for processing and

01:53 understanding text data. And Vincent uses the past nine years of Talk Python transcripts as the core data for the

02:00 course. If you have text data you need to understand, check out the course today at talkpython.fm/spaCy.

02:06 The link is in your podcast player show notes. Now on to more AI and spaCy with Ines.

02:12 Ines, welcome back to Talk Python.

02:14 Thanks for having me. Yeah, thanks for having me back again.

02:16 You're one of my favorite guests. It's always awesome to have you.

02:19 Oh, thanks. You're my favorite podcast.

02:20 Thank you. We have some really cool things to talk about. spaCy, some of course, but also more broadly,

02:28 we're going to talk about just LLMs and AIs and open source and business models and even monopolies.

02:35 I'm going to cover a lot of things. You've been kind of making a bit of a roadshow, a tour around

02:42 much of Europe talking about some of these ideas, right?

02:45 Yeah, I've gotten invited to quite a few conferences. And I feel like this is like after COVID, the first

02:51 like proper, proper year again that I'm like traveling for conferences. And I was like, why not?

02:56 And I think especially now that so much is happening in the AI space, I think it's actually really nice

03:01 to go to these conferences and connect with actual developers. Because, you know, if you're just

03:06 sitting on the internet and you're scrolling, I don't know, LinkedIn, and sometimes it can be really

03:11 hard to tell like what people are really thinking and what's, do people really believe some of these

03:16 hot, weird takes that people are putting out there. So yeah, it was very, very nice to talk about

03:22 some of these ideas, get them checked against what developers think. So yeah, it's been really cool.

03:28 And there's more to come. Yeah, I know. I'll be traveling again later this month to Italy for

03:33 PyCon for my first time, then PyData London, and who knows what else.

03:37 If you must go to Italy and London, what terrible places to spend time in, huh?

03:42 I'm definitely very open for tips, especially for Italy, for Florence. I've never, I've never been,

03:46 I've never been to Italy ever. So.

03:49 Oh, awesome. I've been to Rome, but that's it. And so I don't have any tips, but London is,

03:54 London is also fantastic. Yeah.

03:55 Cool. So people can check you out, but maybe, I think, do you have a list of that publicly where

04:00 people can see some of your talks? We can put that in the show notes. Yeah.

04:03 Yeah. It's on my website. And then also on the explosion side of our company, we've actually

04:08 added an events page because it came up a lot that like either me, Matt, people from our team

04:13 giving talks. And so we thought like, Hey, let's, and podcasts as well. So let me be collecting

04:17 everything on one page, all the stuff we're doing, which is kind of fun.

04:21 Which is quite a bit actually, for sure. Yeah. Yeah. Well, I know many people know you,

04:26 but let's just talk a little bit about spaCy, Explosion, Prodigy, the stuff that you guys are

04:33 doing to give people a sense of where you're coming from.

04:35 Yeah. So we're basically, we're an open source company and we build developer tools for AI,

04:41 natural language processing specifically. So, you know, you're working with lots of text,

04:46 you want to analyze it beyond just looking for keywords. That's kind of where we started and

04:51 what we've always been focusing on. So spaCy is probably what we're mostly known for, which is a

04:56 popular open source library for really what we call industrial strength NLP. So built for production,

05:03 it's fast, it's efficient. We've put a lot of work into, you know, having good usable, user-friendly,

05:09 developer-friendly APIs. Actually. Yeah. I always set an example. I always like to show in my talks,

05:14 a nice side effect that we never anticipated like that is that ChatGPT and similar models are

05:19 actually pretty good at writing spaCy code because we put a lot of work into all of this stuff like

05:25 backwards compatibility, not breaking people's code all the time, stuff like that. But that happens to

05:31 really help at least for now with these models. It's really nice. It's a good thing you've done to

05:36 make it a really stable API that people can trust. But is it weird to see LLMs talking about stuff

05:43 we all created? It's kind of, it's funny in some way. I mean, it's also, there is this whole other

05:49 site to, you know, doing user support for detecting clearly auto-generated code because

05:56 for spaCy, these models are pretty good for Prodigy, which is our annotation tool, which is also scriptable

06:01 in Python. It's a bit less precise and it hallucinates a lot because there's just less code online and on

06:07 GitHub. So we sometimes get like support requests where like users post their code and we're like,

06:12 this is so strange. How did you find these APIs? They don't exist. And then we're like, ah, this was

06:17 auto-generated. Oh, okay. So that was a very new experience. And also it's, you know, I think everyone

06:23 who publishes online deals with that, but like, it's very frustrating to see all these like auto-generated

06:28 posts that like look like tech posts, but are completely bullshit and completely wrong. Like I saw

06:34 with something on spaCy LLM, which is our extension for integrating large language models into spaCy.

06:41 And they're like some blog posts that look like they're tech blog posts, but they're like completely

06:47 hallucinated. And it's very, very strange to see that about like your own software. And also it

06:52 frustrates me because that stuff is going to feed into the next generation of these models. Right. And

06:56 then they will stop being so good at this because they're full of stuff that they've generated

07:03 themselves on like APIs and things that don't even exist. Yeah. It's just going to cycle around and

07:09 around and around until it just gets worse every time. And then that's interesting. Like it's very

07:14 interesting to see what's going on and where, where these things lead. It is, you know, I just had a

07:19 thought I was, you know, open AI and some of these different companies are, are doing work to try to

07:26 detect AI generated images. And I imagine AI generated content. Yeah. When I heard that,

07:31 my thought was like, well, that's just because they kind of want to be good citizens. And I want to put

07:35 little labels and say this, what if it's just so they don't ingest it twice? I think that's definitely,

07:40 and I mean, in a way it's also good because, you know, we will, it would make these models worse.

07:46 And so from like, you know, from a product perspective for a company like open AI, that's

07:50 definitely very useful. And, I think also, you know, commercially, I think there's definitely,

07:56 you know, a big market in that also for like social networks and stuff to, to detect are these real

08:03 images, are these deep fakes, any money in that too. So it's not, I don't think it's just, yeah,

08:08 being good citizens, but like this, there's a clear product motivated thing, which is fine,

08:13 you know, for a company. Yeah. It is fine. I just, I never really thought about it. Of course.

08:17 Of course. Yeah. Do you think we'll get to some point where in food or art, you hear about artisanal

08:25 handcrafted pizza or, you know, whatever, will there'll be artisanal human created tech that has

08:31 got a special, special flavor to it? Like this was created with no AI. Look at how cool this site is

08:36 or whatever. I think it's already something that like, you see, like, I don't know which product this

08:41 was, but I saw there was some ad campaign. I think it might've been a language learning app

08:45 or something else where they really like put that into one of their like marketing claims. Like, Hey,

08:51 it's not AI generated. We don't use AI. It's actually real in humans because it seems to be,

08:57 you know, what people want. They want, you know, they want to have at least that feeling. So

09:01 I definitely think there's an appeal of that also going forward.

09:05 The whole LLM and AI stuff. It's just, it's permeated culture so much. I was at the motorcycle

09:10 shop yesterday talking to a guy who was a motorcycle salesman. And he was like, do you think that AI

09:15 is going to change how software developers work? Do you think they're still going to be relevant? I'm

09:20 like, you're a sale, you're a motorcycle sales guy. It's amazing. how that you're like,

09:24 really this tuned into it. Right. And that you're like, but you know, you think it's maybe just a

09:28 little echo chamber of us talking about, but it seems to be like these kinds of conversations are

09:32 more broad than maybe you would have guessed.

09:34 Chatty PT definitely, you know, brought the conversation into the mainstream, but on the

09:38 other hand, on the plus side, it also means it makes it a lot easier for us kind of to explain

09:43 our work because people have at least heard of this. And I think it's also for developers working in

09:48 teams. Like on the one hand, it can maybe be frustrating to do this expectation management

09:53 because, you know, you have management who just came back from some fancy conference and

09:57 got sold on like, Ooh, we need like some chatbot or LLM. It's kind of the chatbot hype all over again

10:03 that we already had in 2015 or so. That can be frustrating.

10:07 I forgot all about that. Those were going to be so important. And now what are they doing?

10:10 Nothing. Yeah. But yeah, but it's, it's like, yeah, I see a lot of parallels. It's like,

10:14 if you look at kind of the hype cycle and people's expectations and expectation

10:18 management, it's kind of the same thing in a lot of ways, only that like it actually,

10:22 a lot of parts actually kind of work now, which we didn't really have before.

10:26 Yeah. But yeah, it also means for teams and developers that they at least have some more

10:30 funding available and resources that they can work with. Because I feel like before that happened,

10:34 it looked like that, you know, companies are really cutting their budgets, all these exploratory AI

10:40 projects. They all got cut. It was quite frustrating for a lot of developers. And now at least,

10:45 you know, it means they can actually work again, even though they also have to kind of

10:49 manage the expectations and like work around some of the very wild ideas that companies might have at

10:56 the moment.

10:57 Absolutely. Now, one of the things that's really important, and we're going to get to here,

11:01 give you a chance to give a shout out to the other thing that you all have is,

11:05 is how do you teach these things? Information and how do you get them to know things and,

11:11 and so on. And, you know, for the spaCy world, you have Prodigy and maybe give a shout out to

11:15 Prodigy Teams. That's something you just are just announcing, right?

11:19 Yeah. So that's currently in beta. It's something we've been working on. So the idea of Prodigy has

11:23 always been, hey, you know, support a spaCy, also other libraries. And how can we, yeah,

11:28 how can we make the training and data collection process more efficient or so efficient that companies

11:34 can in-house that process? Like whether it's creating training data, creating evaluation data,

11:39 like even if what you're doing is completely generative and you have a model that does it well,

11:44 you need some examples and some data where you know the answer. And often that's a structured

11:49 data format. So you need to create that. We've, you know, really seen that like outsourcing that

11:53 doesn't work very well. And also now with the newer technologies, you, with transfer learning,

11:58 you don't need millions of examples anymore. This whole, like this big, big data idea for

12:04 task-specific stuff is really dead, dead in a lot of ways. So Prodigy is a developer tool that you can

12:10 script in Python and that makes it easy to really collect this kind of structured data on text images

12:17 and so on. And then Prodigy Teams, that has been a very ambitious project. We've really been,

12:23 we've wanted to ship this a long time ago already, but it's been very challenging because

12:28 we basically want to bring also a lot of these ideas that probably we're going to talk about today a

12:32 bit into the cloud while retaining the data privacy. And so you'll be able to run your own cluster on

12:39 your own infrastructure that has the data that's scriptable in Python. So you can kind of script

12:44 the SaaS app in Python, which is very cool, which you normally can't do. Your data never leaves our

12:50 service. And you can basically also use these workflows like distillation, where you start out with a super easy

12:57 prototype that might use LAMA or some other models to add GPT, GPT-4. Then you benchmark that, see how it does.

13:05 And then you collect some data until you can beat that inaccuracy and have a task-specific model that really

13:12 only does the one extraction you're interested in. And that model can be tiny. Like we've had users

13:16 build models that are under 10 megabytes. Like that's, that is pretty crazy to think about these days.

13:23 And that, that run like 20 times faster. They're entirely private. You can, you know, you don't need

13:29 like tons of compute to run them. And that's kind of really one of the workflows of the future that we

13:34 see as very promising. And it's also, people are often surprised how little task-specific data you

13:40 actually need to say, beat GPT-4 inaccuracy. It's not as much as people think. And it's totally,

13:46 you know, in a single workday, you could often do it. The main idea we've been thinking about a lot is

13:51 basically how can we make that workflow better and more user-friendly, even for people who are,

13:56 who don't have an extensive machine learning background.

13:58 Right.

13:58 Because one thing that like prompting an LLM or prompting a generative model has is that it's

14:04 a very low barrier to entry. And it's very, very use the UX is very good. You just type in a question,

14:09 you talk to it the way you would talk to a human, and that's easy to get started with the workflow.

14:14 That's a bit more involved. Yes. Machine learning developers know how to do that and they know

14:19 when to do it, but it's not as accessible to people who don't have all of that experience. And so that's

14:26 kind of the underlying thing that we're trying to solve.

14:29 This portion of Talk Python To Me is brought to you by Sentry. In the last episode, I told you about how we use Sentry to solve a tricky problem. This time, I want to talk about making your front end and back end code work more tightly together.

14:42 If you're having a hard time getting a complete picture of how your app is working and how requests flow from the front end JavaScript app back to your Python services down into database calls for errors and performance, you should definitely check out Sentry's distributed tracing.

14:57 With distributed tracing, you'll be able to track your software's performance, measure metrics like throughput and latency, and display the impact of errors across multiple systems.

15:07 Distributed tracing makes Sentry a more complete performance monitoring solution, helping you diagnose problems and measure your application's overall health more quickly.

15:16 Tracing in Sentry provides insights such as what occurred for a specific event or issue, the conditions that cause bottlenecks or latency issues, and the endpoints and operations that consume the most time.

15:29 Help your front end and back end teams work seamlessly together.

15:32 Check out Sentry's distributed tracing at talkpython.fm/sentry-trace.

15:38 That's talkpython.fm/sentry-trace.

15:42 And when you sign up, please use our code TALKPYTHON, all caps, no spaces, to get more features and let them know that you came from us.

15:50 Thank you to Sentry for supporting the show.

15:53 You talked about transfer learning and using relatively small amounts of data to specialize models.

15:59 Tell people about what that is.

16:00 How do you actually do that?

16:02 It's actually the same idea that has ultimately really led to these large generative models that we see.

16:08 And that's essentially realizing that we can learn a lot about the language and the world and, you know, a lot of general stuff from raw.

16:17 Like if we just train a model with a language modeling objective on like a bunch of text on the whole internet or parts of the internet or whatever,

16:25 in order to basically solve the task, which can be stuff like predict the next word.

16:31 In order to do that, the model has to learn so much in its weights and in its representations about the language

16:38 and about like really underlying subtle stuff about the language that it's also really good at other stuff.

16:43 That's kind of in a nutshell, the basic idea.

16:46 And that's then later led to, you know, larger and larger models and more and more of these ideas.

16:52 But yeah, the basic concept is if you just train on a lot of raw text and a lot of these models are available,

16:59 like something like BERT, that's already like quite a few years old, but still, if you look at kind of the literature

17:06 and look at the experiments people are doing, it's still very competitive.

17:08 It's like you get really good results, even with the most, one of the most basic foundation models.

17:13 And you can use that, initialize your model with that, and then just train like a small task network on top,

17:19 instead of training everything from scratch, which is what you had to do before.

17:23 Or it's like, if you imagine hiring a new employee, it's like, yes, you expect, you know, you don't,

17:29 you can raise them from birth or you can sort of have them, which is like a very creepy concept,

17:34 but it's really similar.

17:36 Teach them everything.

17:38 You were born to be a barista, let me tell you.

17:41 Yeah, and then you teach them English and you teach them.

17:44 Yeah, I mean, yeah, it's a lot of work.

17:47 And I guess you know this more than me because you have, you have kids, right?

17:50 So, yeah, so, and you know, it's understandable that like, okay, this, this, this made a lot

17:56 of these ML projects really hard, but now you actually have the employee come in and they

18:00 can, they know how to talk to people.

18:02 They speak the language and all you have to teach them is like, hey, here's how you make

18:06 a coffee here.

18:07 Exactly.

18:07 Yeah.

18:08 You basically lean on the school system to say they know the language, they know arithmetic,

18:14 they know how to talk to people.

18:16 I just need to show them how this Expresso machine works.

18:19 Here's how you check in.

18:20 Please take out the trash every two hours.

18:22 Like, yeah, very little bit of specialized information, but you, the sort of general working

18:27 human knowledge is like the base LLM, right?

18:30 That's the idea.

18:31 And also transfer learning, it's still, it's just one technology and in context learning,

18:36 which is, you know, what we have with these generative models.

18:38 That's also just another technique.

18:41 Like it's, you know, it's not the case that transfer learning is sort of outdated or has

18:46 been replaced by in context learning.

18:47 It's two different strategies and you use them in different contexts.

18:51 So.

18:52 One thing I want to touch on for people.

18:54 I know some people, probably everyone listening is more or less aware of this, but in practice,

18:59 a lot of folks out there listening, certainly the ones who are not in the ML or developer

19:05 space.

19:06 They just go to chat or they go to somewhere and they're like, this is the AI I've gone

19:10 to, right?

19:11 Maybe they go to Bard.

19:13 I don't know.

19:14 Gemini or whatever they call it.

19:15 But there's a whole bunch.

19:17 I mean, many, many, many open source models with all sorts of variations.

19:22 One thing I really like is LM studio in more introduced this to me, introduced me to it a

19:30 couple of months ago.

19:30 And basically it's a UI for exploring hugging face models and then downloading them and running

19:35 them with like a chat interface just in a UI.

19:38 And the really cool thing is they just added Llama 3, but a lot of these are open source.

19:42 A lot of these are accessible.

19:44 You run them on your machine.

19:45 You get 7 billion parameter models run easily on my Mac mini.

19:48 Yeah.

19:49 What do you think about some of these models rather than the huge ones?

19:52 Yeah, no, I think it's, and also a lot of them are like, you know, the model itself is

19:57 not necessarily much smaller than what a lot of these chat assistants deploy.

20:02 And I think it's also, you know, these are really just the core models for everything that's

20:07 like proprietary and sort of in-house behind like an API.

20:11 There's at least one open source version that's very similar.

20:15 Like I think it's the raw model is really based on academic research.

20:20 A lot of the same data that's available.

20:23 And I think the most important differentiation we see is then around these chat assistants and

20:29 how they work and how the products are designed.

20:31 So I think it's also, this is a, it's kind of a nice exercise or a nice way just to look at this

20:36 distinction between the products versus the machine facing models.

20:42 Because I think that's AI or like the, you know, these products are more than just a model.

20:46 And I think that's like a super important thing to keep in mind.

20:49 It's really relevant for this conversation because you have a whole section where you talk about

20:54 regulation and what is the thing, what is the aspect of these things that should or could be

20:58 regulated?

20:59 We'll get to that in a minute.

21:00 A lot of the confusion, confusion that people have around like, Ooh, are we like, is all

21:04 AI going to be locked away behind APIs?

21:07 And what, how do these bigger, bigger models work?

21:10 I think it kind of stems from the fact that like, it's not always the distinction between

21:14 like the models and products isn't always clear.

21:16 And, you know, you could even argue maybe some companies that are in this business, you know,

21:20 it benefits them to call everything the AI.

21:23 Yeah.

21:24 That really doesn't help.

21:25 So to here, you really see the models.

21:27 Yeah.

21:28 And just sorry to talk over you, but to give people a sense, even if you search for Lama

21:32 3 in this thing, there's 192 different configured, modified, et cetera, ways to work with the Lama

21:40 3 model, which is just crazy.

21:42 So there's a lot of, a lot of stuff that maybe people haven't really explored, I imagine.

21:46 Yeah, it's very cool.

21:47 Yeah.

21:47 One other thing about this, just while we're on it, is it also comes with a open AI API.

21:51 So you could just run it and say, turn on a server API endpoint if I want to talk to it.

21:56 Very fun.

21:56 But let's talk about some of the things you talked about in your talk.

22:01 The AI revolution will not be monopolized.

22:03 How open source beats economies of scale, even for LLMs.

22:07 I love it.

22:07 It's a great title and a great topic.

22:09 No, thanks.

22:10 No, I'm very, it's something I'm very passionate about.

22:12 I was like, I was very, you know, happy to be able to say a lot of these things or to,

22:17 you know, also be given a platform.

22:19 You and I, we've spoken before about open source and running successful businesses in

22:24 the tech space and all sorts of things.

22:25 So it's a cool follow on for sure.

22:27 Yeah.

22:27 I think one of the first parts that you talked about that was really interesting and has nothing

22:32 to do specifically with LLMs or AI is just why is open source a good choice and why are

22:40 people choosing?

22:41 Why is it a good thing to base businesses on and so on?

22:44 Yeah.

22:44 Also often when I give this as a talk, I've, I ask like for a show of hands, like, Hey,

22:50 who, who uses open source software or who works for a company that depends on open source

22:54 software?

22:55 Who's contributed before?

22:56 And usually I think most people raise their hand when I ask like who works for a company

23:01 that relies on open source software.

23:03 So, you know, I often feel like, Hey, I don't even have to like explain like, Hey, it's a

23:08 thing.

23:08 It's more about like, you know, collecting these reasons.

23:11 And I do think a lot of it is around like, you know, the transparency, the extensibility,

23:16 it's all kind of connected.

23:17 Like you're not locked in, you can run it in house, you can fork it, you can, you know,

23:22 program with it.

23:23 Like those are all important things for companies when they adopt software.

23:28 And you also often, you have these small teams running the project that can accept PRs that

23:32 can move fast.

23:33 There's a community around it that can basically give you a sense for, Hey, is this a thing?

23:37 Should I adopt it?

23:38 And all of this I think is important.

23:40 And, you know, I also often make a point to like, yes, I've always mentioned, Hey, it's

23:45 also free, which is what people usually associate with open source software.

23:49 And it's kind of the first thing that comes to mind, but I actually don't think this is

23:52 for companies, the main motivation why they use open source software.

23:56 I absolutely agree.

23:57 I mean, even though we have FOSS, free and open source software, this is not really why

24:04 companies care about it, right?

24:05 I'm sure.

24:06 Certainly some people do, some people don't, but companies, they often see that as a negative,

24:12 I think almost like, well, who do we sue if this goes wrong?

24:14 Where's our, our service level agreement?

24:17 Who's going to help us?

24:18 Who's legally obligated to help us?

24:20 We've definitely also seen that.

24:21 Or like we have, you know, companies who are like, well, who can we pay?

24:24 Or can we pay to like, I don't know, get, you know, some guarantee or like some support

24:30 or, or, you know, can you like confirm to us that, Hey, if there is a critical vulnerability,

24:36 that's like really directly affecting our software, which, you know, has never happened, but are

24:41 you going to fix it?

24:42 We're like, yes, we can say that.

24:43 That's what we've been doing.

24:45 But if you want that guarantee, we can give that to you for money.

24:47 Sure.

24:48 But like, you can pay us, we'll, we'll promise to do what we already promised to do, but

24:51 we'll really double, double promise to do it.

24:53 Right.

24:54 That's definitely a thing.

24:54 And also it's kind of to, you know, to go back up to the business model thing, it's what

24:58 we've seen with Prodigy, which, you know, we offer kind of, it's, it really as a tool,

25:03 it follows the open source spirit.

25:04 Like you don't, you pip install it.

25:06 It's a Python library, you work with it, but we decided to kind of use that as a stepping

25:10 stone between our free open source offering and like the SaaS product that we're about

25:15 to launch soon, hopefully.

25:17 And it's kind of in the middle and it's paid.

25:20 And we've definitely, you know, not found that this is like a huge disadvantage for companies.

25:25 Like, yes, sure.

25:26 You always have companies with like no budget, but those are also usually not the teams that

25:30 are really doing, you know, a lot of the high value work because, you know, it is quite

25:34 normal to have a budget or like software tools.

25:37 Companies pay a lot for this.

25:40 Like, you know, if you, if you want to buy Prodigy, like that, that costs less than,

25:43 I don't know, getting a decent office chair.

25:45 Like, you know, in a, in a commercial context, these, these scales are all a bit different.

25:50 So yeah, I do think companies are happy to pay for something that they need and that's

25:54 good.

25:54 Yeah.

25:55 And the ones who wouldn't have paid, there's a group who said, well, maybe I'll use the

25:59 free one, but they're not serious enough about it to, to actually pay for it or actually

26:03 make use of it.

26:04 You know, I think of sort of analogies of piracy, right?

26:08 Like though they stole our app or they stole our music, like, well, because they, it was

26:11 a link, they clicked it, but they, they wouldn't have bought it or used it at all.

26:15 It's not like you lost a customer because they were not going to be customers.

26:18 They just have.

26:18 Yeah, exactly.

26:19 I mean, it's like, I always tell the story of like, when I was, you know, a teenager, I

26:23 did download a crack version of Adobe Photoshop.

26:27 And because I was a teenager, I would have never been able to like back then they had,

26:31 they didn't have a SaaS model.

26:32 Like, I don't know what Photoshop costs, but like, it's definitely not something I would

26:35 have been able to afford as a 13, 14 year old.

26:38 So I did find that online.

26:40 I downloaded it.

26:40 I'm pretty sure if Adobe had wanted, they could have come after me for that.

26:44 And I do think like, I don't know, maybe I'm giving them too much credit, but I do think

26:48 they might've not done that because they're like, well, what it's not like we lost a customer

26:52 here.

26:52 And now I'm an adult and I'm, I'm proficient at Photoshop and now I'm paying for it.

26:57 Yeah, exactly.

26:58 And I think there was this whole generation of teenagers who then maybe went into creative

27:02 jobs and came in with Photoshop skills.

27:04 Like I wasn't even like, compared to all these other teenagers I was hanging out with on the

27:08 internet, like all these, like mostly, mostly girls, I wasn't even that talented at Photoshop

27:13 specifically.

27:13 So maybe, maybe there was someone smart who thought about this as like a business strategy

27:17 that these teenagers have our professional tools.

27:20 Exactly.

27:21 It's, it's almost marketing.

27:22 Another aspect here that I think is really relevant to LLMs is runs in house, AKA, we're

27:30 not sending our private data, private source code, API keys, et cetera, to other companies

27:36 that may even use that to train their models, which then regurgitate that back to other people

27:40 who are trying to solve the same problems.

27:42 Right.

27:42 That's also, we're definitely seeing that companies are becoming more and more aware of this, which

27:47 is good.

27:48 Like in a lot of industries, like I wouldn't want, I don't know, my healthcare provider

27:51 to just upload all of my data to like whichever SaaS tool they decide to use at the moment.

27:57 Like, you know, of course not.

27:59 So I think it's, you know, it's, it's good.

28:00 And then also with, you know, more data privacy regulations, that's all that's really on people's

28:06 minds.

28:06 And people don't want it's like often we have, we have companies or users who actually

28:12 have to run a lot of their AI stuff on completely air gap machines.

28:15 So they can't even have internet access or it's about, you know, financial stuff.

28:20 We're actually working on a case study that we're hoping to publish soon where even the

28:24 financial information can move markets.

28:26 It's even segregated in the office.

28:28 So it needs to be 100% in-house.

28:31 Yeah.

28:32 That makes sense.

28:32 And I think open source software, it's great because you can do that and you can build your

28:37 own things with it and really decide how you want to host it, how it fits into your existing

28:42 stack.

28:43 That's another big thing.

28:44 People will already use some tools and, you know, you don't want to change your entire workflow

28:50 for every different tool or platform you use.

28:53 And I think especially people have been burned by that so many times by now, and there's

28:57 so many like, you know, unreliable startups things you have.

29:00 There's a company that really tries to convince you to build on their product.

29:03 And then two months later, they close everything down or, you know, it doesn't even have to

29:08 be startup, you know, Google.

29:10 I'm still mad at Google for shutting down Google reader.

29:14 And I don't know, it's been over 10 years, I'm sure.

29:17 And I'm still angry about that.

29:19 I actually had a, had this, we did it.

29:21 We were invited to give a talk at Google and I needed a text example to visualize, you

29:26 know, something grammatical.

29:27 And that text I made, Google shut down Google reader.

29:30 That's a quiet protest.

29:32 Oh, that's amazing.

29:34 Yeah.

29:34 We're going to run sentiment analysis on this article here.

29:39 Sure, open source projects can become unmaintained and that sucks, but like, you know, you can fork

29:44 it.

29:44 It's, it's there and you can, you can have it.

29:47 So there is this, this is motivating.

29:48 And I think we've always called it like, you can reinvent the wheel, but don't reinvent

29:53 the road, which is basically, you can, you can build something, reinventing the wheel.

29:57 I don't think it's bad, but like, you don't want to make people follow like, you know, your

30:03 way of doing everything.

30:05 And yeah, that's interesting.

30:06 Yeah.

30:07 Like we have electric cars now.

30:08 All right.

30:09 So give us a sense of some of the open source models in this AI space here.

30:15 I've kind of divided it into sort of three categories.

30:18 So one of them is what I've called task specific models.

30:22 So that's really models that we're trying to do one specific or some specific things.

30:27 It's kind of what we distribute for spaCy.

30:30 There's also a lot of really cool community projects like size spaCy for scientific biomedical

30:37 techs.

30:38 Stanford also publishes their stanza models.

30:40 And yeah, if you've been on the hugging face hub, there's like tons of these models that

30:45 were really fine tuned to predict like a particular type of categories, stuff like that.

30:50 And so that's been around for quite a while, quite established.

30:54 A lot of people use these in production.

30:55 And it was so quite, especially nowadays, but today's standards, they quite small, cheap,

31:00 but of course they do one particular thing.

31:03 So they don't generalize very well.

31:05 So that's kind of the one category.

31:07 You probably used to think of them as large and now you see how giant, how many gigabytes those

31:14 models are, you know?

31:15 Yeah.

31:15 When deep learning first kind of came about and people were sort of migrating from

31:20 linear models and stuff.

31:22 Like I've met people complaining that the models were too, were so big and slow.

31:26 And that was before we even, you know, used much transfer learning and transformer models

31:32 and BERT and stuff.

31:33 And even when that came about, it was also first a challenge like, Hey, these are significantly

31:38 bigger.

31:38 We do have to change a lot around it or even, you know, Google who published BERT, they had

31:44 to do a lot of work around it to make it work into their workflows and ship them into production

31:49 and optimize them because they were quite different from what was there before.

31:54 This portion of Talk Python To Me is sponsored by porkbun.com.

31:58 Launching a successful project involves many decisions, not the least of which is choosing

32:03 a domain name.

32:04 And as your project grows, ownership and control of that domain is critical.

32:09 You want a domain registrar that you can trust.

32:11 I recently moved a bunch of my domains to a new provider and asked the community who they

32:16 recommended I choose.

32:17 Porkbun was highly recommended.

32:20 Porkbun specializes in domains that developers need like .app, .dev, and .foo domains.

32:25 If you're launching that next breakthrough developer tool or finally creating a dedicated website

32:31 for your open source project, how about a .dev domain?

32:34 Or just show off your kung.foo programming powers with a domain there.

32:40 These domains are designed to be secure by default.

32:43 All .app and .dev domains are HSTS preloaded, which means that all .app and .dev websites will

32:50 only load over an encrypted SSL connection.

32:53 This is the gold standard of website security.

32:56 If you're paying for whois privacy, SSL certificates, and more, you should definitely check out Porkbun.

33:03 These features are always free with every domain.

33:05 So get started with your next project today.

33:07 Lock down your .app, .dev, or .food domain at Porkbun for only $1 for the first year.

33:14 That's right, just $1.

33:16 Visit talkpython.fm/porkbun.

33:19 That's talkpython.fm/porkbun.

33:22 The link is in your podcast player's show notes.

33:24 Thank you to Porkbun for supporting the show.

33:28 Another one in this category of task-specific models is SciSpacee, which is kind of cool.

33:34 What's SciSpacee?

33:35 Yeah, so SciSpacee, that's for scientific biomedical text.

33:39 That was published by Allen AI researchers.

33:42 And yeah, it's really, it has like components specific for working with that sort of data.

33:49 And it's actually, it's definitely, if that's kind of the domain, yeah, any listeners are working

33:54 with, definitely check it out.

33:55 They've also done some pretty smart work around like, A, training components, but also

34:00 implementing like hybrid rule-based things for, say, acronym expansion.

34:07 Right.

34:08 They're like cool algorithms that you can implement that don't necessarily need much machine learning,

34:12 but that work really well.

34:13 And so it's basically the suite of components and also models that are more tuned for that

34:19 domain.

34:20 You mentioned some, but also encoder models.

34:23 What's the difference between the task-specific ones and the encoder ones?

34:26 That's kind of also what we were talking about earlier, actually, with the transfer learning,

34:31 foundation models.

34:32 These are models trained with a language modeling objective, for example, like Google Spurt.

34:37 And, you know, that can also be the foundation for task-specific models.

34:41 That's kind of what we're often doing nowadays.

34:44 Like you start out with some of these pre-trained weights, and then you train like this task-specific

34:51 network on top of it that uses everything that is in these weights about the language and the

34:56 world.

34:57 And yeah, actually by today's standards, these are still relatively small and relatively fast.

35:03 And they generalize better because, you know, they're trained on a lot of raw text that has

35:08 like a lot of, yeah, a lot of that intrinsic meta knowledge about the language and the world

35:14 that we need to solve a lot of other tasks.

35:16 Absolutely.

35:16 And then you've used the word, the term large generative models for things like LAMA and

35:23 Mestral and so on.

35:25 One thing that's very unfortunate when, you know, talking about these models is that like

35:29 everything we've talked about here has at some point been called an LLM by someone.

35:33 And that makes it like really hard to talk about it.

35:36 And, you know, so like, you can argue that like, well, all of them are kind of large language

35:41 models, right?

35:43 And then there's also, you know, the marketing confusion.

35:45 Like, you know, when LLMs were hot, everyone wants to, you know, have LLMs.

35:52 And so by some definition of LLMs, we've all been running LLMs in production for years.

35:57 But basically, yeah, I've kind of decided, okay, I want to try and avoid that phrase as

36:02 much as possible because it really doesn't help.

36:04 And so large generative models kind of captures that same idea, but it makes it clear, okay,

36:09 these generate text, text goes in, text comes out, and they're large, and they're different

36:15 from the other types of models, basically.

36:17 Question on the audience is, Mr. Magnetics, I'd love to learn how to develop AI.

36:22 So maybe let me rephrase that just a little bit and see what your thoughts are.

36:25 Like, if people want to get more foundational, this kind of stuff, like, what areas should

36:28 they maybe focus in to learn?

36:31 What are your thoughts there?

36:32 It depends on really, you know, what it means.

36:35 Like, if you really, you know, there is a whole path to, okay, you really want to learn more

36:40 about the models, how they work, you know, the research that goes into it.

36:44 I think there's a lot of actually also academic resources and courses that you can take that

36:50 are similar to, you know, what you would learn in university if you started...

36:54 ML course.

36:56 Yeah.

36:56 Like, ML.

36:57 And also, I think some universities have made some of their, like, beginner's courses public.

37:01 I think Stanford has.

37:03 Yeah.

37:03 Right.

37:03 I thought Stanford, I think there's someone else, but like, there's definitely also a lot

37:07 of stuff coming out.

37:08 So you can kind of, you know, go in that direction, really learn, okay, how, what goes into this?

37:14 What's the theory behind this?

37:15 And there are some people who really like that approach.

37:18 And then there's a whole more practical side.

37:20 Okay.

37:21 I want to build an application that uses the technology and it solves the problem.

37:26 And often it helps to have like an idea of what you want to, what you want to do.

37:29 Like, if you don't want to develop AI for the sake of it, then it often helps.

37:32 Like, hey, you have, even if it's just your hobby, like you're into football and you,

37:36 you come up with like some fun, fun problem.

37:39 Like you want to analyze football news, for example, and analyze it for something you care

37:45 about.

37:45 Like, I don't know, like often, often really helps to have this hobby angle or something

37:49 you're interested in.

37:50 Yeah, it does.

37:50 Yeah.

37:51 And then you can start looking at tools that go in that direction.

37:54 Like start with some of these open source models, even, you know, try out some of these

37:59 generative models, see how you go.

38:01 Try out, if you want to do information extraction, try out maybe something like spaCy.

38:05 There's like really a lot there.

38:07 And it's definitely become a lot easier to get started and build something these days.

38:12 Another thing you talked about was economies of scale.

38:16 And this one's really interesting.

38:17 So basically we've got Gemini and OpenAI, where they've just got so much traffic.

38:24 And kind of a little bit back to the question, actually, is if you want to do this kind of

38:28 stuff, you want to run your own service doing it, you know, even if you had the equivalent

38:33 stuff.

38:33 It's tricky because even just the way you batch compute, you maybe want to talk about

38:38 that a bit?

38:38 The idea of economies of scale is basically, well, as the companies produce more output,

38:44 the cost per unit decreases.

38:46 And yeah, there's like all kinds of, you know, basically it gets cheaper to do more stuff.

38:50 And, you know, there are like a lot of more boring, like business-y reasons why it's like

38:56 that.

38:56 But I think for machine learning specifically, the fact that GPUs are so parallel really makes

39:02 a difference here.

39:03 And because, you know, you get the user text in, you can't just arbitrarily chop up that

39:07 text because the context matters.

39:09 You need to process that.

39:10 So in order to make the most use of the compute, you basically need to batch it up.

39:15 So either, you know, kind of need to wait until there's enough to batch up.

39:19 And that means that, yes, that favors a lot of those providers that have a lot of traffic

39:25 or, you know, you introduce latency.

39:28 So that's definitely something that at least looks like a problem or, you know, something

39:33 that can be discouraging because it feels like, hey, if you, you know, if supposedly the only

39:38 way you can kind of participate is by running these models and either you have to run them

39:42 yourself or go via an API, like then you're kind of doomed.

39:46 And does that mean that, okay, only some large companies can, you know, provide AI for us?

39:52 So that's kind of also the, you know, the point and, you know, the very legit, like worry

39:56 that some people have, like, does that lead to like monopolizing AI basically?

40:00 It's a very valid concern because even if you say, okay, look, here's the deal.

40:05 Open AI gets to run on Azure.

40:08 I can go get a machine with a GPU stuck to it and run that on Azure.

40:12 Well, guess what?

40:13 They get one of those huge ARM chips that's like the size of a plate and they get the special

40:20 machines and they also get either wholesale compute costs or they get just, we'll give

40:26 you a bunch of compute for some ownership of your company, kind of like Microsoft and open

40:31 AI.

40:31 Yeah.

40:32 That's a very difficult thing to compete with on one hand, right?

40:34 Yes.

40:35 If you want to, you know, run your own, like, you know, LLM or generative model API services,

40:42 that's definitely, you know, a disadvantage you're going to have.

40:45 But on the other hand, I think one thing that leads to this perception that I think is not

40:51 necessarily true is the fact that you want to do anything you need, basically larger and

40:55 larger models that, you know, if you want to do something specific, the only way to get

41:00 there is to turn that request into arbitrary language and then use the largest model that

41:05 can handle arbitrary language and go from there.

41:07 Like if you, and I know this is like something that, you know, maybe a lot of LLM companies

41:11 want to tell you, but that's not necessarily true.

41:14 And you don't, yeah, for a lot of things you're doing, you don't even need to depend on a large

41:20 model at runtime.

41:21 You can distill it and you can use it at development time and then build something that you can run

41:26 in-house.

41:27 And these calculations also look, look very, very different if you're using something at

41:32 development time versus in production at runtime, and then it can actually be totally fine to just

41:38 run something in-house.

41:40 And the other point here is actually some, you know, if we're having a situation where, Hey,

41:45 you're paying a large company to provide some service for you, provide a model for you by an API.

41:52 And there are lots of companies and kind of the main differentiator is who can offer it for

41:56 cheaper. That's sort of the opposite of a monopoly at least, right? That's like competition.

42:01 So this actually, I feel like economies of scale, this, this idea does not prove that, Hey, we're

42:07 heading into, we're heading into a monopoly. And it's also not true because it's not, if you realize

42:14 that, Hey, it's not, you know, you don't need the biggest, most arbitrary models for everything

42:20 you're doing, then the calculation looks very, very different.

42:24 Yeah, I agree. I think there's a couple of thoughts.

42:26 I also have here is one, this LM studio I was talking about, I've been running the Llama 3,

42:31 7 billion parameter model locally instead of using chat these days. And it's been, I would say,

42:37 just as good. And it's, it runs about the same speed on my Mac mini as a typical request does over

42:44 there. I mean, can't handle as many, but it's just me. It's my computer, right? I'm fine. And then the

42:49 other one is if you specialize one of these models, right? You feed it a bunch of your data sets from your

42:55 companies. It might not be able to write you something in the style of Shakespeare around,

43:00 you know, a legal contract or some weird thing like that, but it can probably answer really good

43:05 questions about what is our company policy on this? Or what is, what are engineering reports about this

43:11 thing say? Or, you know, stuff that you actually care about, right? You could run that.

43:14 That's kind of what you want. Like you actually want to, if you're talking about, if you're going

43:18 to like some of the risks or things people are worried about, like a lot of that is around what

43:23 people refer to like, oh, the model going rogue or like the model doing stuff it's not supposed to do.

43:27 Yeah.

43:27 If you're just sort of wrapping ChatGPT and you're not careful, then when you're giving it access to

43:33 stuff, there's a lot of unintended things that people could do with it. If you're actually running

43:39 this and once you expose it to users, there's like a lot of risks there and yeah, writing something in

43:44 the style of Shakespeare is like probably the most harmless outcome that you can get. But like,

43:49 that is kind of a risk. And you basically, you know, you're also, you're paying and you're putting

43:54 all this work into hosting and providing and running this model that has all these capabilities

43:59 that you don't need. And a lot of them might actually be, you know, make it much harder to trust

44:04 the system and also, you know, make it a lot less transparent. Like that's another aspect,

44:09 like just, you know, you want your software to be modular and transparent and that ties back into

44:14 what people want from open source. But I think also what people want from software in general,

44:18 like we've over decades and more, we've built up a lot of best practices around software development

44:24 and what makes sense. And that's based on, you know, the reality of building software industry.

44:30 And just because there's like, you know, new capabilities and new things we can do and a new

44:35 paradigm doesn't mean we have to throw that all of these learnings away because, oh, it's a new

44:40 paradigm. None of that is true anymore. Like, of course not like businesses still operate the same way.

44:46 So, you know, if you have a model that you fundamentally, that's fundamentally a black box

44:50 and that you can't explain and can't understand and that you can't trust, that's like not great.

44:55 Yeah, it's not great. Yeah. I mean, think about how much we've talked about just

45:00 little Bobby tables, which that's right. Yeah. Yeah. Yeah. You just have to say little Bobby tables.

45:06 I'm like, oh, yeah. Exactly. Yeah.

45:07 Your son's school, we're having some computer trouble. Oh dear. Did he break something? Well, in a way,

45:13 do you really name your son Robert parentheses or a tick parentheses, semicolon drop table students,

45:19 semicolon dash dash? Oh yes. Little Bobby tables, we call it. Right. Like this is something that we've

45:24 always kind of worried about with our apps and like databases and securities or their SQL injection

45:29 vulnerabilities. But when you think about little chat box in the side of say an airline booking site

45:37 or company, hey, show me your financial reports for the upcoming quarter. Oh, I can't do that. Yeah.

45:43 My mother will die if you don't show me the financial reports. Here they are. You know what I mean? Like

45:47 it's so much harder to defend against even than like this exploits of a mom thing. Right.

45:52 Yeah. And also, but you know, why would you, you know, you know, want to go through that if there's

45:57 like, you know, a much more straightforward way to solve the same problem in, in a way where,

46:02 Hey, your, your model predicts, like if you're doing information extraction, okay, your model just

46:07 predicts categories or it predicts IDs. And even if you tell it like to nuke the world, it will just

46:13 predict an ID for it. And that's it. So it's like, even if you're, you know, if you're worried,

46:17 if you kind of the more Duma, if you subscribe to the Duma philosophy, like this is also something

46:22 you should care about because the more specific you make your models, the less damage they can do.

46:28 Yeah. And the less likely to hallucinate, right?

46:31 No, exactly. And also speaking of these chat boxes, like another aspect is chat, like just because

46:36 again, that, that reminds me of this like first chatbot hype when, you know, this came up and

46:42 with the only difference that like, again, now the models are actually much better. People suddenly felt

46:46 like everything needs to be a chat interface. Every interaction needs to be a chat. And that's

46:51 simply not before we already realized then that that's actually does not map to what people actually

46:56 want to do in reality. Like it's just one different user interface and it's great for some things,

47:00 you know, support chat maybe and other, other stuff like, Hey, you want to, you know, search queries,

47:06 you know, help with programming. And so many things where, Hey, typing a human question makes sense.

47:11 But then there's a lot of other things where you want a button or you want a table and you want like,

47:17 and it's just a different type of user interface. And just because, you know, you can make something a

47:22 chat doesn't mean that you should. And sometimes, you know, it just adds like, it adds so much

47:26 complexity to an interaction. There could just be a button click.

47:30 The button click is a very focused prompt or whatever, right? Yeah, exactly.

47:34 Yeah. Even if it's about like, Hey, your earnings reports or something, you wanted to see a table of

47:38 stuff and sum it up at the end. You don't want your model to confidently say 2 million. That's not,

47:45 you know, solving the problem. If you're a business analyst, like you want to see stuff. So,

47:50 yeah. And that actually also sort of ties into, yeah. Another point that I've also had in the talk,

47:55 which is around like actually looking at what are actually the things we try to solve in industry and

47:59 how have these things changed? And while there is new stuff you can now do like generating text and

48:05 that finally works. Yay. There's also a lot of stuff around text goes in, structured data comes out and

48:12 that structured data needs to be machine readable, not human readable, like needs to go into some other

48:16 process. And a lot of industry problems, if you really think about it, have not changed very much.

48:22 They've only changed in scale. Like we started with index cards. Well, there's just kind of limit of

48:27 how much you can do with that and how many projects you could do at the same time. But this was always,

48:32 even since before computers, this has always been bringing structure into unstructured data has always

48:37 been the fundamental challenge. And that's not going to just magically go away because we have new

48:41 capacity capacities and new things we can do.

48:44 Let's talk about some of the workflows here. So you have an example where, you know, you take a large

48:50 model and do some prompting and this sort of iterative model assisted data annotation. Like what's that look like?

48:58 You start out with this model, maybe, you know, one of these models that you can run locally, an API

49:03 during development time, and you prompt it to produce some structured output, for example, or some answer.

49:11 You know, we also have like, for example, you can use something like spaCy LLM that lets you, you know,

49:16 plug in any model in the same way you would otherwise, you know, train a model yourself.

49:21 And then you look at the results, you can actually get a good feel for how is your model even doing.

49:27 And you can also, before you really get into distilling a model, you can create some data to evaluate it.

49:33 Because I think that's something people are often forgetting because it's kind of not, it's not,

49:37 maybe not the funnest part, but it's really, you know, it's like writing tests.

49:41 It's like writing tests can be frustrating. I remember when I kind of started out, like,

49:45 tests are frustrating because they actually kind of, they turn up all of these edge cases and mistakes

49:51 that you kind of want to forget about.

49:53 Right. Oh, I forgot to test for this. Whoops.

49:54 Yeah. Yeah. And then like, oh, if you start writing tests and you suddenly see all the stuff

49:59 that goes wrong and then you have to fix it and it's like, it's annoying. So you better just

50:03 not have tests. I can see that. But like evaluation is kind of like that. And it's ultimately a lot of

50:09 these problems, you, you have to know what you want and here's the input, here's the expected output.

50:15 You kind of have to have to define that. And that's not something any AI can help you with because,

50:20 you know, you are trying to teach the machine. You're teaching the AI.

50:23 You want to, yeah, you want to build something that does what you want. So you kind of need examples

50:27 where you know the answer. And then you can also evaluate like, hey, how does this model do out of the box

50:32 for like some easy tasks? Like, hey, you might find something like GPT-4 can give you 75% accuracy

50:39 out of the box without, without any work. So that's, that's kind of good or even higher.

50:44 Sometimes it's like, if it's a bit harder, you'll see, oh, okay, you went like 20% accuracy,

50:48 which is kind of, which is pretty bad. And the bar is very low, but that's kind of the ballpark that

50:54 you're also looking to beat. And then you can look at examples that are predicted by the model. All you

50:58 have to do is look at them. Yes. Correct. If not, you make a small correction and then you go through

51:04 that and you do basically do that until you've beat the baseline.

51:07 The transfer learning aspect, right? Yeah. And then you use transfer learning in order,

51:11 you know, to give the model like the solid foundation of knowledge about the language and

51:16 the world. And you can end up with a model that's much smaller than what you started with. And you

51:21 have a model that's really has a task network that's only trained to do one specific thing.

51:26 Which brings us from going from prototype to production, where you can sort of try some of

51:32 these things out, but then maybe not run a giant model, but something smaller, right?

51:36 Yeah. Yeah. And you can take all these aspects basically that you're interested in,

51:40 in the larger model and train components that do exactly that. And another thing that's also

51:46 good or helpful here is to have kind of a good path from prototype to production. I think that's also

51:53 where a lot of, yeah, machine learning projects in general often fail because it's all, you know,

51:58 you have this nice prototype and it all looks promising and you've hacked something together in your

52:02 Jupyter notebook and that's all looking nice. You maybe have like a nice streamlit demo and you can

52:08 show that, but then you're like, okay, can we ship that? And then if your workflow that leads to the

52:14 prototype is completely different from the workflow that leads to production, you might find that out

52:19 exactly at that phase. And that's kind of where projects go to die. And that's sad. And yeah, so that's,

52:25 that's actually something we've been thinking about a lot. And also what we've kind of been trying to

52:29 achieve with spaCy LLM where you have this LLM component that you plug in and it does exactly the

52:34 same as the components would do at runtime. And it really just slots in and then might use GPT-4 behind

52:40 the scenes to create the exact same structured object. And then you can swap that out. Or maybe,

52:47 you know, you, there are a lot of things you might even want to swap out with rules or no AI at all.

52:52 Like, you know, like a ChatGPT is good at recognizing US addresses and it's great to build a prototype,

52:58 but instead of asking it to extract US addresses, for example, you can ask it, give me spaCy rules,

53:04 match your rules for US addresses. And it can actually do that pretty well. And then you can bootstrap from there.

53:10 Oh, nice.

53:10 There's a lot of stuff like that that you can do. And there might be cases where you find that,

53:14 yeah, you can totally beat any model accuracy and have a much more deterministic approach if you just

53:21 write a regex. Like that's still true.

53:24 It does still work.

53:25 Yeah. It's still something, it's easy to forget because, you know, again, if you look at research

53:29 and literature, nobody's talking about that because this is not an interesting research question.

53:34 Like nobody cares. You know, you can take any benchmark and say, I can beat ChatGPT accuracy with

53:40 two regular expressions. And that's like, that's true. Probably in some cases.

53:45 Yeah.

53:45 It's like nobody cares. Like that's not, that's not research.

53:48 For sure. But you know, what is nice to do is to go to ChatGPT or LM studio or whatever and say,

53:54 I need a Python based regular expression to match this text and this text. And I want to capture group for

54:00 that. And I don't want to think about it. It's really complicated. Here you go. Oh, perfect.

54:05 Now I run the regular expression.

54:06 Yeah. That's actually, that's a good use case. I've still been sort of hacking around on like

54:10 these, you know, interactive regex editors because I'm, I'm not particularly good at regular expressions.

54:15 Neither am I.

54:15 Like on the scale, like I can do it, but like, I know people who really, I think my co-founder,

54:20 Matt, he, he worked through it. Like he's more the type who really approaches these things very

54:25 methodically. And he was like, now he, he wants read this one big book on regular expressions.

54:29 And like, he really did it like the hardcore way, but like, that's why he's obviously much better than I am.

54:36 I consider regular expressions kind of write only. Like you can write them and make them do stuff,

54:40 but then reading them back is tricky.

54:42 Yeah.

54:42 At least for me. All right, let's wrap this up. So what are the things that you did here at the end

54:48 of your presentation, which I want to kind of touch on is you brought back some of the same ideas

54:53 that we had for like, what are the values of open source or why open source, but back to

54:58 creating these smaller focused models. Talk us through this.

55:02 How specific components. Yeah. I mean, if you kind of look at, Hey, what are the, you know,

55:05 the advantages of sort of approach that we talked about of distilling things down of creating these

55:11 smaller models, a lot of it comes down to it being like it's modular. You, again, you're not locked in

55:17 to anything. You own the model. Nobody can take that away from you. It's easier to write tests.

55:22 You have the flexibility. You can extend it because, you know, it's cold. You can program

55:26 with it because often it very rarely you do machine learning for the sake of machine learning. It's

55:31 always like there is some other process. You populate a database, you do some other stuff

55:36 with your stack. And so you want to program with it. It needs to be affordable. You want to understand

55:41 it. You need to be able to say, why is it doing what it is? Like, what do I do to fix it? It again,

55:46 runs in-house it's entirely private. And then, yeah, when I, you know, was kind of thinking

55:51 about this, I realized like, oh, actually, you know, it like this really maps exactly

55:56 the reasons why that we saw it. We talked about earlier why people choose open source or companies.

56:01 And that's obviously not a coincidence. It's because ultimately these are principles that

56:06 we have come up with over a long period of time of, yeah, that's good software development.

56:12 And ultimately AI is just another type of software development. So, of course, it makes sense that

56:17 the same principles make sense and are beneficial. And that, you know, just having a workflow where

56:23 everything's a black box and third party, this can work for prototyping, but it's not.

56:29 That kind of goes against a lot of the things that we've identified as very useful in applied settings.

56:35 Absolutely. All right. So we have to answer the question.

56:37 Oh, the...

56:38 Will it be monopolized?

56:40 And your contention is no, that open source wins even for LLMs. Yeah.

56:46 Open source means there's no monopoly to be gained in AI. You know, I've kind of broken it down into

56:51 some of these strategies, which, you know, how do you get to a monopoly? And these are like, you know,

56:56 this is not just some big stuff. These are things like a lot of companies are actively thinking about.

57:00 If you were in a business where, you know, it's winner takes all, like you want to, you know,

57:05 get rid of like all of that competition that companies hate, that investors hate. And there are ways to do

57:10 that and companies really actively think about this.

57:12 Those pesky competitors, let's get rid of them.

57:14 There are different ways to do that. Like one is having this compounding advantage.

57:19 So that's stuff like network effects, like, you know, if your social network, of course,

57:22 that makes a lot of sense. Everyone's on it. If you could kind of have these network effects,

57:27 that's good. And economies of scale. But as we've seen, like economies of scale is a pretty lame mode

57:33 in that respect. Like that has a lot of, you know, a lot of limitations. It's not even fully true.

57:38 It's kind of the opposite of a monopoly in some ways.

57:41 Yeah. Especially in software.

57:43 Yeah. In software. Exactly. So it's like, I don't think that's, that's not really the way to go.

57:49 One example that comes to mind, at least for me, maybe I'm seeing it wrong, but Amazon,

57:54 amazon.com. Just, you know, how many companies can have massive warehouse with everything at by every

58:01 single person's house?

58:02 Yeah. The one platform that everyone goes on. So even if you're a retailer, you kind of, yeah,

58:06 they feel the Amazon has kind of forced everyone to either sell on Amazon or go bust because...

58:11 Exactly. It's very sad that it's the way it is. Yeah. And then network effects. I'm thinking,

58:16 you know, people might say Facebook or something, which is true, but I would say like Slack actually.

58:21 Oh, okay.

58:22 Or Slack or Discord or, you know, there's a bunch of little chat apps and things, but if you want to

58:27 have one and you want to have a little community, you want people to be able to, well, I already

58:31 have Slack open. It's just an icon next to it versus install my own app. Make sure you run it.

58:36 Be sure to check it. Like people are going to forget to run it and you disappear off the,

58:40 off the space, you know?

58:41 That will make sense. And I do think, you know, these things don't necessarily happen accidentally.

58:45 Like companies think about, okay, how do we, you know, Amazon definitely thought about this.

58:48 This didn't just like happen to Amazon. Yes, they were lucky in a lot of ways, but like,

58:52 you know, that's, that's a strategy.

58:54 Exactly.

58:55 Yeah. And then the other thing that's not relevant here is like, you know, another way is controlling

58:59 a resource. That's really more, if you're like, you know, if they are physical, if there was like

59:04 a physical resource.

59:05 I brought the cables. Something like that.

59:06 Yeah. I mean, it's, yeah. Or like in Germany, I think for a long time, the telecom, they owned

59:11 the wires in the building.

59:13 Right. Exactly.

59:13 And they still do, I think. So they used to have the monopoly. Now they don't, but they kind of still,

59:18 to some extent, they still do because they need to come. If even, no matter who you sign up for,

59:23 with, for internet, telecom needs to come and activate it. So if you sign up with telecom,

59:29 you usually get service a lot faster.

59:31 It's a little better service. Yeah, exactly. Don't wait two weeks. Use us.

59:34 That's kind of, that's how it still works, but we don't, we don't really have that here.

59:38 And then the other, the next point that's very attractive. The final one is regulation. So that's

59:43 kind of like, you have to have a monopoly because the government says so. And that is one where we have

59:49 to be careful because if we're not like, if in all of these discussions, we're not making the distinction

59:56 between the models and the actual product, you know, very different characteristics and

01:00:01 we now do a very different things. If that gets muddy, which like, you know, a lot of also companies

01:00:08 quite actively do in that discourse, then we might end up in a situation where we sort of accidentally,

01:00:14 or, you know, gift a company or some companies a monopoly via the regulation. Because if you let

01:00:20 them write the regulation, for example, and we're not just regulating products, but lumping that in with

01:00:27 technology itself and open research.

01:00:30 Yeah. It's a part of your talk. I can't remember if it was the person hosting it or you who brought

01:00:33 this up, but an example of that might be all the third-party cookie banners, rather than banning,

01:00:40 just targeted, retargeting and tracking. Like instead of banning through the GDPR, instead of banning

01:00:47 the thing that is the problem, it's like, let's ban the implementation of the problem.

01:00:52 That's a risk or that's like, you know, in hindsight, yes, I think in hindsight, we would all agree that

01:00:57 like, we should have just banned targeted advertising. Instead, what we got is these

01:01:01 cookie pop-ups. That's like really annoying. And that's actually what I feel like is one of the,

01:01:06 as much as I think the EU, you know, I'm not an expert on like AI regulation or the EU AI Act,

01:01:12 but what I'm seeing is at least they did make a distinction between use cases. And it's very much,

01:01:17 there is a focus on here are the products and the things people are doing, how high risk is that,

01:01:22 as opposed to how big is the model and how, you know, what does that, because that doesn't,

01:01:27 doesn't say anything, but that would kind of be a very, very dangerous way to go about it. But the

01:01:32 risk is, of course, if we're rushing regulate, like if we're a rushing regulation, then, you know,

01:01:37 we might actually end up with something that's not quite fit for purpose. Or if we let big tech

01:01:42 companies write the regulation or lobby. Lobby for it to get it. Yeah.

01:01:46 These are some ideas because, you know, if they're doing that, like, I think it's pretty obvious,

01:01:50 they're not just worried about the safety of AI and are like appealing to like Congress or whatever.

01:01:56 Like, I think most people are aware of that, but like, yes, the, I think the intentions are even less

01:02:01 pure than that. And I think that's a big risk.

01:02:03 The regulation is very tricky. It's, Yeah. And you know, just for the record, I am pro-regulation. I'm very pro-regulation in general,

01:02:10 but I also think you can, if you fuck up regulation, that can also be very damaging,

01:02:15 obviously.

01:02:16 Absolutely. And it can be put in a way so that it makes it hard for competitors to get into

01:02:21 the system.

01:02:22 Yeah.

01:02:22 There's so much paperwork and so much monitoring that you need a team of 10 people just to operate.

01:02:26 Well, if you, a startup, well, you can't do that because, Hey, Art, we got a thousand people and 10 of them work on this. Like, well.

01:02:31 Even beyond that, like it's, you know, if you think back to all the stuff we talked about,

01:02:35 like they are, this goes against a lot of the best practices of software. This, this goes,

01:02:41 you know, this goes against a lot of what we've identified that actually makes good, secure,

01:02:47 reliable, modular, whatever software, safe software internally. And even doing a lot of the software

01:02:54 development internally, like there are so many benefits of that. And I think, you know, companies,

01:02:59 companies actually working on their own product is good. And if it was suddenly true that like

01:03:05 only certain companies could even provide AI models, I didn't even know what that would mean for open

01:03:10 source or for academic research. Like that would make absolutely no sense. I also don't think that's

01:03:14 like really enforceable, but it would mean that, you know, like this would limit like everyone in

01:03:20 what they could be doing. Like I think, and there's like, you know, there's nothing to do with like,

01:03:24 there's a lot of other things you can do if you, you know, care about AI safety, but that's really

01:03:29 not it. And I also, you know, I just think being aware of that is good. I don't like, I can, you

01:03:34 know, I cannot see an outcome where we really do that. It would, it would really not make sense. I

01:03:39 could not see the reality of this, you know, shaking out, but I think it's still, it's still relevant.

01:03:44 Yeah. I think the open source stuff and some of the smaller models really does

01:03:48 give us a lot of hope. So that's awesome.

01:03:50 I feel, you know, also very positive about this. I've also talked to a lot of developers at conferences

01:03:56 that like, yeah, actually thinking and talking about this gave them some hope, which obviously is,

01:04:00 is nice because I definitely did that some of the vibe I got, like it can be kind of easy to end up a bit

01:04:07 disillusioned by like a lot of the narratives people here. And that, you know, also, even if you're

01:04:13 entering the field, you're like, wait, a lot of this doesn't really make sense. Like, why is it like,

01:04:18 like this? Yeah. It's like, no, it actually, you know, your intuition is right. Like a lot of software,

01:04:23 software engineering best practices, of course, still matter. And, you know, no, they are like,

01:04:29 you know, they are better ways that we're not, you know, we're not just going in that direction.

01:04:33 And I think I definitely believe in that. A lot of the reasons why open source won in a whole bunch

01:04:37 of areas. Yeah. Could be exactly why it wins at LLMs as well, right? Yep. And, you know, again,

01:04:42 it's all based on open research. A lot of stuff is already published and there's no secret source.

01:04:49 The software, you know, software industry does not run on like secrets. All the differentiators are

01:04:55 product stuff. And yes, you know, open AI might monopolize or dominate AI powered chat assistants,

01:05:01 or maybe Google will do like that's, you know, that's a whole race that, you know, if you're not in that

01:05:06 business, you don't have to be a part of, but that does not mean that anyone's going to win at or

01:05:10 monopolize AI. Those are very different things. Absolutely. All right. A good place to leave it

01:05:15 as well. You know, thanks for being here. Yeah. Thanks. That was fun. Yeah. Yeah. People want to

01:05:20 learn more about the stuff that you're doing, maybe check out the video of your talks or whatever. What

01:05:24 do you recommend? Yeah, I think I definitely, I'll definitely give you some links for the show notes

01:05:27 like this. The slides are online, so you can have a look at that. There is at least one recording of the

01:05:32 talk online now from the really cool Python, PyCon Lithuania. It was my first time in Lithuania this

01:05:39 year. Definitely. You know, if you have a chance to visit their conference, it was a lot of fun.

01:05:43 I learned a lot about Lithuania as well. We also on our website, Explosion AI, we publish kind of this

01:05:50 feed of like all kinds of stuff that's happening from maybe some talk or podcast interview, community

01:05:56 stuff. There's still like a lot of super interesting plugins that are developed by people in community

01:06:02 papers that are published. So we really try to give a nice overview of everything that's happening in our

01:06:06 ecosystem. And then of course, you could try out spaCy, spaCy LLM. You know, if you want to try out

01:06:13 some of these generative models and especially for prototyping or production, whatever you want to do

01:06:20 for structured data. If you're any of the conferences and check out the list of events and stuff,

01:06:26 I'm going to do a lot of travel this year. So I would love to catch up with more developers in person

01:06:32 and also learn more about all the places I'm visiting. So that's cool. I've seen the list. It's very,

01:06:37 very comprehensive. So I kind of a neat freak. I like to, I also be very much like to organize things in

01:06:43 that way. So yeah. So there might be something local for people listening now that you're going to be

01:06:48 doing all right. Well, as always, thank you for being on the show. It's great to chat with you.

01:06:52 Yeah. Thanks. Yeah. Thanks.

01:06:53 So next time.

01:06:54 Bye.

01:06:54 This has been another episode of Talk Python To Me. Thank you to our sponsors. Be sure to check

01:07:00 out what they're offering. It really helps support the show. Take some stress out of your life. Get

01:07:05 notified immediately about errors and performance issues in your web or mobile applications with Sentry.

01:07:11 Just visit talkpython.fm/sentry and get started for free. And be sure to use the promo code,

01:07:17 talkpython, all one word. This episode is sponsored by Porkbun. Launching a successful

01:07:23 project involves many decisions, not the least of which is choosing a domain name. Get a .app, .dev,

01:07:29 or .food domain name at Porkbun for just $1 for the first year at talkpython.fm/porkbun.

01:07:36 Want to level up your Python? We have one of the largest catalogs of Python video courses over at Talk Python.

01:07:41 Our content ranges from true beginners to deeply advanced topics like memory and async. And best of all,

01:07:47 there's not a subscription in sight. Check it out for yourself at training.talkpython.fm.

01:07:52 Be sure to subscribe to the show, open your favorite podcast app, and search for Python. We should be right at the top. You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at /rss on talkpython.fm.

01:08:07 We're live streaming most of our recordings these days. If you want to be part of the show and have your

01:08:12 comments featured on the air, be sure to subscribe to our YouTube channel at talkpython.fm/youtube.

01:08:18 This is your host, Michael Kennedy. Thanks so much for listening. I really appreciate it. Now get out there and write some Python code.

01:08:33 Thank you.