Learn Python with Talk Python's 270 hours of courses

#417: Test-Driven Prompt Engineering for LLMs with Promptimize Transcript

Recorded on Monday, May 22, 2023.

00:00 Large language models and chat-based AIs are kind of mind-blowing at the moment.

00:04 Many of us are playing with them for working on code or just as a fun alternative to search, but others of us are building applications with AI at the core.

00:14 And when doing that, the slight unpredictable nature and probabilistic style of LLMs makes writing and testing Python code very tricky.

00:23 Interpromptimize, from Maxime Beauchemin, and Preset.

00:28 It's a framework for non-deterministic testing of LLMs inside of our applications.

00:33 Let's dive inside the AIs with Max. This is Talk Python To Me, episode 417, recorded May 22nd, 2023.

00:54 Welcome to Talk Python to me, a weekly podcast on Python. This is your host, Michael Kennedy.

00:59 Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython, both on fosstodon.org. Be careful with impersonating accounts on other instances, there are many. Keep up with the show and listen to over seven years of past episodes at talkpython.fm. We've started streaming most of our episodes live on YouTube. Subscribe to our YouTube channel over at talkpython.fm/youtube to get notified about upcoming shows and be part of that episode.

01:28 This episode is brought to you by JetBrains, who encourage you to get work done with PyCharm.

01:34 Download your free trial of PyCharm Professional at talkpython.fm/done-with-pycharm.

01:41 And it's brought to you by The Compiler Podcast from Red Hat.

01:45 to an episode of their podcast to demystify the tech industry over at talkpython.fm/compiler.

01:53 Max welcome to Talk Python to Me.

01:54 >> Well it's good to be back on the show and now I know it's live too so no mistakes.

01:58 I'm going to try to not say anything outrageous.

02:01 >> People get the unfiltered version so absolutely.

02:05 I love it when people come check out the live show.

02:08 Welcome back.

02:09 It's been a little while since you were on the show.

02:11 About since September.

02:13 We talked about supersets.

02:15 We also talked a little bit about Airflow, some of the stuff that you've been working on.

02:19 And now we're kind of circling back through this data side of things, but trying to bring AI into the whole story.

02:26 So pretty cool project I'm looking forward to talking to you about.

02:29 - Awesome, excited to talk about it too.

02:31 And these things are related in many ways.

02:34 Like one common foundation is Python.

02:37 Like a lot of these projects are in Python.

02:39 They're data related.

02:40 And here, Proptimize and Propt Engineering and integrating AI is related in a way that we're building some AI features into Superset right now and into Preset, so it ties things together in some way.

02:55 - Yeah, I can certainly see a lot of synergy here.

02:57 Before we dive into it, it hasn't been that long since you were on the show, but give us a quick update, just a bit about your background for people who don't know you.

03:05 - Yeah, so my career is a career of maybe 20 or so years years in data building, doing data engineering, doing, trying to make useful, data useful for organizations.

03:19 Over the past decade or so, I've been very involved in open source.

03:22 I started Apache Airflow in 2014.

03:25 So for those not familiar with Airflow, though, it's pretty well known now.

03:29 It's used at, I heard, like, I think it's like tens of thousands, I think above 100,000 companies are using Apache Airflow, which is kind of insane to think about.

03:38 It's like you started a little project.

03:40 So for me, I started this project at Airbnb and it really took off.

03:44 And I think it's just like great project community fit.

03:48 Like people really needed that.

03:50 It was the right abstraction for people at the time and still today.

03:53 And it just really took off.

03:56 So I was working on orchestration and then I was like, I just love things that are visual and interactive.

04:02 So there was no great open source BI tool out there, business intelligence.

04:07 So this whole data dashboarding, exploration, SQL IDE.

04:11 So it's a playground for people trying to understand and visualize and explore data.

04:16 So I started working on Apache SuperSet in 2000, I think it was like 15 or 16 at Airbnb too.

04:21 And we also brought that to the Apache Software Foundation.

04:24 So again, like a very, very popular open source project that's used in like tens of thousands, a hundred thousand organizations or so.

04:33 And today, it has become a super great open source or alternative to Tableau, Looker, all those business intelligence tool is very viable for organizations.

04:45 And then quick plugs, I pre-set I/O a company I started.

04:48 I'm also an entrepreneur, I started a company a little bit more than four years ago around Apache Superset, and the idea is to bring Superset to the masses.

04:58 So it's really like hosted, managed, state of the art Apache Super Set for everyone with some bells and whistles.

05:04 So the best Super Set you can run, there's a free version too.

05:08 So you can go and play and try it, today gets started in five minutes.

05:12 So it's a little bit of a commercial pointer, but also very relevant to what I've been doing, personally over the past like three or four years.

05:20 - Some of the inspiration for some of the things we're gonna talk about as well and trying to bring some of the AI craziness back to products, right?

05:28 From an engineering perspective, not just a, "Hey, look, I asked what basketball team was gonna, "you know, win this year and it gave me this answer, right?" - It's like, and caveats, I don't know anything that happened since 2021.

05:42 So AI, or AI specifically, Bart is a little bit better at that, but it's like, you know, the last thing I read off of the internet was in fall 2021.

05:51 Makes some things a little bit challenging.

05:54 But yeah, so we're building, you know, AI features into preset, you know, as a commercial open source company, We need to build some differentiators too from supersets.

06:02 We contribute a huge amount, like maybe 50, 80% of the work we do at Preset is contributed back to Superset, but we're looking to build differentiators.

06:10 And we feel like AI is a great kind of commercial differentiator too on top of Superset that makes people even more interested to come and run Preset too.

06:22 - Yeah, excellent.

06:23 And people, you say they were popular projects, Like Airflow has 30,000 stars.

06:30 Apache SuperSet has 50,000 stars, which puts it on par with Django and Flask for people sort of mental models out there, which is, I would say it's pretty well known.

06:39 So awesome.

06:41 - Yeah, stars is kind of vanity metric in some ways, right?

06:44 So it's not necessarily usefulness or value delivered, but it's a proxy for popularity and hype, you know?

06:50 So it gives a good sense.

06:52 And I think like at 50,000 stars, if you look at, it's probably in the top 100 of GitHub projects.

06:59 If you remove the, in the top 100, there's a lot of documentations and guides and things that are not really open source projects.

07:08 So it's probably like top 100 open source project-ish, in both cases, which is--

07:13 - Right, I'm so cool.

07:14 Like, it's like you start a project and you don't know whether it's gonna take off and how, and it's like, wow, it's just nice to see that.

07:22 - Yeah, absolutely.

07:23 I mean, on one hand, it's nice, but it doesn't necessarily make it better.

07:27 But it does mean there's a lot of people using it.

07:28 There's a lot of polish.

07:29 There's a lot of PRs and stuff that have been submitted to make it work.

07:35 A lot of things that you can plug into, right?

07:36 So there's certainly a value for having a project popular versus unpopular.

07:40 - Oh my God, yes.

07:41 And I would say one thing is, all the dark, call it like secondary assets outside of the core projects documentation.

07:49 And there will be a lot of like use cases and testimonials and reviews and people bending the framework in various ways and forks and plugins.

07:58 Another thing too, that people I think don't really understand the value of in software and open source is, or I'm sure people understand the value but it's not talked about, it's just a whole battle tested thing.

08:10 Like when something is run at thousands of organizations in production for a long time, there's a lot of things that happen in a software that are very, very valuable for the criminal organization adopting it.

08:23 - Well, let's talk large language models for a second.

08:28 So AI means different things to different people, right?

08:31 They kind of get carved off as they find some kind of productive productized use, right?

08:37 AI is this general term and like, oh, machine learning is now a thing that we've done, or computer vision is a thing we've done.

08:43 And the large language models are starting to find their own special space.

08:48 So maybe we could talk a bit about a couple of examples, just so people get a sense.

08:54 To me, ChatGPT seems like the most well-known.

08:57 What do you think?

08:58 - Yeah, I mean, well, I'll say if you think about what is a large language model, what are some of the leaps there?

09:05 And I'm not an expert, so I'm gonna try to not put my foot in my mouth, but some things that I think are interesting.

09:12 A large language model is a big neural network that is trained on a big corpus of text.

09:18 I think one of the big leaps we've seen is unsupervised learning.

09:22 So like really often like in machine learning in the past or pre-LLMs, we would have very specific like training set and outcomes we were looking for.

09:33 And then the training data would have to be really structure.

09:37 Here what we're doing with large language models is feeding a lot of huge corpus of text and what the large language model is trying to do or resolve is to chain words, right?

09:46 So he's trying to predict the next word, which seems like you would be able to put words together that kind of makes sense, but like you wouldn't think that consciousness, not just like consciousness, but intelligence would come out of that, but somehow it does, right?

10:01 Like if you chain, it's like, if you say, you know, Hupty Dutty sits on A, it's really clear it's gonna be wall, you know, the next word, but if you push this idea much further with a very large corpus of human knowledge, somehow there's some really amazing stuff that does happen on these large language models.

10:21 And I think that realization happened around, Chad, at GPT-3, 3.5 getting pretty good, and then at 4, we're like, oh my god, this stuff can really, seems like it can think or be smart or be very helpful.

10:36 - Yeah, the thing that I think impresses me the most about these is they seem, People can tell me it's statistics and I'll believe them, but it seems like they have an understanding of the context of what they're talking about more than just predicting like, Humpty Dumpty sat on the what?

10:52 It sat on the wall, right?

10:53 Obviously that's what was likely to come next when you see that set of words.

10:58 But there's an example that I like to play with, which I copied out.

11:02 I'll give it a little thing.

11:03 I'll say, "Hey, here's a program, Python program.

11:07 "I'm gonna ask you questions about it.

11:08 "Let's call it Arrow." and it's like this is a highly nested program that function that tests whether something's a platypus.

11:15 I saw this example somewhere and I thought, okay, this is pretty cool, but it has this.

11:19 If it's a mammal, then if it has fur, then if it has a beak, then if it has a tail, and you can just do stuff that's really interesting.

11:27 Like I'll ask it to--

11:29 - Is bird or something like that or--

11:32 - Yeah, yeah, yeah, rewrite it, write it using guardian clauses to be not nested, right?

11:39 - Oh yeah.

11:40 - Right?

11:41 And it'll say, sure, here we go.

11:43 And instead of being, if this, then nest, if this, then if this, it'll do, if not, if not return false, right?

11:48 Which is really cool.

11:49 And that's kind of a pretty interesting one.

11:51 But like, this is the one, this is the example that I think is crazy.

11:56 It's rewrite arrow to test for crocodiles.

12:00 - Yeah, using the, it's like what people would call a one shot, a few shot example of like, hey, here's an example of the kind of stuff I might want.

12:09 There's some different ways to do that, but it's a pattern in prompt engineering where you'll say you have a zero shot, one shot, few shot examples we can get into.

12:18 But it does feel like it understands the code, right?

12:21 Like what you're trying to do.

12:22 - Right, just for people listening, it said, okay, here's a function isCrocodile.

12:26 If not self.isReptile, if not self.hasScales, these are false examples, right?

12:31 And if it has four legs and a long snout, it can swim, right?

12:34 Like it rewrote the little tests and stuff, right?

12:37 In a way that seems really unlikely that it's just predicting likelihood 'cause it's never seen anything like this really, which is really, it's pretty mind blowing I think.

12:47 - Or it had, like it read the entire internet and all of GitHub and that kind of stuff.

12:51 So say it has seen some other things.

12:53 I think that's mind boggling.

12:54 It's just like, when you think about what it did there is it read the entire conversation so far, your input prompt and it has like a system prompt ahead of time that says, you know, your ChatGPT, try to be helpful to people and here's a bunch of things you should or should not say and non-biased, you try to be concise and good and virtuous.

13:17 And people have found all sorts of jailbreaks out of that.

13:19 But like, all it does from that point on is like try to predict the next word, which is kind of insane that it gets to, you know, the amount of structure that we see.

13:29 - Right, right.

13:30 That's a lot of structure there, right?

13:32 So pretty impressive.

13:33 And ChatGPT is starting to grow.

13:35 You know, if you've got version four and you can start using some of the plugins, it's gonna keep going crazy there.

13:39 Other examples are simply AI just released Lemur, which is a large language model, but really focused on transcribing speech, which I think is kind of cool.

13:49 Microsoft released Microsoft security, released Microsoft security copilot, which is a large language model to talk about things like Nginx misconfiguration and stuff like that.

14:00 There's just a lot of stuff out there that's coming along here, right?

14:05 A lot of thousands of models coming in type of thing.

14:07 - On the open source front too, there's this whole ethical thing, like should everyone and anyone have access to open source models doing that?

14:16 Well, we don't really understand.

14:18 We probably shouldn't get into the ethics part of the debate here, 'cause that's a whole series of episodes we probably won't wanna get into.

14:26 But what's interesting is Databricks came up a model for what's called Facebook, came up with one called LLama, and they open sourced and/or leaked the weights, so you have the model topology with the pre-trained weights.

14:39 In some cases, there's open source corpus of training that are also coming out and are also open sourced.

14:46 I mean, it's like, and these open source models are somewhat competitive or increasingly competitive with GPT-4, yeah, which is kind of crazy.

14:57 And some of them I don't, or where GPT-4 has limitations, they break through these limitations.

15:03 So one thing that's really important as a current limitation of the GPT models and LLMs is the prompt window, the token prompt window.

15:14 So basically when you ask a question, you know, it's been trained and has machine learned with data up to, I think in the case of GPT-3, 5, or 4, it's the corpus of training goes all the way to fall 2021.

15:29 So if you ask like, who is the current president of the United States, it just doesn't know, or it will tell you as of 2021, it is this person.

15:36 But so if you're trying to do tasks, like what I've been working on, we'll probably get into later in the conversation is trying to generate SQL.

15:45 It doesn't know your table.

15:46 So you have to say like, hey, here's all the tables in my database.

15:49 Now, can you generate SQL that does X on top of it?

15:52 And that context window is limited and increasing, but some of these open source models have different types of limitations.

16:00 - This portion of Talk Python to Me is brought to you by JetBrains, who encourage you to get work done with PyCharm.

16:09 PyCharm Professional is the complete IDE that supports all major Python workflows, including full stack development.

16:16 That's front-end JavaScript, Python backend, and data support, as well as data science workflows with Jupyter.

16:23 PyCharm just works out of the box.

16:25 Some editors provide their functionality through piecemeal add-ins that you put together from a variety of sources.

16:32 PyCharm is ready to go from minute one.

16:35 And PyCharm thrives on complexity.

16:37 The biggest selling point for me personally is that PyCharm understands the code structure of my entire project, even across languages such as Python and SQL and HTML.

16:48 If you see your editor completing statements just because the word appears elsewhere in the file, but it's not actually relevant to that code block, that should make you really nervous.

16:57 I've been a happy paying customer of PyCharm for years. Hardly a workday passes that I'm not deep inside PyCharm working on projects here at Talk Python. What tool is more important to your productivity than your code editor? You deserve one that works the best. So download your free trial of PyCharm professional today at talkpython.fm/done-with-pycharm and get work done. That link is in your podcast player show notes. Thank you to PyCharm from JetBrains for sponsoring the show and keeping Talk Python going strong.

17:31 Right. It's interesting to ask questions, right? But it's more interesting from a software developer perspective of can I teach it a little bit more about what my app needs to know or what my app structure is, right?

17:46 In your case, I want to use SuperSet to ask the database questions.

17:52 But if I'm going to bring in AI, it needs to understand the database structure so that when I say, "Help me do a query to do this thing," it needs to know what the heck to do, right?

18:02 >> The table, so there's this stuff it knows and the stuff it can't know.

18:07 Some of it goes is really to the fact that whether this information is going to public on the Internet, whether it has happened to be trained against it.

18:15 And then if it's in private, there's just no hope that it would know about, you know, your own internal documents or your database structure.

18:22 So in our case, it speaks SQLs very, very well.

18:26 So as we get into this example, like how to get GPT to generate good SQL in the context of a tool like SuperSET or SQL lab, which is our SQL IDE.

18:36 So it knows how to speak SQL super well.

18:38 It knows the different dialects of SQL very, very well.

18:41 It knows its functions, its dates functions, which a lot of the SQL, and like the engineers only call it like, yeah, I can never remember like what Postgres date diff function is, but at GPT or GPT models, just it knows SQL and knows the dialects and knows the mechanical SQL.

18:57 It understands data modeling, foreign keys, joins, primary, all this stuff it understands.

19:02 It knows nothing about your specific database, the schema names and the table names, in the column names that I might be able to use.

19:11 So that's where we need to start providing some context and this context window is limited.

19:15 So it's like, how do you use that context well or as well as possible?

19:22 And that's the field and some of the ideas behind it is prompt crafting and prompt engineering, which we can get into once we get there, maybe we're there already.

19:31 - Yeah, yeah, well, yeah, I think where I see this stuff going is from this general purpose knowledge starting to bring in more private or personal or internal type of information, right?

19:44 Like our data about our customers is like structured like this in a table and here's what we know about them.

19:49 Now let us ask questions about our company and our thing, right?

19:53 And it's like starting to make inroads in that direction, I think.

19:56 - Yeah, and one thing to know about is that there's different approaches to teach or provide that context.

20:04 So one would be to build your own model from scratch, right?

20:09 And that's pretty prohibitive.

20:11 So you'd have to find the right corpus.

20:13 And instead of starting with a model that knows SQL and needs to know your table and context, you have to start from zero and very prohibitive.

20:21 Another one is you start from a base model at some point of some kind.

20:25 There's a topology, so there's different layers and number of neurons and it knows some things.

20:30 And then you load up some weights that are open source.

20:32 And then you say, I'm gonna tune this model to teach it my database schemas and basically my own corpus of data.

20:40 So it could be your data dictionaries, could be your internal documents, it could be your GitHub code, your dbt projects.

20:47 If you have one of your Airflow DAGs, be like, I'm gonna dump all this stuff in the model and that will get baked into the neural network itself.

20:55 That's doable, pretty primitive in this era.

20:59 If you have the challenge that we have at Preset, which is we have multiple customers with different schemas.

21:04 We can't have spillover.

21:06 So you have to train a model for each one of our customers and serve a model for each one of our customers.

21:11 So still pretty prohibitive.

21:13 And a lot of people fall back on this third or fourth method that I would call prompt engineering, which is I'm gonna use the base model, the open AI API, or just an API on LLM.

21:25 And then I will, if no SQL already, I'll just say, hey, here's a bunch of tables that you might wanna use.

21:30 can you generate SQL on top of it?

21:32 So then that's just a big request with a lot of context.

21:36 Then we have to start thinking about maximizing the use of that context window to pass the information that's most relevant within the limits allowed by the specific model.

21:47 - Right, and that starts to get into reproducibility, accuracy, and just those limitations, which is kind of an engineering type of thing, right?

21:56 - Yeah, and then, you know, maybe a topic too, And this conversation is based on a recent blog post and the flow, just going back to the flow of that blog post.

22:05 So we started by establishing the premise that everyone is trying to bring AI into their product today.

22:12 Thousands of product builders are currently exploring ways to harness the power of AI in the products and experiences they create.

22:19 That's the premise for us with text to SQL and SQL lab as part of superset and preset.

22:24 But I don't know, like if you think of any product, any startup, any SaaS product you use.

22:30 You work at HubSpot today, you're trying to figure out how to leverage AI to build sales chatbots or SDR chatbots, so everyone everywhere is trying to figure that out.

22:41 The challenge is, I guess, very probabilistic in a different interface to anything we know.

22:47 Engineers would be like, oh, let's look at an API and leverage it, and APIs are very, very deterministic in general, AI is kind of wild beast to tame.

22:59 You ask, first the interface is language not code, and then what comes back is like semi-probabilistic in nature.

23:08 - And it could change underneath you.

23:09 It's a little bit like web scraping in that regard.

23:12 That like, it does the same, it does the same, and then something out there changed, not your code, and then a potentially different behavior comes back, right?

23:20 'Cause they may have trained another couple of years, refine the model, switch the model, change the default temperature, all these things.

23:27 - Yeah, there's a lot that can happen there.

23:29 One thing I noticed, like starting to work with what I would call prompt crafting, which is, you know, you work with ChatGPT and you craft different prompt with putting emphasis in a place or another or changing the order of things or just changing a word, right?

23:44 Just say like important exclamation point, capitalize the words, you know, the reserve words in SQL, And then just the fact that you put important exclamation point will make it do it or not do it, changing from a model to another.

23:59 So one thing that's great is the model, at least at OpenAI, they are immutable as far as I know.

24:06 But like if you use GPT-3.5 Turbo, for instance, that's just one train model.

24:12 I believe that that is immutable.

24:16 The chatbot on top of it might get fine tuned and change over time, but the model is supposed to be static.

24:23 You mentioned temperature, it'd be kind of interesting to just mention for those who are not familiar with that.

24:27 So when you interact with AI, one of the core parameters is temperature, and I think it's a value from zero to one, or I'm not sure how exactly you pass it, but it basically defines how creative you want, you want to let the AI be.

24:46 Like if you put it to zero, you're gonna have something more deterministic.

24:50 So asking the same question should lead to a similar or the same answer, though not in my experience.

24:56 It feels like it should, but it doesn't.

24:58 But then if you put a higher, it will get more creative.

25:01 Talk more about like how that actually seemed to work behind the scenes.

25:06 - Yeah, well, that variability seems to show up more in the image-based ones.

25:11 So for example, this article, this blog post that you wrote, you have this image here and you said, oh, and I made this image from mid-journey.

25:18 I've also got some examples of a couple that I did.

25:22 Where did I stick them?

25:23 Somewhere, here we go.

25:24 Where I asked, just for YouTube thumbnails, I asked Midjourney for a radio astronomy example that I can use, 'cause here's one that's not encumbered by some sort of licensing, but still looks kinda cool and is representative, right?

25:37 And there, it's like massive difference.

25:41 I'm not sure how much difference I've seen.

25:43 I know it will make some, but I haven't seen as dramatic of a difference on ChatGPT.

25:48 - Oh, ChatGPT, yeah.

25:50 Yeah, I'm not sure exactly how they introduced the variability on the generative images AI.

25:57 I know it's like this multi-dimensional space with a lot of words and a lot of images in there.

26:03 And then it's probably like where the location point of that, they randomized that point in that multi-dimensional space.

26:12 For ChatGPT, it's pretty easy to reason about, and I might be wrong on this, again, I'm not an expert, But you know how the way it works is it writes, it takes the prompt and then it comes up with the next word sequentially.

26:24 So for each word for the next word, so Humpty Dumpty sat on A, it might be wall at 99%, but like there might be 1% of fence or something like that.

26:36 And if you up the temperature, it's more likely to pick the non-first word and that probably less, so they probably do in a weighted way, like it's possible that I take a second or the third word randomly, and then of course it's gonna get a tree or decision tree once it picks the words, the next word is also changes.

26:57 So as you up that, it goes down path that sends it into more creative or different.

27:04 - Right, right.

27:05 Yeah, a little butterfly effect, it makes a different choice here, and then it sends it, you know, sends it down through the graph.

27:11 Interesting.

27:12 So one thing that you really pointed out here, and I think it's maybe worth touching on a bit, is this idea of prompt engineering.

27:19 There's even places like learnprompting.org that try to teach you how to talk to these things.

27:25 And you make a strong distinction between prompt crafting or just talking to the AI versus really trying to put an engineering focus on it.

27:33 Do you wanna talk about the differentiation?

27:35 - Yeah, I think it's a super important differentiation, but one that I'm proposing, right?

27:40 So I don't think that people have settled as to what is one or what is the other.

27:45 I think I saw a Reddit post recently that was like, prompt engineering is just a load of crap.

27:50 Like, you know, anyone can go, 'cause they thought their understanding of prompt engineering was like, oh, you know, you fine tune or you craft your prompt and you say like, you are an expert AI working on, you know, creating molecules.

28:06 Now can you do this?

28:07 And then, you know, by doing that, you might get a better outcome.

28:10 Or one really interesting thing that people have been doing in prompt crafting that seem to have huge impact on, there's been paper written on this specific just hint or craft tweak is let's proceed step by step.

28:25 So basically whatever the question is that you are asking, specifically around more mathematicals or things that require more systematic step by step thinking, the whole just like let's think, let's expose this or let's go about it step by step makes it much better.

28:43 So here you might be able to, well, so, you know, if you had an example where ChatGPT-3 failed or ChatGPT-4 failed, you could just say, colon, let's go step by step, and it might succeed that time around.

28:59 - Maybe you can get it to help you understand instead of just get the answer, right?

29:04 Like, factor this polynomial into its primary, you know, solutions or roots or whatever, and you're like, okay, show me, don't just show me the answer, show me step by step so I could understand and try to learn from what you've done, right?

29:17 - Yeah, I mean, if you think about how the way that it's trying to come up with a new word, if all it does is a language-based answer to a mathematical question, like how many days are there between this date and that date?

29:30 There's no, that specific example might not exist or it's kind of complicated for it to go about it, but if you say, let's think step by step, okay, there's this many months, this month's duration is this long, there's this many days since the beginning of that month, and might get it right that time around.

29:48 - Right, or if it fails, you could pick up part way along where it has some more--

29:52 - Yeah, you know, and then you can trace, I mean, just you too, I think one thing is you should be extremely careful as taking for granted that it's right all the time, you know?

30:01 So that means it also helps you review its process and where it might be wrong.

30:06 But back to crafting versus engineering.

30:08 So crafting would be the process that I think is more attached to a use ChatGPT every day, the same way that we've been trained at Googling over the past two decades.

30:20 You use quotes, you use plus and minus, and you know which keywords to use intuitively, where it's gonna get confused or not.

30:30 So I think prompt crafting is a different version of that that's just more worthy.

30:35 And if you're working with the AI to try to assist you, write your blog post, or to try to assist you in any task really, just to be smart about how you bring the context, how you tell it to proceed, goes a very, very long way.

30:49 So that's what I call prompt crafting, call it like one-off cases.

30:54 - Kind of what people do when they're interacting with the large language model.

30:58 - I think so, right?

30:59 Like it's not evident for a lot of people who are exploring the edge of where it fails, and they love to see it fail.

31:05 And then they don't think about like, oh, what could I have told it to get the answer I was actually looking for?

31:12 Like, ah, I got you wrong.

31:13 You know, it's as if I had that actor in a conversation of like, ah, you're wrong and I told you so.

31:18 You know, so I think there's a lot of that online.

31:21 But I think for all these examples that I've seen, I'm really tempted to take the prompt that they had and then give it an instruction or two or more and then figure out how to get it to come up with the right thing.

31:31 prompt crafting super important skill.

31:33 You know, you could probably get a boost of for most knowledge information workers, you'll get a boost of 50% to 10x for a lot of the tasks you do every day if you use AI well. So it's great personal skill to have go and develop that skill if you don't.

31:47 This portion of Talk Python to Me is sponsored by the Compiler Podcast from Red Hat. Just like you, I'm a big fan of podcasts and I'm happy to share a new one from a highly respected open source company. Compiler, an original podcast from Red Hat.

32:02 Do you want to stay on top of tech without dedicating tons of time to it?

32:05 Compiler presents perspectives, topics, and insights from the tech industry, free from jargon and judgment.

32:11 They want to discover where technology is headed beyond the headlines and create a place for new IT professionals to learn, grow, and thrive.

32:18 Compiler helps people break through the barriers and challenges turning code into community at all levels of the enterprise.

32:25 One recent and interesting episode is their "The Great Stack Debate." I love love love talking to people about how they architect their code, the trade-offs and conventions they chose, and the costs, challenges, and smiles that result.

32:37 This Great Stack Debate episode is like that.

32:40 Check it out and see if software is more like an onion, or more like lasagna, or maybe even more complicated than that.

32:46 It's the first episode in Compiler's series on software stacks.

32:50 Learn more about Compiler at talkpython.fm/compiler.

32:54 The link is in your podcast player show notes.

32:56 And yes, you could just go search for a compiler and subscribe to it.

33:00 But follow that link and click on your players icon to add it that way they know you came from us.

33:06 Our thanks to the compiler podcast for keeping this podcast going strong.

33:11 From engineering, in my case, I'm like, you're building something, you're using an AI as an API behind the scene, you want to pass it a bunch of relevant contexts, really specify what you want to get out of it.

33:26 Maybe you even want to get a structured output, right?

33:28 You might want to get a JSON blob out of it.

33:31 You say, "Return a JSON blob with the following format," so it's more structured.

33:36 So then to give all these instructions, there's this idea of providing few shots too.

33:41 You might be storing context in a vector database.

33:43 I don't know if we're getting ahead to that today, but there are ways to kind of structure and organize your potential embeddings or the things you want to pass as context.

33:52 So there's a lot here.

33:53 I think somewhere to I talked about prompt engineering if we'd scroll in the blog post like what is And prompt engineering it will list the kind of things it might be higher in the post. I was scrolling for people When I introduce what is what is prompt engineering? Yeah Above this section about Mickey you scroll at the big end like what is prompt engineering?

34:21 >> Yeah, here.

34:22 >> Yeah, right here.

34:23 The definition of this is Chad's GPT's version of it.

34:26 When you do prompt engineering, you can add context, which that means that you're going to have to retrieve context maybe from a database, from a user session, from your Redux store if you're in the front end.

34:37 You're going to go and fetch the context that's relevant in the context of the application, at least while building products.

34:43 Specify an answer format.

34:44 You could just say, yes, I just want a yes or no, a Boolean, I want a JSON blob with not only the answer, but your confidence on that answer or something like that.

34:53 Limiting scope, asking for pros and cons, incorporating verification or sourcing.

34:59 So that's more, you know, if you iterate on a prompt, you're gonna be rigorous about, is this prompt better than the previous prompt I had?

35:06 Like if I pass five rows of sample data while doing text to SQL, does it do better than if I, or does it do more poorly than if I pass 10 rows of sample data or provide a certain amount of column level statistics.

35:21 So, prompt engineering is not just prompt crafting.

35:23 It is like bringing maybe the scientific method to it, bring some engineering of like fetching the right context and organizing it well and then measuring the outcome.

35:33 - Right, exactly.

35:34 Something that comes out, you can measure and say, this is 10% better by my metric than it was before with this additional data, right?

35:41 That's a big difference.

35:43 - Right, and then there's so many things moving, right?

35:45 Like, and everything is changing so fast in the space So you're like, oh, well, ChatGPT5 is out, or GPT4 Turbo is half the price and then just came out.

35:55 Now I'm just gonna move to that.

35:56 They're like, wait, is that performing better?

35:59 Or what are the trade-off?

36:01 Or even I'm gonna add, I'm gonna move this section, asking for a certain JSON format above this other section.

36:08 I'm gonna write important exclamation point, do X.

36:12 Does that improve my results?

36:14 as I mess it up and which one of my test case perhaps that succeeded before fails now and which one failed before succeeds now.

36:22 So it can be like, is that a better or worse iteration towards my goal?

36:28 - Right, right.

36:29 Kind of bringing this unit testing TDD mindset.

36:34 - Yes, yeah.

36:35 So that's what we're getting deeper into the blog post.

36:37 Right, so the blog post is talking about bringing this TDD, the test-driven development mindset to prompt engineering.

36:46 Right, and there's a lot of things that are in common.

36:49 You can take and apply and kind of transfer just over.

36:53 There are some things to that breakdown that are fundamentally different between testing a prompt or working with AI and working with just a bit more deterministic code testing type framework.

37:05 - Yeah, yeah, for sure.

37:08 So you called out a couple of reasons of why TDD is important for prompt engineering.

37:13 Maybe we could run through those.

37:15 - Yeah, so the first thing is the AI model is not a deterministic things, or you use a modern API or a GraphQL REST API.

37:27 The format of what you ask is extremely clear, and then the format of what you get back is usually defined by a schema.

37:34 It's very deterministic.

37:36 Pretty guaranteed that you do the same request to get the same output-ish or at least format.

37:41 With AI, that's not the case, right?

37:43 So it's much more unpredictable and probabilistic by nature.

37:48 Second one is handling complexity.

37:50 So AI systems are complex, black boxy, kind of unpredictable too.

37:55 So embrace that and assume that you might get something really creative coming out of there for better or for worse.

38:02 And then reducing risk, like you're shipping product.

38:06 If you're shipping product, writing product, you don't want necessarily any sort of like bias or weird thing like the AI could go crazy.

38:17 - Yeah, there are examples of AIs going crazy before like Tay, do you remember Microsoft Tay?

38:22 - I don't know that one, but I know of other examples.

38:25 - Yeah, I mean, it came out and it was like this sort of just, I'm here to learn from you internet and people just turned it into a racist and made it do all sorts of horrible things.

38:34 And they had to shut it down a couple of days later because it just, it's like, whoa, it met the internet and the internet is mean.

38:41 So that's not great.

38:42 - Yeah, train it on Fortron or let it, you know, go crawl 4chan and read it.

38:48 It's not always gonna be nice.

38:49 - So bad, right?

38:51 I mean, you don't entirely control what's gonna come out of those things.

38:54 And so, you're a little more predictable, right?

38:57 - And it's not even like you don't entirely control.

39:00 Like I think, yeah, like basically, you know, control might be a complete illusion.

39:04 Like even the people working at OpenAI don't fully understand what's happening in there.

39:09 (laughs)

39:11 Well, it read a bunch of stuff and it's predicting the next word and it gets most things right.

39:18 By the way, they do a lot around this idea of like not necessarily TDD, but there's a whole eval framework so you can submit your evaluation functions to OpenAI.

39:26 And as they train the next version of things, they include that in what their evaluation system for the new thing.

39:34 Say, if I wanted to go and contribute back a bunch of like text to SQL type use cases as they call evals, then they would take that into consideration when they train their next models.

39:45 All right, so going down the list, reducing risk, right?

39:48 So you're integrating a beast that's not fully tamed into your product.

39:52 You probably wanna make sure it's tamed enough to live inside your product.

39:57 Continuous improvements, that should have been maybe the first one in the list is you're iterating on your prompts, you're trying to figure out a past context, you're trying different model versions, maybe you're trying some open source models or the latest GPT cheaper, greater thing.

40:15 So you wanna make sure that as you iterate, you're getting to the actual outcomes that you want systematically.

40:21 And performance measurement too, of like how long does it take?

40:24 How much does it cost?

40:26 You kind of need to have a handle on that.

40:30 The new model might be 3% better on your corpus of tests, but it might be six times the price.

40:36 Like, do you want, are you okay with that?

40:38 - Right, right, or just from a user perspective, yeah.

40:41 - Time to interaction, you know, that's one thing with AI we're realizing now is a lot of the prompts on four will be like, you know, two to, one to seven seconds, which in the Google era, you know, there's been some really great papers out of Google early on that prove that, you know, it's like 100 milliseconds as an impact on user behaviors and how long they stay.

41:03 - Right.

41:04 Yeah, people give up on checkout flows or whatever going to the next part of your site on a, measurably on 100 millisecond blocks, right?

41:12 When you're talking, well, here's 7,000, here's 70 of those.

41:15 That's gonna have an effect, potentially.

41:17 - Oh, it has, it has been proven and very intricate in usage patterns, session duration, session outcomes, right, and a second is a mountain.

41:27 If today, like we were to AB test Google between like whatever millisecond it's at now, for like just one second or half a second, that the results coming out of that AB test would show very, very different behaviors.

41:39 - Wow.

41:40 - I think there's some, don't quote me on it, there's some really great papers, you know, written on TTI and just time to interaction and the way it influences user behavior.

41:48 So we're still, you know, in the AI world, it has to, if you're gonna wait two to seven seconds for your prompt to come back, it's got to add some real important value to what's happening.

41:58 - Yeah, it does.

41:59 I think it's interesting that it's presented as a chat.

42:01 I think that gives people a little bit of a pause.

42:04 Like, oh, it's talking to me.

42:05 So let's let it think for a second, rather than it's a website that's supposed to give me an answer.

42:09 - Yeah, compared to then, I guess, your basis for comparison is a human, not a website or comparing against Google.

42:17 So that's great.

42:18 - Yeah, I ask it a really hard question.

42:20 Give it some time, right?

42:21 Like that's not normally how we think about these things.

42:22 Okay, so you have a kind of a workflow from this engineering, building a product, testing, like an AI inside of your product.

42:30 You wanna walk us through your workflow here?

42:32 - Yeah, and you know, if you, I think I looked at TDD, you know, and originally, what is the normal TDD type workflow?

42:40 And I just adapted this little diagram to prompt engineering, right?

42:46 'Cause the whole idea of the blog post is to bring prompt engine, like TDD mindset to prompt engineering.

42:51 So this is where I went, but yeah, the workflow is like, okay, define the use case and desired AI behavior.

42:58 What are you trying to solve with AI?

43:00 In my case, the example that I'll use and try to reuse throughout this presentation is, throughout this conversation is, you know, text to SQL.

43:11 So like we're trying to, what a user prompt, what a database schema, get the AI to generate good, useful SQL, find the right tables and columns to use, that kind of stuff.

43:20 I create test cases.

43:22 So it's like, okay, if I have this database and I have this prompt, give me my top five salary per department on this HR dataset, there's a fairly deterministic output to that.

43:34 You could say the SQL is not necessarily deterministic.

43:36 There's different ways to write that SQL.

43:38 There's a deterministic data frame or results set that might come up.

43:42 - There is a right answer of the top five salaries.

43:45 - That's right.

43:45 - You're not getting, ultimately get that.

43:47 And it's great if it is deterministic 'cause you can test it.

43:52 If you're trying to use AI to say write an essay about Napoleon Bonaparte's second conquest, in less than 500 words, it's not as deterministic and it's hard to test whether the AI is doing good or not.

44:08 So you might need human evaluators.

44:10 But I would say in most AI product, or people are trying to bring AI into their product, in many cases more deterministic.

44:18 So another example of like more deterministic would say like, oh, getting, if you say getting AI to write Python functions, it's like, oh, write a function that, you know, returns if a number is prime, yes or no, like that you can get the function and test it in a deterministic kind of way.

44:39 So anyways, just pointing out, it's better, you're only gonna be able to have a TDD mindset if you have a somewhat deterministic outcome to the one I use the AI for.

44:49 Then create a PROM generator.

44:51 So that would be your first version or in the text to SQL example, it's given the 20 tables in this database and this columns and table names and data types and sample data, generate SQL that answers the following user prompt.

45:05 And then the user prompt would say something like department by top five salary per department.

45:12 And then we're getting for people that are not on the visual stream, not YouTube, but on just audio, we're getting into the loop here where it's like run the test, evaluate the results, refine the test, refine the prompts, and then start over.

45:24 Right, and probably compile the results, keep track of the results so that you can compare, not just like are you 3% better on your test cases, but also did you, which tests that used to fail succeed now, which tests that used to fail now succeed, fail now.

45:41 And then once you're happy with the level of success you have, you can integrate the prompt into the product or maybe upgrade.

45:49 - Ship it.

45:50 - Yeah, ship it.

45:52 - Ship it.

45:53 So I think it's probably a good time to jump over to your framework for this because pytest and other testing frameworks in Python are great, but they're pretty low level compared to these types of questions you're trying to answer, right?

46:06 Like how has this improved over time for, I was doing 83% right, right?

46:13 pytest asserts a true or a false.

46:14 It doesn't assert that 83% is--

46:17 - Yeah, and it's a part of CI.

46:19 Like, if any of your pytest fail, you're probably gonna not CI, not allow CI, not even merge the PR, right?

46:28 So one thing that's different between test-driven development and unit testing and prompt engineering is that the outcome is probabilistic.

46:37 It's not true or false.

46:38 It might just be like zero or one, right?

46:40 where our spectrum fails for a specific test.

46:45 You're like, oh, if it gets this column, but not this other column, you succeed at 50%.

46:51 So it's non-binary.

46:53 It's also, you don't need perfection to ship.

46:55 You just need better than the previous version or good enough to start with.

46:59 So the mindset is, so there's a bunch of differences.

47:03 And for those interested, we won't get into the blog post.

47:06 I think I list out the things that are different between the two, I think it's a little bit above this.

47:12 But, you know, the first thing I want to say is like the level of ambition of this project versus say an Airflow is super set is like very low, right?

47:19 So it's maybe more similar to a test, a unit test library and no discredit to the great, awesome like unit test libraries out there, but you would think those are fairly simple and straightforward, which is the information architecture of a pytest is probably simpler than the information architecture of a Django for instance, right?

47:40 It's just like a different thing.

47:41 And here, the level of ambition is much low, and much, you know, for this is fairly simple.

47:47 So Promptimize is something that I created, which is a toolkit to help people write, to evaluate and score and understand while they iterate on their, while doing prompt engineering.

48:04 But so in this case, I think I talk about the use case I preset, which is we have a big corpus that luckily was contributed by, I forgot which university, but a bunch of PhD people did a text-to-SQL contest.

48:17 - I think it was Yale.

48:18 - Yale, yeah.

48:19 - I think it was Yale, yeah.

48:21 - So great people at Yale were like, hey, we're gonna generate 3,000 prompts on 200 databases with the SQL that should be the outcome of that.

48:31 It's a big test set so that different researchers working on Text to SQL and compare their results.

48:37 So for us, we're able to take that test set and some of our own test sets and run it at scale against OpenAI or against LLAMA or against different models.

48:49 And by doing that, we're able to evaluate this particular combo of this prompt engineering methodology with this model generates 73% accuracy.

49:01 And we have these reports we can compare fairly easily which prompts that, as I said before, were failing before are succeeding now and vice versa.

49:09 So you're like, am I actually making progress here or going backwards?

49:14 And if you try to do that on your own, like if you're crafting your prompt just anecdotally and try it on five or six things, like you quickly realize like, oh shit, I'm gonna need to really test out of a much broader range of tests and then some rigor and methodology around that.

49:28 - So right, and try, how do you remember and go back and go, this actually made it better, right?

49:32 'cause it's hard to keep all that in your mind, yeah.

49:35 - Yeah, and something interesting that I'm realizing to work on this stuff is like, everything is changing so fast, right?

49:41 The models are changing fast, the prompting windows are changing fast, the vector databases, which is a way that organize and structure a context for your prompts, evolving extremely fast.

49:52 It feels like you're working on unsettled ground in a lot of ways, like a lot of stuff you're doing might be challenged by, you know, the BART API came out last week Maybe it's better at SQL generation, and then I got to throw everything that I did on OpenAI.

50:06 But here's something you don't throw away, your test library and your use cases.

50:11 >> Right.

50:12 >> Maybe is the real asset here.

50:15 The rest of the stuff is like, I was moving so fast that all the mechanics of the prompt engineering itself and the interface with whatever model is the best at the time, you're probably going to have to throw away as this evolves quickly.

50:29 but your test library is something really, really solid that you can perpetuate or keep improving and bringing along with you along the way.

50:39 It's an interesting thought around that.

50:41 >> Let's talk through this example you have on Promptimize's GitHub read me here.

50:47 To make it a little concrete for people, how do you actually write one of these tests?

50:51 >> Yeah. There's different types of prompts.

50:54 But what I wanted to get to was just like, what is the prompt and how do you evaluate it, right?

51:01 And then behind the scene, we're going to be, you know, discovering all your prompts and running them and compiling results and reports, right, and doing analytics and making it easy to do analytics on it.

51:11 The examples that we have here, and I'll try to be conscious of both the people who can read the code and people who don't, like the people who are just on audio.

51:19 But here, from Proptimize.prompt, we import a simple prompt, And then we bring some evals that are just like utility functions around evaluating the output of what comes back from the AI.

51:32 And here the first prompt case in the model, here I just create an array or a list of prompt cases.

51:38 And it's a prompt case like a test case.

51:41 And with this prompt case, this very simple one, I say, "Hello there!" And then I evaluate that as says, you know, either "Hi" or "Hello" in the output, right?

51:53 So if any of the words exist and what comes back, I give it a one or a zero.

51:59 Framework allows you to, you could say, oh, it has to have both these words or give the percentage of success based on the number of words from this list that it has.

52:08 But that's the first case.

52:11 The second one is a little bit more complicated, but name the top 50 guitar players of all time, I guess.

52:17 And I make sure that Frank Zappa is in the list 'cause I'm a Frank Zappa fan here.

52:21 But you could say, I want to make sure that at least three out of five of these are in the list.

52:29 Those are very more like natural language, very simple tests too.

52:35 That's the Hello World essentially.

52:37 Then we're showing some examples of what's happening behind the scene.

52:40 Well, it will actually call the underlying API, get the results, run your eval function and compile a report.

52:48 What was the prompt?

52:49 What was oh a bird just flew into my room?

52:52 That's gonna make that's gonna make the podcast interesting. Oh my goodness. Okay, that might be that might be a first here That is nuts. Oh, well, it's out of my room. Guess what? There's other people in that house. I'm just gonna close the door to my room And deal with it later. All right. Well, that's a first - I've had a bat fly into my house once, but never a bird, so both are crazy.

53:21 How interesting, this is the first on the podcast out of eight years we've never had a bird, wild animal enter the studio of the guests.

53:28 - Yes, well, welcome to my room.

53:31 I live in Tahoe, so I guess that's something, it's better than a bear, you know?

53:35 It could have been better.

53:36 - It is better than a bear.

53:37 - All right, but yeah, so just keep enumerating kind of what we're seeing visually here.

53:43 We'll keep a YAML file as the report output.

53:48 In Promptimize, you have your test case or your prompt cases, like test cases, you have an output report that says, for this prompt case, here's the key, here's what the prompt that was actually, the user input that came in, here's what the prompt look like, what was the response, the raw response from the API, what are all the tasks, how long did it run?

54:11 So a bunch of metadata and relevant information that we can use later to create these reports.

54:16 So you're like, was the score zero or one?

54:19 So you get the whole output report.

54:22 - Yeah, okay.

54:22 And then you also have a way to get like a report.

54:26 I'm not sure, maybe I scrolled past it.

54:27 - Yeah, I think it's--

54:28 - Where it shows you how it did, right?

54:30 I think that was in your--

54:31 - I think at the blog post, do you see a much more--

54:33 - Oh, there it is.

54:35 - So this one, we're running the spider dataset that I just, that I talked about.

54:40 Remember, it's like the Yale generated text to SQL competition corpus.

54:44 So here we looked at my percentage of success is 70%.

54:49 So here you say weight and score.

54:52 So there's a way to say, "Oh, this particular prompt case is 10 times more important than another one." Right? So you can do a relative importance of weight of your different text cases.

55:02 Now, one thing we didn't mention too is like all these tests are generated programmatically too.

55:07 So that it's the same philosophy behind, you know, Airflow of like, you know, it's almost like a little DSL to write your test case.

55:15 So you could, you know, it could read from a YAML file for instance, in the case of what we do with spider SQL, there's a big Json file of all the prompts and all the databases.

55:23 And then we dynamically generate, you know, a thousand tests based on that.

55:28 So you can do programmatic test definition, so more and more dynamic if you want it to be, or you could do more static if you prefer that.

55:35 So in this case, we're doing, We introduced this idea of a category too.

55:39 So I mentioned like there's some features in Promptomize like categorizing your tests or weights, you know, and things like that.

55:47 So here we'll do some reporting on per category.

55:51 What is the score per category?

55:53 You can see which database is performing well or poorly again.

55:57 So I could have another category that is large database, small databases and see how that, what the score is and compare reports.

56:05 It's pretty cool that it saves the test run to a file that then you can ask questions about and write and generate this report on and rather than just running it and passing or failing, right?

56:16 - Yeah, or like giving the output and then having to run it again.

56:19 Yeah, there's some other features around if, so you can memoize the test.

56:24 So because it has a reports, if you like, you know, exit off of it or restart it later, it won't rerun the same tests if it's the same hash input, Even though with AI, you might get a different answer with the same input.

56:38 But at least in this case, it will say like, hey, I'm rerunning the same prompt instead of like waiting five seconds for open AI and then paying the tokens and paying the piper.

56:48 You know, I'm just gonna skip that.

56:51 So there's some logic around skipping what's been done already.

56:54 - It's not just a couple of milliseconds to run it.

56:56 It could be a while to get the answers.

56:58 - Yeah, also like early libraries, I haven't written the sub, the threading for it where you can say like, oh, run it on eight threads.

57:06 So with Promptimize, I think, and the blog post is probably more impactful than the Python project itself.

57:15 If the Python project takes off and a bunch of people are using it to test prompts and contribute to it, it's great.

57:21 But I think it's more like, okay, this is uncharted territory, working with an AI type interface.

57:29 And then it's more like, how do we best do that as practitioners or as people building products.

57:36 I think that's the big idea there.

57:39 Then the test library, you could probably write your own.

57:41 I think for me that was a one or two week project.

57:44 The one I would like to say is normally, if it wasn't for getting all the help from ChatGPT on, it's like, "I'm creating a project.

57:53 "I'm setting up my setup.py." Setup tools is always a little bit of pain in the ass.

57:59 Then I'm like, "Can you help me create my setup.py?" and then generate some code.

58:04 And I'm like, "Oh, I wanna make sure that PyPI is gonna get my readme from GitHub.

58:09 I forgot how to read the markdown and pass the stuff.

58:13 Can you do that for me?" And then Chedjabint generates this stuff very nicely.

58:17 Or, "I wanna make sure I use my request that requirements of TXT inside my dynamically building my setup tools integration.

58:26 Can you do that for me?" And it's just like, bam, bam, bam.

58:28 Like all the repetitive stuff.

58:30 I need a function.

58:31 - This is incredible, right?

58:32 - Go ahead.

58:33 - Yeah, I kind of want to close out the conversation with that.

58:36 I do agree that the blog post is super powerful in how it kind of teaches you to think about how might you go about testing, integrating with an AI and these types of products, right?

58:46 Much like TDD brought a way to think about how do we actually apply the concept of just, well, I have things and I can test them with this assert thing.

58:55 How should I actually go about building software, right?

58:56 So this is kind of that for AI integrated software.

59:00 So it's certainly worth people watching.

59:02 Let's just close it out with, you kind of touched on some of those things there.

59:05 Like, how do you recommend that people leverage things like ChatGPT to help them build their apps or how to use AI, this kind of - Oh my God, yeah.

59:17 - To like amp up your software development?

59:20 - 100%.

59:21 I mean, it's been a lot of people report on Twitter, people used to like Google, all the problems that they had while writing code and using a lot of Stack Overflow.

59:34 I don't know what the stats on Stack Overflow traffic, but once you try working with Chat GPT to do coding, you probably don't go back to those other flows of, I don't know, it's like putting your error message or stack trace into Google and then going to a bunch of Stack Overflow link and try to make sense of what comes out.

59:54 To me, it's been so much better to go just with ChatGPT and there's a conversation there too.

01:00:00 So say for instance, if I'm in pro optimize, I needed a function to say, can you write, I wrote that function before, you know, but it's a, can you crawl a certain given folder and look for modules that contain objects of a certain class and then bring that back and you know, you have to use the import lib and she's a little bit of pain in the ass to write this. So it writes, you know, a function that works pretty well.

01:00:23 and I'm like, "Oh, I forgot to ask you "to look into lists and dictionaries.

01:00:27 "Can you do that too?" Then it does that in a second.

01:00:30 It's like, "Oh, you didn't add type hints "and duck string and duck test.

01:00:34 "Can you write that too?" And it's a bang, bang, bang, and just like copy paste in your utils file and it works and you save like two hours.

01:00:43 - I think it would be really good at those things that are kind of algorithmic.

01:00:48 Now, it might be the kind of thing that you would do on a whiteboard job interview test, right?

01:00:53 it's just gonna know that really, really solid.

01:00:56 It actually, but it knows quite a bit about the other libraries and stuff that are out there.

01:01:01 - It's insane, yeah.

01:01:02 So one thing that I came across is, I leverage something called LinkChain, which pointed to people getting interested in prompt engineering.

01:01:11 There's a really good, well, the library LinkChain is really interesting.

01:01:15 That's not perfect, it's new, it's moving fast, but push people to check it out.

01:01:20 Also like 41,000 stars, so very--

01:01:22 I know that's nice, right?

01:01:23 It's written in Python.

01:01:25 - Yes, you can do like, yeah, it's in Python too.

01:01:27 You should talk to whoever is writing this or started this.

01:01:32 But yeah, you can change some prompt to say like, the output of a prompt will generate the next one.

01:01:37 There's this idea of agents.

01:01:39 There's this idea of estimating tokens before doing the request.

01:01:44 There's a bunch of really cool things that it does.

01:01:47 To me, the docs are not that comprehensive.

01:01:49 There's someone else that created, if you Google "Lang chain cookbook," you'll find someone else that wrote what I thought was more comprehensive way to start.

01:02:03 This one has a YouTube video and an IP1B file and introduces you to the concept in an interactive way.

01:02:09 I thought that was really good.

01:02:12 But yeah, so we were trying to, I was trying to use this, and I was like, "Oh, Chad, can you generate "a bunch of like, Lang chain related stuff?" I was like, I don't know of a project called LinkChain.

01:02:20 It was created after 2021.

01:02:22 So I was like, I wish I could just say, just go read the GitHub, just read it all, read the docs, and then I'll ask you questions.

01:02:30 And then Chat GPT not that great at that currently, at learning things it doesn't know, for reasons we talked about.

01:02:38 BART is much more up to date, so you can always, for those projects, Chat GPT might be better at Django, 'cause it's old and settled, and it's better at writing code overall, but Bard might be decent and pretty good for--

01:02:52 - Right, if you ask advice on how to do promptimize stuff, it's like, I don't know what that is.

01:02:55 - Yeah, it's like, I've never heard of, it might elucidate too, I think if you go and make shit up, like I've seen it, like promptimize sounds like it would be this and it just makes up stuff, so, so not that great.

01:03:07 But yeah, absolutely, I encourage people to try, you know, for any subtasks that you're trying to do to see if it can help you at it and maybe try a variation on the prompt.

01:03:18 And then, you know, if it's not good at it, do it the old way.

01:03:21 But yeah, it might be better too for those familiar with the idea of functional programming, where each function is more deterministic and can be reasoned about and unit test in an isolation.

01:03:32 Chat GPT is gonna be better at that 'cause it doesn't know about all your other packages and modules.

01:03:37 So really great for the utils functions are very deterministic, functional, super great at that.

01:03:44 Another thing is, and you tell me when we run out of time, but another thing that was really interesting too, of bringing some of the concept and promptimize and writing the blog post itself.

01:03:56 - Right.

01:03:56 - And things like, hey, I'm thinking about the difference of the properties of test-driven development as it applies for prompt engineering.

01:04:04 Here's my blog post, but can you think of other differences between the two that are very core?

01:04:10 And can you talk about the similarities and the differences and I would come up with just really, really great ideas, brainstorming and just very smart at mixing concepts.

01:04:22 - I do think one thing that's not a great idea is just say, "Write this for me." But if you've got something in mind and you're gonna say, "Give me some ideas," or, "Where should I go deeper into this?" And then you use your own creativity to create that, that's a totally valid use.

01:04:37 I wouldn't feel like, "Oh, I'm reading this AI crap." It brought out some insights that you had forgot to think about and now you are, right?

01:04:45 - Or when it fails, instead of saying, like, I got it to fail, AI is wrong, I'm smarter than it, you're like, wait, is there something, can I try to, you know, here's what it didn't get right, and why, like, what did I need to tell it?

01:04:56 So you can go and edit your prompt, or ask a follow-up, and generally it will do better and well.

01:05:03 - Yeah, I think also you can ask it to find bugs, or security vulnerabilities.

01:05:06 - Yeah.

01:05:07 - Right, you're like, here's my 30-line function, do you see any bugs?

01:05:11 Do you-- - Yeah.

01:05:12 - Do you see any security vulnerabilities?

01:05:14 Like yeah, you're passing this straight to, you're concatenating the string in the SQL or something like that.

01:05:21 - Yeah, the rigor stuff too, or like, you know, I would say writing a good doc string, writing doc tests, writing unit tests, reviewing the logic, that kind of stuff.

01:05:33 It does, type hints, right?

01:05:34 If you're like me, like I don't really like to write type hints up front, but I'm like, can you just like sprinkle some type in on top of that?

01:05:43 - Retrofit this thing for me.

01:05:44 - Yeah, that's it.

01:05:45 Just make it that production grade.

01:05:46 You know, one thing that's interesting too, of like, you know, you would think I'm a big TDD guy.

01:05:50 Like I don't do tests.

01:05:52 (laughs)

01:05:54 It's just not my thing.

01:05:55 I like to write code.

01:05:56 I don't think of like what I'm gonna use the function for and before I write it, but like generating, it's good at generating unit tests for a function too.

01:06:07 And then I think what's interesting with Promptimize too is you might, you want deterministic, what I call prompt cases or test cases, but you can say, I've written five or six of these, can you write variations on that theme too?

01:06:22 So you can use it to generate test cases in the case of like TDD, but also the opposite, like for Promptimize, you can get it to generate stuff dynamically too.

01:06:32 - Yeah. - By itself.

01:06:33 - Yeah, it's pretty amazing.

01:06:35 It is pretty neat.

01:06:36 Let's maybe close this out, Well, I'll ask you one more question.

01:06:39 - Okay, can I do one more?

01:06:40 Can I show one more thing?

01:06:41 Since it's a Python podcast, if you go on that repo for Promptomize under examples, there's one called Python example.

01:06:50 - Here we go, something like this.

01:06:51 - Yeah, something like this.

01:06:52 So this stuff right here.

01:06:53 So say here it says, so here I wrote a prompt that asks the bot to generate Python function.

01:07:01 Then I sandbox it and bring the function I wrote into the interpreter, and then I test it.

01:07:06 So I say, write a function that tests if a number is a prime number and returns a Boolean.

01:07:12 And then I test, I have six state test cases for it.

01:07:16 So write a function that finds the greatest common denominator of two numbers, right?

01:07:21 Then behind the scene, we won't get into the class above.

01:07:24 The class above basically interacts with it, gets the input, then runs the test, then compiles the results, right?

01:07:31 So we could test how well 3.5 compares to four.

01:07:36 but I thought it was relevant for the Python folks on the line.

01:07:40 So we're testing out what it is that writing Python function.

01:07:43 - Write a function that generates the Fibonacci sequence.

01:07:45 Yeah.

01:07:46 - Up to a certain number of terms, right?

01:07:48 So it's easy to test.

01:07:50 So it's cool stuff.

01:07:51 What was your last question?

01:07:53 - Oh, I was gonna say something like, see how far we can push it.

01:07:57 Write a Python function to use requests and beautiful soup to scrape the titles of the request.

01:08:06 of episodes of Talk Python to me.

01:08:09 - Oh yeah, and then, yeah, it is.

01:08:12 And if you don't have a, you know, one thing that's a pain in the butt for podcast people is to write the, like, what all we talk about.

01:08:18 So you use another AI to get the transcripts.

01:08:21 It's like, can you write something that's gonna leverage this library to transcript the library, summarize it and publish it back on the, with SEO in mind.

01:08:32 - Yeah, it's really quite amazing.

01:08:34 It went through and said, okay, here's a function, and it knows talkpython.fm/episodes/all.

01:08:39 Use h, get the title, and let's just finish this out, Max.

01:08:43 I'll throw this into--

01:08:44 - An interpreter, see if it runs.

01:08:45 - Interpreter, and I'll see if I can get it to run.

01:08:48 - Hey, you know what's really interesting too, is you can give it a random function, like you can write a function, and say, write a certain function that does certain things, and you say, if I give this input to this function, what is it gonna come out of?

01:09:03 And it doesn't have an interpreter, but it can interpret code like you and I do, right?

01:09:08 Like an interview question of like, hey, here's a function if I input a three as a value, what's gonna come, what's gonna return?

01:09:15 So it's able to do the follow the loops, you know, follow the if statements and basically just do a logical.

01:09:21 - Yeah, another thing I think would be really good is to say, here's a function, explain to me what it does.

01:09:27 - Oh yeah, it's super great at that.

01:09:28 It's great at that for SQL too.

01:09:29 Here's a stupid long SQL query, can you explain to me?

01:09:33 No, it's like, the explanation is on, can you just summarize that in a hundred words?

01:09:38 - Yeah, let's go step by step.

01:09:39 Let's go step by step.

01:09:41 What's this do?

01:09:42 - Well, yeah, I mean, maybe a closing statement is like this stuff is changing our world.

01:09:47 Like for me, I'm interested in how it's changing, how we're building products, you know?

01:09:51 But the core things as a data practitioner, as a Python expert, as a programmer, it's really changing the way people work day after day faster than we all think.

01:10:04 And across a broad, you might understand pretty well, it's changing your daily workflow as a software engineer, but it's changing people's workflow to chemistry or like in every field, there's a lot we can leverage here if you use it well.

01:10:21 - Right, take this idea and apply it to whatever vertical you wanna think of, it's doing the same thing there, right?

01:10:28 - 100%. - Medicine, all over.

01:10:30 - Yeah, 100%, 100%.

01:10:32 All right, well, let's call it a, call it a wrap.

01:10:35 I think we're out of time here.

01:10:38 So really quick before we quit, PyPI package to recommend, maybe something AI related that you found recently, like all these things cool, people should check it out.

01:10:47 - Promptimize, I think it would be something to check out.

01:10:50 I think there's something called Future Tools that you could try to navigate there, but it shows all of the AI-powered tools that are coming out, and it's hard to keep up.

01:11:00 - Yeah, I think I have seen that, yeah.

01:11:02 - And then if you wanna keep up on a daily, what's happening in AI, there's TLDR AI, they have like a DL with their relevant list for the day.

01:11:12 I think that's a...

01:11:14 - It's hard to stay on, I prefer like their weekly digest of what's going on in AI.

01:11:20 - It's more of a, just a stream of information.

01:11:24 - Yeah, it's just kind of dizzying, and it's like, oh, this new model does this, I gotta change everything to that.

01:11:29 And then something else, if you update the correct course too often, it's just like, you know, do nothing.

01:11:36 'Cause you're like, the foundation's shifting too fast under you, so.

01:11:41 - Yeah, absolutely.

01:11:42 Well, very cool.

01:11:43 All right, and then final question.

01:11:45 You're gonna write some Python code.

01:11:47 What editor are you using these days?

01:11:48 - I'm a Vim user, yeah.

01:11:50 I know it's not the best.

01:11:52 I know all the limitation, but it's like muscle memory.

01:11:55 And I'm a UX guy now working on supersets.

01:11:59 I do appreciate the development of all the new IDEs and the functionality that they have.

01:12:05 I think it's amazing.

01:12:07 It's just like, for me, it's all, like I know all my bash commands and big commands.

01:12:12 - Absolutely.

01:12:13 All right, well, Max, thanks for coming on the show, helping everyone explore this wild new frontier of AI and large language models.

01:12:19 And for James, - Yeah, well, you know, exploring it while we're still relevant, because I don't know how long we're gonna be relevant for.

01:12:27 So yeah.

01:12:28 - Yeah, enjoy it while we can, right?

01:12:30 Get out there.

01:12:30 Either control the robots or be controlled by them.

01:12:35 So get on the right side of that.

01:12:36 All right, thanks again.

01:12:38 - Thank you.

01:12:39 - This has been another episode of Talk Python to Me.

01:12:43 Thank you to our sponsors.

01:12:44 Be sure to check out what they're offering.

01:12:46 It really helps support the show.

01:12:48 The folks over at JetBrains encourage you to get work done with PyCharm.

01:12:52 PyCharm Professional understands complex projects across multiple languages and technologies, so you can stay productive while you're writing Python code and other code like HTML or SQL.

01:13:04 Download your free trial at talkpython.fm/done-with-pycharm.

01:13:09 Listen to an episode of Compiler, an original podcast from Red Hat.

01:13:14 Compiler unravels industry topics, trends, and things you've always wanted to know about tech through interviews with the people who know it best.

01:13:21 Subscribe today by following talkpython.fm/compiler.

01:13:25 Want to level up your Python?

01:13:27 We have one of the largest catalogs of Python video courses over at Talk Python.

01:13:31 Our content ranges from true beginners to deeply advanced topics like memory and async.

01:13:36 And best of all, there's not a subscription in sight.

01:13:38 Check it out for yourself at training.talkpython.fm.

01:13:42 Be sure to subscribe to the show, open your favorite podcast app, and search for Python.

01:13:46 We should be right at the top.

01:13:48 You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the Direct RSS feed at /rss on talkpython.fm.

01:13:57 We're live streaming most of our recordings these days.

01:14:00 If you want to be part of the show and have your comments featured on the air, be sure to subscribe to our YouTube channel at talkpython.fm/youtube.

01:14:08 This is your host, Michael Kennedy.

01:14:10 Thanks so much for listening.

01:14:11 I really appreciate it.

01:14:12 Now get out there and write some Python code.

01:14:14 (upbeat music)

01:14:17 [Music]

01:14:32 (upbeat music)

01:14:35 [BLANK_AUDIO]

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon