#417: Test-Driven Prompt Engineering for LLMs with Promptimize Transcript
00:00 Large language models and chat-based AIs are kind of mind-blowing at the moment.
00:04 Many of us are playing with them for working on code or just as a fun alternative to search,
00:10 but others of us are building applications with AI at the core. And when doing that,
00:16 the slight unpredictable nature and probabilistic style of LLMs makes writing and testing Python
00:22 code very tricky. Enter Promptimize from Maxine Boschman and Preset. It's a framework for
00:29 non-deterministic testing of LLMs inside of our applications. Let's dive inside the AIs with Max.
00:35 This is Talk Python To Me, episode 417, recorded May 22nd, 2023.
00:40 Welcome to Talk Python To Me, a weekly podcast on Python. This is your host,
00:58 Michael Kennedy. Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using
01:03 at Talk Python, both on fosstodon.org. Be careful with impersonating accounts on other instances.
01:09 There are many. Keep up with the show and listen to over seven years of past episodes at talkpython.fm.
01:15 We've started streaming most of our episodes live on YouTube. Subscribe to our YouTube channel over
01:21 at talkpython.fm.com to get notified about upcoming shows and be part of that episode.
01:27 This episode is brought to you by JetBrains, who encourage you to get work done with PyCharm.
01:33 Download your free trial of PyCharm Professional at talkpython.fm/done dash with dash PyCharm.
01:40 And it's brought to you by the Compiler Podcast from Red Hat. Listen to an episode of their podcast
01:46 to demystify the tech industry over at talkpython.fm/compiler.
01:51 Max, welcome to Talk Python to Me.
01:54 Well, it's good to be back on the show. And now I know it's live too, so no mistakes. I'm going to
01:59 try to not say anything outrageous.
02:02 People get the unfiltered versions. So absolutely. Absolutely. I love it when people come check out
02:07 the live show. Welcome back. It's been a little while since you were on the show, about since
02:12 September. We talked about superset. We also talked a little bit about airflow, some of the stuff that
02:18 you've been working on. And now we're kind of circling back through this data side of things,
02:23 but trying to bring AI into the whole story. So pretty cool project that I'm looking forward to
02:28 talking to you about.
02:29 Awesome. I'm excited to talk about it too. And these things are related in many ways. Like one
02:35 common foundation is Python. Like a lot of these projects are in Python. They're data related.
02:40 And here, promptimize and prompt engineering and integrating AI is related in a way that we're
02:47 building some AI features into superset right now and into presets. So it ties things together in some way.
02:55 Yeah, I can certainly see a lot of synergy here. Before we dive into it, it hasn't been that long,
02:59 since you were on the show. But give us a quick update, just a bit about your background for people who don't know you.
03:05 Yeah, so my career is a career of like maybe 20 or so years in data building, doing data engineering,
03:13 doing trying to make useful data useful for organizations. Over the past decade or so, I've been very involved in open source.
03:22 So I started Apache Airflow 2014. So for those not familiar with Airflow, though, it's pretty well known now.
03:28 It's used at, I heard like, I think it's like 10s of 1000, I think above 100,000 companies are using Apache Airflow, which is kind of insane to think about.
03:38 It's like you started a little project. So for me, I started this project at Airbnb, and it really took off.
03:44 And I think it's just like great project community fit, like people really needed that. It was the right abstraction for people at the time.
03:52 And still today, and it just really took off. So I was working on orchestration. And then I was like, I just love things that are visual and interactive.
04:01 So there was no great open source BI tool out there business intelligence. So the sole data, data dashboarding, exploration, SQL IDE.
04:11 So it's like a playground for people trying to understand and visualize and explore data. So I started working on Apache Super Set in 2000.
04:18 It was like 15 or 16 at Airbnb, too. And we also brought that to the Apache Software Foundation. So again, like a very, very popular open source project that's used in like 10s, 10s of 1000, 100,000 organizations.
04:31 And a great today, like it has become a super great open source alternative to Tableau Looker, like all those business intelligence tools very viable for organizations.
04:44 And then quick, quick plugs at preset IO company, I started some also an entrepreneur, started a company a little bit more than four years ago around Apache Super Set.
04:54 And then the idea is to bring, bring super set to the masses. So it's really like hosted, manage state of the art Apache Super Set for everyone with some bells and whistles.
05:04 So the best super set you can run, there's a there's a free version too. So you can go and play and try it, you know, today gets started in five minutes.
05:12 So it's a, it's a little bit of a commercial pointer, but also very relevant to what I've been doing, you know, personally over the past like three or four years.
05:20 Sure, it's some of the inspiration for some of the things we're going to talk about as well and trying to bring some of the AI craziness back to products, right?
05:28 So I think that's a, you know, from an engineering perspective, not just a, hey, look, I asked it, what basketball team was gonna, you know, win this year, and it gave me this answer, right?
05:37 It's like, and I, and caveats, I don't know anything that happened since 2021. So AI, or, and AI specifically, Bart is a little bit better at that. But it's like, you know, the last thing I read off of the internet was in fall 2021.
05:50 It makes some things a little bit challenging. But yeah, so we're looking at we're building, you know, AI features into preset, you know, as a commercial open source company, we need, we need to build some differentiators to from supersets, we contribute a huge amount, like maybe 50 80% of the work we do at preset is contributed back to superset, but we're looking to build differentiators. And we feel like AI is a, is a great, kind of commercial differentiator to on top, on top of superset that makes people even more interested to come and run, run, run, run, run,
06:20 on preset too. So yeah, excellent. And jump people you say they were popular, popular projects, like airflow has 30,000 stars, Apache superset has 50,000 stars, which puts it on par with Django and flask, for people sort of mentors models out there, which is, I would say it's pretty well known. So awesome.
06:40 Yeah, Star is a vanity metric in some ways. It's not necessarily usefulness or value delivered, but it's a proxy for popularity and hype. It gives a good sense. I think at 50,000 stars, it's probably in the top 100 of GitHub projects.
06:59 If you remove the top 100, there's a lot of documentations and guides and things that are not really open source projects. It's probably like top 100 open source project-ish in both cases.
07:13 Right.
07:14 It's pretty cool. It's like you start a project and you don't know whether it's going to take off and how. It's just nice to see that.
07:21 Yeah, absolutely. On one hand, it's nice, but it doesn't necessarily make it better.
07:26 But it does mean there's a lot of people using it. There's a lot of polish. There's a lot of PRs and stuff that have been submitted to make it work. A lot of things that can plug into it.
07:36 So there's certainly a value for having a project popular versus unpopular.
07:40 Oh my God. Yes. And I would say one thing is, you know, all the dark, call it like secondary assets outside of the core projects documentation. And there will be a lot of like use cases and testimonials and reviews and people bending the framework in various ways and forks and plugins.
07:58 So another thing too, that people, I think don't really understand the value of in software and open sources, or I'm sure people understand the value, but it's not talked about as just a whole battle tested thing.
08:10 So when something is run at thousands of organizations in production for a long time, there's a lot of things that happen in a software that are very, very valuable for the incremental organization adopting it.
08:23 Well, let's talk large language models for a second. So AI means different things to different people, right? It, they, they kind of get carved off as they find some kind of productive productize use, right?
08:37 AI is this general term and like, Oh, machine learning is now a thing that we've done, or computer vision is a thing we've done.
08:43 And the large language models are starting to find their own, their own special space. So maybe we could talk a bit about a couple of examples just so people get a sense.
08:53 I mean, to me, ChatGPT seems like the most well-known. What do you think?
08:58 Yeah. I mean, well, I'll say, if you think about like, what is a large language model and what are some of the leaps there?
09:05 And I'm not an expert, so I'm going to try to not put my foot in my mouth, but some things that I think are interesting.
09:11 So a large language model is a big neural network that is trained on a big corpus of text.
09:18 I think one of the big leaps we've seen is unsupervised learning. So like really often, like in machine learning in the past or pre-LLMs, we would have very specific like training set and outcomes we were looking for.
09:33 And then the training data would have to be really structured here.
09:36 Here, what we're doing with large language models is feeding a lot of huge corpus of text and what the large language model is trying to do or resolve is to chain words.
09:46 So he's trying to predict the next word, which seems like you would be able to put words together that kind of make sense.
09:55 But like you wouldn't think that consciousness, not just like consciousness, but intelligence would come out of that.
09:59 But somehow it does. Right.
10:01 Like if you chain that it's like if you say, you know, up to the empty that he sits on a it's really clear.
10:07 It's going to be wall, you know, the next word.
10:10 But if you push this idea much further with a very large corpus of like human knowledge, somehow there's some really amazing stuff that does happen on these large language models.
10:21 And I think that realization around happened at around, you know, Chad at GPT three, three, five getting pretty good.
10:29 And then at four, like, oh, my God, this stuff can really kind of seems like it can think or be smart to be very helpful.
10:36 Yeah. The thing that I think impresses me the most about these is they seem and people can tell me it's statistics and, you know, I'll believe them.
10:44 But it seems like they have an understanding of the context of what they're talking about more than just predicting like Humpty Dumpty sat on the what it said on the wall.
10:53 Right. Obviously, that's what was likely to come next when you see those that set of words.
10:58 But there's an example that I like to play with, which I copy.
11:02 You know, I'll give it a little thing. I'll say, hey, here's a program, Python program.
11:07 I'm going to ask you questions about it. Let's call it arrow.
11:09 And it's like this is a highly nested program that function that tests whether something's a platypus.
11:15 I saw this example somewhere and I thought, OK, this is this is pretty cool.
11:18 But it has this if it's a mammal, then if it has fur, then if it has a beak, then if it has a tail and you can just do stuff that's really interesting, like I'll ask it to a bird or something like that.
11:31 Yeah, yeah, yeah, yeah, rewrite it, write it using guarding clauses to be not nested.
11:38 Right.
11:39 Oh, yeah.
11:40 Right. And it will say, sure, here we go.
11:42 And instead of being if this, then nest, if this, then if this, it'll do if not, if not return false.
11:47 Right. Which is which is really cool. And that's kind of a pretty interesting one.
11:51 But like this is this is the one. This is the example that I think is crazy is rewrite arrow to test for crocodiles.
12:00 They're using that.
12:01 It's like what people would call a one shot, a few shot example of like a here's a here's an example, the kind of stuff I might want.
12:09 There's some different ways to do that.
12:11 But it's a pattern in prompt engineering where you'll say you have a zero shot, one shot, few shot examples we get into.
12:17 But it does feel like it understands the code. Right. Like what you're trying to do.
12:22 Right. Just for people listening, it said, OK, here's a function is crocodile.
12:26 If not, self thought is reptile. If not, self thought has scales. These are false examples. Right.
12:30 And if it has four legs and a long snout that can swim. Right. Like it it rewrote the little tests and stuff. Right.
12:36 In a way that seems really unlikely that just predicting likelihood because it's never seen anything like this, really, which is really it's pretty mind blowing, I think.
12:46 Or it or it had like it read the entire internet and all of GitHub and that kind of stuff.
12:51 So it has seen. Exactly.
12:52 Things that think that's mind boggling is just like can like when you think about what it did there is it read the the entire conversation so far, your your input prompt.
13:03 And it has like a system prompt ahead of time that says, you know, your ChatGPT try to be helpful to people.
13:09 And here's a bunch of things that you should or should not say and non bias.
13:13 You try to be concise and good and virtuous.
13:16 And people have found all sorts of jail breaks out of that.
13:19 But like all it does from that point on as I try to predict the new the next word, which is kind of insane that it gets to, you know, the amount of structure that that we see.
13:29 Right. Right. That's a lot of structure there. Right.
13:31 So pretty impressive. And ChatGPT is starting to grow.
13:35 You know, you've got version four and you can start using some of the plugins.
13:37 It's going to keep going crazy there.
13:39 Other examples are simply just released lemur, which is a large language model, but really focused on transcribing speech, which I think is kind of cool.
13:48 Microsoft reduce Microsoft security release Microsoft security copilot, which is a large language model to talk about things like engine X misconfiguration and stuff like that.
13:59 There's just just a lot of a lot of stuff out there that's coming along here. Right.
14:04 A lot of type of thing on the on the open source front to there's there's all ethical thing.
14:11 Like, should everyone and anyone have access to open source models doing that while we don't really understand and we probably shouldn't get to the ethics part of the debate here, because that's a whole series of episodes.
14:25 We probably want to get into.
14:26 But what's interesting is, you know, Databricks came up with a model for what's called Facebook call.
14:30 It came up with one called Lama and they open source and or the weight.
14:34 So you have the model topology with the pre trained weights.
14:39 In some cases, there's open source corpus of training that are also coming out and are also open source.
14:45 So that means like and these these open source models are somewhat competitive or increasingly competitive with GPT or.
14:54 Yeah, which is kind of crazy.
14:57 And some of them don't are where GPT for has limitations.
15:01 They break through these limitations.
15:03 So one thing that's really important as a current limitation of the GPT models and LLMs is the prompt window, the token prompt window.
15:13 So basically, when you ask a question, you know, it's been it's been trained and has machine learn with data up to I think in the case of GPT three, five or four.
15:24 It's the corpus of training goes all the way to fall 2021.
15:28 So if you ask like, who is the current president of the United States, it just doesn't know or I will tell you as of 2021 it is this person.
15:36 But so if you're trying to do tasks like what I've been working on, would probably get into later in the conversation is trying to generate SQL.
15:44 SQL, it doesn't know your table.
15:46 So you have to say, hey, here's all the tables in my database.
15:48 Now, can you generate SQL that does X on top of it?
15:51 And that context window is increased is limited and increasing.
15:56 And some of these open source models have different types of limitations.
16:01 This portion of talk Python to me is brought to you by JetBrains, who encourage you to get work done with PyCharm.
16:08 PyCharm Professional is the complete IDE that supports all major Python workflows, including full stack development.
16:15 That's front end JavaScript, Python back end and data support, as well as data science workflows with Jupyter.
16:22 PyCharm just works out of the box.
16:24 Some editors provide their functionality through piecemeal add ins that you put together from a variety of sources.
16:31 PyCharm is ready to go from minute one.
16:34 And PyCharm thrives on complexity.
16:37 The biggest selling point for me personally is that PyCharm understands the code structure of my entire project, even across languages such as Python and SQL and HTML.
16:48 If you see your editor completing statements just because the word appears elsewhere in the file, but it's not actually relevant to that code block, that should make you really nervous.
16:58 I've been a happy paying customer of PyCharm for years.
17:01 Hardly a workday passes that I'm not deep inside PyCharm working on projects here at Talk Python.
17:08 What tool is more important to your productivity than your code editor?
17:12 You deserve one that works the best.
17:14 So download your free trial of PyCharm Professional today at talkpython.fm/donewithpycharm and get work done.
17:22 That link is in your podcast player show notes.
17:25 Thank you to PyCharm from JetBrains for sponsoring the show and keeping Talk Python going strong.
17:30 Right. It's interesting to ask questions, right?
17:34 But it's more interesting from a software developer's perspective of, can I teach it a little bit more about what my app needs to know or what my app structure is, right?
17:45 In your case, I want to use superset to ask the database questions.
17:51 But if I'm going to bring in AI, it needs to understand the database structure so that when I say help me do a query to do this thing, it needs to know what the heck to do, right?
18:02 The table I need to know.
18:03 So there's the stuff it knows and the stuff it can't know.
18:07 And some of it goes, is related to the fact that whether this information is public on the internet, whether it has happened to be trained against it.
18:15 And then if it's in private, there's just no hope that it would know about, you know, your own internal documents or your database structure.
18:21 So in our case, it speaks SQL very, very well.
18:25 So as we get into this example, like how to get GPT to generate good SQL in the context of a tool like superset or SQL lab, which is our SQL IDE.
18:34 So it knows how to speak SQL super well.
18:37 It knows the different dialects of SQL very, very well.
18:40 It knows its functions, its dates functions, which a lot of the SQL engineers on the call, like, yeah, I can never remember like what Postgres data function is.
18:50 But at GPT or GPT models, just it knows SQL, it knows the dialects, it knows the mechanical SQL, it understands data modeling, foreign keys, drawings, primary, all this stuff it understands.
19:01 It knows nothing about your specific database, the, you know, the schema names and the table names, the column names that I might be able to use.
19:10 So that's where we need to start providing some context.
19:13 And this context window is limited.
19:15 So it's like, oh, how do you use that context well or as well as possible?
19:21 And that's the field and some of the ideas behind it is prompt crafting and prompt engineering, which we can get into once we get there.
19:29 Maybe we're there already.
19:31 Yeah, yeah.
19:32 Well, yeah, I think where I see this stuff going is from this general purpose knowledge starting to bring in more private or personal or internal type of information.
19:44 Right.
19:44 Like our data about our customers is like structured like this in a table.
19:48 And here's what we know about them.
19:49 Now let us ask questions about our company and our thing.
19:52 Right.
19:53 And it's like starting to make inroads in that direction, I think.
19:56 Yeah.
19:57 You know, one thing to know about is there's different approaches to teach or provide that context.
20:04 So one would be to build your own model from scratch.
20:07 Right.
20:08 And that's pretty prohibitive.
20:10 So you'd have to, you know, find the right corpus.
20:12 And instead of starting with a model that knows SQL and needs to know your table and context, you have to start from from zero and very prohibitive.
20:20 Another one is you start from a base model at some point of some kind.
20:24 There's a topology.
20:25 So there's, you know, different layers and number of neurons and it knows some things.
20:29 And then you load up some weights that are open source.
20:31 And then you say, I'm going to, I'm going to tune this model to teach it my database schemas and basically my own corpus of data.
20:39 So it could be your data dictionaries could be your internal documents.
20:43 It could be your GitHub code, your dbt projects.
20:46 If you have one of your airflow tags, be like, I'm going to dump all this stuff in the model and that will be get baked into their own network itself.
20:54 That's doable, pretty prohibitive in this era.
20:58 If you have the challenge that we have a preset, which is we have multiple customers with different schemas, we can't have spillover.
21:05 So you have to train a model for each one of our customers and serve a model for each one of our customers.
21:10 So still pretty prohibitive.
21:12 And a lot of people fall back on this third or fourth method that I would call prompt engineering, which is going to use the base model, the open API API or just an API on LLM.
21:24 And then I will, if no SQL already, I'll just say, hey, here's a bunch of tables that you might want to use.
21:29 Can you generate SQL on top of it?
21:31 So then that's just a big request with a lot of context.
21:36 Then we have to start thinking about maximizing the use of that context window to pass the information that's most relevant within the limits allowed by the specific model.
21:46 Right. And that starts to get into reproducibility, accuracy, and just those limitations, which is kind of an engineering type of thing. Right.
21:56 Yeah. And then, you know, maybe a topic too. And, you know, this conversation is based on a recent blog post and the flow.
22:02 Just going back to the flow of that blog post, we started by establishing the premise that everyone is trying to bring AI into their product today.
22:11 Right. Thousands of product builders are currently exploring ways to harness the power of AI and the products and experiences they create.
22:18 That's the premise for us with text to SQL and SQL lab as part of preset.
22:23 But I don't know if you think of any product, any startup, any SAS product you use to work at HubSpot today, you're trying to figure out how to leverage AI to build, you know, sales chat bots or as the chat chat bot.
22:38 So everyone everywhere is trying to figure that out. And the challenge is the challenges, I guess, very probabilistic in a different interface to anything we know.
22:47 Like, you know, engineers would be like, oh, let's look at an API and leverage it.
22:51 And APIs are very, very deterministic. And in general, AI is kind of wild beast to tame.
22:58 You know, you ask first the interface is language, not code.
23:03 And then what comes back is like semi probabilistic in nature.
23:07 And it could change underneath you. It's a little bit like web scraping in that regard that like it does the same.
23:12 It does the same. And then, you know, something out there change, not your code.
23:16 And then a potentially different behavior comes back. Right.
23:20 Because they may have trained another couple of years, refine the model, switch the model, change the default temperature, all these things.
23:26 Yeah, there's a lot that can happen there. One thing I noticed, like starting to work with what I would call prompt crafting, which is, you know, you work with ChatGPT and you, you craft different prompt with putting emphasis in a place or another or changing the order of things or just changing a word.
23:43 Right. Just say like important exclamation point, capitalize the words, you know, the reserve words in SQL.
23:51 And then just the fact that you put important exclamation point, you know, will make it do it or not do it, changing from a model to another.
23:59 So one thing that's great is the model, at least at OpenAI are, they are immutable as far as I know.
24:06 But like if you use GPT-5, 3.5 Turbo, for instance, as just one trained model, I believe that that is immutable.
24:15 The chat bot on top of it might get fine tuned and change over time.
24:20 But the model is supposed to be static.
24:23 You mentioned temperature.
24:24 Be kind of interesting to just mention for those who are not familiar with that.
24:27 So when you interact with AI, one of the core parameters is temperature.
24:31 And it's it's I think it's a value from zero to one or I'm not sure if you know how exactly you pass it.
24:39 But it basically defines how creative you want.
24:43 You want to let the AI be like how to put it to zero.
24:48 You're going to have something more deterministic.
24:50 So asking the same question should lead to a similar or the same answer.
24:54 Though not in my experience, it feels like it should, but it doesn't.
24:57 But then if you put a higher, it will get more creative to talk more about like how that actually seemed to work behind the scenes.
25:05 Yeah. Well, the that variability seems to show up more in the image based ones.
25:10 So, for example, this article, this blog post that you wrote, you have this image here and you said, oh, and I made this image from mid journey.
25:17 I've also I was I got some examples of a couple that I did.
25:22 Where did I stick them somewhere?
25:23 Here we go.
25:24 Where I asked just for like YouTube thumbnails.
25:26 I asked mid journey for like a radio astronomy example that I can use, because here's one that's not encumbered by, you know, some sort of licensing, but still looks kind of cool and is representative.
25:36 Right.
25:37 And there it's like massive difference.
25:40 I don't I'm not sure how much difference I've seen.
25:43 I know it will make some, but I haven't seen as dramatic of a difference on chat.
25:47 Yeah.
25:48 Yeah.
25:49 Yeah.
25:50 I'm not sure exactly how they introduce the variability on the generative images.
25:55 AI.
25:56 Right.
25:57 I know it's like this multi dimensional space with a lot of words and a lot of images in there.
26:02 And then it's probably like where the location point of that the rent rent, like they randomized that point in that multi dimensional space.
26:12 So for ChatGPT is pretty easy to reason about, and I might be wrong on this again, I'm not an expert, but you know how the way it works is it writes, it takes the prompt and then it comes up with the next word sequentially.
26:24 So for each word for the next word, so Humpty Dumpty sat on a, it might be wall at 99%, but like that might be 1% of, you know, fence or something like that.
26:36 And if you, you up the temperature, there's, it's more likely to pick the non first word and that probably less, they probably do in a weighted way.
26:48 Like it's possible.
26:49 I take a second or third word randomly.
26:51 And then of course it's going to get a tree or decision tree.
26:54 Once it picks the words, the next word is also changes.
26:57 So as you up that it goes down path that sends it into more creative or different.
27:04 Right.
27:05 Yeah.
27:05 A little butterfly effect.
27:06 It makes a different choice here.
27:08 And then it sends it, you know, sends it down through the graph.
27:11 Interesting.
27:12 So one thing that you really pointed out here, and I think is, is maybe worth touching on a bit is this idea of prompt engineering.
27:19 There's even places like learn prompting.org that like try to teach you how to talk to these things.
27:24 And, and you make a strong distinction between prompt crafting or just talking to the AI versus like really trying to put an engineering focus on it.
27:33 Do you want to?
27:34 Yeah.
27:35 Because like, yeah, I think it's a super important differentiation, but one that I'm proposing.
27:40 Right.
27:41 So I don't think that people have settled as to what is one or what is the other.
27:45 I think I saw a Reddit post recently that was like prompt engineering is just a load of crap.
27:50 Like, you know, anyone can go because they thought their, their understanding of prompt engineering was like, oh, you know, you fine tune or you craft your prompt.
27:59 Then you say like, you are an expert AI working on, you know, creating molecules.
28:05 Now, can you do this?
28:06 And then, you know, by doing that, you might get a better outcome or a one really interesting thing that people have been doing in prompt crafting.
28:13 That seemed to have like huge impact on there's been paper written on this specific just hint or craft tweak is let's proceed step by step.
28:26 So basically, whatever the question is that you are asking is specifically around like more mathematicals or like things that require more systematic step by step thinking the whole like just like, let's think let's expose this or let's go about it.
28:41 Step by step makes it much better.
28:43 So here you might be able to, well, so you know, if you had, if you had an example where ChatGPT three failed or ChatGPT four failed, you could just say, Colin, let's go step by step.
28:56 And it might succeed that, that time around, which is kind of, maybe you can get it to help you understand instead of just get the answer.
29:03 Right. Like factor this polynomial into its primary solutions or roots or whatever.
29:10 And you're like, okay, show me that. Don't just show me the answer.
29:12 Show me step by step.
29:13 So I could understand and try to learn from what you've done. Right.
29:16 Yeah. I mean, if you think about how the way that it's trying to come up with a new word, if all it does is a language based answer to a mathematical question, like how many days are there between this date and that date?
29:29 There's no, that specific example might not exist or it's kind of complicated for it to go about it.
29:35 But if you say, let's think step by steps. Okay. There's this many months. This month's duration is this long. There's this many days since the beginning of that month and might get it right. That, you know, that time around.
29:48 Right. Or if it fails, you could pick up partway along where it has some more.
29:52 Yeah. You know, and then you can trace, I mean, just you too, like, I think, you know, one thing is like, you should be extremely careful as like taking for granted that it's right all the time, you know?
30:00 So that means like, it also helps you review its process and where it might be wrong, but back to crafting versus engineering.
30:08 So crafting would be the process that I think is more attached to a use ChatGPT every day, the same way that, you know, we've been trained at Googling, you know, over the past like two decades.
30:20 You know, use a pollute, use quotes, use plus and minus, and you know which keywords to use intuitively, right? You know where it's going to get confused or not.
30:29 So I think prompt crafting is a different version of that. That's just more worthy.
30:35 And if you're, you know, you're working with the AI to try to assist you, write your blog post or to try to assist you in any task really, just to be smart about how you bring the context, how you tell it to proceed, goes a very, very long way.
30:49 So that's what I call prompt crafting, call it like one of cases.
30:53 What people do when they're interacting with, with the large language model.
30:57 I think so. Right. Like it's not evident for a lot of people. Well, they are exploring the edge of where it fails and they love to see it fail.
31:05 And, and then they don't think about like, Oh, what could I have told it to get the answer?
31:10 I was actually looking for like, ah, I got you wrong. You know, it's as if I had that actor in a conversation. I like, ah, you're wrong. And I told you so, you know, so I think there's a lot of that online, but I think for all these examples that I've seen,
31:22 I'm really tempted to take the prompt that they had and then give, give it an instruction or two or more, and then figure out how to get it to come up with the right thing.
31:30 So prompt crafting super important skill. you know, you could probably get a boost of, for most knowledge information workers, you'll get a boost of 50% to 10 X for a lot of the tasks you do every day.
31:41 If you use AI well, so it's great personal skill to have, go and develop that skill. If you don't.
31:48 This portion of talk Python is sponsored by the compiler podcast from Red Hat.
31:51 Just like you, I'm a big fan of podcasts and I'm happy to share a new one from a highly respected open source company compiler and original podcast from Red Hat.
32:01 Do you want to stay on top of tech without dedicating tons of time to it? Compiler presents perspectives, topics, and insights from the tech industry free from jargon and judgment.
32:11 They want to discover where technology is headed beyond the headlines and create a place for new IT professionals to learn, grow, and thrive.
32:17 Compiler helps people break through the barriers and challenges turning code into community at all levels of the enterprise.
32:23 One recent and interesting episode is there. The great stack debate.
32:27 I love, love, love talking to people about how they architect their code, the trade offs and conventions they chose, and the costs, challenges, and smiles that result.
32:37 This great stack debate episode is like that. Check it out and see if software is more like an onion or more like lasagna or maybe even more complicated than that.
32:45 It's the first episode in compiler series on software stacks. Learn more about compiler at talkpython.fm/compiler.
32:53 The link is in your podcast player show notes. And yes, you could just go search for compiler and subscribe to it. But follow that link and click on your players icon to add it.
33:03 That way they know you came from us. Our thanks to the compiler podcast for keeping this podcast going strong.
33:11 Prop engineering. In my case, I'm like, you're building something, you're using an AI as an API behind the scene.
33:19 You want to pass it a bunch of relevant contexts, really specify what you want to get out of it.
33:25 Maybe you even want to get a structured output, right? You might want to get a JSON blob out of it.
33:30 You say, return a JSON blob with the following format so it's more structured.
33:35 So then to give all these instructions, there's this idea of providing few shots too.
33:40 You might be storing context in a vector database. I don't know if we're getting to that today, but there are ways to kind of structure and organize your potential embeddings or the things you want to pass as context.
33:52 So there's a lot here. I think somewhere too, I talk about prompt engineering. If we scroll in the blog post, like what is prompt engineering?
33:59 And prompt engineering will list the kind of things. It might be higher in the post. Sorry, we're scrolling for people.
34:06 I want to introduce what is prompt engineering.
34:10 Yeah.
34:11 Well, there's not above this section.
34:16 You can scroll at the beginning of what is prompt engineering.
34:20 Yeah, here you go.
34:21 Oh yeah, right here. The definition of this is Chad's GPT's version of it. So when you do prompt engineering, you can add context, which that means that you're going to have to retrieve context maybe from a database, from a user session, from your Redux store if you're in the front end, right?
34:37 You're going to go and fetch the context that's relevant in the context of the application.
34:41 It leads while building products.
34:42 Specify an answer format.
34:43 You could just say, yes, I just want a yes or no, a Boolean.
34:46 I want a JSON blob with not only the answer, but your confidence on that answer or something like that.
34:52 Limiting scope, asking for pros and cons, incorporating verification or sourcing.
34:59 So that's more, you know, if you iterate on a prompt, you're going to be rigorous about is this prompt better than the previous prompt I had like this.
35:07 If I pass five rows of sample data while doing text to SQL, does it do better than if I or does it do more poorly than if I pass 10 rows of sample data or provide a certain amount of column level statistics.
35:20 So prompt engineering is not just prompt crafting.
35:23 It is like bringing maybe the scientific method to it, bring some engineering of like fetching the right context and organizing it well and then measuring the outcome.
35:33 Right. Exactly.
35:34 Something that comes out, you can measure and say, this is, you know, 10% better by my metric than it was before with this additional data.
35:41 Right. That's, that's a big difference.
35:42 Right. And then there's so many things moving, right?
35:45 Like, and everything is changing so fast in the space that you're like, oh, well, ChatGPT five or GPT five is out or GPT four turbo is half the price and then just came out.
35:54 Now I'm just going to move to that.
35:55 They're like, wait, is that that performing better?
35:58 Or, you know, what are, what are the trade off?
36:01 Or even I'm going to add, I'm going to move this section, you know, asking for a certain Jason format above this other section.
36:08 I'm going to write important exclamation point do X. Does that improve my results?
36:14 Does that messes mess it up?
36:16 And which one of my test case perhaps that succeeded before fails now, which one failed before succeeds now.
36:22 So you can be like, is that a better, a worse iteration towards my goal?
36:28 Right. Right.
36:29 Kind of training, bringing this unit testing TDD mindset.
36:33 Yes.
36:34 Yeah.
36:35 We're getting deeper into the blog post, right?
36:37 So the blog post is talking about bringing this TDD, the test driven development mindset to prompt engineering.
36:46 Right.
36:47 And there's a lot of things that are in common.
36:49 You can, you can take and apply and kind of transfer just over.
36:53 There are some things to that breakdown that are fundamentally different between testing a prompt or working with AI and working with, you know, just a bit more deterministic code testing type framework.
37:05 We can get into that.
37:06 Yeah. Yeah, for sure.
37:08 So you called out a couple of reasons of why TDD is important for prompt engineering.
37:13 Maybe we could run through those.
37:15 Yeah.
37:16 So, you know, the, the first thing is, the AI model is not a deterministic things, right?
37:22 You, you use a modern API or a graph, you are a rest API.
37:26 The, the format of what you ask is extremely clear.
37:30 And then the format of what you get back is usually defined, defined by a schema.
37:34 It's like very deterministic, pretty guaranteed that you're, you do the same request.
37:37 We'll get the same output ish, or at least format.
37:40 what AI that's not the case, right?
37:43 So it's much more unpredictable and probabilistic by nature.
37:47 second one is handling complexity.
37:50 So, AI systems are complex, black boxy, kind of unpredictable too.
37:54 So, and embrace that and assume that you might get something really creative coming out of there for better or for worse.
38:01 and then reducing risk, like you're shipping product, you know, if you're shipping product, writing product,
38:08 you don't want to silly, like any sort of like, bias or weird thing like the AI could go a little crazy and, and yeah,
38:16 There are examples of AIs going crazy before like Tay.
38:20 Do you remember Microsoft Tay?
38:22 I don't know that one, but I know of other examples.
38:24 Yeah.
38:25 I mean, it came out and it was like this, this sort of just, you know, I'm here to learn from you internet.
38:30 And people just turned it into a racist and made it do all sorts of horrible things.
38:34 And they had to shut it down a couple of days later because it's just, it's like, whoa, it met the internet and the internet is mean.
38:40 So, that's not great.
38:42 Yeah.
38:42 Yeah.
38:43 Train it on a 4chan or let it go crawl 4chan and Reddit.
38:47 And that's not always going to be nice.
38:49 So bad.
38:50 Right.
38:51 I mean, you, you, you don't entirely control what's going to come out of those things.
38:54 And so.
38:55 Oh yeah.
38:56 Or I would say a little more predictable.
38:57 Right.
38:58 I would say, yeah, like, you don't entirely control.
38:59 Like I think, yeah, like basically, you know, control might be a complete illusion.
39:04 Like even the people working at open AI don't fully understand what's happening in there.
39:09 Yeah.
39:10 Like, well, it read a bunch of stuff and it's predicting the next word and, it gets most things right.
39:17 By the way, like they do a lot around this idea of like, not necessarily TDD, but there's a whole eval framework.
39:22 So you can submit your evaluation functions to open AI.
39:26 And as they, they train the next version of things, they include that in what their evaluation system for the new thing.
39:33 So say if I wanted to go and contribute back a bunch of like text to SQL type use cases, as they call evals, then they would take that into consideration when they, they train their next models.
39:45 All right.
39:45 So going down the list, reducing risk, right?
39:47 So you're integrating some, that beast that's not fully tamed into your product.
39:52 So you probably want to make sure it's tamed enough to live inside your product.
39:56 continuous improvements that I, that should have been maybe the first one in the list is you're iterating on your prompts.
40:02 You're trying to figure out a past context.
40:04 you're, you're, you're trying different model versions.
40:08 Maybe you're, you're trying some open source models or the latest GPT cheaper greater thing.
40:14 so you want to make sure that as you iterate, you're getting to the actual outcomes that you want systematically.
40:21 And then performance measurement too, like how long does it take?
40:24 How much does it cost?
40:25 you, you kind of need, to have a handle on that.
40:29 The new model might be 3% better on your corpus of tests, but it might be six times the price.
40:35 Like, do you, do you want, are you okay?
40:37 Right, right.
40:38 Or just from a user perspective.
40:40 Yeah.
40:40 Time to interaction.
40:41 You know, that's one thing with AI we're realizing now is a lot of the prompts on, on four will be like, you know, two to, one to seven seconds, which in the Google era, you know, there's been some really great papers out of Google early on that, prove that, you know, it's like a hundred milliseconds as an impact on user behaviors.
41:01 Right.
41:02 Right.
41:03 Yeah.
41:04 People give up on checkout flows or whatever, not going to the next part of your site on a measurably on a hundred millisecond blocks.
41:11 Right.
41:11 When you're talking, well, here's 7,000, you know, you're 70 of those.
41:14 That's going to have an effect potentially.
41:15 Oh, it has, has been proven and very intricate and usage pattern session duration session outcomes.
41:23 Right.
41:24 And you know, a second is a mountain.
41:26 If today, like we were at this AB test Google between like whatever millisecond it's at now, like just one second or half a second that the results coming out of that AB test would, which show very, very different behaviors.
41:39 Wow.
41:40 I think there's some, no code mailnet, there's some really great papers, you know, written on TTI and just time to interaction and the way it influenced user behavior.
41:48 So we're still, you know, in the AI world has to, if you're going to wait two to seven seconds for your prompt to come back, it's got to add some real, some real important value to what's happening.
41:58 Yeah, it does.
41:59 I think it's interesting that it's presented as a chat.
42:01 Right.
42:02 I think that gives people a little bit of a pause, like, oh, it's talking to me.
42:05 So let's let, let's let it think for a second rather than it's a website that's supposed to give me an answer.
42:09 Yeah.
42:10 Compared to then, I guess your basis for comparison is a human, not a, you know, a website or not comparing against Google.
42:17 So that's great.
42:18 Yeah.
42:19 I ask it a really hard question.
42:20 Give it some time, right?
42:21 Like that's not normally how we think about these things.
42:23 Okay.
42:24 So what's the workflow from this engineering, building a product testing, like an AI inside of your product.
42:30 You want to walk us through your workflow here?
42:32 Yeah.
42:33 And you know, if you, if you, I think I looked at TDD, you know, and, and originally what is the normal like TDD type workflow?
42:40 And I just adapted this little diagram to, to, to prompt engineering work.
42:46 Cause the whole idea of the blog post is to, to bring prompt engine, like TDD mindset to prompt engineering.
42:51 So this is where, where I went, but yeah, the, the workflow is like, okay, define the use case and desired AI behavior.
42:58 What are you trying to solve with AI?
43:00 In my case, the example that I'll, I'll use and try to reuse throughout this presentation is, as throughout this, you know, conversation is, you know, text to SQL.
43:10 So like, we're trying to, what a user prompt, what a database schema gets, get the AI to generate goods, good, useful SQL, find the right tables and columns to use that kind of stuff, create test cases.
43:21 So it's like, okay, if I have this database and I have this prompt, give me my top five salary per department on this HHR dataset, there's a fairly deterministic output to that.
43:33 Or you could say the SQL is not necessarily deterministic.
43:35 There's different ways to write that SQL, but there's a deterministic data frame or results set that might come up.
43:41 There is a right answer of the top five salaries.
43:44 That's right.
43:45 You can see like, you're not getting, ultimately get that.
43:47 And it's great if it's, if it is, deterministic, cause you can test it.
43:51 If you, you know, if you, if you're trying to use AI to say, write an essay about, Napoleon Bonaparte's, second conquest, you know, in less than 500 words, it's a, it's not as deterministic as it's hard to test whether the AI is doing good or not.
44:08 So you might need human evaluators before I would say in most AI product or people are trying to bring AI into their product.
44:16 in many cases, more deterministic.
44:18 So another example of like more deterministic would say like, Oh, getting, if you say getting AI to write Python functions, it's like, Oh, write a function that, you know, returns.
44:28 If a, if, if a number is prime, yes or no, like the, you can get the function and test it in a deterministic kind of way.
44:38 So anyways, just pointing out it's better.
44:40 You're only going to be able to have a TDD mindset if you have like somewhat deterministic, you know, outcome to the, you want to use the AI for then create the prompt generator.
44:50 So that would be your first version or in the text to SQL example, it's given, you know, the, the, the 20 tables in this database and this columns and table names and data types and sample data generate SQL that answers the following user prompt.
45:04 And then the user prompt would say something like department by type, type five salary for department.
45:10 and then you, then we're getting for people that are not on the visual, stream, not YouTube, but on just audio, we're getting into the loop here where it's like, run the test, evaluate the results, refine the tests, refine the prompts and then start over.
45:23 Right.
45:24 And probably, compile the results, keep track of the results so that you can compare, not just like, Oh, are you 3% better on your test cases?
45:32 but also did you, which tests that used to fail succeed now, which, which tests that used to fail, succeed, fail now.
45:41 And then, once you're happy with the level of success you have, you know, you can integrate the prompt into the product or maybe upgrade.
45:49 Ship it.
45:50 Yeah.
45:50 Ship it.
45:51 Ship it.
45:52 So I think it's probably a good time to jump over to your framework for this because py tests and other testing frameworks in Python are great.
46:01 But they're pretty low level compared to these types of questions you're trying to answer.
46:06 Right.
46:06 Like how has this improved over time for, you know, I was doing 83% right.
46:12 Right.
46:13 Py tests, a search, a true or a false.
46:14 It doesn't assert that 83% is.
46:17 Yeah.
46:18 Like if, if any of your, your py tests fail, you're, you're probably gonna not see, I not allow, see, I'm not even merge the PR.
46:28 Right.
46:28 So one thing that's different between test driven development and unit testing and prompt engineering is that the out, the out, the outcome is probabilistic.
46:36 It's not true or false to it might just be like zero or one, right.
46:39 Where I are a spectrum, you know, it fails.
46:43 so for a specific test, you're like, off, it gets this column, but not this other column, you succeed at, you know, 50%.
46:50 So it's non binary.
46:52 It's also, you don't need perfection to ship.
46:54 You just need better than the previous version or good enough to start with.
46:58 So the mindset is, is, so there's a bunch of differences.
47:02 And for, for those interested, we won't, get into the, in the blog post.
47:06 I think I list out the things that are, different between the two.
47:09 I think it's a little bit above this, but, you know, the first thing I want to say is like the level of ambition of this project versus an air flow is super set is like very low.
47:18 Right. So it's maybe, more similar to, a test, a unit test library and no discredit to the great, awesome, like a unit test libraries out there.
47:29 But you would think those are fairly simple and straightforward.
47:32 I was just the information architecture of a pie test is probably simpler than the information architecture of a Django for instance.
47:39 Right. It's just like a different thing.
47:41 And here, the level of ambition is much low and much, you know, for this is, is fairly simple.
47:47 So prompt demise is something that I created, which is a toolkit to help people write, to, to evaluate and score and understand while they iterate on their pro while doing prompt engineering.
48:03 But so in this case, I think, I talk about the use case at preset, which is a, we have a big corpus, that luckily was contributed by, I forgot which university, but a bunch of PhD people did a text to SQL, contest.
48:16 I think it was Yale.
48:17 I think it was Yale.
48:18 Yale. Yeah.
48:19 Yeah.
48:20 So yeah.
48:21 So great people at Yale were like, Hey, we're going to generate, you know, 3000 prompts on 200 databases with the, the SQL that should be the outcome of that.
48:31 It's a big test set so that different researchers working on text to SQL can compare their results.
48:36 so for us, we're able to take that test set and some of our own test sets and run it at scale against, you know, open AI or against Lama or against, different models.
48:48 And by, by doing that, we're able to evaluate like, you know, this particular combo of like this prompt engine methodology with this model generates, you know, 73% accuracy.
49:00 And we have these reports we can compare, you know, fairly easily, which prompts that, as I said before, we're failing before succeeding now and vice versa.
49:09 So you know, I mean, am I actually making progress here or going backwards?
49:13 And if you try to do that on your own, like if you're crafting your prompt, just anecdotally and try on five or six things like you quickly realize like, Oh shit, I'm going to need to really test out of much broader range of this and some rigor methodology around that.
49:27 So, right.
49:28 And try and you remember and go back and go, this actually made it better.
49:31 Right.
49:32 Cause it's, it's hard to keep all that in your mind.
49:34 Yeah.
49:34 Yeah.
49:35 And something interesting that I'm realizing to work on this stuff is like everything is changing so fast, right?
49:41 The models are changing fast.
49:42 The prompting windows are changing fast.
49:44 The vector databases, which is a way that organize and structure context for your prompts evolving extremely fast.
49:51 It feels like you're working on unsettled ground in a lot of ways.
49:55 Like a lot of stuff you're doing might be challenged by, you know, but the BART API came out last week and maybe it's better at SQL generation.
50:02 And then I got to throw everything that I did on open act, but here's something you don't throw away your test library and your use cases.
50:10 Right.
50:11 Maybe is the thing is the real asset here.
50:14 The rest of the stuff is like, oh yeah, it's moving so fast that all the mechanics of the prompt engineering itself and the interface with the, whatever model is the best at the time.
50:26 You're probably gonna have to throw away as this evolves quickly.
50:29 But your test library is something really, really solid that you can perpetuate or like, you know, keep improving and bringing along with you along the way.
50:39 So it's kind of an interesting thought around that.
50:41 Yeah.
50:42 Let's talk to this.
50:43 Let's talk to this example.
50:44 You have on promptimize is GitHub read me here to give it, make it a little concrete for people.
50:49 Like how do you actually write one of these tests?
50:51 Yeah.
50:52 So, so there's different types of prompts, but yeah, the, you know, what I wanted to get to was just like, just like, what is the prompt and how do you evaluate it?
51:00 Right.
51:01 And then behind the scene, we're going to be, you know, discovering all your prompts and running them and compiling results and reports, right.
51:08 And doing analytics and making it easy to do analytics on it.
51:11 The examples that we have here and I'll try to be conscious of both the people who can read the code and people who don't like the people who are just on audio, but here from promptimize.prompt, we import a simple prompt.
51:23 And then we bring some evals that are just like utility functions around evaluating the output of what comes back from the AI.
51:31 And here, the first prompt case in the model, here, I could just create an array or a list of prompt cases.
51:38 And it's a prompt case, like a test case.
51:40 And with this prompt case, this very simple one, I say, hello there, exclamation point.
51:46 And then I evaluate that as says, you know, either hi or hello in the output.
51:52 Right.
51:53 So if any of the words exists and what comes back, I give it a one or a zero framework allows you to, you could say, oh, it has to have both these words or give the percentage of success based on the number of words from this list that it has.
52:07 But, you know, that's the first case.
52:10 The second one is like a little bit more complicated, but name the top 50 guitar players of all time, I guess.
52:17 And I make sure that Frank Zappa is in the list because I'm a Frank Zappa fan here.
52:21 But, you know, you could have a different, you could say, hey, I want to make sure that, you know, at least three out of five of these are in the list.
52:29 Those are like very more like natural language, very simple tests to, you know, just so that's the hello world, essentially.
52:37 And then, you know, we're showing some examples of what's happening behind the scene.
52:40 Well, it will actually like call, you know, the underlying API, get the results, run your eval function and compile a report.
52:48 What was the prompt?
52:49 What was, oh, a bird just flew into my room.
52:52 Inside?
52:53 Yeah, that's going to make the podcast interesting.
52:56 Oh, my goodness.
52:58 Okay.
52:59 That might be a first here.
53:01 That is nuts.
53:02 Oh, well, it's out of my room.
53:04 Guess what?
53:05 There's other people in the house.
53:06 I'm just going to close the door to my room.
53:09 I'm dealing with it later.
53:11 All right.
53:12 Well, that's a first.
53:14 I've had a bat fly into my house once, but never a bird.
53:19 So both are crazy.
53:21 How indirect.
53:22 You're the first on the podcast out of eight years.
53:23 We've never had a bird wild animal enter the studio of the guests.
53:28 Yes.
53:29 Well, welcome to my room.
53:30 I live in Tahoe.
53:31 So I guess that's something that's better than a bear.
53:34 You know, it could have been better.
53:35 It is better than there.
53:37 All right.
53:38 But yes, like this, I keep an emery any kind of what we're seeing visually here.
53:43 You know, we'll keep a YAML file as the report output.
53:48 So in promptimize, you have your test case or your prompt cases.
53:52 I like test cases.
53:53 You have an output report that says for this, you know, prompt case.
53:57 Here's the key.
53:58 Here's what the prompt that was actually the user input that came in.
54:02 Here's what the prompt looked like.
54:04 You know, what was the response or raw response from the API?
54:08 What are all the tasks?
54:09 How long did it run?
54:10 So a bunch of metadata and relevant information that we can use later to create these reports.
54:15 Saying like was the score zero or one.
54:18 So you get the whole output report.
54:20 Yeah.
54:21 Okay.
54:22 And then you also have a way to get like a report.
54:25 I'm not sure.
54:26 Maybe I scroll down.
54:27 Yeah.
54:28 It shows you how it, how it did.
54:29 Right.
54:30 I think that was in your, in your.
54:31 I think at the blog post, do you see a much more?
54:33 Yeah.
54:34 So this one we're running the spider data set that I just, that I talked about.
54:39 Remember it's like the Yale generated text to SQL competition, you know, corpus.
54:44 And so here we, we, we looked at, you know, my percentage of success is 70%.
54:49 So, you know, here you say weight and score.
54:51 So there's a way to say, oh, this, this particular prompt case is 10 times more important.
54:56 Another one.
54:57 Right.
54:58 So you can do a relative importance of weight of your different text cases.
55:02 Now, one thing we didn't mention too, it's like all these tests are generated programmatically too.
55:07 So that it's the same philosophy behind, you know, airflow of like, you know, it's almost like a little DSL to write your test case.
55:14 So you could, you know, I could read from a YAML file, for instance, in the case of a, what we do with spider SQL, there's a big Jason file of all the prompts and all the databases.
55:22 And then we dynamically generate, you know, a thousand tests, based on that.
55:27 So you can do programmatic test definition, so more and more dynamic if you want it to be, or you could do more static if you prefer that.
55:35 So in this case we're doing, we introduced this idea of a category too.
55:39 So I mentioned like there's some features and prompt them as like categorizing your tests or, weights, you know, and things like that.
55:47 So here we'll do some reporting on, on per category.
55:50 What is the score per category?
55:52 You can see which database is perfect.
55:54 We're perfect.
55:55 Well, or poorly again.
55:57 So I could have another category that is large database, small databases and see how that, what the score is and compare reports.
56:05 It's pretty cool that it saves the test run to a file that then you can ask questions about and write and generate this report on and rather than just running it and passing or failing.
56:15 Right.
56:16 Yeah.
56:17 Or like giving the output and then having to run it again.
56:19 Yeah.
56:20 There's some other features around if, so you can memoize the test.
56:24 So because it has a reports, if you like, you know, exit off of it or restart it later, you, you, it will, it won't rerun the same tests.
56:32 If it's the same hash input, even though with AI, you might get a different answer with the same input, but at least in this case, it will say like, Hey, I reran, I'm rerunning the same prompt instead of like waiting five seconds for open AI and paying the tokens and paying the piper.
56:48 You know, I'm just gonna skip that.
56:50 So there's some logic around skipping what's been done already.
56:53 It's not just a couple of milliseconds to run it.
56:56 It could be a while to get the answers.
56:58 Yeah.
56:59 It's also like early libraries out.
57:00 I haven't written the sub the threading for it where you can say like, Oh, run it on eight threads.
57:05 so sure.
57:07 With prompt demise, I think.
57:09 And you know, the blog post is probably more impactful than the Python project itself.
57:14 if the project takes off and a bunch of people are using it to test prompts and improve and contribute to it is great.
57:20 But I think it's more like, okay, this is uncharted territory.
57:24 working with what an AI type interface.
57:28 And then it's more like, Oh, how do we, what's, how do we best do that as practitioners or as people building products?
57:35 I think that's the, the big idea there, you know, then the test library, you could probably write your own.
57:40 Like, I think for me, that was like a one or two week project.
57:43 The weather, what I just say is like, normally, if it wasn't for getting all the help from ChatGPT on, you know, it's like, I'm create a project.
57:52 I'm setting up my setup.py, you know, set up tools is always a little bit of a gas.
57:57 And then I'm like, can you help me set it, create like my setup.py and then, you know, generate some code and I'm like, Oh, I want to make sure that pipe E is going to get my read me from GitHub.
58:09 I forgot how to read the mark down and pass the stuff.
58:12 Can you do that for me?
58:13 And then ChatGPT generates this stuff very nicely.
58:16 Right.
58:17 Or I want to make sure I use my request that requirements of TXT inside my dynamically building my setup tools integration.
58:25 Can you do that for me?
58:26 And it's just like, bam, bam, bam.
58:28 Like all the repetitive stuff.
58:29 I need to function.
58:30 It's incredible.
58:31 Go ahead.
58:32 Yeah.
58:33 I kind of want to close out the conversation with that.
58:35 I do agree that the blog post is super powerful in how it, it kind of teaches you to think about how might you go about testing, integrating with an AI and these types of products.
58:45 Right.
58:46 Much like TDD brought a way to think about how do we actually apply the concept just, well, I have things that I can test them with this assert thing.
58:54 How should I actually go about building software?
58:56 Right.
58:57 So this is kind of that for, for AI integrated software.
58:59 So it's certainly worth people watching.
59:01 Let's just close it out with, you know, you kind of touched on some of those things there.
59:04 Like, you know, how do you recommend that people leverage things like ChatGPT to help them build, build their, their apps or how to use AI.
59:15 Just kind of amp up your software development.
59:20 A hundred percent.
59:21 I mean, it's, it's been a lot of people report, you know, on Twitter, people used to Google, you know, all the problems that they had while writing, you know, writing code and using a lot of stack overflow.
59:34 I don't know what the stats on like stack overflow traffic, but if once you try working with ChatGPT to do coding, you probably don't go back to those other flows of, I don't know.
59:45 It's like putting your error message or stack trace into Google and then go into a bunch of stack overflow link and try to make, make sense of what comes out.
59:54 To me, it's been so much better to go just with ChatGPT.
59:58 And there's a conversation there too.
59:59 So say for instance, if I'm in proptimize, I needed a function to say, can you write, I wrote that function before, you know, but it's a, can you crawl a certain given folder and look for modules that contain objects of a certain class and then bring that back.
01:00:15 And, you know, you have to use the import lib and she's a little bit of pain in the ass to write this.
01:00:20 So it writes, you know, a function that works pretty well.
01:00:23 I'm like, Oh yeah, I forgot to ask you to look into lists and dictionaries.
01:00:26 Can you do that too?
01:00:27 Then it does that in a second.
01:00:29 It's like, I, you know, you didn't have like type hints and duck string and duck, duck test.
01:00:34 Can you write that, you know, too?
01:00:36 And it's a bang, bang, bang.
01:00:37 And just like copy paste in your utils file and it works.
01:00:40 And you save like two hours, you know, I think it would be really good at those things that are kind of algorithmic.
01:00:47 Now you might, they might be the kind of thing that you would do on a whiteboard job interview test.
01:00:52 Right.
01:00:53 It just, it's just going to know that really, really solid.
01:00:56 It actually, but it knows quite a bit about the other libraries and stuff that are out there.
01:01:01 It's insane.
01:01:02 Yeah.
01:01:03 So one, one thing that I came across is I wanted, I leveraged something called length chain, which pointed to people getting interested in prompt engineering.
01:01:10 There's a really good, well, the library length chain is really interesting.
01:01:14 That's not perfect.
01:01:15 It's new.
01:01:16 It's moving fast, but I appreciate people to check it out.
01:01:19 Also like 41,000 stars.
01:01:21 So very, I know that's right.
01:01:23 Is it in Python?
01:01:24 Yes.
01:01:25 You can do like, yeah, it's in Python too.
01:01:27 You should talk to whoever is writing this or started this, but yeah, you can change some prompt to say like the output of a prompt will generate the next one.
01:01:36 There's this idea of agents.
01:01:38 There's this idea of estimating tokens before the, doing the request.
01:01:43 There's a bunch of really cool things that it does to me.
01:01:47 The docs are not that comprehensive.
01:01:49 There's someone else that created, if you Google, link change, link chain cookbook, you'll find someone else that wrote what I thought was more comprehensive way to start.
01:02:02 This one has a YouTube video and an IP one B file introduces you to the concept in an interactive way.
01:02:09 I thought that was really, really good.
01:02:11 But yeah, so we were trying to, I was trying to use this.
01:02:14 I was like, oh, can you generate a bunch of like land chain really?
01:02:17 So I was like, I don't know of a project calling chain.
01:02:19 It was created after 2021.
01:02:21 So that's like, I wish I could just say like, just go read the GitHub, you know, just read it all, read the docs.
01:02:27 And then I'll ask you questions and then ChaggPD is not that great at that currently at learning things that doesn't know for a reason we talked about.
01:02:37 Bard is much more up to date.
01:02:40 So you can always for those projects, you know, ChaggPD might be better at Django because it's, it's old and settled and it's better at writing code overall.
01:02:48 But Bard might be decent and pretty good for the new project.
01:02:52 Right.
01:02:52 If you ask advice on how to do promptimize stuff, it's like, I don't know what that is.
01:02:55 Yeah, it's like, I've never heard of it might hallucinate too.
01:02:58 I think if you'd go and might make sure that like, I've seen it.
01:03:01 I prompt them.
01:03:02 It sounds like it would be this and it just makes up stuff.
01:03:05 So, so not that great.
01:03:07 But yeah, absolutely.
01:03:08 I encourage people to try, you know, for any subtasks that you're trying to do to see if it can help you at it and maybe try a variation on the prompt.
01:03:17 And then, you know, if it's not good at it, do it the, the old way.
01:03:21 But yeah, it might be better too for those familiar with the idea of like functional programming where each function is more deterministic and can be reasoned about and unit tested in isolation.
01:03:32 So, yeah, it's going to be better at that because it doesn't know about all your other packages and modules.
01:03:37 So really great for the utils functions are very deterministic functional.
01:03:41 Yeah.
01:03:42 Super great at that.
01:03:43 Another thing is, and I don't, you tell me when we run out of time, but another thing that that was really interesting too, of, you know, bring the, some of the concept and promptimize and writing the blog post itself.
01:03:55 Right.
01:03:56 And things like, Hey, I'm thinking about the difference of like the properties of test driven development as it applies for prompt engineering.
01:04:03 Here's my blog post, but can you think of other differences between the two that are very core?
01:04:09 And, you know, can you talk about the similarities and the differences?
01:04:13 And it would come up with like, just really, really great ideas, right?
01:04:17 Brainstorming and just very smart at mixing concepts.
01:04:21 I do think one thing that not a great idea is just say, write this for me.
01:04:25 But if you've got something in mind and you're going to say, give me some ideas or how should I go?
01:04:30 Where should I go deeper into this?
01:04:31 And then you use your own creativity to create that.
01:04:34 That's a totally valid use.
01:04:36 I wouldn't feel like, Oh, I'm reading this AI crap.
01:04:39 Right.
01:04:40 It just, it, it brought out some insights that you had forgot to think about.
01:04:42 And now you, now you are.
01:04:44 Right.
01:04:45 Or when it fails and says, say like, I got it to fail.
01:04:47 AI is wrong.
01:04:48 I'm smarter than it.
01:04:49 You're like, wait, is there something I can, can I try to, you know, here's what it didn't get right.
01:04:54 And why, like, what did I need to tell it?
01:04:56 So you can go and edit your prompt or ask a follow up.
01:04:59 And generally it will, it will do better and well.
01:05:02 Yeah.
01:05:03 I think also you can ask it to find bugs or security vulnerabilities.
01:05:05 Yeah.
01:05:06 Right.
01:05:07 You like here, here's my 30 line function.
01:05:09 Do you see any bugs?
01:05:10 Do you, do you, do you see it?
01:05:12 Do you see any security vulnerability?
01:05:14 Like, yeah, you're, you're passing this straight to, you're concatenating the string into SQL or something.
01:05:20 Yeah.
01:05:21 The rigor.
01:05:21 Yeah.
01:05:22 The rigor stuff too.
01:05:23 or like, you know, I would say writing a good doc string, writing doc tests, writing unit tests, reviewing the logic, that kind of stuff.
01:05:32 It does type hints, right?
01:05:34 If you, you're like me, like, I don't really like to write type hints upfront.
01:05:39 but I'm like, can you just like sprinkle some type hints on top of that?
01:05:42 Retrofit this thing for me.
01:05:43 Yeah.
01:05:44 That's it.
01:05:45 Just make it that production grade.
01:05:46 You know, one thing that's interesting too, of like, you know, you would think I'm a big TDD guy.
01:05:50 Like I, I don't do tests.
01:05:52 It's just not my thing.
01:05:54 I like to write code.
01:05:55 I don't think of like what I'm going to use the function for.
01:05:58 And before I write it, but like generating, it's good at generating unit tests for a function too.
01:06:06 And then I think what's interesting with prompt and nice too, is you might think you want to deterministic, what I call prompt cases or test cases, but you can say, I've written, you know, five or six of these.
01:06:19 Can you write variations on that theme too?
01:06:22 So you can use it to generate test cases in the case of like TDD, but also the opposite, like for, for prompt demise, you can get it to generate stuff dynamically to itself.
01:06:33 Yeah.
01:06:34 It's, it's pretty amazing.
01:06:35 It is.
01:06:36 It is pretty.
01:06:37 Let's maybe close this out, but I'll ask you one more question.
01:06:39 Okay.
01:06:40 Can I do one more?
01:06:41 Can I show one more thing?
01:06:42 Since it's a Python podcast, if you go on my repo on that repo for prompt demise under examples, there's one called Python exam.
01:06:49 Here we go.
01:06:50 Something like this.
01:06:51 Yeah.
01:06:52 So I'm going to do this stuff right here.
01:06:53 so say here it says, so here I wrote a prompt that asks the bot to, generate Python function.
01:07:01 Then I sandbox it and, and bring the function it wrote into the interpreter.
01:07:05 And then I test it.
01:07:06 So I say, write a function that tests if a number is a prime number returns, and returns a Boolean.
01:07:12 And then I test, I have, you know, six state test cases for it.
01:07:16 So I write a function that finds the greatest common denominator of two numbers.
01:07:19 Right.
01:07:20 then behind the scene, if we won't get into the class above the class above, like basically interacts with it, gets the input, then runs the test and compiles the results.
01:07:30 Right.
01:07:30 So we could test, you know, how well, 3.5 compares to four.
01:07:35 Sure.
01:07:36 But I thought it was relevant for the Python folks on the line.
01:07:39 So we're testing out what it is at writing Python function.
01:07:42 Write a function that generates the Fibonacci sequence.
01:07:44 Yeah.
01:07:45 Up to a number of terms, right?
01:07:47 It's easy to test.
01:07:48 So it's cool stuff.
01:07:49 And what was your last question?
01:07:52 Oh, I was going to say something like, see if I see how far we could push it.
01:07:56 I'm right.
01:07:57 A Python function to use requests and beautiful soup to scrape the titles of episodes of talk Python to me.
01:08:08 Oh yeah.
01:08:09 And then, yeah, it is.
01:08:11 And if you don't have a, you know, one thing that's a pain in the butt for a podcast people is to write the, like what, what all we talk about.
01:08:17 So you use another AI to get the transcripts.
01:08:20 It's like, can you write something that's going to leverage this library to transcript the library, summarize it and publish it back on the, with SEO in mind.
01:08:30 Yeah, it's, it's really quite amazing.
01:08:33 It went through and said, okay, here's a function and it knows talkpython.fm/episode/all.
01:08:38 Use H, get the title.
01:08:39 And let's, let's just finish this out, Max.
01:08:42 I'll throw this into a.
01:08:43 An interpreter.
01:08:44 See if it runs.
01:08:45 An interpreter and I'll see if I can get it to run.
01:08:47 Hey, you know, what's really interesting too, is like, you can give it a random function.
01:08:51 Like you can write a function and say like, if you know, write a, write a certain function that does certain things.
01:08:57 And you say, if I give the, this input to this function, what is it going to come out of?
01:09:02 And it doesn't have an interpreter, but it can interpret code like you and I do.
01:09:07 Right.
01:09:08 Like, like an interview question of like, Hey, here's a function.
01:09:11 If I input a three as a value, what's going to come, what's going to return.
01:09:14 So it's able to do it, follow the loops, you know, follow the if statements and basically just do a logical.
01:09:20 Yeah.
01:09:21 Yeah.
01:09:22 Another thing I think would be really good is to say, here's a function.
01:09:25 Explain to me what it does.
01:09:26 Oh yeah.
01:09:27 It's super great at that.
01:09:28 It's great at that for SQL too.
01:09:29 Here's a, there's a stupid long SQL query.
01:09:31 Can you explain to me?
01:09:32 No, it does.
01:09:33 It's like, the explanation is on, can you just summarize that 300, you know, in a hundred words?
01:09:37 Yeah.
01:09:38 Let's go step by step.
01:09:39 Let's go step by step.
01:09:40 What's this do?
01:09:41 But yeah, I mean, maybe a closing statement is like, this stuff is changing our world.
01:09:46 Like for me, I'm interested in how it's changing, how we're building products, you know, but the core things as a, as a data practitioner, as a Python expert.
01:09:55 As a programmer, it's really changing the way people work day after day faster than, than, than we all think.
01:10:04 and, and across a lot, like, you know, you might understand pretty well, it's changing your daily workflow as an, as a software engineer, but it's changing people.
01:10:12 Like people's workflow to do chemistry or like in every field.
01:10:17 There's a lot we can leverage here.
01:10:19 if you use it well.
01:10:21 Right.
01:10:22 Yeah.
01:10:23 And apply it to, you know, whatever, vertical you want to think of it's, it's doing the same thing there.
01:10:28 Right.
01:10:29 Medicine all over.
01:10:30 Yeah.
01:10:31 A hundred percent.
01:10:32 Well, let's, let's call it a color up.
01:10:35 I think we're, we're out of, out of time here.
01:10:37 So, really quick before, before we quit PI PI package to recommend, maybe something AI related that you found recently.
01:10:44 Like all this thing's cool.
01:10:45 People should check it out.
01:10:46 prompt to my eyes.
01:10:47 I think it would be, you know, something to check out.
01:10:49 I think there's something called future tools that you could try to navigate there, but it's, it shows like all of the AI powered tools that are coming out and it's, it's hard to keep up.
01:10:59 Yeah.
01:11:00 Yeah.
01:11:00 I think I have seen that.
01:11:01 Yeah.
01:11:01 And then, if you, if you want to keep up on a daily with what's happening in AI, there's, you know, TLDR AI, they have like a DL with their relevant list for the day.
01:11:12 I think that's a, it's, it's hard to stay on to, I prefer like their weekly digest of what's going on in AI.
01:11:19 It's like, just a stream of information.
01:11:23 yeah.
01:11:24 It's just kind of dizzying and it's like, oh, this new model does this.
01:11:27 Like I've got to change everything to that.
01:11:28 And then, then something else.
01:11:30 If you update the correct course too often, and it's just like, you know, do nothing.
01:11:36 Cause you're like the foundation, the foundation is shifting too fast on through you.
01:11:40 So yeah, absolutely.
01:11:41 Well, very cool.
01:11:42 All right.
01:11:43 And then final question, you know, right.
01:11:45 Supply the code, what editor using these days?
01:11:47 I'm a Vim user.
01:11:48 Yeah.
01:11:49 I, I, I know it's not the best.
01:11:51 I'd like, I know all the limitation, but it's like muscle memory.
01:11:54 And, I, I, I'm a UX guy now working on supersets.
01:11:58 I do appreciate the development of all the, the, the new IDEs and the functionality that they have.
01:12:04 I think it's, it's amazing.
01:12:06 And it's just like, for me, it's all like, I know all my bash commands and big commands.
01:12:11 Absolutely.
01:12:12 All right.
01:12:12 Well, Max, thanks for coming on the show, helping everyone explore this wild new frontier of AI and large language models.
01:12:19 And, you know, exploring while we're exploring, while we're still relevant, because I don't know how long we're going to be relevant for.
01:12:26 So yeah.
01:12:27 Yeah.
01:12:28 Yeah.
01:12:29 Enjoy.
01:12:30 Enjoy while we can, right.
01:12:31 Get out there.
01:12:32 Either control the robots or be controlled by them.
01:12:34 So get on the right side of that.
01:12:36 All right.
01:12:37 Thanks again.
01:12:38 Thank you.
01:12:39 This has been another episode of talk Python to me.
01:12:43 Thank you to our sponsors.
01:12:44 Be sure to check out what they're offering.
01:12:46 It really helps support the show.
01:12:47 The folks over at jet brains encourage you to get work done with PyCharm.
01:12:52 PyCharm professional understands complex projects across multiple languages and technologies.
01:12:57 So you can stay productive while you're writing Python code and other code like HTML or SQL.
01:13:03 Download your free trial at talkpython.fm/donewithpycharm.
01:13:08 Listen to an episode of compiler, an original podcast from Red Hat.
01:13:13 Compiler unravels industry topics, trends, and things you've always wanted to know about tech through interviews with the people who know it best.
01:13:20 Subscribe today by following talkpython.fm/compiler.
01:13:24 Want to level up your Python?
01:13:26 We have one of the largest catalogs of Python video courses over at Talk Python.
01:13:30 Our content ranges from true beginners to deeply advanced topics like memory and async.
01:13:35 And best of all, there's not a subscription in sight.
01:13:38 Check it out for yourself at training.talkpython.fm.
01:13:40 Be sure to subscribe to the show.
01:13:42 Open your favorite podcast app and search for Python.
01:13:46 We should be right at the top.
01:13:47 You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at /rss on talkpython.fm.
01:13:56 We're live streaming most of our recordings these days.
01:14:00 If you want to be part of the show and have your comments featured on the air, be sure to subscribe to our YouTube channel at /talkpython.fm/youtube.
01:14:08 This is your host, Michael Kennedy.
01:14:10 Thanks so much for listening.
01:14:11 I really appreciate it.
01:14:12 Now get out there and write some Python code.
01:14:14 Thanks for listening.
01:14:15 Thanks for listening.
01:14:15 Thanks for listening.
01:14:16 Thanks for listening.
01:14:17 Thanks for listening.
01:14:18 Thanks for listening.
01:14:19 Thanks for listening.
01:14:20 Thanks for listening.
01:14:21 Thanks for listening.
01:14:22 Thanks for listening.
01:14:23 Thanks for listening.
01:14:24 Thanks for listening.
01:14:25 Thanks for listening.
01:14:26 Thanks for listening.
01:14:27 Thanks for listening.
01:14:28 Thanks for listening.
01:14:29 Thanks for listening.
01:14:30 Thanks for listening.
01:14:31 you you Thank you.