Learn Python with Talk Python's 270 hours of courses

#410: The Intersection of Tabular Data and Generative AI Transcript

Recorded on Sunday, Apr 2, 2023.

00:00 AI has taken the world by storm.

00:02 It's gone from near zero to amazing in just a few years.

00:05 We have chat GDP, we have stable diffusion.

00:08 What about Jupyter notebooks and pandas?

00:10 In this episode, we meet Justin Waugh, the creator of Sketch.

00:14 Sketch adds the ability to have conversational AI interactions about your pandas data frames, code and data right inside of your notebook.

00:24 It's pretty powerful and I know you'll enjoy the conversation.

00:27 This is Talk Python to Me, episode 410, recorded April 2nd, 2023.

00:34 Welcome to Talk Python to Me, a weekly podcast on Python.

00:48 This is your host, Michael Kennedy.

00:50 Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython, both on fosstodon.org.

00:57 Be careful with impersonating accounts on other instances.

01:00 There are many.

01:01 Keep up with the show and listen to over seven years of past episodes at talkpython.fm.

01:06 We've started streaming most of our episodes live on YouTube.

01:10 Subscribe to our YouTube channel over at talkpython.fm/youtube to get notified about upcoming shows and be part of that episode.

01:18 This episode is brought to you by brilliant.org and us with our online courses over at Talk Python Training.

01:24 Justin, welcome to Talk Python to Me.

01:27 Thanks for having me.

01:28 It's great to have you here.

01:29 I'm a little suspicious.

01:30 I got to know, I really know how to test whether you're actually Justin or an AI speaking as Justin.

01:38 What's the deal here?

01:40 There's no way to know now.

01:41 No, there's not.

01:42 Well, apparently I've recently learned from you that I can give you a bunch of Xs and other arbitrary characters.

01:48 This is like the test.

01:49 It's like asking Germans to say squirrel in World War II sort of thing.

01:54 It's the test.

01:55 It's the tell.

01:56 >> There's always going to be something.

01:58 Some sort of adversarial attack.

02:00 >> Exactly.

02:01 It's only going to get more interesting with this kind of stuff, for sure.

02:05 We're going to talk about using generative AI in large language models, paired with things like pandas, or consumed with straight Python with a couple of your projects, which are super exciting.

02:18 I think it's going to empower a lot of people in ways that it hasn't really been done yet.

02:24 So, awesome on that. But before we get to it, let's start with your story.

02:27 How did you get into programming and Python and AI?

02:30 I got into programming just like when I was a kid, TI-83, learning to code on that.

02:36 And then sort of just kept it up as a side hobby my whole life.

02:39 Didn't ever sort of choose it as my career path or anything for a while.

02:43 It chose you.

02:44 Yeah, it chose me. I dragged it along with me everywhere.

02:47 is just like the toolkit. I got a went to undergrad and for physics electrical engineering, then did a physics PhD, experimental physics. During that, I did a lot of non traditional languages, things like LabVIEW, Igor Pro, just weird, Windows, Windows hotkey for like, just trying to like automate things. Yeah, sure. So just was sort of dragging that along. But along that path, I sort of came across GPUs and used it for accelerating processing, specifically, like particle detection. So it was doing some like electron counting in some just detector experiments.

03:23 Is this like CUDA cores on NVIDIA type things?

03:25 Precisely.

03:26 Stuff like that. Okay.

03:26 That was the...

03:27 Was that with Python or was that with C++ or what?

03:29 At the time it was C++ and I made like a DLL and then called it from LabVIEW.

03:33 Wow, that's some crazy integration. That's like drag and drop programming too on the memory GPU.

03:39 Exactly. It was all over the place. Also had, it was a distributed LabVIEW project.

03:43 that we had multiple machines that were coordinating and doing this all just to move some motors and measured electrons. But it got me into kudos stuff, which then at the time was around the time that the like Alex net some of these like very first neural net stuff was happening. And so those same convolutional kernels were the same exact code I was trying to write to run like convolutions on these images. And so it's like, Oh, look at this like paper. Let me go read it. It seems like it's got so many citations. This is interesting. And then like that sent me down the rabbit hole like, Oh, this AI stuff. Okay, let me go deep dive into this. And then that just, I'd say that like became the obsession from them. Yeah, it's been like, eight years of doing that, then sort of just after I left academia, tried my own startup, then joined multiple others and just sort of have been bouncing around as the sort of like, founding engineer, early engineer at a startup for a while now. And yeah, Python has been the choice ever since, like late grad school and on, I would say it sort of like came through the pandas and numpy part, but then stuck for the scripting, like just power, just can throw anything together at any time.

04:47 So it seems like there were two groups that were just hammering GPUs, hammering them. Crypto miners and AI people. But the physicists and some of those people doing large scale research like that, they were the OG graphics card users, right? Way before crypto mining existed and really before AI was using graphics cards all that much. When I was like looking at some of the code, like pre-CUDA, there were some like quant traders that were doing some like crazy stuff off of shaders. Like it wasn't even CUDA yet, but it was shaders and they were trying to like extract the compute power out of them from that. So, it's like, look, if we could shave one millisecond off this, we can short them all day. Let's do it. But yeah, the physicist has It's always been, yeah, it's always the get as much compute as you can out of the devices you have 'cause simulations are slow.

05:37 - Yeah, I remember when I was in grad school studying math, actually senior year, regular college, my bachelor's, the research team that I was on had gotten a used Silicon Graphics computer for a quarter million dollars and some Onyx workstations that we all were given.

05:54 I'm like, this thing is so awesome.

05:56 A couple years later, like an NVIDIA graphics card and like a simple PC would crush it.

06:02 Like that's $2,000.

06:03 It's just, yeah, there's so much power in those things to be able to harness them for whatever, I guess.

06:07 - Yeah, as long as you don't have too much branching, it works really well.

06:11 - Awesome.

06:12 So let's jump in and start talking about, let's start to talk about ChatGP and some of this AI stuff before we totally get into the projects that you're working on, brings that type of conversational generative AI to things like pandas, as you said, but to me, I don't know how, maybe you've been more on the inside than I have, but to me, it looks like AI has been one of those things that's 30 years in the future forever, right?

06:42 It was like the Turing test and, oh, here's a chat.

06:45 I'm going to talk to this thing and see if it feels human or not.

06:48 And then, you know, there was like OCR and, and then all of a sudden we got self-driving cars, like, wait a minute, that's actually solving real problems.

06:57 And then we got things like chat GTP, where people are like, wait, this can do my job.

07:02 It seems like it just in the last couple of years, there's been some inflection point in this world.

07:08 What do you think?

07:09 Yeah, I think there's sort of like two key things that have sort of happened in the past, I guess, four or five years, four years, roughly.

07:15 One is the attention is all you need paper from Google, sort of this transformer architecture came out and it's sort of a good, very hungry model that can just sort of absorb a lot of facts and just like a nice learnable key value store almost.

07:27 That's stuff.

07:28 So, and then the other thing is, is that GPUs, we were sort of just talking about GPU compute, but this has just been really, GPU compute has really been growing so fast.

07:38 If you like look at the like Moore's law equivalent type things, like it's just, it's faster how much we're getting flops out of these things like faster and faster.

07:45 So it's been really nice.

07:47 I mean, obviously, there'll be a wall eventually, but it's been good writing this like exponential curve for a bit.

07:52 Yeah. Is the benefit that we're getting from the faster GPUs, is that because people are able to program it better and the frameworks are getting better or because just the raw processing power is getting better?

08:03 All of the above.

08:04 Okay.

08:04 I think that there was a paper that tried to dissect this.

08:07 I wish I knew the reference, but I believe that their argument was that it was actually more the processing power was getting better. The actual like physical silicon, we're getting better at making that for specifically this type of stuff, but like how on exponentials. But yeah.

08:20 - Yeah, yeah, yeah, the power that those things take.

08:22 I have a gaming system over there and it has a GeForce 2070 Super.

08:29 I don't know what the Super really gets me, but it's better than the not Super, I guess.

08:33 Anyway, that one still plugs into the wall normal, but the newer ones like the 4090s, those things, the amount of power they consume, it's like space heater level of power.

08:45 Like, I don't know, 800 watts or something just for the GPU.

08:49 - You're gonna brown out the house if you plug in too many of those.

08:53 - Yeah, go look at those DGX A100 clusters and they've got like eight of those A100s just stacked right in there.

09:01 They take really beefy power supply.

09:03 - It's built directly attached to the power plant, electrical power plant.

09:09 Nuts, okay, so yeah, so those things are getting really, really massive.

09:12 Here's the paper, "Attention is all you need" from Google Research.

09:16 What was the story of that?

09:18 How's that play into things?

09:19 Yeah, so this came up during machine translation research at Google.

09:24 And the core thing is they present this idea of, instead of just stacking these layers of neural nets like we're used to, they replace the neural net layer with this concept of a transformer block.

09:38 A transformer block has this concept inside that's an attention mechanism.

09:43 The attention mechanism is effectively three matrices that you combine in a specific order. And the sort of logic is it is that one of the vectors takes you from some space to keys. So it's almost like it's like identifying labels out of your data. Another one is taking you from your data to queries.

10:02 And then it like dot products those to find a weight. And then for the one and then another one finds weight values for your things. So it takes this query and key, you get the weights for them. And then you take the ones that were sort of the closest to get those values from the third matrix. Yeah, just doing it sort of like looks a little bit like a, you know, accessing an element in a dictionary, like, you know, key value lookup.

10:23 Yeah. And they, it's a differentiable version of that.

10:27 And it did really well on their machine learning, sorry, on their machine translation stuff. This was a think it's like one of the first big one is this Bert model. And that paper sort of the architecture of the actual neural net code is effectively unchanged from this to ChatGPT.

10:45 Like there's a lot of stuff for like milking performance and increasing stability, but the actual like core essence of the actual mechanism that drives it, it's the same thing since this paper.

10:55 Interesting. It's funny that Google didn't release something sooner.

10:58 It's wild that they've had, they keep showing off that they've got like, equivalent or better things at different times, but then not releasing it.

11:06 When Dolly happened, they had image in Imagine, I guess, I don't know how you say it.

11:10 And what was the party as the two said two different, really good, way better than Dolly, way better than stable diffusion models like the that had brought out and they like showed it, demoed it like, but never released it to be used.

11:23 So yeah, it's one of these, who knows what's going to happen with Google if they keep holding on to these things?

11:28 Yeah, well, I think there was some hesitation.

11:30 I don't know holds up on accuracy or weird stuff like that.

11:34 Sure.

11:35 Yeah, now it cuts out of the bag now.

11:38 Now it's happening.

11:38 Yeah, the cat's out of the bag and people are racing to do the best they can.

11:42 And it's going to have interesting consequences for us, both positive and negative, I think, but you know, let's leverage the positive once the cat's out of the bag anyway, right?

11:51 Yeah.

11:51 Hopefully you might as well like ask it questions for pandas.

11:54 So let's play a little bit with ChatGPT and maybe another one of these, image type things.

12:00 So I came in here and I stole this example from a blog post.

12:04 is pretty nice about not using deeply nested codes.

12:08 You can use a design pattern called a guarding clause that will look and say if the conditions are not right, we're going to return early instead of having if something, if that also, if something else.

12:21 So there's this example that is written in a poor way and it says like it's checking for a platypus.

12:26 So it says if self.isMammal, if self.hasFur, if self.hasBeak, et cetera, et cetera.

12:33 It's all deeply nested.

12:35 And just for people who haven't played with ChatGPT, like I put that in and they said, sure, I told her I wanted to call this arrow because it looks like an arrow and it says, it tells me a little bit about this.

12:45 So I'm going to ask it.

12:46 Please rewrite arrow to be less nested with guarding clauses, right?

12:53 This is like a machine, right?

12:55 If I tell it this, what is it going to say?

12:57 Let's see.

12:58 It may fail, but it might, I think it's going to get it.

13:00 It's thinking I put it.

13:01 I mistakenly put it into chat to be four, which takes longer.

13:04 I might switch it over to three.

13:06 I don't know.

13:07 But the, the understanding of these things, there's a lot of hype about it.

13:11 Like, I think you kind of agree with me that maybe this hype is worthwhile.

13:16 Here we go.

13:16 So, look at this.

13:18 It rewrote.

13:19 It's a def is platypus.

13:21 If not self as man, I'll return false.

13:23 If not has for, and there's no more nesting.

13:25 That's pretty cool.

13:26 Right?

13:26 Yep.

13:26 I mean, I'm sure you've,played with stuff like this, right?

13:29 Well, you've, yeah.

13:30 Big user of this.

13:31 I mean, this is kind of interesting, right?

13:32 Like it understood there was a structure and it understood what these were and it understood what I said, but what's more impressive is like, please rewrite the program to check for crocodiles, crocodiles, and you know, what is it gonna do here?

13:49 Let's see.

13:51 It says, sure, no problem.

13:52 Then writes the function is crocodile.

13:54 If not, self.is reptile.

13:56 If not, self.has scales.

13:58 If not, self.has long snout.

14:01 Oh my gosh, like it not only remembered, oh yeah, there's this new version I wrote in the garden clause format, but then it rewrote the tests.

14:10 I mean, and then it's explaining to me why it wrote it that way.

14:16 It's just, it's mind blowing, like how much you can have conversations with this and how much it understands things like code or physics or history.

14:26 What do you think?

14:27 - Yeah, it's really satisfying.

14:28 I love that it's such a powerful generalist at these things that are found on the internet.

14:34 So if it exists and it's in the training data, it can do so good at synthesizing, composing, bridging between them.

14:41 It's really satisfying.

14:42 So it's really fun asking it to-- as you're doing, rewriting, changing language.

14:46 I've been getting into a lot more JavaScript because I'm doing a bunch more front end stuff.

14:49 And just I sometimes will write a quick one-liner in Python that I know how to do with list comprehension.

14:54 And then I'll be like, make this for me in JavaScript because I can't figure out this, like, how to initialize an array with integers in it.

15:01 Yeah.

15:02 It's great for just like really quick spot checks.

15:05 And it also seems to know a lot about like really popular frameworks.

15:08 So you can ask it things that are surprisingly detailed about like a how would you do cores with requests in FastAPI.

15:16 And it can help you find that exact middleware.

15:19 You know, it's like boilerplate-y, but it's great that it can just be a source for that.

15:24 This portion of Talk Python to Me is brought to you by Brilliant.org.

15:27 You are a curious person who loves to learn about technology.

15:30 I know because you're listening to my show.

15:32 That's why you would also be interested in this episode's sponsor, Brilliant.org.

15:37 Brilliant.org is entertaining, engaging, and effective.

15:40 If you're like me and feel that binging yet another sitcom series is kind of missing out on life, then how about spending 30 minutes a day getting better at programming or deepening your knowledge and foundations of topics you've always wanted to learn better, like chemistry or biology over on Brilliant.

15:56 Brilliant has thousands of lessons, from foundational and advanced math to data science, algorithms, neural networks, and more, with new lessons added monthly.

16:05 When you sign up for a free trial, they ask a couple of questions about what you're interested in, as well as your background knowledge.

16:11 Then you're presented with a cool learning path to get you started right where you should be.

16:15 Personally, I'm going back to some science foundations.

16:18 I love chemistry and physics, but haven't touched them for 20 years.

16:22 So I'm looking forward to playing with PV=NRT, you know, the ideal gas law, and all the other foundations of our world. With Brilliant, you'll get hands-on on a whole universe of concepts in math, science, computer science, and solve fun problems while growing your critical thinking skills. Of course, you could just visit brilliant.org directly. Its URL is right there in the name, isn't it? But please use our link because you'll get something extra, 20% off an annual premium subscription. So sign up today at talkpython.fm/brilliant and start a seven-day free trial. That's talkpython.fm/brilliant. The link is in your podcast player show notes.

16:59 Thank you to brilliant.org for supporting the show.

17:04 It's insane. I don't know if I've got it in my history here. We're rewriting our mobile apps for Talk Python training for our courses and Flutter, and we're having a problem downloading stuff concurrently using a particular library in Flutter.

17:21 And so I asked it, I said, "Hey, I want some help with a Flutter and Dart program.

17:26 What do you want?" It says, "I'm using the DIO package.

17:29 Do you know it?" "Oh, yes, I'm familiar.

17:31 It does HTTP client stuff for Dart." "Okay, I want to download binary video files and a bunch of them given a URL.

17:37 I want to do them concurrently with three of them at a time.

17:40 Write the code for that." And boom, it just writes it.

17:43 using that library I told you about, not just Dart. So that's incredible that we can get this kind of assistance for knowledge and programming. Like you'll never find, I mean, I take that back. You might find that if there's a very specific Stack Overflow question or something, but if there's not a write on question for it, you're not going to find it.

18:01 - Yep, yep. I love the, when you have a, like, you know, the Stack Overflow would exist for a variant of your question, but it's like the exact one doesn't exist and you have to to go grab like the three of them to synthesize.

18:11 And it's just great at that.

18:13 It's also, yeah, it also is pretty good at fixing errors.

18:16 I mean, sometimes it can walk itself into just like lying to you repeatedly, but that's, you know, it's like--

18:21 - That's that accuracy part that's so problematic, yeah.

18:24 But you can also ask it like, here's my program, are there security vulnerabilities?

18:28 Or do you see any bugs?

18:29 And it'll find them.

18:30 - Yep, yeah, it's nuts.

18:32 - So people may be wondering, we haven't talked yet about your project sketch, why I'm talking so much about ChatsCP.

18:38 So that is kind of the style of AI that your project brings to pandas, which we're gonna get to.

18:45 But I wanna touch on two more really quick AI things that we'll dive into it.

18:49 The other is this just around images, just the ability to ask questions.

18:53 You've already mentioned three, Dolly, Imagine, and then the other one I don't remember from Google that they haven't put out yet.

19:00 A mid-journey is another, just the ability to say, "Hey, I want a picture of this." No, actually change it slightly like that.

19:07 It's mind-blowing.

19:08 - They're a lot of fun.

19:09 They're great for sparking creativity or having an idea and just getting to see it in front of you.

19:12 - I think it's more impressive to me than even this chat, GTP, telling me how to write Dart, is 'cause it's like, I gave you a blank canvas.

19:20 And so, for example, for this video, for this conversation, I'll probably use this as the YouTube thumbnail and image for this episode.

19:28 So I want an artificial intelligence panda.

19:31 And it came up, and I want it photorealistic in the style of National Geographic.

19:35 And so it gave me this panda.

19:37 You can see beautiful whiskers, but just behind the ear, you can see the fur is gone and it's like an android type of creature.

19:46 That's incredible.

19:46 I mean, that is a beautiful picture.

19:48 It's pretty accurate.

19:50 It's nuts that I can just go talk to these systems and ask them these questions.

19:54 - Yeah, I find it interesting you're comparing the ChatGPT and the Mid-Journey style and find the Mid-Journey ones impressive.

20:01 They are, I completely get it.

20:02 It's very visceral.

20:03 It's also, from another perspective, I think of like the weights and the scale of the model.

20:08 And the image, you know, these like image ones that like solve all images are so much smaller in scale than these like language ones that have all this other data and stuff.

20:16 So it's fascinating how complex languages.

20:18 - Yeah, I know the smarts is so much less, but just something about it actually came up with a creative picture that never existed.

20:27 Right, you could show this to somebody like, oh, that's an artificial panda, that's insane, right?

20:32 But it's, but I just gave it like a sentence or two.

20:35 Yeah, yeah, yeah, this, it's a sort of a technical interpretation, but I love it because it's like this. It's just phenomenal interpolation. It's like through semantically labeled space. So like the words have meaning and it understands the meeting and can move sliders of like, well, I've seen lots of these machine things. I understand the concept of gears and this metal and this like the shiny texture and then the fur texture and like a specific they're very good at texture. It's yes, yeah, really great how it interprets all of that just to fit the small prompt.

21:05 >> Yeah. There are other angles which is frustrating.

21:07 I want it in the back of the picture, not the front. No, it's always in the center.

21:12 One more thing really quick, and this leads me into my final thing is GitHub Copilot.

21:17 GitHub Copilot is like this in your editor, which is insane.

21:21 You can just give it a comment or a series of comments, and it will write it.

21:25 I think ChatGPTs may be more open-ended and more creative, but this is also a pretty interesting way to go.

21:32 I'm a heavy user of Copilot.

21:34 I, if there's a weird crux and I'm like slowly developing like a need to have this in my browser.

21:40 I was on a flight recently and was with the internet and Copilot wasn't working.

21:44 And I felt the like, I felt the difference.

21:47 I felt like I was like walking through mud instead of just like actually running a little bit.

21:51 And I was like, oh, I've been disconnected from my distributed mind.

21:55 I am broken partially.

21:57 Yeah.

21:58 So incredible.

21:59 So the last part I guess is like, what are the ethics of this?

22:03 Like I went on very positively about Midjourney, but how much of that is trained on copyright material?

22:10 Or there's GitHub Copilot.

22:11 How much of that is trained on GPL-based stuff that was in GitHub?

22:18 But when I use it, I don't have the GPL any longer on my code.

22:22 I might use it on a commercial code.

22:23 But just running it through the AI, does that strip licenses?

22:27 Or does it not?

22:28 There's a GitHub copilot litigation.com, which is interesting.

22:32 I mean, we might be finding out there's also think Getty, he gets the Getty images.

22:38 I'm not a hundred percent sure, but I think getting images is suing one of these image generation companies.

22:44 I can't remember which one I don't.

22:46 Maybe mid journey.

22:47 I don't think it's mid-journey.

22:48 I think it's stable diffusion, but anyway, it doesn't really matter.

22:50 Like there's a bunch of things that are pushing back against us.

22:53 Like, wait a minute, where did you get this data?

22:55 Did you have rights to use this data in this way?

22:57 And I mean, what are your thoughts on this, this angle of AI these days?

23:02 Yeah, I know. It sounds like I don't worry too much about it in either direction. I like, I think I believe in like, I get like personal ethics. I believe in open source things, availability of things because it just sort of like accelerates collective progress. But that said, I also believe in like, slightly different like social structures to help support people.

23:22 Like, like, I am a, I guess, a personal believer in things like UBI or something like that on that direction. So when you combine those, I feel like it, you know, things sort of work out kind of well, but when we like, but it is still a thing that like, be right exists, and that there is this sense of ownership. And this is my thing. And I wanted to put licenses on it. And I think that this sort of story started, presumably, that I wasn't really having this conversation. But like, when the internet came around, and search engines happened, and like, Google could just go and pull up your thing from your page and summarize it in a little blob on the page is was that fair? What if it starts, you know, your shop and it allows you to go buy that same product from other shops like it. I think the same things are showing up and in the same way that the web like in the internet sort of it's sort of it was a large thing, but then it's sort of I don't know if it got quieter, but it sort of became in the background. We sort of found new systems. It stopped being piracy and CDs and the music industry is going to struggle and hey things like Spotify exist and streaming services exist and like I don't know what - Better than ever, basically, yeah.

24:23 - So I think it's just evolution, and some things will change and adopt, some things will fall apart, and new things will be born.

24:30 It's just a great, it's a good time for lots of opportunity, I guess is the part that I'm excited about.

24:34 - Yeah, yeah, yeah, for sure.

24:35 I think that's definitely true.

24:37 Probably, you're probably right, it probably will turn out to be, old man yells at cloud, cloud doesn't care, sort of story, you know, in the end, where it's like, on the other hand, if somebody came back and said, you know, a court came back and said, you know what, actually anything trained on GPL and then you use Copilot on it, that's GPL, like that would have instantly mega effects, right?

25:02 - Yeah, yeah, and I guess there's also stuff like the, I didn't actually read the article, I only saw the headline, and yeah, that's the worst thing to do is to repeat a thing, which has a headline, but there was that Italy thing that I saw about, like, I don't know the extent.

25:15 - Yeah, yeah.

25:16 - Yeah, that was really clickbaity, but I didn't get a time to look at it yet.

25:19 - You probably asked chat to repeat a summarize for you.

25:22 - If as long as it can do a Bing, I guess, get that updated.

25:25 - Yeah, yeah, yeah.

25:27 There's a lot of things playing in that space, right?

25:31 Some different places.

25:32 Okay, so yeah, very cool.

25:34 But as a regular user, I would say, regardless of kind of how you feel about this, at least this is my viewpoint right now.

25:40 It's like, regardless of how I feel about which side is right in these kinds of disputes, This stuff is out of the bag.

25:47 It's out there and available and it's a tool and it's like saying, you know, I don't wanna use spell check or I don't wanna use some kind of like code checking.

25:55 I just wanna write like in straight notepad because it's pure, right?

25:58 Like, sure you could do that, but there's these tools that will help us be more productive and it's better to embrace them and know them than to just like yell at them, I suppose.

26:08 - Yeah, a lot of accelerant you can get.

26:10 Really speed up whatever you wanna get done.

26:12 - Yeah, absolutely.

26:13 All right, so speaking of speeding up things, let's talk pandas.

26:18 And not even my artificial pandas, but actual programming pandas.

26:22 With this project that you all have from approximate, yeah, approximate labs called Sketch.

26:29 So Sketch is pretty awesome.

26:31 Sketch is actually why we're talking today because I first talked about this on Python Bytes and I saw this was sent over there by Jake Furman and to me and said, "You should check this thing out.

26:42 "It's awesome." And yeah, it's pretty nuts.

26:46 So tell us about Sketch.

26:47 - Yeah, so even though I use a Copilot, as sort of described already, and it's become a crux, I found in Jupyter Notebooks, when I wanted to work with data, it just didn't, it doesn't actually apply.

26:59 So on one side, it was sort of like missing the mark at times and so it was sort of like, how can I get this integrated into my flow, the way I actually work in a Jupyter Notebook?

27:08 If maybe I'm working a Jupyter Notebook on a remote server and I don't wanna set up VS Code to do it, so I don't have Copilot at all.

27:13 Like there's a bunch of different reasons that I was just like in Jupyter.

27:16 It's a very different IDE experience.

27:17 - It is, yeah, it's super different.

27:19 But also you might want to ask questions about the data, not the structure of the code that analyzes the data, right?

27:24 - Exactly, yeah.

27:25 And so just a bunch of that type of stuff.

27:27 And then also at the other side, I was trying to find something that I could throw together that I thought was a strong demonstration of the value Approximate Labs is trying to chase, but wouldn't take me too much time to make.

27:39 So it was a, oh, I could probably just go throw this together pretty quickly.

27:42 I bet this is gonna be actually useful and helpful.

27:45 And so let's just do that.

27:46 And so through on top of the actual library I was using that was Sketch, I put this on it and then shipped it.

27:52 So sort of shifted what the project was.

27:54 - Yeah, yeah.

27:55 So you also have this other project called Lambda Prompt.

27:58 And so were you trying to play around Lambda Prompt and then like see what you could kind of apply here to leverage it or is that--

28:05 - The full journey I can get into is started with a data sketches.

28:09 I've left my last job to chase bringing the algorithm, like combining data sketches with AI, but just like the vague, like at that level.

28:19 - Tell us what a data sketch is real quick.

28:20 - Sure, yeah.

28:21 So a data sketch is a probabilistic aggregation of data.

28:24 So if you have, I think the most common one that people have heard of is HyperLogLog, and it's used to estimate cardinality.

28:31 So estimate the number of unique values in a column.

28:33 A data sketch is a class of algorithms that all sort of like use roughly fixed width in binary, usually representations, and then in a single pass, so their O N will look at each row and hash the row and then update the sketch or not necessarily hash, but they update this sketch object.

28:52 Essentially, those sketch objects also have another property that they are mergeable.

28:56 So you have this like really fast O N to go bring that like to aggregate up and you get this mergeability.

29:02 So you can map reduce it in trivial speeds.

29:06 The net result is that this like tight binary packed object can be used to approximate measures you were looking for on the original data.

29:15 So you could look at if you do a few of these, they're like data sketches.

29:19 You can go and estimate not just the unique count, but you can also estimate if this one column would join well with this other column, or you can estimate, Oh, if I were to join this column to this column, then this third column that was on that other table would actually be correlated to this first column over So you get these, a bunch of different distributions.

29:39 You get a whole bunch of these types of properties.

29:43 And each sketch is sort of just, I would say, algorithmically engineered, like very, very engineered to be information theory optimal at solving one of those measures on the data.

29:54 And so they're tight packed binary representations.

29:56 - All right, so you thought about, well, that's cool, but Chat GPT is cool too.

30:00 - Yeah, exactly.

30:00 The core thing was, so those representations aren't usable by AI right now. And when you actually go and use GPT-3 or something like this, you have to figure out a way to build the prompt to get it to do what you want. This was especially true in a pre-instruction tuning world, you had to really like, you had to play the prompt engineer role even more than you have to now. Now you could sort of get away with describing it to ChatGPT.

30:24 And one of the things that you really have to like, play the game of is how do you get all the information it's going to need into this prompt in a succinct but good enough way that it helps it do this. And so what Sketch was about was rather than just looking at the context of the data, like the metadata, the column names and the code you have, also go get some representations of representation of the content of the data, turn that into a string, and then bring that string in as part of the prompt. And then when it has that it should understand much better at actually generating code, generating answers to questions, and that's what that sketch was, a proof of concept of that, that worked very well.

31:03 It really quickly showed how valuable actual data content context is.

31:08 Yeah, I would say people are-- it's resonating with people.

31:11 It's got 1.5 thousand stars on GitHub.

31:13 It looks about six months old, so that's pretty good growth there.

31:18 Yeah, January 16th was the day I posted it on Hacker News, And it had three, it was an empty repo at that point.

31:24 - Okay, three stars.

31:26 It's like me and my friends.

31:28 Okay, cool.

31:29 So this is a tool that basically patches pandas to add functionality or functions, literally, to pandas data frames that allows you to ask questions about it, right?

31:44 - Yep.

31:45 - So what kind of questions can you ask it?

31:46 What can it help you with?

31:47 - Yeah, so there's two classes of questions you can ask.

31:50 you can ask it the ask type questions.

31:53 These are sort of from that summary statistics data.

31:56 So from the general representation of your data, ask it to like give you answers about it.

32:02 Like what are the columns here?

32:03 You sort of have a conversation where it sort of understands the general, like shape of the data, general distributions, things like that, number of uniques, and like give that context to it, ask questions of that system.

32:16 And then the other one is ask it how to do something.

32:20 So you specifically can get it to write code to solve a problem you have.

32:23 You describe the problem you want and you can ask it to do that.

32:25 Right. I've got this data frame.

32:27 I want to plot a graph of this versus that, but color by the other thing.

32:31 Yep. And in the data space world, what I sort of decided to do is like in the demo here is just sort of walk through what are some standard things people want to ask of data?

32:41 Like what are those common questions that you hear like in Slack between a business team and an analyst team?

32:48 and it's just sort of like, oh, can you do this?

32:50 Can you get me this?

32:51 Can you tell me if there's any PII?

32:53 Is this safe to send?

32:54 Can I send the CSV around?

32:55 Can you clean up this CSV?

32:57 Oh, I need to load this into our catalog.

32:59 Can you describe each of these columns and check the data types all the way to, can you actually go get me analytics or plot this?

33:05 - Yeah, awesome.

33:07 So, and it plugs right into Jupyter Notebooks.

33:10 So you can just import it and basically installing Sketch, which is a pip or conda type thing, and then you just import it and it's good to go, right?

33:19 - Yep, using the Pandas extensions API, which allows you to essentially hook into their data frame, call back and register a--

33:28 - Interesting, so it's not as jammed on from the outside, it's a little more, plays a little nicer with Pandas rather than just like, we're gonna go to the class and just--

33:37 (laughing)

33:38 - Yeah, yeah, not full monkey patching here.

33:41 It's like hack supported, I think.

33:43 I don't see it used often, but it is somewhere in the docs.

33:46 Excellent.

33:47 But here it is.

33:47 So what I wanted to do for this is there's a, an example that you can do.

33:52 Like if you go to the repo, which obviously I'll link to, there's a video, which I mean, mad props to you because I review so many things, especially for the Python bites podcast, where it's a bunch of news items and new things, we're just going to check out.

34:04 And.

34:04 Yeah, we'll, we'll find people recommending GUI frameworks that haven't not a single screenshot.

34:10 Other types of things.

34:13 Like I have no way to judge whether this thing even might look like, He said, "What does it even make?

34:18 "I don't even know, but somebody put a lot of effort, "but they didn't bother to post an image." And you posted a minute and a half animation of it going through this process, which is really, really excellent.

34:28 So people can go and watch that one minute, one minute 30 video.

34:32 But there's also a Colab, open in Google Colab, which gives you a running interactive variant here.

34:40 So you can just follow along, right?

34:42 And play these pieces.

34:44 Oh, it requires me to sign up on and run it, That's okay.

34:47 Let me just talk people through some of the things it does and you can tell me what it's doing, how it's doing that, like how people might find that advantageous.

34:55 So import sketch, import pandas as PD standard.

34:59 And then you can say pandas read CSV and you give it one from like some example CSV that you got on one of your GitHub repos, right?

35:08 Or in your account.

35:08 - Yeah, I found one online and then added just random synthetic data to it.

35:12 - Yeah, like, oh, here's a data dump.

35:13 No, just kidding.

35:14 So then you need to go to that data frame called sales data. You say .sketch.ask as a string, what columns might have PII, personal identifying information in them? Awesome. And so it comes, tell me how that works and what it's doing here.

35:33 So it does, I guess, it has to build up the prompt, which is sent to GPT. So to open AI specific completion endpoint, the building up the prompt, it looks at the data frame, it does a bunch of summarization stats on it. So it calculates uniques and sums and things like that. There's two modes in the back end that either does sketches to do those or just uses like df.describe type stuff. And then it pulls those summary stats together for all the columns, throws it together with my the rest of the prompt I have, you can we could go find it. But then it sends that prompt. It also grabs some information off of inspect. So it sort of like walks the stack up to go and check the variable name because the data frame is named sales data. So it actually tries to go find that variable name in your call stack so that it can when it writes code, it writes valid code, puts all that together, send it off to open AI, gets code back, uses Python AST to parse it, check that it's valid. If it's not valid Python code, or you tried to import something that don't have, it will ask it to rewrite once. So this is sort of like an iterative process.

36:36 So it takes the error or it takes the thing and it sends it back to open as it's like, hey, fix this code. And then it or in this case, I ask it actually just takes this sends that exact same prompt, but it changes the last question to can you answer this question off of the information?

36:53 This portion of talk Python me is brought to you by us over at Talk Python Training with our courses.

36:59 I want to tell you about a brand new one that I'm super excited about.

37:03 Python web apps that fly with CDNs.

37:06 If you have a Python web app, you want it to go super fast.

37:10 Static resources turn out to be a huge portion of that equation.

37:14 Leveraging a CDN could save you up to 75% of your server load and make your app way faster for your users.

37:21 And this course is a step-by-step guide on how to do it.

37:25 And using a CDN to make your Python apps faster is way easier than you think.

37:29 So if you've got a Python web app and you would like to have it scaled out globally, if you would like to have your users have a much better experience and maybe even save some money on server hosting and bandwidth, check out this course over at talkpython.fm/courses.

37:45 It'll be right up there at the top.

37:47 And of course, the link will be in your show notes.

37:49 Thank you to everyone who's taken one of our courses.

37:51 It really helps support the podcast.

37:53 Now back to the show.

37:56 - And so that sounds very, very similar to my Aero program.

38:01 Rewrite it with guarding clauses, redo it.

38:05 I gave you this data and this code and I asked you this question and you can have a little conversation but at some point you're like, "All right, well we're going to take what it gives me after a couple of rounds at it." - Yeah, I take the first one that doesn't, that passes an import check and passes AST linting.

38:20 When you use small models, you run into not valid Python a lot more, but with these ones, it's almost always good.

38:25 It's ridiculous. Yeah, it's crazy. Okay, so it says the columns that might have PII in them are a credit card, SSN and purchase address. Okay, that's pretty excellent. And then you say, all right, sales data.sketch.ask, can you give me friendly name for each column and output this as an HTML list, which is parsed as HTML and rendered in Jupyter Notebook accurately, right? So it - This is index, well that's an index.

38:52 - This one ends up being the same.

38:53 - It's not a great, this one is not a great example because it doesn't have to like infer 'cause the names are like order space date, right?

39:02 Instead of order, like maybe lowercase O and then like attach the big D or whatever.

39:07 But it'll give you some more information.

39:09 You can like kind of ask it questions about the type of data, right?

39:13 - Yeah, exactly.

39:14 I found this is really good at, if you play the game and you just name all your columns, like call one, call two, call three, call four, and you ask it, "Give me new column names for all of these." It gives you something that's pretty reasonable based off of the data, so pretty useful for that.

39:25 - So it's like, "Oh, these look like addresses, "so we'll call that address, "and this looks like social security numbers "and credit scores and whatnot." - Yep, so it can really help with that quick first onboarding step.

39:35 - Yeah, so everyone heard it here first.

39:37 Just name all your columns, one, two, three, four, and then just get help.

39:41 AI, what do we call these?

39:44 All right, so the next thing you did in this demo notebook was you said salesdata.sketch.

39:55 And this is different before, I believe, because before you were saying ask, and now you can say how to create some derived features from the address.

40:05 Tell us about that.

40:08 - Yeah, this is the one that actually is the code writing.

40:08 It's essentially the exact same prompt, but the change is the very end.

40:12 It says, return this as Python code that you can execute to do this.

40:17 So instead of answering the question directly, answer the question with code that will answer the question.

40:18 - Right, write a Python line of code that will answer this question given this data, something like that.

40:23 - Yep, yep, something like that.

40:24 I don't remember exactly anymore, it's been a while.

40:26 But yeah, I iterated a little bit until it started working and I was like, okay, cool.

40:31 And so, ask it for that and then it spits back code.

40:34 And that was, it sounds overly simple, but that was it.

40:38 That was the moment and I was just like, oh, I could just ask it to do my analytics for me.

40:42 And it's just all the, every other feature and it just sort of became like apparently solvable with this.

40:47 And the more I played with it, the more it was just, I don't have to think about, I don't even have to go to Google or Stack Overflow to ask the question to get the API stuff for me.

40:55 I could, from zero to I have code that's working is one step in Jupyter.

40:59 - So you wrote that how-to and you gave it the question, and then it wrote the lines of code, and you just drop that into the next cell and just run it, right?

41:07 And so in this example, it said, well, we can come up with city, state, and zip code and by writing a vector transform, by passing a lambda that'll pull out the city from the string that was the full address and so on, right?

41:21 - Yeah. - That's pretty neat.

41:22 - Yeah, it's fun to see what it does.

41:25 Again, not these things are always probabilistic, but it also usually serves as a great starting point, even if it doesn't get it right.

41:30 - Yeah, sure, you're like, "Oh, okay, I see.

41:32 "Maybe that's not exactly right, "'cause we have Europeans and their city, "maybe their zip code are in different orders sometimes." but it gives you something to work with pretty quickly, right, by asking just, what can I do?

41:44 And then another one, this one's a little more interesting.

41:47 Instead of just saying, like, well, what other things can we pull out?

41:50 It's like, this gets towards the analytics side, right?

41:53 It says, get the top five grossing states for the stales data, right?

41:58 And it writes a group by, some sorts, and then it does a head given five, and that's pretty neat.

42:04 Tell us about this.

42:05 I mean, I guess it's about the same, right?

42:06 Just asking more deep questions.

42:08 They all feel pretty similar to me. I think I guess I could jump towards like things that I wanted to put next, but I didn't we're not reliable enough to like really make the cut.

42:19 I wanted to have it go like that in my question was like go build a model that predicts sales for the next six months and then plot it on a 2d plot with a dotted line for the predicted plot and like it would try but it would always do something off and I found I always had to break up the prompted smaller--

42:37 >> Get in turn level code back.

42:39 >> Yeah, yeah.

42:41 >> It sort of works.

42:43 >> It was fun getting it to train models, but it was also its own separate thing I sort of didn't play with too much. And there's another part of Sketch that I guess is not in this notebook, I didn't realize.

42:55 Because you have to use the OpenAI API key, but it's the Sketch Apply. And that's the--

43:01 I would say this one is another just like power tool. This one has like, I don't really talk about I don't even include it in the video, because it's not just like as plug and play, you do have to go set an environment variable. And so it's like, man, it's one step further than I want to add, it's not terrible, but it's a step. And so what it does is, it lets you apply a completion endpoint of whatever your design row wise. So every single row, you can go and apply and run something. So if every row of of your pandas data frame is a some serialized text from a PDF or some or a file in your directory structure and you just load it as a data frame you can do dot df dot sketch.apply and it's almost the exact same as df.apply but the thing you put in as your function is now just a jinja template that will fill in your column variables for that row and then ask gpt to continue completing so I think I did silly ones like here's a few states and then the prompt is extract the state for it or so I think.

44:00 Right, extract the capital of the state?

44:03 Yeah, yeah. So that's just pure information extraction from it, but you can sort of like this grows into a lot more.

44:10 So does that come out of the data or is that coming out of open AI where like it sees where is the capital of state and it sees New York, it's like, okay, well, all right, Albany.

44:21 This is purely extracting out of the model weights. Essentially, this is not like a factual extraction. So this is probably a bad example because it's like it but the thing that I actually actually the better example I did once was what is like some interesting colors that are good for each state and it like just came up with a sort of like flag ish colors or sports team colors. That was sort of fun when it wrote that I was hex. You can also do things like if you have a large text document or you can actually I'll even do the more common one that I think everybody actually wants is you have messy data. You have addresses that are like syntactically messy and you could say normalize these addresses to be be in this form and you sort of just write one example, I say run.apply and you get a new column that is that cleaned up data.

45:00 - Yeah, incredible.

45:02 Okay, couple things here.

45:04 It says I can use, can directly call OpenAI and not use your endpoints.

45:10 So at the moment, it kind of proxies through web service that you all have that somehow checks stuff or what does that do?

45:17 - Yeah, it was just a pure ease of use.

45:19 I wanted people to be able to do pip install and import sketch and actually get it because I know how much I use things in Colab or in Jupyter notebooks on weird machines and remembering an environment variable, managing secrets, it's like this whole overhead that I don't want to deal with.

45:34 And so I wanted to just offer a light weight way if you just want to be able to use it.

45:39 But I know that that's not sufficient for security.

45:42 People are going to be conscious of these things and want to be able to not go through my proxy thing that's there for help.

45:47 So I offered this up.

45:48 - What's next?

45:49 Do you have a roadmap for this?

45:51 Are you happy where it is and you're just letting it be, or do you have grand plans?

45:55 - I don't have much of a roadmap for this right now.

45:57 I'm actually, I guess there's like grand roadmap, which is like at the company scale, what we're working on.

46:02 I would say that if this, we're really trying to solve data and with AI just in general.

46:08 And so these are the types of things we hope to open source and just give out there.

46:12 Like actually everything we're hoping to open source.

46:14 But the starting place is gonna be a bunch of these like smaller toolkits or just utility things that hopefully save people time or very useful.

46:21 the grand thing we're working towards, I guess, is this more, like the, it's the full automated data stack.

46:28 It's like the dream, I think, that people have wanted, where you just ask it questions and it goes and pulls the data that you need.

46:34 It cleans it, it builds up the full pipeline, it executes the pipeline, it gets you to the result, and it shows you the result, and you look, you can inspect all of that, that whole DAG, and say, yes, I trust this.

46:44 So we're working on getting full end-to-end.

46:46 - So when I would not asked about that Arrow program, I said, I think this will still do it.

46:51 I think this will probably work again.

46:54 And it did, which is awesome, just the way I expected.

46:56 But AI is not as deterministic as read the number seven.

47:01 If seven is less than eight, do this, right?

47:05 Like, what is the repeatability?

47:07 What is the sort of experience of doing this?

47:10 Like, I ran it, oh, I ran it again.

47:13 Is it gonna be pretty much the same or is it gonna have like, what's the mood of the AI when it gets to you?

47:19 - This is sort of a parameter you can, There's a little bit of a parameter you can set if you want to play that game with the temperature parameter on these models at higher and higher temperatures.

47:26 You get more and more random, but it can also truly be out of left field random if you go too high temperature.

47:32 - Okay, but you get maybe more creative solutions.

47:35 - Yeah, you can sometimes get that. And as you move towards zero, it gets more and more deterministic.

47:39 Unfortunately, for really trying to do some good, provable build chain type things with hashing and caching and stuff, it's not fully deterministic even at zero temperature.

47:49 But that's just, I think it's worth thinking about, but at the same time, run it once, see the answers that it gives you comment that business out and just like put that as Markdown, you know, freeze it.

48:01 It like memorialize it in Markdown because you don't need to ask it over and over what columns have PII, like, well, probably the same ones as last time.

48:10 We're just kind of like, right.

48:12 These columns, credit cards, social security and purchase address.

48:15 They have have that.

48:16 And so now, you know, right.

48:18 - Yeah, there's always-- - Is that a reasonable way to think about it?

48:20 - I think, yeah, if you wanna get determinism or the performance is a thing that you're worried about, yeah, you can always cache.

48:26 Think however you do it, comments or actually with systems.

48:29 - Sure, sure, sure.

48:30 Or that like, how do I do that group by sorting business?

48:35 Like, you don't have to ask that over and over.

48:37 Once it gives you the answer, you just--

48:39 - Yeah, my workflow when I use Sketch, definitely I ask the question, I copy the code, and then I go delete the question or ask it a different question for my next problem that I have.

48:47 - Yeah. - I like, it's not code that, it is a little bit like vestigial when you like save your notebook at the end and you sort of want to go back and delete all the questions you asked because you don't need to rerun it when you actually just go to execute the notebook later.

49:01 - Yeah, that makes a lot of sense.

49:02 And plus you look smarter if you don't have to show how you got the answers.

49:06 - Look at this beautiful code that's even commented.

49:08 - Yeah, exactly.

49:09 I guess you could probably ask it to comment your code, right, if you wanted to. - Yeah, you can ask it to describe, there's been some really cool things where people will throw like assembly at it and ask it to translate to different languages so they can interpret it.

49:21 Or you could do really fun things like cross language, cross, I guess I'll say like levels of abstraction.

49:27 You could sort of ask it to describe it like at a very top level, or you can get really precise like for this line, what are all the implications if I change a variable or something like that.

49:34 - Yeah, that's really cool.

49:35 I suppose you could do that here.

49:36 Can you converse with it?

49:38 You can say, okay, you gave me this.

49:41 I guess what's the word?

49:41 Does it have like tokens and context like chat.gpt does?

49:44 Can you say, okay, that's cool, but I want it as integers, not as strings, or I don't know.

49:51 - I did not include that in this.

49:53 There was a version that had something like that, where I was sort of just keeping the last few calls around, but it quickly became, it didn't align with the Jupyter IDE experience, 'cause you end up like scrolling up and down, and you have too much power over how you execute in a Jupyter notebook, so your context can change dramatically by just scrolling up and trying to, via inspect, look across different like across a Jupyter Notebook as I was just a whole other nightmare.

50:17 So I didn't try and like extract the code out of the notebook so that it could understand the local context.

50:21 - You could go straight to Chat GPT or something like that, take what it gave you and start asking it questions.

50:26 Okay, so another question that I had here about this.

50:30 So in order for it to do its magic, like you said the really important thought or breakthrough idea you had was like, not just the structure of the pandas code or anything like that, but also a little bit about the data.

50:42 what is the privacy implications of me asking this question about my data?

50:47 Suppose I have super duper secret CSV and should I not ask or how to on it?

50:53 Or what is the story there?

50:56 What's the, if I work with data, how much sharing do I do of something I might not want to share if I ask a question about it?

51:04 I'd say the same discretion you'd use if you would copy like a row, a few rows of that data into ChatGPT to ask it a question about it. Okay, is the level of concern, I guess you should have, like on the specifically, I am not storing these things. But I know is at least it was, it seems like they're going to start getting towards like a 30 day thing. But so there's a little bit of Yeah, I mean, you're sending your stuff over the wire, like over network, if you do this, and to use these language models until they come local until these things like llama and alpaca get good enough that they're, yeah, they're gonna be remote.

51:38 Actually, that could be a fun, sorry.

51:39 I just now thought, that could be a fun thing.

51:40 Like, just go get alpaca working with Sketch so that it can be fully local.

51:45 - Oh, interesting, like a privacy preserving type of deal.

51:48 - Yeah, I hadn't actually, yeah, that's the power of these smaller models that are almost good enough.

51:53 I could probably just like quickly throw that in here and see if it, you know, maybe has a wider audience.

51:58 - You have an option to not get through your API but directly go to OpenAI.

52:03 you could have another one to pick other options, right?

52:06 Potentially.

52:07 - Yep, yep, yep.

52:08 The interface to these, one thing that I think is not, maybe it's talked about it more in other places, but I haven't heard as much excitement about it, is that the APIs have gotten pretty nice for this whole space.

52:21 They're all, the idea of a completion endpoint is pretty straightforward.

52:25 You send it some amount of text, and it will continue that text.

52:28 And it's such a, it's so simple, but it's so generalizable.

52:32 You could build so many tools off of just that one API endpoint, essentially.

52:35 And so combine that with an embedding endpoint, and you sort of have all you need to make complex AI apps.

52:41 - It's crazy.

52:42 Speaking of making AI apps, maybe touch a bit on your other project, LambdaPromt.

52:49 - Yeah, LambdaPrompt.

52:50 Yeah, LambdaPropmt.

52:51 - But wait, before you get into it, mad props for like Greek letter.

52:55 Like that's a true physicist or mathematician.

52:57 I can appreciate that there.

52:59 - Yeah, yeah.

52:59 I was excited to put it everywhere, but then of course, these things don't.

53:03 Playing games with character sets and websites.

53:07 I'm the one that causes... I both feel the pain, have to clean the data that I also put into these systems.

53:13 Yeah, yeah. People are like, "A prompt? Why is the A so italicized?" - I don't get it. - Yeah, yeah.

53:18 - Okay. Tell us about this. - Yeah, so this one came...

53:21 I was working with... This is pre-GPT.

53:25 This was October. I guess it was right around ChatGPT coming out, like around that time, but I was really just messing around a lot with completion endpoints as we're talking, and I kept rewriting the same request boiler over and over. And then I also kept rewriting f-strings that I was trying to like send in and I was just like, Jinja template solve this already. Like there already is formatting for strings in python. Let me just use that, compose that into a function and just let me call these completion endpoints. I don't want to think of them as like a P.I.M. Point or R.P.C. Is a nice mental model, but I want to use them as functions. I want to be able to put decorators on them. I want to be able to use them both async or not async in python. I want to, I just want to have this as a thing that I can just call really quickly with one line and just do whatever I need to with it. And so through this together, it's very simple, like honest, I mean like the hardest part was just getting all the layers of there's actually two things you can make a prompt that then because I wrap any function as a prompt, so not just these calls to GPT and then I do tracing on it. So as you like get into the call stack, every input and output is you can sort of like get hooked into and trace with some like call traces. So there's a bunch of just like weird stuff to make the utility nice, but functionally as you can see here on it's you just import it. You write a Jinja template with the class and then you use that object that comes back as a function and your Jinja template variables get filled in and your result is the text string that comes back out of - Interesting, and people probably, some people might be thinking like, Jinja, okay, well I gotta create an HTML file and all that, like, not just a string that has double curlies for turning stuff into like strings within the string.

55:06 Kind of a different way to do f strings as you were hinting at.

55:10 - Yeah, yeah.

55:11 There was two pieces here.

55:12 I realized as I was doing this also, I think I sort of mentioned with Sketch, I really often was taking the output of a language model prompt, doing something in Python, or actually I can do a full example of the SQL writing like exploration we did.

55:27 But we would do these things that were sort of run GPT-3 to ask it to write the SQL.

55:34 You take the SQL, you go try and execute it, but it fails for whatever reason.

55:39 Or you, and you take that error, you say, "Hey, rewrite it." So we talked about that sort of pattern, which is sort of like rewriting.

55:44 Another one of the patterns was increase the temperature, ask it to write the SQL, you get like 10 different SQL answers.

55:50 In parallel, and this is where the async was really important for this, because I just wanted to use async.io.gather and run all 10 of these truly in parallel against the OpenAI endpoint, get 10 different answers to the SQL, run all 10 queries against your database, then pool on what the most common, like of the ones that successfully ran, which ones gave the same answer the most often.

56:11 And that's probably the correct answer.

56:13 And just chaining that stuff, it's like very Pythonic functions, like you can really just imagine, oh, I just need to write a for loop, I just need to run this function, take the output feed into another function. Very procedural. But when you all the abstractions in the open at open AI API, the things like just everything else, there was nothing else really at the time. But even the new ones that have come out like Lang chain that have sort of like taken the space by storm now, are not really just trying to offer the minimal ingredient which is the function. And to me, it was just like if I can just offer the function, I can write a for loop I can write, I can store a variable and then keep passing it into it, you could do so many different emergent behaviors with just starting with the function and then simple Python scripting on top of it.

56:57 And some interesting stuff here, LambdaPrompt. So you can start, you can kind of start it, set it, I don't know, with ChatsGDP you can tell it a few things. I'm going to ask you a question about a book. Okay. The book is a choose your own adventure book. Okay. Now here I'm going to, like you kind of prepare it, right? There's probably a more formal term for that, but you can do this here. Like you can say, Hey, system, you are a type of bot and then you, that creates you an object that you can have a conversation with. And you say, what should we get for lunch? And your type of bot is pirate. And so to say, as a pirate, I would suggest we have some hearty seafood or whatever. Right? Like that's, that's beyond what you're doing with sketch. I mean, obviously this is not so much for code. This is like conversing with Python rather than in Python. I don't know. And your editor.

57:43 This one was the open AI chat API endpoint came out and I was just like, Oh, I should support it. So that's what this I wanted to be able to Jinja template inside of the conversation.

57:53 So you can imagine a conversation that is prepared with like seven steps back and forth.

57:58 But you want to hard code with the conversation, like how the flow of the conversation was going. And you want to template it so that like on message three, it put your new context problem on message four, it put the output from another prompt that you ran on message it is this other data thing.

58:12 And then you ask it to complete this, the intent of like, it's arbitrarily complex, but still something like that would be, you know, just three lines or so in Lambda prompt.

58:21 The idea was that it would offer up a really simple API for this.

58:25 - Well, the other thing that's interesting is you have an async and async version.

58:28 So that's cool.

58:29 People can check that out.

58:31 Also a way to make it a hosted as a web service with say like FastAPI or something like that.

58:37 - Yeah.

58:38 - You can make it a decorator if you like.

58:41 An app prompt decorator.

58:43 - Yeah, on any function you can just throw app prompt and it wraps it with the same class so that all the magic you get from that works.

58:50 The server bit is I took, so FastAPI has that sort of like inspection on the function part.

58:58 I did a little bit of middleware to get the two happy together.

59:01 And then all you have to do is import FastAPI and then run, you know, gunicorn that app.

59:07 And it's two lines, and any prompts you have made become their own independent REST endpoint, where you can just do a GET or a POST to it, and it returns the response from calling the prompt.

59:20 But these prompts can also be these chains of prompts.

59:22 Like one prompt can call another prompt, which can call another prompt, and those prompts can call async to not async, back to async, and things like that, and it should work.

59:30 Pretty sure.

59:32 This one actually, I did test everything, as far as I know.

59:34 I'm pretty sure I've got pretty good coverage.

59:35 - Yeah, super cool.

59:37 All right, well, getting a little short on time, but I think people are gonna really dig this, especially Sketch.

59:43 I think there's a lot of folks out there doing pandas that would love an AI buddy to help them do things, like not just analyze the code, but the data as well.

59:55 - Yeah, just, I think anybody's, I know it's for me, but it's just like Copilot in VS Code IDE, Sketch in your Jupyter IDE, it takes almost nothing to add, and whenever you're just sort of sitting there, you think you're about to alt tab to go to Google, you can just try the sketch.ask, and it's surprising how often that sketch.ask or sketch.how to gets you way closer to a solution without even having to leave, that you don't even have to leave your environment.

01:00:18 - It's like a whole other level of autocomplete, for sure, and super cool.

01:00:23 All right, now before I let you out of here, you gotta answer the final two questions.

01:00:26 If you're gonna write some Python code, and it's not a Jupyter notebook, what editor are you using?

01:00:32 It sounds to me like you may have just given a strong hint at what that might be.

01:00:35 Yeah, I've switched almost entirely to VS Code and I've been really liking it with the remote development and I work across many machines, both cloud and local and some five, six different machines are my primary working machines and I use the remote VS Code thing and I have a unified environment that gives me terminal, the files and the code all in one, and copilot on all of them.

01:00:59 - It's wild.

01:01:01 All right, and then notable PyPI package, I mean, pip install sketch, you can throw that out there if you like.

01:01:06 It's pretty awesome.

01:01:07 But anything you've run across, you're like, oh, this is, people should know about this.

01:01:11 - Yeah, let's see. - Doesn't have to be popular, just like, wow, this is cool.

01:01:13 - In the, I guess these two are very popular, but in the data space, I really, I'm a huge fan of Ray and also Arrow.

01:01:22 Like, I use those two tools as like my back end bread and butter for everything I do.

01:01:27 And so those have just been really great work.

01:01:30 - Apache Arrow, right? - Yes.

01:01:32 - And then Ray, I'm not sure.

01:01:35 - Ray is a distributed scheduling compute framework.

01:01:38 It's sort of like a-- - Right, right, right.

01:01:41 Yeah, I remember seeing about this, yeah.

01:01:42 - This is, it is, I'm parsing, we didn't talk about other things, but I'm like parsing Common Crawl, which is like 25 petabytes of data.

01:01:50 And Ray is great, it's just a workhorse.

01:01:52 It's really useful.

01:01:55 Like, I find it's so snappy and good, but it offers everything I need in a distributed environment.

01:02:01 So I can write code that runs on 100 machines and not have to think about it.

01:02:05 It works really well.

01:02:06 - That's pretty nuts.

01:02:07 Not as nuts as chat GDP and mid-journey, but still pretty nuts.

01:02:10 So before we call it a day, do you wanna just tell people about Approximate Labs?

01:02:15 It sounds like you guys are making some good progress.

01:02:18 Might have some jobs for people that work in this kind of area as well.

01:02:21 - Yeah, so we're working at the intersection of AI and tabular data.

01:02:25 So anything related to these training, these large language models, and also tabular data.

01:02:29 So things with columns and rows.

01:02:31 We are trying to solve that problem, try and bridge the gap here, because there's a pretty big gap.

01:02:35 We have three main initiatives that we're working on, which is we're trying to build up the data set of data sets.

01:02:40 So just like the pile or the stack or Leon 5b, these big data sets that we use to train all these big models, we're making our own on tabular data.

01:02:49 We are training models.

01:02:50 So this is actually training large language models, doing these training, these full transformer models.

01:02:55 And then we're also building apps like Sketch, like UIs, things that are actually there to help make data more accessible to people.

01:03:01 So anything that helps people get value from data and make it open source, that's what we're working on.

01:03:07 We just raised our seed round, so we are now officially hiring.

01:03:11 So looking for people who are interested in the space and who are enthusiastic about these problems.

01:03:16 - Awesome.

01:03:17 Well, very exciting demo libraries, I guess, however you call them, but I think these are neat.

01:03:26 People are going to find a lot of cool uses for them.

01:03:30 Excellent work and congrats on all the success so far.

01:03:32 It sounds like you're just starting to take off.

01:03:36 - Thank you.

01:03:37 - All right, Justin, final call to action.

01:03:39 People want to get started, let's pick Sketch.

01:03:40 People want to get started with Sketch, what do you tell them?

01:03:43 - Just pip install it.

01:03:44 Give Sketch a try, pip install it, import it, and then throw it on your data frame.

01:03:49 - Awesome.

01:03:50 And then ask it questions or how-to's, yeah?

01:03:49 - Yeah, yeah, whatever you want.

01:03:51 If you really want to and you trust the model, throw some applies and have it clean your data for you.

01:03:56 - Cool, awesome.

01:03:58 All right, well, thanks for being on the show.

01:04:00 Come in here and tell us about all your work.

01:04:02 It's great. - Yeah, thank you.

01:04:03 - Yeah, see you later. - Thanks for having me.

01:04:05 - This has been another episode of Talk Python to Me.

01:04:09 Thank you to our sponsors.

01:04:10 Be sure to check out what they're offering.

01:04:11 It really helps support the show.

01:04:14 Stay on top of technology and raise your value to employers or just learn something fun in STEM at brilliant.org.

01:04:21 Visit talkpython.fm/brilliant to get 20% off an annual premium subscription.

01:04:28 Want to level up your Python?

01:04:29 We have one of the largest catalogs of Python video courses over at Talk Python.

01:04:33 Our content ranges from true beginners to deeply advanced topics like memory and async.

01:04:38 And best of all, there's not a subscription in sight.

01:04:41 Check it out for yourself at training.talkpython.fm.

01:04:44 Be sure to subscribe to the show, Open your favorite podcast app and search for Python.

01:04:49 We should be right at the top.

01:04:50 You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the Direct RSS feed at /rss on talkpython.fm.

01:04:59 We're live streaming most of our recordings these days.

01:05:03 If you want to be part of the show and have your comments featured on the air, be sure to subscribe to our YouTube channel at talkpython.fm/youtube.

01:05:11 This is your host, Michael Kennedy.

01:05:12 Thanks so much for listening.

01:05:13 I really appreciate it.

01:05:15 Now get out there and write some Python code.

01:05:16 [MUSIC]

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon