Learn Python with Talk Python's 270 hours of courses

#410: The Intersection of Tabular Data and Generative AI Transcript

Recorded on Sunday, Apr 2, 2023.

00:00 AI has taken the world by storm.

00:01 It's gone from near zero to amazing in just a few years.

00:05 We have chat GDP, we have stable diffusion.

00:07 What about Jupyter Notebooks and Pandas?

00:10 In this episode, we meet Justin Wagg, the creator of Sketch.

00:13 Sketch adds the ability to have conversational AI interactions about your Pandas' data frames, code, and data

00:21 right inside of your notebook.

00:23 It's pretty powerful, and I know you'll enjoy the conversation.

00:26 This is Talk Python to Me, episode 410, recorded April 2nd, 2023.

00:31 Welcome to Talk Python to Me, a weekly podcast on Python.

00:47 This is your host, Michael Kennedy.

00:49 Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython,

00:55 both on fosstodon.org.

00:57 Be careful with impersonating accounts on other instances.

00:59 There are many.

01:00 Keep up with the show and listen to over seven years of past episodes at talkpython.fm.

01:05 We've started streaming most of our episodes live on YouTube.

01:09 Subscribe to our YouTube channel over at talkpython.fm/youtube to get notified about upcoming shows

01:15 and be part of that episode.

01:17 This episode is brought to you by Brilliant.org and us with our online courses over at Talk Python Training.

01:24 Justin, welcome to Talk Python to Me.

01:27 Thanks for having me.

01:28 It's great to have you here.

01:29 I'm a little suspicious.

01:30 I got to know, I really know how to test whether you're actually Justin or an AI speaking as Justin.

01:37 What's the deal here?

01:40 Yeah, there's no way to know now.

01:41 No, there's not.

01:42 Well, apparently I've recently learned from you that I can give you a bunch of Xs

01:46 and other arbitrary characters.

01:47 This is like the test.

01:49 It's like asking the Germans to say squirrel in World War II sort of thing.

01:54 Like it's the test.

01:55 It's the tell.

01:56 There's always going to be something.

01:58 It's some sort of adversarial attack.

01:59 Exactly.

02:01 It's only going to get more interesting with this kind of stuff for sure.

02:05 So we're going to talk about using generative AI and large language models paired with things like Pandas

02:13 or consumed with straight Python with a couple of your projects, which are super exciting.

02:18 I think it's going to empower a lot of people in ways that it hasn't really been done yet.

02:24 So awesome on that.

02:25 But before we get to it, let's start with your story.

02:27 How did you get into programming in Python and AI?

02:30 Let's see.

02:30 I got into programming in just like when I was a kid, TI-83, learning to code on that.

02:36 And then sort of just kept it up as a side hobby my whole life.

02:40 Didn't ever sort of choose it as my career path or anything for a while.

02:44 It chose you.

02:44 Yeah, it chose me.

02:46 It just, I dragged it along with me everywhere.

02:47 It's just like the toolkit.

02:49 I got a, went to undergrad and for physics, electrical engineering, then did a physics PhD, experimental physics.

02:57 During that, I did a lot of non-traditional languages, things like LabVIEW, Igor Pro,

03:02 just weird Windows, Windows hotkey for like just trying to like automate things.

03:08 Yeah, sure.

03:08 So just was sort of dragging that along.

03:11 But along that path, sort of came across GPUs and used it for accelerating processing,

03:16 specifically like particle detection.

03:17 So it was doing some like electron counting in some just detector experiments.

03:23 Is this like CUDA cores on NVIDIA type thing?

03:25 Precisely.

03:26 Stuff like that.

03:26 Okay.

03:27 And was that with Python or was that with C++ or what?

03:29 At the time it was C++ and I made like a DLL and then called it from LabVIEW.

03:33 Wow, that's some crazy integration.

03:35 It's like drag and drop programming too on the memory GPU.

03:39 Exactly.

03:40 It was all over the place.

03:41 Also had, it was a distributed LabVIEW project.

03:43 We had multiple machines that were coordinating and doing this all just to move some motors

03:49 and measure electrons.

03:50 But it got me into CUDA stuff, which then at the time was around the time

03:55 that the like AlexNet, some of these like very first neural net stuff was happening.

03:59 And so those same convolutional kernels were the same exact code that I was trying to write

04:03 to run like convolutions on these images.

04:04 And so it's like, oh, look at this like paper.

04:06 Oh, let me go read it.

04:07 It seems like it's got so many citations.

04:09 This is interesting.

04:09 And then like that sent me down the rabbit hole of like, oh, this AI stuff.

04:12 Oh, okay.

04:13 Let me go deep dive into this.

04:14 And then that just, I'd say that like became the obsession from them.

04:18 So it's been like eight years of doing that.

04:20 Then sort of just after I left academia, tried my own startup, then joined multiple others

04:26 and just sort of have been bouncing around as the sort of like founding engineer,

04:30 early engineer at startups for a while now.

04:32 And yeah, Python has been the choice ever since like late grad school and on.

04:38 I would say it sort of like came through the pandas and NumPy part, but then stuck for the scripting,

04:44 like just power, just can throw anything together at any time.

04:47 So it seems like there were two groups that were just hammering GPUs, hammering them,

04:53 crypto miners and AI people.

04:57 but the physicists and some of those people doing large scale research like that,

05:01 they were the OG graphics card users, right?

05:04 Way before crypto mining existed and really before AI was using graphics cards

05:09 all that much.

05:10 When I was like looking at some of the code, like pre-CUDA, there were some like

05:13 quant traders that were doing some like crazy stuff off of shaders.

05:17 Like it wasn't even CUDA yet, but it was shaders and they were trying to like

05:19 extract the compute power out of them from that.

05:22 So...

05:23 Look, if we could shave one millisecond off this, we can short them all day,

05:27 let's do it.

05:28 But yeah.

05:29 Yeah.

05:29 The physicists, I mean, it's always been like, yeah, it's always the get

05:33 as much compute as you can out of the, you know, devices you have because simulations are slow.

05:37 Yeah.

05:38 I remember when I was in grad school studying math, actually senior year,

05:41 regular college, my bachelor's, the research team that I was on had gotten a used

05:47 silicon graphics computer for a quarter million dollars and some Onyx workstations

05:53 that we all were given to.

05:54 I'm like, this thing is so awesome.

05:56 A couple years later, like an NVIDIA graphics card and like a simple PC would crush it.

06:01 Like that's $2,000.

06:03 It's just, yeah, there's so much power in those things to be able to harness them

06:06 for whatever, I guess.

06:07 Yeah.

06:07 As long as you don't have too much branching, it works really well.

06:10 Awesome.

06:11 So let's jump in and start talking about, let's start to talk about ChatGP

06:18 and some of this AI stuff before we totally get into the projects that you're working on,

06:24 which brings that type of conversational generative AI to things like Pandas,

06:30 as you said.

06:31 But to me, I don't know how, maybe you've been more on the inside than I have,

06:36 but to me, it looks like AI has been one of those things that's 30 years

06:41 in the future forever, right?

06:42 It was like the Turing test and, oh, here's a chat, I'm going to talk to this thing

06:46 and see if it feels human or not.

06:48 And then, you know, there was like OCR and then all of a sudden we got self-driving cars,

06:55 like, wait a minute, that's actually solving real problems.

06:57 And then we got things like ChatGP where people are like, wait, this can do my job.

07:02 It seems like it, just in the last couple of years, there's been some inflection point

07:07 in this world.

07:08 What do you think?

07:09 Yeah, I think there's sort of like two key things that have sort of happened

07:12 in the past, I guess, four or five years, four years, roughly.

07:15 One is the attention is all you need paper from Google, sort of this transformer

07:19 architecture came out and it's sort of a good, very hungry model that can just sort of

07:23 absorb a lot of facts and just like a nice learnable key value store almost that's stuck.

07:28 So, and then the other thing is, is the GPUs.

07:31 We were sort of just talking about GPU compute, but this has just been really,

07:34 GPU compute has really been growing so fast.

07:38 If you like, look at the like Moore's law equivalent type things, like it's just,

07:41 it's faster how much we're getting flops out of these things like faster and faster.

07:45 So, it's been really nice.

07:46 I mean, obviously there'll be a wall eventually, but it's been good riding this like

07:51 exponential curve for a bit.

07:52 Yeah, is the benefit that we're getting from the faster GPUs, is that because

07:57 people are able to program it better and the frameworks are getting better

08:00 or because just the raw processing power is getting better?

08:03 All of the above.

08:04 Okay.

08:04 I think that there was a paper that tried to dissect this.

08:07 I wish I knew the reference, but I believe that their argument was that it was

08:11 actually more the processing power was getting better.

08:13 The actual like physical silicon were getting better at making that for specifically

08:17 this type of stuff.

08:17 But like on exponentials, but yeah.

08:20 the power that those things take, I have a gaming system over there and it has a

08:26 GeForce 2070 Super.

08:29 I don't know what the Super really gets me, but it's better than the not Super,

08:32 I guess.

08:33 Anyway, that one still plugs into the wall normal, but the newer ones, like the 4090s,

08:40 those things, the amount of power they consume, it's like space heater level of power.

08:45 Like, I don't know, 800 watts or something just for the GPU.

08:48 You're going to brown out the house if you plug in too many of those.

08:53 Yeah.

08:53 Go look at those DGX A100 clusters and they've got like eight of those A100s just stacked

09:00 right in there.

09:00 They take really beefy powers of mine.

09:03 It's built right directly attached to the power plant, electrical power plant.

09:08 Nuts.

09:09 Okay, so yeah, so those things are getting really, really massive.

09:12 Here's the paper Attention is All You Need from Google Research.

09:15 What was the story of that?

09:18 How's that play into things?

09:19 Yeah, so this came up during like machine translation sort of research at Google

09:23 and the core thing is they present this idea of instead of just stacking

09:30 these like layers of neural nets like we're sort of used to, they replace the like

09:34 neural net layer with this concept of a transformer block.

09:38 A transformer block has this concept inside that's an attention mechanism.

09:42 The attention mechanism is effectively three matrices that you combine in a specific order

09:48 and the sort of logic is it is that one of the vectors takes you from some space

09:54 to keys so it's almost like it's like identifying labels out of your data.

09:58 Another one is taking you from your data to queries and then it like dot products those

10:03 to find a weight and then for the one and then another one finds weight values

10:08 for your things.

10:08 So it takes this query and key, you get the weights for them and then you take

10:13 the ones that were sort of the closest to get those values from the third matrix.

10:16 Just doing it sort of like looks a little bit like accessing an element in a dictionary

10:22 like key value lookup and it's a differentiable version of that and it did really well

10:28 on their machine learning sorry, on their machine translation stuff.

10:31 This was, I think it's like one of the first big one is this BERT model and that paper

10:37 sort of the architecture of the actual neural net code is effectively unchanged

10:43 from this to ChatGPT.

10:45 Like there's a lot of stuff for like milking performance and increasing stability

10:50 but the actual like core essence of the actual mechanism that drives it it's the same

10:54 thing since this paper.

10:55 Interesting.

10:55 It's funny that Google didn't release something sooner.

10:58 It's wild that they've had they keep showing off that they've got like equivalent

11:04 or better things at different times but then not releasing it.

11:06 When Dolly happened they had Imogen Imagine I guess I don't know how you say it

11:11 and what was the Party as the two they had two different really good way better

11:16 than Dolly way better than stable diffusion models like the that had were out

11:20 and they like showed it demoed it like but never released it to be used so yeah

11:25 it's one of these who knows what's going to happen with Google if they keep

11:28 holding on to these things.

11:28 Yeah well I think there was some hesitation I don't know holds up on accuracy

11:33 or weird stuff like that.

11:34 Sure.

11:35 Yeah now cat's out of the bag now now it's happening.

11:38 Yeah the cat's out of the bag and people are racing to do the best they can and it's

11:43 going to have interesting consequences for us both positive and negative I think

11:47 but you know let's leverage the positive once the cat's out of the bag anyway right?

11:51 Yeah.

11:51 Hopefully.

11:52 Might as well like ask it questions for pandas.

11:55 so let's play a little bit with chat GDP and maybe another one of these image type

12:00 things.

12:00 So I came in here and I stole this example from a blog post that's pretty nice

12:05 about not using deeply nested codes.

12:08 You can use a design pattern called a guarding clause that will look and say if the

12:14 conditions are not right we're going to return early instead of having if something

12:18 if that also if something else so there's this example that is written in a poor

12:23 way and it says like it's checking for a platypus so it says if self.ismammal

12:29 if self.hasfur if self.hasbeak etc.

12:32 it's all deeply nested and just for people who haven't played with chat GDP

12:37 like I put that in and I said sure I told her I wanted to call this arrow

12:40 because it looks like an arrow and it says it tells me a little bit about

12:44 this so I'm going to ask it please rewrite arrow to be less nested with girding

12:52 clauses right this is like a machine right if I tell it this what is it going to say

12:57 let's see it may fail but I think it's going to get it it's thinking I put it

13:01 I mistakenly put it into chat GDP 4 which takes longer I might switch it over to

13:06 3 I don't know but the understanding of these things there's a lot of hype

13:11 about it like I think you kind of agree with me that maybe this hype is worthwhile

13:16 here we go so look at this it rewrote it said if it's platypus if not self is

13:22 man will return false if not has fur and there's no more nesting that's pretty

13:25 cool right yep I mean I'm sure you've you've played with stuff like this

13:29 right yeah big user of this I mean this is kind of interesting right like it

13:33 understood there was a structure and it understood what these were and it

13:35 understood what I said but what's more impressive is like please rewrite

13:40 the program to check for crocodiles crocodiles and you know it what is it

13:49 going to do here let's see it says sure no problem writes the function is crocodile

13:54 if not self.is reptile if not self.has scales if not self.has long snout

14:00 oh my gosh like it not only remembered oh yeah there's this new version I

14:06 wrote in the garden clause format but then it rewrote the tests I mean and then

14:12 it's explaining to me why it wrote it that way it's just it's mind blowing

14:18 like how how much you can have conversations with this and how much it understands

14:23 things like code or physics or history what do you think yeah it's really

14:28 satisfying I love that it's such a powerful generalist at these like things

14:33 that are found on the internet so if it like if it exists and it's in the training

14:36 data it can do so good at synthesizing composing bridging between them it's really

14:41 satisfying so it's really fun asking it to as you're doing rewriting changing

14:45 language I've been getting into a lot more JavaScript because I'm doing a

14:48 bunch more like front end stuff and just I sometimes will write a quick one liner in

14:51 Python that I know how to do with a list comprehension and then I'll be like

14:55 make this for me in JavaScript because I can't figure out this like how to

14:59 initialize an array with integers in it it's great for just like really quick spot

15:04 checks and it also seems to know a lot about like really popular frameworks

15:07 so you can ask it things that are surprisingly detailed about like a how would you

15:12 do cores with requests in FastAPI and it can help you find that exact middleware

15:18 you know it's like boilerplate-y but it's great that it can just be a source for that

15:22 this portion of Talk Python to Me is brought to you by brilliant.org you're a

15:28 curious person who loves to learn about technology I know because you're

15:31 listening to my show that's why you would also be interested in this episode's

15:35 sponsor brilliant.org brilliant.org is entertaining engaging and effective

15:40 if you're like me and feel that binging yet another sitcom series is kind

15:44 of missing out on life then how about spending 30 minutes a day getting better

15:47 at programming or deepening your knowledge and foundations of topics you've always

15:51 wanted to learn better like chemistry or biology over on brilliant brilliant

15:57 has thousands of lessons from foundational and advanced math to data science

16:01 algorithms neural networks and more with new lessons added monthly when you sign up

16:06 for a free trial they ask a couple of questions about what you're interested

16:09 in as well as your background knowledge then you're presented with a cool learning

16:13 path to get you started right where you should be personally I'm going back to some

16:17 science foundations I love chemistry and physics but haven't touched them

16:20 for 20 years so I'm looking forward to playing with PV equals NRT you know

16:26 the ideal gas law and all the other foundations of our world with brilliant

16:30 you'll get hands-on on a whole universe of concepts in math science computer science

16:35 and solve fun problems while growing your critical thinking skills of course

16:39 you could just visit brilliant.org directly its url is right there in the name

16:43 isn't it but please use our link because you'll get something extra 20% off

16:47 an annual premium subscription so sign up today at talkpython.fm/brilliant

16:52 and start a 7 day free trial that's talkpython.fm/brilliant the link is

16:57 in your podcast player show notes thank you to brilliant.org for supporting

17:01 the show it's insane I don't know if I've got it in my history here we're rewriting

17:09 our mobile apps for talkbython training for our courses in Flutter and we're

17:15 having a problem downloading stuff concurrently using a particular library

17:19 in Flutter and so I asked it I said hey I want some help with a Flutter and Dart

17:26 programs what do you want it says I'm using the dio package do you know it

17:30 oh yes I'm familiar it does HTTP client stuff for Dart okay I want to download

17:34 binary video files and a bunch of them given a URL I want to do them concurrently

17:39 with three of them at a time write the code for that and boom it just writes it

17:42 like using that library I told it about not just Dart so that's incredible that

17:48 we can get this kind of assistance for knowledge and programming like you'll

17:52 never find I mean I take that back you might find that if there's a very

17:55 specific stack overflow question or something but if there's not a write-on

17:59 question for it you're not going to find it I love when you know the stack

18:04 overflow would exist for a variant of your question but the exact one doesn't

18:08 exist and you have to go grab the three of them to synthesize and it's just great

18:12 at that it also is pretty good at fixing errors sometimes it can walk itself into

18:17 lying to you repeatedly but that's so problematic yeah but you can also ask

18:24 it here's my program are there security vulnerabilities or do you see any

18:28 bugs and it'll find them yep yeah it's nuts so people may be wondering we haven't

18:34 talked yet about your project sketch why I'm talking so much about chat CP

18:38 so that is kind of the style of AI that your project brings to pandas which we're

18:44 going to get to but I want to touch on two more really quick AI things that

18:47 we'll dive into it the other is this just around images just the ability to

18:52 ask questions you've already mentioned three dolly imagine and then the other

18:57 one I don't remember from Google that they haven't put out yet a mid journey is

19:01 another just the ability to say hey I want a picture of this no actually

19:05 change it slightly like that it's mind blowing they're a lot of fun they're great

19:09 for sparking creativity or having idea and just getting to see it in front of

19:12 you I think it's more impressive to me than even this chat GTP telling me

19:16 I want a GTP I want an artificial intelligence panda and it came up and I

19:32 want it photorealistic in the style of National Geographic and so it gave

19:36 me this panda you can see beautiful whiskers but just behind the ear you can see

19:41 the fur is gone and it's like an android type of creature that is a beautiful

19:48 picture it's pretty accurate it's nuts that I can just go talk to these systems

19:52 and ask them these questions I find it interesting comparing the ChatGPT

19:56 and the mid-journey style I completely get it it's very visceral it's also

20:04 from another perspective I think of the weights and the scale of the model

20:07 and these image ones that solve all images are so much smaller in scale than

20:14 these language ones that have all this other data and stuff. So it's fascinating how complex language is. Yeah, I know the smarts is so much less,

20:20 but just something about it actually came up with a creative picture that never existed.

20:26 Yeah. Right. You could show this to somebody like, oh, that's an artificial panda. That's

20:31 insane. Right. But it's, but I just gave it like a sentence or two. Yeah. Yeah. Yeah. I don't know.

20:36 Yeah. This, it's a sort of a technical interpreter, but I, I love it because it's

20:41 like this, it's just phenomenal interpolation. It's like through semantically labeled space. So

20:46 like the words have meaning and it understands the meeting and can move sliders of like, well,

20:51 I've seen lots of these machine things. I understand the concept of gears and this metal and this,

20:54 like the shiny texture and then the fur texture and like, they're very good at texture. It's a,

21:00 yeah, really great how it interprets all of that just to fit the, you know, the small prompt.

21:04 Yeah. There are other angles of which it's frustrating. Like I want it turned, I want it

21:08 in the back of the picture, not the, no, it's always in the center. One more thing really quick.

21:13 And this leads me into my final thing is, is a GitHub copilot. GitHub copilot is like this in

21:19 your editor, which is kind of insane, right? You can just give it like a comment or a series of

21:23 comments and it will write it. I think chat GDP is maybe more open-ended and more creative, but this

21:29 is, this is also a pretty interesting way to go. I'm a heavy user of copilot. I, if there's a,

21:35 there's a weird crux and I'm like slowly developing like a need to have this in my browser. I was a,

21:40 on a flight recently and was with the internet and copilot wasn't working. And I felt the, like,

21:46 I felt the difference. I felt like I was like walking through mud instead of just like actually

21:50 running a little bit. And I was like, Oh, I've been disconnected from my distributed mind. I am broken

21:56 partially. Yeah. So incredible. So the last part I guess is like, you know, what are the ethics

22:03 of this? Like I went on very positively about mid journey, but how much of that is trained on

22:08 copyright material or there's GitHub copilot. How much of that is trained on GPL based stuff that was

22:16 in GitHub. But when I use it, I don't have the GPL any longer on my code. I might use it on commercial

22:23 code, but just running it through the AI, does that strip licenses or does it not? There's a GitHub

22:29 copilotlitigation.com, which is interesting. I mean, we might be finding out. There's also

22:35 think Getty, I think it's the Getty images. I'm not 100% sure, but I think Getty images is suing

22:41 one of these image generation companies. I can't remember which one I don't maybe mid journey. I

22:47 don't think it's mid journey. I think it's stable diffusion, but anyway, it doesn't really matter.

22:50 Like there's a bunch of things that are pushing back against us. Like, wait a minute,

22:53 where did you get this data? Did you have rights to use this data in this way? And I mean,

22:58 what are your thoughts on this angle of AI these days?

23:02 Yeah. I know it sounds like I don't worry too much about it in either direction. I think I

23:08 believe in personal ethics. I believe in open source things, availability of things,

23:14 because it just sort of like accelerates collective progress. But that said, I also believe in like

23:19 slightly different like social structures to help support people. Like I'm a, I guess,

23:24 a person believer in things like UBI or something like that on that direction.

23:27 So when you combine those, I feel like it, you know, things sort of work out kind of well,

23:31 but when we like, but it is still a thing that like, Be Right exists and that there is this sense of ownership and this is my thing. And I wanted to

23:38 put licenses on it. And, I think that this sort of story started presumably that I wasn't really

23:44 having this conversation, but like when the internet came around and search engines happened and like

23:49 Google could just go and pull up your thing from your page and summarize it in a little blob on the

23:54 page is, was that fair? What if it starts, you know, your shop and it allows you to go buy that

23:58 same product from other shops. Like it, I think that the same things are showing up and in the same way

24:04 that the web, like in the internet sort of, it's sort of, it was a large thing, but then it sort of,

24:08 I don't know if it got quieter, but it sort of became in the background. We sort of found new

24:12 systems. It stopped being piracy and CDs and the music industry is going to struggle. And Hey,

24:17 things like Spotify exist and streaming services exist. And like, I don't know what the next way

24:21 is.

24:21 They're doing better than ever basically. Yeah. Yeah. Yeah. So I think it's just evolution.

24:24 And then like the, some things will change and adopt some things will like fall apart and new

24:29 things will be born. I, that's just a great, it's a good time for lots of opportunity, I guess is the

24:33 part that I'm excited about.

24:34 Yeah. Yeah. Yeah. For sure. I think that's definitely true. It probably, you're probably right. It probably

24:39 will turn out to be, you know, old man yells at cloud cloud doesn't care sort of story, you know,

24:45 in the end where it's like, on the other hand, if, if somebody came back and said,

24:49 you know, a court came back and said, you know what, actually anything trained on GPL

24:54 and then you use copilot on it, that's GPL. Like that would have instantly mega effects. Right.

25:02 Yeah. I, yeah. And I, I guess there's also stuff like the, I don't, I didn't actually read the

25:06 article. I only saw the headline and you know, that's the worst thing to do is to repeat a thing,

25:09 which is a headline. But, there was that Italy thing that I saw about, like, I don't know.

25:14 Yeah. That was really clickbaity, but I didn't get a time to look at it yet. So yeah. You probably

25:20 ask chat to be to summarize for you. If as long as it can be a Bing, I guess, get that updated.

25:25 Yeah. Yeah. Yeah. Yeah. There's a lot of, there's a lot of things playing in that space,

25:30 right? Some different places. Okay. So yeah, very cool. But as a regular user, I would say,

25:36 you know, regardless of kind of how you feel about this, at least this is my viewpoint right now.

25:40 It's like, regardless of how I feel about which side is right in these kinds of disputes,

25:45 this stuff is out of the bag. It's out there and available and it's a tool. And it's like saying,

25:50 you know, I don't want to use spell check or I don't want to use some kind of like code checking. I just

25:55 want to write like in straight notepad because it's pure, right? Like sure you could do that,

26:00 but there's these tools that will help us be more productive and it's better to embrace them and know

26:05 them than to just like yell at them, I suppose. Yeah. A lot of accelerant you can get.

26:10 really speed up whatever you want to get done. Yeah, absolutely. All right. So speaking of

26:15 speeding up things, let's talk pandas and not even my artificial pandas, but actual programming pandas

26:22 with this project that you all have from approximate. Yeah. Approximate labs called sketch. So sketch is

26:30 pretty awesome. Sketch is actually why we're talking today because I first talked about this on Python

26:35 bytes and I saw this was sent over there by Jake Furman and to me and said, you should check this

26:41 thing out. It's awesome. And, yeah, it's pretty nuts. So tell us about sketch. Yeah. So, even

26:49 though I use a copilot as I sort of described already, and it's become a crux I found in Jupyter

26:54 notebooks when I wanted to work with data, it just didn't, it doesn't actually apply that. So on one side,

27:00 it was sort of like missing the mark at times. And so it was sort of like, how can I get this

27:04 integrated into my flow? The way I actually work in a Jupyter notebook, if maybe I'm working a Jupyter

27:09 notebook on a remote server and I don't want to set up VS Code to do it. So I don't have copilot at all.

27:13 Like there's a bunch of different reasons that I was just like in Jupyter. It's a very different IDE

27:17 experience. It is. Yeah. It's super different, but also you might want to ask questions about the data,

27:21 not the structure of the code that analyzes the data, right? Exactly. Yeah. And so just a bunch of that

27:26 type of stuff. And then also at the other side, I was trying to find something that I could throw together

27:31 that I thought was strong demonstration of the value approximate labs is trying to chase, but wouldn't

27:38 take me too much time to make. So it was a, oh, I could probably just go throw this together pretty quickly.

27:42 I bet this is going to be actually useful and helpful. And so let's just do that. And so through on top of

27:48 the actual library I was using, it was sketch. I put this on it and then shift it. So sort of shifted what the

27:54 project was. Yeah. Yeah. So you also have this other project called Lambda Prompt. And so were

27:59 you trying to play around Lambda Prompt and then like see what you could kind of apply here to leverage

28:03 it? Or is that the full journey I can get into is started with data sketches. I left my last job

28:11 to chase bringing the algorithm, like combining data sketches with AI, but just like the vague,

28:17 like at that level. Tell us what data sketches is real quick. Sure. Yeah. So a data sketch is a

28:22 probabilistic aggregation of data. So if you have, I think the most common one that people have heard of

28:27 is hyperloglog and it's used to estimate cardinality. So estimate the number of unique

28:32 values in a column. A data sketch is a class of algorithms that all sort of like use roughly fixed

28:39 width in binary, usually representations. And then in a single pass, so their ON will look at each row

28:46 and hash the row and then update the sketch or not necessarily hash, but they update this sketch

28:52 object. Essentially. Those sketch objects also have another property that they are mergeable. So you

28:57 have this like really fast ON to go bring that like to aggregate up and you get this mergeability. So you

29:03 can map reduce it in, you know, trivial speeds. The net result is that this like tight binary packed

29:09 object can be used to approximate measures you were looking for on the original data. So you could look

29:15 at, if you do a few of these, they're like theta sketches, you can go and estimate not just the

29:21 unique count, but you can also estimate if this one column would join well with this other column,

29:25 or you can estimate, Oh, if I were to join this column to this column, then this third column that

29:30 was on that other table would actually be correlated to this first column over here. So you get these,

29:35 a bunch of different distributions, you get a whole bunch of these types of properties.

29:40 And each sketch is sort of just, I would say, algorithmically engineered, like very, very

29:44 engineered to be like information theory optimal at solving one of those like measures on the data.

29:50 And so tight packed binary representations.

29:53 All right. So you thought about, well, that's cool, but chat CTP is cool too.

29:57 Yeah.

29:58 What else?

30:00 The core thing was, so those representations aren't usable by AI right now. And when you actually go and

30:07 use GPT three or something like this, you have to figure out a way to build the prompt to get it to do

30:13 what you want. This was especially true in a pre instruction tuning world, you had to really like, you had to

30:18 play the prompt engineer role even more than you have to now. Now you could sort of get away with describing it to

30:23 ChatGPT. And one of the things that you really have to like, play the game of is how do you get all the

30:28 information it's going to need into this prompt in a succinct, but good enough way that it helps it do

30:35 this. And so what sketch was about was, rather than just looking at the context of the data, like the

30:41 metadata, the column names and the code you have, also go get some representations of representation of the

30:48 content of the data, turn that into a string, and then bring that string in as part of the prompt.

30:53 And then when it has that, it should understand much better at actually generating code, generating

30:59 answers to questions. And that's what that sketch was a proof of concept of that, that worked very well.

31:03 It really quickly showed how valuable actual data content context is.

31:08 Yeah, I would say people are resonating with people. It's got 1.5,000 stars on GitHub.

31:13 And it looks about six months old. So that's pretty good growth there.

31:18 Yeah, January 16th was the day I posted it on Hacker News. And it had three,

31:22 there was an empty repo at that point.

31:24 Okay, three stars. It's like me and my friends. Okay, cool. So this is a tool that basically patches

31:33 pandas to add functionality or functions, literally to pandas data frames that allows you to ask

31:42 questions about it, right?

31:44 Yep.

31:44 So what kind of questions can you ask it? What can it help you with?

31:47 Yeah, so there's two classes of questions you can ask, you can ask it, the ask type questions,

31:53 these are sort of from that summary statistics data. So from the general, you know, representation of your

32:00 data, ask it to like, give you answers about it, like, what are the columns here, you sort of have

32:04 a conversation where it sort of understands the general under like shape of the data, general

32:10 distributions, things like that, number of uniques, and like give that context to it, ask questions of that

32:15 system. And then the other one is ask it how to do something. So you specifically can get it to write

32:21 code to solve a problem you have, you describe the problem you want, and you can ask it to do that.

32:25 Right. I've got this data frame, I want to plot a graph of this versus that, but color by the other

32:31 thing.

32:31 Yep. And in the data space world, what I sort of decided to do is like in the demo here is just sort of

32:37 walk through what are some standard things people want to ask of data, like, like, what are those common

32:43 questions that you hear, like, in Slack between, you know, like, business team and an analyst team. And it's just

32:49 sort of like, Oh, can you do this? Can you get me this? Can you tell me if there's any PII? Is this safe to send?

32:54 Can I send the CSV around? Can you clean up this CSV? Oh, I need to load this into our catalog. Can you

32:59 describe each of these columns and check the data types all the way to can you actually go get me

33:04 analytics or plot this?

33:05 Yeah. Awesome. So and it plugs right into Jupyter Notebooks, so you can just import it and basically

33:13 installing Sketch, which is a pip or Conda type thing, and then you just import it, and it's good to go,

33:19 right? Yep. Using the Pandas extensions API, which allows you to essentially hook into their data

33:24 frame callback and register a, you know, a function. Interesting. So it's not as jammed on from the

33:31 outside. It's a little more, plays a little nicer with Pandas rather than just like, we're going to go

33:35 to the class and just tap on it. Yeah, yeah. I, yeah. Not full monkey patching here. It's a,

33:41 it's like hack supported, I think. I don't, I don't see it used often, but it is somewhere in the docs.

33:46 Excellent. But here it is. So what I wanted to do for this is there's a, an example that you can do,

33:52 like if you go to the repo, which obviously I'll link to, there's a video, which I mean,

33:57 mad props to you because I review so many things, especially for the Python Bytes podcast, where

34:02 there's a bunch of news items and new things we're just going to check out. And we'll, we'll find people

34:06 recommending GUI frameworks that haven't, not a single screenshot or other types of things. Like,

34:13 I have no way to judge whether this thing even might look like that. What does it even make?

34:17 I don't even know, but somebody put a lot of effort, but they didn't bother to post an image. And you

34:21 posted a minute and a half animation of it going through this process, which is really, really

34:27 excellent. So people can go and watch that one minute, one minute 30 video. But there's also a

34:34 collab opening Google collab, which gives you a running interactive variant here. So you can just

34:41 follow along, right? And play these pieces requires me to sign up on and run it. That's okay.

34:46 Let me talk people through some of the things it does. And you can tell me what it's doing,

34:51 how it's doing that, like how people might find that advantageous. So import sketch, import pandas

34:57 as PD standard. And then you can say pandas read CSV and you give it one from a, like a,

35:03 some example CSV that you got on your, one of your GitHub repos, right? Or in your account.

35:08 Yeah. I found one online and then added just random synthetic data to it.

35:12 Yeah. Like, Oh, here's a data dump. No, just kidding.

35:14 So then you need to go to that data frame called sales data. You say dot sketch dot ask as a string,

35:21 what columns might have PII personal identifying information in them?

35:28 Awesome. And so it comes, tell me how that works and what it's doing here.

35:33 So it does, I guess it has to build up the prompt, which is sent to GPT. So to open AI specific

35:40 completion endpoint, the building up the prompt, it looks at the data frame. It does a bunch of

35:44 summarization stats on it. So it calculates uniques and sums and things like that. There's two modes in

35:50 the backend that either does sketches to do those, or it just uses like DF dot describe type stuff.

35:55 And then it pulls those summary stats together for all the columns, throws it together with my,

36:00 the rest of the prompt I have, you can, we can go find it, but then it sends that prompt.

36:05 Actually, it also grabs some information off of inspect. So it sort of like walks the,

36:10 the stack up to go and check the variable name because the data frame is named sales data.

36:15 So it actually tries to go find that variable name in your call stack so that it can, when it writes

36:20 code, it writes valid code, puts all that together, send it off to open AI, gets code back, uses Python

36:25 AST to parse it, check that it's valid. If it's not valid Python code, or you tried to import something

36:30 that you don't have, it will ask it to rewrite once. So this is sort of like an iterative process. So it

36:36 takes the error or it takes the thing and it sends it back to open AI. It's like, Hey, fix this code.

36:41 And then it, or in this case, sorry, ask, it actually just takes this, it sends that exact

36:45 same prompt, but it just changes the last question to, can you answer this question off of the information?

36:51 This portion of talk Python me is brought to you by us over at Talk Python Training with our courses.

36:58 And I want to tell you about a brand new one that I'm super excited about. Python web apps that fly

37:05 with CDNs. If you have a Python web app, you want it to go super fast. Static resources,

37:11 turn out to be a huge portion of that equation. Leveraging a CDN could save you up to 75% of your

37:17 server load and make your app way faster for users. And this course is a step-by-step guide on how to do

37:24 it. And using the CDN to make your Python apps faster is way easier than you think. So if you've

37:30 got a Python web app and you would like to have it scaled out globally, if you'd like to have your users

37:35 have a much better experience and maybe even save some money on server hosting and bandwidth,

37:41 check out this course over at talkpython.fm/courses. It'll be right up there at the top.

37:46 And of course the link will be in your show notes. Thank you to everyone who's taken one of our courses.

37:51 It really helps support the podcast. I'm back to the show.

37:56 And so that sounds very, very similar to my arrow program. Rewrite it with garden clauses, redo it.

38:02 Like you kind of, I gave you this data in this code and I asked you this question and you can have a

38:07 little conversation, but at some point you're like, all right, well, we're going to take what it gives me

38:10 after a couple of rounds at it. Right.

38:12 Yeah. I take the first one that doesn't, that like passes an import check and passes AST linting.

38:18 There was a, the, when you use small models, you run into not valid Python a lot more, but with these

38:23 ones, it's almost always good.

38:25 It's ridiculous. Yeah. Yeah. Yeah. It's crazy. Okay. So it says the columns that might have PII

38:30 in them are credit card, SSN and purchase address. Okay. That's pretty excellent. And then you say,

38:37 all right, sales data dot sketch dot ask. Can you give me friendly name to each column and output this

38:44 as an HTML list, which is parsed as HTML and rendered in Jupyter notebook accurately. Right. So it says

38:51 index. Well, that's an index.

38:52 This one ends up being the same.

38:53 It's not a great, this one is not a great example because it doesn't have to like infer

38:57 because the names are like order space date, right? Instead of order, like maybe lowercase

39:04 O and then like attached a big D or whatever, but it'll give you some more information. You

39:09 can like kind of ask it questions about the type of data, right?

39:13 Yeah, exactly. I found this is really good at if you play the game and you just name all

39:17 your columns, like call one, call two, call three, call four, and you ask it, give me new column

39:21 names for all of these. It gives you something that's pretty reasonable based off of the data.

39:24 So pretty useful.

39:24 Okay. So it's like, oh, these look like addresses. So we'll call that address. And this looks like

39:28 social security numbers and credit scores and whatnot.

39:31 Yep. Yep. So it can really help with that quick first onboarding step.

39:35 Yeah. So everyone heard it here first. Just name all your columns. One, two, three, four,

39:39 and then just get help. Like AI, what do we call these? All right. So the next thing you did in this

39:48 demo notebook was you said sales data dot sketch dot. And this is different before I believe,

39:54 because before you were saying ask, and now you can say how to create some derived features from the,

40:01 from the address. Tell us about that.

40:03 Yeah. This is the one that actually is the code writing. It's essentially the exact same prompt,

40:07 but the change is the very end. It says like, return this as Python code that you can execute to do

40:13 this. So instead of answering the question directly, answer the question with code that will answer the

40:17 question.

40:18 Right. Write a Python line of code that will answer this question given this data, something like that.

40:23 Yep. Yep. Something like that. I don't remember exactly anymore. It's been a while, but yeah,

40:27 some I've iterated a little bit until it started working and I was like, okay, cool. And so,

40:32 ask it for that. And then it spits back code. And that was, it sort of, it sounds overly simple,

40:37 but that was it. That was like, that was the moment. And I was just like, oh, I could just ask it to do my

40:42 analytics for me. And it's just all the, every other feature just sort of became like apparently

40:45 solvable with this. And the more I played with it, the more it was just, I don't have to think about,

40:50 I don't even have to go to Google or stack overflow to ask the question, to get the API stuff for me.

40:55 I could, from zero to I have code that's working is one step in Jupyter.

40:59 So you wrote that how to, and you gave it the question and then it wrote the lines of code and you just

41:04 drop that into the next cell and just run it. Right. And so for example, in this example, it said, well,

41:09 we can come up with city state and zip code and by writing a vector transform by passing a lambda,

41:16 that'll pull out, you know, the city from the string that was the full address and so on. Right.

41:21 Yeah. That's pretty neat.

41:22 Yeah. It's fun to see what it, what it does. Not again, not any of these things are always

41:26 probabilistic, but it also usually serves as a great starting point if, even if it doesn't get it

41:30 right.

41:30 Yeah. Sure. You're like, oh, okay. I see. Maybe that's not exactly right. Cause we have Europeans

41:35 in their city, maybe in their zip code or in different orders sometimes, but it gives you

41:40 something to work with pretty quickly. Right. By asking just a, what can I do?

41:44 And then another one, this one's a little more interesting instead of just saying like, well,

41:48 what other things can we pull out? It's like, this gets towards the analytics side, right? It says,

41:53 get the top five grossing states for the sales data. Right. And it writes a group by some sorts,

42:01 and then it does a head given five. And that's pretty neat. Tell us about this. I mean, I guess

42:05 it's about the same, right? Just ask more questions.

42:08 They all feel pretty similar to me. I think, I guess I could jump towards like things that

42:13 I wanted to put next, but I didn't, we're not reliable enough to like really make the cut.

42:18 I wanted to have it go like in my question was like, go build a model that predicts sales for the

42:24 next six months and then plot it on a 2d plot with a dotted line for the predicted plot. And like,

42:31 it would try, but it would always do something off. And I found I always had to break up the

42:36 like prompted to like smaller, smaller intern level code back. Yeah.

42:40 Yeah. It was fun getting it to train models, but it was also its own like separate thing. I sort of

42:48 didn't play with too much. And there's another part of sketch that I guess is not in this notebook. I

42:54 didn't realize. Yeah. Because you have to use the open AI API key, but it's the sketch apply. And

43:00 that's the, I'll say this one is another just like power tool. This one has like, I don't really talk

43:07 about, I don't even include it in the video because it's not just like as plug and play, you do have to

43:11 go set an environment variable. And so it's like, yeah, that's one step further than I want to,

43:15 I don't, it's not terrible, but it's a step. And so what it does is it lets you apply a completion

43:22 endpoint of whatever your design row wise. So every single row, you can go and apply and run something.

43:29 So if every row of your pandas data frame is a, some serialized text from a PDF or something,

43:35 or a file in your directory structure, and you just load it as a data frame, you can do dot

43:39 df.sketch.apply. And it's almost the exact same as df.apply. But the thing you put in as your function

43:45 is now just a Jinja template that will fill in your column variables for that row and then ask GPT to

43:51 continue completing. So I think I did silly ones, like here's a few states. And then the prompt is

43:58 extract the state for it. Or so I think, right, extract the capital of the state. Yeah. Yeah. So

44:04 just pure information extraction from it, but you can sort of like this grows into a lot more.

44:10 So does that come out of the data? Or is that coming out of open AI where like it sees what is

44:15 the capital of state and it sees New York? It's like, okay, well, all right, Albany.

44:20 Yeah. So this is purely extracting out of the model weights. Essentially, this is not like a factual

44:25 extraction. So this is probably a bad example because it's like it. But the thing that actually,

44:29 actually, the better example I did once was, what is like some interesting colors that are

44:34 good for each state? And it like just came up with a sort of like flaggish colors or sports team colors.

44:38 That was sort of fun when it wrote that as hex. You can also do things like if you have a large text

44:43 document or you can actually, I'll even do the more common one that I think everybody actually wants

44:47 is you have messy data. You have addresses that are like syntactically messy and you can say,

44:52 normalize these addresses to be in this form. And you sort of just write one example. It's a run

44:57 dot apply and you get a new column that is that cleaned up data. Yeah. Incredible. Okay. A couple

45:03 things here. It says I can use, you can directly call open AI and not use your endpoint. So at the

45:11 moment it kind of proxies through web service that you all have that somehow checks stuff or what does

45:16 that do? Yeah. It was just a pure ease of use. I wanted people to be able to do pip install and

45:21 import sketch and actually get it because I know how much I use things in, in a collab or in Jupyter

45:28 notebooks on weird machines and remembering an environment variable, managing secrets. It's like

45:32 this whole overhead that I want to deal with. And so I wanted to just offer a lightweight way if you

45:38 just want to be able to use it. But I know that that's not sufficient for secure. If people are going

45:42 to be conscious of this things and want to be able to, you know, not go through my proxy thing that's

45:46 there for help. So sure.

45:47 Offer this up.

45:48 What's next? Do you have a roadmap for this? Are you happy where it is and you're just letting it be or

45:53 do you have grand plans?

45:55 I don't have much of a roadmap for this right now. I'm actually, I guess there's like grand roadmap,

46:00 which is like at the company scale, what we're working on. I would say that if this, we're really trying to

46:05 solve data and with AI just in general. And so these are the types of things we hope to open source and

46:11 just give out there, like actually everything we're hoping to open source. But the starting place is

46:16 going to be a bunch of these like smaller toolkits or just utility things that hopefully save people

46:20 time or very useful. The grand thing we're working towards, I guess, is this more like the, it's the

46:26 full automated data stack. It's like the dream I think that people have wanted where you just ask it

46:31 questions and it goes and pulls the data that you need. It cleans it. It builds up the full pipeline.

46:36 It executes the pipeline. It gets you to the result and it shows you the result. And you look,

46:40 you can inspect all of that, that whole DAG and say, yes, I trust this. So we're working on getting

46:45 full end to end.

46:46 So when I went and asked about that Arrow program, I said, I think this will still do it. I think this

46:51 will probably work again. And it did, which is awesome. Just the way I expected. But, you know,

46:58 AI is not as deterministic as read the number seven. If seven is less than eight, do this,

47:05 right? Like what is the repeatability? What is the sort of experience of doing this? Like I ran it,

47:11 I ran it again. Is it going to be pretty much the same or is it going to have like, what's the mood

47:16 of the AI when it gets to you?

47:18 This is sort of a parameter you can, there's a little bit of a parameter you can set if you want

47:22 to play that game with the temperature parameter on these models at higher and higher temperatures,

47:26 you get more and more random, but it can also truly be out of left field random if you go too

47:31 high temperature.

47:32 Okay. But you get maybe more creative solutions.

47:34 Yeah. You could sometimes get that. And as you move towards zero, it gets more and more

47:38 deterministic. Unfortunately for really trying to do some like good provable, like sort of like

47:43 build chain type things with like hashing and caching and stuff. It's not fully deterministic,

47:48 even at zero temperature, but that's just, I think it's worth thinking about, but at the same time,

47:53 run it once, see the answers that it gives you comment that business out and just like put that

47:59 as markdown, you know, freeze it. It like us memorialize it in markdown because you don't need

48:05 to ask it over and over what columns have PII. Like, well, probably the same ones as last time.

48:10 We're just kind of like, right, these columns, credit card, social security and purchase address,

48:15 they have, have that. And so now, you know, right. Is that a reasonable way to think about it?

48:20 I think, yeah, if you, if you want to get determinism or the performance is a thing that

48:24 you're worried about, yeah, you can always cash. I think however you do it, comments or actually

48:28 with systems.

48:28 Sure. Sure. Sure. Sure. Or that like, like, how do I, you know, how do I do that group by sorting

48:35 business? Like you don't have to ask that over and over once it gives you the answer.

48:38 Yeah. Yeah. My workflow, when I use sketch, definitely I asked the question, I copy the

48:43 code and then I go delete the question or ask it a different question for my next problem that I have.

48:47 Yeah. I like, it's not code that it is a little bit like a vestigial when it, when you like save

48:53 your notebook at the end and you sort of want to go back and delete all the questions you asked because

48:57 you don't need to rerun it when you actually just go to execute the notebook later. But yeah,

49:01 that makes a lot of sense. And plus you look smarter if you don't have to show how you got

49:04 the answers.

49:05 Look at this beautiful code that's even commented.

49:07 Yeah, exactly. I guess you could probably ask it to comment your code, right?

49:12 Yeah. You can ask it to describe. There's been some really cool things where people will throw

49:17 like assembly at it and ask it to translate to different like languages so they can interpret

49:20 it. Or you could do really fun things like cross language, cross, I guess I'll say like levels

49:26 of abstraction. You could sort of ask it to describe it like at a very top level, or you can get really

49:30 precise, like for this line, what are all the implications if I change a variable or something like that?

49:34 Yeah, that's really cool. I suppose you could do that here. Can you can you converse with it? You

49:39 can say, okay, you gave me this. Does it I guess what's the word? Does it have like tokens in context

49:43 like chat HTTP does? Can you say, okay, that's cool. But but I want as integers, not as strings.

49:50 I don't know. Yeah, I did. I did not include that in this. There was a version that had something like

49:55 that, where I was sort of just keeping the last few calls around. But it quickly became it didn't align

50:00 with the Jupyter IDE experience, because you end up like scrolling up and down. And it you have too much

50:04 power over how you execute in a Jupyter notebook. So your context can change dramatically by just scrolling

50:10 up. And trying to via inspect look across different like, across a Jupyter notebook is just a whole

50:16 other nightmare. So yeah, I didn't try and like extract the code out of the notebook so that it

50:20 could understand the local context. You could go straight to chat HTTP or something like that,

50:23 take what it gave you and start asking it questions.

50:26 Okay, so another question that I had here about this. So in order for that to do its magic,

50:32 like you said, the really important thought or breakthrough or idea you had was like, not just

50:37 the structure of the pandas code or anything like that, but also a little bit about the data.

50:41 What is the privacy implications of me asking this question about my data? Suppose I have

50:47 super duper secret CSV. And should I not ask or how to on it? Or what is the story there?

50:55 What's the, if I work with data, how much sharing do I do of something I might not want to share if I

51:02 ask a question about it?

51:04 I'd say the same discretion you'd use if you would copy like a row or a few rows of that data into

51:09 a, into ChatGPT to ask it a question about it.

51:12 Okay.

51:12 Is the level of concern I guess you should have like on the specifically, I am not storing these

51:19 things, but I know is at least it was, it seems like they're going to start getting towards like a 30

51:24 day thing. But, so there's a little bit of, yeah, I mean, you're sending your stuff over the

51:28 wire, like over network, if you do this and to use these language models until they come local,

51:32 until these things like llama and alpaca get good enough that they're, yeah, they're going to be

51:37 remote. Actually, that could be a fun, sorry. I just now thought that could be a fun thing. Like

51:41 just go get alpaca working with a sketch so that it can be fully local.

51:45 Interesting. Like a privacy preserving type of deal.

51:48 Yeah. I hadn't actually, yeah, that's the, that's the power of these, smaller models that are

51:52 almost good enough. I could probably just like quickly throw that in here and see if it,

51:56 yeah, maybe it's a wider audience.

51:57 You have a option to not get through your API, but directly go to open AI. You could have another

52:04 one to pick other, other options, right? Potentially.

52:07 Yep. Yep. Yep. The, interface to these, one thing that I think is not, maybe it's talked

52:14 about it more than other places, but I haven't heard as much like excitement about it is that these,

52:17 the APIs have gotten pretty nice for this whole space. the, they're all like the idea of a

52:23 completion endpoint is pretty straightforward. You send it some amount of text and it will continue

52:28 that text. And it's such a, it's so simple, but it's so generalizable. You could build so many

52:32 tools off of just that one API endpoint essentially. And so combine that with an embedding endpoint and

52:38 you sort of have all you need to, to make complex AI apps.

52:41 It's crazy. Speaking of making AI apps, maybe touch a bit on your, your other projects, Lambda.

52:48 So yeah, Lambda. Yeah.

52:50 But before you get into it, mad props for like Greek letter, like that's a true physicist or

52:56 mathematician that I can appreciate that there.

52:59 Yeah. That was, I was excited to put it everywhere, but then of course, these things don't playing, playing games with character sets and websites. I'm the one that

53:08 causes, I both feel the pain, have to clean the data that I also put into these systems.

53:13 So yeah. Yeah. People are like a prompt and why is the a so italicized? I don't get it.

53:17 Yeah. Okay. Yeah. So yeah. Yeah. So this one came, I was, working with, this is pre GPT.

53:25 This is October. I guess it was right around ChatGPT coming out like around that time.

53:29 But I was, I was really just messing around a lot with, completion endpoints as we were talking.

53:32 And I kept rewriting the same request boiler over and over. And then I also kept rewriting

53:38 f-strings that I was trying to like send in. And I was just like, ah, Jinja templates solved this

53:43 already. Like there already is formatting for strings in Python. Let me just use that, compose that into a

53:49 function. And there's, let me call these completion endpoints. I don't want to think of them as like

53:53 API endpoint or RPC is a nice mental model, but I want to use them as functions. I want to be able to

53:59 put decorators on them. I want to be able to use them both async or not async in Python. I want to,

54:05 I just want to have this as a thing that I can just call really quickly with one line and just do

54:10 whatever I need to with it. And so through this together, it's very simple. Like, honestly, I mean,

54:15 like the hardest part was just getting all the layers of, there's actually two things you

54:20 can make a prompt that then, cause I wrap any function as a prompt. So not just, these

54:26 calls to GPT and then I do tracing on it. So as you like get into the call stack, every input and output

54:32 is you can sort of like get hooked into and trace with some like call traces. So there's a bunch of just

54:38 like weird stuff to make the utility nice, but functionally, as you can see here on it's,

54:42 you just import it, you write a Jinja template with the class, and then you use that object that

54:48 comes back as a function and your Jinja template variables get filled in. And your result is the

54:53 text string that comes back out of a GPT. Interesting. And people probably, some people

54:57 might be thinking like Jinja, okay, well I got to create an HTML file and all that, like not just a

55:02 string that has double curlies for turning stuff into like strings within the string, kind of a different

55:08 way to do f-strings as you were hinting at. Yeah. Yeah. There was a two pieces here. I realized as

55:13 I was doing this also, I think I sort of mentioned with a sketch I do. I really often was taking the

55:18 output of a language model prompt, doing something in Python, or actually I can do a full example of

55:24 the SQL writing like exploration we did, but, we would do these things, that were sort of run

55:31 GPT three to ask it to write the SQL. You take the SQL, you go try and execute it, but it fails for

55:38 whatever reason, or you, and you take that error, you say, Hey, rewrite it. So we talked about that

55:42 sort of pattern, which is sort of like rewriting. Another one of the patterns was increase the

55:46 temperature, ask it to write the SQL. You get like 10 different SQL answers in parallel. And this is where

55:52 the async was like really important for this. Cause I just wanted to use asyncio gather and run all 10

55:57 of these truly in parallel against the open eye endpoint, get 10 different answers to the SQL,

56:02 run all 10 queries against your database, then pull on what the most common, like of the ones that

56:08 successfully ran, which ones gave the same answer the most often, then that's probably the correct

56:12 answer. And, just chaining that stuff. It's like very pythonic functions. Like you can really

56:19 just imagine like, Oh, I just need to write a for loop. I just need to run this function, take the

56:22 output feed into another function, very procedural. But when you, all the abstractions in the

56:28 open at open AI API, the things like just everything else, there was nothing else really at the time,

56:34 but even the new ones that have come out like Lang chain that have sort of like taken the space by

56:38 storm now are not really just trying to offer the minimal ingredient, which is the function.

56:43 And to me, it was just like, if I can just offer the function, I can write a for loop. I can write,

56:47 I can store a variable and then keep passing it into it. You could do so many different

56:51 emergent behaviors with just starting with the function and then simple Python, scripting

56:56 on top of it.

56:57 And there's some interesting stuff here, land of prompt. So you can start, you can kind of start

57:04 it, set it. I don't know what chat GDP, you can tell it a few things. I'm going to ask you a question

57:09 about a book. Okay. The book is a choose your own adventure book. Okay. Now here I'm going to like,

57:14 you can prepare it, right? There's probably a more formal term for that, but you can do this here.

57:19 Like you can say, Hey system, you are a type of bot. And then you, that creates you an object that

57:25 you can have a conversation with. And you say, what should we get for lunch? And your type of bot is

57:29 pirate. And then so to say, as a pirate, I would suggest we have some hearty seafood or whatever,

57:33 right? Like that's, that's beyond what you're doing with sketch. I mean, obviously this is not so much

57:37 for code. This is like conversing with Python rather than in Python. I don't know. And your editor.

57:42 Yeah. This one was the open AI chat API endpoint came out and I was just like, Oh, I should support

57:49 it. So that's what this, I wanted to be able to Jinja template inside of the conversation. So you

57:53 can imagine a conversation that is prepared with like seven steps back and forth, but you want to hard

57:59 code with the conversation, like how the flow of the conversation was going. And you want to template

58:02 it so that like on message three, it put your new context problem on message four, it put the output

58:08 from another prompt that you ran on message five. It is this other data thing. And then you ask it to

58:13 complete this, the intent of like, it's arbitrarily complex, but still something like that

58:18 would be, you know, just three lines or so in Lambda prompt. The idea was that it would offer up a really

58:23 simple API for this. Well, other thing that's interesting is of an async and async version. So that's,

58:28 that's cool. People can, can check that out. Also a way to make it a hosted as a web service with say

58:35 like FastAPI or something like that. Yeah. And you can make it a decorator if you like an app prompt

58:42 decorator. Yeah. On any function you can just throw app prompt and it, it wraps it with the same class

58:47 so that all the, all the magic you get from that works. The server bit is I took, so FastAPI has

58:54 that sort of like inspection on the function, part. I did a little bit of middleware to get the

59:00 two happy together. And then all you have to do is import FastAPI and then run, you know, G unicorn

59:06 that app. And, it's two lines and any prompts you have made become their own independent rest

59:14 endpoint where you can just do a get or a post to it. And it returns the response from calling the prompt.

59:20 But these prompts can also be these chains of prompts. Like one prompt can call another prompt,

59:24 which can call another prompt. And those prompts can call async to not async back to async and things

59:28 like that. And it should work. Pretty sure this one actually, I did test everything as far as I know,

59:34 I'm pretty sure I've got pretty good coverage. So yeah, super cool. All right. Well get a little

59:39 short on time, but I, I think people are going to really, really dig this, especially sketch. I think

59:43 there's a lot of folks out there doing pandas that would love an AI buddy to help them

59:50 do things like not just analyze the code, but the data as well.

59:55 Yeah. Just, I think, anybody's, I know it's for me, but it's just like copilot in

59:59 VS Code ID, sketch in your Jupyter ID, takes almost nothing to add. And you,

01:00:05 whenever you're just sort of sitting there, you think you're about to alt tab to go to Google. You

01:00:08 could just try the sketch.ask and it's surprising how often that sketch.ask or sketch.howto gets you

01:00:14 way closer to a solution without even having to leave the, you don't even have to leave your,

01:00:17 your environment. It's like a whole other level of autocomplete for sure. And super cool. All right.

01:00:23 Now, before I let you out of here, you got to answer the final two questions. If you're going to write

01:00:27 some Python code and it's not a Jupyter notebook, what editor are you using? It sounds to me like you may

01:00:33 have just given a strong hint at what that might be. Yeah. I've switched almost entirely to VS Code.

01:00:38 and I've been really liking it with the remote development and, like it's just, I work

01:00:43 across like many machines, both cloud and local and some like five, six different machines are my

01:00:48 like primary working machines. And I use the remote, VS Code thing. And it just, I have a unified

01:00:53 environment that gives me terminal, the files and the code all in one and copilot on all of them.

01:00:59 Yeah. It's wild. All right. And then notable pipe UI package. I mean, pip install sketch,

01:01:04 you can throw that out there if you like. That's pretty awesome. But anything you've run across

01:01:08 you're like, Oh, this is people should know about this. Yeah. It doesn't have to be popular. Just

01:01:12 like, Oh, this is cool. In the, I guess these, these two are very popular, but, in the data

01:01:17 space, I really, I'm a huge fan of, Ray and, also arrow. Like I use those two tools as like

01:01:25 my backend bread and butter for everything I do. And so those have just been really great work.

01:01:30 Apache arrow. Right. And then Ray, I'm not sure. Yeah. Ray is a distributed, scheduling compute

01:01:38 framework. It's sort of like a, right. I don't know what they, yeah. I remember seeing about this.

01:01:42 Yeah. This is a, it is, I'm parsing, he didn't talk about other things, but I'm like parsing common

01:01:47 crawl, which is like 25 petabytes of data. And, Ray is great. It's just the workhorse. It power is

01:01:53 really useful. Like I find it's so snappy and good, but it offers everything I need in a distributed

01:02:00 environment. So I can write code that runs on a hundred machines and not have to think about it.

01:02:05 It works really well. That's, that's pretty nuts. Not as nuts as chat GDP and mid journey,

01:02:09 but it's still pretty nuts. So before we call it a date, do you want to tell people about approximate

01:02:14 labs? It sounds like you guys are making some good progress. I might have some, some jobs for people

01:02:19 to work in this kind of area as well. Yeah. So, we're, we're working at the intersection

01:02:23 of, AI and tabular data. So anything related to these training, these large language models,

01:02:28 and also, tabular data. So things with columns and rows, we are trying to like solve that problem,

01:02:32 try and bridge the gap here. Cause there's a pretty big gap. We have three main initiatives

01:02:37 that working on, which is we're trying to build up the data set of data sets. So just like the pile

01:02:41 or the stack or lay on five B these like big data sets that were used to train all these big

01:02:46 models. We're making our own on tabular data. We are training models. So this is actually

01:02:51 training large language models, doing these training, these full transformer models.

01:02:55 And then we're also building apps like sketch, like UIs, things that are actually there to help

01:03:00 make data more accessible to people. So anything that helps people get value from data and make it open

01:03:05 source. That's what we're working on. We just, raised our seed round. So we are now officially

01:03:11 hiring. So, looking for people who are interested in the space and who are enthusiastic

01:03:15 about these problems. Awesome. Well, very exciting demo libraries, I guess, however you call them.

01:03:23 But I think this, I think these are neat. People are going to find a lot of cool uses for them. So

01:03:27 excellent work and congrats on all the success so far. It sounds like you're just starting to take

01:03:32 off. Yeah. Thank you. All right, Justin, final call to action. People want to get started.

01:03:36 Let's pick sketch. People want to get started with sketch. What do you tell them?

01:03:39 Just, pip install it. Give sketch a, give sketch a try, pip install it, import it, and then throw it on your data frame. Awesome. And then ask it questions or how tos.

01:03:49 Yeah. Yeah. Yep. Whatever you want. if you really, if you really want to, and you,

01:03:53 you trust the model, like throw some, applies and have it clean your data for you. Cool.

01:03:57 Awesome. All right. Well, thanks for being on the show.

01:04:00 Come in here and tell us about all your work. It's great. Yeah. Thank you. Yeah. See you later.

01:04:04 Thanks for having me.

01:04:05 This has been another episode of talk Python to me. Thank you to our sponsors. Be sure to check out

01:04:11 what they're offering. It really helps support the show. Stay on top of technology and raise your value

01:04:16 to employers or just learn something fun in STEM at brilliant.org. Visit talkpython.fm/brilliant

01:04:24 to get 20% off an annual premium subscription. Want to level up your Python? We have one of the largest

01:04:30 catalogs of Python video courses over at talk Python. Our content ranges from true beginners

01:04:35 to deeply advanced topics like memory and async. And best of all, there's not a subscription in

01:04:40 sight. Check it out for yourself at training.talkpython.fm. Be sure to subscribe to the show,

01:04:45 open your favorite podcast app and search for Python. We should be right at the top. You can also find

01:04:51 the iTunes feed at /itunes, the Google play feed at /play and the direct RSS feed at slash

01:04:57 RSS on talkpython.fm. We're live streaming most of our recordings these days. If you want to be part of

01:05:03 the show and have your comments featured on the air, be sure to subscribe to our YouTube channel at

01:05:08 talkpython.fm/youtube. This is your host, Michael Kennedy. Thanks so much for listening. I really

01:05:14 appreciate it. Now get out there and write some Python code.

01:05:16 I'll see you next time.

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon