#410: The Intersection of Tabular Data and Generative AI Transcript
00:00 AI has taken the world by storm.
00:01 It's gone from near zero to amazing in just a few years.
00:05 We have chat GDP, we have stable diffusion.
00:07 What about Jupyter Notebooks and Pandas?
00:10 In this episode, we meet Justin Wagg, the creator of Sketch.
00:13 Sketch adds the ability to have conversational AI interactions about your Pandas' data frames, code, and data
00:21 right inside of your notebook.
00:23 It's pretty powerful, and I know you'll enjoy the conversation.
00:26 This is Talk Python to Me, episode 410, recorded April 2nd, 2023.
00:31 Welcome to Talk Python to Me, a weekly podcast on Python.
00:47 This is your host, Michael Kennedy.
00:49 Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython,
00:55 both on fosstodon.org.
00:57 Be careful with impersonating accounts on other instances.
00:59 There are many.
01:00 Keep up with the show and listen to over seven years of past episodes at talkpython.fm.
01:05 We've started streaming most of our episodes live on YouTube.
01:09 Subscribe to our YouTube channel over at talkpython.fm/youtube to get notified about upcoming shows
01:15 and be part of that episode.
01:17 This episode is brought to you by Brilliant.org and us with our online courses over at Talk Python Training.
01:24 Justin, welcome to Talk Python to Me.
01:27 Thanks for having me.
01:28 It's great to have you here.
01:29 I'm a little suspicious.
01:30 I got to know, I really know how to test whether you're actually Justin or an AI speaking as Justin.
01:37 What's the deal here?
01:40 Yeah, there's no way to know now.
01:41 No, there's not.
01:42 Well, apparently I've recently learned from you that I can give you a bunch of Xs
01:46 and other arbitrary characters.
01:47 This is like the test.
01:49 It's like asking the Germans to say squirrel in World War II sort of thing.
01:54 Like it's the test.
01:55 It's the tell.
01:56 There's always going to be something.
01:58 It's some sort of adversarial attack.
01:59 Exactly.
02:01 It's only going to get more interesting with this kind of stuff for sure.
02:05 So we're going to talk about using generative AI and large language models paired with things like Pandas
02:13 or consumed with straight Python with a couple of your projects, which are super exciting.
02:18 I think it's going to empower a lot of people in ways that it hasn't really been done yet.
02:24 So awesome on that.
02:25 But before we get to it, let's start with your story.
02:27 How did you get into programming in Python and AI?
02:30 Let's see.
02:30 I got into programming in just like when I was a kid, TI-83, learning to code on that.
02:36 And then sort of just kept it up as a side hobby my whole life.
02:40 Didn't ever sort of choose it as my career path or anything for a while.
02:44 It chose you.
02:44 Yeah, it chose me.
02:46 It just, I dragged it along with me everywhere.
02:47 It's just like the toolkit.
02:49 I got a, went to undergrad and for physics, electrical engineering, then did a physics PhD, experimental physics.
02:57 During that, I did a lot of non-traditional languages, things like LabVIEW, Igor Pro,
03:02 just weird Windows, Windows hotkey for like just trying to like automate things.
03:08 Yeah, sure.
03:08 So just was sort of dragging that along.
03:11 But along that path, sort of came across GPUs and used it for accelerating processing,
03:16 specifically like particle detection.
03:17 So it was doing some like electron counting in some just detector experiments.
03:23 Is this like CUDA cores on NVIDIA type thing?
03:25 Precisely.
03:26 Stuff like that.
03:26 Okay.
03:27 And was that with Python or was that with C++ or what?
03:29 At the time it was C++ and I made like a DLL and then called it from LabVIEW.
03:33 Wow, that's some crazy integration.
03:35 It's like drag and drop programming too on the memory GPU.
03:39 Exactly.
03:40 It was all over the place.
03:41 Also had, it was a distributed LabVIEW project.
03:43 We had multiple machines that were coordinating and doing this all just to move some motors
03:49 and measure electrons.
03:50 But it got me into CUDA stuff, which then at the time was around the time
03:55 that the like AlexNet, some of these like very first neural net stuff was happening.
03:59 And so those same convolutional kernels were the same exact code that I was trying to write
04:03 to run like convolutions on these images.
04:04 And so it's like, oh, look at this like paper.
04:06 Oh, let me go read it.
04:07 It seems like it's got so many citations.
04:09 This is interesting.
04:09 And then like that sent me down the rabbit hole of like, oh, this AI stuff.
04:12 Oh, okay.
04:13 Let me go deep dive into this.
04:14 And then that just, I'd say that like became the obsession from them.
04:18 So it's been like eight years of doing that.
04:20 Then sort of just after I left academia, tried my own startup, then joined multiple others
04:26 and just sort of have been bouncing around as the sort of like founding engineer,
04:30 early engineer at startups for a while now.
04:32 And yeah, Python has been the choice ever since like late grad school and on.
04:38 I would say it sort of like came through the pandas and NumPy part, but then stuck for the scripting,
04:44 like just power, just can throw anything together at any time.
04:47 So it seems like there were two groups that were just hammering GPUs, hammering them,
04:53 crypto miners and AI people.
04:57 but the physicists and some of those people doing large scale research like that,
05:01 they were the OG graphics card users, right?
05:04 Way before crypto mining existed and really before AI was using graphics cards
05:09 all that much.
05:10 When I was like looking at some of the code, like pre-CUDA, there were some like
05:13 quant traders that were doing some like crazy stuff off of shaders.
05:17 Like it wasn't even CUDA yet, but it was shaders and they were trying to like
05:19 extract the compute power out of them from that.
05:22 So...
05:23 Look, if we could shave one millisecond off this, we can short them all day,
05:27 let's do it.
05:28 But yeah.
05:29 Yeah.
05:29 The physicists, I mean, it's always been like, yeah, it's always the get
05:33 as much compute as you can out of the, you know, devices you have because simulations are slow.
05:37 Yeah.
05:38 I remember when I was in grad school studying math, actually senior year,
05:41 regular college, my bachelor's, the research team that I was on had gotten a used
05:47 silicon graphics computer for a quarter million dollars and some Onyx workstations
05:53 that we all were given to.
05:54 I'm like, this thing is so awesome.
05:56 A couple years later, like an NVIDIA graphics card and like a simple PC would crush it.
06:01 Like that's $2,000.
06:03 It's just, yeah, there's so much power in those things to be able to harness them
06:06 for whatever, I guess.
06:07 Yeah.
06:07 As long as you don't have too much branching, it works really well.
06:10 Awesome.
06:11 So let's jump in and start talking about, let's start to talk about ChatGP
06:18 and some of this AI stuff before we totally get into the projects that you're working on,
06:24 which brings that type of conversational generative AI to things like Pandas,
06:30 as you said.
06:31 But to me, I don't know how, maybe you've been more on the inside than I have,
06:36 but to me, it looks like AI has been one of those things that's 30 years
06:41 in the future forever, right?
06:42 It was like the Turing test and, oh, here's a chat, I'm going to talk to this thing
06:46 and see if it feels human or not.
06:48 And then, you know, there was like OCR and then all of a sudden we got self-driving cars,
06:55 like, wait a minute, that's actually solving real problems.
06:57 And then we got things like ChatGP where people are like, wait, this can do my job.
07:02 It seems like it, just in the last couple of years, there's been some inflection point
07:07 in this world.
07:08 What do you think?
07:09 Yeah, I think there's sort of like two key things that have sort of happened
07:12 in the past, I guess, four or five years, four years, roughly.
07:15 One is the attention is all you need paper from Google, sort of this transformer
07:19 architecture came out and it's sort of a good, very hungry model that can just sort of
07:23 absorb a lot of facts and just like a nice learnable key value store almost that's stuck.
07:28 So, and then the other thing is, is the GPUs.
07:31 We were sort of just talking about GPU compute, but this has just been really,
07:34 GPU compute has really been growing so fast.
07:38 If you like, look at the like Moore's law equivalent type things, like it's just,
07:41 it's faster how much we're getting flops out of these things like faster and faster.
07:45 So, it's been really nice.
07:46 I mean, obviously there'll be a wall eventually, but it's been good riding this like
07:51 exponential curve for a bit.
07:52 Yeah, is the benefit that we're getting from the faster GPUs, is that because
07:57 people are able to program it better and the frameworks are getting better
08:00 or because just the raw processing power is getting better?
08:03 All of the above.
08:04 Okay.
08:04 I think that there was a paper that tried to dissect this.
08:07 I wish I knew the reference, but I believe that their argument was that it was
08:11 actually more the processing power was getting better.
08:13 The actual like physical silicon were getting better at making that for specifically
08:17 this type of stuff.
08:17 But like on exponentials, but yeah.
08:20 the power that those things take, I have a gaming system over there and it has a
08:26 GeForce 2070 Super.
08:29 I don't know what the Super really gets me, but it's better than the not Super,
08:32 I guess.
08:33 Anyway, that one still plugs into the wall normal, but the newer ones, like the 4090s,
08:40 those things, the amount of power they consume, it's like space heater level of power.
08:45 Like, I don't know, 800 watts or something just for the GPU.
08:48 You're going to brown out the house if you plug in too many of those.
08:53 Yeah.
08:53 Go look at those DGX A100 clusters and they've got like eight of those A100s just stacked
09:00 right in there.
09:00 They take really beefy powers of mine.
09:03 It's built right directly attached to the power plant, electrical power plant.
09:08 Nuts.
09:09 Okay, so yeah, so those things are getting really, really massive.
09:12 Here's the paper Attention is All You Need from Google Research.
09:15 What was the story of that?
09:18 How's that play into things?
09:19 Yeah, so this came up during like machine translation sort of research at Google
09:23 and the core thing is they present this idea of instead of just stacking
09:30 these like layers of neural nets like we're sort of used to, they replace the like
09:34 neural net layer with this concept of a transformer block.
09:38 A transformer block has this concept inside that's an attention mechanism.
09:42 The attention mechanism is effectively three matrices that you combine in a specific order
09:48 and the sort of logic is it is that one of the vectors takes you from some space
09:54 to keys so it's almost like it's like identifying labels out of your data.
09:58 Another one is taking you from your data to queries and then it like dot products those
10:03 to find a weight and then for the one and then another one finds weight values
10:08 for your things.
10:08 So it takes this query and key, you get the weights for them and then you take
10:13 the ones that were sort of the closest to get those values from the third matrix.
10:16 Just doing it sort of like looks a little bit like accessing an element in a dictionary
10:22 like key value lookup and it's a differentiable version of that and it did really well
10:28 on their machine learning sorry, on their machine translation stuff.
10:31 This was, I think it's like one of the first big one is this BERT model and that paper
10:37 sort of the architecture of the actual neural net code is effectively unchanged
10:43 from this to ChatGPT.
10:45 Like there's a lot of stuff for like milking performance and increasing stability
10:50 but the actual like core essence of the actual mechanism that drives it it's the same
10:54 thing since this paper.
10:55 Interesting.
10:55 It's funny that Google didn't release something sooner.
10:58 It's wild that they've had they keep showing off that they've got like equivalent
11:04 or better things at different times but then not releasing it.
11:06 When Dolly happened they had Imogen Imagine I guess I don't know how you say it
11:11 and what was the Party as the two they had two different really good way better
11:16 than Dolly way better than stable diffusion models like the that had were out
11:20 and they like showed it demoed it like but never released it to be used so yeah
11:25 it's one of these who knows what's going to happen with Google if they keep
11:28 holding on to these things.
11:28 Yeah well I think there was some hesitation I don't know holds up on accuracy
11:33 or weird stuff like that.
11:34 Sure.
11:35 Yeah now cat's out of the bag now now it's happening.
11:38 Yeah the cat's out of the bag and people are racing to do the best they can and it's
11:43 going to have interesting consequences for us both positive and negative I think
11:47 but you know let's leverage the positive once the cat's out of the bag anyway right?
11:51 Yeah.
11:51 Hopefully.
11:52 Might as well like ask it questions for pandas.
11:55 so let's play a little bit with chat GDP and maybe another one of these image type
12:00 things.
12:00 So I came in here and I stole this example from a blog post that's pretty nice
12:05 about not using deeply nested codes.
12:08 You can use a design pattern called a guarding clause that will look and say if the
12:14 conditions are not right we're going to return early instead of having if something
12:18 if that also if something else so there's this example that is written in a poor
12:23 way and it says like it's checking for a platypus so it says if self.ismammal
12:29 if self.hasfur if self.hasbeak etc.
12:32 it's all deeply nested and just for people who haven't played with chat GDP
12:37 like I put that in and I said sure I told her I wanted to call this arrow
12:40 because it looks like an arrow and it says it tells me a little bit about
12:44 this so I'm going to ask it please rewrite arrow to be less nested with girding
12:52 clauses right this is like a machine right if I tell it this what is it going to say
12:57 let's see it may fail but I think it's going to get it it's thinking I put it
13:01 I mistakenly put it into chat GDP 4 which takes longer I might switch it over to
13:06 3 I don't know but the understanding of these things there's a lot of hype
13:11 about it like I think you kind of agree with me that maybe this hype is worthwhile
13:16 here we go so look at this it rewrote it said if it's platypus if not self is
13:22 man will return false if not has fur and there's no more nesting that's pretty
13:25 cool right yep I mean I'm sure you've you've played with stuff like this
13:29 right yeah big user of this I mean this is kind of interesting right like it
13:33 understood there was a structure and it understood what these were and it
13:35 understood what I said but what's more impressive is like please rewrite
13:40 the program to check for crocodiles crocodiles and you know it what is it
13:49 going to do here let's see it says sure no problem writes the function is crocodile
13:54 if not self.is reptile if not self.has scales if not self.has long snout
14:00 oh my gosh like it not only remembered oh yeah there's this new version I
14:06 wrote in the garden clause format but then it rewrote the tests I mean and then
14:12 it's explaining to me why it wrote it that way it's just it's mind blowing
14:18 like how how much you can have conversations with this and how much it understands
14:23 things like code or physics or history what do you think yeah it's really
14:28 satisfying I love that it's such a powerful generalist at these like things
14:33 that are found on the internet so if it like if it exists and it's in the training
14:36 data it can do so good at synthesizing composing bridging between them it's really
14:41 satisfying so it's really fun asking it to as you're doing rewriting changing
14:45 language I've been getting into a lot more JavaScript because I'm doing a
14:48 bunch more like front end stuff and just I sometimes will write a quick one liner in
14:51 Python that I know how to do with a list comprehension and then I'll be like
14:55 make this for me in JavaScript because I can't figure out this like how to
14:59 initialize an array with integers in it it's great for just like really quick spot
15:04 checks and it also seems to know a lot about like really popular frameworks
15:07 so you can ask it things that are surprisingly detailed about like a how would you
15:12 do cores with requests in FastAPI and it can help you find that exact middleware
15:18 you know it's like boilerplate-y but it's great that it can just be a source for that
15:22 this portion of Talk Python to Me is brought to you by brilliant.org you're a
15:28 curious person who loves to learn about technology I know because you're
15:31 listening to my show that's why you would also be interested in this episode's
15:35 sponsor brilliant.org brilliant.org is entertaining engaging and effective
15:40 if you're like me and feel that binging yet another sitcom series is kind
15:44 of missing out on life then how about spending 30 minutes a day getting better
15:47 at programming or deepening your knowledge and foundations of topics you've always
15:51 wanted to learn better like chemistry or biology over on brilliant brilliant
15:57 has thousands of lessons from foundational and advanced math to data science
16:01 algorithms neural networks and more with new lessons added monthly when you sign up
16:06 for a free trial they ask a couple of questions about what you're interested
16:09 in as well as your background knowledge then you're presented with a cool learning
16:13 path to get you started right where you should be personally I'm going back to some
16:17 science foundations I love chemistry and physics but haven't touched them
16:20 for 20 years so I'm looking forward to playing with PV equals NRT you know
16:26 the ideal gas law and all the other foundations of our world with brilliant
16:30 you'll get hands-on on a whole universe of concepts in math science computer science
16:35 and solve fun problems while growing your critical thinking skills of course
16:39 you could just visit brilliant.org directly its url is right there in the name
16:43 isn't it but please use our link because you'll get something extra 20% off
16:47 an annual premium subscription so sign up today at talkpython.fm/brilliant
16:52 and start a 7 day free trial that's talkpython.fm/brilliant the link is
16:57 in your podcast player show notes thank you to brilliant.org for supporting
17:01 the show it's insane I don't know if I've got it in my history here we're rewriting
17:09 our mobile apps for talkbython training for our courses in Flutter and we're
17:15 having a problem downloading stuff concurrently using a particular library
17:19 in Flutter and so I asked it I said hey I want some help with a Flutter and Dart
17:26 programs what do you want it says I'm using the dio package do you know it
17:30 oh yes I'm familiar it does HTTP client stuff for Dart okay I want to download
17:34 binary video files and a bunch of them given a URL I want to do them concurrently
17:39 with three of them at a time write the code for that and boom it just writes it
17:42 like using that library I told it about not just Dart so that's incredible that
17:48 we can get this kind of assistance for knowledge and programming like you'll
17:52 never find I mean I take that back you might find that if there's a very
17:55 specific stack overflow question or something but if there's not a write-on
17:59 question for it you're not going to find it I love when you know the stack
18:04 overflow would exist for a variant of your question but the exact one doesn't
18:08 exist and you have to go grab the three of them to synthesize and it's just great
18:12 at that it also is pretty good at fixing errors sometimes it can walk itself into
18:17 lying to you repeatedly but that's so problematic yeah but you can also ask
18:24 it here's my program are there security vulnerabilities or do you see any
18:28 bugs and it'll find them yep yeah it's nuts so people may be wondering we haven't
18:34 talked yet about your project sketch why I'm talking so much about chat CP
18:38 so that is kind of the style of AI that your project brings to pandas which we're
18:44 going to get to but I want to touch on two more really quick AI things that
18:47 we'll dive into it the other is this just around images just the ability to
18:52 ask questions you've already mentioned three dolly imagine and then the other
18:57 one I don't remember from Google that they haven't put out yet a mid journey is
19:01 another just the ability to say hey I want a picture of this no actually
19:05 change it slightly like that it's mind blowing they're a lot of fun they're great
19:09 for sparking creativity or having idea and just getting to see it in front of
19:12 you I think it's more impressive to me than even this chat GTP telling me
19:16 I want a GTP I want an artificial intelligence panda and it came up and I
19:32 want it photorealistic in the style of National Geographic and so it gave
19:36 me this panda you can see beautiful whiskers but just behind the ear you can see
19:41 the fur is gone and it's like an android type of creature that is a beautiful
19:48 picture it's pretty accurate it's nuts that I can just go talk to these systems
19:52 and ask them these questions I find it interesting comparing the ChatGPT
19:56 and the mid-journey style I completely get it it's very visceral it's also
20:04 from another perspective I think of the weights and the scale of the model
20:07 and these image ones that solve all images are so much smaller in scale than
20:14 these language ones that have all this other data and stuff. So it's fascinating how complex language is. Yeah, I know the smarts is so much less,
20:20 but just something about it actually came up with a creative picture that never existed.
20:26 Yeah. Right. You could show this to somebody like, oh, that's an artificial panda. That's
20:31 insane. Right. But it's, but I just gave it like a sentence or two. Yeah. Yeah. Yeah. I don't know.
20:36 Yeah. This, it's a sort of a technical interpreter, but I, I love it because it's
20:41 like this, it's just phenomenal interpolation. It's like through semantically labeled space. So
20:46 like the words have meaning and it understands the meeting and can move sliders of like, well,
20:51 I've seen lots of these machine things. I understand the concept of gears and this metal and this,
20:54 like the shiny texture and then the fur texture and like, they're very good at texture. It's a,
21:00 yeah, really great how it interprets all of that just to fit the, you know, the small prompt.
21:04 Yeah. There are other angles of which it's frustrating. Like I want it turned, I want it
21:08 in the back of the picture, not the, no, it's always in the center. One more thing really quick.
21:13 And this leads me into my final thing is, is a GitHub copilot. GitHub copilot is like this in
21:19 your editor, which is kind of insane, right? You can just give it like a comment or a series of
21:23 comments and it will write it. I think chat GDP is maybe more open-ended and more creative, but this
21:29 is, this is also a pretty interesting way to go. I'm a heavy user of copilot. I, if there's a,
21:35 there's a weird crux and I'm like slowly developing like a need to have this in my browser. I was a,
21:40 on a flight recently and was with the internet and copilot wasn't working. And I felt the, like,
21:46 I felt the difference. I felt like I was like walking through mud instead of just like actually
21:50 running a little bit. And I was like, Oh, I've been disconnected from my distributed mind. I am broken
21:56 partially. Yeah. So incredible. So the last part I guess is like, you know, what are the ethics
22:03 of this? Like I went on very positively about mid journey, but how much of that is trained on
22:08 copyright material or there's GitHub copilot. How much of that is trained on GPL based stuff that was
22:16 in GitHub. But when I use it, I don't have the GPL any longer on my code. I might use it on commercial
22:23 code, but just running it through the AI, does that strip licenses or does it not? There's a GitHub
22:29 copilotlitigation.com, which is interesting. I mean, we might be finding out. There's also
22:35 think Getty, I think it's the Getty images. I'm not 100% sure, but I think Getty images is suing
22:41 one of these image generation companies. I can't remember which one I don't maybe mid journey. I
22:47 don't think it's mid journey. I think it's stable diffusion, but anyway, it doesn't really matter.
22:50 Like there's a bunch of things that are pushing back against us. Like, wait a minute,
22:53 where did you get this data? Did you have rights to use this data in this way? And I mean,
22:58 what are your thoughts on this angle of AI these days?
23:02 Yeah. I know it sounds like I don't worry too much about it in either direction. I think I
23:08 believe in personal ethics. I believe in open source things, availability of things,
23:14 because it just sort of like accelerates collective progress. But that said, I also believe in like
23:19 slightly different like social structures to help support people. Like I'm a, I guess,
23:24 a person believer in things like UBI or something like that on that direction.
23:27 So when you combine those, I feel like it, you know, things sort of work out kind of well,
23:31 but when we like, but it is still a thing that like, Be Right exists and that there is this sense of ownership and this is my thing. And I wanted to
23:38 put licenses on it. And, I think that this sort of story started presumably that I wasn't really
23:44 having this conversation, but like when the internet came around and search engines happened and like
23:49 Google could just go and pull up your thing from your page and summarize it in a little blob on the
23:54 page is, was that fair? What if it starts, you know, your shop and it allows you to go buy that
23:58 same product from other shops. Like it, I think that the same things are showing up and in the same way
24:04 that the web, like in the internet sort of, it's sort of, it was a large thing, but then it sort of,
24:08 I don't know if it got quieter, but it sort of became in the background. We sort of found new
24:12 systems. It stopped being piracy and CDs and the music industry is going to struggle. And Hey,
24:17 things like Spotify exist and streaming services exist. And like, I don't know what the next way
24:21 is.
24:21 They're doing better than ever basically. Yeah. Yeah. Yeah. So I think it's just evolution.
24:24 And then like the, some things will change and adopt some things will like fall apart and new
24:29 things will be born. I, that's just a great, it's a good time for lots of opportunity, I guess is the
24:33 part that I'm excited about.
24:34 Yeah. Yeah. Yeah. For sure. I think that's definitely true. It probably, you're probably right. It probably
24:39 will turn out to be, you know, old man yells at cloud cloud doesn't care sort of story, you know,
24:45 in the end where it's like, on the other hand, if, if somebody came back and said,
24:49 you know, a court came back and said, you know what, actually anything trained on GPL
24:54 and then you use copilot on it, that's GPL. Like that would have instantly mega effects. Right.
25:02 Yeah. I, yeah. And I, I guess there's also stuff like the, I don't, I didn't actually read the
25:06 article. I only saw the headline and you know, that's the worst thing to do is to repeat a thing,
25:09 which is a headline. But, there was that Italy thing that I saw about, like, I don't know.
25:14 Yeah. That was really clickbaity, but I didn't get a time to look at it yet. So yeah. You probably
25:20 ask chat to be to summarize for you. If as long as it can be a Bing, I guess, get that updated.
25:25 Yeah. Yeah. Yeah. Yeah. There's a lot of, there's a lot of things playing in that space,
25:30 right? Some different places. Okay. So yeah, very cool. But as a regular user, I would say,
25:36 you know, regardless of kind of how you feel about this, at least this is my viewpoint right now.
25:40 It's like, regardless of how I feel about which side is right in these kinds of disputes,
25:45 this stuff is out of the bag. It's out there and available and it's a tool. And it's like saying,
25:50 you know, I don't want to use spell check or I don't want to use some kind of like code checking. I just
25:55 want to write like in straight notepad because it's pure, right? Like sure you could do that,
26:00 but there's these tools that will help us be more productive and it's better to embrace them and know
26:05 them than to just like yell at them, I suppose. Yeah. A lot of accelerant you can get.
26:10 really speed up whatever you want to get done. Yeah, absolutely. All right. So speaking of
26:15 speeding up things, let's talk pandas and not even my artificial pandas, but actual programming pandas
26:22 with this project that you all have from approximate. Yeah. Approximate labs called sketch. So sketch is
26:30 pretty awesome. Sketch is actually why we're talking today because I first talked about this on Python
26:35 bytes and I saw this was sent over there by Jake Furman and to me and said, you should check this
26:41 thing out. It's awesome. And, yeah, it's pretty nuts. So tell us about sketch. Yeah. So, even
26:49 though I use a copilot as I sort of described already, and it's become a crux I found in Jupyter
26:54 notebooks when I wanted to work with data, it just didn't, it doesn't actually apply that. So on one side,
27:00 it was sort of like missing the mark at times. And so it was sort of like, how can I get this
27:04 integrated into my flow? The way I actually work in a Jupyter notebook, if maybe I'm working a Jupyter
27:09 notebook on a remote server and I don't want to set up VS Code to do it. So I don't have copilot at all.
27:13 Like there's a bunch of different reasons that I was just like in Jupyter. It's a very different IDE
27:17 experience. It is. Yeah. It's super different, but also you might want to ask questions about the data,
27:21 not the structure of the code that analyzes the data, right? Exactly. Yeah. And so just a bunch of that
27:26 type of stuff. And then also at the other side, I was trying to find something that I could throw together
27:31 that I thought was strong demonstration of the value approximate labs is trying to chase, but wouldn't
27:38 take me too much time to make. So it was a, oh, I could probably just go throw this together pretty quickly.
27:42 I bet this is going to be actually useful and helpful. And so let's just do that. And so through on top of
27:48 the actual library I was using, it was sketch. I put this on it and then shift it. So sort of shifted what the
27:54 project was. Yeah. Yeah. So you also have this other project called Lambda Prompt. And so were
27:59 you trying to play around Lambda Prompt and then like see what you could kind of apply here to leverage
28:03 it? Or is that the full journey I can get into is started with data sketches. I left my last job
28:11 to chase bringing the algorithm, like combining data sketches with AI, but just like the vague,
28:17 like at that level. Tell us what data sketches is real quick. Sure. Yeah. So a data sketch is a
28:22 probabilistic aggregation of data. So if you have, I think the most common one that people have heard of
28:27 is hyperloglog and it's used to estimate cardinality. So estimate the number of unique
28:32 values in a column. A data sketch is a class of algorithms that all sort of like use roughly fixed
28:39 width in binary, usually representations. And then in a single pass, so their ON will look at each row
28:46 and hash the row and then update the sketch or not necessarily hash, but they update this sketch
28:52 object. Essentially. Those sketch objects also have another property that they are mergeable. So you
28:57 have this like really fast ON to go bring that like to aggregate up and you get this mergeability. So you
29:03 can map reduce it in, you know, trivial speeds. The net result is that this like tight binary packed
29:09 object can be used to approximate measures you were looking for on the original data. So you could look
29:15 at, if you do a few of these, they're like theta sketches, you can go and estimate not just the
29:21 unique count, but you can also estimate if this one column would join well with this other column,
29:25 or you can estimate, Oh, if I were to join this column to this column, then this third column that
29:30 was on that other table would actually be correlated to this first column over here. So you get these,
29:35 a bunch of different distributions, you get a whole bunch of these types of properties.
29:40 And each sketch is sort of just, I would say, algorithmically engineered, like very, very
29:44 engineered to be like information theory optimal at solving one of those like measures on the data.
29:50 And so tight packed binary representations.
29:53 All right. So you thought about, well, that's cool, but chat CTP is cool too.
29:57 Yeah.
29:58 What else?
30:00 The core thing was, so those representations aren't usable by AI right now. And when you actually go and
30:07 use GPT three or something like this, you have to figure out a way to build the prompt to get it to do
30:13 what you want. This was especially true in a pre instruction tuning world, you had to really like, you had to
30:18 play the prompt engineer role even more than you have to now. Now you could sort of get away with describing it to
30:23 ChatGPT. And one of the things that you really have to like, play the game of is how do you get all the
30:28 information it's going to need into this prompt in a succinct, but good enough way that it helps it do
30:35 this. And so what sketch was about was, rather than just looking at the context of the data, like the
30:41 metadata, the column names and the code you have, also go get some representations of representation of the
30:48 content of the data, turn that into a string, and then bring that string in as part of the prompt.
30:53 And then when it has that, it should understand much better at actually generating code, generating
30:59 answers to questions. And that's what that sketch was a proof of concept of that, that worked very well.
31:03 It really quickly showed how valuable actual data content context is.
31:08 Yeah, I would say people are resonating with people. It's got 1.5,000 stars on GitHub.
31:13 And it looks about six months old. So that's pretty good growth there.
31:18 Yeah, January 16th was the day I posted it on Hacker News. And it had three,
31:22 there was an empty repo at that point.
31:24 Okay, three stars. It's like me and my friends. Okay, cool. So this is a tool that basically patches
31:33 pandas to add functionality or functions, literally to pandas data frames that allows you to ask
31:42 questions about it, right?
31:44 Yep.
31:44 So what kind of questions can you ask it? What can it help you with?
31:47 Yeah, so there's two classes of questions you can ask, you can ask it, the ask type questions,
31:53 these are sort of from that summary statistics data. So from the general, you know, representation of your
32:00 data, ask it to like, give you answers about it, like, what are the columns here, you sort of have
32:04 a conversation where it sort of understands the general under like shape of the data, general
32:10 distributions, things like that, number of uniques, and like give that context to it, ask questions of that
32:15 system. And then the other one is ask it how to do something. So you specifically can get it to write
32:21 code to solve a problem you have, you describe the problem you want, and you can ask it to do that.
32:25 Right. I've got this data frame, I want to plot a graph of this versus that, but color by the other
32:31 thing.
32:31 Yep. And in the data space world, what I sort of decided to do is like in the demo here is just sort of
32:37 walk through what are some standard things people want to ask of data, like, like, what are those common
32:43 questions that you hear, like, in Slack between, you know, like, business team and an analyst team. And it's just
32:49 sort of like, Oh, can you do this? Can you get me this? Can you tell me if there's any PII? Is this safe to send?
32:54 Can I send the CSV around? Can you clean up this CSV? Oh, I need to load this into our catalog. Can you
32:59 describe each of these columns and check the data types all the way to can you actually go get me
33:04 analytics or plot this?
33:05 Yeah. Awesome. So and it plugs right into Jupyter Notebooks, so you can just import it and basically
33:13 installing Sketch, which is a pip or Conda type thing, and then you just import it, and it's good to go,
33:19 right? Yep. Using the Pandas extensions API, which allows you to essentially hook into their data
33:24 frame callback and register a, you know, a function. Interesting. So it's not as jammed on from the
33:31 outside. It's a little more, plays a little nicer with Pandas rather than just like, we're going to go
33:35 to the class and just tap on it. Yeah, yeah. I, yeah. Not full monkey patching here. It's a,
33:41 it's like hack supported, I think. I don't, I don't see it used often, but it is somewhere in the docs.
33:46 Excellent. But here it is. So what I wanted to do for this is there's a, an example that you can do,
33:52 like if you go to the repo, which obviously I'll link to, there's a video, which I mean,
33:57 mad props to you because I review so many things, especially for the Python Bytes podcast, where
34:02 there's a bunch of news items and new things we're just going to check out. And we'll, we'll find people
34:06 recommending GUI frameworks that haven't, not a single screenshot or other types of things. Like,
34:13 I have no way to judge whether this thing even might look like that. What does it even make?
34:17 I don't even know, but somebody put a lot of effort, but they didn't bother to post an image. And you
34:21 posted a minute and a half animation of it going through this process, which is really, really
34:27 excellent. So people can go and watch that one minute, one minute 30 video. But there's also a
34:34 collab opening Google collab, which gives you a running interactive variant here. So you can just
34:41 follow along, right? And play these pieces requires me to sign up on and run it. That's okay.
34:46 Let me talk people through some of the things it does. And you can tell me what it's doing,
34:51 how it's doing that, like how people might find that advantageous. So import sketch, import pandas
34:57 as PD standard. And then you can say pandas read CSV and you give it one from a, like a,
35:03 some example CSV that you got on your, one of your GitHub repos, right? Or in your account.
35:08 Yeah. I found one online and then added just random synthetic data to it.
35:12 Yeah. Like, Oh, here's a data dump. No, just kidding.
35:14 So then you need to go to that data frame called sales data. You say dot sketch dot ask as a string,
35:21 what columns might have PII personal identifying information in them?
35:28 Awesome. And so it comes, tell me how that works and what it's doing here.
35:33 So it does, I guess it has to build up the prompt, which is sent to GPT. So to open AI specific
35:40 completion endpoint, the building up the prompt, it looks at the data frame. It does a bunch of
35:44 summarization stats on it. So it calculates uniques and sums and things like that. There's two modes in
35:50 the backend that either does sketches to do those, or it just uses like DF dot describe type stuff.
35:55 And then it pulls those summary stats together for all the columns, throws it together with my,
36:00 the rest of the prompt I have, you can, we can go find it, but then it sends that prompt.
36:05 Actually, it also grabs some information off of inspect. So it sort of like walks the,
36:10 the stack up to go and check the variable name because the data frame is named sales data.
36:15 So it actually tries to go find that variable name in your call stack so that it can, when it writes
36:20 code, it writes valid code, puts all that together, send it off to open AI, gets code back, uses Python
36:25 AST to parse it, check that it's valid. If it's not valid Python code, or you tried to import something
36:30 that you don't have, it will ask it to rewrite once. So this is sort of like an iterative process. So it
36:36 takes the error or it takes the thing and it sends it back to open AI. It's like, Hey, fix this code.
36:41 And then it, or in this case, sorry, ask, it actually just takes this, it sends that exact
36:45 same prompt, but it just changes the last question to, can you answer this question off of the information?
36:51 This portion of talk Python me is brought to you by us over at Talk Python Training with our courses.
36:58 And I want to tell you about a brand new one that I'm super excited about. Python web apps that fly
37:05 with CDNs. If you have a Python web app, you want it to go super fast. Static resources,
37:11 turn out to be a huge portion of that equation. Leveraging a CDN could save you up to 75% of your
37:17 server load and make your app way faster for users. And this course is a step-by-step guide on how to do
37:24 it. And using the CDN to make your Python apps faster is way easier than you think. So if you've
37:30 got a Python web app and you would like to have it scaled out globally, if you'd like to have your users
37:35 have a much better experience and maybe even save some money on server hosting and bandwidth,
37:41 check out this course over at talkpython.fm/courses. It'll be right up there at the top.
37:46 And of course the link will be in your show notes. Thank you to everyone who's taken one of our courses.
37:51 It really helps support the podcast. I'm back to the show.
37:56 And so that sounds very, very similar to my arrow program. Rewrite it with garden clauses, redo it.
38:02 Like you kind of, I gave you this data in this code and I asked you this question and you can have a
38:07 little conversation, but at some point you're like, all right, well, we're going to take what it gives me
38:10 after a couple of rounds at it. Right.
38:12 Yeah. I take the first one that doesn't, that like passes an import check and passes AST linting.
38:18 There was a, the, when you use small models, you run into not valid Python a lot more, but with these
38:23 ones, it's almost always good.
38:25 It's ridiculous. Yeah. Yeah. Yeah. It's crazy. Okay. So it says the columns that might have PII
38:30 in them are credit card, SSN and purchase address. Okay. That's pretty excellent. And then you say,
38:37 all right, sales data dot sketch dot ask. Can you give me friendly name to each column and output this
38:44 as an HTML list, which is parsed as HTML and rendered in Jupyter notebook accurately. Right. So it says
38:51 index. Well, that's an index.
38:52 This one ends up being the same.
38:53 It's not a great, this one is not a great example because it doesn't have to like infer
38:57 because the names are like order space date, right? Instead of order, like maybe lowercase
39:04 O and then like attached a big D or whatever, but it'll give you some more information. You
39:09 can like kind of ask it questions about the type of data, right?
39:13 Yeah, exactly. I found this is really good at if you play the game and you just name all
39:17 your columns, like call one, call two, call three, call four, and you ask it, give me new column
39:21 names for all of these. It gives you something that's pretty reasonable based off of the data.
39:24 So pretty useful.
39:24 Okay. So it's like, oh, these look like addresses. So we'll call that address. And this looks like
39:28 social security numbers and credit scores and whatnot.
39:31 Yep. Yep. So it can really help with that quick first onboarding step.
39:35 Yeah. So everyone heard it here first. Just name all your columns. One, two, three, four,
39:39 and then just get help. Like AI, what do we call these? All right. So the next thing you did in this
39:48 demo notebook was you said sales data dot sketch dot. And this is different before I believe,
39:54 because before you were saying ask, and now you can say how to create some derived features from the,
40:01 from the address. Tell us about that.
40:03 Yeah. This is the one that actually is the code writing. It's essentially the exact same prompt,
40:07 but the change is the very end. It says like, return this as Python code that you can execute to do
40:13 this. So instead of answering the question directly, answer the question with code that will answer the
40:17 question.
40:18 Right. Write a Python line of code that will answer this question given this data, something like that.
40:23 Yep. Yep. Something like that. I don't remember exactly anymore. It's been a while, but yeah,
40:27 some I've iterated a little bit until it started working and I was like, okay, cool. And so,
40:32 ask it for that. And then it spits back code. And that was, it sort of, it sounds overly simple,
40:37 but that was it. That was like, that was the moment. And I was just like, oh, I could just ask it to do my
40:42 analytics for me. And it's just all the, every other feature just sort of became like apparently
40:45 solvable with this. And the more I played with it, the more it was just, I don't have to think about,
40:50 I don't even have to go to Google or stack overflow to ask the question, to get the API stuff for me.
40:55 I could, from zero to I have code that's working is one step in Jupyter.
40:59 So you wrote that how to, and you gave it the question and then it wrote the lines of code and you just
41:04 drop that into the next cell and just run it. Right. And so for example, in this example, it said, well,
41:09 we can come up with city state and zip code and by writing a vector transform by passing a lambda,
41:16 that'll pull out, you know, the city from the string that was the full address and so on. Right.
41:21 Yeah. That's pretty neat.
41:22 Yeah. It's fun to see what it, what it does. Not again, not any of these things are always
41:26 probabilistic, but it also usually serves as a great starting point if, even if it doesn't get it
41:30 right.
41:30 Yeah. Sure. You're like, oh, okay. I see. Maybe that's not exactly right. Cause we have Europeans
41:35 in their city, maybe in their zip code or in different orders sometimes, but it gives you
41:40 something to work with pretty quickly. Right. By asking just a, what can I do?
41:44 And then another one, this one's a little more interesting instead of just saying like, well,
41:48 what other things can we pull out? It's like, this gets towards the analytics side, right? It says,
41:53 get the top five grossing states for the sales data. Right. And it writes a group by some sorts,
42:01 and then it does a head given five. And that's pretty neat. Tell us about this. I mean, I guess
42:05 it's about the same, right? Just ask more questions.
42:08 They all feel pretty similar to me. I think, I guess I could jump towards like things that
42:13 I wanted to put next, but I didn't, we're not reliable enough to like really make the cut.
42:18 I wanted to have it go like in my question was like, go build a model that predicts sales for the
42:24 next six months and then plot it on a 2d plot with a dotted line for the predicted plot. And like,
42:31 it would try, but it would always do something off. And I found I always had to break up the
42:36 like prompted to like smaller, smaller intern level code back. Yeah.
42:40 Yeah. It was fun getting it to train models, but it was also its own like separate thing. I sort of
42:48 didn't play with too much. And there's another part of sketch that I guess is not in this notebook. I
42:54 didn't realize. Yeah. Because you have to use the open AI API key, but it's the sketch apply. And
43:00 that's the, I'll say this one is another just like power tool. This one has like, I don't really talk
43:07 about, I don't even include it in the video because it's not just like as plug and play, you do have to
43:11 go set an environment variable. And so it's like, yeah, that's one step further than I want to,
43:15 I don't, it's not terrible, but it's a step. And so what it does is it lets you apply a completion
43:22 endpoint of whatever your design row wise. So every single row, you can go and apply and run something.
43:29 So if every row of your pandas data frame is a, some serialized text from a PDF or something,
43:35 or a file in your directory structure, and you just load it as a data frame, you can do dot
43:39 df.sketch.apply. And it's almost the exact same as df.apply. But the thing you put in as your function
43:45 is now just a Jinja template that will fill in your column variables for that row and then ask GPT to
43:51 continue completing. So I think I did silly ones, like here's a few states. And then the prompt is
43:58 extract the state for it. Or so I think, right, extract the capital of the state. Yeah. Yeah. So
44:04 just pure information extraction from it, but you can sort of like this grows into a lot more.
44:10 So does that come out of the data? Or is that coming out of open AI where like it sees what is
44:15 the capital of state and it sees New York? It's like, okay, well, all right, Albany.
44:20 Yeah. So this is purely extracting out of the model weights. Essentially, this is not like a factual
44:25 extraction. So this is probably a bad example because it's like it. But the thing that actually,
44:29 actually, the better example I did once was, what is like some interesting colors that are
44:34 good for each state? And it like just came up with a sort of like flaggish colors or sports team colors.
44:38 That was sort of fun when it wrote that as hex. You can also do things like if you have a large text
44:43 document or you can actually, I'll even do the more common one that I think everybody actually wants
44:47 is you have messy data. You have addresses that are like syntactically messy and you can say,
44:52 normalize these addresses to be in this form. And you sort of just write one example. It's a run
44:57 dot apply and you get a new column that is that cleaned up data. Yeah. Incredible. Okay. A couple
45:03 things here. It says I can use, you can directly call open AI and not use your endpoint. So at the
45:11 moment it kind of proxies through web service that you all have that somehow checks stuff or what does
45:16 that do? Yeah. It was just a pure ease of use. I wanted people to be able to do pip install and
45:21 import sketch and actually get it because I know how much I use things in, in a collab or in Jupyter
45:28 notebooks on weird machines and remembering an environment variable, managing secrets. It's like
45:32 this whole overhead that I want to deal with. And so I wanted to just offer a lightweight way if you
45:38 just want to be able to use it. But I know that that's not sufficient for secure. If people are going
45:42 to be conscious of this things and want to be able to, you know, not go through my proxy thing that's
45:46 there for help. So sure.
45:47 Offer this up.
45:48 What's next? Do you have a roadmap for this? Are you happy where it is and you're just letting it be or
45:53 do you have grand plans?
45:55 I don't have much of a roadmap for this right now. I'm actually, I guess there's like grand roadmap,
46:00 which is like at the company scale, what we're working on. I would say that if this, we're really trying to
46:05 solve data and with AI just in general. And so these are the types of things we hope to open source and
46:11 just give out there, like actually everything we're hoping to open source. But the starting place is
46:16 going to be a bunch of these like smaller toolkits or just utility things that hopefully save people
46:20 time or very useful. The grand thing we're working towards, I guess, is this more like the, it's the
46:26 full automated data stack. It's like the dream I think that people have wanted where you just ask it
46:31 questions and it goes and pulls the data that you need. It cleans it. It builds up the full pipeline.
46:36 It executes the pipeline. It gets you to the result and it shows you the result. And you look,
46:40 you can inspect all of that, that whole DAG and say, yes, I trust this. So we're working on getting
46:45 full end to end.
46:46 So when I went and asked about that Arrow program, I said, I think this will still do it. I think this
46:51 will probably work again. And it did, which is awesome. Just the way I expected. But, you know,
46:58 AI is not as deterministic as read the number seven. If seven is less than eight, do this,
47:05 right? Like what is the repeatability? What is the sort of experience of doing this? Like I ran it,
47:11 I ran it again. Is it going to be pretty much the same or is it going to have like, what's the mood
47:16 of the AI when it gets to you?
47:18 This is sort of a parameter you can, there's a little bit of a parameter you can set if you want
47:22 to play that game with the temperature parameter on these models at higher and higher temperatures,
47:26 you get more and more random, but it can also truly be out of left field random if you go too
47:31 high temperature.
47:32 Okay. But you get maybe more creative solutions.
47:34 Yeah. You could sometimes get that. And as you move towards zero, it gets more and more
47:38 deterministic. Unfortunately for really trying to do some like good provable, like sort of like
47:43 build chain type things with like hashing and caching and stuff. It's not fully deterministic,
47:48 even at zero temperature, but that's just, I think it's worth thinking about, but at the same time,
47:53 run it once, see the answers that it gives you comment that business out and just like put that
47:59 as markdown, you know, freeze it. It like us memorialize it in markdown because you don't need
48:05 to ask it over and over what columns have PII. Like, well, probably the same ones as last time.
48:10 We're just kind of like, right, these columns, credit card, social security and purchase address,
48:15 they have, have that. And so now, you know, right. Is that a reasonable way to think about it?
48:20 I think, yeah, if you, if you want to get determinism or the performance is a thing that
48:24 you're worried about, yeah, you can always cash. I think however you do it, comments or actually
48:28 with systems.
48:28 Sure. Sure. Sure. Sure. Or that like, like, how do I, you know, how do I do that group by sorting
48:35 business? Like you don't have to ask that over and over once it gives you the answer.
48:38 Yeah. Yeah. My workflow, when I use sketch, definitely I asked the question, I copy the
48:43 code and then I go delete the question or ask it a different question for my next problem that I have.
48:47 Yeah. I like, it's not code that it is a little bit like a vestigial when it, when you like save
48:53 your notebook at the end and you sort of want to go back and delete all the questions you asked because
48:57 you don't need to rerun it when you actually just go to execute the notebook later. But yeah,
49:01 that makes a lot of sense. And plus you look smarter if you don't have to show how you got
49:04 the answers.
49:05 Look at this beautiful code that's even commented.
49:07 Yeah, exactly. I guess you could probably ask it to comment your code, right?
49:12 Yeah. You can ask it to describe. There's been some really cool things where people will throw
49:17 like assembly at it and ask it to translate to different like languages so they can interpret
49:20 it. Or you could do really fun things like cross language, cross, I guess I'll say like levels
49:26 of abstraction. You could sort of ask it to describe it like at a very top level, or you can get really
49:30 precise, like for this line, what are all the implications if I change a variable or something like that?
49:34 Yeah, that's really cool. I suppose you could do that here. Can you can you converse with it? You
49:39 can say, okay, you gave me this. Does it I guess what's the word? Does it have like tokens in context
49:43 like chat HTTP does? Can you say, okay, that's cool. But but I want as integers, not as strings.
49:50 I don't know. Yeah, I did. I did not include that in this. There was a version that had something like
49:55 that, where I was sort of just keeping the last few calls around. But it quickly became it didn't align
50:00 with the Jupyter IDE experience, because you end up like scrolling up and down. And it you have too much
50:04 power over how you execute in a Jupyter notebook. So your context can change dramatically by just scrolling
50:10 up. And trying to via inspect look across different like, across a Jupyter notebook is just a whole
50:16 other nightmare. So yeah, I didn't try and like extract the code out of the notebook so that it
50:20 could understand the local context. You could go straight to chat HTTP or something like that,
50:23 take what it gave you and start asking it questions.
50:26 Okay, so another question that I had here about this. So in order for that to do its magic,
50:32 like you said, the really important thought or breakthrough or idea you had was like, not just
50:37 the structure of the pandas code or anything like that, but also a little bit about the data.
50:41 What is the privacy implications of me asking this question about my data? Suppose I have
50:47 super duper secret CSV. And should I not ask or how to on it? Or what is the story there?
50:55 What's the, if I work with data, how much sharing do I do of something I might not want to share if I
51:02 ask a question about it?
51:04 I'd say the same discretion you'd use if you would copy like a row or a few rows of that data into
51:09 a, into ChatGPT to ask it a question about it.
51:12 Okay.
51:12 Is the level of concern I guess you should have like on the specifically, I am not storing these
51:19 things, but I know is at least it was, it seems like they're going to start getting towards like a 30
51:24 day thing. But, so there's a little bit of, yeah, I mean, you're sending your stuff over the
51:28 wire, like over network, if you do this and to use these language models until they come local,
51:32 until these things like llama and alpaca get good enough that they're, yeah, they're going to be
51:37 remote. Actually, that could be a fun, sorry. I just now thought that could be a fun thing. Like
51:41 just go get alpaca working with a sketch so that it can be fully local.
51:45 Interesting. Like a privacy preserving type of deal.
51:48 Yeah. I hadn't actually, yeah, that's the, that's the power of these, smaller models that are
51:52 almost good enough. I could probably just like quickly throw that in here and see if it,
51:56 yeah, maybe it's a wider audience.
51:57 You have a option to not get through your API, but directly go to open AI. You could have another
52:04 one to pick other, other options, right? Potentially.
52:07 Yep. Yep. Yep. The, interface to these, one thing that I think is not, maybe it's talked
52:14 about it more than other places, but I haven't heard as much like excitement about it is that these,
52:17 the APIs have gotten pretty nice for this whole space. the, they're all like the idea of a
52:23 completion endpoint is pretty straightforward. You send it some amount of text and it will continue
52:28 that text. And it's such a, it's so simple, but it's so generalizable. You could build so many
52:32 tools off of just that one API endpoint essentially. And so combine that with an embedding endpoint and
52:38 you sort of have all you need to, to make complex AI apps.
52:41 It's crazy. Speaking of making AI apps, maybe touch a bit on your, your other projects, Lambda.
52:48 So yeah, Lambda. Yeah.
52:50 But before you get into it, mad props for like Greek letter, like that's a true physicist or
52:56 mathematician that I can appreciate that there.
52:59 Yeah. That was, I was excited to put it everywhere, but then of course, these things don't playing, playing games with character sets and websites. I'm the one that
53:08 causes, I both feel the pain, have to clean the data that I also put into these systems.
53:13 So yeah. Yeah. People are like a prompt and why is the a so italicized? I don't get it.
53:17 Yeah. Okay. Yeah. So yeah. Yeah. So this one came, I was, working with, this is pre GPT.
53:25 This is October. I guess it was right around ChatGPT coming out like around that time.
53:29 But I was, I was really just messing around a lot with, completion endpoints as we were talking.
53:32 And I kept rewriting the same request boiler over and over. And then I also kept rewriting
53:38 f-strings that I was trying to like send in. And I was just like, ah, Jinja templates solved this
53:43 already. Like there already is formatting for strings in Python. Let me just use that, compose that into a
53:49 function. And there's, let me call these completion endpoints. I don't want to think of them as like
53:53 API endpoint or RPC is a nice mental model, but I want to use them as functions. I want to be able to
53:59 put decorators on them. I want to be able to use them both async or not async in Python. I want to,
54:05 I just want to have this as a thing that I can just call really quickly with one line and just do
54:10 whatever I need to with it. And so through this together, it's very simple. Like, honestly, I mean,
54:15 like the hardest part was just getting all the layers of, there's actually two things you
54:20 can make a prompt that then, cause I wrap any function as a prompt. So not just, these
54:26 calls to GPT and then I do tracing on it. So as you like get into the call stack, every input and output
54:32 is you can sort of like get hooked into and trace with some like call traces. So there's a bunch of just
54:38 like weird stuff to make the utility nice, but functionally, as you can see here on it's,
54:42 you just import it, you write a Jinja template with the class, and then you use that object that
54:48 comes back as a function and your Jinja template variables get filled in. And your result is the
54:53 text string that comes back out of a GPT. Interesting. And people probably, some people
54:57 might be thinking like Jinja, okay, well I got to create an HTML file and all that, like not just a
55:02 string that has double curlies for turning stuff into like strings within the string, kind of a different
55:08 way to do f-strings as you were hinting at. Yeah. Yeah. There was a two pieces here. I realized as
55:13 I was doing this also, I think I sort of mentioned with a sketch I do. I really often was taking the
55:18 output of a language model prompt, doing something in Python, or actually I can do a full example of
55:24 the SQL writing like exploration we did, but, we would do these things, that were sort of run
55:31 GPT three to ask it to write the SQL. You take the SQL, you go try and execute it, but it fails for
55:38 whatever reason, or you, and you take that error, you say, Hey, rewrite it. So we talked about that
55:42 sort of pattern, which is sort of like rewriting. Another one of the patterns was increase the
55:46 temperature, ask it to write the SQL. You get like 10 different SQL answers in parallel. And this is where
55:52 the async was like really important for this. Cause I just wanted to use asyncio gather and run all 10
55:57 of these truly in parallel against the open eye endpoint, get 10 different answers to the SQL,
56:02 run all 10 queries against your database, then pull on what the most common, like of the ones that
56:08 successfully ran, which ones gave the same answer the most often, then that's probably the correct
56:12 answer. And, just chaining that stuff. It's like very pythonic functions. Like you can really
56:19 just imagine like, Oh, I just need to write a for loop. I just need to run this function, take the
56:22 output feed into another function, very procedural. But when you, all the abstractions in the
56:28 open at open AI API, the things like just everything else, there was nothing else really at the time,
56:34 but even the new ones that have come out like Lang chain that have sort of like taken the space by
56:38 storm now are not really just trying to offer the minimal ingredient, which is the function.
56:43 And to me, it was just like, if I can just offer the function, I can write a for loop. I can write,
56:47 I can store a variable and then keep passing it into it. You could do so many different
56:51 emergent behaviors with just starting with the function and then simple Python, scripting
56:56 on top of it.
56:57 And there's some interesting stuff here, land of prompt. So you can start, you can kind of start
57:04 it, set it. I don't know what chat GDP, you can tell it a few things. I'm going to ask you a question
57:09 about a book. Okay. The book is a choose your own adventure book. Okay. Now here I'm going to like,
57:14 you can prepare it, right? There's probably a more formal term for that, but you can do this here.
57:19 Like you can say, Hey system, you are a type of bot. And then you, that creates you an object that
57:25 you can have a conversation with. And you say, what should we get for lunch? And your type of bot is
57:29 pirate. And then so to say, as a pirate, I would suggest we have some hearty seafood or whatever,
57:33 right? Like that's, that's beyond what you're doing with sketch. I mean, obviously this is not so much
57:37 for code. This is like conversing with Python rather than in Python. I don't know. And your editor.
57:42 Yeah. This one was the open AI chat API endpoint came out and I was just like, Oh, I should support
57:49 it. So that's what this, I wanted to be able to Jinja template inside of the conversation. So you
57:53 can imagine a conversation that is prepared with like seven steps back and forth, but you want to hard
57:59 code with the conversation, like how the flow of the conversation was going. And you want to template
58:02 it so that like on message three, it put your new context problem on message four, it put the output
58:08 from another prompt that you ran on message five. It is this other data thing. And then you ask it to
58:13 complete this, the intent of like, it's arbitrarily complex, but still something like that
58:18 would be, you know, just three lines or so in Lambda prompt. The idea was that it would offer up a really
58:23 simple API for this. Well, other thing that's interesting is of an async and async version. So that's,
58:28 that's cool. People can, can check that out. Also a way to make it a hosted as a web service with say
58:35 like FastAPI or something like that. Yeah. And you can make it a decorator if you like an app prompt
58:42 decorator. Yeah. On any function you can just throw app prompt and it, it wraps it with the same class
58:47 so that all the, all the magic you get from that works. The server bit is I took, so FastAPI has
58:54 that sort of like inspection on the function, part. I did a little bit of middleware to get the
59:00 two happy together. And then all you have to do is import FastAPI and then run, you know, G unicorn
59:06 that app. And, it's two lines and any prompts you have made become their own independent rest
59:14 endpoint where you can just do a get or a post to it. And it returns the response from calling the prompt.
59:20 But these prompts can also be these chains of prompts. Like one prompt can call another prompt,
59:24 which can call another prompt. And those prompts can call async to not async back to async and things
59:28 like that. And it should work. Pretty sure this one actually, I did test everything as far as I know,
59:34 I'm pretty sure I've got pretty good coverage. So yeah, super cool. All right. Well get a little
59:39 short on time, but I, I think people are going to really, really dig this, especially sketch. I think
59:43 there's a lot of folks out there doing pandas that would love an AI buddy to help them
59:50 do things like not just analyze the code, but the data as well.
59:55 Yeah. Just, I think, anybody's, I know it's for me, but it's just like copilot in
59:59 VS Code ID, sketch in your Jupyter ID, takes almost nothing to add. And you,
01:00:05 whenever you're just sort of sitting there, you think you're about to alt tab to go to Google. You
01:00:08 could just try the sketch.ask and it's surprising how often that sketch.ask or sketch.howto gets you
01:00:14 way closer to a solution without even having to leave the, you don't even have to leave your,
01:00:17 your environment. It's like a whole other level of autocomplete for sure. And super cool. All right.
01:00:23 Now, before I let you out of here, you got to answer the final two questions. If you're going to write
01:00:27 some Python code and it's not a Jupyter notebook, what editor are you using? It sounds to me like you may
01:00:33 have just given a strong hint at what that might be. Yeah. I've switched almost entirely to VS Code.
01:00:38 and I've been really liking it with the remote development and, like it's just, I work
01:00:43 across like many machines, both cloud and local and some like five, six different machines are my
01:00:48 like primary working machines. And I use the remote, VS Code thing. And it just, I have a unified
01:00:53 environment that gives me terminal, the files and the code all in one and copilot on all of them.
01:00:59 Yeah. It's wild. All right. And then notable pipe UI package. I mean, pip install sketch,
01:01:04 you can throw that out there if you like. That's pretty awesome. But anything you've run across
01:01:08 you're like, Oh, this is people should know about this. Yeah. It doesn't have to be popular. Just
01:01:12 like, Oh, this is cool. In the, I guess these, these two are very popular, but, in the data
01:01:17 space, I really, I'm a huge fan of, Ray and, also arrow. Like I use those two tools as like
01:01:25 my backend bread and butter for everything I do. And so those have just been really great work.
01:01:30 Apache arrow. Right. And then Ray, I'm not sure. Yeah. Ray is a distributed, scheduling compute
01:01:38 framework. It's sort of like a, right. I don't know what they, yeah. I remember seeing about this.
01:01:42 Yeah. This is a, it is, I'm parsing, he didn't talk about other things, but I'm like parsing common
01:01:47 crawl, which is like 25 petabytes of data. And, Ray is great. It's just the workhorse. It power is
01:01:53 really useful. Like I find it's so snappy and good, but it offers everything I need in a distributed
01:02:00 environment. So I can write code that runs on a hundred machines and not have to think about it.
01:02:05 It works really well. That's, that's pretty nuts. Not as nuts as chat GDP and mid journey,
01:02:09 but it's still pretty nuts. So before we call it a date, do you want to tell people about approximate
01:02:14 labs? It sounds like you guys are making some good progress. I might have some, some jobs for people
01:02:19 to work in this kind of area as well. Yeah. So, we're, we're working at the intersection
01:02:23 of, AI and tabular data. So anything related to these training, these large language models,
01:02:28 and also, tabular data. So things with columns and rows, we are trying to like solve that problem,
01:02:32 try and bridge the gap here. Cause there's a pretty big gap. We have three main initiatives
01:02:37 that working on, which is we're trying to build up the data set of data sets. So just like the pile
01:02:41 or the stack or lay on five B these like big data sets that were used to train all these big
01:02:46 models. We're making our own on tabular data. We are training models. So this is actually
01:02:51 training large language models, doing these training, these full transformer models.
01:02:55 And then we're also building apps like sketch, like UIs, things that are actually there to help
01:03:00 make data more accessible to people. So anything that helps people get value from data and make it open
01:03:05 source. That's what we're working on. We just, raised our seed round. So we are now officially
01:03:11 hiring. So, looking for people who are interested in the space and who are enthusiastic
01:03:15 about these problems. Awesome. Well, very exciting demo libraries, I guess, however you call them.
01:03:23 But I think this, I think these are neat. People are going to find a lot of cool uses for them. So
01:03:27 excellent work and congrats on all the success so far. It sounds like you're just starting to take
01:03:32 off. Yeah. Thank you. All right, Justin, final call to action. People want to get started.
01:03:36 Let's pick sketch. People want to get started with sketch. What do you tell them?
01:03:39 Just, pip install it. Give sketch a, give sketch a try, pip install it, import it, and then throw it on your data frame. Awesome. And then ask it questions or how tos.
01:03:49 Yeah. Yeah. Yep. Whatever you want. if you really, if you really want to, and you,
01:03:53 you trust the model, like throw some, applies and have it clean your data for you. Cool.
01:03:57 Awesome. All right. Well, thanks for being on the show.
01:04:00 Come in here and tell us about all your work. It's great. Yeah. Thank you. Yeah. See you later.
01:04:04 Thanks for having me.
01:04:05 This has been another episode of talk Python to me. Thank you to our sponsors. Be sure to check out
01:04:11 what they're offering. It really helps support the show. Stay on top of technology and raise your value
01:04:16 to employers or just learn something fun in STEM at brilliant.org. Visit talkpython.fm/brilliant
01:04:24 to get 20% off an annual premium subscription. Want to level up your Python? We have one of the largest
01:04:30 catalogs of Python video courses over at talk Python. Our content ranges from true beginners
01:04:35 to deeply advanced topics like memory and async. And best of all, there's not a subscription in
01:04:40 sight. Check it out for yourself at training.talkpython.fm. Be sure to subscribe to the show,
01:04:45 open your favorite podcast app and search for Python. We should be right at the top. You can also find
01:04:51 the iTunes feed at /itunes, the Google play feed at /play and the direct RSS feed at slash
01:04:57 RSS on talkpython.fm. We're live streaming most of our recordings these days. If you want to be part of
01:05:03 the show and have your comments featured on the air, be sure to subscribe to our YouTube channel at
01:05:08 talkpython.fm/youtube. This is your host, Michael Kennedy. Thanks so much for listening. I really
01:05:14 appreciate it. Now get out there and write some Python code.
01:05:16 I'll see you next time.