The Intersection of Tabular Data and Generative AI
Episode Deep Dive
Guests Introduction and Background
Justin Waugh is the creator of Sketch at Approximate Labs and a seasoned expert working at the intersection of Python, data, and AI. With a background in experimental physics (complete with hands-on LabVIEW and GPU processing experience), Justin moved from academia into startups, taking his passion for high-performance computing and machine learning to the software world. He explored GPUs early on for electron-counting in physics labs and found parallels with the cutting-edge deep learning kernels of frameworks like CUDA and new neural net architectures. Justin’s drive to merge data-driven science with practical applications led him to found and work at multiple startups, ultimately creating tools such as Sketch and Lambda Prompt to integrate conversational AI into data workflows.
What to Know If You're New to Python
If you’re just getting started with Python and want to follow along more easily, focus on pandas data frames and Jupyter notebooks, since these are at the heart of the conversation about tabular data and conversational AI. It’s helpful to understand how to install packages (e.g. pip install
) and how to load data into pandas (e.g. pd.read_csv
). Also, knowing how to run code snippets within Jupyter cells will help you explore libraries like Sketch interactively.
Key Points and Takeaways
- Sketch: A Conversational AI Layer for Pandas
Sketch is a Python library that augments pandas data frames with an AI-powered “ask” and “howto” interface. You can literally ask your data questions (e.g., “Which columns might contain sensitive info?”) or ask Sketch how to code certain data transformations and get workable Python code in return. It uses Large Language Models (LLMs) under the hood and context from the data itself to generate answers and code. This approach dramatically cuts down on switching between your notebook and external documentation or Stack Overflow.
- Links and Tools
- GitHub: Sketch repo
- Approximate Labs: approx.dev
- Links and Tools
- Bringing Conversational AI into Data Analysis
The conversation highlights how ChatGPT-like models don’t just assist with code generation but also interpret and explain data. Instead of purely writing transformations, these models can describe anomalies, identify potential data issues, and even highlight PII-related columns. This bidirectional conversation with your data frames opens new possibilities for collaborative data science and faster discovery.
- Links and Tools
- Data Sketches: Efficient Summaries for Large Datasets
Justin’s background in data sketches (probabilistic data structures like HyperLogLog) plays a key role in how Sketch can quickly grasp the “shape” of data. These sketches let you approximate metrics—like unique values—without scanning an entire massive dataset. Combining that snapshot of the data with GPT-like models gives them the right context to answer questions about the data efficiently.
- Tools Mentioned
- Lambda Prompt: A Toolkit for Building AI Functions
Justin also discussed Lambda Prompt, another library that turns LLM endpoints (like OpenAI’s) into straightforward Python functions using Jinja templates. By defining your own prompts as functions, you can chain or compose them for more complex tasks, such as generating SQL queries, rewriting code, or building custom chat-style features. It makes building AI-driven apps simpler and more “Pythonic.”
- Links and Tools
- GPU Computation and the Rise of AI Justin’s story about early GPU usage in physics labs illustrates how GPU hardware rapidly evolved from specialized graphics pipelines to mainstream parallel computing engines. The discussion highlights how frameworks leveraging GPU acceleration (like PyTorch or TensorFlow) have driven breakthroughs in image generation, text modeling, and large language models. This hardware+software synergy paved the way for advanced AI tools we see today.
- Ethics and Licensing in AI Training Data A recurring topic was whether AI systems, like GitHub Copilot or image generation models, inadvertently incorporate copyrighted or GPL-licensed material. Justin pointed out ongoing lawsuits and broader conversations about data usage, especially the potential for “license stripping” when code is regurgitated from a generative model. Although no definitive legal resolutions emerged, the discussion underscores that privacy and ethics in AI remain an evolving challenge.
- ChatGPT vs. GitHub Copilot for Python Coding
The episode compared the broader context-based ChatGPT experience with Copilot’s integrated approach. ChatGPT can do more open-ended tasks and explanations, while Copilot excels at inline code suggestions in IDEs like VS Code. Combining them can significantly level up your productivity, but each tool addresses slightly different developer workflows.
- Links and Tools
- Practical Examples: Data Cleaning and Feature Engineering One highlight was showing how Sketch can parse addresses to extract city, state, and zip, or quickly group sales data by region. Being able to say “clean up messy addresses” and get workable Python code in a single step is especially valuable for people who handle daily data wrangling tasks. Even if the code is 90% correct, it can drastically reduce the time spent on boilerplate tasks.
- Integrating Sketch in Jupyter for a Better Notebook Flow
Since many data scientists live in Jupyter Notebooks, Sketch offers a direct path to embedded AI queries. You can run
df.sketch.ask("...")
ordf.sketch.howto("...")
and remain within the environment without context switching to a browser. This synergy with Jupyter makes it an incredibly smooth experience for data exploration, data cleaning, and immediate code generation.- Tools
- Future of AI-Driven Data Tools The discussion closed on bigger visions for fully automated data pipelines, advanced conversational data analysis, and bridging more complex tasks like model training. Justin’s company, Approximate Labs, aims to unify the steps of data discovery, transformation, and high-level analysis through AI-driven solutions—indicative of a broader industry movement toward more intelligent data platforms.
- Links
Interesting Quotes and Stories
- Justin on missing GitHub Copilot while offline: “I was on a flight recently ... I felt like I was walking through mud instead of running. I realized I’ve become reliant on it in a big way.”
- On GPU usage in early physics: “At the time, I was using C++ and a distributed LabVIEW project just to move some motors and measure electrons—and I realized, it’s basically convolution kernels that the neural nets were doing, too.”
Key Definitions and Terms
- Generative AI: A branch of AI that creates new content (text, images, audio) based on training data, often powered by large language models.
- Data Sketches: Probabilistic data structures (like HyperLogLog) that let you estimate measures like unique counts quickly and with less memory.
- pandas: A Python library for data manipulation and analysis, providing data structures and operations to manipulate numerical tables and time series.
- LLM (Large Language Model): A neural network trained on vast text corpora to predict and generate human-like language responses and code.
Learning Resources
Below are a few curated learning resources to help deepen your knowledge and skill set around Python’s data and AI ecosystem:
- Move from Excel to Python with Pandas: Understand how to transition from Excel-based workflows to more scalable, code-driven solutions in Python using pandas.
- Data Science Jumpstart with 10 Projects: Gain hands-on experience with Python, data analysis, and real-world projects.
- Build An Audio AI App with Python and AssemblyAI: Explore a practical AI application and see how it integrates modern frameworks like FastAPI and GPT-based summarization.
Overall Takeaway
The rise of conversational AI for data analysis, as showcased by Sketch, signals a major transformation in how Python developers and data scientists interact with their datasets. By integrating language models directly into workflows—whether for code generation, data summarization, or interactive exploration—tools like Sketch and Lambda Prompt streamline repetitive tasks and open up new levels of creativity in data wrangling. This episode shows that, while powerful, AI-based solutions also bring considerations around ethics, licensing, and reliability. Overall, the conversation is a strong testament to Python’s vibrant community and the growing potential for AI-assisted development in everything from data cleaning to advanced analytics.
Links from the show
Lambdapromp: github.com
Python Bytes 320 - Coverage of Sketch: pythonbytes.fm
ChatGPT: chat.openai.com
Midjourney: midjourney.com
Github Copilot: github.com
GitHub Copilot Litigation site: githubcopilotlitigation.com
Attention is All You Need paper: research.google.com
Live Colab Demo: colab.research.google.com
AI Panda from Midjourney: digitaloceanspaces.com
Ray: pypi.org
Apache Arrow: arrow.apache.org
Python Web Apps that Fly with CDNs Course: talkpython.fm
Watch this episode on YouTube: youtube.com
Episode transcripts: talkpython.fm
--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy
Episode Transcript
Collapse transcript
00:00 AI has taken the world by storm.
00:01 It's gone from near zero to amazing in just a few years.
00:05 We have chat GDP, we have stable diffusion.
00:07 What about Jupyter Notebooks and Pandas?
00:10 In this episode, we meet Justin Wagg, the creator of Sketch.
00:13 Sketch adds the ability to have conversational AI interactions about your Pandas' data frames, code, and data
00:21 right inside of your notebook.
00:23 It's pretty powerful, and I know you'll enjoy the conversation.
00:26 This is Talk Python to Me, episode 410, recorded April 2nd, 2023.
00:31 Welcome to Talk Python to Me, a weekly podcast on Python.
00:47 This is your host, Michael Kennedy.
00:49 Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython,
00:55 both on fosstodon.org.
00:57 Be careful with impersonating accounts on other instances.
00:59 There are many.
01:00 Keep up with the show and listen to over seven years of past episodes at talkpython.fm.
01:05 We've started streaming most of our episodes live on YouTube.
01:09 Subscribe to our YouTube channel over at talkpython.fm/youtube to get notified about upcoming shows
01:15 and be part of that episode.
01:17 This episode is brought to you by Brilliant.org and us with our online courses over at Talk Python Training.
01:24 Justin, welcome to Talk Python to Me.
01:27 Thanks for having me.
01:28 It's great to have you here.
01:29 I'm a little suspicious.
01:30 I got to know, I really know how to test whether you're actually Justin or an AI speaking as Justin.
01:37 What's the deal here?
01:40 Yeah, there's no way to know now.
01:41 No, there's not.
01:42 Well, apparently I've recently learned from you that I can give you a bunch of Xs
01:46 and other arbitrary characters.
01:47 This is like the test.
01:49 It's like asking the Germans to say squirrel in World War II sort of thing.
01:54 Like it's the test.
01:55 It's the tell.
01:56 There's always going to be something.
01:58 It's some sort of adversarial attack.
01:59 Exactly.
02:01 It's only going to get more interesting with this kind of stuff for sure.
02:05 So we're going to talk about using generative AI and large language models paired with things like Pandas
02:13 or consumed with straight Python with a couple of your projects, which are super exciting.
02:18 I think it's going to empower a lot of people in ways that it hasn't really been done yet.
02:24 So awesome on that.
02:25 But before we get to it, let's start with your story.
02:27 How did you get into programming in Python and AI?
02:30 Let's see.
02:30 I got into programming in just like when I was a kid, TI-83, learning to code on that.
02:36 And then sort of just kept it up as a side hobby my whole life.
02:40 Didn't ever sort of choose it as my career path or anything for a while.
02:44 It chose you.
02:44 Yeah, it chose me.
02:46 It just, I dragged it along with me everywhere.
02:47 It's just like the toolkit.
02:49 I got a, went to undergrad and for physics, electrical engineering, then did a physics PhD, experimental physics.
02:57 During that, I did a lot of non-traditional languages, things like LabVIEW, Igor Pro,
03:02 just weird Windows, Windows hotkey for like just trying to like automate things.
03:08 Yeah, sure.
03:08 So just was sort of dragging that along.
03:11 But along that path, sort of came across GPUs and used it for accelerating processing,
03:16 specifically like particle detection.
03:17 So it was doing some like electron counting in some just detector experiments.
03:23 Is this like CUDA cores on NVIDIA type thing?
03:25 Precisely.
03:26 Stuff like that.
03:26 Okay.
03:27 And was that with Python or was that with C++ or what?
03:29 At the time it was C++ and I made like a DLL and then called it from LabVIEW.
03:33 Wow, that's some crazy integration.
03:35 It's like drag and drop programming too on the memory GPU.
03:39 Exactly.
03:40 It was all over the place.
03:41 Also had, it was a distributed LabVIEW project.
03:43 We had multiple machines that were coordinating and doing this all just to move some motors
03:49 and measure electrons.
03:50 But it got me into CUDA stuff, which then at the time was around the time
03:55 that the like AlexNet, some of these like very first neural net stuff was happening.
03:59 And so those same convolutional kernels were the same exact code that I was trying to write
04:03 to run like convolutions on these images.
04:04 And so it's like, oh, look at this like paper.
04:06 Oh, let me go read it.
04:07 It seems like it's got so many citations.
04:09 This is interesting.
04:09 And then like that sent me down the rabbit hole of like, oh, this AI stuff.
04:12 Oh, okay.
04:13 Let me go deep dive into this.
04:14 And then that just, I'd say that like became the obsession from them.
04:18 So it's been like eight years of doing that.
04:20 Then sort of just after I left academia, tried my own startup, then joined multiple others
04:26 and just sort of have been bouncing around as the sort of like founding engineer,
04:30 early engineer at startups for a while now.
04:32 And yeah, Python has been the choice ever since like late grad school and on.
04:38 I would say it sort of like came through the pandas and NumPy part, but then stuck for the scripting,
04:44 like just power, just can throw anything together at any time.
04:47 So it seems like there were two groups that were just hammering GPUs, hammering them,
04:53 crypto miners and AI people.
04:57 but the physicists and some of those people doing large scale research like that,
05:01 they were the OG graphics card users, right?
05:04 Way before crypto mining existed and really before AI was using graphics cards
05:09 all that much.
05:10 When I was like looking at some of the code, like pre-CUDA, there were some like
05:13 quant traders that were doing some like crazy stuff off of shaders.
05:17 Like it wasn't even CUDA yet, but it was shaders and they were trying to like
05:19 extract the compute power out of them from that.
05:22 So...
05:23 Look, if we could shave one millisecond off this, we can short them all day,
05:27 let's do it.
05:28 But yeah.
05:29 Yeah.
05:29 The physicists, I mean, it's always been like, yeah, it's always the get
05:33 as much compute as you can out of the, you know, devices you have because simulations are slow.
05:37 Yeah.
05:38 I remember when I was in grad school studying math, actually senior year,
05:41 regular college, my bachelor's, the research team that I was on had gotten a used
05:47 silicon graphics computer for a quarter million dollars and some Onyx workstations
05:53 that we all were given to.
05:54 I'm like, this thing is so awesome.
05:56 A couple years later, like an NVIDIA graphics card and like a simple PC would crush it.
06:01 Like that's $2,000.
06:03 It's just, yeah, there's so much power in those things to be able to harness them
06:06 for whatever, I guess.
06:07 Yeah.
06:07 As long as you don't have too much branching, it works really well.
06:10 Awesome.
06:11 So let's jump in and start talking about, let's start to talk about ChatGP
06:18 and some of this AI stuff before we totally get into the projects that you're working on,
06:24 which brings that type of conversational generative AI to things like Pandas,
06:30 as you said.
06:31 But to me, I don't know how, maybe you've been more on the inside than I have,
06:36 but to me, it looks like AI has been one of those things that's 30 years
06:41 in the future forever, right?
06:42 It was like the Turing test and, oh, here's a chat, I'm going to talk to this thing
06:46 and see if it feels human or not.
06:48 And then, you know, there was like OCR and then all of a sudden we got self-driving cars,
06:55 like, wait a minute, that's actually solving real problems.
06:57 And then we got things like ChatGP where people are like, wait, this can do my job.
07:02 It seems like it, just in the last couple of years, there's been some inflection point
07:07 in this world.
07:08 What do you think?
07:09 Yeah, I think there's sort of like two key things that have sort of happened
07:12 in the past, I guess, four or five years, four years, roughly.
07:15 One is the attention is all you need paper from Google, sort of this transformer
07:19 architecture came out and it's sort of a good, very hungry model that can just sort of
07:23 absorb a lot of facts and just like a nice learnable key value store almost that's stuck.
07:28 So, and then the other thing is, is the GPUs.
07:31 We were sort of just talking about GPU compute, but this has just been really,
07:34 GPU compute has really been growing so fast.
07:38 If you like, look at the like Moore's law equivalent type things, like it's just,
07:41 it's faster how much we're getting flops out of these things like faster and faster.
07:45 So, it's been really nice.
07:46 I mean, obviously there'll be a wall eventually, but it's been good riding this like
07:51 exponential curve for a bit.
07:52 Yeah, is the benefit that we're getting from the faster GPUs, is that because
07:57 people are able to program it better and the frameworks are getting better
08:00 or because just the raw processing power is getting better?
08:03 All of the above.
08:04 Okay.
08:04 I think that there was a paper that tried to dissect this.
08:07 I wish I knew the reference, but I believe that their argument was that it was
08:11 actually more the processing power was getting better.
08:13 The actual like physical silicon were getting better at making that for specifically
08:17 this type of stuff.
08:17 But like on exponentials, but yeah.
08:20 the power that those things take, I have a gaming system over there and it has a
08:26 GeForce 2070 Super.
08:29 I don't know what the Super really gets me, but it's better than the not Super,
08:32 I guess.
08:33 Anyway, that one still plugs into the wall normal, but the newer ones, like the 4090s,
08:40 those things, the amount of power they consume, it's like space heater level of power.
08:45 Like, I don't know, 800 watts or something just for the GPU.
08:48 You're going to brown out the house if you plug in too many of those.
08:53 Yeah.
08:53 Go look at those DGX A100 clusters and they've got like eight of those A100s just stacked
09:00 right in there.
09:00 They take really beefy powers of mine.
09:03 It's built right directly attached to the power plant, electrical power plant.
09:08 Nuts.
09:09 Okay, so yeah, so those things are getting really, really massive.
09:12 Here's the paper Attention is All You Need from Google Research.
09:15 What was the story of that?
09:18 How's that play into things?
09:19 Yeah, so this came up during like machine translation sort of research at Google
09:23 and the core thing is they present this idea of instead of just stacking
09:30 these like layers of neural nets like we're sort of used to, they replace the like
09:34 neural net layer with this concept of a transformer block.
09:38 A transformer block has this concept inside that's an attention mechanism.
09:42 The attention mechanism is effectively three matrices that you combine in a specific order
09:48 and the sort of logic is it is that one of the vectors takes you from some space
09:54 to keys so it's almost like it's like identifying labels out of your data.
09:58 Another one is taking you from your data to queries and then it like dot products those
10:03 to find a weight and then for the one and then another one finds weight values
10:08 for your things.
10:08 So it takes this query and key, you get the weights for them and then you take
10:13 the ones that were sort of the closest to get those values from the third matrix.
10:16 Just doing it sort of like looks a little bit like accessing an element in a dictionary
10:22 like key value lookup and it's a differentiable version of that and it did really well
10:28 on their machine learning sorry, on their machine translation stuff.
10:31 This was, I think it's like one of the first big one is this BERT model and that paper
10:37 sort of the architecture of the actual neural net code is effectively unchanged
10:43 from this to ChatGPT.
10:45 Like there's a lot of stuff for like milking performance and increasing stability
10:50 but the actual like core essence of the actual mechanism that drives it it's the same
10:54 thing since this paper.
10:55 Interesting.
10:55 It's funny that Google didn't release something sooner.
10:58 It's wild that they've had they keep showing off that they've got like equivalent
11:04 or better things at different times but then not releasing it.
11:06 When Dolly happened they had Imogen Imagine I guess I don't know how you say it
11:11 and what was the Party as the two they had two different really good way better
11:16 than Dolly way better than stable diffusion models like the that had were out
11:20 and they like showed it demoed it like but never released it to be used so yeah
11:25 it's one of these who knows what's going to happen with Google if they keep
11:28 holding on to these things.
11:28 Yeah well I think there was some hesitation I don't know holds up on accuracy
11:33 or weird stuff like that.
11:34 Sure.
11:35 Yeah now cat's out of the bag now now it's happening.
11:38 Yeah the cat's out of the bag and people are racing to do the best they can and it's
11:43 going to have interesting consequences for us both positive and negative I think
11:47 but you know let's leverage the positive once the cat's out of the bag anyway right?
11:51 Yeah.
11:51 Hopefully.
11:52 Might as well like ask it questions for pandas.
11:55 so let's play a little bit with chat GDP and maybe another one of these image type
12:00 things.
12:00 So I came in here and I stole this example from a blog post that's pretty nice
12:05 about not using deeply nested codes.
12:08 You can use a design pattern called a guarding clause that will look and say if the
12:14 conditions are not right we're going to return early instead of having if something
12:18 if that also if something else so there's this example that is written in a poor
12:23 way and it says like it's checking for a platypus so it says if self.ismammal
12:29 if self.hasfur if self.hasbeak etc.
12:32 it's all deeply nested and just for people who haven't played with chat GDP
12:37 like I put that in and I said sure I told her I wanted to call this arrow
12:40 because it looks like an arrow and it says it tells me a little bit about
12:44 this so I'm going to ask it please rewrite arrow to be less nested with girding
12:52 clauses right this is like a machine right if I tell it this what is it going to say
12:57 let's see it may fail but I think it's going to get it it's thinking I put it
13:01 I mistakenly put it into chat GDP 4 which takes longer I might switch it over to
13:06 3 I don't know but the understanding of these things there's a lot of hype
13:11 about it like I think you kind of agree with me that maybe this hype is worthwhile
13:16 here we go so look at this it rewrote it said if it's platypus if not self is
13:22 man will return false if not has fur and there's no more nesting that's pretty
13:25 cool right yep I mean I'm sure you've you've played with stuff like this
13:29 right yeah big user of this I mean this is kind of interesting right like it
13:33 understood there was a structure and it understood what these were and it
13:35 understood what I said but what's more impressive is like please rewrite
13:40 the program to check for crocodiles crocodiles and you know it what is it
13:49 going to do here let's see it says sure no problem writes the function is crocodile
13:54 if not self.is reptile if not self.has scales if not self.has long snout
14:00 oh my gosh like it not only remembered oh yeah there's this new version I
14:06 wrote in the garden clause format but then it rewrote the tests I mean and then
14:12 it's explaining to me why it wrote it that way it's just it's mind blowing
14:18 like how how much you can have conversations with this and how much it understands
14:23 things like code or physics or history what do you think yeah it's really
14:28 satisfying I love that it's such a powerful generalist at these like things
14:33 that are found on the internet so if it like if it exists and it's in the training
14:36 data it can do so good at synthesizing composing bridging between them it's really
14:41 satisfying so it's really fun asking it to as you're doing rewriting changing
14:45 language I've been getting into a lot more JavaScript because I'm doing a
14:48 bunch more like front end stuff and just I sometimes will write a quick one liner in
14:51 Python that I know how to do with a list comprehension and then I'll be like
14:55 make this for me in JavaScript because I can't figure out this like how to
14:59 initialize an array with integers in it it's great for just like really quick spot
15:04 checks and it also seems to know a lot about like really popular frameworks
15:07 so you can ask it things that are surprisingly detailed about like a how would you
15:12 do cores with requests in FastAPI and it can help you find that exact middleware
15:18 you know it's like boilerplate-y but it's great that it can just be a source for that
15:22 this portion of Talk Python to Me is brought to you by brilliant.org you're a
15:28 curious person who loves to learn about technology I know because you're
15:31 listening to my show that's why you would also be interested in this episode's
15:35 sponsor brilliant.org brilliant.org is entertaining engaging and effective
15:40 if you're like me and feel that binging yet another sitcom series is kind
15:44 of missing out on life then how about spending 30 minutes a day getting better
15:47 at programming or deepening your knowledge and foundations of topics you've always
15:51 wanted to learn better like chemistry or biology over on brilliant brilliant
15:57 has thousands of lessons from foundational and advanced math to data science
16:01 algorithms neural networks and more with new lessons added monthly when you sign up
16:06 for a free trial they ask a couple of questions about what you're interested
16:09 in as well as your background knowledge then you're presented with a cool learning
16:13 path to get you started right where you should be personally I'm going back to some
16:17 science foundations I love chemistry and physics but haven't touched them
16:20 for 20 years so I'm looking forward to playing with PV equals NRT you know
16:26 the ideal gas law and all the other foundations of our world with brilliant
16:30 you'll get hands-on on a whole universe of concepts in math science computer science
16:35 and solve fun problems while growing your critical thinking skills of course
16:39 you could just visit brilliant.org directly its url is right there in the name
16:43 isn't it but please use our link because you'll get something extra 20% off
16:47 an annual premium subscription so sign up today at talkpython.fm/brilliant
16:52 and start a 7 day free trial that's talkpython.fm/brilliant the link is
16:57 in your podcast player show notes thank you to brilliant.org for supporting
17:01 the show it's insane I don't know if I've got it in my history here we're rewriting
17:09 our mobile apps for talkbython training for our courses in Flutter and we're
17:15 having a problem downloading stuff concurrently using a particular library
17:19 in Flutter and so I asked it I said hey I want some help with a Flutter and Dart
17:26 programs what do you want it says I'm using the dio package do you know it
17:30 oh yes I'm familiar it does HTTP client stuff for Dart okay I want to download
17:34 binary video files and a bunch of them given a URL I want to do them concurrently
17:39 with three of them at a time write the code for that and boom it just writes it
17:42 like using that library I told it about not just Dart so that's incredible that
17:48 we can get this kind of assistance for knowledge and programming like you'll
17:52 never find I mean I take that back you might find that if there's a very
17:55 specific stack overflow question or something but if there's not a write-on
17:59 question for it you're not going to find it I love when you know the stack
18:04 overflow would exist for a variant of your question but the exact one doesn't
18:08 exist and you have to go grab the three of them to synthesize and it's just great
18:12 at that it also is pretty good at fixing errors sometimes it can walk itself into
18:17 lying to you repeatedly but that's so problematic yeah but you can also ask
18:24 it here's my program are there security vulnerabilities or do you see any
18:28 bugs and it'll find them yep yeah it's nuts so people may be wondering we haven't
18:34 talked yet about your project sketch why I'm talking so much about chat CP
18:38 so that is kind of the style of AI that your project brings to pandas which we're
18:44 going to get to but I want to touch on two more really quick AI things that
18:47 we'll dive into it the other is this just around images just the ability to
18:52 ask questions you've already mentioned three dolly imagine and then the other
18:57 one I don't remember from Google that they haven't put out yet a mid journey is
19:01 another just the ability to say hey I want a picture of this no actually
19:05 change it slightly like that it's mind blowing they're a lot of fun they're great
19:09 for sparking creativity or having idea and just getting to see it in front of
19:12 you I think it's more impressive to me than even this chat GTP telling me
19:16 I want a GTP I want an artificial intelligence panda and it came up and I
19:32 want it photorealistic in the style of National Geographic and so it gave
19:36 me this panda you can see beautiful whiskers but just behind the ear you can see
19:41 the fur is gone and it's like an android type of creature that is a beautiful
19:48 picture it's pretty accurate it's nuts that I can just go talk to these systems
19:52 and ask them these questions I find it interesting comparing the ChatGPT
19:56 and the mid-journey style I completely get it it's very visceral it's also
20:04 from another perspective I think of the weights and the scale of the model
20:07 and these image ones that solve all images are so much smaller in scale than
20:14 these language ones that have all this other data and stuff. So it's fascinating how complex language is. Yeah, I know the smarts is so much less,
20:20 but just something about it actually came up with a creative picture that never existed.
20:26 Yeah. Right. You could show this to somebody like, oh, that's an artificial panda. That's
20:31 insane. Right. But it's, but I just gave it like a sentence or two. Yeah. Yeah. Yeah. I don't know.
20:36 Yeah. This, it's a sort of a technical interpreter, but I, I love it because it's
20:41 like this, it's just phenomenal interpolation. It's like through semantically labeled space. So
20:46 like the words have meaning and it understands the meeting and can move sliders of like, well,
20:51 I've seen lots of these machine things. I understand the concept of gears and this metal and this,
20:54 like the shiny texture and then the fur texture and like, they're very good at texture. It's a,
21:00 yeah, really great how it interprets all of that just to fit the, you know, the small prompt.
21:04 Yeah. There are other angles of which it's frustrating. Like I want it turned, I want it
21:08 in the back of the picture, not the, no, it's always in the center. One more thing really quick.
21:13 And this leads me into my final thing is, is a GitHub copilot. GitHub copilot is like this in
21:19 your editor, which is kind of insane, right? You can just give it like a comment or a series of
21:23 comments and it will write it. I think chat GDP is maybe more open-ended and more creative, but this
21:29 is, this is also a pretty interesting way to go. I'm a heavy user of copilot. I, if there's a,
21:35 there's a weird crux and I'm like slowly developing like a need to have this in my browser. I was a,
21:40 on a flight recently and was with the internet and copilot wasn't working. And I felt the, like,
21:46 I felt the difference. I felt like I was like walking through mud instead of just like actually
21:50 running a little bit. And I was like, Oh, I've been disconnected from my distributed mind. I am broken
21:56 partially. Yeah. So incredible. So the last part I guess is like, you know, what are the ethics
22:03 of this? Like I went on very positively about mid journey, but how much of that is trained on
22:08 copyright material or there's GitHub copilot. How much of that is trained on GPL based stuff that was
22:16 in GitHub. But when I use it, I don't have the GPL any longer on my code. I might use it on commercial
22:23 code, but just running it through the AI, does that strip licenses or does it not? There's a GitHub
22:29 copilotlitigation.com, which is interesting. I mean, we might be finding out. There's also
22:35 think Getty, I think it's the Getty images. I'm not 100% sure, but I think Getty images is suing
22:41 one of these image generation companies. I can't remember which one I don't maybe mid journey. I
22:47 don't think it's mid journey. I think it's stable diffusion, but anyway, it doesn't really matter.
22:50 Like there's a bunch of things that are pushing back against us. Like, wait a minute,
22:53 where did you get this data? Did you have rights to use this data in this way? And I mean,
22:58 what are your thoughts on this angle of AI these days?
23:02 Yeah. I know it sounds like I don't worry too much about it in either direction. I think I
23:08 believe in personal ethics. I believe in open source things, availability of things,
23:14 because it just sort of like accelerates collective progress. But that said, I also believe in like
23:19 slightly different like social structures to help support people. Like I'm a, I guess,
23:24 a person believer in things like UBI or something like that on that direction.
23:27 So when you combine those, I feel like it, you know, things sort of work out kind of well,
23:31 but when we like, but it is still a thing that like, Be Right exists and that there is this sense of ownership and this is my thing. And I wanted to
23:38 put licenses on it. And, I think that this sort of story started presumably that I wasn't really
23:44 having this conversation, but like when the internet came around and search engines happened and like
23:49 Google could just go and pull up your thing from your page and summarize it in a little blob on the
23:54 page is, was that fair? What if it starts, you know, your shop and it allows you to go buy that
23:58 same product from other shops. Like it, I think that the same things are showing up and in the same way
24:04 that the web, like in the internet sort of, it's sort of, it was a large thing, but then it sort of,
24:08 I don't know if it got quieter, but it sort of became in the background. We sort of found new
24:12 systems. It stopped being piracy and CDs and the music industry is going to struggle. And Hey,
24:17 things like Spotify exist and streaming services exist. And like, I don't know what the next way
24:21 is.
24:21 They're doing better than ever basically. Yeah. Yeah. Yeah. So I think it's just evolution.
24:24 And then like the, some things will change and adopt some things will like fall apart and new
24:29 things will be born. I, that's just a great, it's a good time for lots of opportunity, I guess is the
24:33 part that I'm excited about.
24:34 Yeah. Yeah. Yeah. For sure. I think that's definitely true. It probably, you're probably right. It probably
24:39 will turn out to be, you know, old man yells at cloud cloud doesn't care sort of story, you know,
24:45 in the end where it's like, on the other hand, if, if somebody came back and said,
24:49 you know, a court came back and said, you know what, actually anything trained on GPL
24:54 and then you use copilot on it, that's GPL. Like that would have instantly mega effects. Right.
25:02 Yeah. I, yeah. And I, I guess there's also stuff like the, I don't, I didn't actually read the
25:06 article. I only saw the headline and you know, that's the worst thing to do is to repeat a thing,
25:09 which is a headline. But, there was that Italy thing that I saw about, like, I don't know.
25:14 Yeah. That was really clickbaity, but I didn't get a time to look at it yet. So yeah. You probably
25:20 ask chat to be to summarize for you. If as long as it can be a Bing, I guess, get that updated.
25:25 Yeah. Yeah. Yeah. Yeah. There's a lot of, there's a lot of things playing in that space,
25:30 right? Some different places. Okay. So yeah, very cool. But as a regular user, I would say,
25:36 you know, regardless of kind of how you feel about this, at least this is my viewpoint right now.
25:40 It's like, regardless of how I feel about which side is right in these kinds of disputes,
25:45 this stuff is out of the bag. It's out there and available and it's a tool. And it's like saying,
25:50 you know, I don't want to use spell check or I don't want to use some kind of like code checking. I just
25:55 want to write like in straight notepad because it's pure, right? Like sure you could do that,
26:00 but there's these tools that will help us be more productive and it's better to embrace them and know
26:05 them than to just like yell at them, I suppose. Yeah. A lot of accelerant you can get.
26:10 really speed up whatever you want to get done. Yeah, absolutely. All right. So speaking of
26:15 speeding up things, let's talk pandas and not even my artificial pandas, but actual programming pandas
26:22 with this project that you all have from approximate. Yeah. Approximate labs called sketch. So sketch is
26:30 pretty awesome. Sketch is actually why we're talking today because I first talked about this on Python
26:35 bytes and I saw this was sent over there by Jake Furman and to me and said, you should check this
26:41 thing out. It's awesome. And, yeah, it's pretty nuts. So tell us about sketch. Yeah. So, even
26:49 though I use a copilot as I sort of described already, and it's become a crux I found in Jupyter
26:54 notebooks when I wanted to work with data, it just didn't, it doesn't actually apply that. So on one side,
27:00 it was sort of like missing the mark at times. And so it was sort of like, how can I get this
27:04 integrated into my flow? The way I actually work in a Jupyter notebook, if maybe I'm working a Jupyter
27:09 notebook on a remote server and I don't want to set up VS Code to do it. So I don't have copilot at all.
27:13 Like there's a bunch of different reasons that I was just like in Jupyter. It's a very different IDE
27:17 experience. It is. Yeah. It's super different, but also you might want to ask questions about the data,
27:21 not the structure of the code that analyzes the data, right? Exactly. Yeah. And so just a bunch of that
27:26 type of stuff. And then also at the other side, I was trying to find something that I could throw together
27:31 that I thought was strong demonstration of the value approximate labs is trying to chase, but wouldn't
27:38 take me too much time to make. So it was a, oh, I could probably just go throw this together pretty quickly.
27:42 I bet this is going to be actually useful and helpful. And so let's just do that. And so through on top of
27:48 the actual library I was using, it was sketch. I put this on it and then shift it. So sort of shifted what the
27:54 project was. Yeah. Yeah. So you also have this other project called Lambda Prompt. And so were
27:59 you trying to play around Lambda Prompt and then like see what you could kind of apply here to leverage
28:03 it? Or is that the full journey I can get into is started with data sketches. I left my last job
28:11 to chase bringing the algorithm, like combining data sketches with AI, but just like the vague,
28:17 like at that level. Tell us what data sketches is real quick. Sure. Yeah. So a data sketch is a
28:22 probabilistic aggregation of data. So if you have, I think the most common one that people have heard of
28:27 is hyperloglog and it's used to estimate cardinality. So estimate the number of unique
28:32 values in a column. A data sketch is a class of algorithms that all sort of like use roughly fixed
28:39 width in binary, usually representations. And then in a single pass, so their ON will look at each row
28:46 and hash the row and then update the sketch or not necessarily hash, but they update this sketch
28:52 object. Essentially. Those sketch objects also have another property that they are mergeable. So you
28:57 have this like really fast ON to go bring that like to aggregate up and you get this mergeability. So you
29:03 can map reduce it in, you know, trivial speeds. The net result is that this like tight binary packed
29:09 object can be used to approximate measures you were looking for on the original data. So you could look
29:15 at, if you do a few of these, they're like theta sketches, you can go and estimate not just the
29:21 unique count, but you can also estimate if this one column would join well with this other column,
29:25 or you can estimate, Oh, if I were to join this column to this column, then this third column that
29:30 was on that other table would actually be correlated to this first column over here. So you get these,
29:35 a bunch of different distributions, you get a whole bunch of these types of properties.
29:40 And each sketch is sort of just, I would say, algorithmically engineered, like very, very
29:44 engineered to be like information theory optimal at solving one of those like measures on the data.
29:50 And so tight packed binary representations.
29:53 All right. So you thought about, well, that's cool, but chat CTP is cool too.
29:57 Yeah.
29:58 What else?
30:00 The core thing was, so those representations aren't usable by AI right now. And when you actually go and
30:07 use GPT three or something like this, you have to figure out a way to build the prompt to get it to do
30:13 what you want. This was especially true in a pre instruction tuning world, you had to really like, you had to
30:18 play the prompt engineer role even more than you have to now. Now you could sort of get away with describing it to
30:23 ChatGPT. And one of the things that you really have to like, play the game of is how do you get all the
30:28 information it's going to need into this prompt in a succinct, but good enough way that it helps it do
30:35 this. And so what sketch was about was, rather than just looking at the context of the data, like the
30:41 metadata, the column names and the code you have, also go get some representations of representation of the
30:48 content of the data, turn that into a string, and then bring that string in as part of the prompt.
30:53 And then when it has that, it should understand much better at actually generating code, generating
30:59 answers to questions. And that's what that sketch was a proof of concept of that, that worked very well.
31:03 It really quickly showed how valuable actual data content context is.
31:08 Yeah, I would say people are resonating with people. It's got 1.5,000 stars on GitHub.
31:13 And it looks about six months old. So that's pretty good growth there.
31:18 Yeah, January 16th was the day I posted it on Hacker News. And it had three,
31:22 there was an empty repo at that point.
31:24 Okay, three stars. It's like me and my friends. Okay, cool. So this is a tool that basically patches
31:33 pandas to add functionality or functions, literally to pandas data frames that allows you to ask
31:42 questions about it, right?
31:44 Yep.
31:44 So what kind of questions can you ask it? What can it help you with?
31:47 Yeah, so there's two classes of questions you can ask, you can ask it, the ask type questions,
31:53 these are sort of from that summary statistics data. So from the general, you know, representation of your
32:00 data, ask it to like, give you answers about it, like, what are the columns here, you sort of have
32:04 a conversation where it sort of understands the general under like shape of the data, general
32:10 distributions, things like that, number of uniques, and like give that context to it, ask questions of that
32:15 system. And then the other one is ask it how to do something. So you specifically can get it to write
32:21 code to solve a problem you have, you describe the problem you want, and you can ask it to do that.
32:25 Right. I've got this data frame, I want to plot a graph of this versus that, but color by the other
32:31 thing.
32:31 Yep. And in the data space world, what I sort of decided to do is like in the demo here is just sort of
32:37 walk through what are some standard things people want to ask of data, like, like, what are those common
32:43 questions that you hear, like, in Slack between, you know, like, business team and an analyst team. And it's just
32:49 sort of like, Oh, can you do this? Can you get me this? Can you tell me if there's any PII? Is this safe to send?
32:54 Can I send the CSV around? Can you clean up this CSV? Oh, I need to load this into our catalog. Can you
32:59 describe each of these columns and check the data types all the way to can you actually go get me
33:04 analytics or plot this?
33:05 Yeah. Awesome. So and it plugs right into Jupyter Notebooks, so you can just import it and basically
33:13 installing Sketch, which is a pip or Conda type thing, and then you just import it, and it's good to go,
33:19 right? Yep. Using the Pandas extensions API, which allows you to essentially hook into their data
33:24 frame callback and register a, you know, a function. Interesting. So it's not as jammed on from the
33:31 outside. It's a little more, plays a little nicer with Pandas rather than just like, we're going to go
33:35 to the class and just tap on it. Yeah, yeah. I, yeah. Not full monkey patching here. It's a,
33:41 it's like hack supported, I think. I don't, I don't see it used often, but it is somewhere in the docs.
33:46 Excellent. But here it is. So what I wanted to do for this is there's a, an example that you can do,
33:52 like if you go to the repo, which obviously I'll link to, there's a video, which I mean,
33:57 mad props to you because I review so many things, especially for the Python Bytes podcast, where
34:02 there's a bunch of news items and new things we're just going to check out. And we'll, we'll find people
34:06 recommending GUI frameworks that haven't, not a single screenshot or other types of things. Like,
34:13 I have no way to judge whether this thing even might look like that. What does it even make?
34:17 I don't even know, but somebody put a lot of effort, but they didn't bother to post an image. And you
34:21 posted a minute and a half animation of it going through this process, which is really, really
34:27 excellent. So people can go and watch that one minute, one minute 30 video. But there's also a
34:34 collab opening Google collab, which gives you a running interactive variant here. So you can just
34:41 follow along, right? And play these pieces requires me to sign up on and run it. That's okay.
34:46 Let me talk people through some of the things it does. And you can tell me what it's doing,
34:51 how it's doing that, like how people might find that advantageous. So import sketch, import pandas
34:57 as PD standard. And then you can say pandas read CSV and you give it one from a, like a,
35:03 some example CSV that you got on your, one of your GitHub repos, right? Or in your account.
35:08 Yeah. I found one online and then added just random synthetic data to it.
35:12 Yeah. Like, Oh, here's a data dump. No, just kidding.
35:14 So then you need to go to that data frame called sales data. You say dot sketch dot ask as a string,
35:21 what columns might have PII personal identifying information in them?
35:28 Awesome. And so it comes, tell me how that works and what it's doing here.
35:33 So it does, I guess it has to build up the prompt, which is sent to GPT. So to open AI specific
35:40 completion endpoint, the building up the prompt, it looks at the data frame. It does a bunch of
35:44 summarization stats on it. So it calculates uniques and sums and things like that. There's two modes in
35:50 the backend that either does sketches to do those, or it just uses like DF dot describe type stuff.
35:55 And then it pulls those summary stats together for all the columns, throws it together with my,
36:00 the rest of the prompt I have, you can, we can go find it, but then it sends that prompt.
36:05 Actually, it also grabs some information off of inspect. So it sort of like walks the,
36:10 the stack up to go and check the variable name because the data frame is named sales data.
36:15 So it actually tries to go find that variable name in your call stack so that it can, when it writes
36:20 code, it writes valid code, puts all that together, send it off to open AI, gets code back, uses Python
36:25 AST to parse it, check that it's valid. If it's not valid Python code, or you tried to import something
36:30 that you don't have, it will ask it to rewrite once. So this is sort of like an iterative process. So it
36:36 takes the error or it takes the thing and it sends it back to open AI. It's like, Hey, fix this code.
36:41 And then it, or in this case, sorry, ask, it actually just takes this, it sends that exact
36:45 same prompt, but it just changes the last question to, can you answer this question off of the information?
36:51 This portion of talk Python me is brought to you by us over at Talk Python Training with our courses.
36:58 And I want to tell you about a brand new one that I'm super excited about. Python web apps that fly
37:05 with CDNs. If you have a Python web app, you want it to go super fast. Static resources,
37:11 turn out to be a huge portion of that equation. Leveraging a CDN could save you up to 75% of your
37:17 server load and make your app way faster for users. And this course is a step-by-step guide on how to do
37:24 it. And using the CDN to make your Python apps faster is way easier than you think. So if you've
37:30 got a Python web app and you would like to have it scaled out globally, if you'd like to have your users
37:35 have a much better experience and maybe even save some money on server hosting and bandwidth,
37:41 check out this course over at talkpython.fm/courses. It'll be right up there at the top.
37:46 And of course the link will be in your show notes. Thank you to everyone who's taken one of our courses.
37:51 It really helps support the podcast. I'm back to the show.
37:56 And so that sounds very, very similar to my arrow program. Rewrite it with garden clauses, redo it.
38:02 Like you kind of, I gave you this data in this code and I asked you this question and you can have a
38:07 little conversation, but at some point you're like, all right, well, we're going to take what it gives me
38:10 after a couple of rounds at it. Right.
38:12 Yeah. I take the first one that doesn't, that like passes an import check and passes AST linting.
38:18 There was a, the, when you use small models, you run into not valid Python a lot more, but with these
38:23 ones, it's almost always good.
38:25 It's ridiculous. Yeah. Yeah. Yeah. It's crazy. Okay. So it says the columns that might have PII
38:30 in them are credit card, SSN and purchase address. Okay. That's pretty excellent. And then you say,
38:37 all right, sales data dot sketch dot ask. Can you give me friendly name to each column and output this
38:44 as an HTML list, which is parsed as HTML and rendered in Jupyter notebook accurately. Right. So it says
38:51 index. Well, that's an index.
38:52 This one ends up being the same.
38:53 It's not a great, this one is not a great example because it doesn't have to like infer
38:57 because the names are like order space date, right? Instead of order, like maybe lowercase
39:04 O and then like attached a big D or whatever, but it'll give you some more information. You
39:09 can like kind of ask it questions about the type of data, right?
39:13 Yeah, exactly. I found this is really good at if you play the game and you just name all
39:17 your columns, like call one, call two, call three, call four, and you ask it, give me new column
39:21 names for all of these. It gives you something that's pretty reasonable based off of the data.
39:24 So pretty useful.
39:24 Okay. So it's like, oh, these look like addresses. So we'll call that address. And this looks like
39:28 social security numbers and credit scores and whatnot.
39:31 Yep. Yep. So it can really help with that quick first onboarding step.
39:35 Yeah. So everyone heard it here first. Just name all your columns. One, two, three, four,
39:39 and then just get help. Like AI, what do we call these? All right. So the next thing you did in this
39:48 demo notebook was you said sales data dot sketch dot. And this is different before I believe,
39:54 because before you were saying ask, and now you can say how to create some derived features from the,
40:01 from the address. Tell us about that.
40:03 Yeah. This is the one that actually is the code writing. It's essentially the exact same prompt,
40:07 but the change is the very end. It says like, return this as Python code that you can execute to do
40:13 this. So instead of answering the question directly, answer the question with code that will answer the
40:17 question.
40:18 Right. Write a Python line of code that will answer this question given this data, something like that.
40:23 Yep. Yep. Something like that. I don't remember exactly anymore. It's been a while, but yeah,
40:27 some I've iterated a little bit until it started working and I was like, okay, cool. And so,
40:32 ask it for that. And then it spits back code. And that was, it sort of, it sounds overly simple,
40:37 but that was it. That was like, that was the moment. And I was just like, oh, I could just ask it to do my
40:42 analytics for me. And it's just all the, every other feature just sort of became like apparently
40:45 solvable with this. And the more I played with it, the more it was just, I don't have to think about,
40:50 I don't even have to go to Google or stack overflow to ask the question, to get the API stuff for me.
40:55 I could, from zero to I have code that's working is one step in Jupyter.
40:59 So you wrote that how to, and you gave it the question and then it wrote the lines of code and you just
41:04 drop that into the next cell and just run it. Right. And so for example, in this example, it said, well,
41:09 we can come up with city state and zip code and by writing a vector transform by passing a lambda,
41:16 that'll pull out, you know, the city from the string that was the full address and so on. Right.
41:21 Yeah. That's pretty neat.
41:22 Yeah. It's fun to see what it, what it does. Not again, not any of these things are always
41:26 probabilistic, but it also usually serves as a great starting point if, even if it doesn't get it
41:30 right.
41:30 Yeah. Sure. You're like, oh, okay. I see. Maybe that's not exactly right. Cause we have Europeans
41:35 in their city, maybe in their zip code or in different orders sometimes, but it gives you
41:40 something to work with pretty quickly. Right. By asking just a, what can I do?
41:44 And then another one, this one's a little more interesting instead of just saying like, well,
41:48 what other things can we pull out? It's like, this gets towards the analytics side, right? It says,
41:53 get the top five grossing states for the sales data. Right. And it writes a group by some sorts,
42:01 and then it does a head given five. And that's pretty neat. Tell us about this. I mean, I guess
42:05 it's about the same, right? Just ask more questions.
42:08 They all feel pretty similar to me. I think, I guess I could jump towards like things that
42:13 I wanted to put next, but I didn't, we're not reliable enough to like really make the cut.
42:18 I wanted to have it go like in my question was like, go build a model that predicts sales for the
42:24 next six months and then plot it on a 2d plot with a dotted line for the predicted plot. And like,
42:31 it would try, but it would always do something off. And I found I always had to break up the
42:36 like prompted to like smaller, smaller intern level code back. Yeah.
42:40 Yeah. It was fun getting it to train models, but it was also its own like separate thing. I sort of
42:48 didn't play with too much. And there's another part of sketch that I guess is not in this notebook. I
42:54 didn't realize. Yeah. Because you have to use the open AI API key, but it's the sketch apply. And
43:00 that's the, I'll say this one is another just like power tool. This one has like, I don't really talk
43:07 about, I don't even include it in the video because it's not just like as plug and play, you do have to
43:11 go set an environment variable. And so it's like, yeah, that's one step further than I want to,
43:15 I don't, it's not terrible, but it's a step. And so what it does is it lets you apply a completion
43:22 endpoint of whatever your design row wise. So every single row, you can go and apply and run something.
43:29 So if every row of your pandas data frame is a, some serialized text from a PDF or something,
43:35 or a file in your directory structure, and you just load it as a data frame, you can do dot
43:39 df.sketch.apply. And it's almost the exact same as df.apply. But the thing you put in as your function
43:45 is now just a Jinja template that will fill in your column variables for that row and then ask GPT to
43:51 continue completing. So I think I did silly ones, like here's a few states. And then the prompt is
43:58 extract the state for it. Or so I think, right, extract the capital of the state. Yeah. Yeah. So
44:04 just pure information extraction from it, but you can sort of like this grows into a lot more.
44:10 So does that come out of the data? Or is that coming out of open AI where like it sees what is
44:15 the capital of state and it sees New York? It's like, okay, well, all right, Albany.
44:20 Yeah. So this is purely extracting out of the model weights. Essentially, this is not like a factual
44:25 extraction. So this is probably a bad example because it's like it. But the thing that actually,
44:29 actually, the better example I did once was, what is like some interesting colors that are
44:34 good for each state? And it like just came up with a sort of like flaggish colors or sports team colors.
44:38 That was sort of fun when it wrote that as hex. You can also do things like if you have a large text
44:43 document or you can actually, I'll even do the more common one that I think everybody actually wants
44:47 is you have messy data. You have addresses that are like syntactically messy and you can say,
44:52 normalize these addresses to be in this form. And you sort of just write one example. It's a run
44:57 dot apply and you get a new column that is that cleaned up data. Yeah. Incredible. Okay. A couple
45:03 things here. It says I can use, you can directly call open AI and not use your endpoint. So at the
45:11 moment it kind of proxies through web service that you all have that somehow checks stuff or what does
45:16 that do? Yeah. It was just a pure ease of use. I wanted people to be able to do pip install and
45:21 import sketch and actually get it because I know how much I use things in, in a collab or in Jupyter
45:28 notebooks on weird machines and remembering an environment variable, managing secrets. It's like
45:32 this whole overhead that I want to deal with. And so I wanted to just offer a lightweight way if you
45:38 just want to be able to use it. But I know that that's not sufficient for secure. If people are going
45:42 to be conscious of this things and want to be able to, you know, not go through my proxy thing that's
45:46 there for help. So sure.
45:47 Offer this up.
45:48 What's next? Do you have a roadmap for this? Are you happy where it is and you're just letting it be or
45:53 do you have grand plans?
45:55 I don't have much of a roadmap for this right now. I'm actually, I guess there's like grand roadmap,
46:00 which is like at the company scale, what we're working on. I would say that if this, we're really trying to
46:05 solve data and with AI just in general. And so these are the types of things we hope to open source and
46:11 just give out there, like actually everything we're hoping to open source. But the starting place is
46:16 going to be a bunch of these like smaller toolkits or just utility things that hopefully save people
46:20 time or very useful. The grand thing we're working towards, I guess, is this more like the, it's the
46:26 full automated data stack. It's like the dream I think that people have wanted where you just ask it
46:31 questions and it goes and pulls the data that you need. It cleans it. It builds up the full pipeline.
46:36 It executes the pipeline. It gets you to the result and it shows you the result. And you look,
46:40 you can inspect all of that, that whole DAG and say, yes, I trust this. So we're working on getting
46:45 full end to end.
46:46 So when I went and asked about that Arrow program, I said, I think this will still do it. I think this
46:51 will probably work again. And it did, which is awesome. Just the way I expected. But, you know,
46:58 AI is not as deterministic as read the number seven. If seven is less than eight, do this,
47:05 right? Like what is the repeatability? What is the sort of experience of doing this? Like I ran it,
47:11 I ran it again. Is it going to be pretty much the same or is it going to have like, what's the mood
47:16 of the AI when it gets to you?
47:18 This is sort of a parameter you can, there's a little bit of a parameter you can set if you want
47:22 to play that game with the temperature parameter on these models at higher and higher temperatures,
47:26 you get more and more random, but it can also truly be out of left field random if you go too
47:31 high temperature.
47:32 Okay. But you get maybe more creative solutions.
47:34 Yeah. You could sometimes get that. And as you move towards zero, it gets more and more
47:38 deterministic. Unfortunately for really trying to do some like good provable, like sort of like
47:43 build chain type things with like hashing and caching and stuff. It's not fully deterministic,
47:48 even at zero temperature, but that's just, I think it's worth thinking about, but at the same time,
47:53 run it once, see the answers that it gives you comment that business out and just like put that
47:59 as markdown, you know, freeze it. It like us memorialize it in markdown because you don't need
48:05 to ask it over and over what columns have PII. Like, well, probably the same ones as last time.
48:10 We're just kind of like, right, these columns, credit card, social security and purchase address,
48:15 they have, have that. And so now, you know, right. Is that a reasonable way to think about it?
48:20 I think, yeah, if you, if you want to get determinism or the performance is a thing that
48:24 you're worried about, yeah, you can always cash. I think however you do it, comments or actually
48:28 with systems.
48:28 Sure. Sure. Sure. Sure. Or that like, like, how do I, you know, how do I do that group by sorting
48:35 business? Like you don't have to ask that over and over once it gives you the answer.
48:38 Yeah. Yeah. My workflow, when I use sketch, definitely I asked the question, I copy the
48:43 code and then I go delete the question or ask it a different question for my next problem that I have.
48:47 Yeah. I like, it's not code that it is a little bit like a vestigial when it, when you like save
48:53 your notebook at the end and you sort of want to go back and delete all the questions you asked because
48:57 you don't need to rerun it when you actually just go to execute the notebook later. But yeah,
49:01 that makes a lot of sense. And plus you look smarter if you don't have to show how you got
49:04 the answers.
49:05 Look at this beautiful code that's even commented.
49:07 Yeah, exactly. I guess you could probably ask it to comment your code, right?
49:12 Yeah. You can ask it to describe. There's been some really cool things where people will throw
49:17 like assembly at it and ask it to translate to different like languages so they can interpret
49:20 it. Or you could do really fun things like cross language, cross, I guess I'll say like levels
49:26 of abstraction. You could sort of ask it to describe it like at a very top level, or you can get really
49:30 precise, like for this line, what are all the implications if I change a variable or something like that?
49:34 Yeah, that's really cool. I suppose you could do that here. Can you can you converse with it? You
49:39 can say, okay, you gave me this. Does it I guess what's the word? Does it have like tokens in context
49:43 like chat HTTP does? Can you say, okay, that's cool. But but I want as integers, not as strings.
49:50 I don't know. Yeah, I did. I did not include that in this. There was a version that had something like
49:55 that, where I was sort of just keeping the last few calls around. But it quickly became it didn't align
50:00 with the Jupyter IDE experience, because you end up like scrolling up and down. And it you have too much
50:04 power over how you execute in a Jupyter notebook. So your context can change dramatically by just scrolling
50:10 up. And trying to via inspect look across different like, across a Jupyter notebook is just a whole
50:16 other nightmare. So yeah, I didn't try and like extract the code out of the notebook so that it
50:20 could understand the local context. You could go straight to chat HTTP or something like that,
50:23 take what it gave you and start asking it questions.
50:26 Okay, so another question that I had here about this. So in order for that to do its magic,
50:32 like you said, the really important thought or breakthrough or idea you had was like, not just
50:37 the structure of the pandas code or anything like that, but also a little bit about the data.
50:41 What is the privacy implications of me asking this question about my data? Suppose I have
50:47 super duper secret CSV. And should I not ask or how to on it? Or what is the story there?
50:55 What's the, if I work with data, how much sharing do I do of something I might not want to share if I
51:02 ask a question about it?
51:04 I'd say the same discretion you'd use if you would copy like a row or a few rows of that data into
51:09 a, into ChatGPT to ask it a question about it.
51:12 Okay.
51:12 Is the level of concern I guess you should have like on the specifically, I am not storing these
51:19 things, but I know is at least it was, it seems like they're going to start getting towards like a 30
51:24 day thing. But, so there's a little bit of, yeah, I mean, you're sending your stuff over the
51:28 wire, like over network, if you do this and to use these language models until they come local,
51:32 until these things like llama and alpaca get good enough that they're, yeah, they're going to be
51:37 remote. Actually, that could be a fun, sorry. I just now thought that could be a fun thing. Like
51:41 just go get alpaca working with a sketch so that it can be fully local.
51:45 Interesting. Like a privacy preserving type of deal.
51:48 Yeah. I hadn't actually, yeah, that's the, that's the power of these, smaller models that are
51:52 almost good enough. I could probably just like quickly throw that in here and see if it,
51:56 yeah, maybe it's a wider audience.
51:57 You have a option to not get through your API, but directly go to open AI. You could have another
52:04 one to pick other, other options, right? Potentially.
52:07 Yep. Yep. Yep. The, interface to these, one thing that I think is not, maybe it's talked
52:14 about it more than other places, but I haven't heard as much like excitement about it is that these,
52:17 the APIs have gotten pretty nice for this whole space. the, they're all like the idea of a
52:23 completion endpoint is pretty straightforward. You send it some amount of text and it will continue
52:28 that text. And it's such a, it's so simple, but it's so generalizable. You could build so many
52:32 tools off of just that one API endpoint essentially. And so combine that with an embedding endpoint and
52:38 you sort of have all you need to, to make complex AI apps.
52:41 It's crazy. Speaking of making AI apps, maybe touch a bit on your, your other projects, Lambda.
52:48 So yeah, Lambda. Yeah.
52:50 But before you get into it, mad props for like Greek letter, like that's a true physicist or
52:56 mathematician that I can appreciate that there.
52:59 Yeah. That was, I was excited to put it everywhere, but then of course, these things don't playing, playing games with character sets and websites. I'm the one that
53:08 causes, I both feel the pain, have to clean the data that I also put into these systems.
53:13 So yeah. Yeah. People are like a prompt and why is the a so italicized? I don't get it.
53:17 Yeah. Okay. Yeah. So yeah. Yeah. So this one came, I was, working with, this is pre GPT.
53:25 This is October. I guess it was right around ChatGPT coming out like around that time.
53:29 But I was, I was really just messing around a lot with, completion endpoints as we were talking.
53:32 And I kept rewriting the same request boiler over and over. And then I also kept rewriting
53:38 f-strings that I was trying to like send in. And I was just like, ah, Jinja templates solved this
53:43 already. Like there already is formatting for strings in Python. Let me just use that, compose that into a
53:49 function. And there's, let me call these completion endpoints. I don't want to think of them as like
53:53 API endpoint or RPC is a nice mental model, but I want to use them as functions. I want to be able to
53:59 put decorators on them. I want to be able to use them both async or not async in Python. I want to,
54:05 I just want to have this as a thing that I can just call really quickly with one line and just do
54:10 whatever I need to with it. And so through this together, it's very simple. Like, honestly, I mean,
54:15 like the hardest part was just getting all the layers of, there's actually two things you
54:20 can make a prompt that then, cause I wrap any function as a prompt. So not just, these
54:26 calls to GPT and then I do tracing on it. So as you like get into the call stack, every input and output
54:32 is you can sort of like get hooked into and trace with some like call traces. So there's a bunch of just
54:38 like weird stuff to make the utility nice, but functionally, as you can see here on it's,
54:42 you just import it, you write a Jinja template with the class, and then you use that object that
54:48 comes back as a function and your Jinja template variables get filled in. And your result is the
54:53 text string that comes back out of a GPT. Interesting. And people probably, some people
54:57 might be thinking like Jinja, okay, well I got to create an HTML file and all that, like not just a
55:02 string that has double curlies for turning stuff into like strings within the string, kind of a different
55:08 way to do f-strings as you were hinting at. Yeah. Yeah. There was a two pieces here. I realized as
55:13 I was doing this also, I think I sort of mentioned with a sketch I do. I really often was taking the
55:18 output of a language model prompt, doing something in Python, or actually I can do a full example of
55:24 the SQL writing like exploration we did, but, we would do these things, that were sort of run
55:31 GPT three to ask it to write the SQL. You take the SQL, you go try and execute it, but it fails for
55:38 whatever reason, or you, and you take that error, you say, Hey, rewrite it. So we talked about that
55:42 sort of pattern, which is sort of like rewriting. Another one of the patterns was increase the
55:46 temperature, ask it to write the SQL. You get like 10 different SQL answers in parallel. And this is where
55:52 the async was like really important for this. Cause I just wanted to use asyncio gather and run all 10
55:57 of these truly in parallel against the open eye endpoint, get 10 different answers to the SQL,
56:02 run all 10 queries against your database, then pull on what the most common, like of the ones that
56:08 successfully ran, which ones gave the same answer the most often, then that's probably the correct
56:12 answer. And, just chaining that stuff. It's like very pythonic functions. Like you can really
56:19 just imagine like, Oh, I just need to write a for loop. I just need to run this function, take the
56:22 output feed into another function, very procedural. But when you, all the abstractions in the
56:28 open at open AI API, the things like just everything else, there was nothing else really at the time,
56:34 but even the new ones that have come out like Lang chain that have sort of like taken the space by
56:38 storm now are not really just trying to offer the minimal ingredient, which is the function.
56:43 And to me, it was just like, if I can just offer the function, I can write a for loop. I can write,
56:47 I can store a variable and then keep passing it into it. You could do so many different
56:51 emergent behaviors with just starting with the function and then simple Python, scripting
56:56 on top of it.
56:57 And there's some interesting stuff here, land of prompt. So you can start, you can kind of start
57:04 it, set it. I don't know what chat GDP, you can tell it a few things. I'm going to ask you a question
57:09 about a book. Okay. The book is a choose your own adventure book. Okay. Now here I'm going to like,
57:14 you can prepare it, right? There's probably a more formal term for that, but you can do this here.
57:19 Like you can say, Hey system, you are a type of bot. And then you, that creates you an object that
57:25 you can have a conversation with. And you say, what should we get for lunch? And your type of bot is
57:29 pirate. And then so to say, as a pirate, I would suggest we have some hearty seafood or whatever,
57:33 right? Like that's, that's beyond what you're doing with sketch. I mean, obviously this is not so much
57:37 for code. This is like conversing with Python rather than in Python. I don't know. And your editor.
57:42 Yeah. This one was the open AI chat API endpoint came out and I was just like, Oh, I should support
57:49 it. So that's what this, I wanted to be able to Jinja template inside of the conversation. So you
57:53 can imagine a conversation that is prepared with like seven steps back and forth, but you want to hard
57:59 code with the conversation, like how the flow of the conversation was going. And you want to template
58:02 it so that like on message three, it put your new context problem on message four, it put the output
58:08 from another prompt that you ran on message five. It is this other data thing. And then you ask it to
58:13 complete this, the intent of like, it's arbitrarily complex, but still something like that
58:18 would be, you know, just three lines or so in Lambda prompt. The idea was that it would offer up a really
58:23 simple API for this. Well, other thing that's interesting is of an async and async version. So that's,
58:28 that's cool. People can, can check that out. Also a way to make it a hosted as a web service with say
58:35 like FastAPI or something like that. Yeah. And you can make it a decorator if you like an app prompt
58:42 decorator. Yeah. On any function you can just throw app prompt and it, it wraps it with the same class
58:47 so that all the, all the magic you get from that works. The server bit is I took, so FastAPI has
58:54 that sort of like inspection on the function, part. I did a little bit of middleware to get the
59:00 two happy together. And then all you have to do is import FastAPI and then run, you know, G unicorn
59:06 that app. And, it's two lines and any prompts you have made become their own independent rest
59:14 endpoint where you can just do a get or a post to it. And it returns the response from calling the prompt.
59:20 But these prompts can also be these chains of prompts. Like one prompt can call another prompt,
59:24 which can call another prompt. And those prompts can call async to not async back to async and things
59:28 like that. And it should work. Pretty sure this one actually, I did test everything as far as I know,
59:34 I'm pretty sure I've got pretty good coverage. So yeah, super cool. All right. Well get a little
59:39 short on time, but I, I think people are going to really, really dig this, especially sketch. I think
59:43 there's a lot of folks out there doing pandas that would love an AI buddy to help them
59:50 do things like not just analyze the code, but the data as well.
59:55 Yeah. Just, I think, anybody's, I know it's for me, but it's just like copilot in
59:59 VS Code ID, sketch in your Jupyter ID, takes almost nothing to add. And you,
01:00:05 whenever you're just sort of sitting there, you think you're about to alt tab to go to Google. You
01:00:08 could just try the sketch.ask and it's surprising how often that sketch.ask or sketch.howto gets you
01:00:14 way closer to a solution without even having to leave the, you don't even have to leave your,
01:00:17 your environment. It's like a whole other level of autocomplete for sure. And super cool. All right.
01:00:23 Now, before I let you out of here, you got to answer the final two questions. If you're going to write
01:00:27 some Python code and it's not a Jupyter notebook, what editor are you using? It sounds to me like you may
01:00:33 have just given a strong hint at what that might be. Yeah. I've switched almost entirely to VS Code.
01:00:38 and I've been really liking it with the remote development and, like it's just, I work
01:00:43 across like many machines, both cloud and local and some like five, six different machines are my
01:00:48 like primary working machines. And I use the remote, VS Code thing. And it just, I have a unified
01:00:53 environment that gives me terminal, the files and the code all in one and copilot on all of them.
01:00:59 Yeah. It's wild. All right. And then notable pipe UI package. I mean, pip install sketch,
01:01:04 you can throw that out there if you like. That's pretty awesome. But anything you've run across
01:01:08 you're like, Oh, this is people should know about this. Yeah. It doesn't have to be popular. Just
01:01:12 like, Oh, this is cool. In the, I guess these, these two are very popular, but, in the data
01:01:17 space, I really, I'm a huge fan of, Ray and, also arrow. Like I use those two tools as like
01:01:25 my backend bread and butter for everything I do. And so those have just been really great work.
01:01:30 Apache arrow. Right. And then Ray, I'm not sure. Yeah. Ray is a distributed, scheduling compute
01:01:38 framework. It's sort of like a, right. I don't know what they, yeah. I remember seeing about this.
01:01:42 Yeah. This is a, it is, I'm parsing, he didn't talk about other things, but I'm like parsing common
01:01:47 crawl, which is like 25 petabytes of data. And, Ray is great. It's just the workhorse. It power is
01:01:53 really useful. Like I find it's so snappy and good, but it offers everything I need in a distributed
01:02:00 environment. So I can write code that runs on a hundred machines and not have to think about it.
01:02:05 It works really well. That's, that's pretty nuts. Not as nuts as chat GDP and mid journey,
01:02:09 but it's still pretty nuts. So before we call it a date, do you want to tell people about approximate
01:02:14 labs? It sounds like you guys are making some good progress. I might have some, some jobs for people
01:02:19 to work in this kind of area as well. Yeah. So, we're, we're working at the intersection
01:02:23 of, AI and tabular data. So anything related to these training, these large language models,
01:02:28 and also, tabular data. So things with columns and rows, we are trying to like solve that problem,
01:02:32 try and bridge the gap here. Cause there's a pretty big gap. We have three main initiatives
01:02:37 that working on, which is we're trying to build up the data set of data sets. So just like the pile
01:02:41 or the stack or lay on five B these like big data sets that were used to train all these big
01:02:46 models. We're making our own on tabular data. We are training models. So this is actually
01:02:51 training large language models, doing these training, these full transformer models.
01:02:55 And then we're also building apps like sketch, like UIs, things that are actually there to help
01:03:00 make data more accessible to people. So anything that helps people get value from data and make it open
01:03:05 source. That's what we're working on. We just, raised our seed round. So we are now officially
01:03:11 hiring. So, looking for people who are interested in the space and who are enthusiastic
01:03:15 about these problems. Awesome. Well, very exciting demo libraries, I guess, however you call them.
01:03:23 But I think this, I think these are neat. People are going to find a lot of cool uses for them. So
01:03:27 excellent work and congrats on all the success so far. It sounds like you're just starting to take
01:03:32 off. Yeah. Thank you. All right, Justin, final call to action. People want to get started.
01:03:36 Let's pick sketch. People want to get started with sketch. What do you tell them?
01:03:39 Just, pip install it. Give sketch a, give sketch a try, pip install it, import it, and then throw it on your data frame. Awesome. And then ask it questions or how tos.
01:03:49 Yeah. Yeah. Yep. Whatever you want. if you really, if you really want to, and you,
01:03:53 you trust the model, like throw some, applies and have it clean your data for you. Cool.
01:03:57 Awesome. All right. Well, thanks for being on the show.
01:04:00 Come in here and tell us about all your work. It's great. Yeah. Thank you. Yeah. See you later.
01:04:04 Thanks for having me.
01:04:05 This has been another episode of talk Python to me. Thank you to our sponsors. Be sure to check out
01:04:11 what they're offering. It really helps support the show. Stay on top of technology and raise your value
01:04:16 to employers or just learn something fun in STEM at brilliant.org. Visit talkpython.fm/brilliant
01:04:24 to get 20% off an annual premium subscription. Want to level up your Python? We have one of the largest
01:04:30 catalogs of Python video courses over at talk Python. Our content ranges from true beginners
01:04:35 to deeply advanced topics like memory and async. And best of all, there's not a subscription in
01:04:40 sight. Check it out for yourself at training.talkpython.fm. Be sure to subscribe to the show,
01:04:45 open your favorite podcast app and search for Python. We should be right at the top. You can also find
01:04:51 the iTunes feed at /itunes, the Google play feed at /play and the direct RSS feed at slash
01:04:57 RSS on talkpython.fm. We're live streaming most of our recordings these days. If you want to be part of
01:05:03 the show and have your comments featured on the air, be sure to subscribe to our YouTube channel at
01:05:08 talkpython.fm/youtube. This is your host, Michael Kennedy. Thanks so much for listening. I really
01:05:14 appreciate it. Now get out there and write some Python code.
01:05:16 I'll see you next time.