Multimodal data with LanceDB

Episode #488, published Thu, Dec 12, 2024, recorded Tue, Nov 26, 2024

Episode Deep Dive Links Transcript

LanceDB is a developer-friendly, open source database for AI. It's used by well-known companies such as Midjourney and Character.ai. We have Chang She, the CEO and cofounder of LanceDB on to give us a look at the concept of multi-modal data and how you can use LanceDB in your own Python apps.

Play on YouTube

Watch the live stream version

Episode Deep Dive

1. Introduction to LanceDB and Multimodal Data

What is LanceDB?
- A developer-friendly, open-source database for AI, built on top of the Lance format.
- Focused on multimodal data (e.g., text, images, videos, PDFs, etc.) and embedding vectors.
Multimodal Data
- Refers to data types that extend beyond traditional rows and columns (e.g., images, videos, text embeddings, 3D point clouds).
- LanceDB enables storing and querying these heterogeneous data types in one place.

Relevant Links

LanceDB on GitHub: github.com/lancedb/lancedb

2. Technology Stack and Rust

Core in Rust
- The project’s core data format and database engine are written in Rust for performance and safety.
- Originally started in C++, then switched to Rust to avoid common issues like SEGFAULTS and to leverage Rust’s robust tooling (Cargo).
Python Wrappers
- LanceDB offers Python APIs that wrap around the Rust core, providing a familiar developer experience for data science and AI use cases.
- Effort has been made to ensure contributors can extend LanceDB in Python, even if they don’t know Rust.

3. Lance Format, Arrow, and Ecosystem Integration

Columnar Format
- Lance is a columnar format designed specifically for AI/embedding data.
- Stores data on disk (or in cloud object storage like S3) in a way that is optimized for random access and high-performance reads.
Apache Arrow Integration
- Lance is fully compatible with Apache Arrow, making it easy to hand off data to (or ingest data from) DataFrame libraries like Pandas or Polars, as well as distributed engines like Spark or Ray.
- Random access with Arrow-based datasets greatly improves workflows involving large embeddings or image/video data.
Interoperability with Existing Tools
- The open data layer approach means LanceDB can fit into existing ecosystems—DuckDB, Polars, Pandas, Spark, etc.

Relevant Links

Apache Arrow: arrow.apache.org
DuckDB: duckdb.org

4. Local File-Based Database (SQLite/DuckDB Mental Model)

Single-File Approach
- Like SQLite or DuckDB, LanceDB can be used as an embedded database, writing data to a local file.
- No extra server to manage: just connect to a file path and start inserting/searching data.
Scaling with Object Storage
- The same file-like approach extends to S3 or S3-compatible APIs (e.g., MinIO).
- Allows larger-scale scenarios without maintaining a specialized server in early development.

Relevant Links

MinIO: min.io

5. Querying & Indexing Vectors

Vector Indexing
- LanceDB is particularly optimized for embedding vectors (images, text embeddings, etc.).
- Offers disk-based indexes that allow searching large numbers of vectors without needing to load them fully in RAM.
GPU Acceleration
- For very large datasets (millions to billions of vectors), LanceDB can use GPUs (via frameworks like PyTorch) to build indexes much faster.
- Significantly reduces index creation time from days to hours or minutes depending on data size.

6. Python Usage and Pydantic Integration

Python API
- Install via pip install lancedb.
- Create tables, insert data, and run vector queries with straightforward Python calls, whether synchronously or asynchronously.
Pydantic Models
- LanceDB supports a “schema-first” approach using Pydantic.
- You can define your own BaseModel classes, specify which fields are embeddings, and LanceDB handles embedding generation (e.g., with OpenAI or local Hugging Face models).

Relevant Links

Pydantic: docs.pydantic.dev

7. Searching, RAG Workflows, and Integrations

Search API
- Perform vector searches (e.g., nearest neighbor lookups) with simple Python calls, returning results in Pandas DataFrames, Polars DataFrames, or Pydantic models.
RAG Orchestration
- LanceDB can integrate with external frameworks like LangChain or LlamaIndex, so that retrieval-augmented generation (RAG) workloads can use LanceDB for storing embeddings and retrieving context.
Bring Your Own Embeddings
- Integrations with multiple embedding providers, such as OpenAI, Hugging Face models, cohere, and more.

8. Production and Commercial Offerings

Open Source vs. Enterprise
- LanceDB is fully open source for local prototypes and moderate-scale production.
- For large-scale indexing (billions of vectors) or high throughput, LanceDB offers an Enterprise version and a Cloud (hosted) service, both built on the same Lance format.
Enterprise / On-Prem
- Larger organizations can run LanceDB Enterprise inside their own cloud account (or on-prem).
- Emphasizes high concurrency, vast data volumes, and enterprise security requirements.
Serverless RAG
- Some users run LanceDB in serverless environments (like AWS Lambda) pointing at S3-stored Lance data for cost-effective, fully managed solutions.

Overall Takeaway

LanceDB aims to simplify AI data workflows—whether you’re adding a quick vector search to a Python app or building a large-scale, multimodal data lake for enterprise. By embracing Apache Arrow, offering a columnar disk-based format, and integrating well with the Python ecosystem (including Pydantic, Polars, and LangChain), LanceDB makes it straightforward to store and query embeddings, images, and other unstructured data types. Users can start locally with a single-file approach and scale to enterprise or serverless solutions—all while working with the same fundamental Lance format.

Links from the show

Chang She: @changhiskhan
Chang on Github: github.com

LanceDB: lancedb.com
LanceDB Source: github.com
Embeddings API: github.com
MinIO: min.io
LanceDB Quickstart: github.com
VectorDB-recipes: github.com
Watch this episode on YouTube: youtube.com
Episode #488 deep-dive: talkpython.fm/488
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy
Episode #488 deep-dive: talkpython.fm/488

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 LanceDB is a developer-friendly open-source database for AI.

00:03 It's used by well-known companies such as Midjourney and Character.ai.

00:06 On this episode, we have Chang-Chi, the CEO and co-founder of LanceDB, on to give us a look at the concept of multimodal data

00:15 and how you can use LanceDB in your own Python apps.

00:18 This is Talk Python To Me, episode 488, recorded November 26, 2024.

00:24 Are you ready for your host, here he is!

00:27 You're listening to Michael Kennedy on Talk Python To Me.

00:30 Live from Portland, Oregon, and this segment was made with Python.

00:34 Welcome to Talk Python To Me, a weekly podcast on Python.

00:40 This is your host, Michael Kennedy.

00:42 Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython,

00:48 both accounts over at fosstodon.org, and keep up with the show and listen to over nine years of episodes at talkpython.fm.

00:56 If you want to be part of our live episodes, you can find the live streams over on YouTube.

01:00 Subscribe to our YouTube channel over at talkpython.fm/youtube and get notified about upcoming shows.

01:07 This episode is brought to you by Sentry.

01:09 Don't let those errors go unnoticed.

01:11 Use Sentry like we do here at Talk Python.

01:13 Sign up at talkpython.fm/sentry.

01:16 And this episode is brought to you by Bluehost.

01:19 Do you need a website fast?

01:20 Get Bluehost.

01:21 Their AI builds your WordPress site in minutes, and their built-in tools optimize your growth.

01:27 Don't wait.

01:27 Visit talkpython.fm/bluehost to get started.

01:31 Hey, Chang.

01:32 Welcome to Talk Python To Me.

01:34 Hey, how are you?

01:35 Excited to be here.

01:36 I'm excited to have you here.

01:37 We're going to talk about LanceDB, multidimensional data, all sorts of fun,

01:43 multimodal data, all sorts of fun things.

01:45 So really cool database, kind of in the same categories, SQLite, DuckDB.

01:52 People want a quick mental model, but different kind of data, right?

01:56 Yeah, absolutely.

01:57 So I think we specialize in multimodal data or AI data.

02:02 And what we think about is included in that is, you know, the tabular data that we have from previous generations of data engineering, data science,

02:11 but also, you know, embedding vectors, images, videos, PDFs, basically anything that don't really fit neatly into like data frames or Excel sheets.

02:23 Yeah.

02:23 Less tabular data.

02:24 That's right.

02:25 If it's not good tabular, then you need something else probably.

02:28 Yep.

02:29 Yep.

02:29 Awesome.

02:30 Which is pretty interesting since I was one of the earliest co-contributors to Pandas.

02:35 So I spent years of my life working on data frames and big data.

02:40 And now I'm working on-

02:41 You've given up on squares and rectangles?

02:42 Oh my gosh.

02:44 Well, never, but there's room to build something more here.

02:46 Yeah, I hear you.

02:47 And also, also I agree.

02:49 So it's going to be super fun to dive into all of that stuff together.

02:53 Before we do though, let's hear your story.

02:55 How do you get into programming?

02:56 I know this is written in Rust, but also has some programming APIs.

03:00 How much Python are you doing these days versus other languages?

03:03 Yeah.

03:04 The core of Lance format and Lance CB are in Rust.

03:08 And then there are Python APIs and wrappers around that, as well as TypeScript.

03:13 And the team spends a significant amount of time thinking about what the Python user experience

03:20 looks like and how best to combine the performance that you get from Rust and the ease of use

03:26 you get from Python.

03:27 And then also making it more as extensible as possible so that even though it's a Python library,

03:33 if you want to add features, for example, you don't have to learn Rust just to be a contributor

03:40 to Lance or Lance DB.

03:41 You know, I think Python has benefited a lot from the various Rust-based projects out there,

03:48 right?

03:49 Obviously, we've got uv and Ruff.

03:51 We've got Pydantic, a bunch of other things that make it faster.

03:55 But there has been some pushback that if everything gets done in Rust, then it's hard for Python

04:00 people to contribute.

04:02 I don't necessarily subscribe to that because previously the answer was when things are too slow,

04:06 we write them in C.

04:07 Maybe it's just traditional Python people also know C, but traditional Python people don't also know Rust.

04:13 I think that's honestly the crux of the issue or the crux of the pushback.

04:18 So I really like to hear that, you know, the extensibility points and you've thought through

04:23 how Python people can extend it and contribute and stuff.

04:26 Yeah, I've spent a lot of time writing like Cython and C to make Python code go faster back in the day.

04:34 I think you're right.

04:35 A lot of the points are the same for Rust.

04:38 And actually, Rust is a lot safer, actually.

04:41 So when we originally started Lance in the very beginning in C++, and we spent months writing C++ code for a new format.

04:51 And I think it was the Christmas break of 2022 that that's what I call our Rust pill moment was when we decided to switch over.

05:01 And I think we ended up, we took about three weeks and we rewrote roughly four or five months of C++ code there.

05:08 Overall, I think we actually ended up getting better performance for the most part.

05:13 And the biggest thing was just us having the confidence to move forward very quickly without that, like, in the back of our mind, like, where's the next SEGFAULT coming from?

05:25 Yeah.

05:26 Yeah.

05:26 Oh, absolutely.

05:27 Or the next security vulnerability.

05:29 Exactly.

05:30 Because you did an S print F, not at one of the safe variants or whatever.

05:33 It's been a long time since I've done C++.

05:35 But tell me about that process.

05:38 I don't know how many people are working on the project at this time and how much experience you all had with Rust.

05:43 Why did you make that choice?

05:45 And how did it go?

05:46 Oh, yeah.

05:47 I'm a Rust neophyte.

05:49 Like, before that Christmas of 2022, I've never written a line of Rust.

05:54 Most of our team are also new to Rust.

05:59 Or they're new when they join LanceDB.

06:02 And I think what helps is that most of them were also already proficient in C++, right?

06:09 So the sort of core thinking around algorithms and design is all the same, performance and all that.

06:15 And the Rust language makes it just easier to learn.

06:20 You're more productive.

06:20 And it's just safer.

06:23 And I think one of the surprising benefits that we saw that I don't see people talking about a lot,

06:29 and maybe that's just us not being very good at CMake, is that productivity piece.

06:35 When we were writing lots of C++ code, we spent tons of time just wrestling with CMake to make the builds work.

06:42 And once we moved to Rust, Cargo just made that so easy.

06:47 Basically, we spent zero time wrestling with that.

06:49 Wow.

06:49 That's pretty wild.

06:51 So you all were pretty new to Rust.

06:53 How much do you think that LLMs, ChatGPT, Mistral, Llama, all these things, the availability to just go,

07:00 hey, chat thing, how do you do this in Rust?

07:03 Here it is in C++.

07:04 Did you all find that beneficial?

07:07 Did you all do this?

07:08 We use a lot of coding aid.

07:10 So from Copilot to Cursor, Continue, Zed.

07:14 We try out lots of different tools as well.

07:16 I think certainly the quality of the tooling matters, but at the core of it, there's that model performance.

07:24 And so the models today performs pretty well on Python and TypeScript, but on Rust, much less so.

07:32 Oh, interesting.

07:33 Yeah.

07:33 And so we always joke about, you know, God grant me the confidence of ChatGPT coming up with random Rust syntax.

07:41 Yeah.

07:42 I've had that as well.

07:43 I asked for some, how to do this in Python, because I was feeling lazy or something.

07:47 I just needed it for an example real quick.

07:48 And it said, you use this.

07:50 It looked completely plausible.

07:51 It was converting time zones and date times.

07:55 And just the time zone part of it was driving me nuts.

07:57 I'm like, all right, Chat, let's do this.

07:59 And it just made up stuff.

08:00 So, oh, you're going to use this function or this property based on this.

08:04 No, that doesn't exist.

08:05 I'm like, that doesn't exist.

08:06 Try again.

08:06 Oh, I'm so sorry.

08:07 Let me try another one.

08:08 Yeah.

08:09 What's interesting is I think that like in Python, if you're looking at really well-known libraries out there,

08:18 it generally does much better.

08:19 But even for long-known libraries that don't have as much training corpus or attention focus on it,

08:26 it tends to hallucinate a lot.

08:29 Even for something like Apache Arrow, like PyArrow kind of API, it's the standard for in-memory data.

08:38 But the ChatGPT still makes up APIs for that.

08:42 And if you're looking at in terms of like the effect on the developer community,

08:47 so there are lots of times where I've seen either like Hacker News posts or comments on Hacker News posts

08:53 where there's a choice between two similar libraries.

08:57 And then I think the commenters are like, well, I'm choosing A because ChatGPT or Copilot generates better completions for that one.

09:05 How interesting.

09:06 I've never thought about that.

09:08 But yeah.

09:09 Wow.

09:09 You know, the Python, the PSF and JetBrains Python developer survey that came out,

09:17 one of the really interesting stats is 50% of the people who filled out that survey said they've been doing professional software development for less than two years.

09:25 Oh, wow.

09:26 That is surprising.

09:27 Yeah.

09:27 I mean, 50% of all Python developers, less than two years.

09:31 And I think there's got to be a strong pull for, yeah, I could go research that.

09:36 Let me just ask Chat first and see what it says.

09:38 Or chat, ask Copilot or whatever tools they're using.

09:42 That shouldn't be surprising.

09:44 And also what's interesting, I think I've noticed kind of the same thing with AI versus, you know, previous generations of machine learning.

09:52 In previous generations, it was all Python.

09:54 With AI, there's a huge TypeScript community that has formed around AI.

10:00 And then also just up in previous generations of machine learning, you had to have years of training to learn math and stats and data analytics and data science before you're a productive machine learning engineer.

10:14 And I think today it's with AI, it's definitely not the case that you could quickly through some experimentation and self-learning to become a fairly proficient AI engineer, which is also a new term that's been coined in the last two years.

10:30 It's going to be really interesting to see how this affects the industry.

10:33 On one hand, it's going to supercharge people getting going faster, make them help them become unstuck if they're stuck on a problem.

10:41 But it could also end up hollowing out deep knowledge maybe in the long term.

10:45 I'm not entirely sure.

10:46 We'll see.

10:47 You know, that is very interesting.

10:49 I do think that what we're seeing is there's a lot of focus around sort of model capabilities and then maybe like RAG and, you know, how to marry context and knowledge bases to the model through, you know, vector search and vector databases.

11:05 But what we're seeing a lot is more traditional and large enterprises are now starting to adopt AI in wholesale.

11:13 And the way they're thinking about it is, you know, let's build an excellence center around AI internally and make AI applications easily accessible to the rest of the company across many different groups.

11:26 So you don't have to worry too much about AI infrastructure where the data store and how that interfaces with a model.

11:32 And maybe just think about, you know, more at the business level and the user level, how you want these things to work.

11:39 So I think that deep knowledge around AI statistics, machine learning will probably start to coalesce around these centers of innovation within these large enterprises.

11:50 And of course, within, you know, AI native startups, they, you know, they all have to have much deeper knowledge in advance of the rest of the market to stay ahead of the curve.

12:02 It's interesting.

12:03 And, you know, I started to see these bots.

12:05 We'll get to the database in just a second.

12:08 I started to see these bots as kind of almost replacing search.

12:11 You know, I find myself, maybe I'll just skip search.

12:14 Maybe I'll just ask this thing because I got a good shot of a good answer straight away rather than.

12:19 Right.

12:19 And that kind of ties into this enterprise thing.

12:21 Like, you know, searching Google is a little bit tricky sometimes.

12:24 I use cat or whatever you're using.

12:26 It's tricky because there's SEO tricks going on.

12:31 There's ads.

12:32 There's a lot of stuff where you're like, is this really relevant?

12:34 And I think that's probably even worse than trying to search within your enterprise.

12:38 Like, do we have documents on this or whatever?

12:40 It's probably just like, here's a 175 email email thread.

12:43 Like, no, no, thanks.

12:45 This isn't going to help.

12:46 Let's talk LanceDB and multimodal data.

12:49 And I guess maybe that's the place before we get into the details of Lance.

12:53 What is multimodal data anyway?

12:55 Yeah, absolutely.

12:56 So like I was saying earlier, I think with AI data, you know, vectors and embedding vectors is just scratching the surface.

13:03 One of the most powerful things about AI is that it makes it a lot easier to interact with the multimodal or unstructured data that we have.

13:13 And so, you know, if you're thinking about, you know, images, videos, audio forms, 3D point clouds, there's a lot that's happening in, you know, generation in things like autonomous vehicles.

13:25 But even for traditional enterprises, right, you have just just oodles of like PDF documents or slide decks.

13:33 And there's lots of use cases for a tool that can help those users extract insights and then possibly train models on top of those on top of that kind of data.

13:45 And so if you look at data by by volume, if you look at your average like TPCH data table, it's like what, like 150 bytes or something like that per row.

13:55 And then embeddings, you know, if you just look at the previous generation of open AI, it's like, you know, 25 times that.

14:02 And then if you look at images and videos, you're quickly getting to these tables where even if you have the same number of rows, the data is just huge.

14:09 Right. It's not username, address and a few other things are still simple.

14:15 Yeah.

14:15 When we talk to a lot of our customers who are trying to build this new what they call unstructured data lake, the expectation, they are already coming in with the expectation.

14:25 Okay, this in terms of data volume, this is going to this is going to trounce the existing previous generations of data lakes.

14:33 We had big data before, but now we've got big data in another level, right?

14:37 Yeah, this is even bigger.

14:39 And it's not only that each row is bigger, but, you know, previous generations of tooling, you have, you know, like users or humans manually generating one data point at a time.

14:49 And now, you know, AI is generating its own training data at thousands of tokens per second, right?

14:55 You know, when you're when you're doing completions, that completion itself becomes training data.

15:00 And so the volume at which the data is increasing is also growing at very, very rapidly.

15:06 Yeah, if anyone wants to get a sense for just how intensely the world is focused on this, they just reopened Three Mile Island, the nuclear power plant, purely to plug in a single Azure data center for AI.

15:19 So, yeah, I think we're putting a lot of energy into generating data.

15:22 What would you say?

15:23 Well, I'm glad this data center, even if something goes wrong, the consequences will be much less dire.

15:29 Yes.

15:29 Honestly, I think nuclear is it's worth considering if if you rather than coal or whatever.

15:35 But still, that's a whole different discussion.

15:37 We don't need to go down that hole right now.

15:38 Maybe maybe later at the end.

15:40 Who knows?

15:40 You can turn this show into talk politics instead.

15:43 No, no, no, no.

15:44 Please.

15:45 No.

15:45 It's too early.

15:48 This portion of Talk Python To Me is brought to you by Sentry.

15:51 Code breaks.

15:52 It's a fact of life.

15:53 With Sentry, you can fix it faster.

15:55 As I've told you all before, we use Sentry on many of our apps and APIs here at Talk Python.

16:01 I recently used Sentry to help me track down one of the weirdest bugs I've run into in a long time.

16:07 Here's what happened.

16:08 When signing up for our mailing list, it would crash under a non-common execution pass, like

16:14 situations where someone was already subscribed or entered an invalid email address or something

16:19 like this.

16:20 The bizarre part was that our logging of that unusual condition itself was crashing.

16:26 How is it possible for our log to crash?

16:29 It's basically a glorified print statement.

16:32 Well, Sentry to the rescue.

16:33 I'm looking at the crash report right now, and I see way more information than you'd expect

16:38 to find in any log statement.

16:40 And because it's production, debuggers are out of the question.

16:43 I see the traceback, of course, but also the browser version, client OS, server OS, server

16:49 OS version, whether it's production or Q&A, the email and name of the person signing up.

16:54 That's the person who actually experienced the crash.

16:56 Dictionaries of data on the call stack and so much more.

16:59 What was the problem?

17:01 I initialized the logger with the string info for the level rather than the enumeration.info,

17:08 which was an integer-based enum.

17:10 So the logging statement would crash, saying that I could not use less than or equal to between

17:15 strings and ints.

17:17 Crazy town.

17:18 But with Sentry, I captured it, fixed it, and I even helped the user who experienced that crash.

17:24 Don't fly blind.

17:26 Fix code faster with Sentry.

17:27 Create your Sentry account now at talkpython.fm/sentry.

17:32 And if you sign up with the code TALKPYTHON, all capital, no spaces, it's good for two free

17:38 months of Sentry's business plan, which will give you up to 20 times as many monthly events

17:42 as well as other features.

17:45 Let's talk about something, maybe dissect, kind of like what the H2 or the one-sentence elevator pitch here for LanceDB is.

17:53 So I'll read it out and we'll take it apart and you make sense of it for all of us, okay?

17:57 LanceDB is a developer-friendly, open-source database for AI.

18:01 From hyper-scalable vector search and advanced retrieval for RAG to streaming data,

18:07 interactively exploring large-scale AI sets.

18:10 Best foundation for your AI application.

18:13 Yeah, so open-source for starters.

18:16 Go over here to GitHub, you can see just under 10,000 stars, which is, you know, congratulations.

18:20 That's excellent.

18:22 Actually, across the different bits, but yeah, so really cool.

18:27 And it's under the Apache 2 license.

18:29 That's pretty open for people to do most things what they want, right?

18:33 Maybe not build a closed-source business on top of, but other than that, right?

18:37 The way we're thinking about it is there's a base layer of how do you work with AI?

18:43 And we've made Lance format, which is open-source and it's a columnar format that's optimized for AI.

18:51 And on top of that, we're building the different workloads that allow our users and customers to search, retrieve,

19:02 run analytical queries, run training workloads across the board for all of their AI needs across the enterprise.

19:12 So the repo that you're looking at, which is Lance CB open-source, that's one of the tools that we've built on top of the Lance format.

19:21 And the idea is when you, super common workload is you want to start experimenting with building RAG or agents with memory,

19:31 semantic search, AI-enabled semantic search, like image, text image and video search and things like that.

19:38 Lance CB is sort of the easiest way for you to get started and prototype.

19:43 So like you said in the beginning, the mental model would be something like equivalent to like a SQLite or DuckDB, right?

19:50 So the data is just a file.

19:52 There's no service to manage.

19:55 There's nothing to sort of connect to.

19:58 It's totally open.

19:59 So you can look at it from with other tools that you have.

20:02 And then the actual database runs every day.

20:04 I love this idea.

20:05 I believe, I think things like SQLite and Lance CB, I think this, I just have a file and it, there's a really smart engine on top of that file.

20:15 Instead of a complex server architecture to run and manage and security and firewalls and all of that, migrations, all these things.

20:25 It's really pretty positive, I think, to just say, you can just have a file.

20:29 When your app is running, it's part of your app.

20:31 And when your app shuts down, it saves the data and it shuts down with it, right?

20:36 A couple of interesting things about the, you know, it's just a file is we took that a little bit further than that.

20:43 One is I think that we wanted the file to be open and so that you can inspect it, you can work with it using your other tools, right?

20:51 So when you're building that AI application, one is if you, you have more than just the, that search and the vector lookup.

20:59 There's, you know, you might want to run SQL to look at metadata.

21:02 You might have tensors that, that will upload to PyTorch training workloads and things like that.

21:08 And then we wanted that layer.

21:10 So we wanted that file to be accessible by all your familiar tools, right?

21:15 From like pandas and polars, you know, new data frame engines like DAF, for example, and even distributed engines like Spark and Ray.

21:22 That was one big thing that we focused on with, with Lance Format is making it plug in into the ecosystem.

21:30 For folks who are coming with a Python background and a data background, working with AI would feel familiar.

21:37 Right. A lot of people know how to do SQL to ask questions about their data, even if they're not programmers.

21:42 Maybe a business analyst is like, well, I can write a query, maybe connect that result to some graph or whatever, right?

21:49 This was the primary reason why we chose Apache Arrow as the standard interface and the standard type system.

21:56 It's easy to plug in all these other tools on top.

21:59 So good question.

22:00 Now the audience out there from Carol is, how's the integration with Narwhals?

22:04 I am not familiar enough with Narwhals.

22:08 So Narwhals is a layer that will let you work with multiple data frame libraries.

22:14 So if you wanted to work with say pandas and polars, you can use this and it will adapt depending on what the original data source is.

22:23 It'll adapt everything to a subset of polars commands.

22:28 So maybe the question is, how does it work with polars or pandas, right?

22:33 So maybe if it does with one or the other, then there might be a way to connect it.

22:37 The core interface input output is arrow.

22:39 And then with LanceDB, there's on the output end, you can convert results to pandas data frames or polars data frames.

22:47 And then on the input end to, I think natively, we can take pandas data frames as batches of input, but we convert that to arrow tables.

22:57 And so for others, you know, it's like polars, for example, it's really easy to convert polars to like an arrow table.

23:02 So a lot of that, even if it's not already automatic, it's like dot two arrow one command away.

23:08 Yeah, Carol added, given it works with arrow, it probably just works.

23:11 Yeah.

23:11 When we talk about a file, we don't just mean your local file system, but also object store.

23:17 So if you throw the data set on S3, you can just query it directly.

23:22 Now, you know, you'll hit additional latency because object store.

23:26 But it works.

23:27 So a lot of our users see that as a bridge into production so that, you know, they use LanceDB open source to do the prototyping MVP.

23:38 And then, you know, they, if their use cases fit before they need like a distributed system, it's easy for them to just say, hey, the embedded library lives in my application code.

23:50 And then the I have a shared data layer in S3 or maybe an EFS or something like that.

23:55 And you can run distributed queries very easily still without having to manage your own additional systems.

24:02 Okay.

24:02 Just use really scalable storage layer like S3 or probably anything that has an S3 compatible API, you know, like DigitalOcean.

24:12 Yeah.

24:13 Anything that's like a POSIX compliant interface or an S3 compliant API.

24:18 So even MinIO or something like that, right?

24:21 Like one of your, yeah.

24:22 Yeah.

24:23 I think MinIO published their own integration with LanceDB as an example as well.

24:28 Oh, did they?

24:28 Yeah.

24:29 That's cool.

24:29 Maybe tell people real quick what MinIO is.

24:33 It sounds like you're familiar with it or I can.

24:35 Yeah, go ahead.

24:35 So MinIO is a self-hosted high-end sort of pretty complete version of what you would think of as S3.

24:44 But if you don't want to use S3, you want to host your own, on your own infrastructure with your own storage layer and things like that.

24:50 But you still want to talk S3 APIs and security and all that stuff.

24:53 Yeah, that's what MinIO is.

24:54 It's just a file really, really far.

24:57 And I think the results end up speaking for itself.

25:01 We're seeing really good results.

25:02 And I think it really matches with the way that a lot of our users think about existing tooling and how to add AI to their existing business and applications.

25:12 We might be getting just a little bit ahead of ourselves, but we'll backtrack and talk APIs and how it works and stuff.

25:18 But what's the, you know, if it's just a file, what you said is obviously like a pretty broad statement by that, the way you all have implemented it.

25:26 But what's the go to production story?

25:29 Absolutely.

25:30 If you're using SQLite and that's your middle model, you're like, well, you probably should just switch to Postgres.

25:34 Right.

25:34 But it doesn't sound like that's your answer.

25:36 It sounds like you've thought it through on how to run it at larger scale.

25:39 In terms of large scale production, there's a number of different workloads for both online and offline.

25:46 So this is our commercial offering.

25:48 I'm just calling it LanceDB Enterprise.

25:50 And essentially, it's one distributed system that gives you a huge scale, low latency, high throughput, a system that can be backed by the same Lans data.

26:03 It's just a file in S3.

26:06 So the trick is to create that distributed system in a way that allows you to enjoy the cost efficiency of ObjectStore and the scalability of it while being very, very fast and very performant.

26:21 And so that's the production story for our customers who are putting AI applications into production where they need high throughput or they need very large index sizes or just to manage lots of vectors and create a small portion of it without the same sort of budget constraints as like you would with OpenSearch or other tools that doesn't give you that compute storage separation.

26:49 So it says it has GPU support for building vector indexes, which is pretty awesome.

26:55 So databases, indexes are, they're like the magic speed dust you can sprinkle on them, obviously.

27:01 What's the story with indexing and this GPU support?

27:04 With traditional database indices, it's not very computationally intensive.

27:10 Unfortunately for vector indices and N indices, they're quite so.

27:14 So I think with Rust, you can make CPU based indexing for vector data pretty efficient.

27:21 So if you're looking at, you know, hundreds of thousands or even up to like maybe, you know, 30 or 50 million vectors, CPU indexing is pretty good and can get pretty fast.

27:32 But for our customers that have 15 billion vectors in one table that they need to index one index, that's going to take days.

27:42 All right.

27:42 So with GPU support, you know, we cut that down by more than like 15, 20 X.

27:47 So then it becomes like a, something that they can do repeatedly and have a acceptable feedback loop in terms of, you know, maybe adding new data, refreshing that, that index.

27:59 What is also what I love about the composable data ecosystem.

28:04 And we talk about that, but there's also an extension, which I'd like to add is, is like that composable AI data ecosystem.

28:11 So we talked a little bit about the benefits of like having arrow, which is just, it means if we make the input output APIs arrow, it just works with the rest of the ecosystem.

28:20 This is one of the points where we get additional benefits, where with GPU support, it's easy through the arrow interface to actually talk to PyTorch.

28:29 So we can use a lot of the GPU tooling in PyTorch to build a lot of our accelerated vector indexing tool tools.

28:37 So that way it saves us lots of time messing with CUDA interfaces and, and, and things like that, that are not quite mature in the Rust ecosystem yet.

28:48 It's something you don't have to build.

28:49 It just gets better on its own and you just get to upgrade it.

28:52 Right.

28:52 Exactly.

28:53 That's part of the magic of package managers and package repositories and stuff.

28:58 It's really cool.

28:59 So what is the workflow?

29:01 Like, take me through the journey that somebody might go on.

29:05 They have a bunch of data, they want to put it into the database and they want to put the resulting thing into production so that they could use it behind an API or something that they're implementing for their app or something like that.

29:17 Is there a big training block of time that you do?

29:21 And then you move the resulting data somewhere?

29:23 Or do you just, you put it in production and start adding to it over and just keep adding to it over time?

29:29 What's the workflow?

29:30 With the Lansi B Enterprise, like the system that's running, you can just keep adding data to it.

29:34 The indexing and all of that is automatic once you've configured it properly, which is just, you know, here's the schema and like create indices on these columns, right?

29:43 Much like what you would do with a, like a Postgres table.

29:47 Once you have that set up and as you add data, the indexing is automatic.

29:51 And then for small amounts of data, our users typically will say like have Python dictionaries, JSON, or, you know, Pandas data frames.

30:00 They can add those to the database through the API for really large scale data.

30:05 So if you're working with terabytes or even petabytes of data, you don't want to be shoving that through the API like a thousand rows at a time.

30:13 So this is where the open data layer comes in.

30:15 So if you have a large data set and you have, whether it's Spark or Ray, you can use those large distributed systems to write data directly to S3 in the, in Lansi open source format.

30:27 And then the system actually picks it up from object store and takes care of the indexing and compaction and all of that.

30:35 That's sort of like when we're talking about adding data to it, it's, it's, it's both.

30:40 This portion of talk Python to me is brought to you by Bluehost.

30:43 Got ideas, but no idea how to build a website.

30:47 Get Bluehost with their AI design tool.

30:49 You can quickly generate a high quality, fast loading WordPress site instantly.

30:54 Once you've nailed the look, just hit enter and your site goes live.

30:58 It's really that simple.

30:59 And it doesn't matter whether you're a hobbyist entrepreneur or just starting your side hustle.

31:03 Bluehost has you covered with built-in marketing and e-commerce tools to help you grow and scale your website.

31:10 For the long haul.

31:11 Since you're listening to my show, you probably know Python, but sometimes it's better to focus on what you're creating rather than a custom built website and add another month till you launch your idea.

31:20 When you upgrade to Bluehost cloud, you get a hundred percent of time and 24 seven support to ensure your site stays online through heavy traffic.

31:29 Bluehost really makes building your dream website easier than ever.

31:33 So what's stopping you?

31:34 You've already got the vision.

31:35 Make it real.

31:36 Visit talkpython.fm/bluehost right now and get started today.

31:41 And thank you to Bluehost for supporting the show.

31:44 So I'm talking to you on a Mac mini M2 Pro with maxed out RAM and four terabytes of space or something.

31:51 Can I do interesting stuff on my computer?

31:54 Is it too small?

31:56 It's not too small at all.

31:57 One of the things about LAN format is it's all disk, right?

32:01 So most of the magic that we talked about with LanceDB, at least open source, lives in that open source format layer.

32:09 And that format is three things in one.

32:11 One, there's the columnar format, the file format, and then there's a table format, and then there's the indexing.

32:18 So the indexing is what speeds up the NN vector lookups.

32:22 Typically, we've made the index disk base so that if you're looking up a few partitions in the index, it's only loading those partitions.

32:33 And so the RAM requirement is actually quite small.

32:37 And so on your systems, probably the more limiting factor is going to be like the disk size rather than how much memory you have.

32:45 A couple of terabytes.

32:46 I should be able to put some data on that.

32:47 Yeah, exactly.

32:48 A couple of free terabytes, that is.

32:49 All right.

32:51 Let's talk about the architecture a little bit, maybe a look inside.

32:55 Like give us a, you've talked about some of the pieces and formats, but give us a sense here.

33:00 This is sort of an architecture that includes more than just Lance, but it's more about, you know, how our customers are thinking about building their own multimodal data lake.

33:12 So we call this like vector data lake or unstructured data lake, and there's no standardized terminology yet, but this is what I was mentioning before.

33:20 Nice.

33:21 Yeah, I'll just add for the listeners, I'll put a link to a diagram that talks about this a little bit.

33:26 People can check out if they want.

33:27 So the idea is from these companies perspective, they now have, on the one hand, they have lots of data that they can now take advantage of because of new AI models and applications.

33:39 And on the other hand is there's lots of business units across that enterprise that wants to experiment with AI or add AI applications, whether it's agents or, you know, internal tooling or customer success, you know, video and image search.

33:57 And lots of different use cases where there's a lot of like, hey, can you sprinkle some AI magic on my existing business and maybe look for that 10x in terms of ROI.

34:10 You see, we have all this SharePoint data.

34:12 Can you help me out?

34:13 That's literally a conversation.

34:16 I'm sure that it is because SharePoint is bad and you don't want anything to do with it, but people keep putting stuff into it.

34:21 I don't know.

34:22 It's not just about the AI capabilities, but like, how do you build this base layer and make it a lot easier to access for the rest of the folks in the company?

34:33 And most of these companies aren't starting from scratch.

34:36 They already have made significant investments into the data science and analytical tooling, training for their data scientists, analysts.

34:44 And they also made significant investments into their existing data lake for large scale data processing, right?

34:51 So you don't want to make them have to throw all that away.

34:55 And instead, that open data layer is super important to plug into the rest of their existing ecosystem.

35:02 And with Lance, unlike Parquet or JSON or WebDataSet, all of their multi-bono data can live in one place so that they can do search and retrieval on the vectors.

35:13 They can run SQL.

35:15 They can do training and data processing workloads all sort of in one engine and one piece of data.

35:22 So it just makes things a lot simpler.

35:24 It allows us to add a lot more performance optimizations.

35:29 And then it saves, because of the size issue, it also saves these enterprises a lot in terms of cost so that they don't have to keep making different copies of different parts of the data for different parts of that workload.

35:42 Right.

35:42 This thing needs that format of the data.

35:44 That thing needs that format of the data.

35:46 Right.

35:46 Exactly.

35:47 Two examples that really stuck out to me, speaking with one user recently, and they're doing EDA in DuckDB.

35:56 So they have lots of metadata.

35:57 They're doing EDA in DuckDB.

35:58 Quick acronym, please.

36:00 What's EDA?

36:00 Exploratory data analysis.

36:02 Okay, got it.

36:03 They're running SQL in DuckDB, but then they need to fetch, you know, there's like 10 columns of images and like 10 columns of multi-vectors.

36:13 And so when they, at the end of where they finish running that join and filter, it takes just ages to fetch that data.

36:23 So whether, because of the lack of random access support across the board.

36:28 So to be, then you have to think about other tooling, either distributed engine or somehow just load all of your data in memory.

36:37 And that's also not tenable.

36:38 Whereas with Lance format, because you can do random access very quickly, you can actually just quickly move on to the next step in the same Python application.

36:48 And because it works with DuckDB, you can get the row IDs out, fetch the rows, and then quickly move on to downstream processing, whether it's for like train a UMAP or, you know, something like that.

37:00 So that is just one small example where, you know, Lance makes it so that new workloads just works with existing tooling.

37:10 Yeah, that interop layer is really nice.

37:12 And another example is around, and again, this is like, you know, the size changes everything where we spoke with folks that are like, you know, we manage a bunch of image and video data in, let's say, like web data set.

37:26 And we want to iterate on our model and to create new features means we have to download terabytes of web data set, which are basically tarballs.

37:36 Open up the tarballs, like compute the feature from that data, put it back in the tarball and then upload it.

37:44 So to access one small piece and write a small piece, you often are downloading or transferring like 100x that because the large multimodal data.

37:55 Whereas with something like Lance, instead of using web data set, the table format features of Lance allows you to just keep that data in place, just add a new column, which writes a new file.

38:07 And then you can, the readers can just see an updated version of that with new schema and you can move forward, you know, not having to pay that transfer cost.

38:17 And it's not just the infrastructure cost, but also in the time that your engineers are spending on that, managing that process and just running across maybe petabytes of that data.

38:29 That's a crazy amount of data.

38:30 Yeah.

38:30 Let's talk through what it, what it's like to write a little bit of code, you know, connecting to the database, doing some queries or adding some data.

38:36 You guys have a quick start on here.

38:37 Maybe you could just talk us through real quickly on how to get started in Python with it.

38:41 Overall, the goal for the quick start is you have a package that you install in seconds.

38:47 And then depending on how quickly you're running through the quick start that, you know, between like a minute to five minutes, you've got a quote unquote vector database running, embedded vector database running,

38:59 where you can start building applications on top, right?

39:02 So the first step is very familiar.

39:04 You just run pip install Lance DB and that pulls in, you know, Lance format and installs our Rust packages.

39:12 And then you can start connecting to it, right?

39:15 So the default is you just give it a local file path and you say Lance DB dot connect.

39:21 And then that gives you a database connection, which there's two flavors of that, a synchronous for a client and an asynchronous client.

39:30 I love it.

39:30 So the async is basically just connect underscore async.

39:34 And roughly the interfaces are roughly equivalent.

39:38 There's some disparities which we're addressing now, but the main difference is you just sprinkle some await in places.

39:45 The async and await keywords make concurrent programming so much easier than threads and callbacks and all that business.

39:52 For sure.

39:53 Yeah.

39:53 So the connect bit is if people are familiar with SQLite, you just here's a URL to a file.

39:59 That's pretty much you give it that.

40:01 And then you have what's your naming convention on the async API?

40:06 Are all the functions ending in underscore async?

40:09 Or do you create like an async client?

40:11 Or how do you differentiate?

40:12 That's a good question.

40:13 Yeah.

40:13 So we create an async client.

40:15 The method names are the same between the two.

40:18 It's really just that initial connect call that's different.

40:22 I see.

40:22 But that's kind of the factory method for the client, either sync or async that comes back.

40:27 And then you just, you know what you got.

40:29 In the second example, we see that in action.

40:31 First step is let's create a table and let's add some simple data to it.

40:36 Let's say we have two fields, item and a price, right?

40:41 And then an embedding vector here for the example, we're just going to use two digits for that vector.

40:46 In practice, of course, it's going to be like 1,500 or, you know, 3,000 or something like that.

40:52 That's hard to print on your code.

40:55 Right, right.

40:56 Yeah.

40:56 So here we just have, you know, essentially digits of pi here.

40:59 And then you can see whether you have the async connection or the sync connection.

41:05 Both you call create table.

41:07 You add it.

41:09 You give it the table name.

41:10 And then optionally, you can give it data to initialize it.

41:15 So if you have data initially, the schema is inferred from the data that you provide it when you call create table.

41:24 If you go to the next example, you can also initialize an empty table that has with just a schema.

41:33 So here in this quick start, you can have an arrow schema.

41:37 Basically, this is the same schema as the data before, but you create an empty table, and then you can add data to it later.

41:43 I think I should update this.

41:45 But I think what's more convenient is that we've, as you see in the box below, you can actually, we've added pedantic support.

41:52 Beautiful.

41:53 There's a translation layer between pedantic and arrow schemas.

41:56 And so I think for a lot of our Python users, it's much easier to think in terms of pedantic objects as the data model rather than manually dealing with the PyArrow API to create a schema.

42:08 We saw lots of issues where users are like, well, how do I create a fixed size list?

42:17 What is a fixed size list?

42:19 And why does my vector have to be that?

42:24 Should I do float 32 or 64?

42:27 There's lots of stuff that is just much easier to think of in terms of Python types and Python objects.

42:33 I love the Python, the pydantic integration there.

42:36 That's super cool.

42:37 Well, the data layer used for my course's website and the podcast website and stuff is all based on Beanie and Mongo, which is async.

42:45 Basically, you're writing async queries against pydantic models, which is, it's a real, got the validation, but you're not writing directly to the database and just random dictionaries.

42:55 And who knows if it stays consistent.

42:58 So speaking of which, you have a schema here.

43:00 Is this, how hard, how strictly is this enforced, right?

43:05 Is this Postgres level or is this MongoDB level?

43:08 Like, probably should be this.

43:10 Right.

43:11 This is fairly strictly enforced.

43:13 So think, think like arrow tables and, you know, writing to writing arrow tables.

43:18 There's a little bit of like give around like nullability.

43:21 Because a lot of times if you're providing data, the types can be inferred, which is, it's easier to do casting.

43:28 And then if you're inferring like nullable versus nullable, that can get you in trouble when you're inserting data.

43:37 So for example, if you have a non-nullable column that you've declared in schema, but new data coming in, sometimes that translation layer into arrow just automatically turns it into a nullable, even though you didn't give it any nulls in the data.

43:53 And then when you insert it, it'll produce an error.

43:56 Do you respect things like optional typing in Pydantic models to control nullability, like optional float versus float being not nullable?

44:04 Yep.

44:05 So we do a bunch of that as well.

44:06 So in terms of pedantic integration, you know, it's another Rust project.

44:11 I know we've loved working with it for a long time.

44:14 If any listeners are out there that's interested in just messing with like pedantic and arrow, so that translation layer, we'd love to get some help as well.

44:24 So for some of the more like complicated nested types out there, so where that's like lists of lists or lists of fixed with lists and dictionaries and that kind of thing, the translation layer is incomplete.

44:38 But we know a couple of other companies and tools in the ecosystem who've also built their own kind of translation layer.

44:46 So it would be really interesting, I think, if there was a community member who can lead, say like, let's create a, let's get a bunch of projects together and let's create a standardized translation layer.

44:58 Not everyone just doing their own copy, right?

45:00 Right, right.

45:01 And then maybe like either pedantic or maybe arrow, one of the two projects can own that translation layer.

45:07 I think that would be really great for the, for the ecosystem.

45:11 That sounds like a fantastic idea.

45:12 And let's maybe wrap up our little code sample here with talking about querying the database or doing a search because it's fun.

45:20 I know it is fun to put data into a database and define schemas, but the actual purpose is to ask questions, right?

45:26 That's right.

45:27 We wanted to make it so that it's really familiar for people who've worked with databases and data frame engines.

45:34 So the main workload for Lansi B open source is that search API.

45:38 So when you, you can say table.search, you can pass in the vector, the query vector, and then you can call .limit to say how many results we want.

45:50 And then a .2 underscore blah, blah, blah, blah to determine what format you want the results back.

45:57 So you have two pandas here in the, in the example, but you can convert it to polars or you can just get it back as a, as a list.

46:04 Awesome.

46:04 Yeah, that's really cool.

46:06 I love it.

46:06 Oh, and also a couple of other things.

46:08 So here you can also do, if you have a Pydantic model that you use as the table schema, you can also say to pedantic and pass in the model and it'll automatically return a list of pedantic objects back, back to you.

46:21 So this is particularly useful for like multi-step, like agent workflows where you want that structured data.

46:28 It's easy to connect it to the rest of your.

46:30 Right.

46:31 You would just maybe want to take your pedantic and say, turn that into JSON and hand it over to the next agent or something like that.

46:37 If we sort of move forward in that example a little bit.

46:40 Yeah.

46:40 In addition to setting up the schema, there's also the embedding API that's really interesting so that when you create the schema.

46:49 So here's an example where we can create the schema using pedantic.

46:54 So in this block that I've declared the class words, which is a Lance model.

47:00 And the Lance model is just a pedantic base model that knows how to convert itself to, to arrow.

47:06 And then I have two fields in here, text and vector.

47:09 And what we have in Lance CB is an embedding registry where you can hear, I declared a function of embedding function called func.

47:19 And basically I call get registry dot get open AI dot create.

47:24 And I give it the name of the embedding here.

47:26 We're using ADA two, but new models have been released since.

47:30 And I can use the pedantic annotations to say, hey, the text field is the source field for that function.

47:36 So it's text string equals func dot source field.

47:39 And then the function itself knows how many dimensions it is.

47:43 So I don't need to think about like how many dimensions this, the vector, the embedding has.

47:48 Once I've declared the schema, I can call a familiar dot create table workflow.

47:53 And then I can call table dot add to add data to it.

47:56 And here, because I've declared the embedding function in the schema, I don't actually have to generate the embeddings myself.

48:04 So in this example, I'm only passing in the text field as input and Lance CB.

48:10 Oh, that's really cool.

48:11 Yeah, just takes care of calling the open AI API on your behalf and then adding the vectors before, you know, adding the whole batch to the table.

48:21 So open AI was was our first one, but there's dozens of compatible embedding models.

48:27 So pretty much anything you can pull off hugging face.

48:30 And then there's lots of other vendors like cohere, for example, if you were running open source like olama, there's also a integration for that.

48:39 Nice. You can use some of the on machine ones potentially as well.

48:43 Exactly. Yeah.

48:43 So a lot of the ones you can pull off hugging face can just run locally and it exposes the options.

48:51 So if you do have for your Mac mini or or even your MacBook laptop, there are options.

48:57 There are lots of hugging face models where you can specify NPS to actually make it run a little bit faster.

49:02 Oh, nice. That uses the neural processing units or something instead of CPU or GPU.

49:07 OK. Yeah.

49:07 OK, that's cool. I didn't know that.

49:09 Very nice.

49:09 And it looks like I did want to ask you sort of what's the integration with things like open AI or Gemini or other things.

49:18 It looks like a lot of Lance DB is kind of for you building your app that is self-contained.

49:23 But also, you know, here's integration with some of the open AI API stuff.

49:27 What's how much do you depend on using other external AI systems versus just your own?

49:33 Right. So I think typically the embedding model is you either bring your own run locally or you call a third party API.

49:43 And then the actual completion that's outside of the scope of Lance DB.

49:47 So the prompt engineering and the calling the completion model once you have the context retrieved, that's not part of the Lance DB API.

49:57 That said, so we integrate with, let's say, like a Langchain or a Llama Index.

50:02 So if you're comfortable with that layer of sort of AI orchestration or RAG orchestration, you can use their APIs and plug in Lance DB into that.

50:13 So that that takes care of a lot of that, the other parts of the workflow.

50:18 And then if you're in the, let's say, like AWS ecosystem and you're familiar with, you know, Bedrock and things like that, there are a few AWS folks who've built a complete serverless RAG stack where they're using the Bedrock APIs for embedding creation and then for completion.

50:37 But then they have Lance data that's sitting on S3 and they're running Lance DB open source in AWS Lambda function so that you can essentially have, you don't have to manage any servers and just be calling serverless functions to build your RAG application.

50:55 Sounds super cool.

50:56 All right, we're getting short, short on time.

50:58 Let's wrap it up with a couple of things here.

51:00 First of all, we have the open source Lance DB.

51:04 You talked about the enterprise stuff as well.

51:07 Let's just talk business on a little really quick.

51:09 I think it's interesting.

51:10 Always interesting to see companies that have maybe open core or some kind of open source foundation and how you guys are making this work for both contributing to open source, but also eating.

51:20 What's the story here?

51:23 I think it's always a very complicated topic.

51:26 I think the way that we think about it is your journey building AI.

51:31 When you're just starting out and you're doing that prototype and that MVP, a lot of times you don't even know the value of the thing that you want to build or whether it works.

51:42 So why would you want to go through the hassle of managing complicated infrastructure or pay a third party vendor to just have some small amounts of data, right?

51:53 So we wanted to make Lance DB open source super easy for you to just get started and also just bridge you into production in that early stage.

52:04 The cloud and enterprise offerings are essentially for when you go into production and you want to have a super scalable and highly performant vector database where you still don't have to worry about the infrastructure, but you have lots more challenging systems needs.

52:20 Lance DB enterprises, Lance DB cloud is it's a hosted serverless offering.

52:25 So for, you know, small teams that have production needs, but maybe don't have, you know, billions of vectors and, or, or, you know, really challenging security requirements and things like that.

52:37 And Lance DB enterprises, we really have, okay, I need like a thousand to 10,000 queries per second.

52:44 I have just a ridiculous amount of data to catch myself before I curse and, on-prem as well.

52:52 Maybe.

52:52 Yeah.

52:52 I need my data to live on on-prem.

52:55 And so the enterprise comes in, in two different packages.

52:59 One is a managed offering where you can bring your own bucket and then we just run the compute for you and give you a private link or it's, it can be fully, we call it BYOC.

53:10 So it runs within the customer account.

53:12 So nothing leaves the premises except for basic telemetry data.

53:16 Awesome.

53:16 All right.

53:17 And then, you know, quickly close it out with, I guess, what's next?

53:22 Like where are things going here?

53:24 I see Lance DB cloud is in private beta early access mode.

53:28 You know, what are you, where are y'all going?

53:30 There's a couple of really exciting, first on the open source, you know, our vision for, for Lance is to make it the,

53:40 standard for working with AI, right?

53:42 So I think we already have lots of folks who are depending on Lance with image data, video data, large scale training, and we'll continue to, to make that better.

53:54 And then add additional encoding.

53:56 So for folks that not just have those, but, you know, compressing text data and the metadata that they have to make it as efficient as possible.

54:05 And I think there we've, you know, between Lance and Lance DB, I think we just broke through 2.2 million monthly downloads.

54:14 So we're really excited about that and, you know, that community uptake as, as well.

54:19 And, you know, we're getting lots of community collaborators and contributors, and we're looking to grow that community.

54:27 So on the open source side, it's, you know, better APIs, more automation around the data management, like compaction, and then just better encodings for the non-large blob types, right?

54:40 So we're looking to get smaller strings, smaller strings, numerical data, and things like that.

54:43 And then for Lance DB Enterprise, a lot of times, if you have to look at the whole search engine, right?

54:51 So right now, our customers still have to do the chunking and embedding themselves, but we're basically looking to make it much easier for those types of workloads, not just in terms of product activity, but to also save them costs as well in terms of embeddings.

55:06 So a lot of times what we find is our customers will update a table with a new version that's like 80% the same.

55:13 So if we have the right APIs and manage the embedding calls for them in the, we can essentially save 80% of that cost in terms of embedding APIs without complicated sort of like query cache or embedding cache that adds more complexity.

55:30 And then just deepening the, so more complete search engine for AI.

55:36 And then on the offline side, complete features at scale around building that training, pre-processing, and exploratory data analysis workflow for those types of customers.

55:49 Well, it looks like a really cool product and set of services and nice work.

55:53 Thank you.

55:54 Yeah, you bet.

55:54 And thanks for being here.

55:55 You know, final parting thoughts.

55:57 People want to get started with Lance DB.

55:58 What do you tell them?

55:59 Oh, come to our discord.

56:00 I don't know if I have the link, but give it to me and I'll put it in the show notes for people later.

56:04 Yeah, sounds good.

56:05 So come to our discord.

56:07 We're all in there and it's pretty active and we respond fairly quickly.

56:13 So there's not a lot of noise.

56:15 So it's mostly sort of practical topics, debugging and talking about new features, maybe having a little bit of fun with new examples and things like that.

56:24 Awesome.

56:25 All right.

56:25 Well, thanks for being here.

56:26 See you later.

56:27 Thank you.

56:28 Thank you.

56:28 This has been another episode of Talk Python To Me.

56:31 Thank you to our sponsors.

56:33 Be sure to check out what they're offering.

56:35 It really helps support the show.

56:36 Take some stress out of your life.

56:38 Get notified immediately about errors and performance issues in your web or mobile applications with Sentry.

56:44 Just visit talkpython.com.

56:45 Just visit talkpython.fm/sentry and get started for free.

56:49 And be sure to use the promo code talkpython, all one word.

56:53 And this episode is brought to you by Bluehost.

56:54 And this episode is brought to you by Bluehost.

56:55 Do you need a website fast?

56:57 Get Bluehost.

56:58 Their AI builds your WordPress site in minutes and their built-in tools optimize your growth.

57:03 Don't wait.

57:04 Visit talkpython.fm/bluehost to get started.

57:08 Want to level up your Python?

57:09 We have one of the largest catalogs of Python video courses over at Talk Python.

57:13 Our content ranges from true beginners to deeply advanced topics like memory and async.

57:18 And best of all, there's not a subscription in sight.

57:21 Check it out for yourself at training.talkpython.fm.

57:24 Be sure to subscribe to the show.

57:26 Open your favorite podcast app and search for Python.

57:29 We should be right at the top.

57:30 You can also find the iTunes feed at /itunes, the Google Play feed at /play,

57:35 and the direct RSS feed at /rss on talkpython.fm.

57:39 We're live streaming most of our recordings these days.

57:42 If you want to be part of the show and have your comments featured on the air,

57:46 be sure to subscribe to our YouTube channel at talkpython.fm/youtube.

57:50 This is your host, Michael Kennedy.

57:52 Thanks so much for listening.

57:54 I really appreciate it.

57:55 Now get out there and write some Python code.

57:57 We'll see you next time.