Learn Python with Talk Python's 270 hours of courses

#337: Kedro for Maintainable Data Science Transcript

Recorded on Friday, Oct 1, 2021.

00:00 Have you heard of Kedro? It's a Python framework for creating reproducible, maintainable, and modular data science code.

00:06 We all know that reproducibility and related topics are important ones in the data science space.

00:10 The freedom to pop open a notebook and just start exploring is much of the magic.

00:15 Yet, that freeform style can lead to difficulties in versioning, reproducibility, collaboration, and moving to production.

00:22 Solving these challenges is the goal of Kedro.

00:24 And we have three great guests from the Kedro community here to give us the rundown.

00:28 Yatunda Dada, Waylon Walker, and even Donov.

00:31 This is Talk Python to Me, episode 337, recorded October 1st, 2021.

00:37 Welcome to Talk Python to Me, a weekly podcast on Python.

00:53 This is your host, Michael Kennedy.

00:55 Follow me on Twitter where I'm @mkennedy, and keep up with the show and listen to past episodes at talkpython.fm.

01:01 And follow the show on Twitter via at Talk Python.

01:04 We've started streaming most of our episodes live on YouTube.

01:08 Subscribe to our YouTube channel over at talkpython.fm/youtube to get notified about upcoming shows and be part of that episode.

01:16 This episode is brought to you by Tab9, the editor plugin that enhances your autocomplete by learning how you write code.

01:23 Us over at Talk Python training with our courses.

01:26 And the transcripts are brought to you by Assembly AI.

01:29 Yatunda, Ivan and Waylon, welcome to Talk Python to Me.

01:33 Thank you so much for having us.

01:35 Yeah, thank you for inviting us.

01:36 Yes, thank you.

01:37 Yeah, it's fantastic to have you all here.

01:38 Let's start with just a little bit of your background.

01:41 Since there's three of you, maybe not too long, but how do you all get into programming, Python, into this Kedro project and so on?

01:49 Yatunda, you want to start with you?

01:50 Sure.

01:50 So I'm a principal product manager on Kedro.

01:52 I've been with the project for actually exactly today, three years.

01:56 Oh, nice.

01:58 Three years and eight months.

01:59 You've been there most of the time.

02:00 That's fantastic.

02:00 Yeah.

02:01 Ivan has been there from the beginning, so he'll talk about that.

02:03 My background is mechanical engineering, and I would have been a user of Kedro.

02:08 We discovered, I think, one of the coolest user interviews I've ever done was with my former team.

02:12 They picked it up on their own, which is amazing.

02:14 Oh, nice.

02:15 Ivan now.

02:16 And I work at Quantum Black.

02:18 That's the primary.

02:20 Yeah, okay.

02:21 Ivan?

02:21 Yeah, so I'm Ivan.

02:22 I'm a tech lead for Kedro.

02:24 Started, like, have been working with Quantum Black for the last almost five years now.

02:29 Is it?

02:30 Yeah, five years.

02:30 So you've been there from the beginning, as we hear.

02:33 From the beginning for Kedro, not for the beginning.

02:35 Yeah, yeah, yeah.

02:35 That's what I, yeah.

02:36 Quantum Black.

02:36 Yeah.

02:37 So initially, it was a small internal tool that was being developed by another two people on our team, Nikos and Aris.

02:46 And then we decided to turn it into a product.

02:49 And then we started from scratch and developed it into what it is.

02:53 Then we found out that things can get serious.

02:55 And then that's how we hired Yetu, because we needed a PM as well.

03:00 As a proper product.

03:01 Otherwise, my background is in software engineering, all kinds of software engineering.

03:06 Started off as web developer.

03:08 I was very keen to do some game programming before that, but I couldn't find jobs related to that.

03:13 A lot of people are interested in doing game programming until you actually get into the reality of it.

03:18 And the reality is, a lot of times it can be a grind and it's hard to get a job.

03:23 I was going to say that I'm quite lucky that I didn't end up being a game programmer.

03:28 It's not the first time I heard that.

03:29 That's right.

03:30 Yeah.

03:30 Yeah.

03:30 So then, yeah, I moved on to distributed systems and I ended up doing data and AI at QB mainly.

03:38 So I'm kind of a newbie in the data field, but being there for five years, kind of newbie.

03:45 A lot of people are in that realm, right?

03:49 A lot of people are coming into the whole data science side of things.

03:52 Waylon, how about you?

03:53 My background is in mechanical engineering.

03:55 Probably around 2014, I started diving deeper into the data side of things.

04:01 I had some family medical things that came up and it kind of severely limited my ability to travel.

04:08 And slowly over time, I just kind of doubled down into the side of things.

04:13 Yeah.

04:13 It was right around the time you started the show.

04:15 And a lot of what I've learned has either been directly from this show or from like taking ideas from the show and diving into them.

04:24 Oh, awesome.

04:25 That's really cool.

04:25 Happy to bring that to you.

04:27 I think a lot of people who come from backgrounds that they're not like traditional CS backgrounds, right?

04:33 They're kind of coming in through a side channel.

04:34 I feel like the podcast has offered a lot of connection and extra information besides just what's on like the docs page of some projects for people, which is great.

04:44 It's awesome to hear.

04:45 So now I'm a team lead for a data science team and do Python every day.

04:50 And we use Kedro pretty heavily in all of our projects.

04:55 So before we move on to the topic, give me a quick thought on comparing mechanical engineering work life experience to software development, Python developer experience.

05:06 How do you feel about where you are?

05:08 I like where I'm at.

05:09 One thing that I kind of struggled with in mechanical engineering was there was a lot of learning in college.

05:16 There was a lot of things to learn.

05:17 And then you get an industry and everything's kind of like walled off.

05:22 And the learning doesn't completely stop, but it's very hard to do.

05:26 You know, on the software side, a lot more things are open to learn.

05:31 There's a lot more resources that are just out there in the open and not like behind proprietary IP patents and all that kind of stuff.

05:39 I think that works really well for me.

05:41 I'm a learner.

05:42 I don't remember the, is it Galt?

05:45 The personality study kind of things.

05:48 Yeah.

05:48 Always come out with learner is like the top thing for me.

05:51 Nice.

05:51 Yeah.

05:51 So it's a good fit.

05:52 Yeah.

05:52 If you don't like to continually learn and continues, continually kind of reinvent yourself, then software is probably a data science is probably not a super awesome place because the life cycle of these things.

06:04 I mean, if you want to take it to the extreme, you could do JavaScript.

06:06 But most of us, we have like at least a year on the technology before we move on.

06:11 Yeah.

06:12 All right.

06:12 Now, speaking of that, let's go ahead and get into our main topic here.

06:17 And I want to kick this off by talking about reproducible and maintainable data science.

06:22 So a little bit like engineering, people are coming to data science from these different angles, a lot of times from computational stuff, from biology or from finance and economics or whatever.

06:34 And they don't necessarily come with a full baked set of, oh, here's the software engineering lifecycle skill set.

06:41 Here's how I set up my continuous integration.

06:43 Here's my testing.

06:44 And like a lot of times it's like, look, I got it working.

06:46 We're kind of good.

06:46 You know what I mean?

06:47 I think one of the areas the whole data science field is working with is taking a lot of these folks who are coming from non-traditional CS backgrounds and helping them create more reproducible, reusable bits of code, notebooks, maybe even code outside notebooks.

07:02 So let's maybe open the conversation there because that's where Kedro focuses

07:07 on helping data scientists create reproducible, maintainable code.

07:11 Eutunde, do you want to maybe kick us off with some thoughts on where Kedro's philosophy is on this?

07:17 So I guess if we actually break it down, it's a Python framework that helps you

07:21 do those specific things.

07:22 And when we talk about it being a framework, it's kind of embedded with best practices

07:27 and ways of structuring how you write code so that you can get those reproducible,

07:33 maintainable, and modular data science code.

07:35 But if I go into each one of the definitions, when we say reproducible, we kind of mean that

07:39 when I rerun this pipeline or rerun this experiment, I should get the same result at the end.

07:45 So there shouldn't be any real surprises that things have really changed or things are really breaking.

07:50 When we talk about it being maintainable, we now also add in an aspect of collaboration,

07:55 even for yourself, that if you come back to this code base like three months from now or six months from now,

08:00 you should be able to know what was going on in it and be able to modify and tweak it.

08:05 And other people should be able to do that with you too.

08:07 It's not really, it shouldn't be the biggest disaster if the main code contributor has left

08:13 and you now have to try, scramble to figure out what's going on there.

08:16 And then when we talk about it being modular, this is where we encompass some of the software engineering principles

08:21 that you wouldn't ordinarily learn in, maybe if you enter data science from the mathematician space

08:27 or even from the sciences, where we think about you being able to break your code base down into small units

08:34 so that it's possible to think about things like reuse, but also it's easier to do things like testing the code base as well.

08:40 All of these things like basically amount to trying to enforce like software engineering best practice,

08:46 especially where you recognize that you might need help with that.

08:49 One of the other areas that I've seen a lot of emphasis on the project about is collaboration.

08:56 And one of the things that can be challenging is if you're a data scientist working on a notebook,

09:01 you might run some cells, maybe on data that is live.

09:05 And so the data might be slightly changing.

09:07 You check that into GitHub or source control.

09:10 You've got the results in sort of a scrambled up JSON file notebook, and then someone else has rerun it at a different time,

09:19 and then they try to check it out.

09:20 Well, you know, you end up with these conflicts and other issues.

09:23 So the sort of natural flow of, hey, let's just check everything into Git,

09:28 and then we'll just synchronize over that can sort of fall apart with some of the traditional tooling of data science.

09:33 And that's definitely true.

09:35 I think Kydra's origins actually come from large teams, upwards of like at least three or four data scientists and data engineers,

09:44 machine learning engineers collaborating on the same code base, upwards of, you know, to 12 people having to work on the same code base.

09:50 If you're using Jupyter Notebook and trying to construct your entire pipeline in it,

09:54 I think the workflow would look kind of strange with you waiting.

09:58 Maybe Waylon has some comment on like watching this in practice of, you know, waiting for someone to finish in the Jupyter Notebook

10:04 before you could have a go at it and try your things.

10:07 During our engagement with McKinsey, we were introduced to Kedro during kind of the first iteration.

10:12 Yeah, and Waylon, by the way, we've only talked about Quantum Black so far,

10:16 but Quantum Black is like a subsidiary of McKinsey.

10:18 So yeah, right.

10:20 This is all sort of the same organization in a sense, right?

10:23 So that's why this is coming together, yeah?

10:25 Yeah, good point.

10:26 But so the second half, they chose like just due to familiarity, the people that we had with us,

10:31 it was their first time using Kedro.

10:33 And they're like, well, if we want to move fast on the second half, let's not try to do anything new.

10:38 And let's just do notebooks like we always had.

10:41 Right, fall back to what they know sort of thing.

10:43 Yeah, and the workflow there was definitely like three people are on a project,

10:47 two people are sitting idle while one person has the notebook checked out.

10:51 Oh my gosh.

10:52 That sounds like old school source safe type source control.

10:55 Someone's locked the files, no one else can edit it.

10:58 Like that's just, it's completely 1990s style, right?

11:01 But it's a real problem.

11:02 I mean, some of these things are, there's an attempt to address them by having collaborative notebooks,

11:07 basically Google Docs type of experiences, right?

11:10 But usually those are in somebody else's cloud, somebody else's compute cloud.

11:14 And so you're taking the trade-offs of running it over there, right?

11:17 Yeah, and there's multiple things I think you're missing out on.

11:20 When we talk about even making a robust code base, writing unit tests in a notebook,

11:25 writing doc strings in a notebook becomes a little challenging.

11:28 And then we think about all the additional tools that you have to add for collaboration.

11:32 How am I going to do pull requests and review my friend's code?

11:35 Because we know the way that when you check in a Jupyter notebook, it's often with the weird Jason thing.

11:40 So how do I do reviews of my team members' code so we can overall improve the code base?

11:45 And then some of the features as well within notebooks like cache state as well

11:50 cause issues down the line because I might be working on a version of my code

11:55 that when I rerun my entire Jupyter notebook, not everything will run.

11:58 So it's not necessarily down with the notebooks or anything, for sure.

12:02 We believe that there's a space for them really for maybe doing exploratory data analysis,

12:07 trying to work out what's going on with the data set.

12:09 There's space for it with initial pipeline development as well if you're still prototyping

12:13 and you're not sure how things are going to go.

12:15 And then even for reporting maybe, if you want like a more visual interface for reporting.

12:19 But when you talk about like code that I want to be able to run in three months,

12:24 six months, and many people will be using, it has to be in Python script

12:27 and it's supported when it's in the framework.

12:30 Yeah, I think that's a good point.

12:31 You can definitely start in the notebook space and then eventually move it over.

12:36 Now, one thing I do want to give a shout out to, I don't remember the name,

12:40 but you can set up a Git pre-commit hook that will strip out the metadata,

12:46 the results of your notebook.

12:48 So that's kind of a fix, but it's still, you know, it's not that amazing.

12:52 The other thing, you talked about reproducibility.

12:55 One of the things that troubles me, and I'm a fan of notebooks and I like it.

12:59 And I just on Python Bytes podcast, we just talked about how JupyterLab is now a desktop application.

13:05 they just released that like this week, which is really cool.

13:08 So there's a lot of neat stuff happening around notebooks.

13:10 One of the things that I'm not a big fan of though, is the ability to reorder execution

13:16 or only execute part of it, right?

13:19 There's a lot of benefits to say, run the cell that computes the data that's really expensive.

13:22 And then run this, no, make, go back three cells, make a change here, run this one,

13:28 and then go back down four cells and run that one.

13:31 And it's kind of like a go-to with no explanation.

13:35 Right?

13:36 Where you can jump around in different orders.

13:38 And that certainly doesn't lead to reproducibility when it's up to the human's decision of like,

13:45 I decided I felt like I wanted to make a tweak and rerun that one.

13:48 And I forgot to run the intermediate step that used that.

13:51 It's very problematic for long-term reliability, reproducibility, and so on.

13:55 There's quite a few studies, I think, where people have tried to rerun notebooks.

14:00 There's one by NYU, I think back in 2020 or 2019, where they rerun like,

14:04 I think over like 80,000 notebooks.

14:07 And only 24% of them completed without error.

14:10 But there's a very small part of that that had the same result when the notebook was finished running.

14:18 Yeah.

14:19 Wow.

14:19 Very interesting.

14:20 The other part that Kedro has is the versioned datasets.

14:24 So not only like just running the code itself, but you can check out like an exact version of the code

14:30 or the version of the data that was ran last time.

14:34 Okay.

14:34 So you sort of store the data as well?

14:36 Like instead of just versioning just the source code or the notebook, you also version the data?

14:42 Yep.

14:42 That's an option as you're creating your catalog entries.

14:45 You can, it's as simple as just putting versioned equals true pretty much on most of the datasets.

14:50 Oh, that's really cool.

14:51 Ivan, how about your thoughts on this reproducibility, maintainability side of data science?

14:57 That's essentially why we started Kedro because here at QB, when we were going to clients

15:03 and we needed to be able to, if anyone has worked at Consonance, you know that sometimes you need to rotate people,

15:09 you need to move people from one place to another.

15:11 And the pace is quite high.

15:14 And when people end up in a, like in the midway, like in the middle of a project,

15:18 there is quite a long onboarding time, which is probably a week or more than that.

15:24 And that's super expensive for a client to pay for one extra person just to read the code that was written to that point.

15:30 Moreover, when you hand over code, it can be just notebooks.

15:33 So we ended up resorting to, okay, having different stages where you do things in notebooks,

15:39 then you need to convert them in another programming language.

15:42 And then you have an extra person doing that.

15:44 And obviously that conversion wasn't done due to the limited time in the best way possible.

15:50 So it was quite hard to have this, you know, workflow of making a reproducibility

15:55 reproducible code without sacrificing speed and agility.

16:00 And out of that need, that was how the initial versions of Kedro were born.

16:04 I think notebooks are super useful as well, like, but they are definitely not for production code.

16:09 They're for exploring, for trying out, doing some different things, just basically a working session.

16:16 And I like the name notebook because it's essentially, you're just catching things in a notebook.

16:22 The thing is that what we see is people end up using those in production.

16:26 And I think that makes it hard.

16:29 You already explained, like all of you mentioned some of the issues.

16:32 How do you manage that?

16:35 And how do you deal with credentials, with traditional stuff?

16:38 And for me, when I was coming from different software background, joining a data company,

16:42 and I was like, where are the frameworks here?

16:45 Like there was no frameworks.

16:47 Everything was platforms.

16:48 And there was no way for you to start a project.

16:51 And I found that super interesting.

16:55 Interesting.

16:56 Not like cookie cutter type of templates and those kinds of things for generating it.

17:00 Like here's how we integrate with all of our other libraries and infrastructure and just go, right?

17:05 Yeah, that wasn't there.

17:07 And it was quite hard to align on a similar process.

17:11 And it reminded me a lot to early days of web development when everyone had their own PHP scripts

17:17 that they would make and people didn't use frameworks a lot.

17:20 And then things moved on from there.

17:23 And for me, that's how it felt.

17:24 This portion of Talk Python to me is brought to you by Tab9.

17:29 As you know, I'm a big fan of rich text editors and all they can do to empower us

17:34 to work faster and smarter.

17:35 The most important feature there being autocomplete to help you write code more correctly and faster.

17:41 So why not supercharge your autocomplete?

17:43 If you haven't tried Tab9, you should definitely check it out.

17:46 It works with PyCharm, VS Code, and even Vim among other text editors.

17:50 Tab9 is kind of a mind reader that gets even better as you use it.

17:53 Tab9 uses three AI models, an open-source trained AI, a private code-based trained AI,

18:00 and team-trained AI.

18:02 A very cool benefit of Tab9 is the fact that they have a team-trained AI.

18:06 So if you're on a team working on the same project, you can all work on the same model

18:10 and get suggestions accordingly.

18:12 The more team members you add, the faster the AI will learn your project,

18:16 preferences, and patterns.

18:17 Tab9 is free to use.

18:19 They have a free forever subscription plan as well as a pro plan with advanced models,

18:25 an enterprise plan for larger organizations, and every plan supports inviting team members.

18:30 If you're a student, Tab9 is 100% free.

18:34 Just let them know you're a student, and you'll get the pro plan for free.

18:37 See what better adaptive autocomplete can do for you.

18:40 Visit talkpython.fm/tab9 to get started.

18:43 That's talkpython.fm/tab9, or just visit the link in your podcast player's show notes.

18:48 And thanks to Tab9 for supporting Talk Python to me.

18:53 It was very interesting.

18:54 I think it was very interesting initially because of this, like, okay, how do we bring that

19:00 to people whose job is not to build software, but their job is to build models?

19:05 And these are different skill sets, right?

19:07 It talks about how there's these skills that data scientists should learn

19:11 from software development, software engineering, and that's true.

19:13 It would help them a lot, but there's also a lot of skills that data scientists and people in engineering

19:18 and economics and biology that have, as software developers like you and I,

19:22 you know, we don't know the inner details of, you know, gene editing or mitochondria or whatever, right?

19:28 Like, there's a, to be fair, I mean, it's not like to put them in a negative light.

19:32 It's just some of these skills are not learned along the traditional path,

19:35 and so it does make things like reproducibility hard.

19:38 Yeah, that's why I think the data is like a very, you know, the data landscape is a very nice place

19:43 because it's a very creative mix of people from different backgrounds for that reason.

19:48 And I think what software engineers can do is, okay, help those people to be the most effective,

19:54 the most productive with what skills they already have.

19:57 Yeah.

19:57 And maybe just teach them just enough software engineering practices so they can go ahead.

20:03 They don't need to be experts in two things.

20:05 Yeah.

20:06 You can never be like full expert in, you know, software engineering and then biology

20:10 and, you know, DNA and all that.

20:13 And if we go that way and require people to be full-blown experts, that's, I think, the wrong path.

20:19 These are ways just to equip them with the tools.

20:22 Especially in Python.

20:23 Yeah.

20:24 The ethos of Python is that it's kind of, you can be very effective with a partial understanding of what it is.

20:29 You know, and if you require them to be full data scientists, then it's, you know, something different for sure.

20:35 Yeah.

20:35 So let's talk about Kendro specifically in terms of not just the philosophies

20:41 and some of the goals, but, you know, what is it?

20:43 When I first heard of it, I thought this feels a little bit like, you know, one of these data pipeline type of programs,

20:51 you know, like Luigi or something like that.

20:53 But that's not quite right, is it?

20:55 You tuned in.

20:56 No, it's not quite right because we really do dial in on that focus of like software engineering best practice first

21:01 implemented on code.

21:02 So we think about things like a project template for you to know where to store

21:07 different parts of your code.

21:08 I actually think then we've also got a data catalog which manages like how you load and save data.

21:13 We've also got a way for users to interact with configuration for the first time as well.

21:17 So removing what would have been maybe configuration because of their loading and saving paths

21:23 for data out of the code, as well as things like removing parameters and even implementing logging.

21:28 And then we also think around being able to have our own pipeline abstraction as well,

21:33 which is why everyone gets excited and thinks that Kendro is kind of like Luigi Airflow.

21:37 I think we get grouped with Dagster as well, all these different pipeline abstractions,

21:42 but we really do focus on that journey of like, how do we even write code that's worth deploying?

21:47 Which is a kind of like a different focus because when you get to, our expectation is like when you get into Prefect,

21:52 Dagster, Luigi, Airflow, you really have a code base that's worth deploying.

21:56 And you just really need to think about like, I want it to run at 7 a.m. on Monday.

22:01 Right, or based on this trigger when a file shows up in blob storage or whatever, right?

22:05 Yeah, which is a completely different focus.

22:08 And we think that we call that group, the orchestrators are really good at what they do.

22:11 But in terms of that whole process of leading up to a code base that's worth deploying,

22:16 that's actually what Kedro handles best.

22:18 Yeah, cool.

22:18 You guys want to add anything to that?

22:20 Maybe you might can go a bit more into like functionality.

22:23 Yeah.

22:23 What it has to do.

22:24 I think I can't add anything after you get to this excellent introduction.

22:27 I think initially that's the main thing we were asked for.

22:30 It's like, oh, why don't we use Airflow instead?

22:33 And we're not exactly the same.

22:35 You can just use Kedro and Airflow together.

22:38 And in fact, actually now we have a plugin because we connected with them

22:42 and it was a very nice collaboration we had.

22:44 But I think just to underline the main difference is, you know, like all of those tools,

22:49 they came out from big tech companies that already had good processes.

22:53 Yeah.

22:53 Like, for example, Luigi comes from Spotify and so on, right?

22:58 So Luigi from Spotify, Airflow from Airbnb.

23:00 And like, those are big tech companies.

23:02 They have uniform, like, or at least they have a big team that is taking care of the infrastructure.

23:07 So you own the infrastructure, you own everything.

23:09 You just need to put it there, like run your Airflow instance and then run your code there.

23:15 And for us, we come from consultancies where you don't know what infrastructure you'll find at your client.

23:22 So there is no way for us to say, okay, this is the scheduler we're going to use.

23:26 And it was really impossible.

23:27 Yeah.

23:28 And you also don't know what sort of team you're dropping into.

23:30 Is it a large, highly professional, experienced software team?

23:35 Or is it a bunch of research scientists who need a little help on the software side, right?

23:39 Yeah, absolutely.

23:40 Also the team, you don't know any of that.

23:43 And the thing is that we need to have some transferable skills within our teams

23:47 when you work on one project here and there.

23:50 So the next time you're more efficient and more productive, even though you're changing project.

23:55 And that's the main difference here with Kedro and those orchestrators that like we don't have that assumption on the infrastructure.

24:00 As yet to mention, it's like we focus on how to make something that's worth deploying.

24:05 And then once you need to deploy it, at least that's very hard to achieve.

24:09 But this is what we strive for is to make Kedro deployable basically anywhere.

24:14 If it's managed service in AWS, or if it's Airflow or maybe they're offering,

24:20 they have an astronomer offering, which is basically managed Airflow.

24:24 If you need any of that, Kedro should be able to run on all of this.

24:28 But during the development process, you don't need to, you know, use their primitives to create your nodes

24:35 and deal with all the, you know, the extra work.

24:38 The data scientists or data engineers, they don't need to care about, okay, what is an Airflow operator or these kind of things.

24:45 So does this mean that you're able to swap out these different data pipelines?

24:51 Like if, for example, I had a company we started on Luigi and we're like,

24:55 oh, we really want to move to Airflow.

24:57 We'd written our stuff in Kedro.

24:59 Would it make that easier?

24:59 Yeah, I would say yes.

25:01 Okay.

25:02 Depending how you've written it.

25:03 I guess you can always make code not portable, but yeah.

25:08 You know, one of the options that hasn't been mentioned yet is the one we're using,

25:12 which is just the Kedro Docker plugin.

25:14 So Kedro works very nicely inside of Docker.

25:17 Despite all the nice orchestrators out there, there's also the backup plan of just put it in Docker

25:23 and that can run virtually anywhere as well.

25:26 Yeah.

25:26 Sometimes that is really nice.

25:28 Just, I know there's all this stuff as a service to help me out, but we'll just go in the simple route

25:33 and run in it this way.

25:34 All right, let's talk about some of the features.

25:36 And Kevin, you talked about the lack of some kind of template to get started.

25:41 And that's the first feature listed here.

25:44 So maybe tell folks about this.

25:46 I think that was probably the first one we implemented when we were rebuilding the thing.

25:50 Because what we found out is like lots of data scientists, they were naturally using like some cookie cutter templates.

25:56 Yeah.

25:56 Okay, this is my project.

25:57 This is the structure I like.

25:59 And then we had a big discussion with many different data scientists how to implement this.

26:05 So we set out on a, the thing that will be the bare minimum that you will need for starting a project.

26:12 Right.

26:12 The minimal set that everyone is going to use.

26:14 One of the things I really dislike about templates, these types of project templates,

26:18 and I see them all over the place, is, oh, here's how you use this template to get started.

26:23 And the template says, okay, what we're going to do is we're going to set up Celery as your backend worker.

26:27 We're going to set up Postgres as your database.

26:29 We're going to set up SQLAlchemy as your data layer.

26:33 We're going to set up X, and you end up with 10 things.

26:35 You're like, I only want four of these, but the four it's helping with is really useful.

26:39 But then I got to hunt through and get the others out.

26:41 And it's just, you do want to aim for this minimal side because, you know,

26:46 while it's nice to have support for the other things, like if you foist it upon people

26:51 and they're like, I just, this is more junk than I, like I'm using less than half of this.

26:55 So this is not useful for me.

26:56 Right.

26:57 Yeah.

26:57 So that was your philosophy to go more minimalist on it.

26:59 Yeah, absolutely.

27:00 Even though, because at that time it was an internal tool, we had some stakeholders that we had to appease, right?

27:06 Like there was, we had internally a very well-developed data engineering convention.

27:11 And then they absolutely wanted to have the, you know, in the template, we need to have the folders

27:16 with those layers in the data engineering.

27:19 And I think you can still find that in Kedro.

27:21 So there are these kinds of things that we needed to do, but they're not necessarily,

27:26 okay, you don't need to use them, that you can remove a lot of those.

27:30 And further down the line, we ended up introducing something called starters.

27:34 And this is essentially, you can have a custom template that you can start from.

27:38 And people are using them to create their own custom projects for their organizations.

27:44 Yeah.

27:45 I think that's a great idea.

27:46 Yeah.

27:47 Yeah.

27:47 And we're using cookie cutter behind the scenes, which means like another thing that we wanted to do is to not reinvent the wheel

27:55 and use standard tools out there.

27:57 If the Python world is using cookie cutter, there is no reason for us not to use it.

28:01 Right.

28:02 And we went with that.

28:04 And that's how we settled on the template system.

28:07 Yeah.

28:07 Nice.

28:08 And you talked about, or Hedro talks about on the homepage, how it uses the cookie cutter data science,

28:15 cookie cutter template, which is a logical, how much of that is that?

28:19 Or have you kind of moved beyond that?

28:20 I think we moved beyond that.

28:21 It was mainly, that's what the inspiration was.

28:25 Ah, got it.

28:25 It's not like we're using it, but more like it was inspired by this because we found out like a few users back then,

28:31 they use that one.

28:33 And it was fairly sane.

28:34 I mean, if you, if you don't need to framework, I think it's quite a good starting point.

28:38 So if you don't need a full blown Kedro setup, I would recommend that one.

28:43 And, we said, okay, how can we build upon that?

28:47 And like make, obviously we started from scratch, not copying any of their templating,

28:52 but saying, okay, this is a very good example of what we can be.

28:56 And then how can we achieve the same thing, but achieving our goals for making this framework.

29:02 And, that's how we, we settled on.

29:05 And I think we, we still honor it because, I think it was a good inspiration for us in our documentation.

29:11 Yeah.

29:12 Very nice.

29:12 I love it.

29:13 And definitely thumbs up on using something like cookie cutter, right?

29:17 There's already so many templates out there that people are using.

29:20 People are somewhat familiar with the idea and they maybe know how to extend it.

29:24 Right.

29:24 So no need to go write your own macro language or something crazy.

29:28 Yeah.

29:28 All right.

29:29 Next main feature is the data catalog.

29:32 One thing I was going to mention here, Michael mentioned that you get like from a template,

29:37 a lot of times you get a, a thing that's got everything that you need.

29:41 And then a bunch of stuff you don't need.

29:42 I think one thing that plays well with Kedro is this catalog.

29:46 So with the catalog, I can kind of, you know, abstractly tell Kedro where my data is and what type it is.

29:52 So that catalog can load things from pandas, spark, desk, from databases.

29:58 There's pretty long list of data sources that it can load from.

30:01 So I don't need to change the template based on what my underlying data is or where it's stored.

30:07 That's really nice.

30:08 A lot of times when you're thinking of data, the abstractions are you can switch between MySQL,

30:12 Microsoft SQL server and Postgres, not between, you know, spark and a relational database or cloud storage or something like that.

30:21 That's a lot of flexibility.

30:23 One really nice feature I like that was added in the 16 series is it's built on FS spec under the hood.

30:30 So you can have data sitting on S3, GCP or your local file system.

30:35 And all you do is change your file path, maybe like a slight tweak to your requirements.txt.

30:42 But other than that, Kedro just knows how to load data into whatever object type you ask for.

30:48 Yeah, very nice.

30:49 Also helps on the data science side, if you're not super familiar with like remote blob storage APIs,

30:55 you don't have to learn that, right?

30:57 Which is good.

30:58 Cool.

30:58 And another thing to maybe mention is if you don't need the full Kedro template,

31:04 you don't need the pipeline and everything, you can use Kedro's catalog by itself.

31:09 Okay.

31:09 So maybe if you're just starting a notebook or just starting a project in a notebook,

31:13 you might want to move it to Kedro later.

31:16 You can start putting your catalog together as a Kedro catalog from the start.

31:21 Nice.

31:21 And then maybe as you move it more into source files, it's like part of that's already done,

31:26 right?

31:26 Yep.

31:26 You're closer to the destination.

31:28 Yep.

31:28 And you can use their loaders and savers so that you don't have to write any sort of

31:32 saving and loading code manually.

31:34 Yeah.

31:35 All right.

31:35 Next one is pipeline abstraction, automatic resolution of dependencies between pure Python functions and data pipeline visualizations.

31:44 And you all have a cool visualizer there of a whole lot of stuff going on here.

31:49 I could zoom in here, but like this, this really interesting visualizer of what's happening.

31:56 who wants to sort of tell us what this one's about?

31:58 I'll happily take this one.

32:00 And I leave the coding starters to yet.

32:03 I really wanted to take that one because I want to kind of promote using more the API of the pipeline.

32:09 I think this is probably one of the best things we've done.

32:13 And it was kind of a find that we had is what is the pipeline in our, I think that's probably one of the things that makes Kedro different than other tools.

32:22 We treat each processing node as a pure function.

32:25 So what you need to do is just to write a pure function that you have inputs and outputs,

32:30 and you return stuff and that's all you need to do.

32:33 And then you need to announce that in a pipeline that, okay, I'm going to use that function.

32:39 we'll have those inputs from the catalog that Waylon was talking about.

32:43 And they're just aliases to, to those references in the catalog.

32:47 And then when I'm done with that function, then I'll save them to those datasets.

32:52 And you don't need to know, okay, what's the order of execution or like any of that.

32:58 You just need to think locally.

33:00 You need to think about that function you're dealing with.

33:03 Okay.

33:03 I need maybe in this example that we have on the screen, like people who are listening,

33:08 they might not be able to see, but let's say you have an input to a function and let's say factory train.

33:15 In that example, we have, and that's your input.

33:18 And then you want to remove the new columns and that's your function.

33:22 And then let's say you have a, let's say we call the output a clean factory input or what was it?

33:30 So that's all you need to think about how to solve that locally.

33:34 You don't need to think how that would fit in globally.

33:36 And then once you add enough of those functions, then the connection, those dependencies,

33:43 because you announced them in your inputs and your outputs, they'll be figured out by Kedro.

33:47 And then, you know, this graph, the visualization will be drawn for you out of your code.

33:53 And then you can use that for running your code in that particular order.

33:57 And why I say that we, we are so proud of this because using these pure functions and connecting them as pipelines,

34:04 gave us a lot of ability to reuse code and reuse parts of the pipelines without really taking care of like where the data is.

34:14 So you just work on a pipeline level and the connections, and then the data catalog would load and save things for you.

34:20 And, and that, that made it super, super easy for us to, to scale the types of projects we can build.

34:27 We started off with very small pipelines and now we, we have maybe yet to can talk more about this,

34:32 but we have projects which have like hundreds of people, hundreds of nodes internally in QB.

34:36 Yeah.

34:37 If you look at this, there's a lot going on here.

34:39 And I really appreciate the idea of being able to just focus in on, you know,

34:43 small pieces.

34:44 It brings me back to your idea of talking about, well, let's just write a .php file and put,

34:49 start just, I'm going to start writing HTML and I'm going to start writing some SQL query and I'm going to write some more markup.

34:56 It's just all from scratch and there's zero structure and there's zero support.

35:01 So if you're going to do something like that, you're doing it all at once,

35:04 all at the same time.

35:05 Whereas, you know, compare that to like a modern framework, like Flask, all you do is write the view method.

35:10 You don't care about how the template gets rendered.

35:12 You don't care about how the process comes in or finding the verbs to figure out.

35:16 You just, I know when I get here, Flask got me here.

35:18 I do the, the five lines of code that I got to do and we're good.

35:22 And I feel like this is real similar, right?

35:23 Yeah.

35:24 Yeah, absolutely.

35:25 You just write a function.

35:26 Yeah.

35:26 You write a function that says unify timestamp column name, or you write a function called remove null columns.

35:31 Like you can definitely do that.

35:33 That's not challenging, but if you look at this overall workflow, it looks like there's a lot going on here.

35:38 Yeah.

35:38 Yeah.

35:38 And I really liked that you brought that analogy with frameworks, with view and like these things,

35:43 because the more we started building that, the more it was very similar for me.

35:47 I mean, I've done some Ruby on rails before and actually the pipeline sounded like the

35:52 roots file.

35:52 So you would have different routes like, okay, what is the, how do you register on this URL?

35:57 This is the action I would call and actions here are our notes.

36:01 Right.

36:02 And what data is provided to it and things and so on.

36:05 Yeah.

36:05 All that.

36:05 So the data that's provided to it is basically, you know, the URL endpoints and,

36:09 and maybe the post data and things like that.

36:12 And then the output is your views, for example, in, in, in traditional framework.

36:17 The only difference here is that your inputs and outputs, their data, they're not a URL data coming from,

36:24 you know, the request.

36:25 And then the response is not a view, but actually again, saving to the data.

36:29 And there is one subtle difference here is that you, you have dependencies between different routes where in web frameworks,

36:36 they're fairly independent, right?

36:38 It's a stateless thing.

36:39 Yeah.

36:39 We're here.

36:40 We kind of, it's still stateless, but you have dependencies between different parts of the route.

36:45 And maybe here comes the real reason I wanted to talk about the pipeline was that this abstraction done

36:51 this way gives you kind of like an algebra that you can use to combine pipelines,

36:57 to have prebuilt pipelines, plus them together, plus them together, or maybe just join them together,

37:02 or remove things from them, or maybe saying, because it's a tree, like a,

37:08 ends up being dark, directed acyclic graph.

37:11 means like, I want to get this sub pipeline that produces that output.

37:17 And then it will remove everything from you.

37:19 Okay.

37:19 So maybe you're like saying, I only care about the output at the, say, fifth function,

37:25 this round timestamps thing, and it could strip off a whole bunch of the other pieces.

37:29 Because it's like, well, all this other stuff is not involved in this part of the chain of the pipeline,

37:34 reversing the acyclic graph.

37:36 Yeah, exactly.

37:36 And you don't need to do anything.

37:37 You just need to say pipeline two, and then specify.

37:41 So we have a bunch of methods that you can use.

37:43 And, unfortunately they're underutilized.

37:46 Like people don't really use them.

37:48 And then when we ask like, Oh, how can I do that?

37:50 And it's like, there is one method call and people are like, Oh, that's so cool.

37:54 Maybe we need to improve our documentation.

37:57 And this one.

37:58 Yeah.

37:58 Maybe you should go on a podcast and tell people about it.

38:01 Yeah.

38:01 Yeah.

38:01 I think that's why I want to use the opportunity.

38:04 Yeah.

38:04 Absolutely.

38:05 You should.

38:06 Do you want to elaborate on some of these larger pipelines that you'll have going?

38:09 This actually speaks to a lot to the collaborative way that you can work once you're using the

38:13 Kedra pipelines, because now your team sessions can easily become something like,

38:17 okay, I know I need to work on like these three key functions because we know this is what we

38:21 want this pipeline to do.

38:23 And then you split out the work and work accordingly to produce those specific nodes and functions.

38:27 Also, one of the things that I'd like to call out about the pipeline abstraction is that you definitely

38:32 do get Kedra Viz for free on top of it.

38:35 It's a pipeline visualization tool.

38:37 It's really cool because it allows you to give like kind of like a bird's eye view of

38:41 what's going on in the pipeline.

38:42 So you can actually understand like how different things are connected.

38:45 some of the ways our users have used it, but ways that we didn't imagine it was originally created for being able to talk to non-technical users or stakeholders about the way that the code base was structured instead of diving into code and showing them,

38:59 hey, here's how the code works, because they'll be like, I don't know what's going on here.

39:02 But we've also found our users will do things like debugging with Kedra Viz to find out,

39:08 oh, something doesn't appear right in my pipeline and then figure out what's going on there.

39:12 And in some ways, we've actually extended some of that functionality.

39:15 So you'll see that there's now like a code viewer for you to interact with your code.

39:20 And we have some, I guess, maybe exciting things planned down the line when we're talking about the roadmap.

39:25 We'll be able to allude to some of the work that we're doing with experiment tracking,

39:28 which will extend Kedra Viz a bit further.

39:30 So nice.

39:31 Yeah.

39:32 Yeah.

39:32 You can definitely tell if you've got like a dependency mismatch and the order is wrong or something you could see,

39:37 oh, this one is supposed to be after that one.

39:39 And it's really nice.

39:40 The visualizers.

39:41 It's nice.

39:41 You've kind of got this like map thing you can cruise around on.

39:44 And for people who are listening, I'll definitely put this in the show notes so you can open it up.

39:50 and explore it.

39:50 There's a pretty elaborate pipeline here to explore.

39:53 Does this do anything runtime or is it just for visualizing the static structure?

39:57 It's actually just for visualizing the static structure.

40:00 We often find, we've tried to skirt away from what we call the orchestrator UI interface because it takes us a bit too much.

40:07 I mentioned into that realm of like where we would prefer the orchestrators to play a part.

40:12 So for now it's just static view of what's going on in your code base.

40:15 Yeah.

40:16 No, this is great.

40:17 I feel like a lot of projects would benefit from this kind of stuff.

40:19 not just data science things, right?

40:21 Like I'd like this kind of view of my code for other things as well.

40:25 We have found people using, KedroViz is also available as a React app and people use it in,

40:32 without Kedro.

40:34 So we'll find that they will build, one of the most common use cases we've seen built on top of KedroViz is data lineage,

40:40 but specifically column level lineage that people will want to visualize.

40:44 So they end up using the React app for that.

40:46 I also have a friend who was playing a game.

40:49 He's actually one of the former maintainers on Kedro, KedroViz and Kedro,

40:53 where he was playing a game where he needed to work on how to build, I don't know what this game is called.

40:57 How to build.

40:58 It was Factorio.

40:59 Factorio.

41:00 You basically look at like how to, how to build up your factory, I think,

41:06 or something like that and use the different elements in the factory.

41:08 and he used KedroViz to visualize what he should be doing in his factory.

41:12 So yeah, it's different ways.

41:14 Oh, that's funny.

41:14 I guess you can visualize lots of things with it.

41:18 How neat.

41:18 Yeah.

41:18 All right.

41:19 The next main feature here is deployment.

41:22 One comment I had on the, on the pipeline was, so, you know, you get a task to do on your sprint.

41:29 You sit down to work for the day.

41:31 And if you're not in this Kedro type of, or this framework mode, a lot of times it's like,

41:37 okay, open the notebook or open the script.

41:40 And I've got to run to a certain point to start my work.

41:43 Cause I've got to have that data in memory or I, I, that's my option.

41:47 Or I'm manually saving things along the way.

41:50 Both have their downsides.

41:53 But so a lot of times it's like, okay, I'm going to run the notebook and then I'm going to go grab coffee.

41:57 And maybe when it's, when I get back, I can start my work.

42:00 So Kedro is saving each one of these points in the background along the way.

42:05 So when I get a task and it's like, Hey, you've got to put a note in between these two,

42:09 I can start right away.

42:10 Cause the data is already sitting there.

42:12 Oh, that's cool.

42:12 I can also use the pipeline dag object.

42:15 like Ivan mentioned to just run that section of pipeline I'm working on as I'm working.

42:22 Interesting.

42:22 It's a little bit like just rerun the failing tests or just this one test or something like

42:27 that in the unit testing world.

42:28 Like instead of trying to rerun the entire test suite for every little change.

42:31 Yeah.

42:32 That's a cool thing.

42:33 All right.

42:33 You did talk about the deployment stuff before.

42:35 So maybe you want to touch on some of the deployment stuff.

42:38 I guess it will just be a quick mention.

42:40 So what we do support is two deployment plugins right now.

42:44 The first one is Kedra Docker, which packages your Kedra project in a Docker container.

42:47 And the second one is Kedra Airflow, which was built with the Airflow astronomer team,

42:52 which will take your Kedra pipeline and convert it into an Airflow DAG so that you can run it on Airflow.

42:57 But we also do support in our documentation, like a few guides on how to deploy Kedra on Prefect,

43:03 on Kubeflow, on AWS Batch, AWS SageMaker, and AWS Databricks as well.

43:11 One of them.

43:11 I always feel like with AWS and Azure, there's just, no matter how much I studied,

43:16 there's like three more things that are similar to what are there, but they're different.

43:19 So like I know Batch, but not the one you named after that.

43:22 Yeah.

43:23 That's definitely the case.

43:24 I believe it's AWS Databricks.

43:26 But you can kind of use the same methodology if you're working with Azure Databricks as well to deploy things there.

43:31 Ivan alluded to the fact that we do really pride on flexible deployment because we don't know what is your internal infrastructure like

43:38 and therefore should be able to support the most generalizable case to do that.

43:42 So you can definitely check out those guides.

43:44 I know if there are guides missing, just raise GitHub issues and we'll look at it.

43:48 As well, we add a growing tally of mentions of things that we haven't heard people necessarily using with Kedra.

43:54 Yeah, it's worth checking out.

43:55 Yeah.

43:56 It's open source.

43:56 If people want to add new deployment stories, they can go and PRs are accepted.

44:01 Is that true?

44:01 Yeah.

44:02 Yeah.

44:02 Write up a guide and we'll take it.

44:04 We have a great sense of contributing guide that's available on our documentation too.

44:10 That shows you how to make, I mean, PRs across like whether it's features or minor tech improvements or bug fixes,

44:17 as well as documentation too, because we like to write our docs.

44:20 Yeah, that's important.

44:21 And coming into October, it's October.

44:23 So people can come in and I think we will be really, really grateful for adding more deployment because no matter how many you have,

44:32 you always run out of, or maybe differently, like no matter how many you have,

44:37 you always don't have everything.

44:39 And there is always someone would come and like, by the way, how do you deploy it on this AWS,

44:44 whatever, the new thing that they have?

44:46 It's like, how do I know?

44:48 It's like, this is the first time I hear about it.

44:50 So there is always room for more.

44:53 And this is the thing that we would really, really love help with.

44:57 So if you want to find something to contribute to for this October, maybe that could be Kedra.

45:02 Yeah, that'd be awesome.

45:04 Yeah.

45:04 I periodically will get in October contributions to like, for example, my course projects,

45:10 the GitHub repos for them.

45:12 And it would be like slight changing in the wording.

45:14 And if so, like if it says the Kedra documentation includes three examples to help you,

45:18 you get started, they might say to help you get started.

45:21 The documentation contains three examples.

45:23 There's a PR.

45:24 So I, it counts.

45:24 I'm like, so I just got to go through and close them.

45:26 So please people listening, just a minor useful contribution, but yeah, it would be great to work on this.

45:32 Right.

45:32 And I feel like these types of pipelines are very accessible because they,

45:36 they narrow the focus so much, right.

45:39 When you get down to like certain tasks and certain things, you don't have to understand the whole project.

45:43 Just how do you do this other tasks slightly differently?

45:45 Yeah, absolutely.

45:46 And I'm pretty sure that talking about contributors, like if it's just change of words,

45:50 we might not have t-shirt for you.

45:52 If you're adding a guide, we would definitely send out a t-shirt.

45:56 Oh, nice.

45:57 Awesome.

45:57 T-shirts are included, not just a passing.

46:01 Uber fest.

46:01 Okay.

46:02 So one of the things I find is a little tricky, is always talking through,

46:06 or like thinking through an example of these kinds of things.

46:08 They're very neat, but they also sometimes feel pretty abstract.

46:11 So even you would, maybe want to talk us through, just this like hello world example.

46:16 I know it's hard to talk about code on audio.

46:18 So not, not exactly, but just give us a sense of what it means to write one of these pipelines.

46:24 Yeah.

46:24 Sure.

46:24 Starting from the first one is, I think in the hello world example, we have,

46:30 we show like what a node is and, and it's not very different than a function.

46:34 It's just a function actually.

46:35 And your function, we accept two types of nodes, actually three types of nodes,

46:40 nodes that have only inputs, nodes that have only outputs, or nodes that have both inputs and outputs.

46:45 So we don't accept functions that don't have neither inputs nor outputs.

46:51 So for obvious reasons, because that doesn't do much.

46:54 And, you can start with your own function.

46:57 Let's say you can call it return greeting, which will just return hello.

47:01 So, just let me elaborate on that for a second.

47:03 So what you have is a Python function that takes no parameters and returns a string.

47:09 But the thing that's notable about this is it doesn't have a decorator or some other special thing about it.

47:16 It's literally just a bare Python function that has nothing to do with case.

47:20 With Kidro per se.

47:21 Yeah, absolutely.

47:22 So the reason why we did it this way was to allow all kinds of people just to create functions.

47:29 Like if you don't know decorators and these kinds of things, you don't need to know that.

47:32 And then the second one is you might actually use functions from libraries that were not at all designed to be part of our framework.

47:39 You know, if you're importing a function, you cannot add a decorator on that one.

47:43 Well, there is ways obviously in Python always, but it's not, you know, very intuitive.

47:49 So we just work with pure functions.

47:52 how that gets turned into a node is actually more curious here.

47:56 And then we have a, like it happens to a helper function we have, which is called node conveniently.

48:02 And then you provide your function and you provide your inputs and outputs strings.

48:07 So these are the three things you can add.

48:10 So for example, for this return greeting function, your inputs would be none.

48:15 So, because you don't have any inputs and your outputs could be a string,

48:19 which says my salutation.

48:21 That's in our, how we've done in the hello world example.

48:25 So this will create a node in Kedro sense, which you can embed later in a pipeline.

48:30 Nice.

48:31 And then here, the output has a static.

48:34 Is that like the name of the output?

48:36 So you can use it in the pipeline?

48:37 Yeah.

48:37 Not the value, but like the name.

48:39 So you can refer to it later on, right?

48:40 Yeah, exactly.

48:41 So you can think of them as variables in a way as a variable, the variable,

48:45 like you can define where to store that valuable through the data catalog later.

48:51 Yeah.

48:51 Very nice.

48:52 So it seems super easy.

48:54 You have also a more elaborate example.

48:56 It says a space flight tutorial.

48:58 Yeah.

48:59 So people want to, what is this?

49:00 And we don't have to talk to it.

49:02 Just if people want to go and play with it, what is this one?

49:04 I really want the editor to introduce it because initially when we were thinking of an example,

49:08 I think she came up with the idea to make it more of a space flight.

49:12 And we actually, it was quite funny that this led to something more interesting that the editor can share about.

49:20 Yeah.

49:20 Nice.

49:20 I think it was actually Dimitri that did this.

49:22 he was a former maintainer on the Kendra project.

49:24 But the scenario for this tutorial is that it's the year 2160.

49:28 You're somehow a data scientist predicting the price of space flights to the moon and back.

49:32 And you have access to three data sources, information about companies that are flying people to the moon,

49:38 reviews on their shuttles that they have, and then also the customer reviews that they've given while working with those companies.

49:43 And the whole thing is you just want to predict the price of a space flight.

49:47 So if you go through the tutorial, you'll get acquainted with all the way from beginner functionality,

49:52 like install Kedro, you know, set up my project template, all the way to kind of intermediate,

49:58 just before intermediate functionality in Kedro.

50:01 So you get up to speed in about like an hour, an hour and a half in total,

50:05 as you go through the full tutorial.

50:06 And it will teach you all the basics of like, how do I use the project template?

50:10 How do I use the data catalog?

50:11 How do I construct my pipeline?

50:13 How do I visualize my pipeline?

50:15 And how do I package my project as well?

50:17 So it's really useful for getting up to speed on that.

50:20 But we've had a really great time with the space flights project, because we found out that NASA,

50:26 a team at NASA was using Kedro.

50:27 So it was almost like a dream.

50:29 Oh, nice.

50:29 When we discovered, when we discovered that, because it was like we went full circle and went to the moon with them.

50:36 They're actually doing space flights.

50:39 Amazing.

50:40 Yeah.

50:40 I'm pretty sure they chose this only because of our tutorial.

50:43 We were thinking Luigi, but then we saw this tutorial and we knew this was the one.

50:49 Now, I do love these imaginative examples and tutorials rather than something really boring.

50:55 Like, oh, let's build a to-do list.

50:56 And here's how we're going to do it.

50:57 Like, okay.

50:59 No, this sounds really fun.

51:01 So if people want to get a sense for what it's like to work with Kedro, you recommend this as the tutorial to work through to get started?

51:08 Definitely recommend this.

51:10 And then there are a few online resources if you want to use them.

51:13 We have a blogger.

51:14 I mean, he's been inactive for a while, but he still has really good YouTube tutorials.

51:18 Search for Data Engineer 1 and look for his walkthrough of the tutorial there.

51:22 It's also very handy for getting up to speed if you want a video kind of like workflow

51:27 as you go through the tutorial.

51:28 But we are going to be repiloting some live stream workshops of us working through the tutorial

51:34 ourselves later on in the year.

51:36 So definitely do look out for that on, I think, probably the Quantum Black YouTube will have

51:40 them.

51:41 And then also, oh, yeah, you can actually see.

51:43 I'm here, opened up.

51:44 First, I got to get through the ad for people to see.

51:46 I'm not logged in over here.

51:48 But yeah, there we go.

51:49 That's Data Engineer 1 in action.

51:51 And then I definitely recommend either joining in on those live streams as we host them.

51:55 Nice.

51:55 Or live the follow-up YouTube videos as we do them.

51:58 Yeah, cool.

51:58 I'll link to some of the YouTube videos for people to go check out there.

52:02 Very nice.

52:03 So we're getting pretty short on time.

52:04 Maybe one thing we could just do to wrap this up is maybe, I know you talked about some of

52:09 the cool libraries and stuff you used to build this.

52:11 Maybe you could just talk a little bit briefly about the internals and some of the fun things

52:15 you used there.

52:16 Sure.

52:17 So probably good libraries that were mentioning.

52:20 And maybe I will not use the time to too much about Kedrus internals, but like just to

52:24 have a shout out for nice libraries that we found out.

52:26 Yeah.

52:27 One thing that Waylon already mentioned was FSPEC.

52:31 I think that was amazing.

52:32 And we really found it super useful.

52:35 I think it is in the Anaconda ecosystem developed by some of the people there.

52:43 And the good news is that actually it's also becoming part of Pandas.

52:49 So whenever you're doing load CSV in the newest version of Pandas, they use FSPEC as well.

52:55 Oh yeah.

52:56 And now it says it's also in Dask, Pandas, even DVC and many other things.

53:01 This has been really, really useful because it simplified a lot of our code about the datasets.

53:08 We didn't have to, previously we had the dataset for S3.

53:12 We had a dataset for GCP and GCS and Azure Blob Store and all of that.

53:18 I don't know.

53:19 It's super annoying because they do exactly the same thing.

53:22 You do it like many times, just changing endpoints and things like that.

53:26 It was super frustrating to maintain it.

53:28 And when we have many of those datasets, it was super frustrating.

53:31 And when this came out, it simplified basically, maybe reduced our code base for datasets three times or something like that.

53:38 Wow.

53:38 Yeah.

53:39 Because they do that abstraction for you.

53:41 Yeah.

53:42 You just pass it along, right?

53:43 Yeah.

53:43 So if someone wants to treat a remote database as kind of like a local file, I think AppSPEC is really useful too.

53:51 Cool.

53:51 Let's see.

53:52 Another one that you mentioned was Dynacomp.

53:55 Yeah.

53:56 That's a quite nice one.

53:58 We started using it recently.

53:59 What we wanted to do in Kedro, because we are a framework, and there is some framework code that needs to call user code.

54:08 And that's a bit challenging because you don't know what the package name of the user, because the user will create their own code.

54:15 They will choose a package name, and you don't know what to import from.

54:18 So we came up with a pattern that was actually applied by Django, which you kind of configure projects by their package name, and then you load some of the settings.

54:31 So if people know Django, they know that they have an extensive way of doing settings in order to configure different things in Django.

54:38 How Dynacomp helped us with that is because it was a very extensive, like very clean abstraction to do lazy loading of settings.

54:47 And why did we need lazy loading of settings?

54:50 And for example, you might have multiple pipelines in Kedro.

54:55 One of them would not be completed.

54:57 There could be some errors in it.

54:58 But you still want to run the other pipeline.

55:01 And if you eagerly load all of that, then your code will fail for no reason.

55:07 Right, right.

55:08 Even if you weren't actually going to end up running that part.

55:10 Yeah, exactly.

55:11 It's almost like a compiled language versus a dynamic language.

55:15 Yeah, and here Python shines, right?

55:17 Because you can have things like Dynacomp, where you don't need to compile that path.

55:22 So it helped us a lot with making those settings loaded lazily, because there were different things.

55:28 You can add validators to validate settings and so on.

55:31 It's fairly extensive.

55:32 I recommend people to read their documentation.

55:35 You can use it for many things.

55:37 So yeah, it's a nice package we stumbled on.

55:41 Yeah, very neat.

55:41 Those are great recommendations.

55:42 So with just a little bit of time left, let's wrap up our conversation with where things are going.

55:48 Yatunda, maybe want to give us a roadmap future view of what people who are maybe using Kedro now are like, what's coming?

55:55 So I guess maybe one of the next upcoming feature that you'll start to see being rolled out is experiment tracking in Kedro.

56:02 So what we're doing is Kedro is already, I think Waylon spoke to it, really aware of like being able to save your data sets.

56:09 And for us, our data catalog would apply to models as well.

56:12 So we already had some form of model versioning in Kedro.

56:15 But what we really needed to extend, and we already have a concept of parameters too, so inputs.

56:20 But what we really needed to do was think around how do we think about features and how do we think about metrics coming out of the pipeline as well.

56:28 And those are the two additions that we've made kind of as additional data sets in the Kedro framework too.

56:33 And then the last thing that we had to think around was like, how do we collect all of these things as one unit or one experiment?

56:39 That concept is actually very, it's actually been implemented on the framework side.

56:44 So you can really start to interact with the, you know, experiment tracking functionality there.

56:48 But a lot of the massive changes are going to be done on the front end, where you'll be able to, you know, look at the list of experiments that you've run, compare them as well.

56:57 And then we'll be building up like the functionality as we see fit, including probably MLflow model registry and model serving integration as well.

57:05 And that's probably going to be done through our data catalog as well.

57:07 Oh, that's cool. Is that if I go to the trouble to train up a model and it takes a day, I can store that and other people can just pull it down and use it without spending another day?

57:17 Exactly. That's the large thinking around it.

57:19 So yeah, you can definitely look forward to that.

57:22 There's some open issues on the, our GitHub repository around configuration.

57:26 So I do suggest if users have, you know, interacted with Kedro and you've had issues with scaling configuration, please do check it out and give us comment there.

57:34 Because that will decide whether or not, like when we pick those, those issues up based on like user responses there.

57:40 So that's what I think you can look forward to.

57:43 Oh, fantastic. All right.

57:44 Even I think we're going to take your two libraries you mentioned as the notable PyPI projects, just for the sake of time, since we're kind of over.

57:52 I'll just do one final question for everyone out there.

57:55 And that's if you're going to write some Python code, what editor do you use?

57:59 Even you want to go first?

58:00 IntelliJ.

58:02 I come from Java and Scala world.

58:04 So I stick with IntelliJ.

58:05 So basically the Python support in like full-on IntelliJ, right?

58:09 Yeah.

58:10 Yeah.

58:10 Yeah.

58:10 And not PyCharm, but IntelliJ.

58:12 Yep.

58:13 Go ahead.

58:13 Ivan says like, no PyCharm, I'm full PyCharm.

58:16 Same here.

58:19 There's arguments on the team about this.

58:21 Oh, is there?

58:22 This is a point of contention.

58:23 I see.

58:24 That's funny because they're so similar, right?

58:27 It's not like VS Code versus PyCharm.

58:29 Oh, Aylan, how about you?

58:31 I'm an avid NeoVim user.

58:32 You know, part of my workflow as being a lead data scientist is I bounce between probably

58:39 like a dozen projects a day between like actual running pipelines, or maybe it's like a couple

58:46 of our internal libraries that help like those things run.

58:49 And it's really nice to have something lightweight that can run with pretty low resources.

58:55 Also, having it running in Tmux makes it easy to like a few keystrokes.

59:00 They can go into a specific project.

59:03 The editors, they all tend to look the same.

59:05 And you have a bunch of projects looking the same.

59:07 It's very easy to edit the wrong one.

59:10 Yeah.

59:10 Yeah.

59:11 Cool.

59:11 All right.

59:12 A good recommendation.

59:13 All right.

59:13 Well, thank you all for being here.

59:16 Maybe final call to action.

59:17 People want to get started with Kedro.

59:19 Bring it into the organization.

59:21 Try it out.

59:21 What do you all say?

59:22 Well, you know how to get started.

59:23 Get into the Spaceflight tutorial and then just shout if you have any issues.

59:27 We're up on Discord.

59:28 And we also do have a GitHub discussions page as well.

59:31 So you can just flag.

59:31 We help users across the different levels of where they find themselves.

59:34 So definitely do that.

59:36 Yeah.

59:36 And I guess Quantum Black does consulting for Kedro.

59:40 So if people have these projects and they're like, I'm not sure we can handle this ourselves,

59:43 they could probably hire you all, right?

59:45 We've never quite had that level of interaction.

59:47 It's not that a Quantum Black data science and data engineering team will go out and use Kedro

59:54 as part of a larger engagement on like, how do we make a business problem?

59:59 So you can definitely learn Kedro through that way.

01:00:01 But you're in the open source community as we move in the open source space.

01:00:05 It would definitely be through the channels that we have available.

01:00:08 Cool.

01:00:08 One thing to mention is like, if people want to engage with Quantum Black because they know

01:00:12 about Kedro, please mention that.

01:00:16 I mean, like, we found it because of Kedro, you know, the Kedro team.

01:00:20 So that, you know, we get the Keds.

01:00:23 Yeah.

01:00:24 Awesome.

01:00:25 Yeah, I bring that up because there's different ways to support open source, right?

01:00:29 I mean, there's the MongoDB model where they sell MongoDB as a service and Atlas and all

01:00:34 that.

01:00:35 There's, you know, like Tidelift.

01:00:37 There's GitHub support.

01:00:39 But here's like yet another way in which this project is being grown and being supported

01:00:43 because it's, you know, supporting you all doing your work.

01:00:46 So, yeah.

01:00:47 So, I mean, if that's the case, then I'll definitely say we do really want to be able to help a

01:00:52 lot of people in the industry as well, because we know that it's needed.

01:00:55 And we obviously recognize that as a framework, especially in the data science space, we are

01:01:00 a bit of a first mover.

01:01:01 So we suffer a lot of like first mover pains where people are like, why on earth do I need

01:01:05 a framework?

01:01:06 I don't need a framework.

01:01:07 So if you help us with breaking through those barriers, like please go for it and be an

01:01:12 advocate.

01:01:12 And I guess in this sense, be a Kedroid.

01:01:14 Right on.

01:01:15 All right.

01:01:16 Yatunda, Waylon, even thank you all for being here.

01:01:19 It's been great to learn about Kedro and great to chat with you.

01:01:22 It's been awesome.

01:01:22 Thank you so much.

01:01:23 Thank you.

01:01:24 Bye.

01:01:24 Thanks, Michael.

01:01:25 Bye.

01:01:25 This has been another episode of Talk Python to Me.

01:01:29 Our guests on this episode have been Yatunda Dada, Waylon Walker, and even Donov.

01:01:34 And it's been brought to you by Tab9 and us over at Talk Python Training.

01:01:38 And the transcripts are brought to you by Assembly AI.

01:01:40 Supercharge your editors.

01:01:43 Auto-complete with Tab9.

01:01:44 The editor add-in that uses AI to learn your coding styles and preferences and make you even

01:01:49 more effective.

01:01:50 Visit talkpython.fm/Tab9 to get started.

01:01:54 Do you need a great automatic speech-to-text API?

01:01:57 Get human-level accuracy in just a few lines of code.

01:02:00 Visit talkpython.fm/assemblyai.

01:02:02 Want to level up your Python?

01:02:04 We have one of the largest catalogs of Python video courses over at Talk Python.

01:02:08 Our content ranges from true beginners to deeply advanced topics like memory and async.

01:02:13 And best of all, there's not a subscription in sight.

01:02:16 Check it out for yourself at training.talkpython.fm.

01:02:19 Be sure to subscribe to the show, open your favorite podcast app, and search for Python.

01:02:24 We should be right at the top.

01:02:25 You can also find the iTunes feed at /itunes, the Google Play feed at /play,

01:02:30 and the direct RSS feed at /rss on talkpython.fm.

01:02:34 We're live streaming most of our recordings these days.

01:02:38 If you want to be part of the show and have your comments featured on the air,

01:02:41 be sure to subscribe to our YouTube channel at talkpython.fm/youtube.

01:02:46 This is your host, Michael Kennedy.

01:02:47 Thanks so much for listening.

01:02:49 I really appreciate it.

01:02:50 Now get out there and write some Python code.

01:02:52 Thank you.

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon