#337: Kedro for Maintainable Data Science Transcript
00:00 Have you heard of Kedro? It's a Python framework for creating reproducible, maintainable, and modular data science code. We all know that reproducibility and related topics are important ones in the data science space. The freedom to pop open a notebook and just start exploring as much of the magic.
00:15 Yet that free form style can lead to difficulties in versioning, reproducibility, collaboration and moving to production. Solving these challenges is the goal of Kedro, and we have three great guests from the Kedro community here to give us a rundown Yetunde Dada, Waylon Walker and Ivan Danov.
00:31 This is Talk Python to Me, episode 337, recorded October 1, 2021.
00:49 Welcome to Talk Python to Me, a weekly podcast on Python. This is your host, Michael Kennedy. Follow me on Twitter, where I'm @mkennedy and keep up with the show and listen to past episodes at 'talkpython.fm' and follow the show on Twitter via @talkpython.
01:04 We've started streaming most of our episodes live on YouTube. Subscribe to our YouTube channel over at 'talkpython.fm/youtube' to get notified about upcoming shows and be part of that episode. This episode is brought to you by Tab Nine, the editor plugin that enhances your auto complete by learning how you write code & Us over at Talk Python training with our courses and the transcripts are brought to you by 'AssemblyAI'. Yetunde, Ivan and Waylon. Welcome to Talk Python to Me.
01:33 Thank you very much for having us.
01:35 Yeah. Thank you for inviting us.
01:36 Yes. Thank you.
01:37 Yeah. It's fantastic to have you all here. Let's start with just a little bit of your background, since there's three of you. Maybe not too long. But how do you all get into programming Python into this Kedro project and so on Yetunde. Want to start with you?
01:50 I'm sure I am principal product manager on Kedro. I've been with the project for, actually. exactly today three years.
01:56 Oh, nice . How old is Kedro.
01:58 Two years and eight months.
01:59 Ivan has been there most of the time.
02:00 That's fantastic man has been there from the beginning. So he'll talk about that. My background is mechanical engineering, and I would have been a user of Kedro. We discovered. I think one of the coolest user interviews I've ever done was with my former team. They pick it up on their own, which is amazing.
02:14 Oh, no.
02:15 Ivan and I worked at Quantum Black.
02:19 That's the primary.
02:20 Yeah. Yeah. So Ivan I'm a tech lead for Kedro. Started I have been working with quoting back for the last almost five years now, is it? Yeah. Five years.
02:30 So you've been there from the beginning.
02:32 As we hear from the beginning for Kedro, not for the beginning. Yeah, that's on that. Yeah. So initially it was a small internal tool that was being developed by another two people on our team, Nicos Iris, and then we decided to turn this into a product, eventually started from scratch and developed it into what it is.
02:53 Then we found out that things can get serious. And then that's how we hired yet because we needed a PM as well as a proper product.
03:02 Otherwise, my background is in software engineering. All kinds of software engineering. Started off as a web developer. I was very keen to do some game programming before that, but I couldn't find jobs related to that.
03:13 A lot of people are interested in doing game programming until you actually get into the reality of it. And the reality is a lot of times it can be a grind and it's hard to get a job.
03:22 And I was going to say that I'm quite lucky that I didn't end up being a game programmer.
03:28 Not the first time I heard that. That's right.
03:32 Yeah. Moved on to distributed systems, and I ended up doing data and AI at QB mainly. So I'm kind of a newbie and data in the data field, but being there for five years.
03:44 Kind of newbie wow a lot of people are in that realm, right. A lot of people are coming into the whole data science side of things. Waylon How about you?
03:53 My background is in mechanical engineering, probably around 2014. I started diving deeper into the data side of things. I had some family medical things that came up, and it kind of severely limited my ability to travel.
04:08 And slowly over time, I just kind of doubled down into the side of things.
04:13 It was right around the time you started at the show, and a lot of what I've learned has either been directly from the show or from taking ideas from the show and diving into them.
04:24 Awesome. That's really cool. Happy to bring that to you. I think a lot of people who come from backgrounds that they're not like traditional CS backgrounds. They're kind of coming in through a side channel. I feel like the podcast has offered a lot of connection and extra information. Besides just what's on the docs page of some projects for people, which is great. It's awesome to hear.
04:44 So now I'm a team lead for a data science team and do Python every day, and we use Kedro pretty heavily in all of our projects.
04:55 Interesting. So before we move on to the topic, give me a quick thought on comparing mechanical engineering work life experience to software development, Python Hunt and developer experience. How do you feel about where you are?
05:08 I like where I'm at one thing that I kind of struggled with in mechanical engineering. Was there's a lot of learning in College. There's a lot of things to learn, and then you get an industry and everything's kind of like walled off and the learning doesn't completely stop. But it's very hard to do. On the software side. More things are open to learn. There's a lot more resources that are just out there in the open and not like, behind proprietary IP patents and all that kind of stuff. I think that works really well for me. I'm a learner. I don't remember the Gulp, the personality study kind of things always come out with learner is the top thing for me.
06:23 So a little bit like engineering people are coming to data science from these different angles a lot of times from computational stuff, from biology or from finance and economics or whatever. And they don't necessarily come with a full baked set of here's the software engineering lifecycle skill set here's how I set up my continuous integration. Here's my testing and a lot of times decide, look, I got it working. We're kind of good. You know what I mean? I think one of the areas the whole data science field is working with is taking a lot of these folks who are coming from non traditional CS backgrounds and helping them create more reproducible reasonable bits of code notebooks, maybe even code outside notebooks. So let's maybe open the conversation there because that's where Kedro focuses on helping data scientists create reproducible, maintainable code. Yetunde you want to maybe kick us off with some thoughts on where Kedro's philosophy is on this.
07:17 I guess if we actually break it down like it's a Python framework that helps you do those specific things. And when we talk about it being a framework, it's kind of embedded with these practices and ways of structuring how you write code so that you can get those reproducible, maintainable and modular data science code. But if I go into each one of the definitions, when we say reproducible, we kind of mean that when I rerun this pipeline or rerun this experiment, I should get the same result at the end. So it shouldn't be any real surprises that things have really changed or things are really breaking. When we talk about it being maintainable. We now also add in an aspect of collaboration, even for yourself, that if you come back to this code base, like three months from now or six months from now, you should to be able to know what was going on in it and be able to modify and tweak it. And other people should be able to do that with you, too. No, it shouldn't be the biggest disaster if the main code contributor has left and you now have to try to scramble to figure out what's going on there. And then when we talk about it being modular, this is where we encompass some of the software engineering principles that you wouldn't ordinarily learn in.
08:24 Maybe if you enter data science from the mathematician space or even from a Sciences where we think about you being able to break your code base down into small units so that it's possible to think about things that reuse.
08:37 But also it's easier to do things like testing the code base as well. All of these things basically amount to trying to enforce software engineering best practice, especially where you recognize that you might need help with that.
08:49 Yeah. One of the other areas that I've seen a lot of emphasis on the project about is collaboration. One of the things that can be challenging is if you're a data scientist working on a notebook, you might run some cells, maybe on data that is live. And so the data might be slightly changing. You check that into GitHub or Source Control. You've got the results in sort of a scrambled up JSON file notebook, and then someone else has rerun it at a different time, and then they try to check it out. Well, you end up with these conflicts and other issues, so the sort of natural flow of, hey, let's just check everything into git, and then we'll just synchronize over that can sort of fall apart with some of the traditional tooling of data science.
09:33 And that's definitely true. I think Kedros origins actually come from large teams upwards of, like, at least three or four data scientists and data engineers, machine learning engineers collaborating on the same code base, upwards to twelve people having to work on the same code base. If you're using your notebook and trying to construct your entire pipeline in it, I think the workflow would look kind of strange with you waiting. Maybe Waylan has some comments on watching this in practice of waiting for someone to finish in the Jupyter notebook before you could have a go edit and try other things.
10:07 During our engagement with Mackenzie, we were introduced to Kedro during kind of the first iteration.
10:12 And by the way, we've only talked about Quantum Black so far, but Quantum Black is like a subsidiary. Mckenzie, right. This is all sort of the same organization, in a sense. Right. So that's why this come together.
10:25 Good point.
10:26 So the second half, they chose just due to familiarity, the people that we had with us, it was their first time using Kedro, and they're like, well, if we want to move fast on the second half, let's not try to do anything new. And let's just do notebooks like we always had.
10:41 Or I fall back to what they know sort of thing. Yeah.
10:43 And the workflow there was definitely like three people are on a project.
10:48 Two people are sitting idle while one person has the notebook checked out.
10:51 Oh, my gosh. That sounds like old school source safe type source control. Someone's locked the files. No one else can edit it. And that's just it's completely on 1990 style, but it's a real problem. Some of these things there's an attempt to address them by having collaborative notebooks, basically Google Docs type of experiences. Right. But usually those are somebody else's cloud, somebody else's compute cloud. And so you're taking the trade offs of running it over there, right?
11:17 Yeah. And there's multiple things I think you're missing out on when we talk about even making a robust code based, writing unit tests in a notebook, writing Doc strings and a notebook becomes a little challenging. And then we think about all the additional tools that you have to add for collaboration. How am I going to do pull requests and review my friends code? Because we know the way that when you check in a Jupyter notebook, it's often with the weird JSON thing. So how do I do reviews of my team members code so we can overall improve the code base and then some of the features as well within notebooks like Cache state as well cause issues down the line because I might be working on a version of my code that when I rerun my entire Jupyter notebook. Not everything will run, so it's not to say necessarily down with the notebooks or anything, for sure. We believe that there's a space for them, really, for maybe doing exploratory data analysis, trying to work out what's going on with the data set. There's space for it with initial pipeline development as well. If you're still prototyping and you're not sure how things are going to go and then even for reporting, maybe if you want a more visual interface for reporting. But when you talk about code that I want to be able to run in three months, six months and many people will be using it has to be in Python script, and it's I supported when it's in the framework. Yeah.
12:30 I think that's a good point. You can definitely start in the notebook space and then eventually move it over. Now. One thing I do want to give a shout out to I don't remember the name, but you can set up a Git precommit hook that will strip out the metadata, the results of your notebook. So it's kind of a fix, but still it's not that amazing.
12:53 The other thing you talked about reproducibility one of the things that troubles me, and I'm a fan of notebooks, and I like it. And just on Python Bytes podcast, we just talked about how Jupyter Lab is now a desktop application. They just released that this week, which is really cool. There's a lot of neat stuff happening around notebooks. One of the things that I'm not a big fan of, though, is the ability to reorder execution or only execute part of it.
13:18 There's a lot of benefits to say, run the cell that computes the data. That's really expensive, and then they go back. Three cells make a change here, run this one and then go back down four cells and run that one. And it's kind of like a go to with no explanation where you can jump around in different orders.
13:39 And that certainly doesn't lead to reproducibility when it's up to the humans decision. I decided I felt like I wanted to make a tweak and rerun that one, and I forgot to run the intermediate step that use that is very problematic for long term reliability reproducibility, and so on.
13:55 There's quite a few studies, I think where people have tried to rerun on notebooks. There's one by NYU back into 19 where they rerun, I think over 80,000 notebooks and only 24% of them completed without error. But and then there's a very small part of that, but had the same result when the notebook was finished running.
14:18 Yeah. Wow.
14:19 Very interesting.
14:20 The other part that Kedro has is the version data sets.
14:25 So not only just running the code itself, but you can check out, like an exact version the version of the data that was ran last time.
14:34 So you sort of store the data as well instead of just versioning just the source code or the notebook, you also version the data.
14:42 A. Yeah, that's an option. As you're creating your catalog entries, it's as simple as just putting version equals true. Pretty much on most of the data sets.
14:50 Oh, that's really cool.
14:52 How about your thoughts on this reproducibility? Maintainability side of data science.
14:57 That's essentially why we started Kedro because here at QBE when we were going to clients, and we needed to be able to if anyone has worked at consultants, you know that sometimes you need to rotate people. You need to move people from one place to another, and the pace is quite high. And when people end up in the way in the middle of a project, there is quite a long home boarding time, which is probably a week or more than that. And that's super expensive for a client to pay for one extra person just to read the code that was written to that point. Moreover, when you hand over code, it can be just notebooks.
15:34 So we ended up resorting to having different stages where you do things in notebooks, then you need to convert them in another programming language, and then you have an extra person doing that. And obviously that conversion wasn't done due to the limited time in the best way possible. So it was quite hard to have this workflow of making a reproducible code without sacrificing speed and agility.
16:00 And out of that need, that was how the initial versions of Kedro born. I think notebooks are super useful as well, but they are definitely not for production code, therefore exploring for trying out, doing some different things. Just basically a working session.
16:16 And I like the name notebook, because essentially you're just catching things in a notebook.
16:22 The thing is that what we see is people end up using those in production.
16:28 I think that makes it hot. You already explained, like, all of you mentioned some of the issues. How do you manage that? And how do you deal with credentials with traditional stuff? And for me, when I was coming from different software background, joining a data company and I was like, Where are the frameworks here? There was no frameworks, everything was platforms, and there was no way for you to start a project.
16:52 And I found out super interesting and interesting.
16:56 Not like Cookie Cutter type of templates, those kinds of things for generating T. Here's how we integrate with all of our other libraries and infrastructure and just go.
17:06 Yeah, that wasn't there. And it was quite hard to align on a similar process. And it reminded me a lot to early days of web development when everyone had their own PHP scripts that they would make and people didn't use frameworks a lot. And then things moved on from there. And for me, that's how it felt.
17:25 This portion of Talk Python to Me is brought to you by Tab Nine.
17:29 As you know, I'm a big fan of rich text editors and all they can do to empower us to work faster and smarter. The most important feature there being autocomplete to help you write code more correctly and faster.
17:41 So why not supercharge your autocomplete? If you haven't tried Tab Nine, you should definitely check it out. It works with PyCharm, VS code and even Vim, among other text Editors. Tab Nine is kind of a mind reader that gets even better as you use it. Tab Nine uses three AI models, an open sourced, trained, AI of private codebase trained, AI in team trained AI.
18:02 A very cool benefit of Tab Nine is the fact that they have a team train AI.
18:06 So if you're on a team working on the same project, you can all work on the same model and get suggestions accordingly. The more team members you add, the faster the AI will learn your project preferences and patterns. Tab Nine is free to use. They have a free Forever subscription plan as well as a pro plan with advanced models, an enterprise plan for larger organizations, and every plan supports inviting team members. If you're a student, tab Nine is free, just let them know you're a student and you'll get the pro plan for free. See what better adaptive autocomplete can do for you. Visit 'Talkpython.fm/tabnine' to get started, that's 'talkpython.fm/tabnine' or just visit the link in your podcast players shownotes. And thanks to Tab Nine for supporting Talk Python to Me.
18:52 It was very interesting. I think it was very interesting initially because of this. Okay, how do we bring that to people whose job is not to build software, but their job is to build models. And these are different skill sets, right.
19:07 I talked about how these skills that data scientists should learn from software development software engineer. And that's true. It would help them a lot. But there's also a lot of skills that data scientists and people in engineering and economics and biology. They have software for developers like you and I. We don't know the inner details of gene editing, mitochondria or whatever. To be fair, I mean, it's not like to put them in a negative light, just some of these skills are not learned along the traditional path. And so it does make things like reproducebility hard.
19:38 Yeah, that's why I think the data landscape is very nice place, because it's a very creative mix of people from different backgrounds for that reason. And I think what software engineers can do. Okay, help those people to be the most effective, the most productive with what skills they already have, and maybe just teach them just enough software engineering practices so they can go ahead. They don't need to be experts in two things. You can ever be like full expert in software engineering and then biology and DNA and all that. And if we go that way and require people to be full blown experts, I think the wrong path.
20:19 These are ways just to equip them with the tools, especially in Python that the ethos of Python is that you can be very effective with a partial understanding of what it is.
20:29 If you require them to be full data scientist, then it's something different for sure. So let's talk about Kedro specifically in terms of not just the philosophies and some of the goals, but what is it? What I first heard of what I thought this feels a little bit like one of these data pipeline type of programs like Luigi or something like that. But that's not quite right as it Yetunde.
20:56 No, it's not quite right, because we really do dial in on that focus of software engineering. This practice first implemented on code. So we think about things like a project template for you to know where to store the different parts. We could actually think. Then we've also got a data catalog which manages how you load and save data. We've also got a way for users to interact with configuration for the first time as well. So removing what would have been maybe configuration because of their loading and saving paths for data out of the code, as well as things like removing parameters and even implementing logging. And then we also think around being able to have our own pipeline abstraction as well, which is why everyone gets excited and thinks that Kedro is kind of like Luigi, Airflow.
21:38 I think we get grouped with Dexters. Well, all these different pipeline abstractions, but we really do focus on that journey of how do we even write code that's worth deploying? Which is kind of like a different focus, because when you get to our expectation is when you're getting to prefix Dexter Luigi Airflow, you ready have a code base that's worth deploying. And she just really need to think about I wanted to run at 07:00 a.m. On Monday.
22:01 Right. Or based on this trigger when a file shows up in Blob storage or whatever. Right?
22:06 Which is a completely different focus. And we think we call that group the orchestrators are really good at what they do.
22:12 But in terms of that whole process of leading up to a code base that's worth deploying. That's actually what kids are handles fist.
22:18 Yeah. Cool. You guys want to add anything to that?
22:20 Maybe you can go, but more into, like, functionality what it has.
22:24 I think I can't add anything after you get this excellent introduction. I think initially that's the main thing we were asked for is like, why don't we use Airflow instead? And we're not exactly the same. So you can just use Kedro Airflow together. And in fact, actually, now we have a plug in because we connected with them. And it was a very nice collaboration we had.
22:44 But I think just to underline the main difference is all of those tools they came out from big tech companies that already had good processes.
22:54 For example, Luigi comes from Spotify and so on. Right.
22:57 So Luigi from Spotify, Airflow from Airbnb. And those are big tech companies. They have uniform, like or at least they have a big team that is taking care of the infrastructure. So you own the infrastructure, you own everything you just need to put it there, like, run your Airflow instance and then run your code there. And for us, we come from consultancies where you don't know what infrastructure you find at your client. So there is no way for us to say, okay, this is the scheduler we're gonna use, and it was really impossible.
23:27 And you also don't know what sort of team you're dropping into. Is it a large, highly professional experience software team, or is it a bunch of research scientists who need a little help on the software side, right?
23:39 Yeah. Absolutely. Also the team you don't know any of that. And the thing is that we need to have some transferable skills within our teams when you work on one project here and there.
23:50 So the next time you're more efficient and more productive, even though you're changing project. And that's the main difference here with Kedro, those orchestrators that we don't have that assumption on the infrastructure, as we had to mention, is we focus on how to make something that we're deploying. And then once you need to deploy it, at least that's very hard to achieve. But this is what we strive for is to make Kedro deployable basically anywhere if it's managed service in AWS or if it's Air flow, or maybe they're offering. And they have an astronomer offering, which is basically managed airflow. If you need any of that Kedro, should be able to run on all of this is. But during the development process. You don't need to use their primitives to create your notes and deal with all the extra work the data scientist done or data engineers they don't need to care about. Ok. What is an airflow operator or these kind of things?
24:45 So does this mean that you're able to swap out these different data pipelines?
24:52 For example, I had a company we started on Luigi, and we're like, oh, we really want to move the airflow. If we'd written our stuff in Kedro. Would it make that easier?
25:00 Yeah, I would say yes.
25:02 Depending how you've written it.
25:04 I guess you can always make code not portable.
25:07 But yeah, one of the options that hasn't been mentioned yet is the one we're using, which is just the Kedro Docker plugin. So Kedro works very nicely inside of Docker. Despite all the nice orchestrators out there. There's also the backup plan of just put it in Docker, and that can run virtually anywhere as well.
25:26 Yeah. Sometimes that is really nice.
25:28 I know there's all this stuff as a service to help me out, but we'll just go in the simple route and running it this way.
25:34 All right.
25:34 Let's talk about some of the features, and even you talked about the lack of some kind of template to get started, and that's the first feature listed here. So maybe tell folks about this.
25:46 I think that was probably the first one we implement it when we were rebuilding this. The thing because what we found out is lots of data centers. They were naturally using some cookie cutter templates.
25:56 Yeah. Okay.
25:56 This is my project. This is the structure I like.
25:59 And then we had a big discussion with many different data centers how to implement this.
26:05 So we set out on the thing that will be the bare minimum that you would need for starting a project.
26:11 The minimal set that everyone is going to use. One of the things I really dislike about templates, these types of project templates, and I see them all over the place. So here's how I use this template to get started. And the template says, okay, what we're going to do is we're going to set up Celery as your back end worker. We're going to set up Postgres as your database. We're going to set up SQL Alchemy as your data layer, and you end up with ten things you're like. I only want four of these, but before helping with this really useful, but then I got to hunt through and get the others out. I just you do want to aim for this minimal side, because while it's nice to have support for the other things, like if you force it upon people and they're like, this is more junk than I'm using less than half of this. So this is not useful for me. Yeah. That was our philosophy to go more minimalist on it.
26:59 Yeah. Absolutely. Even though, because at that time it was an internal to. We had some stakeholders that we had to appease, right. Like we had internally, very well developed Data Engineering convention. And then they absolutely wanted to have the, you know, in the template. We need to have the folders with those layers in the data engineering. And I think you can still find that in Kedro. So there are these kind of things that we needed to do, but they're not necessarily. Okay. You don't need to use them that you can remove a lot of those. And further down the line, we ended up introducing something called starters, and this is essentially you can have a custom template that you can start from, and people are using them to create their own custom projects for their organizations.
27:44 Yeah. I think that's a great idea. Yeah.
27:47 And we're using cookie cutter behind the scenes. Which means another thing that we wanted to do is to not reinvent the wheel and use standard tools out there. If the Python world is using cookie cutter, there is no reason for us not to use it. Right.
28:02 And we went with that. And that's how I set up on the template system.
28:07 Yeah. Nice. And you talked about Kedro talks about on the home page, how it uses the cookie cutter data Science cookie cutter template, which is a how much of that is that? Or have you kind of moved beyond that?
28:20 I think we moved beyond that. It was mainly that's what the inspiration was got it it's not like we're using it, but more like it was inspired by this because we found out, like, a few users back then they use that one. And it was fairly sane. I mean, if you don't need a framework, I think it's quite a good starting point. So if you don't need a full blown Kedro set up, I would recommend that one.
28:44 And we said, okay, how can we build upon that and make obviously, we started from scratch, not copying any of their templating, but saying, okay, this is a very good example of what we can be.
28:56 And then how can we achieve the same thing? But achieving our goals for making this framework. And that's how we settled on.
29:05 And I think we still honor it because I think it was a good inspiration for us in our documentation.
29:11 Yeah. Very nice.
29:13 I love it. And definitely thumbs up on using something like cookie cutter. There's already so many templates out there that people are using. People are somewhat familiar with the idea they maybe know how to extend it. So no need to go write your own macro language or something crazy.
29:28 All right. Next main feature is the data catalog.
29:31 One thing I was going to mention here, Michael mentioned that you get from a template a lot of times you get a thing, it's got everything that you need, and then a bunch of stuff you don't need. I think one thing that plays well with Kedro is this catalog.
29:46 So with the catalog, I can kind of, you know, abstractly tell Kedro where my data is and what type it is.
29:52 So that catalog can load things from Pandas, Spark, DASK from databases. There's a pretty long list of data sources that it can load from, so I don't need to change the template based on what my underlying data is or where it's stored.
30:07 That's really nice. A lot of times when you're thinking of data, the abstractions are you can switch between MySQL, Microsoft SQL Server and Postgres, not between Spark relational database or cloud storage or something like that. That's a lot of flexibility.
30:22 One really nice feature I like that was added in the 16 series.
30:27 Is it's built on FS spec under the hood so you can have data sitting on S3 GCP or your local file system, and all you do is change your file path.
30:38 Maybe like a slight tweak to your requirements.txt. But other than that, Kedro knows how to load data into whatever object type you ask for.
30:48 Yeah, very nice. Also helps on the data science side. If you're not super familiar with remote Blob storage APIs, you don't have to learn that, right.
30:57 Which is good.
30:58 And another thing to maybe mention is if you don't need the full Kedro template, you don't need the pipeline and everything you can use Kedros catalog by itself. Okay. So maybe if you're just starting a notebook or just starting a project in a notebook, you might want to move it to Kedro, you can start putting your catalog together as a Kedro catalog from the start.
31:21 Nice. And then maybe as you move it more into source files, like more of that's already done. Right.
31:26 You're closer to the destination and you can use their loaders and savers so that you don't have to write any sort of saving and loading code manually.
31:35 All right. Next one is pipeline abstraction. Automatic resolution of dependencies between pure Python functions and data pipeline visualizations. And you'll have a cool visualizer. There a whole lot of stuff going on here I can zoom in here, but this really interesting visualizer of what's happening. Who wants to sort of tell us what this one's about?
31:58 I will happily take this one and I leave the coding starters to yet.
32:03 I really want to take that one because I want to kind of promote using more the API of the pipeline. I think this is probably one of the best things we've done, and it was kind of find that we have is what is the pipeline in our I think that's probably one of the things that makes Kedro different than other tools. We treat each processing node as a pure function.
32:25 So what you need to do is just to write a pure function that you have inputs and outputs. You return stuff, and that's all you need to do. And then you need to announce that in a pipeline.
32:37 That okay. I'm going to use that function. We'll have those inputs from the catalog that Weldon was talking about, and they're just aliases to those references in the catalog. And then when I'm done with that function, then I'll save them to those data sets. And you don't need to know. Okay, what's the order of execution or like any of that, you just need to think locally you need to think about that function you're dealing with. Okay. I need maybe in this example that we have on the screen, like people who are listening, they might not be able to see. But let's say you have an input to function.
33:13 And let's say factory train.
33:16 In that example, we have that's your input. And then you want to remove the no columns and that's your function.
33:22 And then let's say let's say we called output a clean factory input. Or what was it? So that's all you need to think about how to solve that locally. You don't need to think how that would fit in globally. And then once you add enough of those functions, then the those dependencies because you announce them in your input and your output. They'll be figured out by Kedro. And then you know this graph, the visualization will be drawn for you out of your code. And then you can use that for running your code in that particular order. And why I say that we are so proud of this because using these pure functions and connecting them as pipelines gave us a lot of ability to reuse code and reuse parts of the pipelines.
34:10 We doubt really taking care of where the data is.
34:14 So you just work on a pipeline level and the connections, and then the data catalog would load and save things for you. And that made it super super easy for us to scale the types of projects we can build. We started off with very small pipelines, and now we have maybe yet we can talk more about this. But we have projects which have hundreds of nodes internally in QB.
34:37 If you look at this, there's a lot going on here, and I really appreciate the idea of being able to just focus in on small pieces. It brings me back to your idea of talking about. Well, let's just write a PHP file and put start.
34:50 I'm going to start writing HTML, and I'm going to start writing some SQL query, and I'm going to write some more markup.
34:56 It's just all from scratch, and there's zero structure and there's zero support. So if you're going to do something like that, you're doing it all at once, all at the same time to compare that to like a framework like Flask. All you do is write the view method. You don't care about how the template gets rendered, you don't care about how the process comes in or finding the verbs to figure out. I know when I get here, Flask got me here. I do the five lines of code I got to do, and we're good. And I feel like this is real similar, right?
35:23 Yeah. Absolutely.
35:24 You just write a function, you write a function says unify timestamp column name, or you write a function called Remove null column. You can definitely do that. That's not challenging. But if you look this overall workflow, it looks like there's a lot going on here.
35:38 Yeah. And I really like that you brought that analogy with frameworks with vue and like, these things because the more we started building that, the more it was very similar for me. I've done some Ruby on rails before, and actually the pipeline sounded like the root file, so you would have different routes. Like, okay, how do you register on this URL? This is the action I would call and actions.
36:00 Here are our nodes and what data is provided to it and things and so on. All that.
36:05 So the data that's provided to it is basically the URL endpoints and maybe the post data and things like that. And then the output is your views.
36:14 For example, in traditional framework, the only difference here is that your inputs and outputs their data. They're not URL data coming from the request. And then the response is not a view, but actually, again, saving to the data.
36:30 And there is one sort of difference here is that you have dependencies between different rules where in web frameworks, they're fairly independent. Right. It's a stateless thing where here kind of still stateless. But you have dependencies between different parts of the root. And maybe he comes. The real reason I wanted to talk about the pipeline was that this abstraction done this way gives you kind of like an algebra that you can use to combine pipelines to have pre built pipelines and then plus them together, or maybe just join them together or remove things from them or maybe saying because it's a tree, like, end up being dark directed acyclic graph. It means I want to get this sub pipeline produces that output, and then it will remove everything from.
37:19 Okay. So maybe you're like saying, I only care about the output at the say function this round timestamps thing, and it could strip off a whole bunch of the other pieces was like, well, all this other stuff is not involved in this part of the chain of the pipeline. The reversing the Acyclic graph.
37:36 Yeah. Exactly. And you don't need to do anything. You just need to say pipeline two and then specify.
37:41 So we have a bunch of methods that you can use, and unfortunately, they're underutilized, like, people don't really use them. And then when we ask, like, oh, how can I do that? And it's like there is one method call and people are like, oh, that's so cool.
37:54 Maybe we need to improve our documentation at this one.
37:58 Yeah. Maybe we should go on a podcast and tell people about it. Yeah.
38:01 Yeah. I think that's why I went to to use the opportunity.
38:04 Absolutely. You should you tuned. Do you want to elaborate on some of these larger pipelines that you'll have going so actually speaks to a lot to the collaborative way that you can work once you're using the Kedro pipelines.
38:14 Because now your team sessions can easily become something like, okay, I know I need to work on these three key functions because we know this is what we want the pipeline to do, and then you split out the work and work accordingly to produce those specific nodes and function.
38:28 Also, one of the things that I'd like to call out about the pipeline distraction is that you definitely do get Kedros for free on top of it.
38:35 It's a pipeline visualization tool. It's really cool because it allows you to give kind of a birds eye view of what's going on in the pipeline, so you can actually understand how different things are connected. Some of the ways our users have used that are, but in ways that we can't imagine. It was originally created for being able to talk to non technical users or stakeholders about the way that the code base was structured instead of diving into code and showing them, hey, yeah. So the code works because they'll be like, I don't know what's going on here. Yeah, they can diagram, but we also found our users will do things like debugging with Kedroz to find out. Oh, something doesn't appear right in my pipeline and then figure out what's going on there. And in some ways, we've actually extended some of that functionality. So you'll see that there's no, like, a code viewer for you to interact with your code and we have some it gives me the exciting things planned down the line. When we're talking about the road map will be have to due to some of the work that we're doing with experiment tracking, which will extend Kedroz a bit further. So nice. Yeah.
39:32 You can definitely tell if you've got a dependency mismatch and the order is wrong or something you could see. Oh, this one is supposed to be after that one. It's really nice. The visualizers. It's nice. You've kind of got this map thing you can cruise around on for people who are listening. I'll definitely put this in the show notes so you can open it up and explore it there's. I a pretty elaborate pipeline here to explore. Does this do anything runtime or is it just for visualizing the static structure?
39:57 It's actually just for visualizing the static structure.
40:00 We've tried to skirt away from what we call the Orchestrator UI interface, but it takes is a bit too much I mentioned into that realm.
40:09 We prefer the orchestrates to to play a part. So a static view of what's going on in your code base.
40:16 Now this is great. I feel like a lot of projects would benefit from this kind of stuff, not just data size things. Right? I'd like this kind of view of my code for other things as well.
40:25 And we have a lot of people using. Kedro's is also available as a React app, and people use it without Kedro, so we'll find that they will build one of the most common use cases we've seen built on top of kedro is data lineage, but specifically column level lineage that people will want to visualize. So they end up using the React app for that. I also have a friend who was playing a game. He's actually one the former maintainers on Kendric and Kera where he was playing a game where he needed to work out how to build.
40:56 I don't know what this game is called a Factorio.
41:01 you basically look at how to build up your factory, I think, or something like that and use for different elements in the factory, and he scheduled to visualize what he should be doing in his factory.
41:12 So yeah, that's funny.
41:15 Then I guess you can visualize lots of things with how neat. All right. The next main feature here is deployment.
41:22 One comment I had on the pipeline, so you get a task to do on your sprint, you sit down to work for the day. And if you're not in this Kedro type for this framework mode a lot of times, it's like, okay, open the notebook or open the script, and I've got to run to a certain point to start my work because I've got to have that data in memory.
41:46 That's my option. Or I'm manually saving things along the way.
41:50 Both have their downsides. So a lot of times it's like, okay, I'm going to run the notebook and then I'm going to go grab coffee. And maybe when I get back, I can start my work. Kedro is saving each one of these points in the background along the way. So when I get a task and it's like, hey, you've got to put a note in between these two. I can start right away because the data is already sitting there.
42:12 Oh, that's cool. I can also use the pipeline Dag object like Ivan mentioned to just run that section of pipeline I'm working on as I'm working interesting it's a little bit like just rerun the failing test or just this one test or something like that in the unit testing world. Like, instead of trying to rerun the entire test suite for every little change. Yeah, that's a cool thing. You did talk about the deployment stuff before, so maybe you want to touch on some of the deployment stuff.
42:38 I guess it was just be a quick mention. So what we do support is two deployment plugins right now. The first one is Kedro docker which packages your Kedro project in a Docker container. And the second one is Kedro AirFlow, which was built with the AirFlow Astronomer team, which will take your Kedro pipeline and convert it into an airflow Dag so that you can run it on Air Flow. But we also do support in our documentation, like a few guys on how to deploy Kedron prefix on Kube Flow on AWS Batch, AWS Sage Maker and AWS Data bricks as well.
43:10 One of them I always feel like with AWS and Azure, there's just I know no matter how much I studied there's like three more things that's similar to order there, but they're different. I know batch the one you named after that.
43:22 Yeah, but that's definitely the case. I believe it's AWS Data bricks, but you could kind of use the same methodologies you're working with is your data bricks as well to display things. There even alluded to the fact that we do really pride on flexible deployment, because we don't know what is your internal infrastructure like and therefore should be able to support the most generalizable case to do that. So you can definitely check out those guides. I know if their guides missing just raise GitHub issues and we'll look at it as well. We added a growing tally mentions of things that we haven't heard people necessarily using with Kedros.
43:54 Yeah, it's worth checking out and open source. If people want to add new deployment stories, they can go and PRS are accepted. Is that true?
44:01 Yeah, right up the guide. We will take it.
44:06 We have the sense of contributing guide, and that's available on our documentation too. That shows you how to make a PRS across, whether it's features or minor tech improvements or bug fixes as well as documentation, too, because we like to write our docs.
44:20 Yeah, that's important.
44:21 And coming into October, it's October so people can come in, and I think we will be really, really grateful for adding more deployments, because no matter how many you have, you always run out of or maybe differently, like, no matter how many you have, you always don't have everything. And there is always someone was coming, like, by the way, how to deploy it on this AWS. Whatever the new thing that they have is like, how do I know this is the first time I hear about it, so there is always room for more.
44:53 And this is the thing that we would really love help with. If you want to find something to contribute to for this October, maybe that could be Kedro.
45:03 Yeah, that'd be awesome. Yeah. I periodically will get in October contributions to, for example, my course projects the GitHub repos for them. It would be like slight changing in the wording if it says the Kedro documentation includes three examples to help you get started. They might say to help you get started. The documentation contains three examples. There's a PR, so it counts. I'm like I just got to go through and close them. So please, people listening. Just a minor udeful contribution. But yeah, it would be great to work on this. I feel like these types of pipelines are very accessible because they narrow the focus so much when you get down to certain tasks, a certain things you don't have to understand the whole project. Just how do you do this other tasks slightly differently.
45:45 Yeah, absolutely. And I'm pretty sure that talking about contributors. Like, if it's just a change of worse, we might not have T shirt for you. If you're adding guides, we will definitely send out a Tshirt.
45:56 That awesome. T shirts are included, not just passing October Fest. Okay, so one of the things I find is a little tricky is always talking through thinking through an example of these kinds of things. They're very neat, but they also sometimes feel pretty abstract. So even you maybe want to talk us through just this Hello World example. I know it's hard to talk about code on audio, so not exactly. I just give us a sense of what it means to write one of these pipelines.
46:24 Starting from the first one is, I think in the Hello World example, we show what a node is and it's not very different than a function. It's just a function, actually. And your function. We accept two types of nodes.
46:38 Actually, three types of nodes have only inputs, those that have only outputs or those that have both inputs and outputs.
46:46 We don't accept functions that don't have neither inputs nor outputs for obvious reasons, because that doesn't too much.
46:55 And you can start with your own function. Let's say you can call it return greeting, which would just return. Hello.
47:01 Just so let me elaborate on that. 1st 2nd. So what you have is a Python function that takes no parameters and returns a string.
47:09 But the thing that's notable about this is it doesn't have a decorator or some other special thing about it. It's literally just a bare Python function that has nothing to do with Kidro per se.
47:21 Yeah, absolutely.
47:23 The reason why we did it this way was to allow all kinds of people just to create functions. Like if you don't know decorators and these kind of things, you don't need to know that. And then the second one is you might actually use functions from libraries that were not at all designed to be part of our framework. If you're importing a function, you cannot add decorator on that one. Well, there is ways, obviously in Python always, but it's not very intuitive. So we just work with pure functions.
47:52 How that gets started to notice. Actually, more curious here and then have it happens to a helper function we have, which is called node conveniently. And then you provide your function and you provide your inputs and outputs strings. So these are the three things you can add. So for example, for this return greeting function, your inputs would be none. So because you don't have any inputs and your output could be a string, which says, my salutation. That's in how we've done in the Hello World example.
48:25 So this will create a node in Kedrosens, which you can embed later in a pipeline.
48:31 Here. The output has a static. Is that like the name of the output? So you can use it in the pipeline, not the value but the name. So you can refer to it later on. Right?
48:40 Yeah. Exactly. So you can think of them as variables in a way, as a variable. The variable you can define where to store that valuable through the data catalog later.
48:51 Yeah. Very nice.
48:52 So it seems super easy. You have also a more elaborate example. It says a spaceflight tutorial. Yeah.
48:59 What is this? We don't have to talk to it. Just if people want to go and play with it. What is this one?
49:04 I really want yet to introduce it, because initially when we were thinking of an example, I think she came up with the idea to make it more of a space flight.
49:12 And we actually it was quite funny that this led to something more interesting that Yuedeta can share about.
49:20 I think it was actually Demetri that therefore we maintain on the Kedro project. But the scenario for this tutorial is that it's the year 2160. You're somehow a data scientist predicting the price of space flight to the moon and back. And you have access to three data sources information about companies that are flying people to the moon, reviews on their shuttles that they have, and then also the customer reviews that they've given while working with those companies. And the whole thing is you just want to predict the price of a space flight. So if you go through the tutorial, you'll get acquainted with all the way from beginner functionality.
49:53 Install Kedro, set up my project template all the way to kind of get intermediate just before Intermediate functionality in Kedro. So you get up to speed in about an hour, an hour and a half in total as you go through the full tutorial, and it will teach you all the basics of how do I use the project template? How do I use the data catalog? How do I construct my pipeline?
50:13 How do I visualize my pipeline? And how do I package my project as well?
50:17 So it's really useful for getting up to speed on that. But we've had a really great time with the spaceflight project because we found out that NASA team at NASA was using Kedro. So it was almost like a change oh nice when we discovered when we discovered that because it was like we went full circle and went to the moon with that.
50:37 They're actually doing spaceflight. Amazing.
50:40 Yeah. I'm pretty sure they chose this only because of our tutorial.
50:45 We were thinking, Luigi, but then we saw this tutorial. We knew this was the one. I do love these imaginative examples and tutorials rather than something really boring. Let's build a to do list and here's how we're going to do it.
50:59 Okay, now this sounds really fun. So if people want to get a sense for what it's like to work with Kedro, you recommend this as the tutorial to work through to get started.
51:08 Definitely recommend this. And then there are a few online resources if you want to use them. We have a blogger.
51:14 He's been inactive for a while, but he still has really good YouTube tutorials. Search for Data Engineer one and look for a walk through of the tutorial there. It's also very handy for getting up to speed if you want a video. Kind of like workflow as you go through the tutorial, but we are going to be piloting some livestream workshops of us working through the tutorial ourselves later on in the year, so definitely do look after that. I think probably the quantum black youtube will have them and then also.
51:42 Oh yeah. You can actually see he opened up at first.
51:44 So I got to get through the ad for people to see. I'm not logged in over here, but yeah, there we go.
51:49 That's Data Engineer one in action. And then I definitely recommend either joining in on those live streams as we host them or the followup youtube video. Yeah, cool.
51:59 I'll link to some of the YouTube videos for people that go check out there. Very nice. So we're getting pretty short on time. Even maybe. One thing we could just do to wrap this up is I know you talked about some of the cool libraries and stuff you used to build this. Maybe you could just talk it a little bit briefly about the internals and some of the fun things you use there.
52:16 Sure. So probably good libraries that worth mentioning. Maybe I will not use the time too much about Kedros internals, but just to have a shout out for nice libraries that we found out. One thing that Waylan already mentioned was spec. I think that was amazing and we really found it super useful.
52:36 I think it is in the Anaconda ecosystem developed by some of the the people there. And the good news is actually it's also becoming part of Pandas.
52:49 So whenever you're doing load CSV in the newest version of Pandas, they use fsspec as well. Oh yeah. Now it says it's also in Dask Pandas, even DVC and many other things.
53:02 This has been really useful because it simplified a lot of our code about the data set we didn't have previously. We have the data set for S3, had a data set for GCP and GCS and Azure Blob store, and all of that. I don't know. It's super annoying because they do exactly the same thing you do it many times, just changing endpoints and things like that. It was super frustrating to maintain it. And when we have many of those data sets, it was super frustrating. And when this came out, it simplified. Basically, maybe it reduced our code base four data sets three times or something like that.
53:39 Because they do that abstraction for you.
53:42 Pass it along.
53:43 Yeah. So if someone wants to treat a remote database as kind of like a local file, I think, I expect, is really useful, too. Cool.
53:51 Let's see. Another one that you mentioned was Dynaconf.
53:56 That's quite nice one. We started using it recently. What we wanted to do in Kedro because we're a framework, and there is some framework code that needs to call user code, and that's a bit challenging because you don't know what the package name of the user, because the user will create their own code, they will choose a package name and you don't know what to import from. So we came up with a pattern that was actually applied by Django, which you kind of configure project by their package name, and then you load some of the settings. So if people know Django, they know that they have an extensive way of doing settings in order to configure different things in Django.
54:39 How Dynaconf helped us with that is because it was a very extensive, very clean obstruction to do lazy loading of settings.
54:48 And why did we need lazy loading of settings?
54:51 And for example, you might have multiple pipelines in Kedro. One of them would not be completed. They could be some errors in it. But you still want to run the other pipeline.
55:02 If you eagerly load of that, then your code will fail for no reason.
55:07 Right. Even if you weren't actually going to end up running that part.
55:11 It's almost like a compiled language versus a dynamic language. Yeah.
55:16 And then her Python shines because you can have things like Dynaconf where you don't need to compile that path.
55:22 So it helped us a lot with making those settings loaded lazily, because there are different things you cannot validators to validate settings, and so on. It's fell extensive to I recommend people to read their documentation. You can use it for many things. So yeah, it's a nice package.
55:40 We stumbled on a very neat. Those are great recommendations.
55:43 So with just a little bit of time left, let's wrap up our conversation with where things are going. Yutenda maybe want to give us a road map feature view of what people who are maybe using Kedro now or what's coming.
55:55 I guess maybe the next upcoming feature that you'll start to see being rolled out is experiment tracking and Kedro. So what we're doing is Kedro is already I think Waylon spoke to it. I'm really aware of being able to save your data set. And for us, our data catalog would apply to model as well. So we already had some form of model versioning in Kedro, but what we really needed to extend, and we really have the concept of parameters to through inputs. But what we needed to do was think around, how do we think about features and how do we think about metrics coming out of the pipeline as well? And those are the two additions that we've made kind of as additional data sets in the Kedro framework, too.
56:34 And then the last thing that we had to think around was how do we collect all of these things as one unit or one experiment?
56:40 That concept is it's actually been implemented on the framework side, so you can really start to interact with the experiment tracking functionality there. But a lot of the massive changes are going to be done on the front end, where you'll be able to look at the list experiments that you've run, compare them as well. And then we'll be building up the functionality as we see fit, including probably ML flow model, registry model serving integration as well. And that's probably going to be done through our data catalog as well.
57:07 That's cool. Is that if I go to the trouble to train up a model and it takes a day, I can store that and other people can just pull it down and use it without spending another day.
57:17 Exactly. That's a large thinking around it. So you can definitely look forward to that. There's some open issues on our GitHub repository around configuration.
57:27 If I do suggest if users have interacted with Kedro and you've had issues with scaling configuration, please do check it out and give us a comment there, because that wouldn't decide whether or not when we pick those those issues up based on user responses there. So that's what I think you can look forward to.
57:43 Oh, fantastic. All right. Ivan I think we're going to take your two libraries you mentioned as the notable PyPI project, just for the sake of time, since we're kind of over, I'll just do one final question for everyone out there. And if you're going to write some Python code, what editor do you use? Even you want to go first?
58:00 Intellij. I come from a Java and Scala work, so let's stick with intelligence.
58:05 So basically the Python support in, like, full on IntelliJ, right? Yeah.
58:10 And not PyCharm, but IntelliJ.
58:12 Yeah, I go ahead.
58:13 Evan says like, no PyCharm. I'm full PyCharm.
58:18 Same here.
58:19 See arguments on the team about this.
58:21 Oh, is there? This is a point of contention. I see.
58:25 That's funny, because they're so similar, right? It's not like VS code versus PyCharm Elan.
58:30 How about you?
58:31 I'm an avid Neovim user.
58:33 Part of my workflow as being a lead data scientist is I bounce between probably like a dozen projects a day between actual running pipelines. Or maybe it's like a couple of our internal libraries that help those things run.
58:49 And it's really nice to have something lightweight that can run with pretty low resources.
58:55 Also, having it running in tmux makes it easy to like a few keystrokes. I can go into a specific project, the Editors, they all tend to look the same and you have a bunch of projects looking the same. It's very easy to edit the wrong one.
59:11 All right.
59:12 Good recommendation. All right. Well, thank you all for being here. Maybe final call to action. People want to get started with Kedro, bring it into the organization, try it out. What do you all say?
59:22 Well, you know how to get started, get into the spaceflight tutorial and then just shout if you have any issues, we're up on Discord and we also do have a GitHub discussion page as well. So you can just flag we help users across the towards levels of where they find themselves. So definitely we do that.
59:36 And I guess Quantum Black does consulting for Kedro. So if people have these projects and they're like, I'm not sure we can handle this ourselves. They could probably hire you. All right.
59:45 We've never quite had that level of interest that a Quantum Black data science and data engineering team will go out and use Kedro as part of a larger engagement. And how do we get you a business problem? But you can definitely learn Kedro through that way. You're in the open source community as we remove in the open source space. It would definitely be through channels we have available.
01:00:08 One thing to mention is if people want to engage with Quantum Black because they know about Kedro.
01:00:13 Please mention that in a Kedro team so that we get the creds.
01:00:24 Awesome. Yeah, I bring that up because there's different ways to support open source. Right there's the Mongo DB model where they sell MongoDB as a service and Alice and all that there's like TideLift, there's GitHub support, but here's yet another way in which this project is being grown and being supported because it's supporting you all doing your work.
01:00:46 Yeah. I mean, if that's the case, then I'll definitely say we do really want to be able to help a lot of people in the industry as well, because we know that it's needed, and we obviously recognize that as a framework, especially in the data science space. We are a bit of a first mover, so we suffer a lot of first mover pains where people are like, why on Earth do I need a framework? I don't need a framework.
01:01:07 So if you help us with breaking through those barriers, please go for it and be an advocate. And I guess in the sense, be a Kedroy.
01:01:15 Right on.
01:01:16 Yetunda, Waylon, Ivan. Thank you all for being here. It's been great to learn about Kedro and good to chat with you.
01:01:22 It's an awesome thank you so much.
01:01:23 Thank you. Bye.
01:01:24 Thanks, Michael.
01:01:25 This has been another episode of Talk Python to me. Our guest on this episode have been Yetunde Dada, Waylan Walker and Ivan Danov and it's been brought to you by Tab Nine and us over at Talk Python training and the transcripts are brought to you by 'Assembly AI'. Supercharge your Editors auto complete with Tab Nine. The editor add in that uses AI to learn your coding styles and preferences and make you even more effective. Visit 'talkpython.fm/tabnine' to get started, do you need a great automatic speechtotext API? Get human level accuracy in just a few lines of code? Visit 'talkpython.fm/ assemblyAI' on to level up your Python we have one of the largest catalogs of Python video courses over at Talk Python. Our content ranges from true beginners to deeply advanced topics like memory and async and best of all, there's not a subscription in sight. Check it out for yourself at 'training.talkpython.fm'. Be sure to subscribe to the show. Open your favorite podcast app and search for Python. We should be right at the top. You can also find the itunes feed at /itunes, the Google Play feed at /Play and the Direct RSS feed at /RSS on 'talkpython.fm'.
01:02:34 We're live streaming most of our recordings these days. If you want to be part of the show and have your comments featured on the air, be sure to subscribe to our YouTube channel at 'talkpython.fm/youtube'.
01:02:46 This is your host, Michael Kennedy. Thanks so much for listening. I really appreciate it. Now get out there and write some Python code.