#330: Apache Airflow Open-Source Workflow with Python Transcript
00:00 If you're working with data pipelines, you definitely need to give Apache Airflow a look.
00:03 This pure Python workflow framework is one of the most popular and capable out there.
00:08 You create your workflows by writing Python code using clever language operators,
00:12 and then you can monitor them and even debug them visually once you get them started.
00:16 So stop writing manual code or Cronjaw-based code to create data pipelines and check out Airflow.
00:21 And to do that, we have three great guests from the Airflow community.
00:25 Yarek Potik, Taxil Naik, and Leah Cole.
00:28 This is Talk Python to Me, episode 330, recorded August 5th, 2021.
00:33 Welcome to Talk Python to Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities.
00:53 This is your host, Michael Kennedy.
00:55 Follow me on Twitter where I'm @mkennedy.
00:57 And keep up with the show and listen to past episodes at talkpython.fm.
01:01 And follow the show on Twitter via at Talk Python.
01:04 This episode is brought to you by us over at Talk Python Training.
01:08 And the transcripts are brought to you by Assembly AI.
01:11 Leah, Yarek, Axel, welcome to Talk Python to Me.
01:16 It's good to have you all here.
01:16 Thanks for having us.
01:17 Thank you.
01:18 Yeah.
01:18 It's really fun to be talking about Airflow.
01:20 These are the types of tools that I think they don't get that much awareness, but they're the kind of thing that can be the real backbone of a lot of teams, a lot of organizations, and so on.
01:31 So I think that'll be super fun to dive into.
01:34 And we'll all learn a lot.
01:35 And I suspect a lot of people listening will realize, oh, here's a whole class of tools.
01:39 I didn't even realize I should have considered to solve my problems.
01:42 But before we get down to that, let's start with your stories.
01:45 Leah, you go first.
01:46 How do you get into programming Python?
01:47 Python was the first language that I learned.
01:51 I do have a bachelor's in computer science.
01:54 And the school I went to, that is the language that Intro to CS is taught in.
02:00 I am so jealous.
02:01 My Intro to CS class was in Scheme, which is derivative of Lisp, which didn't seem that practical.
02:07 And then I was told I had to learn Fortran.
02:09 It would be the most useful language I'd ever learn.
02:11 Neither of which turned out to be true.
02:13 I wish I learned Python.
02:14 So thanks, Carlton College, Northfield, Minnesota, for giving me Python early.
02:19 And yes, I loved Python from the beginning.
02:22 You asked how I got into programming.
02:24 So I actually do have a parent in tech.
02:26 It is my dad.
02:28 And he tried to get me into programming a lot earlier.
02:30 And like a true teen, I said, absolutely not.
02:34 Because it was suggested by my dad.
02:36 It really wasn't until I got to school.
02:39 And I heard people say that Intro to CS was a fun elective.
02:43 For those only listening, that's totally in quotes, as I'm saying it, that I decided to take it.
02:49 And it turned out I really liked it.
02:51 And I decided to pivot from being a math major, which wasn't going very well, to being a computer science major.
02:57 Yeah, that's fantastic.
02:58 I was also a math major.
02:59 And I find the programming side, a lot of the same skill set you have to use.
03:04 Like the thinking through problem solving.
03:06 You have these constraints or axioms in math.
03:09 And you work from them.
03:11 But in math, you just come up with sort of like the next idea that is the next problem that is the next idea.
03:16 And in computers, you build stuff that people use.
03:19 Exactly.
03:19 And it's such a difference, I find.
03:21 It's puzzles is the programming.
03:23 And that was always the part of math that I liked.
03:25 I never liked the writing proofs or the theoretical side of things.
03:30 I just wanted to solve puzzles with logic and rules.
03:33 Yeah, fantastic.
03:34 Well, it sounds like you've landed in the right spot.
03:36 That's awesome.
03:36 I'm doing okay.
03:37 Jarek, how about you?
03:39 Yes, you talked about your first language.
03:41 So my first language in computer science during the studies was, I think, Delphi or Pascal.
03:48 I can't even remember that.
03:50 But actually, the first language I started programming in the real work was, listen to that, Kobol.
03:56 So I tend to joke that when I'm retiring, I will be very well paid five hours a week programmer of Kobol.
04:07 Because nobody else will know it.
04:09 You're going to keep the trucks delivering and the warehouses open.
04:12 Exactly.
04:13 Beautiful.
04:14 Just five hours a week.
04:17 Yeah, that's super cool job, I think.
04:20 But then Python is actually quite new in my portfolio, let's say, of languages.
04:24 I've learned it maybe six years ago.
04:27 And with my experience and years of working in CS, it's relatively late.
04:33 But I loved it from the first glance.
04:35 I used to work in like C, Java, C++, hundreds of other, a lot of other languages.
04:41 But Python was just super easy from the start and super nice tool and super, super friendly.
04:47 Like, I was, after years of programming in Java, I was like so much, oh, one line you can do what I would do in five pages of Java code.
04:57 Yeah.
04:57 Yeah.
04:58 That's cool.
04:58 And you can understand it as well.
05:00 Yes.
05:00 Yes.
05:02 So, yeah.
05:03 So, yeah.
05:03 I fell in love immediately.
05:04 And this is my absolutely favorite language right now.
05:08 Same here.
05:09 Axel?
05:09 Yeah.
05:10 For me, I did my bachelor's in electrical engineering.
05:13 So, didn't do anything over there.
05:15 But when I came in the UK to do my master's, we were taught R language and Java.
05:21 One fine day, we were ending the college in just a month or two.
05:26 And there was a presentation from someone in the university who was telling us how to use data science in the industry and everything.
05:32 They said, you should know Python.
05:34 And you're like, oh, but we were not taught Python.
05:37 And we are just one or two months away from doing our internships and everything.
05:41 And we don't know Python.
05:42 So, that's when I started looking into Python.
05:45 I got an internship.
05:47 And then I actually started learning more of Python.
05:50 So, this was 2016, I'm talking about.
05:53 And, yeah, since then it has been a wild, wild ride.
05:56 I have written a lot of Java, a lot of R.
05:58 But Python seems to be very easy to write, easy to understand.
06:03 Plus, the community behind it and the packages behind it are so vast that you can use it for anything.
06:09 And basically, yeah.
06:10 I saw a funny t-shirt once that said, I learned Python.
06:13 It was a great weekend.
06:13 Which I think is really funny, right?
06:16 Because on one hand, yeah, sure, you can go through.
06:20 And actually, the language is simple, especially if you know something else that's like Java or C++.
06:24 Oh, this is a breath of fresh air, right?
06:26 But on the other hand, I've been doing this for a long time, all day, every day.
06:30 And I'm still learning Python every day, right?
06:32 So, it's a really interesting juxtaposition of, like, you can learn the language really easily.
06:36 But then there's the standard library.
06:38 And then there's 300,000 PyPI packages.
06:40 Like, Airflow is just one of them.
06:42 And that's our whole topic today, right?
06:44 So, it's kind of both, right?
06:45 Yeah.
06:45 And the language keeps growing.
06:47 So, you've got to keep track of the cool new things that are released.
06:50 And things that are true in Python 2.7 are definitely not true today with Python 3.10.
06:57 It's grown a lot.
06:58 It definitely has.
07:00 And I saw that Airflow is not supporting the older versions of Python, basically, as they get deprecated.
07:05 So, yay for that, right?
07:06 Yeah, we have actually, you know, very, very strong rules following the Python release rules.
07:12 So, we've learned from what Python learned on the release schedule.
07:16 And we just follow it very, very closely with, like, how much we have when we support, when we stop, we support Python.
07:22 That makes a lot of sense.
07:23 Yeah, it's difficult to maintain compatibility between Python 2 and 3.
07:27 That's a lot of overhead.
07:28 Actually, we, yes.
07:30 I have nightmares about it, cherry picking all the stuff from the main branch to the old release branch and adding Python to support.
07:37 But Kaxil, you cannot complain.
07:39 I mean, we both, but Kaxil did that a lot.
07:41 And thanks to that, we've been several times top committers on Apache organization.
07:48 Like, there is a, like, this week, most commits made.
07:51 And that was us doing cherry picks between three version and two seven version.
07:56 And once, we made it both at the same time, top committers on Apache.
08:01 Fantastic.
08:02 I'm mad at GitHub.
08:03 The GitHub does not count commits on any other branch except main or master.
08:07 Oh, not fair, not fair.
08:10 We'll have to do our own visualizations.
08:13 Exactly.
08:14 Yeah, exactly.
08:15 All right.
08:16 Before we move on out of the live stream, Hawaii girl says Python is awesome, like all of us.
08:20 Yes, definitely.
08:20 Thanks for being here.
08:21 All right.
08:22 Well, let's start this at a slightly higher conversation than just Airflow.
08:27 So Airflow is one of these workflow management frameworks.
08:31 Whoever wants to take this, what is that?
08:34 Why do I need that?
08:35 When do I need that?
08:36 What are these tools, as I hinted at the beginning?
08:38 I want to walk through some of the history, though, like in 2014, 2015, where like data engineering
08:44 was not mainstream and everyone was just using Cron for scheduling their task.
08:51 And then there came Luigi, where people were using XML and those sort of languages to write
08:58 their tags workflows to make sure that the task runs on schedule.
09:02 Yeah.
09:03 So DAGs directed acyclic graphs?
09:06 Yes.
09:06 Okay.
09:07 Yeah.
09:08 Let's be very clear.
09:09 You cannot have a circle for tasks.
09:12 The dependencies cannot be that.
09:15 People have tried that.
09:16 It takes a long time to finish those.
09:18 Yes.
09:19 A little bit.
09:19 A long time.
09:20 Some of those decks are still running.
09:22 So it takes a long time for them to start sometimes, too.
09:27 But yeah, I think to complete the history part, people just got bored writing the XML syntax.
09:34 And it's difficult to understand.
09:36 Similar to what we were talking about, Java and Python.
09:39 Like Python is much easier to read, easier to understand.
09:42 There came Airflow.
09:44 Maxime wrote Airflow in his time at Airbnb and open sourced it to Apache Software Foundation.
09:50 And that had a sigh of relief for people working on Luigi and others as well.
09:55 Because then you could write your workflows in an easy to understand language that you're already very familiar with.
10:01 You don't need to write those XMLs.
10:04 And who loves writing XMLs, first of all?
10:09 And so it's easy to understand.
10:11 Just configuration as code.
10:13 And there was also, I think, a slight move towards everything as code.
10:18 Infrastructure as code with Terraform and Ansible and whatnot.
10:22 And Airflow was just a perfect tool for workflow as code or DAGs as code.
10:28 And since, I think, 2016 to 2018, Airflow's popularity has skyrocketed with the advent of, like, separate specialized data engineering field.
10:38 Previously, I think software engineers used to do everything.
10:41 But then people or companies also realized that it's a separate field.
10:45 It's a lot of work.
10:46 It's not, you can also just not include machine learning engineer and let him do everything.
10:51 It's a separate data engineer's job to write a pipeline, knows how to handle the data, and everything from start to start.
10:59 Let's realize that thousands of things, which, first of all, your Cron expressions or Cron alone cannot handle those.
11:05 The task dependencies, the SLAs and whatnot.
11:09 So I think that's when, with the advent of data engineering, people realizing the importance of data, Airflow's popularity gained massively between 2080,
11:19 which also, by the way, coincided with where Airflow became the top-level project in Apache Software Foundation.
11:25 Until then, Airflow was just an incubating project in ASF.
11:29 And then it became a top-level project.
11:32 And that was a big milestone for Airflow and the community.
11:35 I think data engineering is really interesting because a lot of people, when they think of,
11:39 well, what are the divisions of what you do with programming, you know, especially in Python?
11:43 Well, we've got, like, web programming, to some degree, UI programming, and then we've got data science.
11:49 Sort of web and data science are the two, but there's this middle ground where I feel like
11:54 people kind of don't want to go there, or that's the data, right?
11:57 You want to make sure if you get a bunch of data and you feed it to your model,
12:00 your model is only as good as the data you get, right?
12:03 If you're trying to automate some ingest of data or warehousing reporting,
12:08 it's only as good as the reliability of the data coming in, the accuracy, right?
12:13 We've got things, was it great expectations and stuff like that for testing, actually testing
12:18 against the data, not the code that works with the data.
12:22 Yes.
12:22 And let me just add, because also Airflow is really an orchestrator.
12:27 So, like, I used to sing in the choir for many years.
12:30 And for me, this is really, like, this parallel between the conductor and the team playing.
12:35 We don't do stuff in Airflow.
12:37 Airflow doesn't do stuff.
12:38 It just tells others what to do.
12:40 And there's data processing stuff.
12:43 So, like, we don't know how, basically, we as data engineers, because, like, we are actually,
12:48 you know, data software engineers writing things for data engineers.
12:52 So, when we think about, like, this cross of this both, like, software engineer and data
12:56 engineer, so we don't know how to actually make a model, machine learning model.
13:00 Or we don't know machine learning.
13:02 We don't know how to, even we don't know how to do MapReduce, right?
13:06 I mean, if you want to process a lot of data, but we know what to do with the data when it
13:11 comes, who should do next, what, and what, how to pass it somewhere else.
13:15 And we can make it super complex to define or complex in terms of composed of many, many
13:22 different steps in different relations.
13:24 But Airflow makes it super easy to manage the whole thing so that it runs smoothly and you
13:29 can operate it and you don't, and you can deal with any problems that arise on the go.
13:34 So, Jarek, at this point, I want to expand on it real quick.
13:38 There's a very human aspect to the workflow orchestration that I think both you, both
13:43 Kexil and Jarek have touched on, which is that having a workflow orchestrator really enables
13:49 you to move from having, like, the data scientist in their silo working on this pipeline alone
13:55 to having a whole team of data scientists and data engineers working together.
14:00 Because you have really specialized folks who can work on building those models.
14:04 And that might not be the same group of people that's figuring out how to get the data from
14:08 A to B and making sure that it's healthy and is what the model and the data scientists are
14:14 expecting.
14:15 So I think it just enables a lot more collaboration and helps you have more specialists working together.
14:22 Yeah, it becomes that well-known, well-tested way to flow data down into the specialties that people need, right?
14:30 Exactly.
14:33 Talk Python to me is partially supported by our training courses.
14:35 At Talk Python, we run a bunch of web apps and web APIs.
14:39 These power the training courses as well as the mobile apps on iOS and Android.
14:43 If I had to build these from scratch again today, there's no doubt which framework I would use.
14:48 It's FastAPI.
14:49 To me, FastAPI is the embodiment of modern Python and modern APIs.
14:54 You have beautiful usage of type annotations.
14:57 You have model binding and validation with Pydantic.
14:59 And you have first class, async, and await support.
15:02 If you're building or rebuilding a web app, you owe it to yourself to check out our course,
15:06 Modern APIs with FastAPI over at Talk Python Training.
15:10 It'll take you from curious to production with FastAPI.
15:14 To learn more and get started today, just visit talkpython.fm/FastAPI or email us at sales at talkpython.fm.
15:21 One of the things you do in these types of frameworks is you build these tasks, right?
15:28 Give us an idea of what some of the tasks look like.
15:31 And you actually have a whole bunch of, would that be the integrations in there?
15:35 Or is that something different?
15:36 Provisors.
15:37 That's the name that we are using in Airfield too.
15:40 Yes.
15:40 So we have like more than 70 of those right now.
15:43 We do for 70 services we talk to, external services or databases or whatnot.
15:50 70 entities.
15:52 But within that, we have several hundreds of these so-called operators or sensors
15:58 or transfer operators, which perform the task.
16:02 And they're actually super easy.
16:04 It's just one method.
16:05 Execute.
16:06 That's it.
16:06 Right.
16:07 That's pretty much it.
16:08 Yeah.
16:08 There's the three things, the sensors, the operators, the transfers.
16:11 Like an example of a sensor might be waiting to see if an object is in S3 or in Google Cloud
16:18 storage.
16:19 And a transfer is moving something from A to B.
16:23 And an operator, those are the ones we probably have the most of, right?
16:27 Yak and Kaxel.
16:28 And that, it can be anything in a service.
16:31 I don't know, like starting running.
16:33 So I work in Google Cloud, so the operators I'm most familiar with are the Google ones.
16:37 So like spinning up a data cluster and then running a Spark job on it or running something
16:44 on a Kubernetes pod.
16:45 There, yeah.
16:46 If you can dream it, either there is an operator for it or you can write an operator for it.
16:51 Yeah.
16:51 Yeah.
16:52 When I started with Airflow back in 2017, we used Airflow for the same reason.
16:57 Like Airflow was designed for being a classic ETL tool or being an enabler of sorts.
17:02 So a lot of companies were migrating from on-premises to cloud.
17:06 We were doing a project in partnership with Google to move customers' data to cloud.
17:11 And we were using NiFi for data to be on GCS.
17:16 But from there, everything was orchestrated by Airflow.
17:20 So once the data lands in Google Cloud Storage, then there's classic ETL, that extract, transform,
17:26 load.
17:26 From GCS, it goes to BigQuery.
17:29 BigQuery does some manipulations and the data goes to like, there's a dashboard, a data studio
17:36 that shows a rich dashboard behind it.
17:38 And this is all managed by Airflow.
17:41 And it was so easy because we separated this using task and we were using all the hooks and
17:46 operators that Leah and Yarek were talking about, which was like GCS to GCS operator, move
17:52 the data from the landing area to staging.
17:55 So your landing area remains untouched.
17:57 So you can verify with your vendor that the data is as recent, even in futures.
18:03 And then there were BigQuery operator to run SQL query.
18:07 And then there are other operators for different GCS services.
18:11 So I think with Google, there was already a good amount of integrations back three, four
18:16 years back.
18:16 Similarly for Spark and other operators.
18:19 Yeah.
18:20 One of the things that stands out to me that might be really useful here is if something
18:24 goes wrong.
18:25 You know, you talked about the contrast being cron jobs or something like that.
18:29 And if something goes wrong with that, or you need to scale out across different machines
18:33 or whatever, and how do you get those timings right or other weird things?
18:37 So what's the mechanism for dealing with, you know, I'm going to get some data.
18:41 It drops in the cloud.
18:42 I'm going to pull it over.
18:43 But then maybe it's invalid data or something.
18:45 What's that look like?
18:47 So at least for Airflow, all the operators that were written previously or the idea behind
18:53 them were the task, a single operator or a single task should be idempotent.
18:56 So even if you run them multiple times, it should produce the same result.
19:00 So if a task for whatever reason fails, you could add more retries to it.
19:05 There's a retry parameter that the base class takes.
19:08 And you could say retries is four, retries is five, and Airflow will handle that for you.
19:13 So if a task fails, it will rerun it for that amount of time.
19:16 Right.
19:16 It could fail because the database server is down, or it could fail because it's never going
19:20 to work, right?
19:21 It could be either one.
19:21 Exactly.
19:22 And you want to be notified as well.
19:24 So then we had all those on failure callback, on success callback, those emails get sent out
19:30 saying the data didn't arrive at all or whatever the reason may be.
19:34 There is even more to that because we also have the mechanism of backfilling the data.
19:38 So even in this case, it's not like not a server failure, but your data has improved.
19:43 Because you've got a new metadata and you want to reprocess the data you've already processed
19:48 for like last week or only process part of the data because it takes a lot of time.
19:54 And you know that the data up to a certain point is good, but then you have to process just part
19:59 of your workflow of part of your DAG for the last week.
20:03 You can do that with Airflow.
20:05 So you can just tell, make a command, run a command, just reprocess me that data for
20:10 this period of time, starting from this task, because this is where we know we have to reprocess
20:16 the data because the data has been cleaned up, for example.
20:18 Right.
20:19 You don't have to detect it.
20:20 You don't have to copy it down.
20:21 You've changed it locally and you want it to get fixed.
20:23 I see.
20:24 Okay.
20:24 And the super cool thing there is that this can be done by one person who doesn't know
20:28 what those tasks are doing at all.
20:30 Like they are just all the language of how the tasks are written.
20:34 The specification is written in the way that anyone can do that.
20:37 And then this person operating can very safely just rerun parts of it and be sure that what
20:42 comes out at the end is just what they are expecting.
20:45 And if you have like hundreds and thousands of that's written by, you know, tens and twenties
20:50 or hundreds of people, just one person can sit down and operate all the whole, the whole
20:55 of it without understanding a single thing, how it works inside.
20:59 But knowing, with seeing what happened, this is like so powerful.
21:03 Yeah.
21:03 Part of Earth.
21:04 It lets you focus on just the steps and not all the bit together, right?
21:08 So yeah, let's focus on a couple of things on the website here that I think are maybe worth
21:13 calling out.
21:14 One of the things here is that the project has four principles that are really nice.
21:21 Maybe you want to highlight those for people?
21:22 Yeah, I think.
21:24 Okay.
21:24 So the four principles, it's that Airflow is dynamic, extensible, elegant, and scalable.
21:30 And I am going to go ahead and pick my favorite one right here.
21:34 And it's one that we've kind of touched upon without spelling out clearly, which is that
21:38 Airflow is extensible.
21:39 Yarek talked about how we have these 70 plus providers, these various integrations with all
21:47 kinds of services from the big cloud providers to things like Slack, Snowflake, which I know
21:53 are also kind of big, to much smaller ones.
21:56 And if a provider doesn't exist or if an operator doesn't exist for a task that you need to perform,
22:02 you can write it and you can either write it and be running it in your instance of Airflow.
22:08 Or if you're being a good steward of open source, you can write it and contribute it back to the
22:12 community.
22:13 So other people who need to do that task can also benefit from what you've already figured out.
22:19 Yeah, that's really neat.
22:20 So a lot of these would be things down here, like if only one person has to write,
22:26 how do I connect to Hadoop?
22:28 Or if you go to airflow.apache.org, or you go to the bottom, there's all these different--
22:33 are these the operators, or what are these?
22:35 Or the tasks?
22:36 GIL DIPERSON: Those are integrations, integrations with the different services you have.
22:41 So Google, for example, is a big provider, but it consists of integration like Google Cloud,
22:45 CMS, data store, machine learning.
22:48 So you have a number of integrations per provider, even sometimes.
22:52 GIL DIPERSON: OK, cool.
22:53 And Leo, if I was going to create one of these, if I was going to be a good citizen, I'm like, oh,
22:57 I want to create one with AWS Lambda.
23:00 That exists.
23:00 But something like that, right?
23:01 GIL DIPERSON: Yeah.
23:02 GIL DIPERSON: Does that get contributed back to Airflow?
23:04 So when I pip install Airflow, does that come with it?
23:06 Or is there some external way to bring in--
23:09 GIL DIPERSON: Yes.
23:10 We do actually-- well, I'll have to double check with Yarek and Caxel, because I know we've been messing around
23:14 with how we do the installs lately.
23:16 So it used to be that Airflow operators were packaged along with Airflow.
23:22 And when you did pip install Airflow, you would get everything.
23:25 And I think that you do still get a certain number of base operators that are kind of like provider agnostic that come with Airflow.
23:34 GIL DIPERSON: But the way we have things now is that all of these provider-based operator sensors,
23:42 all these provider task things are packaged separately.
23:45 And you add them just like you would any other kind of Python package.
23:48 GIL DIPERSON: Right.
23:49 GIL DIPERSON: So for example, if you want to install the Google Cloud operators, you have that separately.
23:53 And the advantage of that is that they're released on a separate release schedule.
23:58 GIL DIPERSON: And follow versioning that ensures they're compatible with versions of Airflow.
24:03 And they're very clear about that.
24:06 And it's a lot easier for Airflow users to upgrade just the providers package than it is to upgrade the entirety of Airflow,
24:15 which for folks running in production, that is not always feasible or practical.
24:19 GIL DIPERSON: Yeah, you can actually click on documentation link on this page, Michael.
24:24 And then you will see all of those providers.
24:26 So you see the list of different provider packages.
24:29 And you can see the documentation of that versions, the different versions.
24:33 We release them very frequently.
24:35 Like every month, we have a bunch of providers released, which are adding new functionality.
24:40 And they are done completely separately, as Leah said.
24:43 Not the same release schedule as Airflow, and you can start using them faster.
24:48 And this is actually super cool that you can actually always find something there.
24:54 But if you don't, we don't actually force you to go this community route.
24:58 Like those are all providers which are developed by community and maintained by the community of Apache Airflow
25:03 under the Apache Software Foundation rules, which is called like Apache Way.
25:08 So the way how Apache releases software.
25:11 But if you want, you can actually build your own custom provider.
25:14 You can build your own custom operators and you can release them separately.
25:18 And somebody can install that.
25:19 And they even-- we even have integration points that if people are writing the custom providers,
25:23 they can use exactly the same feature as the community driven ones.
25:27 And you can install them as a package, as another Python package, completely independent from Airflow.
25:32 And it just plugs in the UI of Airflow, plugs into the whole framework, and you can start using it.
25:39 So it's both community and custom.
25:41 Yeah, you can go either path, right?
25:43 That's neat.
25:44 I think, Leah, what you're saying about the cadence, the release frequency, and maybe even the degree of seriousness with which you have to apply to these.
25:54 You might want the main Airflow to be treated differently than some edge package or integration, right?
26:01 Yes, definitely.
26:02 There was a proposal for requests, the very popular HTTP library, to be integrated into Python to replace Python's HTTP layer.
26:14 And the decision of the core devs, I believe, was we don't want to do that to request.
26:19 To request, like it will actually make requests go much slower and only get released once a year with changes rather than, you know, as quickly as it needs to go.
26:28 Same thing for you all, right?
26:29 That was one of the biggest reasons for us to separate the providers, because when we were releasing 1.10.2, 1.10.3, 1.10.4, it meant that all the development was happening in the main or master branch.
26:41 And we were not releasing from master branch because we were just releasing the minor or pet versions right now.
26:48 And because the code has to be tested thoroughly, even if there's a small bug in one of the providers, let's say a Google GCS bucket operator or something, it has to wait until the entire code has been tested and released.
27:01 So the cycle can be large, whereas what everyone was thinking, at least the committers and PMC members, that providers can be released more frequently, even if it means it can be released.
27:12 If we find a bug right now, we should fix it and go with the normal ESF releasing way, which is like three days of voting and release it.
27:19 So it is quicker release rather than waiting for the next month to club it into the core airflow release.
27:25 Plus that way, it's easier to also check the changes that happens because imagine checking the change log for 70 odd providers, including the airflow core in a single page.
27:36 So it will be a nightmare.
27:37 Yeah, I bet.
27:38 I'm just thinking of all the coordination of, well, there's some people working on the Discord integration and someone's working on the Samba integration and we're going to do a new release.
27:47 You've got to kind of feature freeze all that stuff.
27:49 So yeah, it makes a ton of sense to separate these things.
27:53 Actually, this is super, super cool that, you know, I'm the release manager for providers so far.
27:59 So I was releasing, I don't know, maybe six, seven releases over the last year.
28:03 And actually, I do it myself in like two, three hours.
28:07 I'm able to bring all the changes and put the release notes for all the 70 providers.
28:12 It's all fully automated and we can manage and release that without a worry that it will break something.
28:17 Because if one of those releases goes wrong, providers go wrong, we can simply yank this release.
28:22 This is this fantastic feature of PyPype that you can yank the release.
28:26 And this actually happened yesterday.
28:27 So we discovered that the PostgreSQL we released, 2.1.0 version, had an incompatibility back with previous version of Airflow.
28:35 We haven't discovered that during our testing.
28:37 We test a lot of things, but this one slipped through.
28:40 But what we've done, just yanked this release.
28:43 Anyone can use the previous one.
28:44 When they install Airflow and PostgreSQL operators, they will install the latest version.
28:49 And in the meantime, we can just fix the PostgreSQL and release a new version.
28:53 And that's super cool, actually, for maintenance release and usability and stability of your installation.
29:01 Yeah, that's really good that you can change it around.
29:03 All right.
29:04 So I want to talk about, first of all, let's talk about installing.
29:08 So how do I get Airflow onto my computer?
29:11 It depends on if you want a hosted, managed version, like Cloud Composer, which I work on.
29:17 Or there is one for Amazon, MWAA.
29:19 And there's also Astronomer, which is for Caxelworks.
29:22 Or if you want to do it yourself.
29:24 Yes.
29:25 In general, though, we at least say that use the constraints file.
29:30 So every time when we release an Airflow version, we also tag in GitHub the constraints file for each of the release.
29:38 A constraints file contains the set of known dependencies that we have tested Airflow with on the CI.
29:45 Because Airflow has a lot of dependencies.
29:47 And before we started using constraints, there were a lot of instances where we just released Airflow.
29:54 And one of the dependencies released a breaking change in a minor or a patch version, which means users could install Airflow.
30:00 And to get over it, we came up with this idea of using constraints file because Airflow is a library as well as an application.
30:09 So for library users who want the latest versions, whereas for application, you want the stable versions of everything.
30:16 So we came up with this balance of using the constraints file.
30:19 So if you check that Airflow version 2.1.2, we get the Python version and then we fetch that constraints file from GitHub and use that constraint file.
30:29 Because that way we can guarantee that it is reproducible and it will work for sure.
30:34 Yeah, very cool.
30:35 So if I go to the documentation, there's a couple of options.
30:39 I can run it locally.
30:40 I can run it in Docker.
30:41 I can run it in Astronomer.
30:43 But looking through the script to set things up here, it looks like there's a couple of steps.
30:50 So there's a database that does something.
30:52 There's some users who execute the task or, you know, you don't want to run as root most likely.
30:59 I suspect that's something you all discourage.
31:01 Probably.
31:02 And there's a web server and there's a scheduler.
31:05 So maybe tell us about that.
31:08 Whoever wants to take it.
31:08 Yes, I'll take it.
31:09 So Airflow is pretty complex in setup because it has multiple components.
31:14 Depending on the setup, you can talk to a Kubernetes cluster.
31:17 It can execute the workflow there.
31:19 Or you can have a salary queue system processing your tasks and executing them on distributed workers.
31:25 Because the scalability part, which was one of those features of Airflow.
31:30 So you can have multiple workers, multiple machines, even several hundreds of them if you want.
31:36 And Airflow can be installed using all those capacity.
31:40 So we have salary workers, we have Kubernetes workers, we have scheduler, we have web server.
31:44 And putting it together is not as simple as you would think.
31:49 Or actually, you can think that it's complex.
31:51 And it is.
31:51 However, we've made, like recently especially, we've put a lot of effort to make the kind of very simple way of installing Airflow.
32:00 Like, you know, like you just install it and it works.
32:02 And also, if you want to scale it to like a very complex one, you can also turn on all the knobs, put as many components you want in a way that fits you best.
32:12 So coming back a little bit to this installation, we have a Docker image.
32:17 So that's something I also worked for quite some time.
32:19 Together with Caxil and the other maintainers, we iterated and perfected it.
32:24 So we have a very nice Docker image that can be used to both run Airflow as it is or build your own custom image, which contains all the new dependencies you want or all the special packages that you want to install, which are needed for you.
32:37 And then from that, we have Docker Compose, which is kind of a quick start.
32:42 So you can just, and this is this running Airflow in Docker, this part there.
32:46 When you run in Docker, does, say, the web server run in one container and the scheduler in another or something like that?
32:52 That's exactly what this Docker quick start.
32:54 It's orchestrated, yeah.
32:55 Okay.
32:55 Yes.
32:55 But it's super easy.
32:57 It's a really quick start.
32:59 You just download the Docker Compose file.
33:01 You just run two commands.
33:02 If you go a little bit down, then there's like a few commands to run.
33:06 And then off you go.
33:07 You have all these components talking together to each other and processing the DAGs.
33:12 And you can start playing with that.
33:14 It's not production ready, the Docker one.
33:16 But then there is the next step.
33:18 So you have a local installation.
33:21 It gives you Docker Compose.
33:22 And then I will transfer it to Kaxi because he was working mostly on that.
33:26 Yeah.
33:26 So we also have the Helm chart that we did.
33:29 The first version of Helm chart we released in March of this year.
33:32 So that's what we recommend for production uses.
33:35 That uses the official Docker image.
33:38 So we release like a lot of artifacts for Airflow.
33:41 And again, the documentation for Helm chart.
33:43 If you click on documentation again at the top and scroll all the way down, you will see a separate documentation for the Helm chart.
33:50 All right.
33:51 So go to Helm chart.
33:52 Yeah.
33:52 Okay.
33:52 Got it.
33:53 We have versioned all this documentation separately because they are different artifacts and all of them have different release cadence and are released separately.
34:02 And Helm chart is what we recommend for users because it comes with all the configurations that we have tested it in production environments.
34:09 Astronomer donated the Helm chart last year and we traded on it a lot of time before we released it.
34:16 We also, me and Jarek, had a presentation in a recently concluded Airflow Summit.
34:20 So if users are interested in it, we can probably drop a link at the end of this session, I guess.
34:26 Yeah.
34:27 You all just had the Airflow Summit, right?
34:28 Yes.
34:29 Yeah.
34:29 Oh, I have a lot to say about this.
34:32 All right.
34:32 Well, tell us.
34:33 Community is definitely where the majority of my contributions to Airflow come in.
34:38 So this is our second ever Airflow Summit.
34:41 So far, it's been an annual thing, but I'm always nervous to say annual.
34:45 We can't, I don't want to make promises, but it's looking good.
34:49 Like we'll have it again.
34:50 So we had our first summit in 2020.
34:52 We had originally planned to have it be this 500 person in-person event and it's going to be in Mountain View.
34:59 That's how I got involved because we were looking to host it at the Computer History Museum.
35:04 And I said, oh, that's really close to where I work.
35:06 I can like be your liaison to the location.
35:09 And then, you know, there's a whole pandemic and everything.
35:12 And we ended up pivoting to a totally virtual event and it was a great success.
35:18 We did it in partnership with Software Guru.
35:20 They helped us run the summit last year.
35:23 And we felt that it was such a good success that we did it again this year.
35:28 And it just finished up in July.
35:30 We had 10,000, I think more than 10,000 at this point, registered attendees from all over the world.
35:37 That's really good for an online conference.
35:39 And for only the second edition too.
35:41 We're pretty proud.
35:42 Yeah.
35:42 And we had it live streamed in a bunch of different time zones.
35:45 So sometimes it was more Americas friendly.
35:47 Sometimes it was more EMEA friendly.
35:49 Sometimes it was more APAC friendly.
35:51 And we had all variations of talks.
35:53 We had ones that were customer use cases.
35:57 So people who are running Airflow in production or running one of the hosted managed versions of Airflow and what they're using it for.
36:06 We had people who are contributors talking about their first time contribution experience and why you shouldn't be scared to contribute to Airflow because we're really nice.
36:15 I promise we are, or at least we try to be, but, and we had more experienced contributors like Yarg and Kaxal talk about some of the more complex things that they've been working on over the past year and everything in between.
36:27 And there are so many talks and you have the summit page up right now.
36:32 Actually, all of the recordings and slides for those presentations that had slides available are up there for you to watch.
36:40 If you go to airflowsummit.org, there's many, many, many hours of content.
36:45 I highly encourage you to watch whatever sounds interesting for you.
36:50 Yeah.
36:51 I think this is great.
36:52 Like I said, congratulations on having 10,000 registered.
36:55 Thank you.
36:55 Yeah.
36:56 That's pretty amazing.
36:57 I think there's obviously a big group of people who know that this is like the right tool.
37:03 I think there's a lot of people who necessarily don't know for sure.
37:05 Like for example, there's on the Airflow GitHub page, it's 23,000 stars.
37:11 That's big time.
37:12 Yeah.
37:12 Jango and Flask are 50, 50 K.
37:14 So, I mean, that's, that's a lot of people using this and interested in this and so on.
37:19 I think that the best part about Airflow is the community.
37:23 And that's like why we have those stars, but also like why we had such a summit.
37:27 And Kaxal, you were going to say something.
37:29 Yeah.
37:30 I was just going to say that if you go by the PyPy stats, we have like 3 million downloads
37:35 a month or something like that, which is insane.
37:38 I know a good number of those come from CI and automated processes, but hey, all the other
37:43 packages also have the same thing.
37:45 So you can at least compare them between packages.
37:47 Yeah.
37:48 It's a relative statement at least, right?
37:49 Exactly.
37:51 And likely I mentioned like the biggest part about or biggest greatest thing about Airflow
37:56 is its community.
37:57 If you check the new contributors, I think we are to more than 1600 contributors to the Airflow
38:04 project, which is great.
38:05 And every day we at least get a few new contributors trying to contribute to the project with whatever
38:11 they can and they must.
38:12 And I, again, through your medium, I would encourage people to go to Airflow website.
38:17 If you find anything, contribute it, fix it.
38:20 If you have some ideas about hooks, operators, anything, contribute it.
38:25 And we are there to help you.
38:27 Not only three of us, there are more than 30, 40 committers and PMC members, and there are
38:32 users helping users in the Airflow Slack channel.
38:35 We have more than 16,000, 17,000 members in the Airflow Slack workspace as well.
38:40 Wow.
38:40 That's cool.
38:41 So I actually want to give a quick plug for an Airflow Summit talk I gave this year that
38:46 was authored by me and a colleague.
38:48 It's called You Don't Have to Wait for Someone to Fix It for You.
38:51 And it is about the kinds of contributions that you can make to Airflow because there's all
38:56 those things that Coxell mentioned.
38:58 But my personal opinion is that one of the best and easiest ways to contribute to Airflow or
39:05 any open source project really is to find something that is driving you nuts and to fix it.
39:10 Or at least to articulate really well what's driving you nuts and what needs to change.
39:16 Because a really good issue can be just as good of a contribution as a PR.
39:21 Because you may have just made the foundation for someone else to write a fabulous PR with a really
39:28 detailed issue.
39:29 And let me add to that as well because the community is definitely the thing that I love
39:34 most about Airflow.
39:35 The people are fantastic here.
39:38 And we are, all of us, all the committers, we are so much into making, like inviting people
39:45 to come and to join us or to give back for whatever they got from Airflow.
39:50 Like it's a free software.
39:51 Anyone can use it for free.
39:53 Giving back is just super nice.
39:55 But we don't stop talking, only talking about them.
39:58 Because if you see, if you see, scroll down a little bit above, you would see that we also
40:03 run a workshop during the Airflow Summit.
40:07 And this workshop is about contributing to Apache Airflow.
40:10 This year we had like 20 attendees coming and learning in three hours how to make your first
40:16 PR, how to communicate, how to be present in the community, how to make the most of it,
40:23 how to be super helpful to others as well.
40:26 And then we were just, it was like part of it was about coding, but all the rest was all
40:32 about communication, about speaking to people, about being able to express yourself and all
40:37 the stuff that you just needed.
40:38 And it is super important.
40:40 Who should I ask about this and things like that?
40:42 I know exactly who you should ask.
40:44 So actually one of our, one of my favorite stories about this year's Airflow Summit is we
40:49 had a speaker, I forget her last name.
40:50 Her first name is Tatiana and she's like a principal data engineer at the BBC.
40:54 And she went to the workshop last year.
40:57 And this year she was a speaker at the summit.
41:00 And her talk about how to basically like kind of debug when crazy stuff is going wrong in Airflow
41:06 was fabulous.
41:08 Oh, super.
41:08 Okay.
41:09 Yeah.
41:09 And you can, that's, people can live stream that off the sessions.
41:12 That's really cool.
41:13 Yeah.
41:14 Doing in Airflow, Airflow obstructions.
41:15 Awesome.
41:16 So that is an example of that workshop working.
41:19 Yeah.
41:19 Yeah.
41:20 Very cool.
41:20 Yeah.
41:20 I was just saying Airflow Summit is also one of a kind conference.
41:23 It's not like the normal conferences, mainly because we had the local meetup groups hosting
41:30 that day of the event.
41:31 So we had like the London meetup group.
41:33 We had the Bangalore meetup group, Melbourne, Warsaw meetup group.
41:37 And every, though we were bringing the community together.
41:40 So let's say the first day was hosted by the London meetup group, which was me, Ash and
41:45 other folks.
41:46 We were hosting that event for just for the Monday slot.
41:49 And then on the Tuesday, there were other PMC members, other community members from Japan
41:54 hosting that, some from Melbourne hosting that.
41:57 Similarly, those were the slots.
41:58 And someday even we had like some sort of overlap because we were trying to cover the
42:04 Pacific time zone and the Asian time zones, which was incredible because now you have tons
42:10 of content for the Airflow users to watch out.
42:12 Also, we had two community days.
42:15 We started from Thursdays.
42:16 So we had Thursday, all the talks about community, how you could make the contributions and stuff
42:22 like that.
42:23 Friday, we had that workshop.
42:24 And then from Monday to Friday, there were more about the Airflow use cases and why Airflow
42:30 2.0 was the big milestone for the project and what we are planning ahead for Airflow and
42:36 stuff like that.
42:37 There's a ton of stuff here.
42:38 I think people could watch for the rest of the year and study this and get a lot out
42:42 of it.
42:43 It's true.
42:43 I do think so.
42:44 And we actually even had a networking event there Friday night.
42:49 And that was a blast, actually.
42:51 It was.
42:51 The networking this year was like people learn how to use it online.
42:55 And that was like, well, not maybe as good as physical conferences.
43:00 So I'm looking forward to next year, which hopefully will be going to be partially at least a physical
43:05 event.
43:05 But it was good enough.
43:07 And I think that was really cool to talk to those people about all the different things,
43:12 not only Airflow.
43:13 So we are not only Airflow and not only Python and not only programming, but also people.
43:19 Yeah.
43:20 I feel like this is a project that would be easy to contribute to in the sense that if
43:25 I'm going to say contribute as a newcomer to Django, that's going to be hard because that's a highly
43:30 polished single piece of software.
43:32 And if you're going to make a change that affects millions of people and it's not easy.
43:37 Whereas here, if you want to add some kind of integration and it didn't exist before, you're
43:41 not going to break anybody's code.
43:43 You don't want to work with a bunch of legacy code.
43:44 There's a bunch of sort of broad but shallow places people could jump in and participate.
43:49 Well, and if people, if a newcomer does want to come in and like really jump into the deep
43:54 end, we do have this concept called AIP, which stands for Airflow Improvement Proposal.
44:00 And it kind of sets you up to not run into heartbreak if you open this, what you think is an
44:06 amazing PR and we're like, oh, no, no, no.
44:08 Hold on.
44:08 We're not ready for that.
44:09 Because it's almost like writing the outline before you write your essay.
44:13 I know it sounds kind of dry, but what it really is, is it's an opportunity to fully flesh out
44:18 this amazing idea you have and share it with the community and the community will give you
44:23 feedback and they will be productive about it because if they're not, they're not abiding
44:28 by community code of conduct.
44:30 Yeah.
44:30 I find it very unfortunate.
44:33 I feel really bad if people come and do a PR to some project that I have.
44:37 And granted, these are all very small open source projects.
44:39 But if they come and they actually do the work and the first I know about it is, boom, here's
44:43 a PR.
44:43 Yeah.
44:44 That's just not in the same zen of what I'm trying to accomplish with this.
44:49 And it's going to break the thing that makes it special.
44:51 It's a bummer.
44:52 So I have to reject it, right?
44:53 But you don't want to.
44:54 Yeah.
44:54 It'd be much better to say, I have this idea.
44:56 If I built this, would you want it?
44:58 You know, do you want the puppy?
44:59 Here's a puppy for Christmas.
45:00 Yes, exactly.
45:02 This is precisely what we are teaching people at those workshops because it's not obvious.
45:07 If you come from outside, you don't understand that.
45:09 We are not only teaching people about contributing the code, but also how to find yourself there,
45:15 like how to be empathetic, how to think about, put yourself in our shoes.
45:20 And on the other hand, how to tell what he wants to tell in the way that we will understand
45:25 it.
45:25 Because it's sometimes really different worlds, different people, different backgrounds, different
45:29 expectations and assumptions.
45:31 So all this is the communication is that I'm a software engineer.
45:34 I love to do software engineering, but like 30, 40, 50% of my time is communication.
45:40 It's not actually coding.
45:42 And this is cool.
45:44 I related to this, I actually want to call out a really important Apache value that I
45:49 think that Airflow embodies, which is the concept of the importance of community over code.
45:55 And I really feel that the Airflow project lives that value.
45:59 And folks in the community really are trying to foster a positive community because they
46:05 understand that if the Airflow community is not healthy, then the Airflow code will not
46:11 live on.
46:12 It doesn't matter.
46:13 Yeah, it doesn't matter.
46:15 And if folks have questions about that, I do want to acknowledge that I am the one woman
46:21 in the room.
46:21 I am often the one woman in the room when it comes to Airflow.
46:26 And I would love to see that change and have more gender diverse folks come join.
46:32 And so if you are someone who identifies with that and wants to hear Leah's unfiltered views
46:38 on the community, feel free to reach out to me in the Airflow Slack or on my Twitter.
46:42 And like I said, I do think this is a project that if you want to get into open source, it's
46:46 one that has relatively low barriers, technically speaking.
46:49 Yes.
46:49 Oh, yeah.
46:50 The keynote talk I gave in the Airflow Summit on Thursday, the first talk.
46:54 So if you go to the Airflow Summit page, the first talk, then I talk about my journey as
47:00 well.
47:01 Because I was very afraid of contributing to open source because it feels intimidating at
47:05 first on everything will be public.
47:08 Oh, who knows if I screw something up?
47:10 What would people say?
47:12 On my permanent record.
47:14 Yeah.
47:15 And I didn't know Python or didn't know it proficiently.
47:20 So I talk about my journey of how I did it.
47:22 I talk about 10 minutes about that and then how a new user can start contributing to the
47:28 project.
47:28 Because Airflow is a relatively still a larger code base.
47:32 And there are a lot of areas that people can target because if you try to learn everything
47:35 at once, it is going to be very difficult.
47:37 We have Helm charts.
47:39 We have Docker images.
47:40 We have scheduler, which is code to Airflow.
47:43 We have executors.
47:44 We have the CLI, REST API, and a lot of things like that.
47:48 So there are a lot of room for people to get expertise in a certain area.
47:52 And then if you start including all the integrations, then it's a whole piece, right?
47:56 You can just add your own integration and be an expert at that and become a contributor,
48:01 a PMC member just with that contributions.
48:04 Well, and in the interest of empathy, I would like to share that I do not know all of these
48:09 parts.
48:09 I think the part I'm most familiar with is the Google provider.
48:13 And I have never touched the Helm chart and it scares me because I haven't taken the time
48:17 to learn what it's all about.
48:18 But the good news is that other community members know and I know that I can look to them for
48:23 help when I do need to mess around with it.
48:26 Yeah, that's fantastic.
48:27 That's the beauty of the project, right?
48:28 If everyone knows everything, then why are we all here?
48:32 Each one of us knows their part, then that's the community.
48:36 Otherwise, it's not a community project.
48:37 Yeah.
48:38 Yeah.
48:38 Yes.
48:38 We're getting short on time.
48:39 I do want to touch on a couple of things that I think we haven't got a chance to touch
48:43 on that are really important.
48:45 One, let's talk about the user interface because one of the ways you all position to this is
48:51 you don't want to do this all with just cron jobs and like sort of little scripts that are
48:55 put together and run on weird random triggers.
48:58 And one of the real big benefits is you have this really beautiful UI for all sorts of visualization
49:03 of like running workflows and all kinds of stuff, right?
49:06 You want to tell us about that?
49:08 I'll do the simple version because I think that Kaxel and Yarek know more about it than me,
49:12 but I'll tell you the two things I'm most excited about.
49:15 One of them is that it just got a huge makeover with Airflow 2.
49:18 So if you're an Airflow user and you haven't upgraded to Airflow 2, if you need one reason alone, it is that the UI is so much prettier and it is much more responsive.
49:28 And as a former cron user, I'll say that the best, easiest benefit you get from this
49:35 is you can just see what's failing.
49:36 If you don't have to dig around and try to figure out what's missing, like you know that something went wrong.
49:41 All right, Yarek and Kaxel, that's my, I'm off my soapbox now.
49:44 Yeah, basically you have all the information you need, all the historical view in front of you.
49:48 Like if you want to see which task failed historically, you could just check the tree view.
49:52 And then this is the graph view where you can see how your task is proceeding.
49:55 Plus we now have auto refresh, like Leah mentioned from 2.0, which is like you don't need to press the refresh button,
50:01 which was a bit annoying for the Airflow 1.10X version, which is very good.
50:05 Your task will continue.
50:06 You can see the progress that the Airflow is continuing this task.
50:10 If you click on that task, it will show you the logs of that task.
50:14 So everything is very intuitive and easy to monitor.
50:18 For people who are listening and are not watching the live stream, you can go and for example, in the graph, it'll show you all your tasks that you would do,
50:27 like download this file or run this bash script or whatever.
50:30 And then it actually shows you how they're working together.
50:32 And then they're colored as you progress through this DAG of tasks, right?
50:36 So you can actually visually see, was this one skipped?
50:39 Was this one successful?
50:40 Which one failed?
50:41 How far are you visually as a graph?
50:43 Which I think is awesome.
50:44 Yeah.
50:44 And one of the interesting thing over there is to understand the dependencies,
50:48 which was very interesting when I initially started with Airflow, that for a user or for a company to understand what all the tasks they are working on
50:56 and in a single flow, how does that dependency graph work on?
51:00 If you're depending on a data from a single client, how does that go to a dashboard?
51:04 So that end-to-end view, like it's an actual pipeline of sorts that you can see.
51:09 Yes. And just to add on that, so the visualization of the data flow is like super important
51:14 because then you can, with a glance, you can see what's going on and you can go to any part of it and focus on that and understand what's going on.
51:22 However, I will come back to kind of the roots because Airflow doesn't have a way by default to define those flows visually.
51:33 You can see them visually, but they are all defined as Python code.
51:37 And this is like the beauty of it.
51:38 And that was a very, very deliberate choice.
51:40 And this is the reason why we are at the Python talks today, because Airflow is all about Python.
51:46 So this visualization that you see here are really reflection of the code that you wrote as a writer.
51:53 And it means also that the common language between people using Airflow,
51:57 different parts of it, is Python.
51:59 This is the common language that we're using.
52:01 And this makes it so powerful.
52:03 And the visual part is pretty much an addition.
52:07 And it's necessary.
52:08 And it's more kind of result of the Python code which is being written.
52:12 A lot of workflow systems try to go in reverse, right?
52:14 They're like, here's your draggy, droppy set of tasks and options.
52:17 You drag it all together.
52:18 Then you press go.
52:19 Yes.
52:20 This lets you live at the code level.
52:22 This all breaks at the very moment when you want to have some custom work.
52:27 Because if you are used to the drag and dropping, you will not do coding.
52:31 You will not code the kind of customization that you want to do.
52:35 You will ask someone else to do that.
52:37 In Airflow, this is quite reverse.
52:38 I mean, everything is Python.
52:40 Everything.
52:40 Dependencies are Python.
52:42 The code itself is Python.
52:44 The blocks are Python.
52:45 But you can also write your own.
52:46 In the same place where you define your DAG, you can write your own custom operator
52:51 without having to use a black box operator of sorts.
52:55 And you don't have to leave the box of working on Python while doing that.
53:00 And this is so powerful.
53:02 I think this is the way why it is so popular between data engineers all over the world.
53:07 I think this is like one of the most popular workflow orchestration engine in the world right now.
53:12 That's it.
53:13 I don't have hard data on that.
53:14 So it's just a feeling.
53:16 But I think that's the case.
53:18 I mean, we did have 10,000 people at the summit, Jarek.
53:21 Yes.
53:22 Yeah, for sure.
53:23 And while it is written in Python, you can use the bash operator to run like your Java code,
53:28 for example, or Scala or whatever.
53:30 So while everything is in Python, you can use it to run any other languages too.
53:36 You can run Docker image, Kubernetes task.
53:38 Because a lot of those workflows are also, okay, we have Kubernetes.
53:42 So we run everything in Kubernetes.
53:44 We run them as Docker containers.
53:45 And that's the only way you can do that.
53:48 Airflow can do that as well.
53:49 No problem whatsoever.
53:50 There is Kubernetes pod operator.
53:52 You can spin off a new Kubernetes pod to run your task.
53:55 But you can also have a Python code, which is very easy to put together and play with and run locally without all the overhead of building the Docker images and making them available to run you as a task.
54:07 So, so much more extensible and powerful.
54:11 Yeah, that's a very good point.
54:12 There's a lot of escape hatches to bring in other technologies.
54:15 That's cool.
54:15 Yeah.
54:16 Let me give people just a super quick sense of what it's like to write code for this, this Python code.
54:21 So you would say with DAG, with directed acyclic graph, and you give it some details.
54:26 And then you create these various tasks, like a task might be a bash operator or something like that, or like you said, a Kubernetes pod or whatever.
54:34 And then you just run them.
54:35 One thing I did want to ask you all about, like, what is this T1 double arrows into list of T2, T3 for the tasks?
54:42 Ooh, good question.
54:44 So you had those tasks matched to variables called T1, T2, and T3.
54:49 And this is how that visualization is defined, using those, like the bit shift operators in Python.
54:55 So this one would say that T2 and T3 run after T1, and they run in parallel.
55:01 And there are different ways of setting dependencies.
55:03 If you scroll down or just search for setting up dependencies on the right side of your, on the right side.
55:09 Yeah, setting up dependencies.
55:10 Yeah, there you go.
55:11 There are different ways you can set those dependencies between tasks.
55:14 You could do T1-test-do-do or T1.setupstreet.
55:19 You can, like, right shift.
55:20 You can left shift.
55:21 You can double bit shift as a transitive type thing.
55:24 It's set upstream.
55:25 Okay.
55:26 And the beauty of that, again, is that you can, this is all Python code.
55:30 Those are custom operators.
55:32 The left shift and right shift, they are just custom Python operators over with them.
55:36 Right.
55:36 And you can override them in the task, right?
55:38 Just like pathlib overrides forward slash to mean, like, combine parts of the path, right?
55:43 Wouldn't probably recommend that if you don't know about Airflow that much.
55:47 But the better thing there is that you can actually programmatically build the tasks and build the relationships.
55:55 So this is not something that is predefined in one file in the declarative way, like it's an XML file or JSON.
56:00 This is a Python code.
56:02 So you can pretty much dynamically build the DAG.
56:05 So very complex.
56:07 Like we saw, like, you know, the DAGs, which were like thousands, thousands of nodes built with like 200 lines of code because you could build those tasks.
56:16 You know what, which relationships you want to build in what way, like in for loop.
56:20 It's very hard to have a conditional in a JSON file or a XML file.
56:25 Yeah.
56:25 That's the thing.
56:26 Or loop.
56:27 Actually loop in JSON file is like, no.
56:30 I mean, there is no way to do that.
56:32 I mean, we do have XSLT.
56:33 You could go crazy.
56:34 Come on.
56:35 Yeah.
56:35 Yeah.
56:36 Please no.
56:37 And also from Airflow 2.0, this is an explicit way of setting dependencies, but from Airflow 2 and onwards, there's also an implicit way of having dependencies, which is like,
56:48 if you say that your mesh operator takes an input from another task, then Airflow sets dependencies between them implicitly because you are depending on an output of another task.
56:59 So it knows.
57:00 Yeah.
57:01 That makes a lot of sense.
57:02 Cool.
57:02 All right.
57:02 So I think just two really quick things before we wrap it up.
57:07 We are short on time here.
57:08 One is we talked about the web UI for the stuff we're looking at, but there's also, you will describe a rich command line utility.
57:17 To perform complex surgeries on DAGs.
57:20 Okay.
57:21 Why would you perform a surgery on one of these things?
57:24 And what is this all about?
57:25 Who wants to take that one?
57:26 Oh, I don't know that I've done surgery with the CLI, but I have used the CLI to give me information about my environment to figure out when things are misbeheaving.
57:36 Yeah.
57:36 Okay.
57:36 Cool.
57:37 Like for diagnosis and stuff like that.
57:39 Yeah.
57:39 Like, cause we have, there's one command list DAGs and it also shows you how long the DAGs are taking to load.
57:46 So you can kind of see if one of them is your problem DAG.
57:49 If it's taking way longer to load than the rest, that usually means that I've made a mistake.
57:54 Yeah.
57:55 Yeah.
57:55 That command also gives you the parsing time and everything like that.
57:58 So it will tell you that it took five seconds to parse your DAG file, which means something is wrong in your DAG file.
58:04 You are probably importing a lot of things or doing some database calls on the top of your file, not inside the objects.
58:11 Gotcha.
58:11 So you could find those sort of issues.
58:13 Also, you could use Airflow backfill CLI command to run all the backfilling of data if you got the data today.
58:20 And if you want to run it for last one year or so.
58:22 But also it's what is not mentioned in the document.
58:26 There is this, well, it is mentioned in the documentation.
58:28 We have also a very, very powerful and rich and very well written API.
58:33 So we have a stable Assov Airflow 2.
58:36 That was one of the improvements implemented.
58:38 So if you go to Apache Airflow, yeah, and scroll down on the left, yes, not this one, this stable.
58:43 All the way down, yeah.
58:44 Even below, there was like stable REST API.
58:48 Yeah, very good.
58:49 Yeah, yeah, yeah, gotcha.
58:50 Okay.
58:50 Yeah.
58:50 This API is like written in open API standards, which means that all the tools which you can imagine for like managing access, for trying out things, for testing the API calls, all the beautiful documentation that you see here with examples, this is all automatically generated.
59:08 From our API.
59:09 This is super cool because you can actually, and this is surprising.
59:13 You said that the UI is fantastic.
59:15 And yeah, it is.
59:16 But there are some companies who have their own UI, their own ways of looking at the processing pipelines.
59:21 And many, many, we've learned during the Airflow Summit, many of those companies, they actually build their own UI.
59:27 They don't use Airflow UI at all.
59:29 They just use the engine to execute it.
59:31 Right.
59:31 Maybe you want to integrate it into some larger thing they already have or something, yeah.
59:35 Exactly.
59:35 And this API makes it possible.
59:38 So you can just query which DAX you have, which are the relationships, how this all works, which is successful, which not.
59:44 And then you can build beautiful UI or even ugly UI if you want.
59:48 But the UI, that is something that you're used to without looking even at the Airflow UI.
59:53 And this is also super powerful.
59:55 Yeah.
59:55 And this is straight up REST API.
59:58 So while Python is awesome, if you're not a Python person, but you still want to adopt this, here's a way to integrate with it.
01:00:05 Right.
01:00:05 Absolutely.
01:00:06 And we have already started creating clients in different languages, like we have a Java client for Airflow built on this API spec.
01:00:15 Users can create their own APIs for a specific language because it, under the hood, uses OpenAPI.
01:00:20 So you can auto-generate clients for different languages.
01:00:23 Yeah.
01:00:24 Fantastic.
01:00:24 All right.
01:00:25 I think that is about time for us.
01:00:27 I did want to point out that Astronomer and AWS, but Astronomer, where you work, Axel, is a sponsor.
01:00:34 So if you want to run sort of Airflow as a service, that's kind of your job, right?
01:00:40 A hundred percent.
01:00:40 And also we, Astronomer, has their own registry.
01:00:44 So if you do open registry.astronomer.io, it makes it very easy to search for built-in providers that are baked inside Airflow.
01:00:52 Or if users create and maintain their own providers, it is very easy to search that as well.
01:00:58 I just posted a link if you want to check out.
01:01:02 One comment on that, because we also have Google Cloud Composer.
01:01:04 So we have Astronomer, AWS, and Cloud Composer.
01:01:08 These are like big embrace of Airflow as a service.
01:01:11 For us, it's like you can choose either you run it on your own, you run it using Astronomer, which have like great expertise in everything.
01:01:19 Because we have lots of people from Astronomer or our communities.
01:01:21 Then there are Amazon people.
01:01:24 Then there are Google or Amazon offering and Google offering.
01:01:27 And you are free to choose whatever you want.
01:01:30 Like how you want to run Airflow.
01:01:32 And you can move probably if you decide you need to move.
01:01:34 Yeah.
01:01:34 Absolutely.
01:01:35 That just has to do with the infrastructure.
01:01:36 The DAGs will be the same no matter where you take them.
01:01:40 You might have to do a few changes when it comes to like auth and making sure your keys are up to date.
01:01:45 Cool.
01:01:46 All right.
01:01:47 Let's wrap this up with a little bit of future looking.
01:01:49 Just whoever has the right visibility in our group here.
01:01:54 Just, you know, where are things going in the future?
01:01:55 People are excited about Airflow.
01:01:57 Like what can they look forward to?
01:01:59 There's a really good talk from the Airflow Summit that's called Looking Ahead Beyond Airflow 2.0.
01:02:04 That is with Ash from Astronomer and Ajemal from Google.
01:02:08 And I think the thing that Ash said over and over again is, well, there is no roadmap, but we do always have things going on.
01:02:16 So I'll turn it over to Caxon.
01:02:17 No promises.
01:02:18 No promises.
01:02:19 But there are lots of that.
01:02:22 So yeah, we pretty much know the direction we are heading to.
01:02:25 So we want Airflow to be the orchestrator you want to use for whatever workflows you want to run.
01:02:30 That's it.
01:02:31 And there are lots of things like to happen in order to get there because we are so specialized on one hand on what we are opening up.
01:02:39 But we are on the road to really make it easy to accommodate more use cases, make it easier to run, make it faster, make it serve those cases which currently cannot be served because of some reasons, historical reasons mainly.
01:02:55 This is definitely some direction we are heading to open up to even more cases without losing the single focus.
01:03:01 Like we want to be great at scheduling tasks and orchestration.
01:03:05 That's it.
01:03:06 We don't want to do processing.
01:03:07 We don't want to go into this direction.
01:03:09 That doesn't make sense for us.
01:03:10 We want others to do processing and we will do orchestration the best way it's possible.
01:03:15 Yeah.
01:03:15 And the two immediate things that we are already working on and we are almost close to merging it on the main branches, making the Airflows schedule more powerful.
01:03:24 That is a user who will have more power than just like expressing it in Cron.
01:03:30 Users will also be able to say run it on the third of the third trading day of the month or something like that.
01:03:37 Like that level of powerful timetable we want to provide to the users.
01:03:42 We call it timetables.
01:03:43 We will have Cron timetable.
01:03:45 We will have time delta timetable of sorts.
01:03:47 We are figuring that out.
01:03:49 But we'll have that plus something called deferred operators.
01:03:53 I mentioned about the sensors which are currently we put a book for the API call and see until it succeeds.
01:04:00 We are going to have a new component called trigger that will use Python's async library to use resources in a more optimized manner.
01:04:10 Instead of polling, you just wait for it to happen and then boom, off it goes.
01:04:13 Yeah.
01:04:14 Okay.
01:04:14 That sounds cool.
01:04:15 Just one comment to this scheduling because those great examples.
01:04:18 One of the cases we want to serve, there is a real astronomer, not the company, real astronomer using Earthflow.
01:04:24 And he wanted to start DAX when there is a sunset and sunrise.
01:04:28 And, you know, when you are astronomer and flying around Earth, that's a little bit complex.
01:04:33 So the whole scheduling is going to be there to implement this astronomer request.
01:04:38 Yeah.
01:04:39 Fantastic.
01:04:39 It sounds really useful.
01:04:40 All right.
01:04:41 Well, I think that's it for covering Airflow.
01:04:44 But let's quickly wrap up with, I guess, just one of the questions since we're a little bit over time that I usually ask at the end.
01:04:50 So I'll ask you about your editor.
01:04:51 Yarek, if you're going to work on Airflow and other stuff, but what editor do you use for Python?
01:04:55 On a daily basis, I use IntelliJ, Ultimate.
01:04:58 That's my favorite editor.
01:05:00 However, very, very frequently, my favorite editor is VI.
01:05:04 I mean, I'm an old type guy and VI is always when I have to do something quick.
01:05:09 Somewhere where I don't have the editor started, VI is there.
01:05:14 And I, you know, have it in my, you know, like fingers.
01:05:16 I know how to quit VI.
01:05:17 It's easy.
01:05:18 I can learn you.
01:05:19 I can teach you, no problem.
01:05:21 Fantastic.
01:05:22 Yeah, I love that joke about random strings.
01:05:24 Kaxel?
01:05:25 For me, it's PyCharm.
01:05:27 I love PyCharm.
01:05:28 It's debugging.
01:05:29 It's going to the source code.
01:05:31 And those intelligent habits, QSOA.
01:05:34 Just a big fan of PyCharm.
01:05:36 Right on.
01:05:36 I use a combination of VS Code.
01:05:39 And I also have Southspot for Vamble.
01:05:41 Yeah, very cool.
01:05:42 Vamble if it's going to be fast.
01:05:43 VS Code if it's not.
01:05:45 Yeah, we're going to be here for a while.
01:05:47 Let's get down to it.
01:05:48 Yeah.
01:05:48 Right on.
01:05:49 Well, thank you all for being here.
01:05:51 It's been really great.
01:05:52 Final call to action.
01:05:53 People want to get started either using Airflow or contributing to Airflow.
01:05:57 What do you tell them?
01:05:58 Oh, I tell them to go to the community page on the Airflow website.
01:06:02 And I tell them to sign up for the dev list and to join the Airflow Slack.
01:06:07 Yeah.
01:06:07 Fantastic.
01:06:07 All right.
01:06:08 Well, thanks again.
01:06:09 Thanks for being here.
01:06:10 Thank you.
01:06:10 Thank you for inviting us.
01:06:12 It was a great time.
01:06:13 Yeah.
01:06:13 Thank you.
01:06:14 Thanks.
01:06:14 Bye.
01:06:14 Bye.
01:06:15 Bye.
01:06:15 Bye.
01:06:15 Bye.
01:06:15 Bye.
01:06:16 This has been another episode of Talk Python to Me.
01:06:18 Our guests in this episode were Yarek Potik, Kaxal Naik, and Leah Cole.
01:06:23 And it's been brought to you by us over at Talk Python Training, as well as the transcripts
01:06:28 have been brought to you by Assembly AI.
01:06:30 Do you need a great automatic speech-to-text API?
01:06:33 Get human-level accuracy in just a few lines of code.
01:06:36 Visit talkpython.fm/assemblyai.
01:06:39 Want to level up your Python?
01:06:41 We have one of the largest catalogs of Python video courses over at Talk Python.
01:06:45 Our content ranges from true beginners to deeply advanced topics like memory and async.
01:06:50 And best of all, there's not a subscription in sight.
01:06:52 Check it out for yourself at training.talkpython.fm.
01:06:55 Be sure to subscribe to the show, open your favorite podcast app, and search for Python.
01:07:00 We should be right at the top.
01:07:01 You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the
01:07:07 direct RSS feed at /rss on talkpython.fm.
01:07:11 We're live streaming most of our recordings these days.
01:07:14 If you want to be part of the show and have your comments featured on the air, be sure to
01:07:18 subscribe to our YouTube channel at talkpython.fm/youtube.
01:07:22 This is your host, Michael Kennedy.
01:07:24 Thanks so much for listening.
01:07:25 I really appreciate it.
01:07:26 Now get out there and write some Python code.
01:07:28 I'll see you next time.
01:07:48 Bye.
01:07:49 Bye.
01:07:49 Bye.
01:07:49 Bye.
01:07:49 Thank you.