#256: Click to run your notebook with Binder Transcript
00:00 Have you come across a GitHub repo with a Jupyter notebook that has a run in Binder button?
00:04 It seems magical. How does it know what dependencies and external libraries you might need?
00:09 And where does it run anyway? Like all technology, it's not magic. It's the result of hard work by
00:15 people behind the project. In this case, mybinder.org. On this episode, you'll meet Tim Head,
00:20 who has been working to bring Binder to us all. Take a look inside mybinder.org,
00:24 how it works, and the history of the project. This is Talk Python to Me, episode 256, recorded
00:30 February 20th, 2020. Welcome to Talk Python to Me, a weekly podcast on Python, the language,
00:50 the libraries, the ecosystem, and the personalities. This is your host, Michael Kennedy,
00:54 follow me on Twitter where I'm @mkennedy, keep up with the show and listen to past episodes at
00:59 talkpython.fm, and follow the show on Twitter via at Talk Python. This episode is brought to you by
01:04 Linode and Talk Python Training. Be sure to check out what the offers are for both of these segments.
01:10 It really helps support the show. Tim, welcome to Talk Python to Me.
01:14 Hi, Mike.
01:15 It's great to have you on the show, and I'm really looking forward to learning more about Binder.
01:18 It's fantastic to be on the show and to talk to you about Binder, what we do, how it came to be,
01:25 and hopefully how it will continue forever in the future.
01:29 I hope so. It's definitely going to continue in some sense, no matter what, where GitHub is taking
01:37 all of the public repositories and encoding them on tape and putting them in some vault in some Nordic country.
01:44 I can't remember exactly where, but it's probably already been archived there for the world.
01:50 So it's definitely going to continue, but I hope it continues actively as well.
01:54 Yeah, that sounds good. We have a contributor who lives in Norway, so maybe we should try and organize
02:02 a trip to wherever this vault is.
02:05 Yes, exactly. I can't remember where it is, which country it's in, but yeah, it's up there.
02:10 It's got to be near Norway. Pretty cool.
02:12 All right, now, I'm definitely looking forward to talking about Binder and learning more and a lot
02:17 of the behind-the-scenes stuff. But before we get to all that, let's just start with your story.
02:21 How did you get into programming and Python?
02:23 So when I was a teenager, I wanted to do what everybody 20 years ago wanted to do, is build websites.
02:31 And at the time, my dad told me, oh yeah, maybe you should check out Python.
02:38 And then they had this fantastic web server called Zope.
02:42 And then I learned how to use that and make forum software and other little websites like that.
02:50 And that's how I got into Python.
02:52 And if you wanted to be mean, never really learn any other programming languages.
02:58 If you don't have to.
03:00 I think there's value in other programming languages.
03:04 But people who know Python are in a bit of a special place because it's so widely used
03:09 and accepted, you're not forced to go learn something else necessarily.
03:12 Other than JavaScript, everybody's forced to learn JavaScript if you want to do anything on
03:16 the web.
03:16 Yeah, that's right.
03:17 So I know a little bit of JavaScript.
03:19 And I was a physics student at university and then worked at CERN as well as a physicist.
03:29 And there we write a lot of C++.
03:31 So staring at assembler code is the other thing I do.
03:36 Yeah.
03:36 So are you still working at CERN?
03:39 So no, I'm not an academic anymore.
03:42 You've hung up your tweed jacket and your pipe?
03:45 Yeah.
03:46 Yeah.
03:46 Actually, I'm not sure if you get those before you become a professor.
03:50 Maybe only when you get a tenure, maybe full professor, you get the hat at the end.
03:55 Yeah.
03:55 Yeah.
03:56 Yeah.
03:57 So I think three or four years ago, I left academia and thought there was no better thing
04:04 to do in your life than be unemployed and moved to Switzerland on the same day.
04:09 And I started my own business, started my own small consulting company around data science,
04:16 machine learning, that kind of stuff.
04:18 And today I work for a small company in Zurich that is called Scribble.
04:24 So like you scribble on a piece of paper and we do electronic signatures.
04:30 So if you need to sign documents which require legal signatures, you should come to us.
04:37 Okay.
04:38 Yeah.
04:38 Yeah.
04:39 Super cool.
04:39 What was it like working at CERN?
04:41 Oh, it was one of the best places I've worked in my life.
04:44 There's a reason the competition to work there is so crazy.
04:48 Yeah.
04:49 It's fantastic.
04:50 It seems like a really amazing place and it's one of the true cutting edge places where science
04:56 is happening these days.
04:57 And they also didn't set the atmosphere on fire or create a black hole or anything like
05:02 that.
05:02 So it was all good.
05:03 Yeah.
05:04 It's a great place to work and I really, really enjoyed it.
05:08 But at some point decided it wasn't what I wanted to keep doing.
05:12 Sure.
05:13 And so I switched to do something else.
05:16 Yeah.
05:16 You know, I said I was in academics for a while.
05:19 I'm working in math and I really loved it.
05:22 But in the end, I felt like I could just have more impact and just do more interesting,
05:26 concrete stuff with programming than I could with, you know, just math ideas.
05:31 Yeah.
05:32 Yeah.
05:32 I've been there.
05:33 I've been there with you.
05:33 Now, let's start this conversation at the high level before we dive down into Binder,
05:40 which is a tool we're working with and hosting notebooks in a special way.
05:45 Hopefully I got that roughly right.
05:47 Yeah.
05:48 Yeah.
05:49 That's about right.
05:50 Cool.
05:50 But let's start at a high level.
05:52 Just like, what's the state of notebooks today?
05:54 Now, just a little background on me.
05:56 If you don't know, I do more on the website, database side, random utility side.
06:03 Not as much data science.
06:05 Right.
06:06 Although I do pay attention to whatnot.
06:08 It's not where I live day to day.
06:10 So I'm genuinely asking, like, what's the state of notebooks today, do you think?
06:14 I think it is more exciting than it's ever been.
06:18 When notebooks started and, you know, you can discuss on who invented them.
06:23 But for a long time, it was a tool for describing what you were doing to a human and to the computer simultaneously.
06:32 Because you have the narrative text for humans and the computer text code for the computer.
06:40 And now people are, like, going crazy, you know?
06:45 People are trying to automatically turn notebooks into web applications.
06:50 People are wanting to run, like, only small parts of their notebook and make it look like it's within a normal editor.
07:00 And it's like a text file kind of thing.
07:03 How can we make a notebook, which...
07:06 Right, right.
07:06 Both, like, PyCharm and VS Code have that kind of flavor, right?
07:11 There's, like, a special comment that separates the cells.
07:13 But you're kind of in a text file, in one of their views, at least.
07:16 Exactly, yeah.
07:17 And I don't use it very much, so I'm not sure.
07:20 But I think you can then just highlight lines and say, run these lines in my file.
07:26 Right.
07:27 Which, I mean, yeah.
07:29 You can imagine doing all sorts of completely crazy stuff that way.
07:33 Yeah, it's pretty exciting.
07:34 And there's a lot of online hosted places as well, right?
07:38 Google and Azure both have ways in which you can log into your notebooks.
07:43 And it may cost some money.
07:44 They may be free.
07:45 It depends, right?
07:46 Yeah.
07:46 So there's Google CoLab, Azure Notebooks.
07:52 Various startups offer it as a service.
07:56 There's a service called CoCalc that offers hosted notebooks.
08:01 So there's a lot of people out there who do it.
08:05 And mybinder.org is one of them.
08:08 Okay.
08:08 I guess it's probably a good time to talk about Binder and, like, where does it fit in this world?
08:14 So Binder, what we offer people is that you send a link to somebody else.
08:20 And they click on it.
08:21 And they click on it.
08:21 And if you did everything right, they can just run whatever code you wanted them to run.
08:29 For example, you wanted them to try out.
08:31 It should just work.
08:32 And it runs in their browser.
08:34 So they can do it from their tablet, from their phone.
08:37 If they want to use a laptop, they can use a laptop as well.
08:40 But they don't have to install anything.
08:42 Do anything like this.
08:44 And in that sense, it's just another notebook hosting service.
08:49 The twist is that you cannot create an account.
08:54 There is no way for you to make an account with us.
08:57 Okay.
08:57 And that means you can just send the link to everybody and it will just start, which I think is fantastic.
09:05 So people use it a lot, for example, for workshops, because everybody knows that a tutorial or workshop they went to where they spent the first half of the workshop trying to get everybody set up.
09:17 Right.
09:17 And with a service like mybinder.org, you can send them all the link.
09:23 And if the class is not ginormous, they follow the link and then they have something that they can use for the workshop.
09:31 Then you can spend maybe the end of the workshop teaching them how to set it up locally.
09:35 But you can get going.
09:37 And binder is something that you could set up locally?
09:40 Or maybe they would focus more just on here's how you set up this particular project with these notebooks and these requirements.
09:46 I guess either, huh?
09:48 But more likely the latter.
09:49 The answer is it depends what you want to do.
09:52 Imagine, like, how would I set up something that I can use for my course?
09:57 I would create a GitHub repository and I would put, for example, a requirements.txt in it.
10:03 And as human instructions, I would write, please run pip install minus r requirements.txt.
10:12 And this is such a common pattern that for mybinder.org, we built a tool that goes looking for these files and says, aha, I found a requirements.txt.
10:23 I bet you the author of this repo wanted us to run pip install minus r requirements.txt.
10:30 And in an overwhelming number of cases, that is what the author wanted us to do.
10:35 So that's what we automate.
10:37 And so the end goal is that we didn't invent any new way of setting up your software.
10:44 We tried to just spot the patterns of what lots of people are already doing.
10:50 That makes it easy to automate.
10:52 But it means also it should be fairly straightforward to then do it locally.
10:57 Right.
10:58 Because as long as you follow the same basic patterns as was intended anyway, then you're good.
11:03 But you get to avoid all of the challenge of, oh, wait, I don't have Python.
11:07 Or when I type Python, it says that print has these weird semicolon or these parentheses or, you know, some weird thing.
11:15 It's like print is not a function or I don't know, whatever it's going to say, because they have Python too.
11:20 And they didn't know that when they type Python, they got the wrong Python, but they don't have permission to install the right Python.
11:27 And there's just like, yeah, there's just layers of these challenges.
11:31 And the problem is often like that's the first thing to hit people.
11:36 Right.
11:36 Like in a couple of weeks, they're probably fine to be addressing these issues ago.
11:40 Yeah, I understand what's happening.
11:42 I'm going to fix this.
11:43 But that's not what you want to be.
11:45 Welcome to Python.
11:46 Here's your four problems to struggle through and set up.
11:49 Right.
11:50 Exactly.
11:50 And it takes an infinite or seemingly infinite amount of time to debug all these problems.
11:56 If you have a group of 30 people, I guarantee there is going to be people there who have a problem you've never seen before.
12:02 And now as the person trying to run the course, you're like, oh, my God, we're going to do some life debugging.
12:09 That's not what I wanted to show.
12:11 Yeah, exactly.
12:12 It's the way that a lot of training sessions begin or a lot of workshops begin.
12:15 It's definitely not a good impression for the students.
12:19 And it's certainly not productive time.
12:21 So this is really nice.
12:23 Right.
12:23 So you can set this up.
12:24 Typically, I don't know, all the workshops and trainings I did all had GitHub repos.
12:29 And it's like, hey, here's your GitHub repo for the stuff we're giving you.
12:32 Here's the repo.
12:33 Here's the part of the repo where it has the code that you're given.
12:36 Here's the code where I'm going to write and I'll put it in there later.
12:39 Get started.
12:40 And that's a perfect fit for Binder, right?
12:42 Exactly.
12:42 Yeah.
12:43 And we started a lot of people who work on Binder come from the Project Jupyter ecosystem.
12:53 So there's a strong focus on notebooks.
12:55 But now there are examples where you've replaced the UI of notebooks with VS Code because VS Code is also just a web app.
13:04 Actually, you can start if you wanted to.
13:07 You could start a repo, which to the users, presenters of it, it's just VS Code in the cloud somewhere.
13:15 Oh, no kidding.
13:16 Now, I knew VS Code did that.
13:18 I knew that that's something they announced under the term Visual Studio Online.
13:23 And also Coder.com was doing that as well.
13:27 But how do you guys do that?
13:30 Like, I guess VS Code will let you set that up for your own environment, not just for theirs.
13:36 Is that how it works?
13:37 So we actually take advantage of the very hard work Coder.com, I think.
13:44 I'll give them a shout out because I think they're the people who did all the work for the VS Code port or modifications.
13:53 I'm not sure what you call it.
13:54 It does run as a web app.
13:58 Right, right.
13:58 Coder.com is definitely the first experience that I've seen where that was happening.
14:04 Yeah.
14:04 Yeah.
14:05 And they publish their hard work as open source.
14:08 So we allow people to install that or trigger that being installed in their binder.
14:16 And then you can use it as an alternative UI.
14:21 So that was one of the things that I was thinking of here as we were exploring some of the use cases and how this is interesting and useful.
14:29 Right.
14:29 So it makes a lot of sense that this is a lot of sense.
14:38 So it's just a matter of where the server executes its kernel and whatnot.
14:41 Right.
14:42 So that's pretty easy to do, relatively speaking.
14:45 But if it's pure Python code, it's a different kind of challenge, right?
14:49 Because then how do you edit those files?
14:51 How do you interact with them?
14:52 How do you get a terminal into the environment?
14:54 It's a lot more multi-step, it felt like.
14:57 Yeah.
14:58 If you could fire up like a Visual Studio Code in the cloud instance, right?
15:03 That it's got the little terminal.
15:05 That terminal runs inside the container that's hosting it and so on.
15:09 I think that's pretty awesome.
15:10 That really expands the appeal of Binder, I think.
15:13 Yeah.
15:14 Yeah.
15:14 And you get a lot of the language server extension, the full, like, what is it?
15:25 Hints.
15:26 How to complete the function, what the arguments are.
15:29 All this stuff.
15:31 If you know how to install the plugins or enable the plugins you need, then you get all that good stuff.
15:38 This portion of Talk Python to Me is brought to you by Linode.
15:43 Whether you're working on a personal project or managing your enterprise's infrastructure,
15:48 Linode has the pricing, support, and scale that you need to take your project to the next level.
15:52 With 11 data centers worldwide, including their newest data center in Sydney, Australia,
15:57 enterprise-grade hardware, S3-compatible storage, and the next-generation network,
16:03 Linode delivers the performance that you expect at a price that you don't.
16:08 Get started on Linode today with a $20 credit, and you get access to native SSD storage,
16:13 a 40-gigabit network, industry-leading processors, their revamped cloud manager at cloud.linode.com,
16:19 root access to your server, along with their newest API, and a Python CLI.
16:23 Just visit talkpython.fm/Linode when creating a new Linode account, and you'll automatically get $20 credit for your next project.
16:31 Oh, and one last thing.
16:32 They're hiring.
16:33 Go to linode.com slash careers to find out more.
16:36 Let them know that we sent you.
16:38 Do you set that up as part of, as a project owner?
16:44 When you create your binder setup, can you say, this is going to be a Python plus JavaScript thing,
16:49 so we're going to have to make sure those extensions come along for the ride?
16:52 Yeah.
16:52 Yeah.
16:53 So you, as the owner of the GitHub repo, you specify that you would also like to have,
16:59 I think it's called code server, the executable.
17:03 You would like to have that installed as well.
17:06 And then you format the link you share with people slightly differently so that we know to send you to this other UI.
17:13 And then, yeah, you need to also, if you know which plugins and so on to install,
17:19 then during the building phase of your binder, you have to add a few lines that we can execute to do all that stuff.
17:28 Sure. Okay. Well, that sounds very flexible and so on.
17:32 I want to talk about getting the launch binder, launch this in binder on your GitHub and all the details of how you do this.
17:39 But really, just since we're still kind of beginning, let me just ask you real quick,
17:42 how did you get started with binder in this whole project anyway?
17:44 That's a good link back to my previous life at CERN, where as a student and as a postdoc also,
17:53 we write so much software to analyze our data and it does fairly complicated stuff.
17:58 And so when you see somebody else doing something cool, you're like, oh, I wish I could do that as well.
18:05 But if you ask anybody in particle physics or at CERN to share their code,
18:12 and then it only takes you, let's say, a week to get it running, doing what they were doing.
18:18 You get this thing, you compile that from source, and then you do this, and oh, you got to get this right version of the header.
18:25 Yeah. And so if it only takes a week, people are super impressed and super happy.
18:30 But for me and a bunch of my friends, it was just, we were like, no, this has to take like an hour maybe.
18:37 It's too long. If we just spend a week to figure out how to run it before we start modifying it, it's too long.
18:45 So at some point, we decided we need to do something about it or shut up complaining.
18:52 Yeah. Yeah. That's kind of my philosophy as well.
18:55 It's like, it's fine to complain and criticize, but it should come with a, but you should do this instead.
19:00 Or here's how we're making it better, right?
19:02 That's great.
19:03 So fortunately, there was a, or there's a yearly hackathon hosted at CERN called the WebFest.
19:09 And we're like, okay, going to take three days to try and build something.
19:13 And at least we will have tried.
19:16 So we started and the idea was to build something very similar to what we now call binder.
19:23 What was exciting is that on the Saturday, so it's a weekend event on the Saturday morning or afternoon.
19:30 I read an email from a Jeremy Freeman, who is the inventor or creator of my binder.org announcing my binder.org.
19:41 I was like, oh, that sounds very much like what we're trying to do here.
19:45 So from day one, that was, yeah, it was exciting to see that other people are trying to do the same thing.
19:52 Yeah. For me, I then took a detour, just doing some more science and lost my, my connection to this.
19:59 And I think it was three years ago, two or three years ago, a PyCon, somebody said, we're trying to restart my binder.org.
20:11 We've seen you did something similar.
20:13 Do you want to be involved?
20:15 Okay.
20:15 And since then, I've not managed to escape.
20:17 They pulled you back in.
20:20 Wonderful.
20:20 Cool.
20:21 Well, yeah, I mean, that totally makes sense.
20:23 It sort of fits with that overall theme of reproducible science and that kind of stuff, right?
20:28 Yeah.
20:29 If you can't create the environment to rerun or play or experiment with the code, well, that's a pretty bad shot against reproducibility.
20:37 Yes.
20:38 I will leave it at that.
20:40 For sure.
20:42 All right.
20:42 All right.
20:42 So the way that I learned about binder is I would go to various GitHub repos and I would see a little launch binder badge, right?
20:53 Like GitHub has all these little cool, I guess it's not really GitHub.
20:56 Like you can just put all these little badges on top of your readme that appear.
21:00 I guess it's a convention in GitHub.
21:03 It doesn't come from GitHub.
21:05 And it might say, this thing requires Python 3.5 or above, or the continuous integration is working, or there's 94% code coverage, or it could say, click here to launch this project in binder.
21:18 Now, to me, that was always really impressive because, you know, it's not enough to just take the code from the repository or the notebook from the repository and just run it.
21:28 Right.
21:29 As you described, there's all these requirements potentially, right?
21:32 Like if I'm using SciPy, you know, maybe that's not installed or more likely using some edge package that nobody knows about, or you need a certain version or whatever, right?
21:43 There's a lot of specific requirements that have to happen.
21:46 So to me, it just looked kind of like magic.
21:48 Like, oh, you can click on this button, and then this thing runs, but I have no idea how it might run because it seems like the environment where it's going to run should be so specific.
21:58 So how does this happen, right?
22:00 And I guess, yeah, give us the rundown of how that happens.
22:03 There's a pretty good little UI that you can get started with over at mybinder.org that talks you through it.
22:12 But let's talk about what happens there.
22:13 We use a little tool we built called repo2docker, and it does exactly what the name suggests.
22:20 It will take your repository and will look at it and say, if there's an environment.yaml, it will go, okay, we need to install Python.
22:32 Well, we need to install Conda, and then we need to run whatever the Conda command is to get all the stuff which is listed in the environment.yaml installed.
22:44 And that really is 80% of what repo2docker does is it will look for these very well-known files.
22:53 In the Python world, it's requirements.txt, environment.yaml, setup.py.
22:58 Does it know, like, pipfile and pyproject.toml and those things?
23:04 We have some support for both of those as well.
23:07 It doesn't seem to be used as much.
23:11 So I would say it's something that could do with more love.
23:15 Sure. Okay.
23:16 And then we have the same for the R community.
23:19 They have a whole bunch of magic files that the community has agreed on using as a format for specifying things.
23:27 The Julia community.
23:28 Because the notebooks will support a bunch of different kernels these days, a bunch of different languages and runtimes.
23:33 And so I guess one of the questions is, when I create a Python project, I don't typically tell it, you're a Python project.
23:42 I mean, I guess the closest I ever get to that is I will have the gitignore be the Python default gitignore.
23:50 But other than that, when I create a new project, I don't tell it it's Python.
23:53 I just put Python files in there.
23:54 But I might also put JavaScript files or less files or other things that could confuse you guys.
24:00 So how do you, when you look at one of these projects, know, oh, this one's Python, that one's Julia?
24:05 We don't try and guess that so much as we have a list of files that we recognize.
24:11 So like, yeah, requirements.txt.
24:16 And when we see that, we say, okay, now add like the commands to install Conda, add the commands to install these pip packages to a Docker file.
24:29 And we build a Docker container for you.
24:31 I see.
24:31 So for example, if you see like a requirements.txt, you're like, aha, Python.
24:36 Exactly.
24:36 We don't try and like look at the files in the repo in general and try and deduce.
24:42 Okay, there's 80% Python files and 20% JavaScript.
24:46 Let's install Node and Python 3.7 or something like that.
24:51 Yeah.
24:51 It always makes me, I guess, frustrated or something.
24:54 When I see that, the GitHub estimation of what the project is, you know, you can like click down and the languages and it'll show like, you know, 70% this, 20% that.
25:03 Because sometimes there's just some huge file, right?
25:06 Like maybe I have one notebook, but there's like a ton of output in that notebook.
25:11 But it's otherwise pure Python straight files, .py files.
25:15 And it'll say, this is like 80% Jupyter.
25:17 Like, you know, not really.
25:19 It's more like output from Jupyter that you're counting or I'll decide, you know, I want to bundle up some node packages and put them in the repo so I don't ever have to worry about them vanishing or changing or ever.
25:33 Just put them there and they're fine.
25:35 And then it becomes a JavaScript project because, you know, it's got more JavaScript than it does Python, but it's not really.
25:41 It's just like the libraries, right?
25:42 So I can see that that would be very fraught with error.
25:45 Yeah.
25:46 So instead, the slogan we have is we reward community best practices by automating them.
25:52 We try and encourage people to just be boring and do the mainstream thing.
25:59 And in exchange, we can automatically build a Docker container or Docker image for them, which does what they want to do.
26:08 Right.
26:08 So one of the main things and the way you solve the challenge I started outlining in the beginning is like, it seems like magic.
26:14 Why is this thing that seems like it has a bunch of dependencies and is very specific?
26:19 Just run when I click it.
26:20 And it's because you've looked at the repository.
26:22 You've followed these best practices.
26:23 You've sort of guessed at the best practices and found what you're supposed to do.
26:28 And then you defined a Docker file that then built an image that then you run as a container when you click the button.
26:35 That sets up that environment, right?
26:38 Yeah.
26:38 And then, and all as a user, all you need to bring is something that is like a web browser.
26:44 And then at the very end, we connect you to it and off you go.
26:48 And to that, you have an environment.
26:49 So that's pretty interesting.
26:50 Where does that run?
26:51 Well, we're lucky to have four clusters around the world now, which is fantastic.
26:56 So we rely on people donating resources to run all these containers for people.
27:03 And we have one cluster in Google Kubernetes engine.
27:08 We have one at OVH, which is a European cloud hoster.
27:12 We have one sponsored by the Turing Institute in London.
27:17 And we have one bare metal Kubernetes cluster at a social science research institute in Germany, in Cologne, called GISIS.
27:29 So they are really at the front, you know, none of this cloud hosting business.
27:34 We run our own Kubernetes cluster.
27:36 That's right.
27:37 You go in there, you pull the cover off of the 1U server, and you can just see Binder right in there.
27:42 Yeah.
27:44 Yeah.
27:44 That's really cool.
27:45 Are you looking for more or is it, is that enough?
27:48 Is it okay?
27:49 So our master plan for how to make this sustainable is to encourage more and more research institutes or cloud providers to host a small chunk of mybinder.org.
28:06 Because that way, each individual contributor doesn't have to pay a cloud bill, which is hundreds of thousands of dollars, but only maybe $20,000.
28:17 And if you're a big research university like MIT or University of Oxford or someone like that, spending $20,000 on something which benefits you because you can use it at your university for your researchers, but also benefits the whole world, it's still a difficult sell.
28:38 It's a much easier sell than if we come to you and say, do you have $200,000?
28:43 And they're like, oh.
28:44 In any currency, no.
28:45 Yeah.
28:47 People will just laugh at you and show you the door.
28:50 So that's the plan.
28:52 So the short answer to your question, are we looking for more, is yes.
28:56 It would be very interesting to add more clusters to this.
29:00 It should be easy to do technically now that we've gone from one to four.
29:07 It should be easy.
29:09 Pretty much after you go from one to two, that's the hardest step.
29:12 Right?
29:14 And then onward, it's not necessarily easy, but it's the...
29:17 Going from a very bespoke, it's all matching this to like, now we have multiple environments.
29:22 Well, that's a pretty big step, I would say.
29:24 Yeah, that's true.
29:26 It comes with its own challenges because now you have four times as many things that can go wrong or idiosyncrasies to deal with.
29:37 But it's great to see that we can run on four so different Kubernetes flavors.
29:46 And I think it's great because it keeps us honest when we say we are not building something which is tied to any one cloud provider and their secret source.
29:58 So you can really take it and run it at home.
30:01 Right.
30:02 Yeah, that's super cool.
30:03 So basically, if you have admin access, like Binder has admin access to a Kubernetes cluster, it can create containers and pods and then spin them up and spin them down.
30:14 And like, it's basically just managing a Kubernetes cluster wherever that happens to run.
30:19 Exactly.
30:19 Yeah.
30:20 Okay.
30:20 Yeah.
30:20 That's what we do.
30:22 Yeah.
30:22 That sounds very exciting, actually.
30:25 That's cool.
30:25 Now, one of the things that's probably worth mentioning is if I have a GitHub repository and I go through this effort, this minor effort described at mybinder.org to like set this up and register it and whatnot and get my little badge so people can run my code there.
30:42 That's all well and good.
30:43 But what happens as I make changes to my GitHub repository?
30:47 Like, oh, I need a new version, a new library I'm going to add, or we're going to update this code, but it's going to require some underlying fundamental change to the environment.
30:55 Every time you make a commit to your repository, when you then launch it, we check in our cache, do we already have an image for this commit of this repo?
31:08 And if the answer is no, we will rebuild it or build it again.
31:12 Right, right.
31:13 Like probably you tag it by like git commit hash shah or something like that, right?
31:17 Yeah, exactly.
31:18 Okay.
31:18 And that's nice because anytime you change your repository and add a new dependency, we will just automatically rebuild it next time.
31:28 It has the disadvantage that it can take quite a long time to build your repository, depending on what crazy stuff you're trying to do, how many dependencies you need to install and compile from scratch.
31:43 It can take a long time to build the Docker image for your repository.
31:49 Right.
31:49 Do you do anything like layer this?
31:53 So with Docker, you can have like layers of Docker files and dependencies and it will just, it'll cache all of them and only build the changes.
32:00 Do you do anything like here's the layer that is the dependencies of this project?
32:06 And here's the layer that actually has the code because the dependencies probably change infrequently, whereas the code is probably, you know, 10, 100 times more likely to change.
32:16 Right.
32:17 And the slow part is not getting the code.
32:18 The slow part is getting the pip install dash R from their Ubuntu or whatever.
32:23 Yeah.
32:24 So we try and be clever about how we order the layers and the Docker.
32:29 Right.
32:30 Right.
32:30 Because that's the trick to making it do the minimum work when a change happens.
32:34 Yeah.
32:35 What makes it tricky is that a lot of package managers allow you to do arbitrary stuff.
32:41 We have a few, hopefully enough pieces of code that inspect your requirements.txt to try and figure out, are you referring to stuff in the rest of the repo?
32:53 Yes or no?
32:54 Because if you are, then we need to update the whole rest of the repo before we run your file.
32:59 If no, then we can cache it.
33:02 So, yeah, we try and be clever on that front.
33:06 Right.
33:07 Because your biggest expense, at least I think, there's actually a Jupyter notebook that talks about this, which we could get into, which is fun.
33:14 But your biggest expense is compute, right?
33:17 Yes.
33:17 As opposed to storage or bandwidth or something.
33:19 Yeah.
33:19 We do have a fairly large Docker container registry.
33:23 And the funny story about it is that for a very long time, we paid no attention whatsoever to it.
33:30 And then one day in this notebook that you were referring, which shows you how much money we spend on which line item in our cloud build, we started noticing a new one.
33:43 And we're like, what is this?
33:44 Why is it getting bigger constantly?
33:45 And it turns out that there was, I think, tens of terabytes of images in our container registry.
33:54 And that is the level at which it starts being noticeable on our chart.
34:01 Right, right, right.
34:02 That's like $20,000 or $10,000 or something like that.
34:06 But it's a non-trivial number.
34:08 Yeah.
34:09 So I actually haven't looked at our billing costs in quite a while.
34:16 Before we had, when we only had two clusters, so that's probably at least half a year ago or so, we would say that it costs between $80,000 and $100,000 in Google Cloud cost to run mybinder.org.
34:35 Yes, the vast majority of that is paying for the virtual machines.
34:39 Right, right, right.
34:40 If you're a regular listener of the podcast, you surely heard about Talk Python's online courses.
34:47 But have you had a chance to try them out?
34:49 No matter the level you're looking for, we have a course for you.
34:52 Our Python for Absolute Beginners is like an introduction to Python plus that first-year computer science course that you never took.
34:59 Our data-driven web app courses build a full PyPI.org clone along with you right on the screen.
35:05 And we even have a few courses to dip your toe in with.
35:08 See what we have to offer at training.talkpython.fm or just click the link in your podcast player.
35:16 I guess that's worth talking about as well.
35:20 So there's a notebook, which I'll put the link into the show notes.
35:23 Basically, it's showing you the real-time, semi-real-time.
35:29 You can adjust it.
35:30 It's a notebook.
35:31 The cost of over the day, over the week.
35:34 And it's a non-trivial number, right?
35:37 Like, I'm on the weekly cost, and it's showing, on average, maybe $1,000 a week plus a little bit more.
35:43 You know, that's...
35:44 I know open source is doing well these days, but that's a lot of money, $1,000 a week for compute.
35:50 So where does that money come from?
35:52 You talked...
35:53 You sort of hinted at it before, but where does it come from?
35:56 So the particular notebook there only tracks the cost we incur from Google.
36:04 Okay.
36:05 From the Google Cloud.
36:06 Which is, like, not necessarily a quarter, but one of the four, right?
36:08 Yeah.
36:08 It's probably the biggest of the four.
36:11 So I'll talk about how that bill gets paid.
36:16 At the very beginning, we were fortunate enough to have a grant from the Moore Foundation,
36:22 which included a chunk of money to pay for the bill.
36:25 And that was several years ago.
36:28 Now, we have friends, our benefactors at Google, who so far have been able to justify why Google should give us credits.
36:40 I see.
36:41 Are high enough to pay the bill.
36:43 So that's fantastic to see.
36:45 So in some sense, Google's donating that compute to the project.
36:48 Yes.
36:49 Yeah.
36:49 Yeah.
36:50 Have you considered reaching out to the other big groups like Azure and AWS?
36:54 See if they want to be part of that party to get their name in some little halo effect?
37:00 Yeah.
37:01 So, I mean, that's how we got the cluster OVH, which is a European cloud hoster.
37:07 They donate the compute resources for that cluster.
37:11 The Turing Institute's cluster runs on Azure cloud.
37:17 And I don't know exactly how the invoicing works there.
37:22 But I would imagine that at least in some way it's sponsored or supported by Azure.
37:30 The fact that the Turing Institute has enough spare money to finance such or donate resources for such an altruistic project.
37:43 Yeah.
37:43 Okay.
37:44 It's really cool because it really shows you that, you know, this is, it's not free.
37:48 And it's, I mean, at the core, right?
37:51 Somebody, those images have to get built somewhere.
37:54 They have to, the Docker containers and pods, Kubernetes pods have to run somewhere, right?
38:00 And this is, and also probably, I know it's probably a small percentage of the people that touch the button,
38:06 but it's still got to be a non-trivial amount of like real computational stuff happening, right?
38:11 Like it's scientific in large part, I would imagine.
38:14 Yeah.
38:14 So luckily for us, most people who start a notebook spend a lot more time thinking and reading than running code.
38:24 Yeah.
38:24 So we can cheat and overcommit our CPUs by quite a large factor.
38:30 Yeah.
38:31 I would think.
38:31 Yeah.
38:32 But yeah, I mean, they're not sitting there idling.
38:35 No, I'm sure.
38:36 Also, this doesn't help for the Python file and VS Code angle, but it might help for the notebooks ways.
38:44 Like a lot of those notebooks come pre-populated with the last data they had run, right?
38:49 You don't have to run them to read them initially.
38:52 Yes.
38:52 Is that true for your setup or is that just true for GitHub?
38:55 No.
38:55 So if the notebook in your repository has the output still in it from the last time it was run,
39:01 then we will show it to you because we use standard JupyterLab or Jupyter Notebook.
39:08 Right, right.
39:09 A UI to show it to you.
39:10 And so that way you get something to read before you ever try and run code.
39:17 Cool.
39:17 Yeah.
39:18 So maybe people actually load it up and never actually run it or they might not run the expensive bits or something.
39:23 So that's got to help as well.
39:24 Yeah.
39:24 So then it's a very expensive way to view your notebooks.
39:29 It would be good if you just want to read it.
39:33 Just read it on GitHub or something, right?
39:37 Yeah.
39:37 If the GitHub viewer works well, then that's good because that's much cheaper to run than starting up a Docker container.
39:46 For sure.
39:47 And, you know, I guess there are definitely benefits that you have on interactivity as well, right?
39:53 Like GitHub doesn't, it's kind of amazing that you can look at an output of a notebook and it looks like it ran,
39:58 but it's just the cached version or saved version of whatever was before.
40:02 But, you know, a lot of those have like the little widgets, those IPy widgets or whatever.
40:07 Yeah.
40:07 You have like little sliders that you can play with and like adjust it.
40:10 And in order to do those types of explorations, even if you're not writing code, but you're kind of exploring parameters,
40:16 you've got to have a real life system to do that, right?
40:19 Yeah.
40:19 Yeah.
40:20 You need some kind of kernel connected to it to run the computations that are needed when you slide the slider to 11.
40:28 Right.
40:29 Yeah.
40:30 Cool.
40:30 Now, I think it's awesome that all these people are donating the compute to the environment and making this available to the entire world.
40:39 But there might be situations where people want this type of system, but they, for some reason, can't put their data publicly out there or they don't want to link to it or they just want to keep it more controlled.
40:51 Is there a way to take what you guys have built and say, create a mine, mine, all mine binder.org rather than, you know, like an internal version of it that's not just out there on the public cloud?
41:04 Yeah.
41:04 Yeah.
41:04 So the software that is behind my binder.org is called binder hub, like Jupyter hub, but binder hub.
41:12 And that's open source, like all other Jupyter projects as well.
41:17 And you can deploy it yourself on your compute with your credit card.
41:23 And then you can access private repositories.
41:26 You can limit access to your binder instance.
41:29 So you have to log in instead of right now where it will, it's completely open to the public.
41:35 Sure.
41:36 Yeah.
41:36 So basically, once you have those links, you can just visit it.
41:39 Like that's the point.
41:40 You already emphasize that, right?
41:42 That you're not even supposed to have an account, which is kind of the opposite of I need to keep this private.
41:47 Exactly.
41:49 And also because we take compute, which has been donated to us by people.
41:56 We say, if you have private repositories, you probably also work for somebody who could pick up the bill themselves.
42:04 So it's a political decision that we've disabled these features for my binder.org.
42:11 But you can take the same software and run it yourself.
42:14 And then you can access private repos.
42:17 So basically, there's no technical limitation.
42:20 But right now, in order to run one of your repositories on mybinder.org, it has to be a public repository.
42:28 That's right.
42:29 Because that's the zen that you guys are going after.
42:31 That's the overriding philosophy of what you're doing, right?
42:35 Yeah.
42:35 Okay.
42:35 Yeah, that sounds totally reasonable.
42:38 Now, I guess this repo to Docker is also pretty interesting, potentially on its own, right?
42:48 Like, it's cool that you guys are using it for capturing these environments.
42:53 So you can run them on mybinder.org.
42:56 But is that something people would find useful outside of that use case?
43:00 Yes and no.
43:01 There's nothing stopping you.
43:03 Or we encourage people.
43:05 For example, when they...
43:06 So you can run repo to Docker locally on your laptop.
43:10 And it will perform its little magic trick just as well as it does online.
43:15 Right.
43:15 We encourage people to do that, for example, if they're trying to debug why their binder isn't building or why it isn't doing quite what they want it to do.
43:24 Oh, interesting.
43:25 So if things are not working right, you're like, this container won't build or it's not finding the files or whatever it is, right?
43:31 The volumes aren't mapped right.
43:32 The ports aren't mapped right.
43:33 You can play around with repo to Docker locally and get a better shot at figuring that out, huh?
43:38 Exactly.
43:38 Because you can see a lot more of the logs.
43:40 The turnaround is potentially much, much faster because you have a more powerful machine.
43:46 You're just more direct contact with it than if you have to control it via some web form with the log automatically scrolling paths and so on.
43:58 Right, right.
43:58 I'm going to go do another commit to the repo to force a GitHub webhook to go off and make it happen again or whatever.
44:04 Yeah.
44:05 Exactly.
44:05 So you can use it locally.
44:08 The disadvantage is that you probably don't have as big a library of cache Docker image layers on your laptop that repo to Docker uses.
44:19 Potentially you have lots of them already because Docker is very good at filling up your hard drive.
44:23 Right.
44:24 It won't clean up any of those cached ones unless you run like a Docker cleanup.
44:27 I forget the exact command.
44:29 But yeah, it's really easy to fill up your computer with Docker images that are stale cached results, right?
44:36 Exactly.
44:37 So you can run it locally.
44:39 I do run it locally once in a while to just get the environment that the author of the repository had in mind.
44:47 However, what I'm finding is that if the repository builds on mybinder.org or with repo to Docker,
44:55 because it's in some way very simple minded what repo to Docker recognizes,
45:01 the instructions to do it by hand are also fairly straightforward.
45:05 So if you're comfortable using Conda, then it's quicker to make yet another Conda environment and just do what the author wanted you to do.
45:15 And because you can try whether it actually works by clicking on the badge in the readme and seeing it run on mybinder.org,
45:23 you're like, yeah, I'm pretty sure the instructions are complete.
45:26 And I'm not going to spend two hours and then find out that they forgot something.
45:31 Sure, sure, sure.
45:32 That's pretty cool.
45:33 I guess maybe if it depended on Linux and you were on Mac or something like that,
45:39 that might be a quick way to jumpstart a Docker container without knowing too much about Docker.
45:45 But yeah, if it's a kind of standard environment, then there's probably not a huge benefit.
45:49 Yeah.
45:49 If you're on a different flavor of Linux or whatever, then yes, automating the build of the Docker container and have it run locally is great.
46:02 And it is something people do, but it's, I would say, just not as convenient as clicking the link.
46:09 So that's why not as many people are doing it.
46:11 Yeah, for sure.
46:12 It's easy.
46:13 Can I get a public repository into mybinder.org that didn't decide to be in there?
46:21 And what I mean is I've got to go through mybinder.org and put in the repo info and stuff in general to set this up.
46:28 Could I just go, this one doesn't have a link, but I'm just going to give it to mybinder.org anyway and see what happens.
46:35 Yeah.
46:36 Yes.
46:37 You don't have to be an author of the repo or have any special rights on the repo to launch it.
46:42 Right.
46:42 So with no account and it's required to be public, right?
46:46 It's pretty hard to not fill out the form at mybinder.org with the info you want, right?
46:51 Exactly.
46:52 And a lot of projects will just work.
46:55 Yeah.
46:55 Because of the practices that you follow, right?
46:57 Exactly.
46:58 Yeah.
46:58 So then you can open a pull request and add the badge to the readme.
47:02 That's exactly what I was thinking.
47:03 Yeah.
47:04 So if you find a repo that should have a launch in Binder and yet it doesn't, a real nice way to make a simple contribution would be to go fill out mybinder.org.
47:14 If it works, go do a PR with the details required.
47:17 I mean, you could just edit the readme file on a branch, put the little icon there and submit that as the PR like here.
47:24 I was missing this.
47:26 I fixed it.
47:26 Yeah.
47:27 That's a super contribution for the project.
47:30 It's a good way to get started and it will help whoever comes next to the project because they'll see the badge.
47:38 And even if they don't yet know what mybinder.org is, they might click it and then they can run the examples.
47:45 Yeah.
47:46 It seems like a really nice, generous thing to do and super easy.
47:49 Well, cool.
47:50 So we're getting near the end of our time and I was going to ask you just a really quick sort of high level question.
47:57 Looking back a bit on our conversation and your experience over time.
48:02 People recently have been asking me, why do you think Python has become so popular in the data science and the scientific computing space over the last five years or so?
48:11 You know, someone who worked at CERN and has been probably more involved in that than I have in a lot of ways.
48:16 What are your thoughts?
48:17 Why Python?
48:19 Why did it become so much more popular recently?
48:23 I think the short answer is I don't know because I've, since I was 15, thought Python is a pretty good language for programming.
48:32 Finally, everyone agrees with me.
48:34 Exactly.
48:35 So I don't know what really what happened.
48:38 I think part of it or why I think why Python, not some other language is there's a quote somebody else told me is you invite people to a job interview and make them write something on the, do some coding on the whiteboard or something like that.
48:57 And they will write, even if they are, I don't know, C++ or Julia or whatever programmers, often they will write in something which looks very much like Python on the whiteboard.
49:09 They will not put all the curly brackets.
49:14 Or if you read Wikipedia, often there will be about algorithms, there will be a little section in pseudo code and you can almost copy and paste that and it will run.
49:25 Right.
49:25 So I think that's a killer feature in some sense.
49:29 It makes it just so much more accessible to people who don't think of themselves as programmers.
49:35 They are, they work in an insurance company and they want to do some data science.
49:42 They don't think of themselves as programmers.
49:45 They are, I don't know, actuaries or.
49:47 Right, right, right.
49:48 I think that agrees a lot with my general working theory is that Python is one of these special languages where it's as simple as possible to get started with like a tiny bit of computation.
50:01 You don't even have to have a function, you know, forget compiling, linking, headers, static main voids, just like here's the three lines I want.
50:08 I'm going to use those libraries.
50:10 I'm going to do this.
50:11 And so people easily get into it in the beginning.
50:15 But then unlike a lot of easy languages, you don't really outgrow it.
50:20 Right.
50:20 It's not like, well, you can't use Python anymore because now we're doing astronomy or whatever.
50:25 I was like, no, it's just, you just keep using the libraries and you can keep adding more computer science like concepts, functions, and then classes, then generators, then whatever.
50:35 So I feel like it sucks people in who don't believe, just like you said, that they're programmers.
50:41 They believe they're biologists, astronomers, physicists, whatever.
50:45 But then it's kind of got this, like, well, I already know this.
50:49 It's totally working.
50:50 Why would I leave and pick up a harder language?
50:53 Because there's just kind of this gravity.
50:55 Like, once you get sucked into it, there's no reason to get kicked out of it because you can just keep growing with the language in the libraries.
51:03 That's my theory as well.
51:04 Yeah.
51:05 Yeah, I think so too.
51:07 And the fact that we can put a nice user interface in the programming language that you use Sense onto big, trusty Fortran and C++ libraries that have been around to do really heavy lifting, linear algebra and whatnot.
51:29 I think that is fantastic.
51:32 If you imagine Python didn't have that feature, then we would be stuck.
51:36 We'd have to write all this stuff from scratch.
51:38 Yeah.
51:39 Like this, we can put a nice UI on Fortran code and get all the benefits of having well-tested Fortran code do crazy linear algebra for us.
51:53 That's perfect.
51:54 But not think about it ever again.
51:56 Nice.
52:00 All right.
52:00 Well, yeah, that was definitely a good, insightful answer.
52:03 I like it.
52:04 All right.
52:05 Now, before you get out of here, before we call the show, you've got to answer the final two questions.
52:09 So, if you're going to write some Python code, what editor do you use?
52:13 Today, I use Atom.
52:14 Okay.
52:15 Well, cool.
52:16 Yeah, Atom is neat.
52:16 My favorite editor probably is Emacs in a console.
52:21 But at some point, I had to admit that I didn't want to spend as much time as I used to trying to understand how to get JavaScript tools to hook into my editor and auto-formatting stuff.
52:37 So, yes, I use Atom now.
52:39 Nice.
52:40 Very cool.
52:42 And then notable PyPI package.
52:44 Maybe not something necessarily super popular, but something like, oh, I saw this and it was so cool.
52:49 You should check it out.
52:50 Need to think.
52:51 While you're thinking, let me tell you about one thing that I ran across.
52:54 It's not exactly that, but it's actually super cool.
52:58 And I think it's relevant to the audience who would be interested in this topic as well.
53:02 So, have you heard about Carnets?
53:04 C-A-R-N-E-T-S?
53:06 Carnets?
53:07 No.
53:07 So, this is a standalone Jupyter Notebook environment that runs on iOS.
53:14 Not like the browser sees it, but it actually has NumPy and SciPy and all that stuff installed and Python executing disconnected on iOS.
53:25 So, we could put that one out there for sure.
53:28 And that's cool and it's open source and people can go check it out.
53:32 It does uncertainties.
53:34 So, maybe, yeah, it is actually just called uncertainties.
53:36 Oh, nice.
53:38 That's a good name.
53:39 Yeah.
53:39 So, it lets you put in numbers with uncertainties on them.
53:45 Okay.
53:45 And then you can do complicated computations on it and it will spit out, you know, the answer is three plus or minus 1.7.
53:55 Right, right.
53:56 Because if you have plus or minus 1, but then you square it and then you add this other uncertainty to it, right?
54:01 Like, the propagation of uncertainty is not obvious.
54:05 It's uncertain even.
54:06 Yeah.
54:08 I mean, it's one of these things where it's not difficult, but it's hard to do right.
54:14 Yeah.
54:14 Yeah.
54:14 You very easily make mistakes.
54:16 So, yeah.
54:17 That's cool.
54:17 Uncertainties.
54:18 That's a cool package.
54:19 Okay.
54:19 Uncertainties.
54:19 Nice.
54:20 Another one that's kind of in that realm is Pint, which lets you do different units and multiplications and divides and whatnot.
54:28 It's pretty cool.
54:29 Okay.
54:30 Very cool.
54:30 That's a good recommendation indeed.
54:32 All right, Sam.
54:33 Well, we're about out of time.
54:34 So, maybe a final call to action.
54:36 People are excited about mybinder.org and this whole concept that we talked about.
54:41 What do they do?
54:42 Try it out.
54:43 Tell a friend about it.
54:44 That is, I think, an easy thing to do.
54:47 And you can get it done because it's easy.
54:52 Yeah.
54:52 Yeah.
54:53 It looks like you've automated a lot and it looks really not hard at all to get started with.
54:57 Yeah.
54:57 And, of course, we would be super happy if people find mistakes or want to add new features if they stop by and make code contributions, make contributions to the documentation, or help us deal with the fact that there's so many people who use it.
55:17 If you have questions, you can contribute to mybinder.org, especially if you can't program.
55:25 That's actually an asset because there's lots of people who know how to program, who help run mybinder.org.
55:31 But explaining how it works, advertising it, giving talks about it, all that kind of good stuff.
55:40 Adding launch and binder to various repositories that should have it.
55:44 Yeah.
55:44 For example, yeah.
55:46 Yeah, those are pretty straightforward things to do.
55:48 Awesome.
55:48 All right.
55:49 Well, it was great to learn more about mybinder.org and what you guys are up to.
55:53 Thanks for being on the show.
55:54 Thanks for taking the time to talk to me.
55:57 Yeah, you bet.
55:58 It was great.
55:59 Bye.
55:59 Bye.
55:59 This has been another episode of Talk Python to Me.
56:03 Our guest on this episode was Tim Head, and it's been brought to you by Linode and Talk Python's online courses.
56:09 Start your next Python project on Linode's state-of-the-art cloud service.
56:14 Just visit talkpython.fm/linode, L-I-N-O-D-E.
56:18 You'll automatically get a $20 credit when you create a new account.
56:21 Want to level up your Python?
56:24 If you're just getting started, try my Python Jumpstart by Building 10 Apps course.
56:29 Or if you're looking for something more advanced, check out our new async course that digs into all the different types of async programming you can do in Python.
56:37 And of course, if you're interested in more than one of these, be sure to check out our Everything Bundle.
56:42 It's like a subscription that never expires.
56:44 Be sure to subscribe to the show.
56:46 Open your favorite podcatcher and search for Python.
56:48 We should be right at the top.
56:50 You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at /rss on talkpython.fm.
56:59 This is your host, Michael Kennedy.
57:00 Thanks so much for listening.
57:02 I really appreciate it.
57:03 Now get out there and write some Python code.
57:05 I'll see you next time.