#196: Datalore: Hosted smart notebooks Transcript
00:00 If you're doing any sort of data exploration, you've likely heard about Jupyter Notebooks.
00:03 In fact, there are quite a few options for running and hosting your Jupyter Notebooks.
00:08 You may have also heard me rave about PyCharm as an editor. Well, on this episode,
00:12 you'll meet Adam Hood from the Datalore team at JetBrains. That's a new project that tries to
00:17 bring the power of PyCharm to notebooks and more. This is Talk Python to Me, episode 196,
00:22 recorded January 9, 2019.
00:24 Welcome to Talk Python to Me, a weekly podcast on Python, the language, the libraries, the ecosystem,
00:43 and the personalities. This is your host, Michael Kennedy. Follow me on Twitter where I'm @mkennedy.
00:48 Keep up with the show and listen to past episodes at talkpython.fm and follow the show on Twitter via
00:53 at talkpython. This episode is sponsored by Linode and brilliant.org. Please check out what they're
00:59 offering during their segments. It really helps support the show. Adam, welcome to Talk Python.
01:04 Let's jump right in. How'd you get into programming?
01:06 How did I get into programming? I guess, well, originally my dad is a programmer,
01:11 so he tried to teach me when I was, I don't remember, 10 or 11. It didn't really take then,
01:16 but then I started taking classes in high school and college. Python, I didn't know anything
01:23 about Python or start learning Python until college. I took a few classes, but even then,
01:28 the bulk of my exposure to Python has been working on this project specifically. Before that,
01:34 it was just a couple of classes that I took.
01:36 Okay. So mostly college experience, you'd say, with getting into programming or like what,
01:41 did you go to college for computer science? What was the start there?
01:44 No, my degree is actually in math.
01:46 Okay.
01:46 I ended up going into programming partly to put off graduate school. And it was, I mean,
01:52 I had taken a number of programming classes and I also worked in a lab for two years and the bulk
01:58 of my work was programming in MATLAB granted, but still programming.
02:02 Yeah. I think that's how a lot of math people get into programming. They start there and they're
02:06 like, yeah, this is kind of crummy. What can I do other than this? What can I do other than MATLAB?
02:10 There you go.
02:11 Yeah. That's actually a real similar to my story. Actually, I started in math and got into programming.
02:18 And I spent some time writing MATLAB stuff as well, but bailed on it as soon as I could,
02:24 basically.
02:24 Right.
02:25 Nice. So when you joined the Datalore team or the team that's working on the project that became
02:30 Datalore, that's more or less when you got into Python.
02:33 I guess it was at first we weren't really working on a Python editor. So it was probably another year
02:41 or so before I actually got into Python.
02:43 Right. So you got to build the foundation of all the tooling and whatnot.
02:46 Right.
02:47 Before it matters. Yeah.
02:48 Working on this project, that's when I really started getting into Python, especially when I started,
02:53 I've went to a couple of conferences. I've demoed the tool a fair amount. So I've
02:58 certainly had some exposure to at least simple Python from that.
03:02 Yeah, sure. That's cool. So what do you do day to day at JetBrains?
03:06 I'm a developer. I'm working on building features, fixing bugs, and so on. I can say I've worked
03:13 around various different parts of the tool at various points in time. Maybe we could get into that later
03:18 once we talk more about the features that we actually have, I guess.
03:21 Yeah, yeah, sure. Well, let's just start there. Let's set the stage a little bit with what this
03:26 guys have created and you got here. And then we'll maybe take a step back. So what is Datalore?
03:32 Like what else is out there that's comparable or is it something totally new?
03:35 Well, it's a little bit new. Things that are comparable would be Jupyter Notebooks, especially
03:41 if you talk about Google Collaboratory, I guess. It's this computational notebook. In our case,
03:48 it's hosted in the cloud, it's hosted in the cloud, also completely collaborative. You can share with
03:54 coworkers, collaborators, whoever. In my limited experience working with Jupyter and also based on
04:00 talking to a lot of people, I get the impression that a lot of people, they'll use Jupyter, but they'll
04:05 use something else as well sometimes to write the code and then they execute in Jupyter. A lot of people
04:11 have come to our booth at a conference talking about how they use PyCharm and Jupyter together and so on.
04:16 So maybe one idea here is that this is a single unified tool, which kind of takes the ideas from
04:22 Jupyter as the computational notebook, but also from PyCharm as an IDE. Also the collaboration that you
04:29 might find in Google Docs are now in Collaboratory.
04:31 Okay. So let's come back to that collaboration stuff because I think that's super interesting,
04:35 but maybe let's just take a step back and just talk about notebooks in general, not the computers
04:41 with keyboards, but, you know, Jupyter and other computational notebook type things. There was a
04:46 recent article and I forget what it was. It was in the Atlantic and it's, I've maybe mentioned it at a
04:52 time or two on the show. And basically the title is that the scientific paper is obsolete. And it's
04:58 really kind of provocative. It has like this picture of like a scientific paper with 15 authors on it.
05:05 And it's literally on fire, right? Like it's crazy. And it's actually a super interesting look into the
05:11 history of where notebooks came from. And it looks a little bit at Mathematica and Wolfram and why that
05:19 really didn't take off in the scientific space in the same way. You know how a Jupyter and things like
05:25 DataLore and Google Collaboratory and whatnot are really redefining what it means for really to
05:32 present like computational work, like science, you know, hard science type stuff. Let me get your take
05:36 on that. What do you think about this positioning that like notebooks are the new scientific paper
05:40 for data science or for, you know, real science, like hard science, you know, chemistry, whatever.
05:45 Oh, it might take some doing to convince all the journals of that.
05:49 To be fair, they have a very strong financial vested interest in keeping their position. I think there's
05:58 actually something a little bit fairly wrong about these whole scientific journals because,
06:03 you know, so much of this research is funded by taxpayers and then is published in these private,
06:11 ultra expensive journals. Like, isn't that our research? Didn't we pay for that? So that's a whole
06:16 other angle, right? We mean, we shouldn't go down that too far, but there's also a lot of politics
06:20 involved in my experience, at least from what I've seen. But if we go come back to the idea of the
06:26 notebooks, I mean, I remember seeing Mathematica for the first time and I thought it was a really awesome
06:31 tool. Just and it's exactly like as pretty much what you were saying about the article, it was pretty
06:38 much exactly that. Everything that you have in the article, like you have this idea that you can have
06:44 the code and these nice visualizations or outputs, you can print out the mathematical formula and so on,
06:52 all just coming together in one thing. That was really, I thought that was really awesome when I
06:57 first was introduced to it. And yeah, in a lot of ways you can start to, I don't know if I want to say
07:03 replace papers because that's hard to do, but you do get to see the scientific process
07:09 when you look at a nice, well-organized notebook.
07:12 I mean, that does assume that it is put together for presentation, right? I mean,
07:17 you've probably seen a ton of notebooks that are just like the equivalent of spaghetti code,
07:21 just scrambled all through there. And you're like, what is this thing? And that doesn't actually help,
07:25 but like, at least it's the canvas upon which that could be built, right?
07:29 Right, exactly. And I think there's nothing wrong with this spaghetti code type notebook in the sense that,
07:36 you know, the first stage you have to do a lot of exploration or you're not going to get anywhere.
07:40 And it's not until you have to present it that you need to really think about it.
07:44 Yeah. So I agree that you look at the way people work with the notebooks and after people who don't,
07:49 you know, say like web developers that work in something like PyCharm or Visual Studio Code,
07:55 like this, that sort of iterative exploratory type of working with your tool.
08:00 You don't get that in these other tools, really. Not in the same way, right?
08:04 You've got to rerun them all from scratch. You can't just like, let me tweak this formula and rerun that cell
08:09 and see how that looks. Right? So I definitely think there's this interesting way of working
08:14 in this exploring data with notebooks. Yeah. Let's dig into data lore.
08:21 Sure.
08:21 Tell people what it is. Like, how do you guys, if you're at a conference, you know, be at PyCon or whatever,
08:27 and someone comes up and says, data lore, what's that? What do you tell them?
08:30 I've had a lot of different lines throughout the years, actually.
08:34 It's an online computational notebook in Python.
08:37 To me, it kind of looks like something that is really familiar to people that use Jupyter,
08:43 but has a lot of the features of, say, PyCharm.
08:47 That's definitely one of the ideas. Yeah.
08:49 A lot of the code smarts around understanding the code, suggesting fixes and improvements and whatnot,
08:55 like PyCharm does, but in notebook form, right?
08:59 Yes. And I mean, that's always been one of the big trademarks of the whole company,
09:04 this idea of understanding the code, helping you with your code. So it would be
09:09 kind of remiss on our part if we didn't include that.
09:12 Yeah. It definitely would look out of place if it wasn't there, right?
09:15 Yeah.
09:15 You have all these cool, like, code assistant, code intention tools. Like,
09:19 why are they not being brought over here? When you talk about building it,
09:23 who do you imagine your users are?
09:26 I can't really speak to specifics. I would just say, like, the whole realm of data scientists,
09:31 data analysts, anyone who works with data and works with data in Python in particular,
09:37 hopefully could use this tool.
09:39 Sure. Like, pretty similar to the folks that might choose Jupyter?
09:42 Yes.
09:43 Okay. Maybe it's worth doing, like, a compare and contrast between Jupyter and JupyterLab.
09:48 So how are they different? What does Jupyter have that you guys don't? What do you have
09:54 that Jupyter doesn't? Maybe position those two things for us.
09:57 Sure. I guess it's easier for me to talk about what we have that Jupyter doesn't.
10:01 Sure.
10:01 Not that I want to say Jupyter is awful. I mean, obviously, Jupyter is doing a great job of filling
10:06 the space right now. One of the big differences is the way that we work with the environment that
10:13 you have. In Jupyter, you have this one global environment. You execute the code in some cell,
10:20 and it executes on top of that environment.
10:22 So, like, basically, if I declare, like, a variable in a cell, it's just global to all the other cells,
10:27 and it's that kind of thing?
10:29 At least that's my understanding of Jupyter up to this point.
10:33 Okay.
10:33 Data lore is a bit different. You can almost think of it as after every cell,
10:38 you have a different environment. So you have whatever you write into the first cell,
10:43 you execute that, or by default, it executes automatically, and you have that environment.
10:49 But then you start adding to another cell. That'll execute on top of the environment from the previous
10:55 cell. Basically, the idea here is that, at least for me working with Jupyter, one thing that gets
11:01 confusing is if I start having to re-execute previous cells, the environment gets a little bit
11:07 confusing for me to keep track of just because, you know, I see this sequential code,
11:12 but that doesn't represent what we actually have in the environment.
11:15 Right. Like, maybe there's 10 cells, and you went back and you fiddled with number three and re-ran it,
11:21 but you didn't run four, which would have actually taken the output of three to make a difference.
11:24 You know, like that kind of thing, right?
11:25 Right. And for us, that's not the reality. For us, like, if you go back and fiddle with three, you don't necessarily have to execute four,
11:34 five, and six right away. But if you do, that will, they will reflect what you had from three.
11:40 Okay.
11:41 So I guess that's one pretty big difference. I touched on it already, but we have
11:45 two different computation modes. By default, we have basically live computation. Like,
11:51 as you write the code, we're executing it.
11:53 Even before you say execute cell, like, just like if you write, as you type lines out, it just, if it's valid syntax, it runs?
11:59 Yeah, pretty much.
12:00 Okay.
12:00 It's useful if you're doing lightweight stuff. The idea is, like, you're getting the feedback
12:05 right away. If you write something and you try to reference a variable that doesn't exist or
12:10 something, you know that right there. Of course, if you have something that's really long running,
12:15 you might not want to run it a million times, but we do have a more Jupyter-like computation mode
12:23 where you actually execute the cell. So I would say that's another kind of key difference.
12:27 Yeah. Probably the other one, I guess you might say, it's like, it's in the cloud, period, right?
12:33 It's in the cloud. Yes.
12:34 You can't just locally pip install Datalore.
12:37 Nope.
12:37 Datalore notebook sort of thing. So for both what the good and bad that comes with that,
12:42 right? But it's definitely a difference between the two.
12:45 Yeah. It's in the cloud. Some of the things that come with being in the cloud is, like, for one thing,
12:50 it's collaborative.
12:52 Right.
12:53 A lot like a Google Docs. I mean, you can share with coworkers, whoever, that you can have them just view it,
13:00 or they could also edit your workbooks.
13:03 On the collaborative bit, that sounds super interesting. And it's really cool. I think this is actually a trend
13:10 in a decent amount of the tooling to have this sort of collaborative live coding, right?
13:16 I know there's been so many folks who've done all sorts of tricks to get some sort of, like, real-time collaboration,
13:24 even so much as, like, going to Amazon, spinning up a virtual machine, that you can both log into over remote desktop or something, and actually, like, we're going to both type on this.
13:34 And, like, that's probably not the right way to do it.
13:36 But, like, people were forced to do this, right?
13:39 Because we got Google Docs and cool stuff like that, but PyCharm doesn't do it. The other tooling doesn't do it.
13:45 But, for example, Microsoft announced this with Visual Studio, like, this year, where you can remotely share a session,
13:51 and you can even, like, remotely debug the other person's code on their machine.
13:56 I mean, it's pretty interesting. I think this is, you know, tapping into a trend that is making a lot of sense, yeah.
14:02 Anytime you have a project where you need multiple people to work on it together, you need some sort of tooling for collaboration.
14:08 Alternative has been check it into Git and merge it.
14:12 Yeah. I've only interacted with Git with Jupyter Notebooks once, and it was a painful process.
14:19 I can imagine. Well, and also the exploratory nature doesn't make a lot of sense.
14:24 Like, you're like, oh, I changed a variable, I checked it in, do a pull, right?
14:28 Like, it doesn't, I mean, that's, like, too quick of an iteration, right?
14:31 Like, it's, I added a feature, check it out. That's a different type of thing.
14:35 But Notebooks, like you said, sort of tend to be more iterative and exploratory, and I think that makes a lot of sense.
14:41 What does the collaboration look like?
14:43 So I've played with Datalore, but I haven't played with it with other people.
14:46 So how real time is it?
14:49 You know, is it every word, every character, every cell?
14:52 Pretty much every character that you type.
14:54 You could end up sending several characters at once.
14:57 I don't know exactly how that works.
14:59 But more or less every character, as you type it, it's getting synchronized to your collaborators and so on.
15:05 We have this mode where you can actually follow what your collaborator is doing.
15:09 Like, they have a little avatar in the top right corner.
15:12 If you click on it, you'll start tracking them.
15:14 So as they make edits, you'll see them real time.
15:18 That's pretty awesome.
15:18 Have you seen people do anything or heard of people, like, using this for teaching or, say, like, live presentations where everyone else can, like, come in and, like, I hear we're all sharing the notebook and all the students follow along or the participants follow along?
15:34 It's definitely something that you can do.
15:36 And that I think some people have done that.
15:39 It sounds so much better than slides over WebEx.
15:42 Right.
15:43 You know?
15:44 Like, here, open this in your browser and follow along as we do it.
15:47 Yeah.
15:47 I mean, one thing that was kind of cool was we had this little, like, internal workshop.
15:52 One of our data scientists did a presentation on data science, just, like, some overview stuff.
15:58 And he was doing it in data lore.
15:59 And there is, like, a big difference between just, like, watching what he's doing if he's screen sharing versus just being able to follow him as he moves along.
16:09 And that was really cool.
16:10 Yeah.
16:10 It seems like there's a lot of possibilities.
16:11 Yeah.
16:12 Yeah.
16:12 I think definitely.
16:13 Nice.
16:13 So I guess this puts it kind of in the realm of Google Collaboratory, which I actually haven't used very much.
16:19 I don't know if that has live collaboration, but I'm guessing by the name that it does.
16:23 Yeah.
16:24 I would assume so.
16:25 I haven't used it.
16:26 I have opened it.
16:28 It looks to have just a new format for Google Docs or Sheets or Slides, except it's a notebook instead of one of those things.
16:37 Pretty nice thing to have.
16:39 Yeah, that's cool.
16:40 So I haven't done much with Google Collaboratory, but I haven't heard a ton of people talking about it, but it does seem pretty nice.
16:46 Yeah.
16:46 Cool.
16:46 So maybe we could dig into some of the features that makes data lore cool.
16:53 So there's a few things that caught my attention.
16:56 Maybe I'll bring them up and then you can describe what they are and how they work and why they're cool or whatever.
17:02 So one of them is smart code completion.
17:06 Sort of context-aware code completion.
17:10 I mean, basically all the code completion you have in PyCharm, you also have over data lore, right?
17:14 In some ways, we could have even more because we're aware of everything that's in the environment already.
17:20 But yes.
17:21 Okay.
17:22 It feels like it's more of a real editor when you're working in the cells, to me.
17:27 That's the idea, at least.
17:28 That's the hope.
17:29 Yeah.
17:29 I think one of the cool things, just going between data lore and PyCharm for me, though, was in PyCharm, you're going to get all of the static inspections, completion, everything.
17:40 But you can even get runtime information as you go along and you edit in data lore because it is running as you go along.
17:48 Oh, I see.
17:49 So maybe a field was added to some object at runtime.
17:52 And because it's just hanging out there in the environment, you could get completion on that.
17:57 Whereas in PyCharm, it's got to be static analysis, right?
18:01 That's what you're saying?
18:02 Yeah.
18:02 I think so.
18:03 Yeah, yeah.
18:03 That's pretty awesome.
18:04 So then you have all the code inspection, like this thing has a type annotation that takes a list, but you're passing a string.
18:13 What gives here?
18:14 That kind of stuff, right?
18:15 As far as I know, we're still working on adding a bunch of those inspections.
18:18 So we only have a few.
18:19 I think one example is we have an inspection for unused variables.
18:24 If we see a variable that's just sitting around completely unused, we'll highlight it.
18:29 We probably have a couple of others.
18:31 I don't know all of them.
18:32 Sure.
18:33 Navigation.
18:34 So you can navigate, turn your parts of your notebook into hyperlinks that take you to like where the function is defined or the variable and things like that?
18:42 Yeah, yeah.
18:42 You can.
18:43 I've only used it a few times, but you definitely can do that.
18:45 I don't remember the shortcut even.
18:48 When I'm developing software, that's one of the most useful things that I use.
18:54 Like I always want to see the function.
18:56 Absolutely.
18:56 I mean, that's one of the things that drives me crazy about some of the more basic or, you know, not multifile type of editors that you can't just say, go to definition on this.
19:09 I don't care where it is, go find it and show me what it is.
19:13 You know what I mean?
19:13 Yeah.
19:16 This portion of Talk Python to me is brought to you by Linode.
19:19 Are you looking for hosting that's fast, simple, and incredibly affordable?
19:23 Well, look past that bookstore and check out Linode at talkpython.fm/Linode.
19:28 That's L-I-N-O-D-E.
19:30 Plans start at just $5 a month for a dedicated server with a gig of RAM.
19:34 They have 10 data centers across the globe.
19:36 So no matter where you are or where your users are, there's a data center for you.
19:40 Whether you want to run a Python web app, host a private Git server, or just a file server, you'll get native SSDs on all the machines, a newly upgraded 200 gigabit network, 24-7 friendly support, even on holidays, and a seven-day money-back guarantee.
19:54 Need a little help with your infrastructure?
19:56 They even offer professional services to help you with architecture, migrations, and more.
20:01 Do you want a dedicated server for free for the next four months?
20:04 Just visit talkpython.fm/Linode.
20:09 The other thing that we can do in terms of navigation is we have a find, basically like Command F or Control F in your browser, except it just searches the code that you have.
20:21 Okay, that's pretty cool.
20:22 You also have this code, or just intentions.
20:25 Tell us about that.
20:26 That looks pretty cool with like data sets and all sorts of things.
20:29 Right.
20:29 There are these context-aware suggestions for actions that you can take with your code.
20:35 By and large, they're trying to focus on the data science workflow.
20:39 Like if you have a completely empty workbook you just started, you can start out by adding some standard imports like pandas, numpy, matplotlib.
20:50 We also have this internal plotting library, I think.
20:53 You can also start by loading a data set.
20:56 You can upload something, or you can work with, we provide like 10 or 12 example data sets or something like that.
21:04 Well, we don't provide all of them, but you can load using this attention.
21:08 You can be loaded.
21:09 So like some of them might go and like suck it off of like a website or something, or off an API.
21:14 Right.
21:14 You can do that.
21:15 We do have this file uploader as well.
21:17 You can upload files from disk, or if you know...
21:21 Sure.
21:21 What format do they have to be in?
21:22 Can you just put whatever and you just read it?
21:24 Or is it like, does it get understood when you upload it?
21:27 You're going to end up reading it in the code.
21:29 So if you know how to read it, like programmatically, pretty much anything.
21:34 I see.
21:35 But it's not like you send it a CSV file and it like pre-parses it for you.
21:39 No, it won't do that.
21:40 Yeah.
21:41 Okay.
21:41 If you do upload a CSV or TSV file, it will get suggested by the intention.
21:47 So you said it has all these different data sources and some data sets.
21:51 Like what are some of the data sets that I could work with?
21:54 Well, we have a number of examples.
21:56 I can even give a full list.
21:58 So a lot of these are like just examples that are used, you know, teaching examples.
22:03 Iris, Boston Housing, Breast Cancer.
22:07 We have two different digits data sets.
22:10 One of them is small.
22:11 One of them is large.
22:12 Like the handwriting, you're trying to recognize handwriting digits.
22:15 Right.
22:15 So that's like an OCR, like machine learning one.
22:19 The Iris one is also like an optical image processing data set that's pretty commonly used, I think.
22:26 Yeah.
22:26 Sounds cool.
22:27 If you wanted to do like demonstrate some form of say machine learning around OCR, you could just go.
22:34 And we're just going to use the well-known data set from here, right?
22:37 Yeah.
22:37 Yeah.
22:37 Are you planning on adding more?
22:38 I'm not sure.
22:39 I think for now we're satisfied with it.
22:42 So basically you have the ability to upload.
22:44 So you just like upload your files.
22:46 I think for the most part, we're not even, we don't even have these uploaded internally.
22:51 It's just, we know how to call the functions that are, that will load these.
22:54 Like with Iris, you can get it directly from scikit-learn and same with a few others.
23:00 That sounds pretty cool.
23:02 What else happens with the intentions?
23:04 So it's not just about data sets, right?
23:06 Right.
23:06 It's not just about data sets.
23:07 So you start out by loading and then after you load the data set, you can, I guess, split it, extract out like X and Y where X is all the, what is it called?
23:20 The feature matrix or something like that.
23:21 And Y is your target.
23:24 We will do the train test split for me.
23:26 I remember when I, I always screw up the syntax for train test split.
23:30 Like is it X test, X train?
23:33 Or is it X test then Y test?
23:34 Or I don't know.
23:35 But data lore knows?
23:36 Data lore knows.
23:36 So I use the intention.
23:38 It'll split the data for me.
23:40 It'll create some set of models.
23:43 Like it knows how it'll create the object for various scikit-learn estimators.
23:49 It can also do a few pre-processing steps if you need to.
23:53 Like if you loaded Titanic, there's some missing values and some columns that like you have a couple of features that need to be transformed in some way.
24:04 So you can do like a one-hot encoding, you can impute, and there are intentions that will do these things for you.
24:10 Nice.
24:11 So how does that show up?
24:12 Does it show up like you type a word and hit tab and then it gives you like some big expanded thing?
24:17 Or what does it do?
24:18 Actually, they just manifest as buttons at the bottom of the cell.
24:23 Well, we call them blocks, but for the sake of Jupyter users, we can talk about cells.
24:29 There's just all these buttons that you have at the bottom.
24:32 And the buttons that you see exactly will depend on the context.
24:36 Like what variable do you have focused?
24:38 Or it could just depend on the context of like how far along are you in this machine learning process?
24:43 Like what have you done so far?
24:44 What would be the next step?
24:45 Okay.
24:46 You can just click the button.
24:47 You might have a dialogue with some options or it'll just do something.
24:50 And then it'll...
24:51 Yeah, that sounds really cool.
24:52 What machine learning libraries does it support?
24:55 Is it just scikit-learn?
24:56 Is it do like Keras or any others?
24:58 The intention specifically or Datalore as a whole?
25:00 Yeah.
25:01 When it suggests to do like some training, what does it, you know, where does that work with?
25:04 It's going to work by and large with scikit-learn.
25:07 Okay.
25:07 And I mean, it'll use like Pandas or NumPy to support the process.
25:12 Like I think it generally prefers to work with, we prefer to work with Pandas data frames.
25:16 So another thing that caught my attention, and I think you were touching on it before, is this idea of dependencies between the cells or blocks and the variables and like incremental computation.
25:26 So if you have something that defines like X somewhere and then something else that maybe passes X to a function, Datalore is aware of that relationship and can do like a partial re-execution to make it go back and sync.
25:42 Is that right?
25:43 Yeah.
25:43 Like if you update X way above, it'll know that you need to reevaluate that.
25:49 I think that's a pretty cool feature.
25:50 Like before you were talking about, you know, running the cells and say a Jupyter Notebook in different orders and getting different results.
25:57 Right.
25:57 Which it's not at all obvious how that happened.
26:00 I know it has like a little number along the side of like order of execution.
26:04 But like if you've got 20 cells, like do you really have that held in memory, in your memory of which order it is?
26:10 Not really, right?
26:11 I know from talking to people that for some people it isn't really a problem.
26:15 But I think like if you have a lot of experience working with Jupyter Notebooks, you're probably used to it and you know how to work with it well.
26:22 You know what trouble you can run into.
26:23 So you just don't do it, right?
26:24 Yeah, for me, it's always a problem.
26:26 Yeah.
26:27 Well, I'm not that experienced with Notebooks.
26:29 And I think I imagine for a lot of people trying to break in, it is, you know, a hurdle you have to get through.
26:34 Sure.
26:34 So I guess once you kind of built up your data, maybe train some models with these intentions, things like that, you probably want to look at it, right?
26:43 And so you talked about some libraries, you know, Matplotlib, for example.
26:46 But you also have some built-in plotting libraries, one for like two-dimensional plots and one for geographical ones.
26:54 So I guess what are those?
26:56 And then like why not just import Altair or something, you know, like some other thing, right?
27:02 Like why do you have your own?
27:04 What are the benefits?
27:05 I can't really compare and contrast with a lot of different libraries because I don't know them that well.
27:10 But yeah, we do have, I guess, datalore.plot.
27:14 That's the prefix for the library.
27:17 It's our internal graphics library.
27:19 It's based on the grammar of graphics.
27:20 So it's going to be very similar to like ggplot2 and R.
27:25 And I think there are some Python libraries that might also be based on the grammar of graphics.
27:30 Sure.
27:30 That I don't know about or I don't know the names of off the top of my head.
27:34 My impression is that matplotlib is definitely the most popular library in Python for graphics or plotting.
27:41 Yeah.
27:41 But I have, well, me personally, and I guess enough other people that we decided to develop this library
27:48 have this preference for the grammar of graphics for that different syntax.
27:54 I remember in my days of working with MATLAB, the general torture that was working with graphics in MATLAB.
28:02 And matplotlib is, as far as I understand, based on that or very similar.
28:07 So yeah, I guess the motivation originally was to bring the grammar of graphics to Python, at least within our tool.
28:14 Okay.
28:14 You can still use the others?
28:16 Like if I wanted to use some other library that typically works in Jupyter, I could install it, import it, run it?
28:22 Yeah, it should work.
28:23 If it doesn't, we appreciate the bug report.
28:25 Sure.
28:27 But the intention is that you can, right?
28:28 Yeah.
28:29 Yeah.
28:29 Okay.
28:30 So I guess another question that makes me start to think of is, what about dependencies?
28:35 Like how do I specify my Python dependencies, the other libraries I'm going to use?
28:40 And then how do I get them?
28:41 Like if it's running on your cloud, how do I get the five libraries off PyPI that I actually need to do my thing?
28:48 So we have a library manager.
28:51 Like we have a tool that lets you set up a custom environment.
28:54 We have some default environment and includes a lot of the basics that you probably want, like NumPy and Pandas and scikit-learn and matplotlib.
29:03 Right.
29:03 You might as well just start with those installed, right?
29:06 Right.
29:06 Exactly.
29:07 But then you have the option to use this library manager to add whichever libraries you like from Conda, from PIP.
29:16 Even if you have access to some GitHub repo.
29:21 Right.
29:22 Where you pip with the git command just to the repo base.
29:25 Yeah.
29:25 Yeah.
29:26 So you can install from GitHub.
29:27 Yeah, but basically you can install whatever you really need to to do your work, right?
29:33 Yeah.
29:33 For now, as we sort of started at the beginning, Datalore is a cloud thing.
29:38 I know there's a ton of people out there listening going, that's great, but my company won't let me upload my data to a cloud thing.
29:47 Is there a way to run it on premise?
29:49 Maybe even if that's not like pip installed Datalore, but maybe it's like, here's a virtual machine you can license from us internally.
29:57 Unfortunately, no, it's not possible right now.
30:01 It's something we're very much aware of because we've heard this feedback, this exact feedback from a number of users through various channels.
30:09 So it's very much something that's like on our radar, something that we could consider.
30:13 But unfortunately, right now it is not supported, no.
30:16 And if it's not, maybe that's fine, right?
30:18 Like Google Docs doesn't have an on-prem Google Docs.
30:22 Yeah.
30:22 Just don't do it, right?
30:23 I do think that a lot of the data science folks are analyzing data that's pretty closely guarded, right?
30:31 Yeah.
30:31 It would be good, but who knows?
30:33 As I said, it's not the first time we've heard it.
30:36 Yeah.
30:37 Is the source code for Datalore available or is it private?
30:41 Right now it's private.
30:42 Okay.
30:43 It doesn't surprise me.
30:44 Even for public, it's probably like deeply bound to your infrastructure.
30:49 You know what I mean?
30:50 Yeah.
30:50 I mean, I can't say if there are any plans to open source it or not.
30:53 We have open sourced some projects and other projects we've never open sourced.
30:57 It's not really my decision in the end either.
31:00 Sure.
31:00 So there's different ways, like different machines or computational units, whatever you can buy to like do your analysis.
31:08 So you start out, you can just play around with like some moderate virtual machine that it's running on, but you can get like a higher end computation or can you run it on, run your stuff on GPUs as well?
31:21 Yes, you can.
31:22 How's that work?
31:23 The only thing that's available.
31:25 So we have two different plans right now.
31:27 We have a free plan, a paid plan.
31:29 The only thing that's available on the free plan is just these medium, like these moderate instances, which work fine for just some general all-purpose stuff.
31:39 But if you upgrade, you have the opportunity to upgrade to larger, more powerful machines.
31:46 All of these right now are posted on AWS, so you can even look them up on AWS.
31:51 But one of them does include GPUs.
31:55 Right.
31:56 One of them is, I guess, in EC2 parlance, a P2X large, which would be a four CPU and single GPU machine with 61 gigs of RAM.
32:07 That's a pretty serious machine with 12 gigs of GPU RAM.
32:10 Yeah.
32:10 Maybe we talk about pricing as well.
32:12 So this is both a thing you can use for free and also something you can pay to get more for, right?
32:19 Right.
32:19 What do you get for free and then what do you have to pay for?
32:22 What you get for free is a monthly allowance on these moderate machines.
32:28 I think 120 hours per month across all your various computations.
32:33 Is that actual execution time or just like I'm working with my notebook?
32:38 The second one, like when you're working on the notebook.
32:40 Like if I have it open and I'm interacting with it rather than...
32:43 So like what if I leave my browser open and I go to lunch?
32:47 Does that count towards my hour?
32:48 It does.
32:50 Now, eventually we will shut it down if it's been idle for a really long time because we don't want you to just like run through your full allowance by accident without realizing it.
33:00 As long as it's open, we keep the instance running.
33:04 Right.
33:04 So 120 hours is like if you do that eight hours a day, that's 15 days a month.
33:09 I got still, you know, that's a lot of time for free.
33:12 So I just want to make sure people understand like what it is when you say hour, what you're actually talking about, right?
33:18 That's fair.
33:19 Then if you, so if you upgrade, you unlock the ability to use some of these larger, more powerful machines.
33:26 You still have to pay for them, I guess.
33:28 But like we have various different machines depending on what you need.
33:32 But you basically pay like the EC2 price of the machines if you select them?
33:38 It's marked up.
33:39 I mean, you can look up the EC2 prices, so it's not going to be exactly the same.
33:43 Sure.
33:43 But it's going to be based roughly on the EC2 prices.
33:47 The more expensive machines will obviously cost more.
33:50 The less expensive machines will cost less.
33:52 Well, that sounds pretty cool.
33:54 I mean, the fact that there's quite a bit of free space and computation and whatnot, I think that's what it takes to be interesting to everybody, right?
34:02 Yeah.
34:03 At the end of the day, we are competing with an open source thing, which will always be open source.
34:08 So we'd better do something more.
34:10 Yeah.
34:12 Yeah.
34:12 I think it makes sense.
34:13 It seems pretty fair.
34:14 This portion of Talk Python to Me is brought to you by Brilliant.org.
34:20 Many of you have come to software development and data science through paths that did not include a full-on computer science or mathematic degree.
34:27 Yet, in our technical field, you may find you need to learn exactly these topics.
34:32 You could go back to university.
34:33 But then again, this is the 21st century, and we do have the internet.
34:37 Why not take some engaging online courses to quickly get just the skills that you need?
34:42 That's where Brilliant.org comes in.
34:44 They believe that effective learning is active.
34:47 So master the concepts you need by solving fun, challenging problems yourself.
34:51 Get started today.
34:53 Just visit talkpython.fm/brilliant and sign up for free.
34:57 And don't wait either.
34:58 If you decide to upgrade to a paid account for guided courses and more practice exercises,
35:03 the first 200 people that sign up from Talk Python will get an extra 20% off an annual premium subscription.
35:09 That's talkpython.fm/brilliant.
35:14 I ask if it will run outside of your cloud.
35:18 And right now you say, no, it's hosted at EC2.
35:19 Can I pick where it runs in EC2?
35:22 Like, suppose I work on healthcare and it has to stay, you know, within Europe or has to stay within France or something.
35:30 At the moment, no.
35:31 You can't pick which cloud it runs in.
35:34 But basically, right now, we don't have like a dropdown.
35:37 Like, I want to do my work in Europe or I want to do my work in the US or something like that.
35:41 No, there's nothing like that.
35:43 Yeah.
35:43 Do you know if there's any plans for something like that?
35:45 I'm not sure, actually.
35:47 It's, I mean, one thing we always have to keep our eye on is all the different privacy laws and make sure that we're not doing something untoward with anyone's data.
35:57 Making sure that we're always complying.
36:00 Yeah, I mean, there's certainly value to the privacy laws, but man, they can be hard to keep straight.
36:05 Because it's not just one country, right?
36:08 It's all the countries.
36:08 Right.
36:09 And if you have users in one country, you have to comply with the laws for that country, for that user.
36:14 Yes, exactly.
36:15 It's, you know, welcome to the internet, right?
36:17 Yeah.
36:19 So if people want to go and play around or explore data lore, you have a set of pre-built notebooks that they can go play with.
36:29 You want to maybe talk us through some of those that they can explore?
36:32 Right.
36:32 We do have some sample workbooks.
36:34 I am not too up to date on what the exact sample workbooks are on, but we actually do have some set here, which is kind of nice.
36:41 So the first one, these links will be somewhere.
36:45 So yeah, the links are super long and crazy, and I'll put them in the show notes.
36:48 So don't worry about trying to say them.
36:50 They're like, you know, UUID type of things.
36:53 Yeah.
36:53 Yeah.
36:53 In any case, if I'm understanding correctly, I think if you create an account, these workbooks will be automatically created for you.
37:01 So you can see them for yourself.
37:03 Exactly.
37:03 Under your home, you've got like, when I log in, at least in my account, I haven't done really much.
37:07 So like, this has got to be what's coming with it.
37:09 You know, there's like a scikit learn section.
37:12 There's a map section and there's a plotting section.
37:16 And the plotting has the most actually.
37:17 Yeah.
37:18 I guess we're kind of proud of it.
37:19 We want to show it off.
37:20 Yeah, sure.
37:21 In any case, I guess the first one, we're basically showing, first of all, in pandas, you have this scatter matrix function.
37:28 It's very useful for like some exploration.
37:31 You can see distributions of the different features as well as their distributions with each other, whatever the right terminology is.
37:39 I feel like I'm butchering data science terminology as we go along.
37:42 But in any case, so first we're showing how you can generate the scatter matrix in pandas.
37:47 But then we also want to show how you can do it using data lore plot, using the internal plotting.
37:52 So first of all, you can create a scatter pretty easily for any pair of features that you have.
37:58 You can also create a histogram using the geom point and geom histogram functions.
38:04 You can update it to show density instead of just the counts of the different bins.
38:10 Yeah, nice.
38:10 So basically, like all the plotting that you, most plotting you want to do with like ggplot or the data lore plotting stuff, right?
38:19 Yeah.
38:19 Yeah.
38:20 So these are nice.
38:20 These graphs are pretty interactive.
38:22 Yeah.
38:22 I guess that's something I didn't even talk about yet.
38:24 They are interactive.
38:26 Like if you hover over them, you can see the different points on them, like the different coordinates, which is kind of cool.
38:32 I like it.
38:32 It's quite nice.
38:33 So you have another one on like Bayesian inference, which is pretty cool.
38:38 So one thing here that I'm seeing is, gosh, I'm not entirely sure what the syntax is.
38:46 I think it might be LaTeX.
38:48 So at the top, you've got basically really symbolically nice mathematics in here, right?
38:57 Yeah.
38:57 That's pretty cool that you can actually write the formulas in mathematical notation, not just trying to put, you know, ASCII characters to make like the integral sign look like an integral sign or something.
39:08 Yeah, basically we have support, we have these markdown cells or blocks as I guess we call them.
39:13 And you can include LaTeX into these cells as you can see there, which is always kind of fun.
39:19 It does make for a nice presentation.
39:21 Yeah.
39:22 I have totally mixed feelings about LaTeX coming from a math background.
39:26 Like I love the outcome.
39:28 I hate the creation of it.
39:31 It's so painful.
39:31 It's so not obvious.
39:32 I can fully understand.
39:34 But see, when I was in college, what made LaTeX bearable for me was I started discovering tools like, I don't remember if it was called Overleaf then and it's write LaTeX now or vice versa.
39:44 But it was basically this real-time LaTeX editor, which was collaborative.
39:49 Oh, interesting.
39:50 You could switch between modes like compile incrementally or just you press a button to compile.
39:55 But it gave you like the real-time feedback about exactly what you were writing, which was always really cool.
40:02 Yeah, that's pretty cool.
40:03 And for people who don't know, LaTeX is basically, it's been around for a while.
40:06 It's kind of like a markdown syntax, but for symbolic mathematics.
40:11 Like I want the sum from n equals 0 to 1, or for 100, you know, of like it'll actually do all the symbols, right?
40:17 And it used a lot in like math papers and whatnot.
40:19 So let's see, what else?
40:20 We got a lot of like a whole bunch of the data lore plot stuff, the geometry and whatnot.
40:28 And you have some California housing data as well graphed out, which has got some cool geometry stuff.
40:35 And it pulls in open street map and then like overlays it, right?
40:38 Something like that.
40:39 That's pretty nice.
40:40 So I guess that's just using the California housing data from Scikit-learn datasets.
40:46 Exactly.
40:47 It conveniently has the latitude and longitude coordinates.
40:51 So we can just plot every point on the map as we see it, which is kind of neat.
40:54 That's cool.
40:55 So one thing at the bottom that's pretty interesting is it's got a picture of airplanes flying along the Hudson.
41:04 And it overlays like animated flying airplanes using the geography plotting stuff, which is, that's pretty awesome.
41:11 I don't work on this myself, but I'm always amazed at the new features that keep coming out.
41:15 Like all the different things you can do with the maps and the plots.
41:18 I think this is one of my favorite little animations is showing the different routes.
41:22 You sit there and watch them fly along.
41:24 That's pretty cool.
41:25 Yeah.
41:25 So people can check out those shared notebooks on data lore and play with them.
41:29 Basically, like you said, if you create an account, you get all those like in your account.
41:33 Yep.
41:33 Yeah.
41:34 What about a version control, right?
41:36 Like if I have Jupyter Notebook local or specially if I have Python files locally, it's super easy to check that into GitHub, do a diff, do a branch, do a PR.
41:47 Like what does a version control look like over here?
41:49 We have our own integrated version control.
41:53 Basically, we have this thing called history.
41:55 Every now and then, if there's some inactivity, we'll just keep a checkpoint, kind of like a commit or something.
42:02 You can think of it.
42:03 Okay.
42:03 Or you can make your own as you need to.
42:05 And then you can compare the diffs.
42:07 You can revert to the previous version.
42:09 I think you can even revert specific changes, but you'd have to check me on that one.
42:14 Basically, like this internal version control.
42:15 That's pretty cool.
42:16 And how do I, like, can I branch?
42:19 Can I do stuff like that?
42:20 No, it's completely linear.
42:21 So no branching involved here.
42:23 So right now, basically, it just, I can go forward or backwards in time for now, right?
42:29 Yeah.
42:29 Yeah.
42:29 You can't really do branches or merges or anything like that.
42:32 So one thing I'm noticing is I can export a Datalore file, which is kind of cool, right?
42:38 So I theoretically could export the Datalore file and then check that into GitHub.
42:43 That's one possibility.
42:44 Yeah.
42:45 Right.
42:45 And then I do save points there.
42:46 You can also export Jupyter Notebook format, the IPYNV format.
42:53 Yeah.
42:53 Can you go the other way?
42:54 Can I import?
42:55 Yes.
42:56 Like, if I create a new notebook, can I say from this Jupyter Notebook or from someone else's Datalore notebook?
43:02 Yeah, I'm pretty sure you can.
43:03 If you go to the, if you're in the file system view, there should be a button.
43:07 It looks kind of like the button to create a workbook, but basically it lets you import a workbook.
43:14 It could have that Datalore format or it could have the IPNB, IPYNB, IPYNB, IPYNB, IPYNB, IPYNB.
43:26 So yeah, we can import Jupyter Notebooks into Datalore.
43:30 I think that would be a good place, a good way for people who are doing now, doing Jupyter and Explorer,
43:36 just like, hey, what does it look like if I were to do the same work, but over here?
43:39 Right.
43:40 Yeah.
43:40 I think that could be an interesting idea to try instead of just trying to start from scratch.
43:44 You try to take some established work from Jupyter and see what it looks like working with it in Datalore.
43:49 Right.
43:50 Grab some interesting notebook off of GitHub and throw it in there.
43:53 Okay.
43:54 That sounds pretty good.
43:55 I guess the final thing around Datalore would be, where is it going?
44:01 It's pretty new, right?
44:03 It just went 1.0 not long ago, right?
44:06 Yeah, I think in October.
44:08 What is that?
44:09 Four months, something like that.
44:10 So it's pretty new.
44:11 And now that your app has met users, I'm sure you got some feedback.
44:16 Yeah.
44:17 What are you thinking on working on next?
44:20 I can't really talk too much about the big features that we're working on, or big new features that we're working on.
44:27 Certainly one of the big things that we try to do is, as we get feedback, we want to act on that feedback.
44:32 We want users to get the sense that we actually care about what they're saying.
44:35 I can't think of specific features that were suggested by users.
44:38 I can certainly think of a lot of bugs that were reported, for example, and that they get that extra urgency because they're coming from the users and we want to fix it so that they can continue to work with the tool.
44:47 So that's certainly a big part of the, I guess, philosophy or something.
44:52 Okay.
44:52 I did see in the blog post that announced the release of 1.0, there was a bunch of feedback in the comments around there.
45:00 And one thing, I don't know how much my audience would care, but I know that they were asking, people were asking for Kotlin support, which Kotlin is a language JetBrains created.
45:09 That's like a better Java.
45:11 That's kind of the way I think of it.
45:12 I don't know what happened.
45:13 That's the idea.
45:14 So are you planning on adding other languages like Kotlin or JavaScript or R or whatever?
45:21 Or is it really focused on Python?
45:23 For now, the focus is on Python.
45:25 But the idea of supporting other languages, it's definitely something on our radar.
45:29 We've been asked about, I think, R, Scala, Kotlin, probably a few others.
45:36 So it's definitely something that's on our radar.
45:39 I can't say whether or not we're working directly on it right now.
45:42 Cool.
45:42 Well, it's definitely an interesting entry in this whole computational notebook space.
45:47 I think the smart editor features sort of borrowed from PyCharm is really cool.
45:52 I think the collaboration is really cool.
45:53 So hopefully people find it an interesting thing to consider.
45:57 I hope so, yeah.
45:58 All right.
45:58 So I guess before I let you out of here, let me ask you the final two questions I always ask folks.
46:03 So if you're going to write some Python code, what editor would you use?
46:07 I feel like I only have one correct answer.
46:12 I'm going to go ahead and say PyCharm.
46:14 Though realistically, technically, I'm using IntelliJ IDEA for all my Python editing, but it has the Python plugin, which gives me PyCharm, basically.
46:25 Yeah, exactly.
46:26 So effectively, more or less the same thing.
46:28 And then notable Python package that people might find interesting out there that maybe they don't know about.
46:35 I don't know if I can really name something that a lot of people won't know about.
46:39 I think my personal favorite.
46:41 I really like SymPy.
46:43 At least just the concept of SymPy is really neat to me, like the symbolic mathematics in Python.
46:49 Yeah, that's really cool.
46:50 I haven't used it.
46:51 I've heard of it before.
46:52 Maybe just tell really quick what it is.
46:55 It's bringing symbolic mathematics into Python.
46:58 Like you can define variables and build mathematical formulas, equations in SymPy.
47:04 Right.
47:04 So I could give it like a quadratic equation and say solve for Y and it'll or solve for X and it'll like express it in terms of, you know, things like that.
47:14 It will be able to do that.
47:15 I saw a demo, which was pretty neat, in which basically it was, you would define this partial differential equation and it would build a numeric solver for it.
47:26 Oh, wow.
47:26 Okay.
47:27 I don't remember how that happened, but I remember seeing that tutorial and I thought it was really neat.
47:32 That's pretty cool.
47:33 Basically, all I remember from partial differential equations is that they were hard and I would love a library to solve them for me.
47:39 Yeah.
47:39 So, yeah, I think for me, the appeal is mostly just like the mathematician in me likes the idea.
47:47 It sounds really cool and it ties in well with like all the other stuff that we're talking about here.
47:51 Great.
47:52 All right.
47:52 Well, final call to action.
47:54 People maybe are interested in Datalore.
47:56 What do they do?
47:56 Go to datalore.io and create an account and check it out if it sounds at all interesting.
48:02 You can play around with the samples that we have.
48:05 You can even upload your own notebooks, see what it's like to play with them in Datalore.
48:10 Awesome.
48:10 Well, Adam, thanks so much for being on the show and sharing your part of Datalore.
48:15 It's cool.
48:15 Thank you for having me.
48:16 You bet.
48:16 Bye.
48:17 Bye.
48:17 This has been another episode of Talk Python to Me.
48:20 Our guest on this episode was Adam Hood and it's been brought to you by Linode and Brilliant.org.
48:26 Linode is your go-to hosting for whatever you're building with Python.
48:30 Get four months free at talkpython.fm/Linode.
48:33 That's L-I-N-O-D-E.
48:35 Brilliant.org wants to help you level up your math and science through fun, guided problem solving.
48:41 Get started for free at talkpython.fm/brilliant.
48:45 Want to level up your Python?
48:47 If you're just getting started, try my Python Jumpstart by Building 10 Apps course.
48:52 Or if you're looking for something more advanced, check out our new async course that digs into all the different types of async programming you can do in Python.
49:00 And of course, if you're interested in more than one of these, be sure to check out our Everything Bundle.
49:05 It's like a subscription that never expires.
49:07 Be sure to subscribe to the show.
49:09 Open your favorite podcatcher and search for Python.
49:12 We should be right at the top.
49:13 You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at /rss on talkpython.fm.
49:22 This is your host, Michael Kennedy.
49:24 Thanks so much for listening.
49:25 I really appreciate it.
49:27 Now get out there and write some Python code.
49:28 I'll see you next time.
49:29 Bye.
49:29 Bye.
49:29 Bye.
49:29 Bye.
49:29 Thank you.