Learn Python with Talk Python's 270 hours of courses

#196: Datalore: Hosted smart notebooks Transcript

Recorded on Wednesday, Jan 9, 2019.

00:00 Michael Kennedy: If you're doing any sort of data exploration, you've likely heard about Jupyter Notebooks. In fact, there are quite a few options for running and hosting your Jupyter Notebooks. You may have also heard me rave about PyCharm as an editor. Well, on this episode, you'll meet Adam Hood from the Datalore team at JetBrains. That's a new project that tries to bring the power of PyCharm to notebooks and more. This is Talk Python to Me, Episode 196, recorded January 9th, 2019. Welcome to Talk Python to Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter where I'm @MKennedy. Keep up with the show and listen to past episodes at talkpython.fm, and follow the show on Twitter via @talkpython. This episode is sponsored by Linode and brilliant.org. Please check out what they're offering during their segments. It really helps support the show. Adam, welcome to Talk Python. Let's jump right in. How'd you get into programming?

01:07 Adam Hood: How did I get into programming? I guess, well, originally, my dad is a programmer, so he tried to teach me when I was, I don't remember, 10 or 11. It didn't really take then, but then I started taking classes in high school and college. Python, I didn't know anything about Python or start learning Python until college. I took a few classes, but even then, the bulk of my exposure to Python has been working on this project specifically. Before that, it was just a couple of classes that I took.

01:36 Michael Kennedy: Okay, so mostly college experience, you'd say, with getting into programming, or like, did you go to college for computer science? What was the start there?

01:44 Adam Hood: No, my degree is actually in math. I ended up going into programming partly to put off graduate school, and I mean, I had taken a number of programming classes, and I also worked in a lab for two years, and the bulk of my work was programming, in MATLAB granted, but still programming.

02:02 Michael Kennedy: Yeah. I think that's how a lot of math people get into programming, 'cause they start there, and they're like, yeah, this is kind of crummy. What can I do other than this? What can I do other than MATLAB?

02:09 Adam Hood: There you go.

02:13 Michael Kennedy: Yeah, that's actually real similar to my story, actually. I started in math, then got into programming, and I spent some time writing MATLAB stuff as well, but, you know, bailed on it as soon as I could, basically.

02:24 Adam Hood: Right.

02:24 Michael Kennedy: Nice. So, when you joined the Datalore team, or the team that's working on the project that became Datalore, that's more or less when you got into Python.

02:33 Adam Hood: I guess it was, at first we weren't really working on a Python editor, so it was probably another year or so before I actually got into Python.

02:43 Michael Kennedy: Right, so you got to build the foundation of all the tooling and whatnot.

02:46 Adam Hood: Right, right. Working on this project, that's when I really started getting into Python, especially when I started. I went to a couple of conferences. I've demoed the tool a fair amount. So I've certainly had some exposure to like, at least simple Python from that.

03:02 Michael Kennedy: Yeah, sure, that's cool. So, what do you do day-to-day at JetBrains?

03:06 Adam Hood: I'm a developer. I'm working on building features, fixing bugs and so on. I can say I've worked around various different parts of the tool at various points in time. Maybe we could get into that later, once we talk more about the features that we actually have, I guess.

03:21 Michael Kennedy: Yeah, yeah, sure. Well, let's just start there. Let's set the stage a little bit with what this is that you guys have created and got here, and then we'll make take a step back. So, what is Datalore? Like, what else is out there that's comparable, or is it something totally new?

03:36 Adam Hood: Well, it's a little bit new. Things that are comparable would be Jupyter Notebooks, especially if you talk about Google Colaboratory, I guess. It's this computational notebook. In our case, it's hosted in the cloud, also completely collaborative. You can share with coworkers, collaborators, whoever. In my limited experience working with Jupyter, and also based on talking to a lot of people, I get the impression that a lot of people, they'll use Jupyter, but they'll use something else as well sometimes to write the code, and then they execute in Jupyter. A lot of people have come to our booth at a conference talking about how they use PyCharm and Jupyter together and so on. So, maybe one idea here is that this is a single, unified tool which kind of takes the ideas from Jupyter as the computational notebook, but also from PyCharm as an IDE. Also, the collaboration that you might find in Google Docs or now in Colaboratory.

04:31 Michael Kennedy: Okay, so let's come back to that collaboration stuff, 'cause I think that's super interesting, but maybe let's just take a step back and just talk about notebooks in general, not the computers with keyboards, but, you know, Jupyter and other computational notebook-type things. There was a recent article, I forget where it was, it was in the Atlantic, and maybe I've mentioned it a time or two on the show, and basically, the title is that the scientific paper is obsolete. It's really kind of provocative. It has this picture of a scientific paper with 15 authors on it, and it's literally on fire, right? Like, it's crazy. And it's actually a super interesting look into the history of where notebooks came from. It looks a little bit at Mathematica and Wolfram, and why that really didn't take off in the scientific space in the same way, you know, how Jupyter and things like Datalore and Google Colaboratory and whatnot are really redefining what it means really to present computational work, hard science-type stuff. Let me get your take on that. What do you think about this positioning that like, notebooks are the new scientific paper for data science or for real science, like hard science, yeah, chemistry, whatever?

05:46 Adam Hood: Well, it might take some doing to convince all the journals of that.

05:50 Michael Kennedy: To be fair, they have a very strong financial vested interest. In keeping their position. I think there's actually something a little bit, fairly wrong about these whole scientific journals, because so much of this research is funded by taxpayers, and then is published in these private, ultra-expensive journals. Like, isn't that our research? Didn't we pay for that? So that's a whole 'nother angle, right? Maybe we shouldn't go down that too far, but.

06:18 Adam Hood: There's also a lot of politics involved in my experience, at least from what I've seen. But if we come back to the idea of the notebooks, I mean, I remember seeing Mathematica for the first time, and I thought it was a really awesome tool, and it's exactly pretty much what you were saying about the article. It was pretty much exactly that, everything that you have in the article. Like, you have this idea that you can have the code and these nice visualizations or outputs. You can print out the mathematical formula and so on, all just coming together in one thing. I thought that was really awesome when I first was introduced to it. And yeah, in a lot of ways, you can start to, I don't know if I want to say replace papers, 'cause that's hard to do, but you do get to see the scientific process when you look at a nice, well-organized notebook.

07:13 Michael Kennedy: And I mean, that does assume that it is put together for presentation, right? I mean, you've probably seen a ton of notebooks that are just like, the equivalent of spaghetti code, just scrambled all through there, and you're like, what is this thing? That doesn't actually help. But like, at least it's the canvas upon which that could be built, right?

07:29 Adam Hood: Right, exactly. And I think there's nothing wrong with the spaghetti code-type notebook in the sense that, you know, the first stage, you have to do a lot of exploration, or you're not going to get anywhere, and it's not until you have to present it that you need to really think about it.

07:44 Michael Kennedy: Yeah, so I agree that you look at the way people work with the notebooks, and the people who don't, say, web developers that work in something like PyCharm or Visual Studio Code, like, that sort of iterative exploratory type of working with your tool, and just, you don't get that in these other tools really, not in the same way, right? You've got to rerun them all from scratch. You can't just like, let me tweak this formula and rerun that cell and see how that looks, right? So I definitely think there's this interesting way of working in this exploring data with notebooks, yeah. Let's dig into Datalore.

08:18 Adam Hood: Sure.

08:22 Michael Kennedy: Tell people what, like, if you were at a conference, be it PyCon or whatever, and someone comes up and says Datalore, what's that, what do you tell 'em?

08:30 Adam Hood: I've had a lot of different lines throughout the years, actually. It's an online computational notebook in Python.

08:38 Michael Kennedy: To me, it kind of looks like something that is really familiar to people that use Jupyter, but has a lot of the features of, say, PyCharm.

08:47 Adam Hood: That's definitely one of the ideas, yeah.

08:49 Michael Kennedy: A lot of the code smarts around understanding the code, suggesting fixes and improvements and whatnot like PyCharm does, but in notebook form, right?

08:59 Adam Hood: Yes. And I mean, that's always been one of the big trademarks of the whole company, this idea of understanding the code, helping you with your code, so it would be kind of remiss on our part if we didn't include that.

09:13 Michael Kennedy: Yeah, it definitely would look out of place if it wasn't there. You have all these cool code assistant, code intention tools. Like, why are they not being brought over here? When you talk about building it, who do you imagine your users are?

09:26 Adam Hood: I can't really speak to specifics. I would just say the whole realm of data scientists, data analysts. Anyone who works with data and works with data in Python in particular hopefully could use this tool.

09:39 Michael Kennedy: Sure, pretty similar to the folks that might choose Jupyter?

09:41 Adam Hood: Yes.

09:43 Michael Kennedy: Okay, maybe it's worth doing a compare and contrast between Jupyter and JupyterLab. So, how are they different? What does Jupyter have that you guys don't? What do you have that Jupyter doesn't? Maybe position those two things for us.

09:56 Adam Hood: Sure. I guess it's easier for me to talk about what we have that Jupyter doesn't.

10:00 Michael Kennedy: Sure.

10:01 Adam Hood: Not that I want to say Jupyter is awful. I mean, obviously, Jupyter is doing a great job of filling the space right now. One of the big differences is the way that we work with the environment that you have. In Jupyter, you have this one global environment. You execute the code in some cell, and it executes on top of that environment.

10:22 Michael Kennedy: So like basically if I declare a variable in a cell, it's just global to all the other cells, and that kind of thing?

10:29 Adam Hood: At least, that's my understanding of Jupyter, up to this point. Datalore is a bit different. You can almost think of it as after every cell, you have a different environment. So you have whatever you write into the first cell, you execute that, or by default it executes automatically, and you have that environment, but then you start adding to another cell. That'll execute on top of the environment from the previous cell. Basically, the idea here is that at least for me, working with Jupyter, one thing that gets confusing is if I start having to re-execute previous cells, the environment gets a little bit confusing for me to keep track of, just because, you know, I see the sequential code, but that doesn't represent what we actually have in the environment.

11:15 Michael Kennedy: Right, like maybe there's 10 cells, and you went back, and you fiddled with number three and reran it, but you didn't run four, which would have actually taken the output of three to make a difference. You know, like that kind of thing, right?

11:26 Adam Hood: Right, and for us, that's not the reality. For us, like, if you go back and fiddle with three, you don't necessarily have to execute four, five, and six right away, but if you do, they will reflect what you had from three. So I guess that's one pretty big difference. I touched on it already, but we have two different computation modes. By default, we have basically live computation. Like, as you write the code, we're executing it.

11:53 Michael Kennedy: Even before you say execute cell?

11:56 Adam Hood: Right.

11:56 Michael Kennedy: As you type lines out, it just, if it's valid syntax, it runs?

11:59 Adam Hood: Yeah, pretty much. It's useful if you're doing lightweight stuff. The idea is you're getting the feedback right away. If you write something and you try to reference a variable that doesn't exist or something, you know that right there. Of course, if you have something that's really long-running, you might not want to run it a million times, but we do have a more Jupyter-like computation mode where you actually execute the cell. So I would say that's another kind of key difference.

12:27 Michael Kennedy: Yeah, probably the other one, I guess you might say, is it's in the cloud, period, right?

12:33 Adam Hood: It's in the cloud, yes.

12:35 Michael Kennedy: You can't just locally pip install datalore.

12:37 Adam Hood: Nope.

12:38 Michael Kennedy: Datalore notebook sort of thing, so, for both the good and bad that comes with that, right? But it's definitely a difference between the two.

12:45 Adam Hood: Yeah, it's in the cloud. Some of the things that come with being in the cloud is for one thing, it's collaborative.

12:52 Michael Kennedy: Right.

12:53 Adam Hood: A lot like a Google Docs. I mean, you can share with coworkers, whoever. You can have them just view it, or they could also edit your workbooks.

13:03 Michael Kennedy: On the collaborative bit, that sounds super interesting, and it's really cool. I think this is actually a trend in a decent amount of the tooling, to have this sort of collaborative live coding, right? I know there's been so many folks who have done all sorts of tricks to get some sort of like, realtime collaboration, even so much as like, going to Amazon, spinning up a virtual machine that you can both log into over remote desktop or something, and actually like, we're going to both type on this. And like, that's probably not the right way to do it. But people were forced to do this, right?

13:38 Adam Hood: Right, yeah.

13:39 Michael Kennedy: 'Cause we got Google Docs and cool stuff like that, but PyCharm doesn't do it, the other tooling doesn't do it. But, for example, Microsoft announces with Visual Studio this year where you can remotely share a session, and you can even remotely debug the other person's code on their machine. It's pretty interesting. I think this is tapping into a trend that is making a lot of sense, you know?

14:02 Adam Hood: Any time you have a project where you need multiple people to work on it together, you need some sort of tooling for collaboration.

14:08 Michael Kennedy: Alternative has been check it into Git.

14:10 Adam Hood: Yep.

14:12 Michael Kennedy: And merge it.

14:12 Adam Hood: Yeah. I've only interacted with Git with Jupyter Notebooks once, and it was a painful process.

14:20 Michael Kennedy: I can imagine. Well, and also, the exploratory nature doesn't make a lot of sense. You're like, oh, I change a variable, I checked it in, do a pull, right? I mean, that's too quick of an iteration. It's, I added a feature, check it out. That's a different type of thing. But notebooks, like you said, sort of tend to be more iterative and exploratory, and I think that makes a lot of sense. What does the collaboration look like? I've played with Datalore, but I haven't played with it with other people. So how realtime is it? Is it every word, every character, every cell?

14:52 Adam Hood: Pretty much every character that you type. You could end up sending several characters at once. I don't know exactly how that works, but more or less, every character, as you type it, it's getting synchronized to your collaborators and so on. We have this mode where you can actually follow what your collaborator is doing. They have a little avatar in the top-right corner. If you click on it, you'll start tracking them. So as they make edits, you'll see them realtime.

15:18 Michael Kennedy: That's pretty awesome. Have you seen people do anything or heard of people like, using this for teaching, or say, live presentations, where everyone else can come in and like, here we're all sharing the notebook, and all the students follow along or the participants follow along?

15:34 Adam Hood: It's definitely something that you can do, and I think some people have done that.

15:39 Michael Kennedy: It sounds so much better than slides over WebEx.

15:43 Adam Hood: Right.

15:43 Michael Kennedy: You know? Like, here, open this in your browser and follow along as we do it.

15:47 Adam Hood: Yeah, I mean, one thing that was kind of cool was we had this little internal workshop. One of our data scientists did a presentation on data science, just some overview stuff, and he was doing it in Datalore, and there is a big difference between just watching what he's doing if he's screen sharing versus just being able to follow him as he moves along, and that was really cool.

16:10 Michael Kennedy: Yeah, it seems like there's a lot of possibility.

16:12 Adam Hood: Yeah, yeah, I think definitely.

16:13 Michael Kennedy: Nice. So, I guess this puts it kind of in the realm of Google Colaboratory, which I actually haven't used very much. I don't know if that has live collaboration, but I'm guessing by the name that it does.

16:23 Adam Hood: Yeah, I would assume so. I haven't used it. I have opened it. It looks to have just a new format for Google Docs or Sheets or Slides, except it's a notebook instead of one of those things. Pretty nice thing to have.

16:39 Michael Kennedy: Yeah, that's cool. So, I haven't done much with Google Colaboratory, but I haven't heard a ton of people talking about it. But it does seem pretty nice. Cool. So, maybe we could dig into some of the features that makes Datalore cool. So, there's a few things that caught my attention. Maybe I'll bring them up, and then you can describe what they are and how they work and why they're cool or whatever. So, one of 'em is smart code completion, sort of context-aware code completion. Basically all the code completion you have in PyCharm, you also have over at Datalore, right?

17:14 Adam Hood: In some ways, we could have even more, because we're aware of everything that's in the environment already, but yes.

17:21 Michael Kennedy: Okay. It feels like it's more of a real editor when you're working in the cells to me.

17:27 Adam Hood: That's the idea, at least. That's the hope. I think one of the cool things, just going between Datalore and PyCharm for me, though, was like, in PyCharm, you're going to get all of the static inspections, completions, everything, but you can even get runtime information as you go along and you edit in Datalore, because it is running as you go along.

17:48 Michael Kennedy: Oh, I see, so maybe a field was added to some object at runtime, and because it's just hanging out there in the environment, you could get completion on that, whereas in PyCharm, it's got to be static analysis, right? That's what you're saying?

18:01 Adam Hood: Yeah, I think so.

18:03 Michael Kennedy: Yeah, yeah, that's pretty awesome. So, then you have like, all the code inspection, like this thing has a type annotation that says it takes a list, but you're passing a string, what goes here, that kind of stuff, right?

18:15 Adam Hood: As far as I know, we're still working on adding a bunch of those inspections. So we only have a few. I think one example is we have an inspection for unused variables. If we see a variable that's just sitting around, completely unused, we'll highlight it. We probably have a couple of others. I don't know all of them.

18:32 Michael Kennedy: Sure. Navigation? So you can navigate, turn your parts of your notebook into hyperlinks that take you to like, where the function's defined or the variable and things like that?

18:41 Adam Hood: Yeah, yeah, you can. I've only used it a few times, but you definitely can do that. I don't remember the shortcut, even. When I'm developing software, that's one of the most useful things that I use. Like, I always want to see the function.

18:56 Michael Kennedy: Absolutely. I mean, that's one thing that drives me crazy about some of the more basic or, you know, multi-file type of editors that you can't just say, go to definition on this. I don't care where it is. Go find it, and show me what it is, you know what I mean? This portion of Talk Python to Me is brought to you by Linode. Are you looking for hosting that's fast, simple, and incredibly affordable? Well, look past that bookstore, and check out Linode at talkpython.fm/linode. That's L-I-N-O-D-E. Plans start at just five dollars a month for a dedicated server with a gig of RAM. They have 10 data centers across the globe, so no matter where you are or where your users are, there's a data center for you. Whether you want to run a Python webapp, host a private Git server, or just a file server, you'll get native SSDs on all the machines, a newly-upgraded 200-gigabit network, 24/7 friendly support, even on holidays, and a seven-day money-back guarantee. Need a little help with your infrastructure? They even offer professional services to help you with architecture, migrations, and more. Do you want a dedicated server for free for the next four months? Just visit talkpython.fm/linode.

20:09 Adam Hood: The other thing that we can do in terms of navigation is we have a find, basically like Command + F or Control + F in your browser, except it just searches the code that you have.

20:21 Michael Kennedy: Okay, that's pretty cool. You also have, it says code intentions, or just intentions. Tell us about that. That looks pretty cool, with like datasets and all sorts of things.

20:29 Adam Hood: Right, there are these context-aware suggestions for actions that you can take with your code. By and large, they're trying to focus on the data science workflow. Like, if you have a completely empty workbook, you've just started, you can start out by adding some standard imports, like pandas, NumPy, Matplotlib. We also have this internal plotting library, I think. You can also start by loading a dataset. You can upload something, or you can work with, we provide like 10 or 12 example datasets or something like that. Well, we don't provide all of them, but you can load using this intention.

21:10 Michael Kennedy: So like, some of them might go and like, suck it off of a website or something, or off an API?

21:14 Adam Hood: Right, you can do that. We do have this file uploader as well. You can upload files from disc, or if you know.

21:21 Michael Kennedy: Sure, what format do they have to be in? Can you just put whatever and you just read it, or is it like, does it get understood when you upload it?

21:27 Adam Hood: You're going to end up reading it in the code. So if you know how to read it, like, programmatically, pretty much anything.

21:33 Michael Kennedy: I see. But it's not like you send it a .csv file and it like, pre-parses it for you.

21:38 Adam Hood: No, it won't do that.

21:40 Michael Kennedy: Yeah, okay.

21:41 Adam Hood: If you do upload a .csv or .tsv file, it will get suggested by the intention.

21:47 Michael Kennedy: So, you said it has all these different data sources and some datasets. Like, what are some of the datasets that I can work with?

21:54 Adam Hood: Well, we have a number of examples. I can even give a full list. So a lot of these are like, just examples that are used, you know, teaching examples.

22:02 Michael Kennedy: Mmhmm.

22:03 Adam Hood: Iris, Boston Housing, breast cancer. We have two different digits datasets. One of them is small, one of them is large, like the handwriting, you're trying to recognize handwriting digits.

22:14 Michael Kennedy: Right. So that's like an OCR, machine learning one. The Iris one is also an optical image processing dataset that's pretty commonly used I think. Yeah, sounds cool. If you wanted to do, like, demonstrate some form of, say, machine learning around OCR, you could just go, yeah, we're just going to use the well-known dataset from here, right? Are you planning on adding more?

22:38 Adam Hood: I'm not sure. I think for now, we're satisfied with it.

22:42 Michael Kennedy: So basically, you have the ability to upload, so you just upload your files?

22:46 Adam Hood: I think for the most part, we don't even have these uploaded internally. It's just, we know how to call the functions that will load these. Like, with Iris, you can get it directly from scikit-learn, and same with a few others.

23:00 Michael Kennedy: Yeah, that sounds pretty cool. What else happens with the intentions? So it's not just about datasets, right?

23:06 Adam Hood: Right, it's not just about datasets. So you start out by loading, and then after you load the dataset, you can, I guess, split it, extract out X and Y where X is all the, what is it called, the feature matrix or something like that. Y is your target. We will do the Train/Test Split. For me, I remember, I always screw up the syntax for Train/Test Split, like, is it xtest, xtrain, or is it xtest, then ytest, or, I don't know.

23:35 Michael Kennedy: But Datalore knows?

23:35 Adam Hood: Datalore knows. So I use the intention, and it'll split the data for me. It'll create some set of models. Like, it'll create the object of various scikit-learn estimators. It can also do a few pre-processing steps if you need to. Like, if you loaded Titanic, there's some missing values and some columns that, like, you have a couple of features that need to be transformed in some way. So you can do like a one-hot encoding, you can impute, and there are intentions that will do these things for you.

24:10 Michael Kennedy: Nice. So, how does that show up? Does it show up like you tab a word and hit tab, and then it gives you some big, expanded thing, or what does it do?

24:18 Adam Hood: Actually, they just manifest as buttons at the bottom of the cell. Well, we call them blocks, but for the sake of Jupyter users, we can talk about cells. There's just all these buttons that you have at the bottom, and the buttons that you see exactly will depend on the context, like, what variable do you have focused, or it could just depend on the context of like, how far along are you in this machine learning process? What have you done so far? What would be the next step? You can just click the button. You might have a dialog with some options, or it'll just do something.

24:51 Michael Kennedy: Yeah, that sounds really cool. What machine learning libraries does it support? Is it just scikit-learn? Does it do like, Keras or any others?

24:58 Adam Hood: The intentions specifically, or Datalore as a whole?

25:01 Michael Kennedy: Yeah, when it suggests to do like, some training, where does that work with?

25:04 Adam Hood: It's going to work, by and large, with scikit-learn. And, I mean, it'll use pandas or NumPy to support the process. I think it generally prefers to work with, we prefer to work with pandas DataFrames.

25:16 Michael Kennedy: So another thing that caught my attention, and I think you were touching on it before, is this idea of dependencies between the cells or blocks and the variables, and like, incremental computation. So, if you have something that defines, like, X somewhere, and then something else that maybe passes X to a function, Datalore is aware of that relationship and can do like, a partial re-execution to make it go back in sync? Is that right?

25:43 Adam Hood: Yeah, like if you update X way above, it'll know that you need to reevaluate that.

25:48 Michael Kennedy: I think that's a pretty cool feature. Before you were talking about, you know, running the cells in, say, Jupyter Notebook in different orders and getting different results, right? Which it's not at all obvious how that happened. I know it has a little number along the side. Of like order of execution, but if you've got 20 cells, do you really have that held in memory, in your memory, of which order it is? Not really, right?

26:11 Adam Hood: I know from talking to people that for some people, it isn't really a problem, but I think, like, if you have a lot of experience working with Jupyter Notebooks, you're probably used to it, and you know how to work with it well.

26:22 Michael Kennedy: You know what trouble you can run into, so you just don't do it, right?

26:24 Adam Hood: Yeah, for me, it's always a problem. Well, I'm not that experienced with Notebooks, and I think, I imagine for a lot of people trying to break in, it is a hurdle you have to get through.

26:34 Michael Kennedy: Sure. So, I guess once you kind of built up your data, maybe trained some models with these intentions, things like that, you probably want to look at it, right? And so you talked about some libraries, you know, Matplotlib for example, but you also have some built-in plotting libraries, one for two-dimensional plots and one for geographical ones. So I guess, what are those, and then like, why not just import Altair or something, you know, like some other thing, right? Like, why do you have your own? What are the benefits?

27:05 Adam Hood: I can't really compare and contrast with a lot of different libraries, 'cause I don't know them that well. But yeah, we do have, I guess, datalore.plot. That's the prefix for the library. It's our internal graphics library. It's based on the grammar of graphics. So it's going to be very similar to like, ggplot2 in R. And I think there are some Python libraries that might also be based on the grammar of graphics.

27:30 Michael Kennedy: Sure.

27:31 Adam Hood: That I don't know about or I don't know the names of off the top of my head. My impression is that Matplotlib is definitely the most popular library in Python for graphics or plotting. But, I have, well, me personally, and I guess enough other people that we decided to develop this library have this preference for the grammar of graphics for that different syntax. I remember in my days of working with MATLAB the general torture that was working with graphics in MATLAB, and Matplotlib is, as far as I understand, based on that, or very similar. So yeah, I guess the motivation originally was to bring the grammar of graphics to Python, at least within our tool.

28:14 Michael Kennedy: Okay. You could still use the others? Like, if I wanted to use some other library that typically works in Jupyter, I could install it, import it, and run it?

28:22 Adam Hood: Yeah, it should work. If it doesn't, we appreciate the bug report.

28:26 Michael Kennedy: Sure. But the intention is that you can, right? Okay, so I guess another question that makes me start to think of is, what about dependencies? Like, how do I specify my Python dependencies to other libraries I'm going to use, and then how do I get them? Like, if it's running on your cloud, how do I get the five libraries off PyPI that I actually need to do my thing?

28:48 Adam Hood: So, we have a library manager. Like, we have a tool that lets you set up a custom environment. We have some default environment, and it includes a lot of the basics that you probably want, like NumPy and pandas and Scikit-learn and Matplotlib.

29:03 Michael Kennedy: Right. You might as well just start with those installed, right?

29:06 Adam Hood: Right, exactly. But then, you have the option to use this library manager to add whichever libraries you like, from Conda, from pip, even if you have access to some GitHub repo.

29:21 Michael Kennedy: Right, where you pip with the git command, just to the repo base, yeah.

29:26 Adam Hood: Yeah, so you can install from GitHub.

29:27 Michael Kennedy: Yeah, but basically, you can install whatever you really need to to do your work, right? For now, as we sort of started at the beginning, Datalore is a cloud thing. I know there's a ton of people out there listening going, that's great, but my company won't let me upload my data to a cloud thing. Is there a way to run it on-premise, maybe even if that's not like, pip install datalore, but maybe is like, here's a virtual machine you can license from us internally.

29:58 Adam Hood: Unfortunately, no, it's not possible right now. It's something we're very much aware of, 'cause we've heard this feedback, this exact feedback, from a number of users through various channels. So it's very much something that's on our radar, something that we could consider, but unfortunately right now, it is not supported, no.

30:16 Michael Kennedy: And if it's not, maybe that's fine, right? Like, Google Docs doesn't have an on-prem Google Docs. Just don't do it, right? I do think that, you know, a lot of the data science folks are analyzing data that's pretty closely guarded, right? It would be good, but who knows.

30:33 Adam Hood: As I said, we definitely, it's not the first time we've heard it, so.

30:37 Michael Kennedy: Yeah. Is the source code for Datalore available, or is it private?

30:42 Adam Hood: Right now it's private.

30:42 Michael Kennedy: Okay, doesn't surprise me. Even for a public, it's probably like, deeply bound to your infrastructure, you know what I mean?

30:50 Adam Hood: Yeah, I mean, I can't say if there are any plans to open source it or not. We have open sourced some projects, and other projects, we've never open sourced. It's not really my decision in the end either.

31:00 Michael Kennedy: Sure. So, there's different ways, like different machines or computational units, whatever you can buy to like, do your analysis so when you start out, you can just play around with some moderate virtual machine that it's running on, but you can get like a higher-end computation, or can you run your stuff on GPUs as well?

31:21 Adam Hood: Yes, you can.

31:21 Michael Kennedy: How's that work?

31:23 Adam Hood: The only thing that's available, so we have two different plans right now. We have a free plan, a paid plan. The only thing that's available on the free plan is just these medium, like these moderate instances, which work fine for just some general all-purpose stuff. But, if you upgrade, you have the opportunity to upgrade to larger, more powerful machines. All of these right now are hosted on AWS, so you can even look them up on AWS, but one of them does include GPUs.

31:53 Michael Kennedy: Right. One of 'em is, I guess, in EC2 parlance, a p2.xlarge, which would be a four-CPU and single GPU machine with 61 gigs of RAM. That's a pretty serious machine, with 12 gigs of GPU RAM. Maybe we talk about pricing as well, and like, so this is both a thing you can use for free, and also something you can pay to get more for, right?

32:19 Adam Hood: Right.

32:20 Michael Kennedy: What do you get for free, and what do you have to pay for?

32:22 Adam Hood: What you get for free is a monthly allowance on these moderate machines, I think 120 hours per month across all your various computations.

32:34 Michael Kennedy: Is that actual execution time, or just like, I'm working with my notebook?

32:38 Adam Hood: The second one, like, when you're working on the notebook.

32:41 Michael Kennedy: Like, if I have it open and I'm interacting with it? What if I leave my browser open and I go to lunch? Does that count towards my hour?

32:50 Adam Hood: It does. Now, eventually, we will shut it down if it's been idle for a really long time, because we don't want you to just run through your full allowance by accident without realizing it. As long as it's open, we keep the instance running.

33:04 Michael Kennedy: Right, so 120 hours is like, if you do that eight hours a day, that's 15 days a month. That's still, you know, that's a lot of time for free. So I just want to make sure people understand like, what it is when you say hour, what you're actually talking about, right?

33:18 Adam Hood: That's fair. Then, so if you upgrade, you unlock the ability to use some of these larger, more powerful machines. You still have to pay for them, I guess, but like, we have various different machines depending on what you need.

33:32 Michael Kennedy: But you basically pay like the EC2 price of the machines if you select 'em?

33:38 Adam Hood: It's marked up. I mean, you can look up the EC2 prices, so it's not going to be exactly the same.

33:43 Michael Kennedy: Sure.

33:44 Adam Hood: But, it's going to be based, roughly, on the EC2 prices. The more expensive machines will obviously cost more; the less expensive machines will cost less.

33:52 Michael Kennedy: Well, that sounds pretty cool. I mean, the fact that there's quite a bit of free space and computation and whatnot, I think that's what it takes to be interesting to everybody, right?

34:02 Adam Hood: Yeah. At the end of the day, we are competing with an open-source thing, which will always be open-source, so we better do something more.

34:11 Michael Kennedy: Yeah, I think it makes sense. It seems pretty fair. This portion of Talk Python to Me is brought to you by brilliant.org. Many of you have come to software development and data science through paths that did not include a full-on computer science or mathematic degree. Yet, in our technical field, you may find you need to learn exactly these topics. You could go back to university, but then again, this is the 21st century, and we do have the internet. Why not take some engaging online courses to quickly get just the skills that you need? That's where brilliant.org comes in. They believe that effective learning is active, so master the concepts you need by solving fun, challenging problems yourself. Get started today. Just visit talkpython.fm/brilliant and sign up for free. And don't wait, either. If you decide to upgrade to a paid account for guided courses and more practice exercises, the first 200 people that sign up from Talk Python will get an extra 20% off an annual premium subscription. That's talkpython.fm/brilliant. I asked if it will run outside the cloud, your cloud, and right now, he's saying no, it's hosted in EC2. Can I pick where it runs in EC2? Like, suppose I work on healthcare, and it has to stay within Europe, or it has to stay within France or something.

35:30 Adam Hood: At the moment, no, you can't pick which cloud it runs in.

35:34 Michael Kennedy: But basically, right now, we don't have a dropdown, like I want to do my work in Europe, or I want to do my work in the US or something like that.

35:41 Adam Hood: No, there's nothing like that.

35:43 Michael Kennedy: Yeah, do you know if there's any plans for something like that?

35:45 Adam Hood: I'm not sure, actually. I mean, one thing we always have to keep our eye on is all the different privacy laws and make sure that we're not doing something untoward with anyone's data, making sure that we're always complying.

36:00 Michael Kennedy: Yeah, I mean, there's certainly value to the privacy laws, but man, they can be hard to keep straight. 'Cause it's not just one country, right? It's all the countries.

36:09 Adam Hood: Right, and if you have users in one country, you have to comply with the laws for that country for that user.

36:13 Michael Kennedy: Yes, exactly. Welcome to the internet, right? So if people want to go and play around or explore Datalore, you have a set of prebuilt notebooks that they can go play with. You want to maybe talk us through some of those that they can explore?

36:32 Adam Hood: Right, we do have some sample workbooks. I am not too up to date on what the exact sample workbooks are on, but we actually do have some set here, which is kind of nice. So the first one, these links will be somewhere?

36:45 Michael Kennedy: So yeah, the links are super long and crazy, and I'll put them in the show notes, so don't worry about trying to say them. They're like, you know, UUID type of things, yeah.

36:53 Adam Hood: Yeah, in any case, if I'm understanding correctly, I think if you create an account, these workbooks will be automatically created for you so you can see them for yourself.

37:03 Michael Kennedy: Exactly, under your home you've got, like, when I log in, at least, in my account, I haven't done really much so this has got to be what's coming with it, you know? There's like, a Scikit-learn section, there's a map section, and there's a plotting section, and the plotting has the most, actually.

37:17 Adam Hood: Yeah, I guess we were kind of proud of it and want to show it off.

37:21 Michael Kennedy: Yeah, sure.

37:22 Adam Hood: In any case, I guess the first one, we're basically showing, first of all, in pandas, you have the scatter_matrix function. It's very useful for some exploration. You can see distributions of the different features, as well as their distributions with each other, whatever the right terminology is. I feel like I'm butchering data science terminology as we go along. In any case, so first, we're showing how you can generate the scatter matrix in pandas, but then we also want to show how you can do it using Datalore plot, using the internal plotting. So first of all, you can create a scatter pretty easily for any pair of features that you have. You can also create a histogram using the geom_point, and geom_histogram functions can update it to show density instead of just the counts of the different bins.

38:10 Michael Kennedy: Yeah, nice. So basically, like, all the plotting that you, most of the plotting you want to do with ggplot or the Datalore plotting stuff, right? Yeah, so these are nice. These graphs are pretty interactive.

38:22 Adam Hood: Yeah, I guess that's something I didn't even talk about yet. They are interactive. Like, if you hover over them, you can see the different points on them, the different coordinates, which is kind of cool.

38:32 Michael Kennedy: I like it, it's quite nice. So you have another one on Bayesian inference, which is pretty cool. So, one thing here that I'm seeing is, gosh, I'm not entirely sure what this syntax is. I think it might be LaTeX. So at the top, you've got basically really symbolically nice mathematics in here, right? That's pretty cool, that you can actually write the formulas in mathematical notation, not just trying to put ASCII characters to make the integral sign look like an integral sign or something.

39:08 Adam Hood: Yeah, basically, we support, we have these Markdown cells or blocks, as I guess we call them, and you can include LaTeX into these cells, as you can see there, which is always kind of fun. It does make for a nice presentation.

39:22 Michael Kennedy: Yeah, I have totally mixed feeling about LaTeX, coming from a math background. Like, I love the outcome. I hate the creation of it. It's so painful, it's so not obvious.

39:32 Adam Hood: I can fully understand, but see, when I was in college, what made LaTeX bearable for me was I started discovering tools like, I don't remember if it was called Overleaf then and it's LaTeX now or vice versa, but it was basically this real-time LaTeX editor, which was collaborative.

39:49 Michael Kennedy: Oh, interesting.

39:50 Adam Hood: You could switch between modes, like compile incrementally, or just you press a button to compile, but it gave you the realtime feedback about exactly what you were writing, which was always really cool.

40:02 Michael Kennedy: Yeah, that's pretty cool. And for people who don't know, LaTeX is basically, it's been around for a while. It's kind of like Markdown syntax, but for symbolic mathematics, like I want the sum from n equals zero to one, or for 100. It'll actually do all the symbols, and it's used a lot in math papers and whatnot. So, let's see, what else? We got a lot of, like a whole bunch of the Datalore plot stuff, the geometry and whatnot, and you have some California housing data as well graphed out, which has got some cool geometry stuff, and it pulls in OpenStreetMap and then like, overlays it, right?

40:38 Adam Hood: Something like that.

40:39 Michael Kennedy: That's pretty nice. So, yeah, I guess that's just using the California housing data from Scikit-learn datasets.

40:46 Adam Hood: Exactly. It conveniently has the latitude and longitude coordinates, so we can just plot every point on the map as we see it, which is kind of neat.

40:54 Michael Kennedy: That's cool. So, one thing at the bottom that's pretty interesting is it's got a picture of airplanes flying along the Hudson, and it overlays animated flying airplanes using the geography plotting stuff, which is, that's pretty awesome.

41:11 Adam Hood: I don't work on this myself, but I'm always amazed at the new features that keep coming out, like all the different things you can do with the maps and plots. I think this is one of my favorite little animations, just showing the different routes.

41:23 Michael Kennedy: Just sit there and watch 'em fly along it? That's pretty cool. Nice, so people can check out those shared ntoebooks on Datalore and play with them. Basically, like you said, if you create an account, you get all those in your account.

41:33 Adam Hood: Yep.

41:33 Michael Kennedy: Yeah. What about version control, right? Like, if I have Jupyter Notebook local, or especially if I have Python files locally, it's super easy to check that into GitHub, do a diff, do a branch, do a PR. Like, what does version control look like over here?

41:49 Adam Hood: We have our own integrated version control. Basically, we have this thing called history. Every now and then, if there's some inactivity, we'll just keep a checkpoint, kind of like a commit or something, you can think of it, or you can make your own as you need to, and then you can compare the diffs, you can revert to the previous version. I think you could even revert specific changes, but you'd have to check me on that one. Basically, like, this internal version control.

42:16 Michael Kennedy: That pretty cool. And how do I, like, can I branch? Can I do stuff like that?

42:20 Adam Hood: No, it's completely linear, so no branching involved here.

42:23 Michael Kennedy: So right now, basically, it just, I can go forward or backwards in time for now, right?

42:28 Adam Hood: Yeah, can't really do branches or merges or anything like that.

42:32 Michael Kennedy: So one thing I'm noticing is I can export a Datalore file, which is kind of cool, right? So I theoretically could export the Datalore file and then check that into GitHub. That's one possibility. Right, do save points there. You can also export Jupyter Notebook format, the IPYNB format. Can you go the other way? Can I import?

42:54 Adam Hood: Yes.

42:56 Michael Kennedy: Like, if I create a new notebook, can I say, from this Jupyter Notebook, or from someone else's Datalore notebook?

43:02 Adam Hood: Yeah, I'm pretty sure you can. If you're in the file system view, there should be a button. It looks kind of like the button to create a workbook, but basically, it lets you import a workbook. It could have that Datalore format, or it could have the IPYNB, the IPython Notebook format. So yeah, we can import Jupyter notebooks into Datalore.

43:31 Michael Kennedy: I think that would be a good place, a good way for people who.

43:32 Adam Hood: Right.

43:33 Michael Kennedy: To be now doing Jupyter and just like, hey, what does it looks like if I were to do the same work, but over here?

43:38 Adam Hood: Right, yeah. I think that could be an interesting idea to try. Instead of just trying to start from scratch, you try to take some established work from Jupyter and see what it looks like working with it in Datalore.

43:50 Michael Kennedy: Right, grab some interesting notebook off of GitHub and throw it in there. Okay, that sounds pretty good. I guess the final thing around Datalore would be where is it going? Like, it's pretty new, right? Like, it just went 1.0 not long ago, right?

44:06 Adam Hood: Yeah, I think in October.

44:08 Michael Kennedy: What is that, four months, something like that? So it's pretty new, and now that your app has met users, I'm sure you got some feedback. What are you thinking on working on next?

44:20 Adam Hood: I can't really talk too much about the big features that we're working on, or big new features that we're working on. Certainly one of the big things that we try to do is as we get feedback, we want to act on that feedback. We want users to get the sense that we actually care about what they're saying. I can't think of specific features that were suggested by users. I can certainly think of a lot of bugs that were reported, for example, and they get that extra urgency, because they're coming from the users, and we want to fix it so that they can continue to work with a tool.

44:47 Michael Kennedy: Mmhmm.

44:48 Adam Hood: So that's certainly a big part of the, I guess, philosophy or something.

44:51 Michael Kennedy: Okay. I did see in the blog post that announced the release of 1.0, there was a bunch of feedback in the comments around there, and one thing, I don't know how much my audience would care, but I know they were asking, people were asking, for Kotlin support, which Kotlin is a language JetBrains created that's like, a better Java. That's kind of the way I think of it. I don't know.

45:13 Adam Hood: That's the idea.

45:15 Michael Kennedy: So are you planning on adding other languages like Kotlin or JavaScript or R or whatever, or is it really focused on Python?

45:23 Adam Hood: For now, the focus is on Python, but the idea of supporting other languages, it's definitely something on our radar. We've been asked about I think R, Scala, Kotlin, probably a few others. So it's definitely something that's on our radar. I can't say whether or not we're working directly on it right now.

45:42 Michael Kennedy: Cool, well, it's definitely an interesting entry in this whole computational notebook space. I think the smart editor features sort of borrowed from PyCharm is really cool, I think the collaboration is really cool, so hopefully people find it an interesting thing to consider.

45:57 Adam Hood: I hope so, yeah.

45:57 Michael Kennedy: All right. So, I guess before I let you out of here, let me ask you the final two questions I always ask folks. So, if you're going to write some Python code, what editor would you use?

46:08 Adam Hood: I feel like I only have one correct answer. I'm going to go ahead and say PyCharm.

46:17 Michael Kennedy: Yeah, that sounds good.

46:18 Adam Hood: Technically, I'm using IntelliJ IDEA for all my Python editing, but it has the Python plugin, which gives me PyCharm, basically.

46:25 Michael Kennedy: Yeah, exactly. So, effectively, more or less, the same thing. And then, notable Python package that people might find interesting out there that maybe they don't know about?

46:35 Adam Hood: I don't know if I can really name something that a lot of people won't know about. I think my personal favorite, I really like SymPy, at least just the concept of SymPy is really neat to me, like the symbolic mathematics in Python.

46:49 Michael Kennedy: Yeah, that's really cool. I haven't used it. I've heard of it before. Maybe just tell like, really quick what it does?

46:55 Adam Hood: It's bringing symbolic mathematics into Python. Like, you can define variables and build mathematical formulas, equations, in SymPy.

47:03 Michael Kennedy: Right, so I could give it like, a quadratic equation, and say, solve for Y, or solve for X, and it'll express it in terms of, you know, things like that?

47:14 Adam Hood: It will be able to do that. I saw a demo which was pretty neat in which basically, it was, you were defined this partial differential equation, and it would build a numeric solver for it.

47:26 Michael Kennedy: Oh, wow, okay.

47:27 Adam Hood: I don't remember how that happened, but I remember seeing that tutorial, and I thought it was really neat.

47:33 Michael Kennedy: That's pretty cool. Basically all I remember from partial differential equations is that they were hard, and I would love a library to solve them for me.

47:39 Adam Hood: Yeah! So, yeah, I think for me, the appeal is mostly just like, the mathematician in me likes the idea.

47:47 Michael Kennedy: It sounds really cool, and it ties in well with all the other stuff that we're talking about here. Great, all right, well, final call to action. If people maybe are interested in Datalore, what do they do?

47:57 Adam Hood: Go to datalore.io, and create an account and check it out, if it sounds at all interesting. You can play around with the samples that we have. You can even upload your own notebooks, see what it's like to play with them in Datalore.

48:10 Michael Kennedy: Awesome. Well, Adam, thanks so much for being on the show and sharing your part of Datalore. It's cool.

48:15 Adam Hood: Thank you for having me.

48:16 Michael Kennedy: You bet, bye.

48:16 Adam Hood: Bye!

48:18 Michael Kennedy: This has been another episode of Talk Python to Me. Our guest on this episode was Adam Hood. It has been brought to you by Linode and brilliant.org. Linode is your go-to hosting for whatever you're building with Python. Get four months free at talkpython.fm/linode. That's L-I-N-O-D-E. Brilliant.org wants to help you level up your math and science through fun guided problem solving. Get started for free at talkpython.fm/brilliant. Want to level up your Python? If you're just getting started, try my Python Jumpstart by Building 10 Apps course, or, if you're looking for something more advanced, check out our new Async course that digs into all the different types of async programming you can do in Python, and of course, if you're interested in more than one of these, be sure to check out our everything bundle. It's like a subscription that never expires. Be sure to subscribe to the show. Open your favorite podcatcher and search for Python. We should be right at the top. You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at /rss on talkpython.fm. This is your host Michael Kennedy. Thanks so much for listening. I really appreciate it. Now, get out there and write some Python code!

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon