Monitor performance issues & errors in your code

#171: 1M Jupyter notebooks analyzed Transcript

Recorded on Friday, Jul 6, 2018.

00:00 Michael Kennedy: Jupyter Notebooks have transformed the way many developers and data scientists do their jobs. They offer a platform to not just explore but to explain data and computation. But how are they really being used? Adam Rule is here to describe his research and PhD dissertation which analyzed over one million Jupyter Notebooks found out in the wild. This is Talk Python to Me, Episode 171, recorded July 6th, 2018. Welcome to Talk Python to Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter where I'm @mkennedy. Keep up with the show and listen to past episodes at talkpython.fm and follow the show on Twitter via @talkpython. This episode is brought to you by Linode and Studio 3T. Please check out what they're offering during their segments. It really helps support the show. Adam, welcome to Talk Python.

01:06 Adam Rule: Yeah, great to be with you. Thanks for inviting me on the show.

01:09 Michael Kennedy: Yeah, it's great to have you here. I'm really glad our mutual friend, Phillip Guo, introduced us and suggested this show.

01:15 Adam Rule: Yeah, you had a lot of great people on the show. I mean, Phillip included, who I think has been a repeat on the show, which is great. Other folks in my area of data analysis or computational notebooks, Matthias from Project Jupyter and DJ Patil, one of the kind of godfathers in data analysis. So, it's an honor to be on here with you today.

01:36 Michael Kennedy: It's great to have you, and it was great to have those people as well. So, we're going to talk about Jupyter Notebooks in a super meta way. I had Matthias on to talk about notebooks as a technology, but we're going to talk about more how people work with notebooks and how it affects researchers and data science and stuff. Before we get to that, let's start with your story. How'd you get into programming, and what led you down this Python path?

02:01 Adam Rule: Oh, gosh. I mean, my original track was, I was an industrial engineer as an undergrad. And I thought, oh, I like people a little bit, but I also like science and math. And so, industrial engineering is this mix of both of those where you're trying to optimize production flows, but then you spend some time on the shop floor talking to people about, "How's this working for you? How can we make this process easier for you?" From there it's a somewhat circuitous path, but eventually took me to understanding, oh, not just designing manufacturing processes, but we can design devices, and products, and software for people. And gosh, designing software for people is really difficult, 'cause you can't just look at, measure how long is their arm and lay out the workspace that way. But you have to get into their head a little bit and try and figure out mental capacity, and how do people think through things? What's a workflow look like for that? Not just how I work a workpiece through a factory. So, with that and got into programing largely by studying how people use software. Needing to do some of my own programing to build software or add ons to really look deeply at how is it that people are using these technologies, and how can we make them easier to use or just do more powerful things and augmenting what we can learn about how people work in their capacity.

03:21 Michael Kennedy: Oh, that's really cool. It's super interesting how this sort of engineering path led you over here. I think industrial engineering plus software would be a really cool space to work, because there's just so many gadgets and things you could design with just a little bit of smarts and a little bit of programming that would be amazing. That's a skill I don't have.

03:41 Adam Rule: Yeah, yeah. I know that kind of fusion of the two worlds would be really fascinating.

03:45 Michael Kennedy: I think it's a little ironic one of the things I really like about programing is you actually get to build stuff. You get to build stuff that is conceptual but it runs. And I'm speaking coming from a math perspective. Whereas, prove this cool theorem about topological spaces, I guess I'll go look at another theorem. You don't actually have an outcome. Here, even though it's sort of virtual, you still get to build things. But this would take it to another level. Awesome, so you started writing software to kind of understand and study how people interact with it which started you down this meta path. Right?

04:18 Adam Rule: Yeah, yeah, I know I think today, as we get further on in the talk, it'll be very meta as we talk about studying how people use tools like Jupyter Notebooks, by in turn using Jupyter Notebooks to analyze a dataset about people using it. It kind of becomes Turtles all the way down.

04:34 Michael Kennedy: Yeah, I was just thinking there's Turtles all the way down. Absolutely. Yeah, so what led you into notebooks and Python? You could've, say use C++ to understand how people use software. Would take you longer, but you could've done it.

04:47 Adam Rule: Yeah, yeah, no. And this may be going back a little bit too far, but I was really fascinated by healthcare and how physicians use medical records to track and document their work. It's this really data driven domain. But it's really hard to get into. You can't really say, oh, let me just hack on your enterprise software system in the hospital. I just want to tweak it in this way and see if that makes it easier to care for patients. One, the software systems are super complex and regulated, and you have patient risk there as well. And so, I turned and looked at another very data driven domain, data analysis. And honestly, it's just by hearsay of so many people saying, "Hey, have you checked out these notebooks? They're fantastic, I've been using them for months, or years now for doing analysis. They're pretty amazing, you should really look at this." And so, where I was at at UC San Diego, a bunch of people were using these tools for very basic biology, neuroscience research. So, that got me into looking in how are people using these tools to track and talk about very complex data driven work.

05:57 Michael Kennedy: Yeah, that's awesome. So, maybe that's a good place to segue into what you've been doing recently. So, you said you were at UC San Diego, a very nice school down in San Diego. And you just finished your PhD as a Human Computer Interaction researcher, which is a sub-portion of, say cognitive science. I was really surprised how much computer programing and software is involved in cognitive science more broadly.

06:24 Adam Rule: Yeah, no, cognitive science is an interesting field that's kind of a fusion of psychology and computer science. And you go back to some of the early days looking at folks like Herb Simon and others at Carnegie Mellon, and around the world. And they were playing in both fields of developing and testing a lot of software. Trying to figure out can we model how the brain works. And then there's kind of the reverse transition of people, will look a the brain and use that to try and figure out how can we build more efficient algorithms or computer systems with neuromorphic computing. Yeah, I just finished my PhD a month ago, actually. So, still fresh off of that.

07:03 Michael Kennedy: Are you just super relaxed now?

07:08 Adam Rule: I am. I was going to say I'm almost in this academic sabbatical period where we still have funding and I'm still continuing to do some of the research that we'll talk about today. But due to my wife's job, I moved to a different city. I'm now up in Portland staying on the West Coast, but very different in terms of sunshine hours from San Diego.

07:29 Michael Kennedy: Less sunshine, more green.

07:31 Adam Rule: More green, which is great. Yeah, when I moved to San Diego from Seattle, when I was in Seattle really loved the lush green. And so, San Diego has many benefits. The surfing, the burritos, the sunshine. But it lacks in green. So it's good to be back.

07:47 Michael Kennedy: Nice. So, you still are finishing up this research a little bit that we're going to be talking about. And then it's time to hit the real world. Are you thinking academics? Are you thinking industry? Where are you headed?

07:59 Adam Rule: I'm thinking industry at this point, just to try my hand at something slightly different and see what that world is like. Gotten a good dose of academia for the last five years of PhD and two years of master's before that.

08:12 Michael Kennedy: That's a healthy dose.

08:13 Adam Rule: What the working world is like.

08:14 Michael Kennedy: Awesome. All right, so maybe we should start by talking a little bit broadly about what human computer interaction is.

08:22 Adam Rule: I think Phillip has covered some of this as well, 'cause he researches really similar topics. It's really studying the design and use of computer technology. So, how do people use current technologies and how do certain aspects of the design, make it easier to use for a particular task? As I was saying with cognitive science, it's really a mix of human computer interaction, the subfield is really a mix of computer science and social science. So, some days I'm building software, other days I'm testing it with people, other days I'm just sitting and observing how people use it or don't use it during their tasks. So, it's this flopping between programming and observing and social science more anthropological skills, that's a lot of fun and goes back to my industrial engineering days of the math and science and the satisfaction of building things, and then the flip side, trying to understand people as well.

09:14 Michael Kennedy: Oh, yeah, it's a really interesting mix. Are folks in that area starting to think about how artificial intelligence is changing this? Things like the Amazon Assistant or the Google Assistant, and stuff like that?

09:27 Adam Rule: There's a bunch of research in that area. I think some of the folks who are farthest out ahead on that are those in the human robot interaction field. Because they've had to think for a while about how are people going to interact with robots, and reason about how is this computer device reasoning about things and making decisions, and should I trust that, or does it not have access to all the information I do. All these things we do very naturally with other humans of like, oh, they don't see that car coming 'cause they're looking this other way. I should let them know. It's harder to do that with computer systems where you're less sure about what are the inputs, what's the processing, what are the outputs going on.

10:04 Michael Kennedy: That's really interesting. It's going to really become more and more so over time, isn't it?

10:08 Adam Rule: Yeah. And then all this work on machine learning interpretability, how are you going to be able to interpret a decision that came out? I know there's work going on in that, and even here in Portland. The next HCI researcher meetup in the area's focused on machine learning and how do we design for this and help people understand what's going on? So, I think both in academia and industry it's a big deal right now.

10:30 Michael Kennedy: Oh, yeah, that's awesome. It sounds like people who want research projects, I suspect that's a good place to focus.

10:35 Adam Rule: A good place to look.

10:36 Michael Kennedy: Yeah, for sure. You decided to focus a little more meta. You wanted to focus on these computational notebooks, which Jupyter is one of. Let's set the landscape. So, in the Python space we hear Jupyter, Jupyter, Jupyter, oh, JupyterLab is slightly better. JupyterLab, and then that's about it. But there's actually a slightly broader view of these things in the history as well, right?

10:59 Adam Rule: So, there's actually a really interesting article in the Atlantic that was coming out and saying, oh, the academic paper is dead, computational notebooks or going to replace them. A lot of that talks about Jupyter Notebooks, but it goes into some of the history of notebook platforms back to really Mathematica is the one that's often credited with being one of the first environments where you could have this literate programing back and forth, typing and running small scripts in a specific language to analyze data or ask questions. So, that was back in the 80s, and there were academic systems like Maple that were in schools in the 90s and 2000s, but it's been in the last couple.

11:39 Michael Kennedy: I remember using Maple. That thing was magic.

11:42 Adam Rule: Yeah, no, I remember using it as well. And I haven't really seen it much outside of the educational context. I don't know if that's just their niche or what. So, it's been around for a while, but it's often been locked away in proprietary software that you had to pay a big license fee for. And so, it's really in the last decade or so, and really the last five years or so that platforms like Jupyter Notebook or RStudio have been providing these open source, and in some cases like Jupyter, free environments for using a notebook like interface to play with data.

12:16 Michael Kennedy: Yeah, what's your take on, when I was working on my master's degree and my PhD and stuff, which I didn't get my PhD, but I did get my masters degree, but anyway, when I was working on that, I was using MATLAB and stuff, and we were doing things like wavelet decomposition. Which I'm pretty sure the license, this is like an add on to MATLAB, it was like $2,000 additional dollars per person that was using it. That's completely insane. And then here comes Jupyter and whatnot and going, "Oh, actually ours is free. Why don't you try that?" That's a big effect, right?

12:52 Adam Rule: Yeah. And thinking about there's a huge push in science for open science and not just sharing your results, sharing the data, but then sharing also your code and how you arrived at that. And the fascinating thing is so many researchers are making, you know, you talked about the wavelet decomposition, they're making their own packages or libraries, especially in the Python ecosystem, and sharing them openly for others to use and download. And that's really seemed to move up the stack to not just be the packages but the language. You know, Python should be open and free. And the environments, the development environments like Jupyter should be open and free.

13:33 Michael Kennedy: Do you see the zen of open source from the scientists interacting with the software? Like flowing into science directly in the sense that people are sort of changing the way they see other stuff by virtue of having these open source experiences.

13:48 Adam Rule: I think so. And a couple points on that. Some of the folks that I talk to and we get to some of the research that I've done, talked about this really strong obligation they felt to make things open source to make them reproducible. And it was almost a religious zeal of like this is just how things should be done. And I hadn't seen that really before in academia, in other contexts. And I think one of the other interesting things I saw is so many labs where half the lab might be wet lab biology folks who are running experiments and creating slices and staining them, using microscopes. And the other half just looks like a startup or a software development group who are doing code reviews on a weekly basis.

14:32 Michael Kennedy: They have a GitHub repository maybe.

14:34 Adam Rule: GitHub repositories, talking about versioning, remote workers calling in from across the world and seemed very much like a software company. But they're in a research lab doing fundamental biology research.

14:49 Michael Kennedy: Yeah, I really do see this computer science skill starting to permeate a lot more stuff. Not just so we have more programmers, but so that, like what you described, these biologists can take some of these software ideas and just really amplify what they're doing in the lab.

15:06 Adam Rule: Yeah, absolutely.

15:09 Michael Kennedy: This portion of Talk Python to Me is brought to you by Linode. Are you looking for bulletproof hosting that's fast, simple, and incredibly affordable? Look past that book store and check out Linode at talkpython.fm/linode. That's L I N O D E. Plans start at just $5 a month for a dedicated server with a gig of ram. They have 10 data centers across the globe, so no matter where you are, there's a data center near you. Whether you want to run your Python web app, host a private Git server or file server, you'll get native SSDs on all the machines, a newly upgraded 200 gigabit network, 24/7 friendly support, even on holidays, and a seven day money back guarantee. Do you need a little help with your infrastructure? They even offer professional services to help you get started with architecture, migrations, and more. Get a dedicated server for free for the next four months. Just visit talkpython.fm/linode. So, I did sort of cut you off a little bit when you were talking about this history. And so we've had Mathematica and Maple and MATLAB, and then we've got Jupyter and RStudio. But then there's a bunch of hosted ones as well that maybe some people have heard of. But there's actually a ton of variety out there for even just getting these and running them online.

16:21 Adam Rule: Yeah, so even just sticking with Jupyter, one of the great things about Jupyter is they have the notebook itself, kind of its front end. But I think some of the more lasting impact from Jupyter might be just the standards that they set, about how are we going to send messages back and forth to a kernel, what's the notebook format, and the very specific JSON structure. So, in a way, they're almost like a standards setter like World Wide Web Consortium, HTML or other things. This is just how scientific computing or data analysis should be documented and shared. And so a number of other groups have come and built on top of Jupyter and those standards. So, like Google Colaboratory is like a Google Docs version of Jupyter Notebooks. Microsoft's Azure notebooks on their Azure Cloud. And then even Sage notebooks. Sage was this project and specific language kind of like Mathematica for doing data analysis and mathematics. And they've now switched over.

17:21 Michael Kennedy: That's the Sage from William Stein up in Seattle, right?

17:26 Adam Rule: Yep, that's exactly right. So, now they have a whole notebook infrastructure, and each of these have made add-ons and different history features or profiling tools that are slightly different from Jupyter Notebooks or JupyterLab. But they're all essentially build on top of Jupyter.

17:42 Michael Kennedy: Yeah, and there's some really interesting cloud computing tie-ins. Like, I don't know Azure and Sage well enough, but I know on Google Colaboratory you hit a hotkey to run a cell, and you hit a slightly different hotkey to run the cell on a GPU. It's like crazy, right?

17:58 Adam Rule: Yeah, and I think that comes into business models of how you're going to monetize these things. Jupyter has this interesting model of being open source and funded by academic time and grants. Whereas others are saying, "Well, We'll provide this software for free, but if you want white glove support or to run it on our cluster, then that'll cost you at that point." So, it's free for smaller uses in educational context. We'll provide the compute infrastructure.

18:25 Michael Kennedy: Ah, that's pretty interesting. There's also a JavaScript one, right?

18:27 Adam Rule: Yeah, so this kind of notion of computational notebooks is now spreading into the web world where ObservableHQ, so Mike Bostock, the mind behind the D3 Data Visualization Library, now has a company that's making a completely on the web computational notebook. So, writing JavaScript code to analyze data. And then Mozilla actually is starting up a project called Iodide that's looking at this. And what's fascinating about these is you can write code not only to analyze the data but to directly manipulate the notebook itself. And so, you get some really fascinating views on data or dashboards where something that you wrote in a cell just changes the complete layout or operation of the notebook itself. So, it mixes the programing for data analysis and then programing to change the tool that you're using to do the data analysis all in the same...

19:22 Michael Kennedy: How interesting. Yeah, because you can reprogram the notebook with the notebook.

19:27 Adam Rule: Yep, so it makes it easy to do things that have been difficult to do in, say Jupyter Notebook, to create widgets. Where if I want a slider that can then change some parameter and a visualization I have, I can just call whatever that DIV is, 'cause I'm programing in JavaScript and select it, tell it to update on this callback.

19:46 Michael Kennedy: Yeah, I definitely see that having an advantage in that integration, as well as that runs on the client browser. So, the hosting cloud side of it is like, you could probably do that on a $5 server. You know what I mean? Like you could host an incredible amount of computation. 'Cause you're just serving up the files more or less, right?

20:04 Adam Rule: Yep. So, all these really interesting models all around this idea of a notebook infrastructure where you incrementally write a few lines of code to analyze data and then get the outputs printed right in line.

20:15 Michael Kennedy: One weakness, I guess I would call it, that I see with a JavaScript computational environment is JavaScript has poor numerical support. It's like integers, for example, are super hard because integers.

20:29 Adam Rule: And I'm not sure, I mean, even beyond that, just the infrastructure, you know, there's all sorts of great infrastructure, especially in Python and R for data manipulation, and cleaning, and analysis. I mean, those libraries I don't think are quite there for JavaScript.

20:45 Michael Kennedy: Right, where's the Pandas for JavaScript? Does it exist? It may. I don't know, actually.

20:50 Adam Rule: Or the Tidyverse. Again, it may exist, but it certainly doesn't have, I think, as many folks working on or using it as you do in the Python or the R worlds.

20:58 Michael Kennedy: Yeah, it's interesting that it exists though. Okay, that sort of sets the stage of this world. And your PhD goals was to go and study that world and actually understand how people use these computational notebooks and if they're really fulfilling their promise of becoming like a computational narrative, or what are people doing, right? So, maybe tell us more about your research.

21:23 Adam Rule: There is this Atlantic article that came out and made this declaration that the science...

21:27 Michael Kennedy: I want to encourage people to check out that article. We're going to put it in the show notes. I think I've talked about this before on Python Bytes, my other podcast. But anyway, it's really provocative, right? There's a scientific paper that's literally on fire, like an animated fire, on the homepage. It's quite something, and it's a big change.

21:48 Adam Rule: Yeah, I know, it makes some bold claims. It roasts Stephen Wolfram quite a bit. And then it goes into the history of Mathematica and also Python, and these different views of a closed type ecosystem like Mathematica where you make your own language, or an open one like the Python community, where anybody can write a library and contribute. Yeah, so one of the things in this article, and really this just reflects some of the zeitgeist right now, is this notion that, well, in the future we're no longer going to be sending around scientific results just in a dead PDF, because that doesn't give you all the information you need to reproduce science. You know, the analysis we're doing today are so complex that you can't just read a three sentence description on it and know what to do.

22:35 Michael Kennedy: Yeah, I think I've heard somewhere some kind of quote saying your academic paper that you write about your computational results is like advertising for that computational result, but it's not actually the research. The software is the research in that sense. So, why are these two separate things?

22:53 Adam Rule: Yeah, no, I think you had some talks with folks behind Journal for Open Source Software.

22:57 Michael Kennedy: Yeah, I think that's where that came from is that one, yeah.

22:59 Adam Rule: Yeah, and so much of the work and research now is software development. Whether it's for building a package to do a certain type of analysis or that particular analysis itself. You know, this article and others are saying really what we need is a new medium where you can share all of the code used to run the analysis, because so much of the analysis is now happening via programming, not in Excel, not with SPSS or other packages for stats. But if you just share the code, that's really not understandable either. You know, we all hope to comment it well. And so what you need is this mixed narrative, and that's what notebooks give. You can write a line of markdown text that explains what the notebook's doing, and then you can just build up your argument and show how you collected, analyzed, and did this data. And so, a lot of folks are saying, hey, scientific paper is dead, notebooks are the new medium. And look, millions of people are using them. We really wondered, is that the case? Is the scientific paper dead? How are people using these notebooks? 'Cause despite being around for decades and having millions of users, we know very little about how people actually use them in their day to day work. So, we sought out to kind of understand better how are people actually using these things? Is there much rich computational narrative? You know, we read these things and understand the analysis. Or are people really just using them because they're a nice iterative development environment? And so, that's kind of what got us started down this path.

24:25 Michael Kennedy: Right, is it a place where you load some data and you just sort of iteratively explore it and interact with it? Or are you actually trying to put something like a paper together. Like the next version of that.

24:38 Adam Rule: Those are really compelling reasons to use notebooks. Having this real tight REPL that will lit you iterate and just explore some data or having it be this really well curated explanation that you can share with others.

24:52 Michael Kennedy: Yeah, this idea of a REPL is really handy and nice. But sometimes it's just super hard to go back to what you want. You know, you're like, well, it's 20 things back, and a bunch of them are like 10 lines long. It's not very nice to interact with it, say off of the terminal in some situations. But this is perfect, right? You can go back and read it and just jump to where you want by touching it. It's great. So, how did you get your data? How did you find these notebooks to study? Just go ask a couple people you know? Or what'd you do?

25:24 Adam Rule: The first part of the research that we did was trying to just get a big sample of notebooks. So, we said one way that we can tackle this problem is just try and get a bunch of notebooks and look at them and see what the content is. So, we ended up actually scraping all of the Jupyter Notebooks that were on GitHub at the time of the study, which was about a year ago.

25:43 Michael Kennedy: How many is that?

25:44 Adam Rule: So, that was a little over a million, about one and a quarter million notebooks they had on there, which that was a fun process of working around great limiting to get that to work.

25:55 Michael Kennedy: Yeah, so tell us how do you do that? I mean, was that GitHub API, was that web scraping? What was the flow there?

26:02 Adam Rule: It was a mix of both. GitHub is a great and well documented API. But in order to do what we wanted to do, we had to abuse it a little bit. They don't really want you looking for just one specific file type. So, you can't really just search and say, show me all the files of this type and download all of them. You need to give it some other parameter set. So, we actually had to go and say give me all the Jupyter Notebooks between zero and 100 kilobytes, okay, between 100 and 200, and kind of iterate them through that way. Both to get a list of all the notebooks and because they limit, we're only going to send you 1,000 results. So, even if you can look and see that there's a million of these, we'll only send you the first thousand. So, we had to restrict our query down to get it in packets that were small enough.

26:50 Michael Kennedy: I see. You had to come up with a arbitrary filter criteria that would get it below 1,000? Okay.

26:56 Adam Rule: Yep. And then from there it's just a lot of learning how to be a good citizen with GitHub servers and respect when they say, "Okay, you're making too many queries, and slow down." It took a couple weeks, but we eventually got the full dataset that way. And then afterwards, we had to do some web scraping to get the files themselves. So, we essentially, first got a list of what are all the notebooks and where are they. And then used some web scraping to get the files themselves.

27:22 Michael Kennedy: Yeah, that's pretty interesting. It only took a couple weeks. I mean, on one hand that's a long time. On the other, you've gone out and gathered all of these million notebooks from all these sources. That's pretty amazing, actually.

27:35 Adam Rule: I think it's fascinating that GitHub provides the tools for us to be able to do something like this. I mean, really only asks us in return that we make the data and any publications open afterwards. That you can use their API to really study how people all over the world are using a tool set. So, it's a great way to get a massive and diverse sample size.

27:58 Michael Kennedy: Yeah, I think GitHub is kind of a special place. I mean, there's lots of source control, and versioning, and issue tracking places in the world. But GitHub stands alone. And it's sort of reach in just people using it and so on. How do you go about studying it? You're studying Jupyter Notebooks, did you actually use Jupyter Notebooks to study your Jupyter Notebooks?

28:21 Adam Rule: Yeah, so this is where it gets super meta. So, we download this whole dataset. And for those who don't know, Jupyter Notebooks themselves are really just a JSON file. They have a different ending with pynb, but under the hood it's just a JSON file. So, we essentially have a million JSON files to look at. So, we just spun up on own Jupyter Notebooks, imported that dataset, started making some data frames using the Python ecosystem. And so, analyzed it that way. So, in the end it's come full circle. 'Cause the notebooks that I used for my analysis of notebooks on GitHub are now hosted on GitHub. Completely available.

29:00 Michael Kennedy: So, you'll be a dataset in the followup replication study? Is that what you're saying? Yeah, how interesting. This portion of Talk Python to Me is brought to you by Studio 3T, the IDE for MongoDB. No SQL databases offer maximum flexibility. But what if you could combine the benefits of MongoDB with the benefits of SQL? With Studio 3T you can. With their innovated SQL query feature, you can write SQL joins and expressions to query MongoDB. And the best part is you get to see how your SQL queries translate to MongoDB's native query syntax with the click of a button. You can create MongoDB queries, aggregation statements, and SQL queries. And 3T's novel query code will automatically generate code for you in a variety of languages like Python, JavaScript, and even C#. Studio 3T also offers the richest coding experience with its full featured InteliShell. It's the built in MongoDB shell interface with smart auto completion of collection names, shell methods, document key names, operators, and field names. By using in place editing within a collection, it's even easier to edit your documents. Try Studio 3T and see why it's used by Fortune 500 companies like Nike, Tesla, Formula 1, Comcast, and many more, saving enterprise users countless hours of development time. Visit talkpython.fm/studio to get a free one month trial. That's talkpython.fm/studio. What are some of the findings or the things you got by studying these? We talked about this tension between an iterative REPL and then an explanatory narrative or computational narrative. What'd you find?

30:37 Adam Rule: One of the big headlines from this bit of research was that very few of the notebooks had what you might consider the baseline requirement for a rich narrative, which is just any explanatory text. So, over a quarter of the notebooks had no markdown text at all. So, they were just code or code blocks. And then even of those that had text, they were pretty short. So, the median was about 150 words. So, just a really short blurb.

31:04 Michael Kennedy: So, maybe those blurbs are almost like comments, they just are in text cells instead of in code with a hash?

31:09 Adam Rule: Yeah, yeah, and they'd often be like, okay, import data, model data, really descriptive of the steps. And I think for us, not that this is a bad use of the notebook, but more they're being used as an interactive programming environment with some light loose notes rather than this view of, oh, there's this really rich description like a scientific paper of what people did. I think there are a host of reasons for that. But that was one of the major findings for us from this.

31:39 Michael Kennedy: Yeah, so how much of this is just people happen to be using and storing their notebooks on GitHub, versus they intend other people to consume those?

31:49 Adam Rule: Yeah, you know, it's hard to say for sure, but we think a lot of it is just, we're going to throw it up as a repository for myself, and I'm not really expecting others to use it. We did an analysis to where we actually looked at the descriptions of the GitHub repos where these notebooks lived, so how are people describing these projects, and looked for keywords. When we removed things like notebook or GitHub from that keyword search, the top words are things like machine learning, Kaggle, Udacity, Nanodegree. And so that really showed us that a lot of these seem to be people learning how to do data analysis, learning machine learning in particular, and doing online education assignments, and then hosting their results up online. Whether that's a form of submission, or for a resume, or a portfolio building exercise. A lot of these seemed to be educational.

32:45 Michael Kennedy: Interesting. I can see a lot of people who are students taking courses at a university or something, and their professor says, "All right, what we're going to do is going to create a repo for your course, and everybody's going to put their assignment and just share it with me or make it public." or something.

32:59 Adam Rule: Our search excluded forked repos and forked notebooks. So, at least one form of distribution like that should've been excluded. But, yeah, still think a lot of this is course assignment like that.

33:11 Michael Kennedy: Did you do any refining where you say, well, let's look at repositories that have over 1,000 stars or a lot of followers, or just the ones that are not clearly private?

33:23 Adam Rule: We tried to look for ones that seem to get reused a lot. In fact, the motivation for us was, well, let's see if we can find best practices in notebooks. If we can find ones that are repositories that were starred a lot or forked a lot, maybe that means that they were really useful, and we can kind of glean some best practices on notebook design from this.

33:46 Michael Kennedy: Right, maybe even a lot of PRs, they're getting polished.

33:48 Adam Rule: Exactly. And one of the things we found when we tried to do that was, many of these notebooks that were in highly starred or forked repositories, were just tutorials for various software packages. So, as an example, it could've been something like, here's Pandas up on GitHub, and then here's the notebook as documentation showing how to use Pandas. But the reason why this repository is so starred and forked is that people really like using Pandas, not because the notebooks themselves were all that insightful.

34:20 Michael Kennedy: Yeah, that's interesting. I guess it is a really nice way to have code mixed with description on GitHub, because GitHub renders and executes those now.

34:29 Adam Rule: Yeah, and that's one of the things, as we were initially doing this research that people said why they're using Jupyter Notebook and why they're putting it on GitHub is, hey, a manger that I have, they don't really want to install the software and set up an environment. But I can just send them a link, and they can see the notebooks statically, and that's a really nice way to share results.

34:47 Michael Kennedy: Yeah, that's great. Even managers can run web browsers.

34:51 Adam Rule: Yeah. I think one of the other interesting things that we looked at was, testament to the Python ecosystems, we said okay, what are the packages that people are importing? And just finding the vast majority, around 90% or so of these notebooks are importing external packages. Things like Pandas, NumPy, Matplotlib, were importing two thirds or more of them. So, just the data science infrastructure that's being provided is a really core component. It's not just having the notebook like Jupyter. It's having the Python ecosystem to be able to do it.

35:25 Michael Kennedy: It's the foundation. Yeah, that's really awesome. What about the things like R and JavaScript and stuff, were you able to figure out how much is Python, how much is other stuff?

35:34 Adam Rule: The notebooks themselves have a tag for what the language is in there. And the vast majority of those, like 96%, were written in Python. Whereas things like R and Julia, which it's why Jupyter's named Jupyter, it's a combination of Julia, Python, and R. Each accounting for about a percent, and then there was a long tail of other languages that were supported. But by and large, it was Python. And it kind of surprised me.

36:01 Michael Kennedy: 96%, that's actually higher than I would've even guessed, to be honest.

36:06 Adam Rule: Again, I don't know if it's because so many of these are educational, and that's just a good language to teach in, or what the reason may be. But it still is strongly reflecting its IPython roots of being kind of a Python first environment.

36:22 Michael Kennedy: For sure, it sounds like it. So, were you able to find a subset of what you might call academic papers, or narrative type things, and analyze those?

36:31 Adam Rule: That leads into the second line of work we did on this is, okay, pulled down all of these notebooks, we've looked at them. Very few of them have this rich description and seemed to be more just using notebooks as a nice iterative environment. But what if we're just looking at the wrong subset of notebooks? Many of these seem to be fore education. It may be that people are just hosting these for themselves.

36:53 Michael Kennedy: Yeah, you probably don't want to look at a notebook that a student has known programming for 10 days as a how should we do things.

37:01 Adam Rule: Exactly, that's exactly it. So, like okay, let's be a little humble with the limitations of the dataset and our assumptions of it. So, we ended up saying, well, what if we look at what some consider the creme de la creme of doing and presenting analysis? What if we look at notebooks that are supplementing academic publications? So, this is back to that Atlantic article saying, you know what, in the future scientific papers should just be in notebooks. And there's a number of folks who have jumped on that bandwagon and said, "You know, "I may publish something in Science or Nature, one of the big journals, but I'm going to link to here's the notebook that I used for that analysis so that people can retrace it, recreate it, fork it, and continue the analysis themselves." And as mentioned earlier, there are some in the academic community who have a real strong evangelistic fervor around the need to share their results in this way. So, we ended up looking for those specifically.

37:58 Michael Kennedy: Interesting. And what'd you find there? Any differences? Probably, right?

38:01 Adam Rule: Yeah, so, we ended up finding, again, many of these are on GitHub, so we pulled about 150 of them. And we used a slightly different method where rather than just doing a big data analysis, we hand code it. We wanted to see with finer grain detail.

38:14 Michael Kennedy: I looked at it and I put it in categories, not like typing programing, right?

38:20 Adam Rule: Exactly. So, putting it in categories, iterating those over time, and then making sure you have other people who can validate, yes, that's a valid category, not just something that you came up with.

38:31 Michael Kennedy: Okay, so you coded these by hand and probably got slightly different results, I'd guess.

38:36 Adam Rule: Okay, so these notebooks have a little more text in them. But surprisingly, they're not using that text really to describe the analysis in any rich way. So, of the notebooks that had any text in them, which was still not all of them, the majority would use that text just to describe the steps of the analysis. Importing data, fitting model.

38:56 Michael Kennedy: Back to the comments as text rather than comments as code comments, yeah, okay.

39:00 Adam Rule: And then only about a third of them have what we might consider a rich description. So, any description of why they did the analysis in a particular way. So, oh, I fit a linear model because these assumptions are met, or we tried this other model and it didn't work for interpreting results. So, if you look at this plot, you'll notice this outlier here. Most would just leave the end figure like it spoke for itself, often without axis labeled and say, "Here's how we got the result." and not really describe what they thought it meant or how they got there. So, that for us was surprising, again, thinking, well, academics will want their work to be easily understood and replicable, yet so many of these still kind of fell short of Fernando Perez, Brian Granger's, and others' vision of this rich computational narrative.

39:47 Michael Kennedy: How interesting. Yeah, so what'd you do? You go ask them why didn't you write more?

39:53 Adam Rule: So, in a way, we did. Not with the folks who had posted notebooks there. But we ended up finding people around campus at UC San Diego who were using notebooks. Again, some of these labs where they have big biology analysis, or genomics work, and where using notebooks as kind of a way of life. So, we went and talked to some of those people. So, we ended up finding 15 of them and just walked them through, show us a notebook you've been working on, which was great. We got to see in progress work rather than just, "Here's my finished product that I post on GitHub."

40:25 Michael Kennedy: Yeah, that's cool. And so, were they also more using it as REPL type explorative stuff? Or what was the story there?

40:33 Adam Rule: It was exactly that. Again, people were using it for this iterative environment. Some would talk about it, "It's my coding playground where I get to test out ideas. It's a very personal thing. It's reflecting my style of programing, and I'm not going to take time to clean it up for others, 'cause maybe they don't want to see it."

40:51 Michael Kennedy: I was going to say, it seems like a really great thing for people to create these notebooks, and then if you're meeting with your research group to pull it up and everybody look at it and walk through it, I wonder how much that played into it. Like, "Here, look what I've been doing this week. And here, let me show you the results and how I got it." and so on.

41:09 Adam Rule: Yeah, we have the same intuition as well. And it seems like that's not the case. In fact, one of...

41:13 Michael Kennedy: Just create while you do it.

41:16 Adam Rule: One of the folks that we talked to said, "You know what, I've tried this in lab meetings, and people just think I didn't take time to prepare. They think I'm just showing up by the skin of my teeth."...

41:26 Michael Kennedy: Winging it.

41:27 Adam Rule: Winging it. "And unless I have a slide deck put together that these aren't solid results, or that I didn't take time to think about what you might want to see." So, it's this really interesting case where there's kind of this entrenched practice of you must present from slide decks or else it means this or that about your work that got in the way of using notebooks as a presentation medium.

41:51 Michael Kennedy: Interesting. I wonder if it's a chicken and egg thing. Like if they were really beautifully formatted and descriptive, maybe they'd go "That's a really great presentation." But if they're sloppy like in and of themselves, maybe the presentation feels sloppy.

42:04 Adam Rule: Or just the social expectation in practice. What if labs just expected that rather than taking time to create a slide deck, that you would take the time to document your code in the notebook?

42:17 Michael Kennedy: That seems like that's more reusable and valuable over time. I've not refactored or reused slides that much. Wow. What do you think needs to be done for notebooks to reach their potential? To become this thing that would actually sort of validate the burning paper on the Atlantic?

42:39 Adam Rule: I'd first want to say I and the research who work with me think notebooks are fantastic. And I wouldn't want anyone coming away from the podcast saying, "Oh, notebooks are done. They're not the right way to do analysis." I think they're the best thing that we've got going, and they're a vast improvement over prior ways of doing data analysis, which was often having script one in a file, script two in another file, script 2.5...

43:07 Michael Kennedy: Version control is usually just naming copies of the files.

43:11 Adam Rule: It's much better way for version control and tracking steps of an analysis. And through our research has demonstrated it's a fantastic way to iteratively do the process of analyzing data, especially with Python. And so many people are using it for that. For us, the real question is now, how do we make this wonderful programing environment also a wonderful presentation environment? One where it's easy to share results with others and to support collaboration in that way. And I don't know of any silver bullet to get it done. I think there's things that we can do to tweak the design in how people use the notebooks. But there also have to be some social changes. Things like labs expecting that presentations will be from notebooks, or journals expecting submissions in notebook format rather than PDF. So, I think it'll be a mix.

44:02 Michael Kennedy: Interesting. How much do you think like Software Carpentry type of stuff is involved here? Like bringing a little bit of the CS side of things to the researchers.

44:13 Adam Rule: Yeah, I think that's really vital. And I think the model that Software Carpentry has of doing the workshops and dedicated time to training these best practices. One of the really stand out insights from our last line of work, the interviewing researchers, was that one researcher mentioned that when she had started as a biology student in biology or chemistry labs, she was trained in a very specific way of tracking her results. You know, this is how you write your name, and the date, and the reagents that you're going to use for this experiment and the steps. And if she didn't do it in a particular disciplined way, she'd get docked points by her teaching assistant or professor. 'Cause this was just the practice of how you document and share a biology or chemistry lab. And she said, "You know, we don't really have that for notebooks. I came into the lab, I was shown a notebook and told, 'Have fun.'"

45:07 Michael Kennedy: You type here.

45:09 Adam Rule: Exactly. I had to figure out, oh, I can create my own packages, and I can import external files, and oh, I can move all of the import statements to the very top of the notebook, and had to kind of figure out best practices on their own. So, I think there's a lot of best practices, both from software development and engineering, and from data analysis, that have grown up in the last few years that can be brought to bear through things like Data Carpentry or Software Carpentry workshops that aren't there yet.

45:39 Michael Kennedy: There's been a lot of progress there, but it also seems like there's always more progress to be made. I guess just from an academic perspective every year you start over, in a sense. Every year there's a new grad student who's fresh and they've never done this, and they're at the lab, and you've got to bring them under the fold. So, there's also this sort of mentoring aspect.

46:01 Adam Rule: Be really interesting to look at how would the results be different if we had done this out of enterprise. Like a lot of what we looked at ended up being academic domain, whether it's all of these students working on notebooks that we found on GitHub or looking at folks who are in an academic lab environment. That's largely a factor of who we have access to, who's willing to share their notebooks publicly.

46:21 Michael Kennedy: The rest of them, they're in these buildings right around campus. We'll just go talk to them.

46:25 Adam Rule: But looking at, I think there's similar issues of turnover within organizations, or handing off a project from somebody who's a senior analyst to a more junior less experienced analyst, and onboarding, and disciplined practice. Folks like Hilary Parker over at Stitch Fix, who talked a lot about opinionated data analysis and having a disciplined way of tracking, or sharing, or reviewing data analysis, and needing to develop that in enterprise.

46:52 Michael Kennedy: That's pretty cool. I wonder how accessible that information is. 'Cause I notice some of these companies are somewhat open about what they do. But again, this data analysis, this is partly what drives their company, and they're not going to just give it away. It's not like I could go to Goldman Sachs and go, "Hey, why don't you just publish your notebooks for your analysis? Because that would be so interesting. Like, "We're not doing that."

47:15 Adam Rule: Yeah, I think the domain where I've seen outside of academia, the most open sharing of analysis is journalism. So, folks like...

47:22 Michael Kennedy: Oh, yeah, yeah, yeah.

47:23 Adam Rule: Or Buzzfeed or others who had published a number of notebooks online, and even on GitHub. So, many of them are in our dataset.

47:31 Michael Kennedy: FiveThirtyEight's got an amazing set of data on GitHub. So, FiveThirtyEight spelled out with letters, slash data, I think is it.

47:39 Adam Rule: Really interesting use of notebooks. And again, for them it's kind of the incentives that we want to be open and show that we're a reputable organization and have others validate the claims that we make in our articles. Whereas you say, for companies it's kind of the keys to the kingdom of this is how we generate value is our data and our building on top of it.

47:59 Michael Kennedy: Yeah, and I guess with journalism, once you publish it, you've staked your claim to it. But in business, your sales model one year could be everyone's sales model next year. That's not the same. No one gives you much credit. Yeah, for sure. So, what do you think about the future? We talked a little bit about this, but maybe packaging these things up, it still may be a little bit difficult to say here's my notebook, and go run it with everything that you need. Should we look at containers? What do you see going forward?

48:30 Adam Rule: Notebook environments like Jupyter notebook are going to increasingly become the core infrastructure for data analysis, both in industry, and academia, and journalism that have just proven so valuable as their iterative programming environment and for presenting results, though is still takes a lot of time to clean it up. But yeah, I think for the future there's a lot of work that'll have to be done in a lot of different domains before notebooks get more widely used. And as you referenced, some of it is in packaging the environment. So, theoretically, you can use notebooks to rerun somebody else's analysis. But what if you don't have the same access to their data? Either because of human subjects restrictions, or it's just a lot of data, or it's stored away on a server, or your version of the libraries are slightly different. So, I think there's a lot of work on how do we containerize and package up not only the programming environment and the language at this point in time, but also the data.

49:25 Michael Kennedy: That's interesting.

49:26 Adam Rule: I think that gets into questions about differential privacy and incentivizing data curation. Right now that's not really recognized often in academia at least as a contribution to the field.

49:41 Michael Kennedy: Yeah, it seems like the cloud computing stuff that we touched on in the beginning, it helps somewhat with that. Like the Azure notebooks, the Google notebooks, and Sage notebooks, and so on. You can share one of those, but at the same time, I guess eventually, maybe those things upgrade the packages that you have access to. And that could theoretically change your results. Like, oh, we fixed a bug in this analysis thing which now doesn't look the same. Who knows?

50:07 Adam Rule: And discussions from some folks from the library sciences, they'll say for as much flack as things like the PDF document give as a way of presenting results, PDFs and paper have proved to be a really stable way to share insights and knowledge.

50:21 Michael Kennedy: Yeah, the versioning isn't nearly as touchy.

50:25 Adam Rule: We haven't had an issue of pulling a book off the shelf and it not working any more except if maybe the spine broke or something, like we did with trying to run code that's even five or 10 years old. It's just much more stable environment for stored knowledge. So, I think figuring out ways to do that will be really vital for things like notebooks becoming more integral to sharing results over time.

50:47 Michael Kennedy: I agree. I think containers have a lot of promise there, 'cause they can freeze the whole environment. But even them, they're based on some other operating system. They've got to run on a certain version of Linux. It's not perfect.

51:01 Adam Rule: Yep, it's not perfect. Yeah. And then, as you referenced earlier, I think there's in the future a lot to be done in the realm of Software Carpentry, Data Carpentry, scaling education. As much as we can do to tweak these interfaces to make it easier to develop in or to write really clear narratives in, I still think a lot is going to rely on either apprenticeship models, or mentoring, or training, and ways in doing and documenting data analysis. So, figuring out the right way to do that is a super tricky problem.

51:32 Michael Kennedy: Yeah, it sure is. I wonder if we're going to have some more specializations. So, in the early days it's like, well, the people that want to do computer programming, they got electrical engineering degrees for whatever reason. And then we had the CS degree, computer science. I don't know, I don't feel like I've done any science really in computers for a really long time. But that's what it's called even though you're not actually staging a hypothesis and doing things. It's more like engineering. So, we then go software engineering degrees that are slightly different applied ways of doing what computer science was doing. Maybe we will get computational scientist degree specializations or something coming along. It's like half computer science, but half these other data sides of things you're talking about.

52:17 Adam Rule: I think many of the data science institutes that are popping up, either online or at universities, are trying to figure out what of this is specialized and unique to working with data, what is just rehashing things that we've already learned from software engineering and versioning? What do we even have to specialize further in that biologies use a programing for data analysis will look fundamentally different from astronomy. The practices will be as different as Astropy and another library are.

52:48 Michael Kennedy: I can see a world where we end up there. As computation becomes more and more the foundation of all these different degrees, that each degree's like, "No, no, we're going to have a biological computationist degree. It's not going to be over in the CS world. It's going to be here, and we're going to run it, and it's going to have some CS, but it's also going to have lots of biology and other aspects."

53:10 Adam Rule: Yeah. And that's one aspect of future work we haven't gotten to, but I think would be fascinating, is looking at how people use notebooks differently in different environments. Both how's enterprise different from academia? How are beginning computer science students different from later ones? Or how does astronomy differ from chemistry and how they use these environments? So, I think that's a fascinating area to look at next.

53:34 Michael Kennedy: It definitely is. All right, well, I think that probably is a good place to leave things. A pretty optimistic future. We're getting a little low on time, so maybe we'll ask you the two questions at the end of the show. First of all, if you're going to edit some Python code, what editor do you use?

53:50 Adam Rule: I'm really a Jupyter fanboy. So, used it whenever I need to analyze data, used it for this study, and really liked the environment. The exception to that will be when I'm doing more software application development. So, I've actually, from doing web development and have Python on the server side and JavaScript on the front end. Then I'll often use Atom for that, just 'cause I'm switching back and forth between languages.

54:14 Michael Kennedy: Yeah, if you have a bunch of different files, and they're kind of all working together, especially cross language, like CSS, JavaScript, HTML, etc. Jupyter's not amazing for that. But it is really great for exploring. Awesome. All right, how 'about a notable PyPI package.

54:29 Adam Rule: I think Phillip may have mentioned this too. But especially in the data analysis world, Anaconda, I have yet to find a better way just to be people up and running on doing data analysis and quickly package together Pandas, NumPy, Matplotlib, Seaborn.

54:44 Michael Kennedy: Especially on Windows. Especially on Windows where some of those tools are hard to pip install, because some weird compiler thing is missing. Yeah, that's cool.

54:54 Adam Rule: Pick a specific one and probably be the visualization libraries, Matplotlib or Seaborn are really where I spend my time.

55:00 Michael Kennedy: Awesome. Yeah, I'll throw one out for you. I don't normally do this. But this one is so relevant. I just came across it. Have you heard of PixieDebugger? P I X I E Debugger.

55:10 Adam Rule: No.

55:10 Michael Kennedy: So, PixieDebugger is a visual interactive debugger for Jupyter Notebooks.

55:16 Adam Rule: That is fantastic.

55:17 Michael Kennedy: You just include it, and it gives you below your cell, you put a little decorator like magic command onto a cell, and then you can just step through, step forward, you inspect the variables visually as you're going through that cell. It's pretty awesome.

55:30 Adam Rule: That is awesome. Yeah, so many folks will split cells to try and figure out where does this thing fail. So, that's perfect.

55:36 Michael Kennedy: Yeah, it's really, really, great. All right. So, people are interested. They want to do more with this, maybe look at your research. Is the data available? What do they do?

55:45 Adam Rule: I have a personal website, adamrule.com, that they can go to that links out to everything. And that will have copies of the papers documenting my research, as well as links to the data repositories.

55:56 Michael Kennedy: How big is the data? Is it a lot?

55:58 Adam Rule: It's about 600 gigabytes.

56:00 Michael Kennedy: 600 gigabytes, wow.

56:02 Adam Rule: Thankfully, our university had a very fast internet speed to be able to download all that.

56:09 Michael Kennedy: Oh, my goodness, yeah, no kidding. Where do you host it? That's actually a nontrivial amount of money if you actually had to put it on S3 or something like that. That's 50 bucks per download.

56:19 Adam Rule: Props to UC San Diego again on that. Their library has graciously agreed to host that. So many other sites that we looked at had limits at 100 gigabytes or something for datasets. So, they're hosting that. And again, I have a link off of my website. But we both have the full dataset, and then we have a starter dataset that's about one to two gigabytes with a subset of those, but with all the different datatypes and example notebooks that people can use to begin playing with it.

56:46 Michael Kennedy: Wow, that's really cool. So, they can start to play with it, and then if they're really committed, they can download 600 gigs.

56:51 Adam Rule: Exactly, exactly.

56:54 Michael Kennedy: That's pretty awesome. Well, Adam, this is really interesting research, and thanks for sharing your view into the whole notebook space.

57:01 Adam Rule: Yeah, thanks again for chatting today. It's been a pleasure.

57:03 Michael Kennedy: You bet, bye.

57:03 Adam Rule: Bye.

57:05 Michael Kennedy: This has been another episode of Talk Python to Me. Our guest has been Adam Rule. And this episode has been brought to you by Linode and Studio 3T. Linode is bulletproof hosting for whatever you're building with Python. Get four months free at talkpython.fm/linode. That's L I N O D E. With Studio 3T, you can write SQL queries and translate them automatically to Python. Try their database IDE today at talkpython.fm/studio. Want to level up your Python? If you're just getting started, try my python jumpstart by building 10 apps, or our brand new 100 days of code in Python. And if you're interested in more than one course, be sure to check out the everything bundle. It's like a subscription that never expires. Be sure to subscribe to the show. Open your favorite podcatcher and search for python. We should be right at the top. You can also find iTunes feed at /itunes, Google Play feed at /play, and direct RSS feed at /rss on talkpython.fm. This your host Michael Kennedy. Thanks so much for listening. I really appreciate it. Now, get out there and write some Python code.

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon