1M Jupyter notebooks analyzed

Episode #171, published Sun, Jul 29, 2018, recorded Fri, Jul 6, 2018

Episode Deep Dive Links Transcript

Jupyter notebooks have transformed the way many developers and data scientists do their jobs. They offer a platform to not just explore but to explain data and computation.

But how are they *really* being used? Adam Rule is here to describe his research (and Ph.D. dissertation) which analyzed over 1M Juypter notebooks found in the wild.

Episode Deep Dive

Guest introduction and background

Adam Rule recently completed his PhD at UC San Diego, focusing on human-computer interaction (HCI) and how people use computational notebooks. During his graduate research, he investigated Jupyter notebooks in both academic and enterprise contexts. Adam and his colleagues analyzed over one million publicly available notebooks on GitHub to understand the common practices, challenges, and future directions of notebook usage. His perspective blends software design insights with social science approaches, giving him a unique angle on how notebooks shape modern data analysis and collaboration.

What to Know If You're New to Python

Here are a few essentials to help new Python users follow along with this episode’s notebook-centric discussion:

Understand that Jupyter Notebooks combine code cells and descriptive text (Markdown) to foster both experimentation and explanation.
Familiarize yourself with common Python data libraries such as Pandas, NumPy, and Matplotlib, as they frequently appear in Jupyter contexts.
Know that GitHub often hosts open-source Python and Jupyter projects, making it a valuable resource for finding real-world examples.

Key points and takeaways

Analyzing 1 Million Notebooks Adam led a study downloading and examining over one million Jupyter notebooks from GitHub. This large dataset revealed widespread notebook usage across academia, data science competitions, and personal projects. Surprisingly, a significant portion of notebooks had minimal to no explanatory text beyond raw code cells. This finding underscores a gap between the ideal notion of literate programming and the realities of everyday notebook use.
- Tools and links:
  - GitHub – Main repository host for the analyzed notebooks
  - Jupyter.org – Official site for Jupyter notebooks
Notebooks as Iterative Environments Many users treat notebooks like an enhanced REPL, an interactive playground for quick exploration of data, testing models, and verifying code snippets. The cell-based structure makes it easy to run partial bits of code, refine logic, and see instant results. However, the ephemeral nature of these “playgrounds” can lead to messy, unstructured files if not cleaned up.
- Tools and links:
  - NumPy – Common library for numerical operations
  - Matplotlib – Key plotting library often used in notebooks
Limited Narrative Within Most Notebooks Despite Jupyter’s promise of “literate programming,” a surprising number of notebooks contain little or no Markdown text. Many rely solely on code comments or minimal cell headings to explain steps. This undermines their potential as a truly explanatory medium. Instead, they serve more as personal scratchpads rather than polished, shareable documents.
- Tools and links:
  - Markdown syntax – Lightweight text formatting used in Jupyter’s text cells
Academic Use: Not Always Polished Even in academic and research contexts, where reproducibility and clarity are paramount, many published notebooks fell short of fully explaining methods, results, or assumptions. While some did include thoughtful narratives and discussion of outliers or modeling choices, these were the exception rather than the rule. The social expectation in some labs still favors slides over notebooks for official presentations.
- Tools and links:
  - Software Carpentry – Offers training to help researchers adopt better coding practices
Discrepancies Between Open Science Aspirations and Reality Open science advocates tout notebooks as a future replacement for static PDF papers, but Adam’s analysis shows a mismatch: People often fail to annotate code cells in a way that anyone else can reproduce results or even interpret what’s happening. Truly open and reproducible science will require both better tool support and broader cultural shifts around sharing code and data.
- Tools and links:
  - Nature.com and Science.org – Examples of journals where some authors post supplementary notebooks
Collaboration Challenges in Labs and Teams While notebooks can be shared easily (e.g., via GitHub or as static HTML exports), not everyone embraces them for group presentations or code reviews. Some lab cultures demand a polished slide deck for results. Others want integrated version control with deeper knowledge of best practices for commentaries. This means notebooks can remain largely personal rather than collaborative artifacts.
- Tools and links:
  - GitHub Pages – A simpler way to share static notebook outputs
  - Pixie Debugger – Interactive debugging for Jupyter
Containerization and Reproducibility Sharing a notebook alone may not guarantee that others can re-run it exactly. Version mismatches, missing packages, and locked-down data can break reproducibility. Packaging the environment with containers or services like binder.org can help ensure that code runs “as is.” This approach can also unify library versions for truly stable research artifacts.
- Tools and links:
  - Binder – Turn a Git repo into a sharable, live notebook environment
  - Docker – Common container solution
Notebooks Beyond Python While Jupyter originated with IPython, it officially supports many kernels, including R, Julia, and even JavaScript-based platforms like Observable HQ. Nonetheless, Adam’s data showed Python dominating (around 96%). This speaks to Python’s well-established ecosystem, including data libraries such as Pandas and scikit-learn.
- Tools and links:
  - RStudio – IDE for R, but also fosters an R markdown culture
  - ObservableHQ – Browser-based JavaScript notebooks
Educational Dominance on GitHub The dataset highlights the rise of data science education. Many notebooks appear to be course assignments, Kaggle practice, or tutorial examples. These “student projects” often focus on incremental skill-building and may not reflect best-practice design. Still, they show how accessible notebooks are for beginners, fueling Python’s popularity.
- Tools and links:
  - Kaggle.com – A major site for data competitions with heavy notebook usage
  - Udacity – Offers nano-degree programs featuring Jupyter-based projects
Future Directions: A More Opinionated Approach Adam described a need for “opinionated data analysis” frameworks that guide novices toward well-documented, reproducible processes within notebooks. The success of standards like PEP 8 for Python style shows how norms can improve quality. Adapting these for Jupyter notebooks, and seeing them enforced or taught, could significantly raise the bar for clarity and usability.

Tools and links:
- Pandas – Example of a library with strong community conventions
- Pydantic – Shows how typed data models can unify data validation

Interesting quotes and stories

"Jupyter notebooks have transformed the way many developers and data scientists do their jobs. They offer a platform to not just explore, but to explain data and computation." -- Episode introduction

"I've tried [showing a notebook in lab meetings], and people just think I didn't take time to prepare if it's not in a slide deck." -- Adam on the social expectation around presentations

"It's my coding playground where I get to test out ideas, but it's a very personal environment." -- Adam summarizing a common sentiment among notebook users

Key definitions and terms

Human-Computer Interaction (HCI): The multidisciplinary field studying how humans interact with computers and design principles that foster better user experiences.
Computational Notebook (e.g., Jupyter): A document combining executable code cells, outputs, and explanatory text (usually Markdown) in one place.
REPL (Read-Eval-Print Loop): An interactive programming environment where you type a command, the computer evaluates it, and returns a result instantly.
Open Science: A movement promoting transparent, reproducible, and accessible research, including sharing of data, code, and methods.
Literate Programming: A programming paradigm introduced by Donald Knuth that combines source code and explanatory prose in one integrated document.

Learning resources

Here are some recommended Talk Python courses to go further.

Python for Absolute Beginners: Ideal if you’re new to programming and the Python ecosystem.
Data Science Jumpstart with 10 Projects: Learn practical data science skills and build real-world projects in Python, perfect for those wanting to go from raw data to analysis in notebooks.
Move from Excel to Python with Pandas: Great for analysts eager to transition from Excel-based workflows to powerful, Python-based data manipulation in notebooks.

Overall takeaway

Computational notebooks remain one of the most influential developments in data science, promising both interactive exploration and a narrative for reproducible research. Adam Rule’s million-plus notebook survey shows that while the iterative development aspect of Jupyter has been transformative, many people do not harness the full potential of literate programming. A shift in both social norms and tooling, particularly around documentation, environment management, and collaboration, can help notebooks evolve from personal scratchpads into more polished and shareable documents. By embracing disciplined best practices, we can bridge the gap between day-to-day iterative coding and truly reproducible, open science.

Links from the show

Adam Rule: adamrule.com
1 Million Notebooks Paper (official): dl.acm.org
1 Million Notebooks Paper (pre-print): adamrule.com/files
Analysis Notebooks for Paper: github.com/
Dataset for Paper: library.ucsd.edu
Atlantic Article - The Scientific Paper is Obsolete: theatlantic.com
Episode #171 deep-dive: talkpython.fm/171
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy
Episode #171 deep-dive: talkpython.fm/171

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 Jupyter notebooks have transformed the way many developers and data scientists do their jobs.

00:04 They offer a platform to not just explore, but to explain data and computation.

00:08 But how are they really being used? Adam Rule is here to describe his research and PhD dissertation,

00:14 which analyzed over 1 million Jupyter notebooks found out in the wild.

00:18 This is Talk Python To Me, episode 171, recorded July 6, 2018.

00:24 Welcome to Talk Python To Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities.

00:44 This is your host, Michael Kennedy. Follow me on Twitter, where I'm @mkennedy.

00:49 Keep up with the show and listen to past episodes at talkpython.fm.

00:52 And follow the show on Twitter via at Talk Python.

00:55 This episode is brought to you by Linode and Studio 3T.

00:59 Please check out what they're offering during their segments. It really helps support the show.

01:03 Adam, welcome to Talk Python.

01:05 Yeah, great to be with you. Thanks for inviting me on the show.

01:08 Yeah, it's great to have you here. I'm really glad our mutual friend, Philip Guo, introduced us and suggested this show.

01:15 Yeah, no, you've had a lot of great people on the show.

01:17 I mean, Philip included, who I think has been a repeat on the show, which is great.

01:22 But other folks in kind of my area of data analysis or computational notebooks, Min and Matthias from Project Jupyter and DJ Patel, one of the kind of godfathers in data analysis.

01:34 So it's an honor to be on here with you today.

01:36 It's great to have you on. It was great to have those people as well.

01:39 So we're going to talk about Jupyter Notebooks in a super meta way.

01:43 Like, I've had Matthias on to talk about notebooks in, like, as a technology.

01:48 But we're going to talk about more, like, how people work with notebooks and how it affects researchers and data science and stuff.

01:55 Before we get to that, let's start with your story.

01:57 How did you get into programming and what led you down this Python path?

02:01 Oh, gosh.

02:02 I mean, my original track was I was an industrial engineer as an undergrad.

02:06 And I thought, oh, I like people a little bit, but I also like science and math.

02:11 And so industrial engineering is this mix of both of those where you're trying to optimize, like, production flows.

02:17 But then you spend some time on the shop floor talking to people about, like, how's this working for you?

02:22 How can we make this process easier for you?

02:24 From there, it's a somewhat circuitous path, but eventually took me to understanding, oh, not just designing manufacturing processes,

02:31 but we can design devices and products and software for people.

02:36 And, gosh, designing software for people is really difficult because you can't just look at, you know, measure how long is their arm and lay out the workspace that way.

02:45 But you have to get into their head a little bit and try and figure out, okay, mental capacity and how do people think through things?

02:51 What's a workflow look like for that, not just how I work a work piece through a factory?

02:56 So with that, I got into programming largely by studying how people use software and needing to do some of my own programming to build software.

03:06 Add-ons to really look deeply at how is it that people are using these technologies and how can we make them easier to use or just do more powerful things

03:15 and augmenting what we can learn about how people work in their capacity.

03:20 Yeah.

03:20 Oh, that's really cool.

03:22 It's super interesting how this sort of engineering path led you over here.

03:27 I think industrial engineering plus software would be a really cool space to work because there's just so many gadgets and things you could design with just a little bit of smarts and a little bit of programming that would be amazing.

03:39 Like, that's a skill I don't have.

03:41 Yeah.

03:42 I know that kind of fusion of the two worlds would be really fascinating.

03:45 I think it's a little ironic.

03:46 One of the things I really like about programming is you actually get to build stuff, which you get to build stuff that is, like, conceptual, right?

03:52 But it runs.

03:53 And I'm speaking, like, coming from a math perspective, right?

03:56 Where it's like, prove this cool theorem about topological spaces.

03:59 I guess I'll go look at another theorem, right?

04:02 Like, you don't actually have an outcome, right?

04:03 Here, even though it's sort of virtual, you still get to build things.

04:06 But this would take it to another level.

04:08 Awesome.

04:08 So you started writing software to kind of understand and study how people interact with it, which is sort of started you down this meta path, right?

04:17 Yeah.

04:18 Yeah.

04:18 I know.

04:19 I think today as we get further on the talk, it'll be very meta as we talk about studying how people use tools like Jupyter Notebook by, in turn, using Jupyter Notebook to analyze a data set about people using it.

04:32 It kind of becomes turtles all the way down.

04:34 Yeah.

04:35 I was just thinking it's turtles all the way down.

04:36 Absolutely.

04:37 Absolutely.

04:37 Yeah.

04:38 So what led you into, like, Notebooks and Python, right?

04:41 Like, you could have, say, used C++ to understand how people use software.

04:45 It would take you longer, but you could have done it.

04:47 Yeah.

04:48 Yeah.

04:48 No.

04:49 And this may be going back a little bit too far, but I was really fascinated by HealthKit and how physicians use medical records to track and document their work.

04:57 It's this really data-driven domain.

04:59 But it's really hard to get into.

05:02 You can't really say, oh, let me just hack on your enterprise software system in the hospital.

05:08 I just want to tweak it in this way and see if that makes it easier to care for patients.

05:12 One, the software systems are super complex and regulated, and you have kind of patient risk there as well.

05:21 And so I turned and looked at another very data-driven domain, data analysis.

05:25 And honestly, it's just by kind of hearsay of so many people saying, hey, have you checked out these notebooks?

05:32 They're fantastic.

05:33 I've been using them for months or years now for doing analysis.

05:37 They're pretty amazing.

05:38 You should really look at this.

05:40 And so where I was at UC San Diego, a bunch of people were using these tools for very basic biology, neuroscience research.

05:50 So that got me into looking in, okay, how are people using these tools to track and talk about very complex data-driven work?

05:57 Yeah, that's awesome.

05:58 So maybe that's a good place to segue into kind of what you've been doing recently.

06:03 So you said you are at UC San Diego, a very nice school down in San Diego.

06:07 And you just finished your PhD as a human-computer interaction researcher, which is a sub-portion of, say, cognitive science.

06:17 And I was really surprised how much computer programming and software is involved in cognitive science more broadly.

06:24 Yeah, I know.

06:25 Cognitive science is an interesting field that's kind of a fusion of psychology and computer science.

06:32 And you go back to some of the early days looking at folks like Herb Simon and others at Carnegie Mellon and around the world.

06:39 And they were playing in both fields of developing and testing a lot of software, trying to figure out, can we model how the brain works?

06:46 And then there's kind of the reverse transition of people will look at the brain and use that to try and figure out how can we build more efficient algorithms or computer systems with neuromorphic computing.

06:57 But yeah, I just finished my PhD a month ago, actually.

07:01 So still fresh off of that.

07:03 Are you just super relaxed now?

07:07 I am.

07:08 No, I was going to say, I'm almost in this academic sabbatical period where we still have funding and I'm still continuing to do some of the research that we'll talk about today.

07:17 But due to my wife's job, I moved to a different city, now up in Portland, staying on the West Coast, but very different in terms of sunshine hours from San Diego.

07:28 Less sunshine, more green.

07:30 More green, which is great.

07:32 Yeah.

07:32 When I moved to San Diego from Seattle and when I was in Seattle, really loved the lush green.

07:39 And so San Diego has many benefits.

07:42 The surfing, the burritos, the sunshine, but it lacks in green.

07:46 So it's good to be back.

07:47 Nice.

07:47 So you still are finishing up this research a little bit that we're going to be talking about.

07:52 And then it's time to hit the real world.

07:55 Are you thinking academics?

07:56 Are you thinking industry?

07:57 Where are you headed?

07:58 Yeah, I'm thinking industry at this point.

08:01 Just to try my hand at something slightly different and see what that world is like.

08:05 Gotten a good dose of academia for the last five years of PhD and two years of master's before that.

08:11 That's a healthy dose.

08:12 What the working world is like.

08:14 Awesome.

08:15 All right.

08:15 So maybe we should start by talking a little bit broadly about what human-computer interaction is.

08:21 Yeah.

08:22 And I think Philip has covered some of this as well because he researches really similar topics.

08:26 But it's really studying the design and use of computer technology.

08:30 So how do people use current technologies?

08:32 And how do certain aspects of the design make it easier to use for particular tasks?

08:38 So as I was saying with cognitive science, it's really a mix.

08:42 Human-computer interaction, the subfield, is really a mix of computer science and social science.

08:47 So some days I'm building software.

08:50 Other days I'm testing it with people.

08:52 Other days I'm just sitting and observing how people use it or don't use it during their tasks.

08:58 So it's this flopping between programming and observing and social science, more anthropological skills.

09:04 That's a lot of fun and goes back to my industrial engineering days of the math and science and the satisfaction of building things.

09:11 And then the flip side, trying to understand people as well.

09:14 Oh, yeah.

09:15 It's a really interesting mix.

09:17 Are folks in that area starting to think about how artificial intelligence is changing this?

09:22 Like things like the Amazon Assistant or the Google Assistant and stuff like that?

09:27 There's a bunch of research in that area.

09:29 I think some of the folks who are farthest out ahead on that are those in the human-robot interaction field.

09:35 Because they've had to think for a while about how are people going to interact with robots and reason about how is this computer device reasoning about things and making decisions?

09:45 And should I trust that?

09:46 Or does it not have access to all the information I do?

09:50 And all these things we do very naturally with other humans of like, oh, they don't see that car coming because they're looking this other way.

09:56 I should let them know.

09:58 It's harder to do that with computer systems where you're less sure about what are the inputs, what's the processing, what are the outputs going on.

10:04 That's really interesting.

10:05 It's going to really become more and more so over time, isn't it?

10:08 Yeah.

10:09 And then all this work on like machine learning interpretability.

10:12 How are you going to be able to interpret a decision that came out?

10:15 I know there's work going on in that.

10:17 And even here in Portland, the next HCI researcher meetup in the area is focused on machine learning and how do we design for this and help people understand what's going on.

10:26 So I think both in academia and industry, it's a big deal right now.

10:29 Oh, yeah.

10:30 That's awesome.

10:30 It sounds like people who want research projects, I suspect that's a good place to focus.

10:35 Good place to look.

10:36 Yeah, for sure.

10:37 You decided to focus a little more meta.

10:39 You wanted to focus on these computational notebooks, which Jupyter is one of.

10:44 But maybe give us like let's set the landscape, right?

10:47 So in the Python space, we hear Jupyter, Jupyter, Jupyter.

10:50 Oh, Jupyter lab is slightly better.

10:51 Jupyter lab.

10:52 And then that's about it, right?

10:54 But there's actually a slightly broader view of these things in the history as well, right?

10:58 So there's actually a really interesting article in the Atlantic that was coming out and saying, oh, the academic paper is dead and computational notebooks are going to replace them.

11:09 And a lot of that talks about Jupyter notebooks, but it goes into some of the history of notebook platforms back to really Mathematica is the one that's often credited with being one of the first environments where you could have this literate programming back and forth, typing and running small scripts in a specific language to analyze data or ask questions.

11:29 So that was back in the 80s.

11:31 And there were academic systems like Maple that were in schools in the 90s and 2000s.

11:38 But it's been in the last couple.

11:39 I remember using Maple.

11:40 That thing was magic.

11:42 Yeah, no, I remember using it as well.

11:44 And I haven't really seen it much outside of the educational context.

11:47 I don't know if that's just their niche or what.

11:49 But so it's been around for a while, but it's often been locked away in proprietary software that you had to pay a big license fee for.

11:58 And so it's really in the last decade or so and really the last five years or so that platforms like Jupyter notebook or RStudio have been providing these open source and in some cases like Jupyter free environments for using a notebook like interface to play with data.

12:16 Yeah, what's your take on when I was working on my master's degree and my PhD and stuff, which I didn't get my PhD, but I did get my master's degree.

12:26 But anyway, when I was working on that, you know, I was using MATLAB and stuff and we were doing things like wavelet decomposition, which I'm pretty sure the license was 2000.

12:36 This is like an add on to MATLAB was like 2000 additional dollars per person that was using it like that's completely insane.

12:44 Yeah.

12:44 And then here comes Jupyter and whatnot and going, oh, actually ours is free.

12:48 Why don't you try that?

12:49 Well, that's a big effect, right?

12:52 Yeah.

12:53 And thinking about, you know, there's a huge push in science for open science and not just sharing your results, sharing the data, but then sharing also your code and how you you arrived at that.

13:06 And the fascinating thing is so many researchers are making, you know, you talked about the wavelet decomposition.

13:12 They're making their own packages or libraries, especially in the Python ecosystem and sharing them openly for others to use and download.

13:21 And that's really seemed to move up the stack to not just be the packages, but the language, you know, Python should be open and free.

13:28 And the environments, the development environments like Jupyter should be open and free.

13:32 Do you see this like Zen of open source from the scientists interacting with the software, like flowing into science directly in the sense that people are sort of changing their way?

13:45 Do you see other stuff by virtue of having these open source experiences?

13:48 I think so.

13:49 And a couple of points on that.

13:51 Some of the folks that I talked to and we get to some of the research that I've done talked about this really strong obligation they felt to make things open source, to make them reproducible.

14:01 And it was almost a religious zeal of like, this is just how things should be done.

14:05 And I hadn't seen that really before in academia, in other contexts.

14:10 I think one of the other interesting things I saw is so many labs where, you know, half the lab might be wet lab biology folks who are running experiments and, you know, creating slices and staining them and using microscopes.

14:24 And the other half really just looks like a startup or a software development group who are doing code reviews on a weekly basis.

14:32 They have a GitHub repository, maybe.

14:34 GitHub repositories talking about versioning, remote, you know, workers calling in from across the world and seem very much like just a software company.

14:45 But they're in a research lab doing fundamental biology research.

14:49 Yeah, I really do see this computer science skill starting to permeate a lot more stuff, not just so we have more programmers, but so that, like what you described, these biologists can take some of these software ideas and just really amplify what they're doing in the lab.

15:06 Yeah, absolutely.

15:09 This portion of Talk Python To Me is brought to you by Linode.

15:12 Are you looking for bulletproof hosting that's fast, simple, and incredibly affordable?

15:16 Look past that bookstore and check out Linode at talkpython.fm/Linode.

15:21 That's L-I-N-O-D-E.

15:23 Plans start at just $5 a month for a dedicated server with a gig of RAM.

15:27 They have 10 data centers across the globe.

15:30 So no matter where you are, there's a data center near you.

15:33 Whether you want to run your Python web app, host a private Git server, or file server, you'll get native SSDs on all the machines, a newly upgraded 200 gigabit network, 24-7 friendly support, even on holidays, and a seven-day money-back guarantee.

15:48 Do you need a little help with your infrastructure?

15:50 They even offer professional services to help you get started with architecture, migrations, and more.

15:56 Get a dedicated server for free for the next four months.

15:59 Just visit talkpython.fm/Linode.

16:02 So I did sort of cut you off a little bit when you were talking about the history.

16:07 So we've had Mathematica and Maple and MATLAB, and then we've got Jupyter in our studio.

16:13 But then there's a bunch of hosted ones as well that maybe some people have heard of, but there's actually a ton of variety out there for even just getting these and running them online.

16:21 Yeah.

16:22 So I mean, even just sticking with Jupyter, one of the great things about Jupyter is they have the notebook itself, kind of this front end.

16:30 But I think some of the more lasting impact from Jupyter might be just the standards that they set about how are we going to send messages back and forth to a kernel?

16:38 What's the notebook format and the very specific JSON structure?

16:43 So in a way, they're almost like a standard setter, like World Wide Web Consortium, HTML, or other things.

16:49 This is just how scientific computing or data analysis should be documented and shared.

16:54 And so a number of other groups have come and built on top of Jupyter in those standards.

17:00 So like Google Collaboratory is like a Google Docs version of Jupyter Notebooks, Microsoft's Azure Notebooks on there, you know, Azure Cloud, and then even Sage Notebooks.

17:11 Sage was this project in specific language, kind of like Mathematica, for doing data analysis and mathematics.

17:19 And they've now switched over.

17:21 Is that the, that's the Sage from William Stein up in Seattle, right?

17:26 Yep, that's exactly right.

17:27 So now they have a whole notebook infrastructure and each of these have, have made add-ons and different history features or profiling tools that are slightly different from Jupyter Notebook or Jupyter Lab.

17:40 But they're all essentially built on top of Jupyter.

17:42 Yeah.

17:43 And there's some really interesting cloud computing tie-ins.

17:45 Like I don't know Azure and Sage well enough, but I know on Google Collaboratory, like you hit a hotkey to run a cell and you hit a slightly different hotkey to run the cell on a GPU.

17:55 It's like crazy, right?

17:57 It's just.

17:57 Yeah.

17:58 And I think that comes into business models of how you're going to monetize these things.

18:02 Jupyter has this interesting model of being open source and funded by academic time and grants, whereas others are saying, well, we'll provide this software for free.

18:11 But if you want white glove support or to run it on our cluster, then that'll cost you at that point.

18:19 So it's free for smaller uses in educational context, but we'll provide the compute infrastructure.

18:25 That's pretty interesting.

18:26 There's also a JavaScript one, right?

18:27 Yeah.

18:28 So this kind of notion of computational notebooks is now spreading into the web world where Observable HQ.

18:34 So Mike Bostock, the mind behind the D3 data visualization library, now is a company that's making a completely on the web computational notebook.

18:45 So writing JavaScript code to analyze data.

18:49 And then Mozilla actually is starting up a project called Iodide that's looking at this.

18:54 And what's fascinating about these is you can write code not only to analyze the data, but to directly manipulate the notebook itself.

19:02 And so you get some really fascinating views on data or dashboards or something that you wrote in a cell just changes the complete layout or operation of the notebook itself.

19:14 So it mixes kind of the programming for data analysis and then programming to change the tool that you're using to do the data analysis.

19:22 How interesting.

19:23 Yeah.

19:23 Because you can reprogram the notebook with the notebook.

19:27 Yep.

19:28 So it makes it easy to do things that have been difficult to do in, say, Jupyter Notebook to create widgets where if I want a slider that can then change some parameter and a visualization I have, I can just call whatever that div is because I'm programming in JavaScript and select it and tell it to update on this callback.

19:46 Yeah, I definitely see that having an advantage in that integration as well as that runs on the client browser.

19:54 So the hosting cloud side of it is like, you could probably do that on a $5 server.

19:58 You know what I mean?

19:59 Like you could host an incredible amount of computation because you're just serving up the files more or less, right?

20:04 Yeah.

20:04 So all of these really interesting models all around this idea of a notebook infrastructure where you incrementally write a few lines of code to analyze data and then get the outputs printed right in line.

20:16 One weakness, I guess I would call it, that I see with a JavaScript computational environment is JavaScript has poor numerical support.

20:23 Yeah.

20:24 It's like integers, for example, are super hard because integers.

20:29 And I'm not sure, I mean, even beyond that, just the infrastructure, you know, there's all sorts of great infrastructure, especially in Python and R for data manipulation and cleaning and analysis.

20:40 And those libraries just, I don't think are quite there for JavaScript.

20:45 Right.

20:46 Where's the pandas for JavaScript?

20:47 Does it exist?

20:48 It may, I don't know, actually.

20:49 Or the tidyverse.

20:50 Again, it may exist, but it certainly doesn't have, I think, as many folks working on or using it as you do in the Python or the R worlds.

20:58 Yeah, it's interesting that it exists.

20:59 Okay, so that sort of sets the stage of this world.

21:03 And your PhD goal was to go and study that world and actually understand how people use these computational notebooks and if they're really fulfilling their promise of becoming like a computational narrative or what are people doing this, right?

21:20 So maybe tell us more about your research.

21:23 There's this Atlantic article that came out and made this declaration of the science.

21:27 I want to encourage people to check out that article.

21:30 We're going to put it in the show notes.

21:31 It is, I think I've talked about this before on Python Bytes, my other podcast.

21:35 But anyway, it's really provocative, right?

21:39 Like there's a scientific paper that's literally on fire, like an animated fire on the homepage.

21:45 It's quite something.

21:46 And it's a big change.

21:48 Yeah, no, it makes some bold claims.

21:50 It roasts Stephen Wolfram quite a bit.

21:54 And then it goes into kind of the history of Mathematica and then also Python and these different views of like a closed type ecosystem like Mathematica where you make your own language or an open one like the Python community where anybody can write a library and contribute.

22:10 But yeah, so one of the things in this article, and really this just reflects some of the zeitgeist right now, is this notion that, well, in the future, we're no longer going to be sending around scientific results just in a dead PDF because that doesn't give you all the information you need to reproduce science.

22:27 You know, the analyses we're doing today are so complex that you can't just read a three sentence description of it and know what to do.

22:34 Yeah, I think I've heard somewhere some kind of quote saying your academic paper that you write about your computational results is like advertising for that computational result, but it's not actually the research.

22:48 Like the software is the research in a sense, right?

22:51 So why are these two separate things?

22:53 Yeah, I know, I think you had some talks with folks behind Journal for Open Store Software.

22:57 Yeah, I think that's where that came from is that one.

22:59 Yeah.

22:59 And just like so much of the work and research now is software development, whether it's for building a package to do a certain type of analysis or that particular analysis itself.

23:10 You know, this article and others are saying really what we need is a new medium where you can share all of the code used to run the analysis because so much of analysis is now happening via programming, not in Excel.

23:22 Not with SPSS or other packages for stats.

23:26 But if you just share the code, that's really not understandable either.

23:29 You know, we all hope to comment it well.

23:31 And so what you need is this mixed narrative.

23:34 And that's what notebooks give.

23:35 You can write a line of markdown text that explains what the notebook's doing.

23:38 And then you can just build up your argument and show how you collected, analyzed, and did this data.

23:44 And so a lot of folks are saying, hey, scientific paper is dead.

23:47 Notebooks are the new medium.

23:49 And look, millions of people are using them.

23:51 And we really wondered, is that the case?

23:54 You know, is the scientific paper dead?

23:57 How are people using these notebooks?

23:59 Because despite being around for decades and having millions of users, we know very little about how people actually use them in their day-to-day work.

24:07 So we sought out to kind of understand better how are people actually using these things?

24:11 Is there much rich computational narrative?

24:14 You know, if we read these things and understand the analysis.

24:18 Or are people really just using them because they're a nice iterative development environment?

24:22 And so that's kind of what got started down this path.

24:25 Right. Is it a place where you load some data and you just sort of iteratively explore it and interact with it?

24:32 Or are you actually trying to put something like a paper together, right?

24:36 Like the next version of that.

24:38 Those are really compelling reasons to use notebooks.

24:41 You know, having this really tight REPL that will let you iterate and just explore some data.

24:47 Or having it be this really well curated explanation that you can share with others.

24:52 Yeah, this idea of a REPL is really handy and nice.

24:55 But sometimes it's just super hard to go back to what you want.

24:58 You know, you're like, well, it's 20 things back.

25:00 And this, you know, a bunch of them are like 10 lines long.

25:02 So you got to go.

25:03 It's just, you know, it's not very, very nice to interact with it, say, off of the terminal in some situations.

25:10 But this, this is perfect, right?

25:11 You can go back and read it.

25:12 You know, just jump to where you want by touching it.

25:14 It's great.

25:15 So how did you, how did you get your data?

25:18 You, how did you find these notebooks to study?

25:21 Just like go ask a couple of people, you know, or what, what do you do?

25:24 The first part of the research that we did was trying to just get a big sample of notebooks.

25:29 So he said, one way that we can tackle this problem is just try and get a bunch of notebooks and look at them and see what the content is.

25:36 So we ended up actually scraping all of the Jupyter notebooks that were on GitHub at the time of the study, which was about a year ago.

25:43 How many is that?

25:44 So that was a little over a million, about one and a quarter million notebooks they had on there, which that was a fun process of working around rate limiting to get that to work.

25:54 Yeah.

25:55 So tell us how do you do that?

25:57 I mean, was that GitHub API?

25:59 Was that web scraping?

26:00 What was the flow there?

26:01 Yeah, and it was a mix of both.

26:03 So GitHub is a great and well-documented API.

26:06 But in order to do what we wanted to do, we had to abuse it a little bit.

26:10 They don't really want you looking for just one specific file type.

26:14 So you can't really just search and say, show me all the files of this type and download all of them.

26:20 You need to give it some other parameter set.

26:22 So we actually had to go and say, OK, give me all the Jupyter notebooks between, you know, zero and 100 kilobytes.

26:28 OK, between 100 and 200 and kind of iterate them through that way, both to get a list of all of the notebooks.

26:36 And because they limit, we're only going to send you a thousand results.

26:40 So even if you can look and see that there's a million of these, we'll only send you the first thousand.

26:45 So we had to restrict our query down to get it in packets that were small enough.

26:50 I see.

26:50 You had to come up with an arbitrary filter criteria that would get it below a thousand.

26:55 OK.

26:55 Yeah.

26:57 And then from there, it's just a lot of learning how to be a good citizen with GitHub servers and respect when they say, OK, you're making too many queries and slow down.

27:07 It took a couple of weeks, but we eventually got the full data set that way.

27:11 And then afterwards, we had to do some web scraping to get the files themselves.

27:16 So we essentially first got a list of what are all the notebooks and where are they?

27:19 And then used some web scraping to get the files themselves.

27:22 Yeah, that's pretty interesting.

27:23 And it only took a couple of weeks.

27:25 I mean, on one hand, that's a long time.

27:26 On the other, you've gone out and gathered all of these millions of notebooks from all these sources.

27:33 And that's pretty amazing, actually.

27:35 No, I think it's fascinating that, I mean, GitHub provides the tools for us to be able to do something like this.

27:42 And really only asks us in return that we make the data in any publications open afterwards, that you can use their API to really study how people all over the world are using a tool set.

27:54 So it's a great way to get a massive and diverse sample size.

27:58 Yeah, I think GitHub is kind of a special place, right?

28:01 I mean, there's lots of source control and versioning and issue tracking places in the world, but GitHub stands alone.

28:07 And it's sort of reach and just people using it and so on.

28:11 Yeah.

28:11 So what was, I guess, how do you go about studying it?

28:15 Like, you're studying Jupyter Notebooks.

28:17 Did you actually use Jupyter Notebooks to study your Jupyter Notebooks?

28:20 This is exactly it.

28:21 Yeah, so this is where it gets super meta.

28:22 So we download this whole data set.

28:25 And for those who don't know, Jupyter Notebooks themselves are really just a JSON file.

28:29 They have a different ending, I, PI, and B, but under the hood, it's just a JSON file.

28:35 So we essentially have a million JSON files to look at.

28:39 And so we just spun up our own Jupyter Notebooks, imported that data set, started making some data frames using the Python ecosystem.

28:47 And so analyzed it that way.

28:49 So in the end, it's come full circle because the Notebooks that I used for my analysis of Notebooks on GitHub are now hosted on GitHub.

28:58 Completely available.

28:59 So you will be a data set in the follow-up replication setting.

29:03 Is that what you're saying?

29:03 Yeah.

29:04 Yeah.

29:05 How interesting.

29:08 This portion of Talk Python is brought to you by Studio 3T, the IDE for MongoDB.

29:13 No SQL databases offer maximum flexibility.

29:16 But what if you could combine the benefits of MongoDB with the benefits of SQL?

29:21 With Studio 3T, you can.

29:23 With their innovative SQL query feature, you can write SQL joins and expressions to query MongoDB.

29:29 And the best part is you get to see how your SQL queries translate to MongoDB's native query syntax with the click of a button.

29:35 You can create MongoDB queries, aggregation statements, and SQL queries.

29:40 And 3T's novel query code will automatically generate code for you in a variety of languages like Python, JavaScript, and even C#.

29:47 Studio 3T also offers the richest coding experience with its full-featured IntelliShell.

29:53 It's the built-in MongoDB shell interface with smart auto-completion of collection names, shell methods, document key names, operators, and field names.

30:01 By using in-place editing within a collection, it's even easier to edit your documents.

30:05 Try Studio 3T and see why it's used by Fortune 500 companies like Nike, Tesla, Formula One, Comcast, and many more,

30:12 saving enterprise users countless hours of development time.

30:16 Visit talkpython.fm/studio to get a free one-month trial.

30:21 That's talkpython.fm/studio.

30:23 What are some of the findings or things you got by studying these?

30:29 We talked about this tension between an iterative REPL and then an explanatory narrative or computational narrative.

30:36 What did you find?

30:36 One of the big headlines from this bit of research was that very few of the notebooks had what you might consider the baseline requirement for a rich narrative,

30:47 which is just any explanatory text.

30:49 So over a quarter of the notebooks had no markdown text at all.

30:54 So they were just code or code blocks.

30:56 And then even of those that had text, they were pretty short.

31:00 So like the median was about 150 words.

31:02 So just a really short blurb.

31:04 Well, maybe those blurbs are almost like comments.

31:06 They just are in text cells instead of in code with a hash.

31:09 Yeah.

31:10 Yeah.

31:10 And they'd often be like, okay, import data, model data, really descriptive of the steps.

31:17 And I think for us that, you know, hints that more, you know, not that this is a bad use to the notebook,

31:22 but more they're being used as an interactive programming environment with some light, loose notes rather than this view of,

31:29 oh, there's this really rich description, like a scientific paper of what people did.

31:34 I think there are a host of reasons for that.

31:36 But that was one of the major findings for us from this.

31:39 Yeah.

31:39 So how much of this is just people happen to be using and storing their notebooks on GitHub versus they intend other people to consume those?

31:49 Yeah.

31:50 You know, it's hard to say for sure, but we think a lot of it is just we're going to throw it up as a repository for myself.

31:56 And I'm not really expecting others to use it.

31:59 We did an analysis where we actually looked at the descriptions of the GitHub repos where these notebooks lived.

32:05 So like how are people describing these projects and looked for keywords.

32:10 And when we remove things like notebook or GitHub from that keyword search, the top words are things like machine learning, Kaggle, Udacity, Nano degree.

32:20 And so that really showed us that a lot of these seem to be people learning how to do data analysis, learning machine learning in particular,

32:29 and doing online education assignments and then hosting their results up online, whether that's as a form of submission

32:39 or for a resume or portfolio building exercise.

32:43 But a lot of these seem to be educational.

32:45 Interesting.

32:45 I can see a lot of people who are students taking courses at a university or something, and their professor says,

32:52 all right, what we're going to do is going to create a repo for your course,

32:55 and everybody's going to put their assignment and just share it with me or make it public or something.

32:59 Our search excluded forked repos and forked notebooks.

33:03 So at least one form of distribution like that should have been excluded.

33:07 But yeah, I still think a lot of this is course assignments like that.

33:11 Did you do any refining where you say, well, let's look at repositories that have over a thousand stars or a lot of followers,

33:19 or just the ones that are not clearly sort of private?

33:23 We tried to look for ones that seem to get reused a lot.

33:26 And in fact, the motivation for us was, well, let's see if we can find best practices in notebooks, right?

33:33 Like if we can find ones that are in repositories that were starred a lot or forked a lot,

33:38 maybe that means that they were really useful and we can kind of glean some best practices on notebook design from this.

33:45 Right.

33:46 Maybe even a lot of PRs, they're getting like polished.

33:48 Exactly.

33:49 And one of the things we found when we tried to do that was many of these notebooks that were in highly starred or forked repositories were just tutorials for various software packages.

34:01 So as an example, it could have been, you know, something like here's pandas up on GitHub.

34:06 And then here's the notebook as documentation showing how to use pandas.

34:11 But the reason why this repository is so starred and forked is that people really like using pandas,

34:17 not because the notebooks themselves were all that insightful.

34:20 Yeah, that's interesting.

34:21 I guess it is a really nice way to have code mixed with description on GitHub because GitHub renders and executes those now.

34:29 Yeah, and that's one of the things is we're initially doing this research that people said why they're using Jupyter notebook and why they're putting it on GitHub is,

34:36 hey, a manager that I have, you know, they don't really want to install the software and set up an environment,

34:41 but I can just send them a link and they can see the notebook statically.

34:44 And that's a really nice way to share results.

34:47 Yeah, that's great.

34:47 Even managers can run web browsers.

34:50 Uh-huh.

34:51 Uh-huh.

34:51 Yeah.

34:52 I think one of the other interesting things that we looked at was in a testament to the Python ecosystems,

34:57 we said, okay, what are the packages that people are importing?

35:00 And just finding that the vast majority, around 90% or so of these notebooks,

35:05 are importing external packages.

35:07 And things like pandas, numpy, matplotlib, we're importing two-thirds or more of them.

35:14 So just the data science infrastructure that's being provided is a really core component.

35:20 It's not just having the notebook like Jupyter.

35:23 It's having the Python ecosystem to be able to do it.

35:25 It's the foundation.

35:26 Yeah, that's really awesome.

35:27 What about things like R and JavaScript and stuff?

35:31 Were you able to figure out, well, how much is Python?

35:33 How much is other stuff?

35:34 The notebooks themselves have a tag for what the language is in there.

35:38 And the vast majority of those, like 96%, were written in Python.

35:43 Whereas things like R and Julia, which is why Jupyter is named Jupyter, the combination of Julia, Python, and R, each accounted for about a percent.

35:53 And then there was a long tail of other languages that were supported.

35:56 But by and large, it was Python.

35:59 And kind of surprising for us.

36:01 96%?

36:02 That's actually higher than I would have even guessed, to be honest.

36:05 No.

36:06 And again, I don't know if it's because so many of these are educational.

36:09 And that's just a good language to teach in or what the reason may be.

36:15 But it still is strongly reflecting its IPython roots of being kind of a Python-first environment.

36:22 For sure.

36:22 It sounds like it.

36:23 So were you able to find a subset of what you might call academic papers or like these narrative type things and analyze those?

36:31 That leads into the second line of work we did on this is, okay, pull down all of these notebooks.

36:35 We've looked at them.

36:36 Very few of them have this rich description and seem to be more just using notebooks as a nice iterative environment.

36:44 But what if we're just looking at, you know, the wrong subset of notebooks?

36:47 Many of these seem to be for education.

36:50 It may be that people are just hosting these for themselves.

36:53 Yeah.

36:53 You probably don't want to look at a notebook that a student has like known programming for 10 days as a, how should we do things?

37:01 Exactly.

37:02 That's exactly it.

37:03 So like, okay, like, let's be a little humble with the limitations of the data set and our assumptions of it.

37:09 So we ended up saying, well, what if we look at what some consider like the creme de la creme of doing and presenting analysis?

37:16 What if we look at notebooks that are supplementing academic publications?

37:21 So this is back to that Atlantic article saying, you know what, in the future, scientific papers should just be in notebooks.

37:27 And there's a number of folks who have jumped on that bandwagon and said, you know, I may publish something in science or nature or one of the big journals, but I'm going to link to here's the notebook that I use for that analysis.

37:39 So that people can retrace it, recreate it, fork it, and continue the analysis themselves.

37:45 And as mentioned earlier, there are some in the academic community who have a real strong kind of evangelistic fervor around the need to share the results in this way.

37:55 So we ended up looking for those specifically.

37:58 Interesting.

37:58 And what did you find there?

37:59 Any differences?

38:00 Probably, right?

38:01 Yeah.

38:01 So we ended up finding, again, many of these are on GitHub.

38:05 So we pulled about 150 of them.

38:07 And we used a slightly different method where rather than just doing a big data analysis, we hand code it.

38:12 And we wanted to see kind of with finer grain detail.

38:15 I looked at it and I put it in categories, not like typing programming, right?

38:19 Exactly.

38:21 So putting the categories, kind of iterating those over time, and then making sure you have other people who can validate.

38:27 But yes, that's a valid category, not just something that you came up with.

38:31 Nice.

38:31 Okay.

38:31 So you coded these by hand and probably got slightly different results, I guess.

38:35 Okay.

38:36 So these notebooks have a little more text in them.

38:38 But surprisingly, they're not using that text really to describe the analysis in any rich way.

38:44 So of the notebooks that had any text in them, which was still not all of them, the majority would use that text just to describe the steps of the analysis.

38:53 Importing data, fitting model.

38:56 Back to the comments as text rather than comments as code comments.

38:59 Yeah.

39:00 Okay.

39:00 And then only about a third of them have what we might consider a rich description.

39:04 So any description of why they did the analysis in a particular way.

39:08 So, oh, I fit a linear model because these assumptions are met, or we tried this other model and it didn't work.

39:14 Or interpreting results.

39:16 So, you know, if you look at this plot, you'll notice this outlier here.

39:20 Most would just leave the end figure like it spoke for itself, often without axes labeled.

39:26 And just say, here's how we got the result.

39:29 And not really describe what they thought it meant or how they got there.

39:33 So that for us was surprising.

39:34 Again, thinking, well, academics will want their work to be easily understood and replicable, that so many of these still kind of fell short of Fernando Perez, Brian Granger's, and others' vision of this rich computational narrative.

39:47 How interesting.

39:48 Yeah.

39:48 So what did you do?

39:49 You go ask them, like, why didn't you write?

39:51 Why didn't you write more?

39:53 So in a way we did.

39:54 Not with the folks who had posted notebooks there, but we ended up finding people around campus at UC San Diego who are using notebooks.

40:03 Again, some of these labs where they have big biology analyses or genomics work and where using notebooks is kind of a way of life.

40:10 So we went and talked to some of those people.

40:12 So we ended up finding 15 of them and just walk them through, show us a notebook you've been working on, which was great.

40:19 We got to see kind of in progress work rather than just here's my finished product that I'm post on GitHub.

40:25 Yeah, that's cool.

40:26 And so were they also more using it as like repl type explorative stuff or what was the story there?

40:33 It was exactly that.

40:34 Again, people were using it for this iterative environment.

40:37 Some would talk about it.

40:38 It's my coding playground where I get to test out ideas, but it's a very personal thing.

40:44 You know, it's reflecting my style of programming.

40:47 And, you know, I'm not going to take time to clean it up for others because maybe they don't want to see it.

40:51 I was going to say, it seems like a really great thing for people to create these notebooks.

40:56 And then if you're meeting with your research group to pull it up and everybody look at it and kind of walk through it.

41:02 I wonder how much that played into it.

41:05 Like, here, look what I've been doing this week.

41:06 And here, let me show you the results and how I got it and so on.

41:09 Yeah, we have the same intuition as well.

41:10 And it seems like that's not the case.

41:13 And in fact, one of the folks that we talked to said, you know what?

41:19 I've tried this in lab meetings and people just think I didn't take time to prepare.

41:23 They think I'm just showing up by the skin of my teeth.

41:26 It's winging it.

41:26 Winging it.

41:27 And unless I have a slide deck put together that these aren't solid results or that I didn't take time to think about what you might want to see.

41:36 So it's this really interesting case where there's kind of this entrenched practice of you must present from slide decks or else it means this or that about your work that got in the way of using notebooks as a presentation medium.

41:50 Huh.

41:50 Interesting.

41:51 I wonder if it's a chicken and egg thing.

41:53 Like, if they were really beautifully formatted and descriptive, maybe that's a really great presentation.

41:59 But if they're sloppy, like, in and of themselves, maybe the presentation feels sloppy.

42:03 Or just the social expectation in practice.

42:06 Like, what if labs just expected that rather than taking time to create a slide deck, that you would take the time to document your code in the notebook?

42:16 That seems like that's more reusable and valuable over time.

42:20 Right?

42:20 Because, I mean, I've not refactored or reused slides that much.

42:24 Uh-huh.

42:26 Okay.

42:26 Wow.

42:28 What do you think needs to be done for notebooks to reach their potential, right?

42:34 To become this thing that would actually sort of validate the burning paper on the Atlantic?

42:39 I first want to say, like, I and the research who work with me think notebooks are fantastic.

42:45 I wouldn't want anyone coming away from the podcast saying, oh, you know, notebooks are done.

42:50 They're not the right way to do analysis.

42:53 I think they're the best thing that we've got going.

42:55 And there are vast improvement over prior ways of doing data analysis, which was often having, you know, script one in a file, script two in another file, script 2.5.

43:06 Version control is usually just naming copies of the files.

43:10 Exactly.

43:10 Yeah.

43:10 It's a much better way for version control and tracking steps of the analysis.

43:14 And, you know, through our research has demonstrated it's a fantastic way to iteratively do the process of analyzing data, especially with Python.

43:23 And so many people are using it for that.

43:25 For us, the real question is now, how do we make this wonderful programming environment also a wonderful presentation environment?

43:33 One where it's easy to share results with others and to support kind of collaboration in that way.

43:40 And I don't know a silver bullet to get it done.

43:43 I think there's things that we can do to tweak the design and how people use the notebooks.

43:49 But there also have to be some social changes.

43:51 You know, things like labs expecting the presentations will be from notebooks or journals expecting submissions in notebook format rather than PDF.

44:00 So I think it'll be a mix.

44:01 Interesting.

44:02 How much do you think like software carpentry type of stuff is involved here?

44:08 Like bringing a little bit of the CS side of things to the researchers?

44:12 Yeah.

44:13 No, I think that's really vital.

44:14 And I think the model that software carpentry has of doing the workshops and dedicated time to training these best practices.

44:23 One of the really standout insights from our last line of work, the interviewing researchers, was that one researcher mentioned that when she had started as a biology student in biology or chemistry labs, she was trained in a very specific way of tracking her results.

44:41 You know, this is how you write your name and the date and the reagents that you're going to use for this experiment and the steps.

44:47 And if she didn't do it in a particular disciplined way, she'd get docked points by her teaching assistant or professor because this was just the practice of how you document and share a biology or chemistry lab.

44:59 And she said, you know, we don't really have that for notebooks.

45:02 I came into the lab.

45:04 I was shown a notebook and told, have fun.

45:07 You type here.

45:08 Exactly.

45:09 But I've had to figure out like, oh, I can create my own packages and I can import external files.

45:15 And, oh, I can move all of the import statements to the very top of the notebook and, you know, have to kind of figure out best practices on their own.

45:23 So I think there's a lot of best practices, both from software development and engineering and from data analysis that have grown up in the last few years that can be brought to bear through things like data carpentry or software carpentry workshops that aren't there yet.

45:39 There's been a lot of progress there, but it also seems like, you know, there's always more progress to be made.

45:44 And I guess just from an academic perspective, every year you start over in a sense, right?

45:51 Like every year there's a new grad student who's fresh and they've never done this and they're at the lab and you've got to bring them under the fold, right?

45:58 So there's also this sort of mentoring aspect.

46:00 It'd be really interesting to look at how would our results be different if we'd done this out in enterprise, right?

46:05 Like a lot of what we looked at ended up being academic domain, whether it's all of these students working on notebooks that we found on GitHub or looking at folks who are in an academic lab environment.

46:16 And that's largely a factor of who we have access to and who's willing to share their notebooks publicly.

46:21 The rest of them, they're in these buildings right around campus.

46:23 We'll just go talk to them.

46:24 Uh-huh.

46:24 But looking at, you know, I think there's similar issues of turnover within organizations or handing off a project from somebody who's a senior analyst to a more junior, less experienced analyst.

46:36 And onboarding and disciplined practice.

46:39 Folks like Hillary Parker over at Stitch Fix had talked a lot about opinionated data analysis and having a disciplined way of tracking or sharing or reviewing data analyses and needing to develop that in enterprise.

46:52 That's pretty cool.

46:53 I wonder how accessible that information is because I know some of these companies are somewhat open about what they do.

47:01 But, again, this data analysis, this is partly what drives their company and they're not going to just give it away.

47:06 It's not like I can go to Goldman Sachs and go, hey, why don't you just publish your notebooks for your analysis?

47:11 Because that would be so interesting.

47:13 Yeah.

47:13 Like, yeah, we're not doing that.

47:15 Yeah.

47:15 I think the domain where I've seen outside of academia, the most open sharing of analyses is journalism.

47:21 So folks like FiveThirty or BuzzFeed or others who have published a number of notebooks online and even on GitHub.

47:30 So many of them are in our data set.

47:31 FiveThirty has got an amazing set of data on GitHub.

47:35 So FiveThirty has spilled out with letters slash data, I think is it.

47:40 Really interesting use of notebooks.

47:41 And, again, for them, it's kind of the incentives that we want to be open and show that we're a reputable organization and have others validate the claims that we make in our articles.

47:50 Whereas, you say, for companies, it's kind of the keys to the kingdom of this is how we generate value is our data and our building on top of it.

47:58 Yeah.

47:59 And I guess with journalism, once you publish it, you stake your claim to it.

48:03 Right.

48:03 But in business, like your sales model one year could be everyone's sales model next year.

48:08 Right.

48:08 That's not the same.

48:09 Yeah.

48:09 No one gives you much credit.

48:10 Yeah.

48:12 Yeah, for sure.

48:12 So what do you think about the future?

48:15 Like, we talked a little bit about this, but maybe packaging these things up, it's still maybe a little bit difficult to say, here's my notebook and go run it with everything that you need.

48:26 Should we look at containers?

48:29 What do you see going forward?

48:30 Notebook environments like Jupyter Notebook are going to increasingly become the core infrastructure for data analysis, both in industry and academia and journalism.

48:40 They've just proven so valuable as an iterative programming environment and for presenting results, though it still takes a lot of time to clean it up.

48:48 But yeah, I think for the future, there's a lot of work that will have to be done in a lot of different domains before notebooks get more widely used.

48:55 And as you referenced, some of it is in packaging the environment.

48:59 So theoretically, you can use notebooks to rerun somebody else's analysis.

49:03 But what if you don't have the same access to their data, either because of human subjects restrictions or it's just a lot of data or stored away on a server or your version of the libraries are slightly different?

49:16 So I think there's a lot of work on how do we containerize and package up not only the programming environment and the language at this point in time, but also the data.

49:25 That's interesting.

49:26 I think that gets into questions about, yeah, differential privacy and incentivizing data curation.

49:33 Right now, that's not really recognized often in academia, at least, as a contribution to the field.

49:41 Yeah, it seems like the cloud computing stuff that we touched on at the beginning helps somewhat with that, right?

49:47 Like the Azure notebooks, the Google notebooks, and Sage notebooks, and so on.

49:52 Like, you can share one of those.

49:54 But at the same time, I guess eventually maybe those things upgrade the packages that you have access to.

50:00 And that could theoretically change your results.

50:02 Like, oh, we fixed a bug in this analysis thing, which now doesn't look the same.

50:06 Who knows?

50:06 In discussions from some folks from library sciences, they'll say, you know, for as much flack as things like the PDF document give as a way of presenting results, PDFs and paper have proved to be a really stable way to share insights and knowledge.

50:21 Yeah, the versioning isn't nearly as touchy.

50:25 Like, we haven't had an issue of, you know, pulling a book off the shelf and it not working anymore, except if maybe the spine broke or something like we did with trying to run code that's even five or ten years old.

50:36 Yeah.

50:36 It's just a much more stable environment for storing knowledge.

50:39 So I think figuring out ways to do that will be really vital for things like notebooks becoming more integral to sharing results over time.

50:47 I agree.

50:48 I think containers have a lot of promise there because they can freeze the whole environment.

50:53 But even them, you know, they're based on some other operating system.

50:57 They've got to run on a certain version of Linux.

50:59 It's not perfect.

51:01 Yeah, it's not perfect.

51:02 Yeah.

51:02 And then as you referenced earlier, I think there's in the future a lot to be done in the realm of like software carpentry, data carpentry, scaling education.

51:12 As much as we can do to tweak these interfaces to make it easier to develop in or to write really clear narratives in, I still think a lot is going to rely on either apprenticeship models or mentoring or training in ways of doing and documenting data analysis.

51:27 So figuring out the right way to do that is a super tricky problem.

51:32 Yeah, it sure is.

51:33 I wonder if we're going to have some more specialization.

51:36 So in the early days, it's like, well, the people who want to do computer programming, they got electrical engineering degrees for whatever reason.

51:43 And then we have the CS degree in computer science.

51:46 Like, I don't know.

51:47 I don't feel like I've done any science really in computer for a really long time.

51:51 But it's, you know, that's what it's called.

51:53 Even though you're not actually staging hypotheses and doing things, you're just building.

51:57 It's more like engineering, right?

51:58 So we then got software engineering degrees that are slightly different applied ways of doing what computer science was doing.

52:05 Maybe we will get like computational scientist degree specializations or something coming along.

52:12 It's like half computer science, but half sort of these other data sides of things you're talking about.

52:17 I think many of the data science institutes that are popping up either online or at universities are trying to figure out like, what of this is specialized and unique to working with data?

52:28 What is it just rehashing things that we've already learned from software engineering and versioning?

52:34 What do we even have to specialize further in that biology's use of programming for data analysis will look fundamentally different from astronomy?

52:42 You know, the practices will be as different as AstroPy and another library are.

52:48 I can see a world where we end up there.

52:51 As computation becomes more and more the foundation of all these different degrees, that each degree is like, no, no, we're going to have a biological computationist degree.

53:01 It's not going to be over in the CS world.

53:04 It's going to be here and we're going to run it and it's going to have some CS, but it's also going to have lots of biology and other aspects.

53:10 Yeah.

53:10 And that's one aspect of future work we haven't gotten to, but I think would be fascinating is looking at how people use notebooks differently in different environments.

53:18 Both like how is enterprise different than academia?

53:21 How are beginning computer science students different from later ones?

53:25 Or, you know, how does astronomy differ from chemistry and how they use these environments?

53:32 So I think that's a fascinating area to look at next.

53:34 It definitely is.

53:35 All right.

53:35 Well, I think that probably is a good place to leave things.

53:39 Okay.

53:39 A pretty optimistic future.

53:41 We're getting a little low on time.

53:42 So maybe, maybe we'll ask you the two questions at the end of the show.

53:45 First of all, if you're going to edit some Python code, what editor do you use?

53:50 I'm really a Jupyter fanboy.

53:52 So used it whenever I need to analyze data, used it for this study.

53:56 And again, really like the environment.

53:58 The exception of that will be when I'm doing more software application development.

54:03 So I've actually, if I'm doing web development and have Python on the server side and JavaScript on the front end, then I'll often use Atom for that.

54:11 Just because I'm switching back and forth between languages.

54:14 Yeah.

54:14 If you have a bunch of different files and they're kind of all working together, especially cross language like CSS, JavaScript, HTML, et cetera.

54:21 All right.

54:21 It's something, Jupyter is not amazing for that, but it is really great for exploring.

54:25 Awesome.

54:26 All right.

54:27 How about a notable PyPI package?

54:29 I think Philip may have mentioned this too, but especially in the data analysis world, Anaconda, I have yet to find a better way just to get people up and running on doing data analysis and quickly package together pandas, numpy, matplotlib, seborn.

54:44 Especially on Windows.

54:45 Especially on Windows where some of those tools are hard to pip install because like some weird compiler thing is missing.

54:53 Yeah, that's cool.

54:54 Pick a specific one and probably be the visualization libraries.

54:57 Matplotlib or seborn are really where I spend my time.

55:00 Awesome.

55:01 Yeah, I'll throw one out for you.

55:02 I don't normally do this, but this one is like so relevant.

55:05 I just came across it.

55:05 Have you heard of Pixie Debugger?

55:07 P-I-X-I-E Debugger?

55:09 No.

55:10 So Pixie Debugger is a visual interactive debugger for Jupyter Notebooks.

55:15 That is fantastic.

55:16 You just include it and it gives you like below your cell, you put a little decorator type magic command onto a cell and then you can just step through, step forward, inspect the variables visually as you're going through that cell.

55:29 It's pretty awesome.

55:30 That is awesome.

55:31 Yeah, so many folks will split cells to try and figure out, okay, where does this thing fail?

55:35 So that's perfect.

55:36 Yeah, it's really, really great.

55:37 All right.

55:38 So people are interested.

55:39 They want to do more with this.

55:41 Maybe look at your research.

55:42 Is the data available?

55:44 What do they do?

55:44 I have a personal website, adamrule.com, that they can go to that kind of links out to everything.

55:49 And that will have copies of the papers documenting my research as well as links to the data repositories.

55:56 How big is the data?

55:57 Is it a lot?

55:58 It's about 600 gigabytes.

56:00 600 gigabytes.

56:02 Wow.

56:02 Thankfully, our university had a very fast internet speed to be able to down all of that.

56:08 Oh my goodness.

56:09 Yeah, yeah.

56:10 No kidding.

56:10 Where do you host it?

56:13 That's actually a non-trivial amount of money if you actually had to put it on S3 or something.

56:17 That's 50 bucks per download.

56:19 Props to UC San Diego again on that.

56:21 Their library has graciously agreed to host that.

56:23 So many other sites that we looked at had limits at like 100 gigabytes or something for data sets.

56:29 So they're hosting that.

56:31 And again, I have a link off of my website.

56:32 But we both have the full data set.

56:34 And then we have a starter data set.

56:37 It's about one to two gigabytes with a subset of those, but with all the different data types.

56:42 And example notebooks that people can use to begin playing with it.

56:46 Oh, that's really cool.

56:46 So they can start to play with it.

56:48 And if they're really committed, they can download 600 gigs.

56:51 Exactly.

56:52 Oh, that's pretty awesome.

56:54 Well, Adam, this is really interesting research.

56:56 And thanks for sharing your view into the whole notebook space.

57:01 Yeah.

57:01 Thanks again for chatting today.

57:02 It's been a pleasure.

57:03 You bet.

57:03 Bye.

57:04 Bye.

57:05 This has been another episode of Talk Python To Me.

57:08 Our guest has been Adam Rule.

57:10 And this episode has been brought to you by Linode and Studio 3T.

57:13 Linode is bulletproof hosting for whatever you're building with Python.

57:17 Get four months free at talkpython.fm/linode.

57:22 That's L-I-N-O-D-E.

57:24 With Studio 3T, you can write SQL queries and translate them automatically to Python.

57:29 Try their database ID today at talkpython.fm/studio.

57:34 Want to level up your Python?

57:36 If you're just getting started, try my Python jumpstart by building 10 apps or our brand new

57:41 100 days of code in Python.

57:43 And if you're interested in more than one course, be sure to check out the Everything Bundle.

57:47 It's like a subscription that never expires.

57:49 Be sure to subscribe to the show.

57:52 Open your favorite podcatcher and search for Python.

57:54 We should be right at the top.

57:55 You can also find the iTunes feed at /itunes, Google Play feed at /play, and

58:01 direct RSS feed at /rss on talkpython.fm.

58:04 This is your host, Michael Kennedy.

58:06 Thanks so much for listening.

58:08 I really appreciate it.

58:09 Now get out there and write some Python code.

58:11 I'll see you next time.