WEBVTT

00:00:00.001 --> 00:00:04.380
Jupyter notebooks have transformed the way many developers and data scientists do their jobs.

00:00:04.380 --> 00:00:08.860
They offer a platform to not just explore, but to explain data and computation.

00:00:08.860 --> 00:00:14.720
But how are they really being used? Adam Rule is here to describe his research and PhD dissertation,

00:00:14.720 --> 00:00:18.540
which analyzed over 1 million Jupyter notebooks found out in the wild.

00:00:18.540 --> 00:00:24.320
This is Talk Python To Me, episode 171, recorded July 6, 2018.

00:00:24.320 --> 00:00:44.920
Welcome to Talk Python To Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities.

00:00:44.920 --> 00:00:49.060
This is your host, Michael Kennedy. Follow me on Twitter, where I'm @mkennedy.

00:00:49.060 --> 00:00:52.940
Keep up with the show and listen to past episodes at talkpython.fm.

00:00:52.940 --> 00:00:55.480
And follow the show on Twitter via at Talk Python.

00:00:55.480 --> 00:00:59.640
This episode is brought to you by Linode and Studio 3T.

00:00:59.640 --> 00:01:03.560
Please check out what they're offering during their segments. It really helps support the show.

00:01:03.560 --> 00:01:05.880
Adam, welcome to Talk Python.

00:01:05.880 --> 00:01:08.900
Yeah, great to be with you. Thanks for inviting me on the show.

00:01:08.900 --> 00:01:15.180
Yeah, it's great to have you here. I'm really glad our mutual friend, Philip Guo, introduced us and suggested this show.

00:01:15.180 --> 00:01:17.460
Yeah, no, you've had a lot of great people on the show.

00:01:17.460 --> 00:01:22.120
I mean, Philip included, who I think has been a repeat on the show, which is great.

00:01:22.260 --> 00:01:34.400
But other folks in kind of my area of data analysis or computational notebooks, Min and Matthias from Project Jupyter and DJ Patel, one of the kind of godfathers in data analysis.

00:01:34.400 --> 00:01:36.640
So it's an honor to be on here with you today.

00:01:36.800 --> 00:01:39.400
It's great to have you on. It was great to have those people as well.

00:01:39.400 --> 00:01:43.120
So we're going to talk about Jupyter Notebooks in a super meta way.

00:01:43.120 --> 00:01:48.300
Like, I've had Matthias on to talk about notebooks in, like, as a technology.

00:01:48.300 --> 00:01:55.220
But we're going to talk about more, like, how people work with notebooks and how it affects researchers and data science and stuff.

00:01:55.860 --> 00:01:57.660
Before we get to that, let's start with your story.

00:01:57.660 --> 00:02:01.140
How did you get into programming and what led you down this Python path?

00:02:01.140 --> 00:02:02.120
Oh, gosh.

00:02:02.120 --> 00:02:06.460
I mean, my original track was I was an industrial engineer as an undergrad.

00:02:06.460 --> 00:02:11.180
And I thought, oh, I like people a little bit, but I also like science and math.

00:02:11.180 --> 00:02:17.400
And so industrial engineering is this mix of both of those where you're trying to optimize, like, production flows.

00:02:17.400 --> 00:02:22.700
But then you spend some time on the shop floor talking to people about, like, how's this working for you?

00:02:22.700 --> 00:02:24.700
How can we make this process easier for you?

00:02:24.700 --> 00:02:31.780
From there, it's a somewhat circuitous path, but eventually took me to understanding, oh, not just designing manufacturing processes,

00:02:31.780 --> 00:02:36.640
but we can design devices and products and software for people.

00:02:36.640 --> 00:02:45.260
And, gosh, designing software for people is really difficult because you can't just look at, you know, measure how long is their arm and lay out the workspace that way.

00:02:45.260 --> 00:02:51.400
But you have to get into their head a little bit and try and figure out, okay, mental capacity and how do people think through things?

00:02:51.400 --> 00:02:56.720
What's a workflow look like for that, not just how I work a work piece through a factory?

00:02:56.720 --> 00:03:06.220
So with that, I got into programming largely by studying how people use software and needing to do some of my own programming to build software.

00:03:06.520 --> 00:03:15.440
Add-ons to really look deeply at how is it that people are using these technologies and how can we make them easier to use or just do more powerful things

00:03:15.440 --> 00:03:20.240
and augmenting what we can learn about how people work in their capacity.

00:03:20.240 --> 00:03:20.960
Yeah.

00:03:20.960 --> 00:03:22.020
Oh, that's really cool.

00:03:22.020 --> 00:03:27.120
It's super interesting how this sort of engineering path led you over here.

00:03:27.240 --> 00:03:39.860
I think industrial engineering plus software would be a really cool space to work because there's just so many gadgets and things you could design with just a little bit of smarts and a little bit of programming that would be amazing.

00:03:39.860 --> 00:03:41.340
Like, that's a skill I don't have.

00:03:41.340 --> 00:03:41.720
Yeah.

00:03:41.720 --> 00:03:42.300
Yeah.

00:03:42.300 --> 00:03:45.200
I know that kind of fusion of the two worlds would be really fascinating.

00:03:45.200 --> 00:03:46.180
I think it's a little ironic.

00:03:46.300 --> 00:03:52.920
One of the things I really like about programming is you actually get to build stuff, which you get to build stuff that is, like, conceptual, right?

00:03:52.920 --> 00:03:53.700
But it runs.

00:03:53.700 --> 00:03:56.420
And I'm speaking, like, coming from a math perspective, right?

00:03:56.420 --> 00:03:59.560
Where it's like, prove this cool theorem about topological spaces.

00:03:59.560 --> 00:04:02.060
I guess I'll go look at another theorem, right?

00:04:02.060 --> 00:04:03.800
Like, you don't actually have an outcome, right?

00:04:03.800 --> 00:04:06.840
Here, even though it's sort of virtual, you still get to build things.

00:04:06.840 --> 00:04:08.440
But this would take it to another level.

00:04:08.440 --> 00:04:08.900
Awesome.

00:04:08.900 --> 00:04:17.660
So you started writing software to kind of understand and study how people interact with it, which is sort of started you down this meta path, right?

00:04:17.660 --> 00:04:18.260
Yeah.

00:04:18.260 --> 00:04:18.660
Yeah.

00:04:18.660 --> 00:04:19.080
I know.

00:04:19.080 --> 00:04:32.340
I think today as we get further on the talk, it'll be very meta as we talk about studying how people use tools like Jupyter Notebook by, in turn, using Jupyter Notebook to analyze a data set about people using it.

00:04:32.780 --> 00:04:34.800
It kind of becomes turtles all the way down.

00:04:34.800 --> 00:04:35.140
Yeah.

00:04:35.140 --> 00:04:36.740
I was just thinking it's turtles all the way down.

00:04:36.740 --> 00:04:37.180
Absolutely.

00:04:37.180 --> 00:04:37.840
Absolutely.

00:04:37.840 --> 00:04:38.700
Yeah.

00:04:38.700 --> 00:04:41.720
So what led you into, like, Notebooks and Python, right?

00:04:41.720 --> 00:04:45.740
Like, you could have, say, used C++ to understand how people use software.

00:04:45.740 --> 00:04:47.700
It would take you longer, but you could have done it.

00:04:47.700 --> 00:04:48.060
Yeah.

00:04:48.060 --> 00:04:48.560
Yeah.

00:04:48.560 --> 00:04:49.000
No.

00:04:49.000 --> 00:04:57.460
And this may be going back a little bit too far, but I was really fascinated by HealthKit and how physicians use medical records to track and document their work.

00:04:57.460 --> 00:04:59.960
It's this really data-driven domain.

00:04:59.960 --> 00:05:02.100
But it's really hard to get into.

00:05:02.360 --> 00:05:08.140
You can't really say, oh, let me just hack on your enterprise software system in the hospital.

00:05:08.140 --> 00:05:12.180
I just want to tweak it in this way and see if that makes it easier to care for patients.

00:05:12.180 --> 00:05:21.140
One, the software systems are super complex and regulated, and you have kind of patient risk there as well.

00:05:21.340 --> 00:05:24.980
And so I turned and looked at another very data-driven domain, data analysis.

00:05:25.680 --> 00:05:32.180
And honestly, it's just by kind of hearsay of so many people saying, hey, have you checked out these notebooks?

00:05:32.180 --> 00:05:33.140
They're fantastic.

00:05:33.140 --> 00:05:37.320
I've been using them for months or years now for doing analysis.

00:05:37.320 --> 00:05:38.720
They're pretty amazing.

00:05:38.720 --> 00:05:40.180
You should really look at this.

00:05:40.380 --> 00:05:49.480
And so where I was at UC San Diego, a bunch of people were using these tools for very basic biology, neuroscience research.

00:05:50.200 --> 00:05:57.480
So that got me into looking in, okay, how are people using these tools to track and talk about very complex data-driven work?

00:05:57.480 --> 00:05:58.140
Yeah, that's awesome.

00:05:58.140 --> 00:06:03.180
So maybe that's a good place to segue into kind of what you've been doing recently.

00:06:03.180 --> 00:06:06.640
So you said you are at UC San Diego, a very nice school down in San Diego.

00:06:07.340 --> 00:06:17.640
And you just finished your PhD as a human-computer interaction researcher, which is a sub-portion of, say, cognitive science.

00:06:17.640 --> 00:06:24.500
And I was really surprised how much computer programming and software is involved in cognitive science more broadly.

00:06:24.500 --> 00:06:25.420
Yeah, I know.

00:06:25.420 --> 00:06:31.940
Cognitive science is an interesting field that's kind of a fusion of psychology and computer science.

00:06:32.380 --> 00:06:39.220
And you go back to some of the early days looking at folks like Herb Simon and others at Carnegie Mellon and around the world.

00:06:39.220 --> 00:06:46.900
And they were playing in both fields of developing and testing a lot of software, trying to figure out, can we model how the brain works?

00:06:46.900 --> 00:06:57.700
And then there's kind of the reverse transition of people will look at the brain and use that to try and figure out how can we build more efficient algorithms or computer systems with neuromorphic computing.

00:06:57.700 --> 00:07:01.080
But yeah, I just finished my PhD a month ago, actually.

00:07:01.080 --> 00:07:03.380
So still fresh off of that.

00:07:03.380 --> 00:07:07.040
Are you just super relaxed now?

00:07:07.040 --> 00:07:08.740
I am.

00:07:08.740 --> 00:07:17.500
No, I was going to say, I'm almost in this academic sabbatical period where we still have funding and I'm still continuing to do some of the research that we'll talk about today.

00:07:17.500 --> 00:07:28.660
But due to my wife's job, I moved to a different city, now up in Portland, staying on the West Coast, but very different in terms of sunshine hours from San Diego.

00:07:28.660 --> 00:07:30.380
Less sunshine, more green.

00:07:30.920 --> 00:07:32.420
More green, which is great.

00:07:32.420 --> 00:07:32.800
Yeah.

00:07:32.800 --> 00:07:39.080
When I moved to San Diego from Seattle and when I was in Seattle, really loved the lush green.

00:07:39.080 --> 00:07:42.260
And so San Diego has many benefits.

00:07:42.260 --> 00:07:46.380
The surfing, the burritos, the sunshine, but it lacks in green.

00:07:46.380 --> 00:07:47.580
So it's good to be back.

00:07:47.580 --> 00:07:47.960
Nice.

00:07:47.960 --> 00:07:51.800
So you still are finishing up this research a little bit that we're going to be talking about.

00:07:52.060 --> 00:07:55.420
And then it's time to hit the real world.

00:07:55.420 --> 00:07:56.480
Are you thinking academics?

00:07:56.480 --> 00:07:57.580
Are you thinking industry?

00:07:57.580 --> 00:07:58.760
Where are you headed?

00:07:58.760 --> 00:08:01.540
Yeah, I'm thinking industry at this point.

00:08:01.540 --> 00:08:05.480
Just to try my hand at something slightly different and see what that world is like.

00:08:05.480 --> 00:08:11.580
Gotten a good dose of academia for the last five years of PhD and two years of master's before that.

00:08:11.580 --> 00:08:12.860
That's a healthy dose.

00:08:12.860 --> 00:08:14.740
What the working world is like.

00:08:14.740 --> 00:08:15.140
Awesome.

00:08:15.140 --> 00:08:15.560
All right.

00:08:15.560 --> 00:08:21.760
So maybe we should start by talking a little bit broadly about what human-computer interaction is.

00:08:21.760 --> 00:08:22.200
Yeah.

00:08:22.200 --> 00:08:26.580
And I think Philip has covered some of this as well because he researches really similar topics.

00:08:26.760 --> 00:08:30.280
But it's really studying the design and use of computer technology.

00:08:30.280 --> 00:08:32.760
So how do people use current technologies?

00:08:32.760 --> 00:08:38.600
And how do certain aspects of the design make it easier to use for particular tasks?

00:08:38.600 --> 00:08:42.200
So as I was saying with cognitive science, it's really a mix.

00:08:42.200 --> 00:08:47.480
Human-computer interaction, the subfield, is really a mix of computer science and social science.

00:08:47.480 --> 00:08:50.160
So some days I'm building software.

00:08:50.160 --> 00:08:52.180
Other days I'm testing it with people.

00:08:52.320 --> 00:08:58.060
Other days I'm just sitting and observing how people use it or don't use it during their tasks.

00:08:58.060 --> 00:09:04.300
So it's this flopping between programming and observing and social science, more anthropological skills.

00:09:04.300 --> 00:09:11.380
That's a lot of fun and goes back to my industrial engineering days of the math and science and the satisfaction of building things.

00:09:11.380 --> 00:09:14.560
And then the flip side, trying to understand people as well.

00:09:14.560 --> 00:09:15.040
Oh, yeah.

00:09:15.040 --> 00:09:17.040
It's a really interesting mix.

00:09:17.920 --> 00:09:22.580
Are folks in that area starting to think about how artificial intelligence is changing this?

00:09:22.580 --> 00:09:27.060
Like things like the Amazon Assistant or the Google Assistant and stuff like that?

00:09:27.060 --> 00:09:29.020
There's a bunch of research in that area.

00:09:29.020 --> 00:09:35.580
I think some of the folks who are farthest out ahead on that are those in the human-robot interaction field.

00:09:35.580 --> 00:09:45.420
Because they've had to think for a while about how are people going to interact with robots and reason about how is this computer device reasoning about things and making decisions?

00:09:45.420 --> 00:09:46.700
And should I trust that?

00:09:46.700 --> 00:09:50.320
Or does it not have access to all the information I do?

00:09:50.320 --> 00:09:56.720
And all these things we do very naturally with other humans of like, oh, they don't see that car coming because they're looking this other way.

00:09:56.720 --> 00:09:57.740
I should let them know.

00:09:58.240 --> 00:10:04.520
It's harder to do that with computer systems where you're less sure about what are the inputs, what's the processing, what are the outputs going on.

00:10:04.520 --> 00:10:05.400
That's really interesting.

00:10:05.400 --> 00:10:08.760
It's going to really become more and more so over time, isn't it?

00:10:08.760 --> 00:10:09.160
Yeah.

00:10:09.160 --> 00:10:12.460
And then all this work on like machine learning interpretability.

00:10:12.460 --> 00:10:15.260
How are you going to be able to interpret a decision that came out?

00:10:15.260 --> 00:10:17.180
I know there's work going on in that.

00:10:17.180 --> 00:10:26.420
And even here in Portland, the next HCI researcher meetup in the area is focused on machine learning and how do we design for this and help people understand what's going on.

00:10:26.420 --> 00:10:29.940
So I think both in academia and industry, it's a big deal right now.

00:10:29.940 --> 00:10:30.420
Oh, yeah.

00:10:30.420 --> 00:10:30.920
That's awesome.

00:10:30.920 --> 00:10:35.540
It sounds like people who want research projects, I suspect that's a good place to focus.

00:10:35.540 --> 00:10:36.640
Good place to look.

00:10:36.640 --> 00:10:37.340
Yeah, for sure.

00:10:37.340 --> 00:10:39.180
You decided to focus a little more meta.

00:10:39.180 --> 00:10:44.380
You wanted to focus on these computational notebooks, which Jupyter is one of.

00:10:44.800 --> 00:10:47.520
But maybe give us like let's set the landscape, right?

00:10:47.520 --> 00:10:50.360
So in the Python space, we hear Jupyter, Jupyter, Jupyter.

00:10:50.360 --> 00:10:51.740
Oh, Jupyter lab is slightly better.

00:10:51.740 --> 00:10:52.640
Jupyter lab.

00:10:52.640 --> 00:10:54.180
And then that's about it, right?

00:10:54.180 --> 00:10:58.940
But there's actually a slightly broader view of these things in the history as well, right?

00:10:58.940 --> 00:11:09.100
So there's actually a really interesting article in the Atlantic that was coming out and saying, oh, the academic paper is dead and computational notebooks are going to replace them.

00:11:09.100 --> 00:11:29.760
And a lot of that talks about Jupyter notebooks, but it goes into some of the history of notebook platforms back to really Mathematica is the one that's often credited with being one of the first environments where you could have this literate programming back and forth, typing and running small scripts in a specific language to analyze data or ask questions.

00:11:29.760 --> 00:11:31.780
So that was back in the 80s.

00:11:31.780 --> 00:11:38.300
And there were academic systems like Maple that were in schools in the 90s and 2000s.

00:11:38.300 --> 00:11:39.720
But it's been in the last couple.

00:11:39.720 --> 00:11:40.860
I remember using Maple.

00:11:40.860 --> 00:11:41.960
That thing was magic.

00:11:42.220 --> 00:11:44.320
Yeah, no, I remember using it as well.

00:11:44.320 --> 00:11:47.560
And I haven't really seen it much outside of the educational context.

00:11:47.560 --> 00:11:49.580
I don't know if that's just their niche or what.

00:11:49.580 --> 00:11:58.140
But so it's been around for a while, but it's often been locked away in proprietary software that you had to pay a big license fee for.

00:11:58.140 --> 00:12:16.680
And so it's really in the last decade or so and really the last five years or so that platforms like Jupyter notebook or RStudio have been providing these open source and in some cases like Jupyter free environments for using a notebook like interface to play with data.

00:12:16.680 --> 00:12:26.000
Yeah, what's your take on when I was working on my master's degree and my PhD and stuff, which I didn't get my PhD, but I did get my master's degree.

00:12:26.000 --> 00:12:36.440
But anyway, when I was working on that, you know, I was using MATLAB and stuff and we were doing things like wavelet decomposition, which I'm pretty sure the license was 2000.

00:12:36.440 --> 00:12:44.420
This is like an add on to MATLAB was like 2000 additional dollars per person that was using it like that's completely insane.

00:12:44.420 --> 00:12:44.860
Yeah.

00:12:44.860 --> 00:12:48.860
And then here comes Jupyter and whatnot and going, oh, actually ours is free.

00:12:48.860 --> 00:12:49.600
Why don't you try that?

00:12:49.600 --> 00:12:52.520
Well, that's a big effect, right?

00:12:52.520 --> 00:12:53.160
Yeah.

00:12:53.160 --> 00:13:05.840
And thinking about, you know, there's a huge push in science for open science and not just sharing your results, sharing the data, but then sharing also your code and how you you arrived at that.

00:13:06.100 --> 00:13:11.760
And the fascinating thing is so many researchers are making, you know, you talked about the wavelet decomposition.

00:13:12.220 --> 00:13:21.200
They're making their own packages or libraries, especially in the Python ecosystem and sharing them openly for others to use and download.

00:13:21.200 --> 00:13:28.200
And that's really seemed to move up the stack to not just be the packages, but the language, you know, Python should be open and free.

00:13:28.600 --> 00:13:32.880
And the environments, the development environments like Jupyter should be open and free.

00:13:32.880 --> 00:13:45.000
Do you see this like Zen of open source from the scientists interacting with the software, like flowing into science directly in the sense that people are sort of changing their way?

00:13:45.080 --> 00:13:48.760
Do you see other stuff by virtue of having these open source experiences?

00:13:48.760 --> 00:13:49.640
I think so.

00:13:49.640 --> 00:13:51.600
And a couple of points on that.

00:13:51.600 --> 00:14:01.240
Some of the folks that I talked to and we get to some of the research that I've done talked about this really strong obligation they felt to make things open source, to make them reproducible.

00:14:01.240 --> 00:14:05.940
And it was almost a religious zeal of like, this is just how things should be done.

00:14:05.940 --> 00:14:10.820
And I hadn't seen that really before in academia, in other contexts.

00:14:10.820 --> 00:14:24.180
I think one of the other interesting things I saw is so many labs where, you know, half the lab might be wet lab biology folks who are running experiments and, you know, creating slices and staining them and using microscopes.

00:14:24.180 --> 00:14:32.180
And the other half really just looks like a startup or a software development group who are doing code reviews on a weekly basis.

00:14:32.180 --> 00:14:34.100
They have a GitHub repository, maybe.

00:14:34.100 --> 00:14:45.000
GitHub repositories talking about versioning, remote, you know, workers calling in from across the world and seem very much like just a software company.

00:14:45.000 --> 00:14:49.160
But they're in a research lab doing fundamental biology research.

00:14:49.300 --> 00:15:06.200
Yeah, I really do see this computer science skill starting to permeate a lot more stuff, not just so we have more programmers, but so that, like what you described, these biologists can take some of these software ideas and just really amplify what they're doing in the lab.

00:15:06.200 --> 00:15:06.900
Yeah, absolutely.

00:15:09.540 --> 00:15:12.380
This portion of Talk Python To Me is brought to you by Linode.

00:15:12.380 --> 00:15:16.320
Are you looking for bulletproof hosting that's fast, simple, and incredibly affordable?

00:15:16.320 --> 00:15:21.240
Look past that bookstore and check out Linode at talkpython.fm/Linode.

00:15:21.240 --> 00:15:23.160
That's L-I-N-O-D-E.

00:15:23.160 --> 00:15:27.640
Plans start at just $5 a month for a dedicated server with a gig of RAM.

00:15:27.640 --> 00:15:30.160
They have 10 data centers across the globe.

00:15:30.160 --> 00:15:32.700
So no matter where you are, there's a data center near you.

00:15:33.340 --> 00:15:48.020
Whether you want to run your Python web app, host a private Git server, or file server, you'll get native SSDs on all the machines, a newly upgraded 200 gigabit network, 24-7 friendly support, even on holidays, and a seven-day money-back guarantee.

00:15:48.020 --> 00:15:50.380
Do you need a little help with your infrastructure?

00:15:50.380 --> 00:15:56.600
They even offer professional services to help you get started with architecture, migrations, and more.

00:15:56.600 --> 00:15:59.680
Get a dedicated server for free for the next four months.

00:15:59.680 --> 00:16:02.520
Just visit talkpython.fm/Linode.

00:16:02.520 --> 00:16:07.940
So I did sort of cut you off a little bit when you were talking about the history.

00:16:07.940 --> 00:16:13.440
So we've had Mathematica and Maple and MATLAB, and then we've got Jupyter in our studio.

00:16:13.440 --> 00:16:21.520
But then there's a bunch of hosted ones as well that maybe some people have heard of, but there's actually a ton of variety out there for even just getting these and running them online.

00:16:21.800 --> 00:16:22.040
Yeah.

00:16:22.040 --> 00:16:30.020
So I mean, even just sticking with Jupyter, one of the great things about Jupyter is they have the notebook itself, kind of this front end.

00:16:30.020 --> 00:16:38.440
But I think some of the more lasting impact from Jupyter might be just the standards that they set about how are we going to send messages back and forth to a kernel?

00:16:38.440 --> 00:16:42.700
What's the notebook format and the very specific JSON structure?

00:16:43.080 --> 00:16:49.500
So in a way, they're almost like a standard setter, like World Wide Web Consortium, HTML, or other things.

00:16:49.500 --> 00:16:54.860
This is just how scientific computing or data analysis should be documented and shared.

00:16:54.860 --> 00:17:00.480
And so a number of other groups have come and built on top of Jupyter in those standards.

00:17:00.480 --> 00:17:11.100
So like Google Collaboratory is like a Google Docs version of Jupyter Notebooks, Microsoft's Azure Notebooks on there, you know, Azure Cloud, and then even Sage Notebooks.

00:17:11.100 --> 00:17:19.380
Sage was this project in specific language, kind of like Mathematica, for doing data analysis and mathematics.

00:17:19.380 --> 00:17:21.080
And they've now switched over.

00:17:21.080 --> 00:17:26.160
Is that the, that's the Sage from William Stein up in Seattle, right?

00:17:26.160 --> 00:17:27.440
Yep, that's exactly right.

00:17:27.440 --> 00:17:40.120
So now they have a whole notebook infrastructure and each of these have, have made add-ons and different history features or profiling tools that are slightly different from Jupyter Notebook or Jupyter Lab.

00:17:40.120 --> 00:17:42.800
But they're all essentially built on top of Jupyter.

00:17:42.800 --> 00:17:43.180
Yeah.

00:17:43.180 --> 00:17:45.280
And there's some really interesting cloud computing tie-ins.

00:17:45.280 --> 00:17:55.500
Like I don't know Azure and Sage well enough, but I know on Google Collaboratory, like you hit a hotkey to run a cell and you hit a slightly different hotkey to run the cell on a GPU.

00:17:55.500 --> 00:17:57.020
It's like crazy, right?

00:17:57.020 --> 00:17:57.460
It's just.

00:17:57.460 --> 00:17:57.820
Yeah.

00:17:57.820 --> 00:17:58.720
Yeah.

00:17:58.720 --> 00:18:02.700
And I think that comes into business models of how you're going to monetize these things.

00:18:02.700 --> 00:18:11.740
Jupyter has this interesting model of being open source and funded by academic time and grants, whereas others are saying, well, we'll provide this software for free.

00:18:11.740 --> 00:18:19.000
But if you want white glove support or to run it on our cluster, then that'll cost you at that point.

00:18:19.000 --> 00:18:25.080
So it's free for smaller uses in educational context, but we'll provide the compute infrastructure.

00:18:25.560 --> 00:18:26.060
That's pretty interesting.

00:18:26.060 --> 00:18:27.620
There's also a JavaScript one, right?

00:18:27.620 --> 00:18:28.000
Yeah.

00:18:28.160 --> 00:18:34.780
So this kind of notion of computational notebooks is now spreading into the web world where Observable HQ.

00:18:34.780 --> 00:18:45.500
So Mike Bostock, the mind behind the D3 data visualization library, now is a company that's making a completely on the web computational notebook.

00:18:45.500 --> 00:18:49.940
So writing JavaScript code to analyze data.

00:18:49.940 --> 00:18:54.940
And then Mozilla actually is starting up a project called Iodide that's looking at this.

00:18:54.940 --> 00:19:02.620
And what's fascinating about these is you can write code not only to analyze the data, but to directly manipulate the notebook itself.

00:19:02.620 --> 00:19:14.060
And so you get some really fascinating views on data or dashboards or something that you wrote in a cell just changes the complete layout or operation of the notebook itself.

00:19:14.060 --> 00:19:22.240
So it mixes kind of the programming for data analysis and then programming to change the tool that you're using to do the data analysis.

00:19:22.240 --> 00:19:23.160
How interesting.

00:19:23.160 --> 00:19:23.600
Yeah.

00:19:23.600 --> 00:19:27.720
Because you can reprogram the notebook with the notebook.

00:19:27.720 --> 00:19:28.040
Yep.

00:19:28.040 --> 00:19:46.440
So it makes it easy to do things that have been difficult to do in, say, Jupyter Notebook to create widgets where if I want a slider that can then change some parameter and a visualization I have, I can just call whatever that div is because I'm programming in JavaScript and select it and tell it to update on this callback.

00:19:46.440 --> 00:19:54.080
Yeah, I definitely see that having an advantage in that integration as well as that runs on the client browser.

00:19:54.280 --> 00:19:58.520
So the hosting cloud side of it is like, you could probably do that on a $5 server.

00:19:58.520 --> 00:19:59.200
You know what I mean?

00:19:59.200 --> 00:20:04.060
Like you could host an incredible amount of computation because you're just serving up the files more or less, right?

00:20:04.060 --> 00:20:04.440
Yeah.

00:20:04.440 --> 00:20:16.000
So all of these really interesting models all around this idea of a notebook infrastructure where you incrementally write a few lines of code to analyze data and then get the outputs printed right in line.

00:20:16.000 --> 00:20:23.900
One weakness, I guess I would call it, that I see with a JavaScript computational environment is JavaScript has poor numerical support.

00:20:23.900 --> 00:20:24.200
Yeah.

00:20:24.200 --> 00:20:29.480
It's like integers, for example, are super hard because integers.

00:20:29.480 --> 00:20:40.400
And I'm not sure, I mean, even beyond that, just the infrastructure, you know, there's all sorts of great infrastructure, especially in Python and R for data manipulation and cleaning and analysis.

00:20:40.800 --> 00:20:45.520
And those libraries just, I don't think are quite there for JavaScript.

00:20:45.520 --> 00:20:46.000
Right.

00:20:46.000 --> 00:20:47.340
Where's the pandas for JavaScript?

00:20:47.340 --> 00:20:48.080
Does it exist?

00:20:48.080 --> 00:20:49.300
It may, I don't know, actually.

00:20:49.300 --> 00:20:50.360
Or the tidyverse.

00:20:50.360 --> 00:20:58.300
Again, it may exist, but it certainly doesn't have, I think, as many folks working on or using it as you do in the Python or the R worlds.

00:20:58.460 --> 00:20:59.940
Yeah, it's interesting that it exists.

00:20:59.940 --> 00:21:03.700
Okay, so that sort of sets the stage of this world.

00:21:03.700 --> 00:21:20.960
And your PhD goal was to go and study that world and actually understand how people use these computational notebooks and if they're really fulfilling their promise of becoming like a computational narrative or what are people doing this, right?

00:21:20.960 --> 00:21:23.060
So maybe tell us more about your research.

00:21:23.060 --> 00:21:27.840
There's this Atlantic article that came out and made this declaration of the science.

00:21:27.840 --> 00:21:30.120
I want to encourage people to check out that article.

00:21:30.120 --> 00:21:31.400
We're going to put it in the show notes.

00:21:31.400 --> 00:21:35.660
It is, I think I've talked about this before on Python Bytes, my other podcast.

00:21:35.660 --> 00:21:39.100
But anyway, it's really provocative, right?

00:21:39.100 --> 00:21:45.340
Like there's a scientific paper that's literally on fire, like an animated fire on the homepage.

00:21:45.340 --> 00:21:46.640
It's quite something.

00:21:46.640 --> 00:21:48.380
And it's a big change.

00:21:48.380 --> 00:21:50.200
Yeah, no, it makes some bold claims.

00:21:50.200 --> 00:21:53.520
It roasts Stephen Wolfram quite a bit.

00:21:54.080 --> 00:22:09.880
And then it goes into kind of the history of Mathematica and then also Python and these different views of like a closed type ecosystem like Mathematica where you make your own language or an open one like the Python community where anybody can write a library and contribute.

00:22:10.600 --> 00:22:27.480
But yeah, so one of the things in this article, and really this just reflects some of the zeitgeist right now, is this notion that, well, in the future, we're no longer going to be sending around scientific results just in a dead PDF because that doesn't give you all the information you need to reproduce science.

00:22:27.740 --> 00:22:34.920
You know, the analyses we're doing today are so complex that you can't just read a three sentence description of it and know what to do.

00:22:34.920 --> 00:22:48.820
Yeah, I think I've heard somewhere some kind of quote saying your academic paper that you write about your computational results is like advertising for that computational result, but it's not actually the research.

00:22:48.820 --> 00:22:51.220
Like the software is the research in a sense, right?

00:22:51.220 --> 00:22:52.940
So why are these two separate things?

00:22:53.080 --> 00:22:57.220
Yeah, I know, I think you had some talks with folks behind Journal for Open Store Software.

00:22:57.220 --> 00:22:59.400
Yeah, I think that's where that came from is that one.

00:22:59.400 --> 00:22:59.560
Yeah.

00:22:59.560 --> 00:22:59.920
Yeah.

00:22:59.920 --> 00:23:10.200
And just like so much of the work and research now is software development, whether it's for building a package to do a certain type of analysis or that particular analysis itself.

00:23:10.200 --> 00:23:22.780
You know, this article and others are saying really what we need is a new medium where you can share all of the code used to run the analysis because so much of analysis is now happening via programming, not in Excel.

00:23:22.780 --> 00:23:26.120
Not with SPSS or other packages for stats.

00:23:26.120 --> 00:23:29.660
But if you just share the code, that's really not understandable either.

00:23:29.660 --> 00:23:31.660
You know, we all hope to comment it well.

00:23:31.660 --> 00:23:34.080
And so what you need is this mixed narrative.

00:23:34.080 --> 00:23:35.360
And that's what notebooks give.

00:23:35.360 --> 00:23:38.880
You can write a line of markdown text that explains what the notebook's doing.

00:23:38.880 --> 00:23:44.180
And then you can just build up your argument and show how you collected, analyzed, and did this data.

00:23:44.760 --> 00:23:47.800
And so a lot of folks are saying, hey, scientific paper is dead.

00:23:47.800 --> 00:23:49.460
Notebooks are the new medium.

00:23:49.460 --> 00:23:51.940
And look, millions of people are using them.

00:23:51.940 --> 00:23:54.480
And we really wondered, is that the case?

00:23:54.480 --> 00:23:57.420
You know, is the scientific paper dead?

00:23:57.420 --> 00:23:59.060
How are people using these notebooks?

00:23:59.060 --> 00:24:07.560
Because despite being around for decades and having millions of users, we know very little about how people actually use them in their day-to-day work.

00:24:07.560 --> 00:24:11.720
So we sought out to kind of understand better how are people actually using these things?

00:24:11.720 --> 00:24:14.340
Is there much rich computational narrative?

00:24:14.340 --> 00:24:18.020
You know, if we read these things and understand the analysis.

00:24:18.020 --> 00:24:22.360
Or are people really just using them because they're a nice iterative development environment?

00:24:22.360 --> 00:24:25.180
And so that's kind of what got started down this path.

00:24:25.400 --> 00:24:32.440
Right. Is it a place where you load some data and you just sort of iteratively explore it and interact with it?

00:24:32.440 --> 00:24:36.420
Or are you actually trying to put something like a paper together, right?

00:24:36.420 --> 00:24:38.520
Like the next version of that.

00:24:38.520 --> 00:24:41.360
Those are really compelling reasons to use notebooks.

00:24:41.360 --> 00:24:47.460
You know, having this really tight REPL that will let you iterate and just explore some data.

00:24:47.460 --> 00:24:52.040
Or having it be this really well curated explanation that you can share with others.

00:24:52.040 --> 00:24:54.860
Yeah, this idea of a REPL is really handy and nice.

00:24:55.180 --> 00:24:58.020
But sometimes it's just super hard to go back to what you want.

00:24:58.020 --> 00:25:00.660
You know, you're like, well, it's 20 things back.

00:25:00.660 --> 00:25:02.820
And this, you know, a bunch of them are like 10 lines long.

00:25:02.820 --> 00:25:03.440
So you got to go.

00:25:03.440 --> 00:25:10.240
It's just, you know, it's not very, very nice to interact with it, say, off of the terminal in some situations.

00:25:10.240 --> 00:25:11.760
But this, this is perfect, right?

00:25:11.760 --> 00:25:12.860
You can go back and read it.

00:25:12.860 --> 00:25:14.900
You know, just jump to where you want by touching it.

00:25:14.900 --> 00:25:15.320
It's great.

00:25:15.640 --> 00:25:18.460
So how did you, how did you get your data?

00:25:18.460 --> 00:25:21.520
You, how did you find these notebooks to study?

00:25:21.520 --> 00:25:24.360
Just like go ask a couple of people, you know, or what, what do you do?

00:25:24.500 --> 00:25:29.260
The first part of the research that we did was trying to just get a big sample of notebooks.

00:25:29.460 --> 00:25:36.640
So he said, one way that we can tackle this problem is just try and get a bunch of notebooks and look at them and see what the content is.

00:25:36.640 --> 00:25:43.780
So we ended up actually scraping all of the Jupyter notebooks that were on GitHub at the time of the study, which was about a year ago.

00:25:43.780 --> 00:25:44.480
How many is that?

00:25:44.580 --> 00:25:54.980
So that was a little over a million, about one and a quarter million notebooks they had on there, which that was a fun process of working around rate limiting to get that to work.

00:25:54.980 --> 00:25:55.380
Yeah.

00:25:55.380 --> 00:25:57.000
So tell us how do you do that?

00:25:57.000 --> 00:25:59.280
I mean, was that GitHub API?

00:25:59.280 --> 00:26:00.580
Was that web scraping?

00:26:00.580 --> 00:26:01.900
What was the flow there?

00:26:01.900 --> 00:26:03.280
Yeah, and it was a mix of both.

00:26:03.280 --> 00:26:06.600
So GitHub is a great and well-documented API.

00:26:06.600 --> 00:26:10.640
But in order to do what we wanted to do, we had to abuse it a little bit.

00:26:10.640 --> 00:26:14.140
They don't really want you looking for just one specific file type.

00:26:14.560 --> 00:26:20.260
So you can't really just search and say, show me all the files of this type and download all of them.

00:26:20.260 --> 00:26:22.740
You need to give it some other parameter set.

00:26:22.740 --> 00:26:28.860
So we actually had to go and say, OK, give me all the Jupyter notebooks between, you know, zero and 100 kilobytes.

00:26:28.860 --> 00:26:36.760
OK, between 100 and 200 and kind of iterate them through that way, both to get a list of all of the notebooks.

00:26:36.760 --> 00:26:40.360
And because they limit, we're only going to send you a thousand results.

00:26:40.360 --> 00:26:45.080
So even if you can look and see that there's a million of these, we'll only send you the first thousand.

00:26:45.080 --> 00:26:50.040
So we had to restrict our query down to get it in packets that were small enough.

00:26:50.040 --> 00:26:50.400
I see.

00:26:50.400 --> 00:26:55.000
You had to come up with an arbitrary filter criteria that would get it below a thousand.

00:26:55.000 --> 00:26:55.520
OK.

00:26:55.520 --> 00:26:56.720
Yeah.

00:26:57.260 --> 00:27:07.800
And then from there, it's just a lot of learning how to be a good citizen with GitHub servers and respect when they say, OK, you're making too many queries and slow down.

00:27:07.800 --> 00:27:11.500
It took a couple of weeks, but we eventually got the full data set that way.

00:27:11.500 --> 00:27:15.960
And then afterwards, we had to do some web scraping to get the files themselves.

00:27:16.180 --> 00:27:19.560
So we essentially first got a list of what are all the notebooks and where are they?

00:27:19.560 --> 00:27:22.320
And then used some web scraping to get the files themselves.

00:27:22.320 --> 00:27:23.480
Yeah, that's pretty interesting.

00:27:23.480 --> 00:27:25.140
And it only took a couple of weeks.

00:27:25.140 --> 00:27:26.340
I mean, on one hand, that's a long time.

00:27:26.340 --> 00:27:33.100
On the other, you've gone out and gathered all of these millions of notebooks from all these sources.

00:27:33.100 --> 00:27:35.360
And that's pretty amazing, actually.

00:27:35.580 --> 00:27:42.740
No, I think it's fascinating that, I mean, GitHub provides the tools for us to be able to do something like this.

00:27:42.740 --> 00:27:54.920
And really only asks us in return that we make the data in any publications open afterwards, that you can use their API to really study how people all over the world are using a tool set.

00:27:54.920 --> 00:27:58.000
So it's a great way to get a massive and diverse sample size.

00:27:58.000 --> 00:28:01.260
Yeah, I think GitHub is kind of a special place, right?

00:28:01.340 --> 00:28:07.340
I mean, there's lots of source control and versioning and issue tracking places in the world, but GitHub stands alone.

00:28:07.340 --> 00:28:11.360
And it's sort of reach and just people using it and so on.

00:28:11.360 --> 00:28:11.620
Yeah.

00:28:11.620 --> 00:28:15.020
So what was, I guess, how do you go about studying it?

00:28:15.020 --> 00:28:17.120
Like, you're studying Jupyter Notebooks.

00:28:17.120 --> 00:28:20.220
Did you actually use Jupyter Notebooks to study your Jupyter Notebooks?

00:28:20.220 --> 00:28:21.400
This is exactly it.

00:28:21.400 --> 00:28:22.860
Yeah, so this is where it gets super meta.

00:28:22.860 --> 00:28:25.200
So we download this whole data set.

00:28:25.200 --> 00:28:29.460
And for those who don't know, Jupyter Notebooks themselves are really just a JSON file.

00:28:29.940 --> 00:28:35.100
They have a different ending, I, PI, and B, but under the hood, it's just a JSON file.

00:28:35.100 --> 00:28:39.120
So we essentially have a million JSON files to look at.

00:28:39.120 --> 00:28:47.140
And so we just spun up our own Jupyter Notebooks, imported that data set, started making some data frames using the Python ecosystem.

00:28:47.140 --> 00:28:49.320
And so analyzed it that way.

00:28:49.320 --> 00:28:58.340
So in the end, it's come full circle because the Notebooks that I used for my analysis of Notebooks on GitHub are now hosted on GitHub.

00:28:58.340 --> 00:28:59.980
Completely available.

00:28:59.980 --> 00:29:03.240
So you will be a data set in the follow-up replication setting.

00:29:03.240 --> 00:29:03.880
Is that what you're saying?

00:29:03.880 --> 00:29:04.680
Yeah.

00:29:04.680 --> 00:29:05.160
Yeah.

00:29:05.160 --> 00:29:05.820
How interesting.

00:29:08.320 --> 00:29:13.100
This portion of Talk Python is brought to you by Studio 3T, the IDE for MongoDB.

00:29:13.100 --> 00:29:16.200
No SQL databases offer maximum flexibility.

00:29:16.200 --> 00:29:21.100
But what if you could combine the benefits of MongoDB with the benefits of SQL?

00:29:21.100 --> 00:29:23.360
With Studio 3T, you can.

00:29:23.700 --> 00:29:29.080
With their innovative SQL query feature, you can write SQL joins and expressions to query MongoDB.

00:29:29.080 --> 00:29:35.200
And the best part is you get to see how your SQL queries translate to MongoDB's native query syntax with the click of a button.

00:29:35.200 --> 00:29:40.180
You can create MongoDB queries, aggregation statements, and SQL queries.

00:29:40.180 --> 00:29:47.260
And 3T's novel query code will automatically generate code for you in a variety of languages like Python, JavaScript, and even C#.

00:29:47.260 --> 00:29:52.640
Studio 3T also offers the richest coding experience with its full-featured IntelliShell.

00:29:53.020 --> 00:30:01.180
It's the built-in MongoDB shell interface with smart auto-completion of collection names, shell methods, document key names, operators, and field names.

00:30:01.180 --> 00:30:05.940
By using in-place editing within a collection, it's even easier to edit your documents.

00:30:05.940 --> 00:30:12.800
Try Studio 3T and see why it's used by Fortune 500 companies like Nike, Tesla, Formula One, Comcast, and many more,

00:30:12.800 --> 00:30:16.060
saving enterprise users countless hours of development time.

00:30:16.060 --> 00:30:21.100
Visit talkpython.fm/studio to get a free one-month trial.

00:30:21.300 --> 00:30:23.740
That's talkpython.fm/studio.

00:30:23.740 --> 00:30:29.100
What are some of the findings or things you got by studying these?

00:30:29.100 --> 00:30:36.000
We talked about this tension between an iterative REPL and then an explanatory narrative or computational narrative.

00:30:36.000 --> 00:30:36.840
What did you find?

00:30:36.840 --> 00:30:47.260
One of the big headlines from this bit of research was that very few of the notebooks had what you might consider the baseline requirement for a rich narrative,

00:30:47.380 --> 00:30:49.560
which is just any explanatory text.

00:30:49.560 --> 00:30:54.160
So over a quarter of the notebooks had no markdown text at all.

00:30:54.160 --> 00:30:56.540
So they were just code or code blocks.

00:30:56.540 --> 00:31:00.040
And then even of those that had text, they were pretty short.

00:31:00.040 --> 00:31:02.660
So like the median was about 150 words.

00:31:02.660 --> 00:31:04.380
So just a really short blurb.

00:31:04.600 --> 00:31:06.540
Well, maybe those blurbs are almost like comments.

00:31:06.540 --> 00:31:09.860
They just are in text cells instead of in code with a hash.

00:31:09.860 --> 00:31:10.340
Yeah.

00:31:10.340 --> 00:31:10.820
Yeah.

00:31:10.820 --> 00:31:17.460
And they'd often be like, okay, import data, model data, really descriptive of the steps.

00:31:17.600 --> 00:31:22.980
And I think for us that, you know, hints that more, you know, not that this is a bad use to the notebook,

00:31:22.980 --> 00:31:29.480
but more they're being used as an interactive programming environment with some light, loose notes rather than this view of,

00:31:29.480 --> 00:31:34.180
oh, there's this really rich description, like a scientific paper of what people did.

00:31:34.600 --> 00:31:36.880
I think there are a host of reasons for that.

00:31:36.880 --> 00:31:39.180
But that was one of the major findings for us from this.

00:31:39.180 --> 00:31:39.400
Yeah.

00:31:39.400 --> 00:31:49.680
So how much of this is just people happen to be using and storing their notebooks on GitHub versus they intend other people to consume those?

00:31:49.680 --> 00:31:50.120
Yeah.

00:31:50.220 --> 00:31:56.440
You know, it's hard to say for sure, but we think a lot of it is just we're going to throw it up as a repository for myself.

00:31:56.440 --> 00:31:59.080
And I'm not really expecting others to use it.

00:31:59.080 --> 00:32:05.420
We did an analysis where we actually looked at the descriptions of the GitHub repos where these notebooks lived.

00:32:05.420 --> 00:32:09.920
So like how are people describing these projects and looked for keywords.

00:32:10.380 --> 00:32:20.680
And when we remove things like notebook or GitHub from that keyword search, the top words are things like machine learning, Kaggle, Udacity, Nano degree.

00:32:20.680 --> 00:32:29.840
And so that really showed us that a lot of these seem to be people learning how to do data analysis, learning machine learning in particular,

00:32:29.840 --> 00:32:39.120
and doing online education assignments and then hosting their results up online, whether that's as a form of submission

00:32:39.120 --> 00:32:43.160
or for a resume or portfolio building exercise.

00:32:43.160 --> 00:32:45.200
But a lot of these seem to be educational.

00:32:45.200 --> 00:32:45.680
Interesting.

00:32:45.680 --> 00:32:52.560
I can see a lot of people who are students taking courses at a university or something, and their professor says,

00:32:52.560 --> 00:32:55.520
all right, what we're going to do is going to create a repo for your course,

00:32:55.520 --> 00:32:59.740
and everybody's going to put their assignment and just share it with me or make it public or something.

00:32:59.740 --> 00:33:03.480
Our search excluded forked repos and forked notebooks.

00:33:03.480 --> 00:33:07.420
So at least one form of distribution like that should have been excluded.

00:33:07.680 --> 00:33:11.720
But yeah, I still think a lot of this is course assignments like that.

00:33:11.720 --> 00:33:19.760
Did you do any refining where you say, well, let's look at repositories that have over a thousand stars or a lot of followers,

00:33:19.760 --> 00:33:23.640
or just the ones that are not clearly sort of private?

00:33:23.640 --> 00:33:26.500
We tried to look for ones that seem to get reused a lot.

00:33:26.500 --> 00:33:33.020
And in fact, the motivation for us was, well, let's see if we can find best practices in notebooks, right?

00:33:33.080 --> 00:33:38.740
Like if we can find ones that are in repositories that were starred a lot or forked a lot,

00:33:38.740 --> 00:33:45.940
maybe that means that they were really useful and we can kind of glean some best practices on notebook design from this.

00:33:45.940 --> 00:33:46.200
Right.

00:33:46.200 --> 00:33:48.480
Maybe even a lot of PRs, they're getting like polished.

00:33:48.480 --> 00:33:49.060
Exactly.

00:33:49.340 --> 00:34:01.340
And one of the things we found when we tried to do that was many of these notebooks that were in highly starred or forked repositories were just tutorials for various software packages.

00:34:01.740 --> 00:34:06.760
So as an example, it could have been, you know, something like here's pandas up on GitHub.

00:34:06.760 --> 00:34:11.140
And then here's the notebook as documentation showing how to use pandas.

00:34:11.140 --> 00:34:17.120
But the reason why this repository is so starred and forked is that people really like using pandas,

00:34:17.120 --> 00:34:20.620
not because the notebooks themselves were all that insightful.

00:34:20.620 --> 00:34:21.640
Yeah, that's interesting.

00:34:21.640 --> 00:34:28.880
I guess it is a really nice way to have code mixed with description on GitHub because GitHub renders and executes those now.

00:34:29.060 --> 00:34:36.200
Yeah, and that's one of the things is we're initially doing this research that people said why they're using Jupyter notebook and why they're putting it on GitHub is,

00:34:36.200 --> 00:34:41.100
hey, a manager that I have, you know, they don't really want to install the software and set up an environment,

00:34:41.100 --> 00:34:44.720
but I can just send them a link and they can see the notebook statically.

00:34:44.720 --> 00:34:47.240
And that's a really nice way to share results.

00:34:47.240 --> 00:34:47.920
Yeah, that's great.

00:34:47.920 --> 00:34:50.320
Even managers can run web browsers.

00:34:50.320 --> 00:34:51.200
Uh-huh.

00:34:51.200 --> 00:34:51.500
Uh-huh.

00:34:51.500 --> 00:34:52.080
Yeah.

00:34:52.080 --> 00:34:57.080
I think one of the other interesting things that we looked at was in a testament to the Python ecosystems,

00:34:57.080 --> 00:35:00.420
we said, okay, what are the packages that people are importing?

00:35:00.420 --> 00:35:05.420
And just finding that the vast majority, around 90% or so of these notebooks,

00:35:05.420 --> 00:35:07.440
are importing external packages.

00:35:07.440 --> 00:35:14.800
And things like pandas, numpy, matplotlib, we're importing two-thirds or more of them.

00:35:14.800 --> 00:35:20.760
So just the data science infrastructure that's being provided is a really core component.

00:35:20.760 --> 00:35:22.640
It's not just having the notebook like Jupyter.

00:35:23.180 --> 00:35:25.920
It's having the Python ecosystem to be able to do it.

00:35:25.920 --> 00:35:26.620
It's the foundation.

00:35:26.620 --> 00:35:27.620
Yeah, that's really awesome.

00:35:27.620 --> 00:35:31.140
What about things like R and JavaScript and stuff?

00:35:31.140 --> 00:35:33.340
Were you able to figure out, well, how much is Python?

00:35:33.340 --> 00:35:34.460
How much is other stuff?

00:35:34.460 --> 00:35:37.900
The notebooks themselves have a tag for what the language is in there.

00:35:38.400 --> 00:35:43.080
And the vast majority of those, like 96%, were written in Python.

00:35:43.080 --> 00:35:53.240
Whereas things like R and Julia, which is why Jupyter is named Jupyter, the combination of Julia, Python, and R, each accounted for about a percent.

00:35:53.240 --> 00:35:56.680
And then there was a long tail of other languages that were supported.

00:35:56.880 --> 00:35:59.020
But by and large, it was Python.

00:35:59.020 --> 00:36:01.660
And kind of surprising for us.

00:36:01.660 --> 00:36:02.560
96%?

00:36:02.560 --> 00:36:05.580
That's actually higher than I would have even guessed, to be honest.

00:36:05.580 --> 00:36:06.140
No.

00:36:06.140 --> 00:36:09.820
And again, I don't know if it's because so many of these are educational.

00:36:09.820 --> 00:36:15.100
And that's just a good language to teach in or what the reason may be.

00:36:15.100 --> 00:36:22.060
But it still is strongly reflecting its IPython roots of being kind of a Python-first environment.

00:36:22.300 --> 00:36:22.500
For sure.

00:36:22.500 --> 00:36:23.040
It sounds like it.

00:36:23.040 --> 00:36:31.080
So were you able to find a subset of what you might call academic papers or like these narrative type things and analyze those?

00:36:31.080 --> 00:36:35.720
That leads into the second line of work we did on this is, okay, pull down all of these notebooks.

00:36:35.720 --> 00:36:36.840
We've looked at them.

00:36:36.840 --> 00:36:44.160
Very few of them have this rich description and seem to be more just using notebooks as a nice iterative environment.

00:36:44.160 --> 00:36:47.940
But what if we're just looking at, you know, the wrong subset of notebooks?

00:36:47.940 --> 00:36:49.840
Many of these seem to be for education.

00:36:50.500 --> 00:36:53.320
It may be that people are just hosting these for themselves.

00:36:53.320 --> 00:36:53.700
Yeah.

00:36:53.700 --> 00:37:01.140
You probably don't want to look at a notebook that a student has like known programming for 10 days as a, how should we do things?

00:37:01.140 --> 00:37:02.040
Exactly.

00:37:02.040 --> 00:37:03.080
That's exactly it.

00:37:03.080 --> 00:37:09.220
So like, okay, like, let's be a little humble with the limitations of the data set and our assumptions of it.

00:37:09.220 --> 00:37:16.560
So we ended up saying, well, what if we look at what some consider like the creme de la creme of doing and presenting analysis?

00:37:16.560 --> 00:37:20.580
What if we look at notebooks that are supplementing academic publications?

00:37:21.040 --> 00:37:27.240
So this is back to that Atlantic article saying, you know what, in the future, scientific papers should just be in notebooks.

00:37:27.240 --> 00:37:39.080
And there's a number of folks who have jumped on that bandwagon and said, you know, I may publish something in science or nature or one of the big journals, but I'm going to link to here's the notebook that I use for that analysis.

00:37:39.080 --> 00:37:45.840
So that people can retrace it, recreate it, fork it, and continue the analysis themselves.

00:37:45.840 --> 00:37:55.840
And as mentioned earlier, there are some in the academic community who have a real strong kind of evangelistic fervor around the need to share the results in this way.

00:37:55.840 --> 00:37:58.000
So we ended up looking for those specifically.

00:37:58.000 --> 00:37:58.480
Interesting.

00:37:58.480 --> 00:37:59.260
And what did you find there?

00:37:59.260 --> 00:38:00.320
Any differences?

00:38:00.320 --> 00:38:01.140
Probably, right?

00:38:01.340 --> 00:38:01.680
Yeah.

00:38:01.680 --> 00:38:05.020
So we ended up finding, again, many of these are on GitHub.

00:38:05.020 --> 00:38:07.060
So we pulled about 150 of them.

00:38:07.060 --> 00:38:12.160
And we used a slightly different method where rather than just doing a big data analysis, we hand code it.

00:38:12.160 --> 00:38:15.200
And we wanted to see kind of with finer grain detail.

00:38:15.200 --> 00:38:19.760
I looked at it and I put it in categories, not like typing programming, right?

00:38:19.760 --> 00:38:21.200
Exactly.

00:38:21.200 --> 00:38:27.140
So putting the categories, kind of iterating those over time, and then making sure you have other people who can validate.

00:38:27.140 --> 00:38:31.120
But yes, that's a valid category, not just something that you came up with.

00:38:31.120 --> 00:38:31.360
Nice.

00:38:31.360 --> 00:38:31.560
Okay.

00:38:31.560 --> 00:38:35.860
So you coded these by hand and probably got slightly different results, I guess.

00:38:35.860 --> 00:38:36.300
Okay.

00:38:36.300 --> 00:38:38.720
So these notebooks have a little more text in them.

00:38:38.720 --> 00:38:44.280
But surprisingly, they're not using that text really to describe the analysis in any rich way.

00:38:44.280 --> 00:38:53.800
So of the notebooks that had any text in them, which was still not all of them, the majority would use that text just to describe the steps of the analysis.

00:38:53.800 --> 00:38:56.400
Importing data, fitting model.

00:38:56.400 --> 00:38:59.800
Back to the comments as text rather than comments as code comments.

00:38:59.800 --> 00:39:00.040
Yeah.

00:39:00.040 --> 00:39:00.560
Okay.

00:39:00.560 --> 00:39:04.820
And then only about a third of them have what we might consider a rich description.

00:39:04.820 --> 00:39:08.440
So any description of why they did the analysis in a particular way.

00:39:08.440 --> 00:39:14.360
So, oh, I fit a linear model because these assumptions are met, or we tried this other model and it didn't work.

00:39:14.360 --> 00:39:16.260
Or interpreting results.

00:39:16.740 --> 00:39:20.100
So, you know, if you look at this plot, you'll notice this outlier here.

00:39:20.100 --> 00:39:26.820
Most would just leave the end figure like it spoke for itself, often without axes labeled.

00:39:26.820 --> 00:39:29.080
And just say, here's how we got the result.

00:39:29.080 --> 00:39:33.000
And not really describe what they thought it meant or how they got there.

00:39:33.080 --> 00:39:34.460
So that for us was surprising.

00:39:34.460 --> 00:39:47.300
Again, thinking, well, academics will want their work to be easily understood and replicable, that so many of these still kind of fell short of Fernando Perez, Brian Granger's, and others' vision of this rich computational narrative.

00:39:47.300 --> 00:39:48.300
How interesting.

00:39:48.300 --> 00:39:48.360
Yeah.

00:39:48.360 --> 00:39:49.720
So what did you do?

00:39:49.720 --> 00:39:51.740
You go ask them, like, why didn't you write?

00:39:51.740 --> 00:39:53.140
Why didn't you write more?

00:39:53.140 --> 00:39:54.400
So in a way we did.

00:39:54.400 --> 00:40:02.980
Not with the folks who had posted notebooks there, but we ended up finding people around campus at UC San Diego who are using notebooks.

00:40:03.020 --> 00:40:10.380
Again, some of these labs where they have big biology analyses or genomics work and where using notebooks is kind of a way of life.

00:40:10.380 --> 00:40:12.260
So we went and talked to some of those people.

00:40:12.260 --> 00:40:19.680
So we ended up finding 15 of them and just walk them through, show us a notebook you've been working on, which was great.

00:40:19.680 --> 00:40:25.280
We got to see kind of in progress work rather than just here's my finished product that I'm post on GitHub.

00:40:25.280 --> 00:40:26.300
Yeah, that's cool.

00:40:26.300 --> 00:40:32.940
And so were they also more using it as like repl type explorative stuff or what was the story there?

00:40:33.180 --> 00:40:34.320
It was exactly that.

00:40:34.320 --> 00:40:37.340
Again, people were using it for this iterative environment.

00:40:37.340 --> 00:40:38.480
Some would talk about it.

00:40:38.480 --> 00:40:44.320
It's my coding playground where I get to test out ideas, but it's a very personal thing.

00:40:44.320 --> 00:40:47.000
You know, it's reflecting my style of programming.

00:40:47.000 --> 00:40:51.700
And, you know, I'm not going to take time to clean it up for others because maybe they don't want to see it.

00:40:51.700 --> 00:40:56.080
I was going to say, it seems like a really great thing for people to create these notebooks.

00:40:56.080 --> 00:41:02.680
And then if you're meeting with your research group to pull it up and everybody look at it and kind of walk through it.

00:41:02.680 --> 00:41:05.540
I wonder how much that played into it.

00:41:05.540 --> 00:41:06.900
Like, here, look what I've been doing this week.

00:41:06.900 --> 00:41:09.020
And here, let me show you the results and how I got it and so on.

00:41:09.020 --> 00:41:10.780
Yeah, we have the same intuition as well.

00:41:10.780 --> 00:41:13.080
And it seems like that's not the case.

00:41:13.080 --> 00:41:19.420
And in fact, one of the folks that we talked to said, you know what?

00:41:19.420 --> 00:41:23.400
I've tried this in lab meetings and people just think I didn't take time to prepare.

00:41:23.400 --> 00:41:26.260
They think I'm just showing up by the skin of my teeth.

00:41:26.260 --> 00:41:26.500
It's winging it.

00:41:26.500 --> 00:41:27.980
Winging it.

00:41:27.980 --> 00:41:35.900
And unless I have a slide deck put together that these aren't solid results or that I didn't take time to think about what you might want to see.

00:41:36.120 --> 00:41:50.140
So it's this really interesting case where there's kind of this entrenched practice of you must present from slide decks or else it means this or that about your work that got in the way of using notebooks as a presentation medium.

00:41:50.140 --> 00:41:50.600
Huh.

00:41:50.600 --> 00:41:51.660
Interesting.

00:41:51.660 --> 00:41:53.140
I wonder if it's a chicken and egg thing.

00:41:53.140 --> 00:41:59.520
Like, if they were really beautifully formatted and descriptive, maybe that's a really great presentation.

00:41:59.520 --> 00:42:03.960
But if they're sloppy, like, in and of themselves, maybe the presentation feels sloppy.

00:42:03.960 --> 00:42:06.860
Or just the social expectation in practice.

00:42:06.860 --> 00:42:16.000
Like, what if labs just expected that rather than taking time to create a slide deck, that you would take the time to document your code in the notebook?

00:42:16.000 --> 00:42:20.500
That seems like that's more reusable and valuable over time.

00:42:20.500 --> 00:42:20.960
Right?

00:42:20.960 --> 00:42:24.880
Because, I mean, I've not refactored or reused slides that much.

00:42:24.880 --> 00:42:26.360
Uh-huh.

00:42:26.360 --> 00:42:26.940
Okay.

00:42:26.940 --> 00:42:28.260
Wow.

00:42:28.860 --> 00:42:34.720
What do you think needs to be done for notebooks to reach their potential, right?

00:42:34.720 --> 00:42:39.380
To become this thing that would actually sort of validate the burning paper on the Atlantic?

00:42:39.380 --> 00:42:45.180
I first want to say, like, I and the research who work with me think notebooks are fantastic.

00:42:45.180 --> 00:42:50.980
I wouldn't want anyone coming away from the podcast saying, oh, you know, notebooks are done.

00:42:50.980 --> 00:42:53.300
They're not the right way to do analysis.

00:42:53.300 --> 00:42:55.780
I think they're the best thing that we've got going.

00:42:55.780 --> 00:43:06.240
And there are vast improvement over prior ways of doing data analysis, which was often having, you know, script one in a file, script two in another file, script 2.5.

00:43:06.560 --> 00:43:10.020
Version control is usually just naming copies of the files.

00:43:10.020 --> 00:43:10.020
Exactly.

00:43:10.020 --> 00:43:10.520
Yeah.

00:43:10.520 --> 00:43:14.900
It's a much better way for version control and tracking steps of the analysis.

00:43:14.900 --> 00:43:23.780
And, you know, through our research has demonstrated it's a fantastic way to iteratively do the process of analyzing data, especially with Python.

00:43:23.780 --> 00:43:25.680
And so many people are using it for that.

00:43:25.680 --> 00:43:33.800
For us, the real question is now, how do we make this wonderful programming environment also a wonderful presentation environment?

00:43:33.800 --> 00:43:39.900
One where it's easy to share results with others and to support kind of collaboration in that way.

00:43:40.460 --> 00:43:43.700
And I don't know a silver bullet to get it done.

00:43:43.700 --> 00:43:49.180
I think there's things that we can do to tweak the design and how people use the notebooks.

00:43:49.180 --> 00:43:51.720
But there also have to be some social changes.

00:43:51.720 --> 00:44:00.320
You know, things like labs expecting the presentations will be from notebooks or journals expecting submissions in notebook format rather than PDF.

00:44:00.320 --> 00:44:01.860
So I think it'll be a mix.

00:44:01.860 --> 00:44:02.420
Interesting.

00:44:02.420 --> 00:44:08.020
How much do you think like software carpentry type of stuff is involved here?

00:44:08.020 --> 00:44:12.800
Like bringing a little bit of the CS side of things to the researchers?

00:44:12.800 --> 00:44:13.440
Yeah.

00:44:13.440 --> 00:44:14.880
No, I think that's really vital.

00:44:14.880 --> 00:44:22.640
And I think the model that software carpentry has of doing the workshops and dedicated time to training these best practices.

00:44:23.000 --> 00:44:41.260
One of the really standout insights from our last line of work, the interviewing researchers, was that one researcher mentioned that when she had started as a biology student in biology or chemistry labs, she was trained in a very specific way of tracking her results.

00:44:41.260 --> 00:44:47.620
You know, this is how you write your name and the date and the reagents that you're going to use for this experiment and the steps.

00:44:47.840 --> 00:44:59.960
And if she didn't do it in a particular disciplined way, she'd get docked points by her teaching assistant or professor because this was just the practice of how you document and share a biology or chemistry lab.

00:44:59.960 --> 00:45:02.860
And she said, you know, we don't really have that for notebooks.

00:45:02.860 --> 00:45:04.580
I came into the lab.

00:45:04.580 --> 00:45:07.020
I was shown a notebook and told, have fun.

00:45:07.020 --> 00:45:08.320
You type here.

00:45:08.320 --> 00:45:09.520
Exactly.

00:45:09.520 --> 00:45:15.000
But I've had to figure out like, oh, I can create my own packages and I can import external files.

00:45:15.000 --> 00:45:23.060
And, oh, I can move all of the import statements to the very top of the notebook and, you know, have to kind of figure out best practices on their own.

00:45:23.060 --> 00:45:39.280
So I think there's a lot of best practices, both from software development and engineering and from data analysis that have grown up in the last few years that can be brought to bear through things like data carpentry or software carpentry workshops that aren't there yet.

00:45:39.280 --> 00:45:44.240
There's been a lot of progress there, but it also seems like, you know, there's always more progress to be made.

00:45:44.240 --> 00:45:51.640
And I guess just from an academic perspective, every year you start over in a sense, right?

00:45:51.640 --> 00:45:58.840
Like every year there's a new grad student who's fresh and they've never done this and they're at the lab and you've got to bring them under the fold, right?

00:45:58.840 --> 00:46:00.780
So there's also this sort of mentoring aspect.

00:46:00.780 --> 00:46:05.600
It'd be really interesting to look at how would our results be different if we'd done this out in enterprise, right?

00:46:05.600 --> 00:46:16.320
Like a lot of what we looked at ended up being academic domain, whether it's all of these students working on notebooks that we found on GitHub or looking at folks who are in an academic lab environment.

00:46:16.880 --> 00:46:21.420
And that's largely a factor of who we have access to and who's willing to share their notebooks publicly.

00:46:21.420 --> 00:46:23.340
The rest of them, they're in these buildings right around campus.

00:46:23.340 --> 00:46:24.240
We'll just go talk to them.

00:46:24.240 --> 00:46:24.740
Uh-huh.

00:46:24.740 --> 00:46:36.040
But looking at, you know, I think there's similar issues of turnover within organizations or handing off a project from somebody who's a senior analyst to a more junior, less experienced analyst.

00:46:36.040 --> 00:46:39.280
And onboarding and disciplined practice.

00:46:39.280 --> 00:46:52.760
Folks like Hillary Parker over at Stitch Fix had talked a lot about opinionated data analysis and having a disciplined way of tracking or sharing or reviewing data analyses and needing to develop that in enterprise.

00:46:52.760 --> 00:46:53.360
That's pretty cool.

00:46:53.360 --> 00:47:01.160
I wonder how accessible that information is because I know some of these companies are somewhat open about what they do.

00:47:01.280 --> 00:47:06.860
But, again, this data analysis, this is partly what drives their company and they're not going to just give it away.

00:47:06.860 --> 00:47:11.440
It's not like I can go to Goldman Sachs and go, hey, why don't you just publish your notebooks for your analysis?

00:47:11.440 --> 00:47:13.000
Because that would be so interesting.

00:47:13.000 --> 00:47:13.720
Yeah.

00:47:13.720 --> 00:47:15.520
Like, yeah, we're not doing that.

00:47:15.520 --> 00:47:15.780
Yeah.

00:47:15.780 --> 00:47:21.560
I think the domain where I've seen outside of academia, the most open sharing of analyses is journalism.

00:47:21.560 --> 00:47:29.800
So folks like FiveThirty or BuzzFeed or others who have published a number of notebooks online and even on GitHub.

00:47:30.160 --> 00:47:31.860
So many of them are in our data set.

00:47:31.860 --> 00:47:35.360
FiveThirty has got an amazing set of data on GitHub.

00:47:35.360 --> 00:47:40.180
So FiveThirty has spilled out with letters slash data, I think is it.

00:47:40.180 --> 00:47:41.740
Really interesting use of notebooks.

00:47:41.740 --> 00:47:50.920
And, again, for them, it's kind of the incentives that we want to be open and show that we're a reputable organization and have others validate the claims that we make in our articles.

00:47:50.920 --> 00:47:58.880
Whereas, you say, for companies, it's kind of the keys to the kingdom of this is how we generate value is our data and our building on top of it.

00:47:58.960 --> 00:47:59.040
Yeah.

00:47:59.040 --> 00:48:03.360
And I guess with journalism, once you publish it, you stake your claim to it.

00:48:03.360 --> 00:48:03.840
Right.

00:48:03.840 --> 00:48:08.340
But in business, like your sales model one year could be everyone's sales model next year.

00:48:08.340 --> 00:48:08.540
Right.

00:48:08.540 --> 00:48:09.340
That's not the same.

00:48:09.340 --> 00:48:09.680
Yeah.

00:48:09.680 --> 00:48:10.840
No one gives you much credit.

00:48:10.840 --> 00:48:12.160
Yeah.

00:48:12.160 --> 00:48:12.680
Yeah, for sure.

00:48:12.680 --> 00:48:15.860
So what do you think about the future?

00:48:15.860 --> 00:48:26.540
Like, we talked a little bit about this, but maybe packaging these things up, it's still maybe a little bit difficult to say, here's my notebook and go run it with everything that you need.

00:48:26.540 --> 00:48:28.700
Should we look at containers?

00:48:29.180 --> 00:48:30.680
What do you see going forward?

00:48:30.680 --> 00:48:39.780
Notebook environments like Jupyter Notebook are going to increasingly become the core infrastructure for data analysis, both in industry and academia and journalism.

00:48:40.340 --> 00:48:48.000
They've just proven so valuable as an iterative programming environment and for presenting results, though it still takes a lot of time to clean it up.

00:48:48.000 --> 00:48:55.840
But yeah, I think for the future, there's a lot of work that will have to be done in a lot of different domains before notebooks get more widely used.

00:48:55.940 --> 00:48:59.240
And as you referenced, some of it is in packaging the environment.

00:48:59.240 --> 00:49:03.980
So theoretically, you can use notebooks to rerun somebody else's analysis.

00:49:03.980 --> 00:49:16.460
But what if you don't have the same access to their data, either because of human subjects restrictions or it's just a lot of data or stored away on a server or your version of the libraries are slightly different?

00:49:16.580 --> 00:49:25.980
So I think there's a lot of work on how do we containerize and package up not only the programming environment and the language at this point in time, but also the data.

00:49:25.980 --> 00:49:26.420
That's interesting.

00:49:26.420 --> 00:49:33.920
I think that gets into questions about, yeah, differential privacy and incentivizing data curation.

00:49:33.920 --> 00:49:41.280
Right now, that's not really recognized often in academia, at least, as a contribution to the field.

00:49:41.280 --> 00:49:47.740
Yeah, it seems like the cloud computing stuff that we touched on at the beginning helps somewhat with that, right?

00:49:47.740 --> 00:49:52.040
Like the Azure notebooks, the Google notebooks, and Sage notebooks, and so on.

00:49:52.040 --> 00:49:54.160
Like, you can share one of those.

00:49:54.160 --> 00:50:00.380
But at the same time, I guess eventually maybe those things upgrade the packages that you have access to.

00:50:00.380 --> 00:50:02.720
And that could theoretically change your results.

00:50:02.720 --> 00:50:06.600
Like, oh, we fixed a bug in this analysis thing, which now doesn't look the same.

00:50:06.600 --> 00:50:06.980
Who knows?

00:50:06.980 --> 00:50:21.560
In discussions from some folks from library sciences, they'll say, you know, for as much flack as things like the PDF document give as a way of presenting results, PDFs and paper have proved to be a really stable way to share insights and knowledge.

00:50:21.560 --> 00:50:24.960
Yeah, the versioning isn't nearly as touchy.

00:50:25.500 --> 00:50:36.000
Like, we haven't had an issue of, you know, pulling a book off the shelf and it not working anymore, except if maybe the spine broke or something like we did with trying to run code that's even five or ten years old.

00:50:36.000 --> 00:50:36.340
Yeah.

00:50:36.340 --> 00:50:39.380
It's just a much more stable environment for storing knowledge.

00:50:39.380 --> 00:50:47.500
So I think figuring out ways to do that will be really vital for things like notebooks becoming more integral to sharing results over time.

00:50:47.700 --> 00:50:48.120
I agree.

00:50:48.120 --> 00:50:53.040
I think containers have a lot of promise there because they can freeze the whole environment.

00:50:53.040 --> 00:50:57.800
But even them, you know, they're based on some other operating system.

00:50:57.800 --> 00:50:59.840
They've got to run on a certain version of Linux.

00:50:59.840 --> 00:51:01.100
It's not perfect.

00:51:01.100 --> 00:51:02.340
Yeah, it's not perfect.

00:51:02.340 --> 00:51:02.980
Yeah.

00:51:02.980 --> 00:51:12.020
And then as you referenced earlier, I think there's in the future a lot to be done in the realm of like software carpentry, data carpentry, scaling education.

00:51:12.440 --> 00:51:27.780
As much as we can do to tweak these interfaces to make it easier to develop in or to write really clear narratives in, I still think a lot is going to rely on either apprenticeship models or mentoring or training in ways of doing and documenting data analysis.

00:51:27.780 --> 00:51:32.160
So figuring out the right way to do that is a super tricky problem.

00:51:32.160 --> 00:51:33.060
Yeah, it sure is.

00:51:33.060 --> 00:51:36.720
I wonder if we're going to have some more specialization.

00:51:36.720 --> 00:51:43.240
So in the early days, it's like, well, the people who want to do computer programming, they got electrical engineering degrees for whatever reason.

00:51:43.240 --> 00:51:46.480
And then we have the CS degree in computer science.

00:51:46.480 --> 00:51:47.760
Like, I don't know.

00:51:47.760 --> 00:51:51.800
I don't feel like I've done any science really in computer for a really long time.

00:51:51.800 --> 00:51:53.520
But it's, you know, that's what it's called.

00:51:53.520 --> 00:51:57.120
Even though you're not actually staging hypotheses and doing things, you're just building.

00:51:57.120 --> 00:51:58.320
It's more like engineering, right?

00:51:58.320 --> 00:52:05.520
So we then got software engineering degrees that are slightly different applied ways of doing what computer science was doing.

00:52:05.720 --> 00:52:12.680
Maybe we will get like computational scientist degree specializations or something coming along.

00:52:12.680 --> 00:52:17.800
It's like half computer science, but half sort of these other data sides of things you're talking about.

00:52:17.800 --> 00:52:28.720
I think many of the data science institutes that are popping up either online or at universities are trying to figure out like, what of this is specialized and unique to working with data?

00:52:28.720 --> 00:52:33.900
What is it just rehashing things that we've already learned from software engineering and versioning?

00:52:34.340 --> 00:52:42.500
What do we even have to specialize further in that biology's use of programming for data analysis will look fundamentally different from astronomy?

00:52:42.500 --> 00:52:48.760
You know, the practices will be as different as AstroPy and another library are.

00:52:48.760 --> 00:52:51.480
I can see a world where we end up there.

00:52:51.480 --> 00:53:01.560
As computation becomes more and more the foundation of all these different degrees, that each degree is like, no, no, we're going to have a biological computationist degree.

00:53:01.560 --> 00:53:04.220
It's not going to be over in the CS world.

00:53:04.220 --> 00:53:10.060
It's going to be here and we're going to run it and it's going to have some CS, but it's also going to have lots of biology and other aspects.

00:53:10.280 --> 00:53:10.460
Yeah.

00:53:10.460 --> 00:53:18.060
And that's one aspect of future work we haven't gotten to, but I think would be fascinating is looking at how people use notebooks differently in different environments.

00:53:18.060 --> 00:53:21.080
Both like how is enterprise different than academia?

00:53:21.080 --> 00:53:25.760
How are beginning computer science students different from later ones?

00:53:25.760 --> 00:53:31.640
Or, you know, how does astronomy differ from chemistry and how they use these environments?

00:53:32.020 --> 00:53:34.400
So I think that's a fascinating area to look at next.

00:53:34.400 --> 00:53:35.120
It definitely is.

00:53:35.120 --> 00:53:35.580
All right.

00:53:35.580 --> 00:53:39.380
Well, I think that probably is a good place to leave things.

00:53:39.380 --> 00:53:39.720
Okay.

00:53:39.720 --> 00:53:41.020
A pretty optimistic future.

00:53:41.020 --> 00:53:42.040
We're getting a little low on time.

00:53:42.040 --> 00:53:45.580
So maybe, maybe we'll ask you the two questions at the end of the show.

00:53:45.580 --> 00:53:50.480
First of all, if you're going to edit some Python code, what editor do you use?

00:53:50.480 --> 00:53:52.000
I'm really a Jupyter fanboy.

00:53:52.000 --> 00:53:56.640
So used it whenever I need to analyze data, used it for this study.

00:53:56.640 --> 00:53:58.540
And again, really like the environment.

00:53:58.820 --> 00:54:03.340
The exception of that will be when I'm doing more software application development.

00:54:03.340 --> 00:54:11.840
So I've actually, if I'm doing web development and have Python on the server side and JavaScript on the front end, then I'll often use Atom for that.

00:54:11.840 --> 00:54:14.260
Just because I'm switching back and forth between languages.

00:54:14.260 --> 00:54:14.700
Yeah.

00:54:14.700 --> 00:54:21.300
If you have a bunch of different files and they're kind of all working together, especially cross language like CSS, JavaScript, HTML, et cetera.

00:54:21.300 --> 00:54:21.620
All right.

00:54:21.620 --> 00:54:25.700
It's something, Jupyter is not amazing for that, but it is really great for exploring.

00:54:25.700 --> 00:54:26.880
Awesome.

00:54:26.880 --> 00:54:27.720
All right.

00:54:27.740 --> 00:54:29.500
How about a notable PyPI package?

00:54:29.500 --> 00:54:44.100
I think Philip may have mentioned this too, but especially in the data analysis world, Anaconda, I have yet to find a better way just to get people up and running on doing data analysis and quickly package together pandas, numpy, matplotlib, seborn.

00:54:44.100 --> 00:54:45.140
Especially on Windows.

00:54:45.140 --> 00:54:53.200
Especially on Windows where some of those tools are hard to pip install because like some weird compiler thing is missing.

00:54:53.200 --> 00:54:54.660
Yeah, that's cool.

00:54:54.660 --> 00:54:57.560
Pick a specific one and probably be the visualization libraries.

00:54:57.560 --> 00:55:00.700
Matplotlib or seborn are really where I spend my time.

00:55:00.700 --> 00:55:01.060
Awesome.

00:55:01.060 --> 00:55:02.640
Yeah, I'll throw one out for you.

00:55:02.640 --> 00:55:05.100
I don't normally do this, but this one is like so relevant.

00:55:05.100 --> 00:55:05.980
I just came across it.

00:55:05.980 --> 00:55:07.820
Have you heard of Pixie Debugger?

00:55:07.820 --> 00:55:09.720
P-I-X-I-E Debugger?

00:55:09.720 --> 00:55:10.240
No.

00:55:10.440 --> 00:55:15.760
So Pixie Debugger is a visual interactive debugger for Jupyter Notebooks.

00:55:15.760 --> 00:55:16.880
That is fantastic.

00:55:16.880 --> 00:55:29.600
You just include it and it gives you like below your cell, you put a little decorator type magic command onto a cell and then you can just step through, step forward, inspect the variables visually as you're going through that cell.

00:55:29.600 --> 00:55:30.060
It's pretty awesome.

00:55:30.060 --> 00:55:30.960
That is awesome.

00:55:31.080 --> 00:55:35.120
Yeah, so many folks will split cells to try and figure out, okay, where does this thing fail?

00:55:35.120 --> 00:55:36.180
So that's perfect.

00:55:36.180 --> 00:55:37.380
Yeah, it's really, really great.

00:55:37.380 --> 00:55:38.160
All right.

00:55:38.160 --> 00:55:39.660
So people are interested.

00:55:39.660 --> 00:55:41.180
They want to do more with this.

00:55:41.180 --> 00:55:42.840
Maybe look at your research.

00:55:42.840 --> 00:55:44.120
Is the data available?

00:55:44.120 --> 00:55:44.860
What do they do?

00:55:44.980 --> 00:55:49.560
I have a personal website, adamrule.com, that they can go to that kind of links out to everything.

00:55:49.560 --> 00:55:56.220
And that will have copies of the papers documenting my research as well as links to the data repositories.

00:55:56.220 --> 00:55:57.720
How big is the data?

00:55:57.720 --> 00:55:58.460
Is it a lot?

00:55:58.460 --> 00:56:00.280
It's about 600 gigabytes.

00:56:00.280 --> 00:56:02.220
600 gigabytes.

00:56:02.220 --> 00:56:02.720
Wow.

00:56:02.720 --> 00:56:08.560
Thankfully, our university had a very fast internet speed to be able to down all of that.

00:56:08.560 --> 00:56:09.760
Oh my goodness.

00:56:09.760 --> 00:56:10.200
Yeah, yeah.

00:56:10.200 --> 00:56:10.560
No kidding.

00:56:10.560 --> 00:56:12.680
Where do you host it?

00:56:13.080 --> 00:56:17.820
That's actually a non-trivial amount of money if you actually had to put it on S3 or something.

00:56:17.820 --> 00:56:19.340
That's 50 bucks per download.

00:56:19.340 --> 00:56:21.340
Props to UC San Diego again on that.

00:56:21.340 --> 00:56:23.920
Their library has graciously agreed to host that.

00:56:23.920 --> 00:56:29.500
So many other sites that we looked at had limits at like 100 gigabytes or something for data sets.

00:56:29.500 --> 00:56:31.180
So they're hosting that.

00:56:31.180 --> 00:56:32.860
And again, I have a link off of my website.

00:56:32.860 --> 00:56:34.840
But we both have the full data set.

00:56:34.840 --> 00:56:37.020
And then we have a starter data set.

00:56:37.020 --> 00:56:42.880
It's about one to two gigabytes with a subset of those, but with all the different data types.

00:56:42.880 --> 00:56:46.020
And example notebooks that people can use to begin playing with it.

00:56:46.020 --> 00:56:46.820
Oh, that's really cool.

00:56:46.820 --> 00:56:48.740
So they can start to play with it.

00:56:48.740 --> 00:56:51.480
And if they're really committed, they can download 600 gigs.

00:56:51.480 --> 00:56:51.980
Exactly.

00:56:51.980 --> 00:56:52.760
Exactly.

00:56:52.760 --> 00:56:54.780
Oh, that's pretty awesome.

00:56:54.780 --> 00:56:56.800
Well, Adam, this is really interesting research.

00:56:56.800 --> 00:57:01.080
And thanks for sharing your view into the whole notebook space.

00:57:01.080 --> 00:57:01.420
Yeah.

00:57:01.420 --> 00:57:02.900
Thanks again for chatting today.

00:57:02.900 --> 00:57:03.400
It's been a pleasure.

00:57:03.400 --> 00:57:03.800
You bet.

00:57:03.800 --> 00:57:04.100
Bye.

00:57:04.100 --> 00:57:04.400
Bye.

00:57:05.400 --> 00:57:08.080
This has been another episode of Talk Python To Me.

00:57:08.080 --> 00:57:10.200
Our guest has been Adam Rule.

00:57:10.200 --> 00:57:13.820
And this episode has been brought to you by Linode and Studio 3T.

00:57:13.820 --> 00:57:17.980
Linode is bulletproof hosting for whatever you're building with Python.

00:57:17.980 --> 00:57:22.300
Get four months free at talkpython.fm/linode.

00:57:22.300 --> 00:57:24.080
That's L-I-N-O-D-E.

00:57:24.080 --> 00:57:29.680
With Studio 3T, you can write SQL queries and translate them automatically to Python.

00:57:29.800 --> 00:57:34.860
Try their database ID today at talkpython.fm/studio.

00:57:34.860 --> 00:57:36.720
Want to level up your Python?

00:57:36.720 --> 00:57:41.900
If you're just getting started, try my Python jumpstart by building 10 apps or our brand new

00:57:41.900 --> 00:57:43.760
100 days of code in Python.

00:57:43.760 --> 00:57:47.560
And if you're interested in more than one course, be sure to check out the Everything Bundle.

00:57:47.560 --> 00:57:49.820
It's like a subscription that never expires.

00:57:49.820 --> 00:57:52.000
Be sure to subscribe to the show.

00:57:52.000 --> 00:57:54.220
Open your favorite podcatcher and search for Python.

00:57:54.220 --> 00:57:55.460
We should be right at the top.

00:57:55.460 --> 00:58:01.240
You can also find the iTunes feed at /itunes, Google Play feed at /play, and

00:58:01.240 --> 00:58:04.780
direct RSS feed at /rss on talkpython.fm.

00:58:04.780 --> 00:58:06.640
This is your host, Michael Kennedy.

00:58:06.640 --> 00:58:08.000
Thanks so much for listening.

00:58:08.000 --> 00:58:09.060
I really appreciate it.

00:58:09.060 --> 00:58:11.020
Now get out there and write some Python code.

00:58:11.020 --> 00:58:31.640
I'll see you next time.

