Learn Python with Talk Python's 270 hours of courses

#333: State of Data Science in 2021 Transcript

Recorded on Thursday, Sep 9, 2021.

00:00 We know that Python and data science are growing in lockstep together, but exactly what's happening

00:05 in the data science space in 2021? Stan Siebert from Anaconda is here to give us a report on what

00:10 they found with their latest State of Data Science in 2021 survey. This is Talk Python to Me,

00:16 episode 333, recorded August 9th, 2021.

00:19 Welcome to Talk Python to Me, a weekly podcast on Python. This is your host, Michael Kennedy.

00:37 Follow me on Twitter where I'm @mkennedy and keep up with the show and listen to past episodes

00:42 at talkpython.fm and follow the show on Twitter via at talkpython. We've started streaming most of our

00:49 episodes live on YouTube. Subscribe to our YouTube channel over at talkpython.fm/youtube to get

00:54 notified about upcoming shows and be part of that episode. This episode is brought to you by Shortcut,

01:01 formerly known as clubhouse.io, and masterworks.io. And the transcripts are brought to you by Assembly

01:07 AI. Please check out what they're offering during their segments. It really helps support the show.

01:11 Stan, welcome to Talk Python to Me.

01:15 Hey, nice to be here.

01:16 Yeah, it's great to have you here. I'm super excited to talk about data science things,

01:21 Anaconda things, and we'll even squeeze a little one of my favorites, the Apple M1 stuff mixed in

01:27 with data science. So it should be a fun conversation.

01:30 I'm also very excited about the M1.

01:32 Nice. Yeah, we can geek out about that a little bit. That'll be fun. But before we get there,

01:37 let's just start with your story. How'd you get into programming in Python?

01:40 Yeah, programming started as a kid, you know, dating myself here. I learned to program

01:44 basic on the Osborne 1, a suitcase of a computer that we happened to have as a kid. And then

01:51 eventually picked up C and stuff like that. Didn't learn Python until college, mostly because I was

01:57 frustrated with Perl. I just found that Perl just never fit in my brain right. And so I was like,

02:03 well, what other scripting languages are there? And I found Python. And that was a huge game changer.

02:08 I didn't really use it professionally or like super seriously until grad school when I had a summer

02:14 research job. And I realized that this new thing called NumPy could help me get do my analysis.

02:19 And so that was when I really started to pick up Python seriously. And now here I am, basically.

02:25 Yeah, what were you studying in grad school?

02:26 I was doing physics. So I did particle physics and used Python quite extensively, actually,

02:33 throughout my research. And C++, unfortunately, for better or worse. So yeah, but that's how I end.

02:39 I always ended up kind of being the software person on experiments. So when I was leaving academia,

02:45 going into software engineering kind of was a logical step for me.

02:49 I was studying math in grad school and did a lot of programming as well. And I sort of trended more and

02:56 more towards the computer side and decided that that was the path as well. But it's cool. A lot of the

03:01 sort of logical thinking, problem solving you learn in physics or math or whatever, they translate

03:06 pretty well to programming.

03:08 Yeah, yeah. And definitely, you know, working on large experiments, a lot of the sort of soft skills

03:14 of software engineering, things like how do you coordinate with people? How do you design software

03:18 for multiple people to use? That sort of thing. I actually, I inadvertently was learning how to be

03:22 a software manager as a physicist and then only realized it later when I went into industry.

03:29 And how about now? You're over at Anaconda, right?

03:31 Yeah. So, you know, maybe I'm doing the same thing. So now I'm both a developer and a manager

03:37 at Anaconda.

03:38 It's a direct path from like PhD physics, particle physics to programming to data science at Anaconda.

03:45 Is that how it goes?

03:45 Yeah. I mean, we employ a surprising number of scientists who are now software engineers.

03:51 And so I manage the team that does a lot of the open source at Anaconda. So we work on stuff like

03:57 Numba and Dask and various projects like that. Just recently hired the piston developers to broaden our

04:04 scope into more Python JIT optimization kind of stuff. So yeah, so I'm doing a mix of actual development on

04:11 some projects as well as just managing strategy, the usual kind of stuff.

04:16 Well, I suspect most people out there know what Anaconda is, but I have listeners who come from all

04:20 over, you know, what is Anaconda? It's kind of like a Python you download, but it's also,

04:25 it has its own special advantages, right?

04:28 Yeah. I mean, where we came out of and still is our main focus is how to get Python and just

04:35 broader data science tools. One of the interesting things about data science is it's not just Python.

04:40 Most of people are going to have to combine Python and maybe they don't realize it, but with Fortran

04:44 and C++ and all the things that underpin all of these amazing libraries. And so a lot of what we do is

04:49 try to get Python into the hands of data scientists is, you know, get them the latest things and make it

04:54 easy for them to install on whatever platform they're on. Windows, Mac, Linux, that sort of thing.

04:58 So the, you know, Anaconda has a, you know, a free, call it individual edition. It's basically a

05:05 package distribution and installer that lets you get started. And then you can, there are thousands

05:10 of Conda packages, Conda's packaging system. There are thousands of Conda packages that you can install

05:14 where, you know, we, or, you know, the broader community have done a lot of the hard work to

05:19 make sure all of those compiled packages are built to run on your system.

05:24 That's one of the real big challenges of the data science stuff is getting it compiled for your

05:29 system. Because if I use requests, it's, you know, pip install requests. I probably,

05:34 maybe it runs a setup high. Maybe it just comes down as a wheel. I don't know, but it's just pure

05:39 Python and there's not a whole lot of magic. If I'm really getting there far out there, maybe I'm

05:44 using SQLAlchemy and it has some C optimizations. It will try to compile. And if it doesn't, well,

05:49 it'll run some slower Python version probably. But in the data science world, you've got really heavy

05:55 dependencies, right? Like, as you said, stuff that requires a Fortran compiler on your computer.

05:59 I don't know if I have a Fortran compiler on my Mac. I'm pretty sure I don't. Maybe it's in there.

06:05 Probably not. Right. And as maybe C++, probably have a C++ compiler, but maybe not the right one.

06:12 Maybe not the right version. Maybe my path is not set up right. And plus it's slow, right? All of these

06:18 things are a challenge. So Anaconda tries to basically be, let's rebuild that stuff with a tool chain that

06:25 we know will work and then deliver you the final binaries, right? The challenge with that for a lot

06:29 of tooling is it's downloaded and installed to different machines with different architectures,

06:34 right? So you've gone and built stuff for macOS, you built stuff for Linux, you built stuff for

06:40 Windows and whatnot. Is that right?

06:42 Yeah. Yeah. Building software is non-trivial and no matter how much a developer tries to automate it so

06:50 that things just work, it helps to have someone do a little bit of quality control and a little bit of

06:56 just deciding how to set all the switches to make sure that you get a thing that works so that you

07:02 can just get going quickly. Early on, I remember in the sort of 2014, 2015 era, Anaconda was extremely

07:10 popular with Windows users who did not have a lot of good options for how to get this stuff.

07:14 Right.

07:15 Like with Linux, you could kind of get it together and get it going. If you were motivated on Windows,

07:19 it was often just like a very much, I don't know what to do. And so this making it sort of one-stop

07:26 shopping for all of these packages. And then another thing we wanted to do is make sure that there was a

07:29 whole community of package building around it. It wasn't just us. So things like Condo Forge is a

07:35 community of package builders that we are part of and hugely support. Because there's a long tail,

07:42 there's always going to be stuff that is going to be, you know, we're never going to get around to

07:45 packaging.

07:45 Right. There's important stuff that you're like, this is essential. So NumPy, Matplotlib, and so on.

07:52 Like you all take control of making sure that that one gets out. But there's some, you know,

07:57 biology library that people don't know about that you're not in charge of. And that's what the

08:03 Condo Forge plus Condo is, is like, sort of like pip and PyPI, but also in a slightly more structured way.

08:10 Yeah. Yeah. And that was why, you know, Condo was built to help make it so that it is possible for

08:14 this community to grow up, for people to package things that aren't Python at all that you might

08:18 need, all kinds of stuff like that. And yeah, they, you know, there's always going to be, you know,

08:24 in your specific scientific discipline. I mean, so for example, Bioconda is a really interesting

08:28 distribution of packages built by the bioinformatics community built on Condo, but they have all of the

08:34 packages that they care about. And many of which I've never heard of, aren't in common use, but are really important to that scientific discipline.

08:41 Out in the live stream, we have a question from Neil Heather. Hey Neil, I mentioned Linux, Windows,

08:47 macOS. Neil asked, does Anaconda work on Raspberry Pi OS as in ARM64?

08:53 Yeah. So the answer to that is Anaconda, not yet. Condo Forge does have a set of community built

09:01 packages for Raspberry Pi OS. The main challenge there is actually, we just a couple months ago

09:08 announced ARM64 support, but it was aimed at the server ARM machines that are running ARM 8.2

09:14 instruction set, which the Raspberry Pi is 8.0. And so the packages we built, which will work great on

09:21 server ARM, are missing, are using some instructions that Raspberry Pis can't support. But Condo Forge,

09:27 so if you go look up Condo Forge and Raspberry Pi, you'll find some instructions on how to install for

09:33 that.

09:33 ARM is interesting, right? So let's talk a little bit about that because I find that this whole Apple

09:40 Silicon move, you know, they created their M1 processor and they said, you know what, we're dropping

09:47 Intel, dropping x86, more importantly, and we're going to switch to basically iPad processors,

09:54 slightly amped up iPad processors that turn out to be really, really fast, which is actually

09:59 blew my mind and it was unexpected. But I think the success of Apple is actually going to encourage

10:07 others to do this as well. And it's going to add, you know, more platforms that things like

10:14 Anaconda, Condo Forge and stuff are going to have to support, right? So there's a cool article over here

10:21 by you on Anaconda called A Python Data Scientist's Guide to the Apple Silicon Transition.

10:28 Yeah, this was, you know, I've been, I'm a huge chip nerd, just due to background and thinking about

10:33 optimization and performance. And so this came out of, you know, some experiments I was doing

10:40 to just understand, I mean, we got some M1 Mac minis into our data center and started immediately

10:45 playing with them. And I realized I, after some, you know, I should take the stuff I was,

10:49 I was learning and finding and put it together in a document for other people because I couldn't find

10:53 this information anywhere organized in a way that was, you know, for me as a Python developer,

10:59 I was having a hard time putting it all together.

11:02 Right. There was some anecdotal stuff about just like, yeah, this kind of works for me,

11:07 or this is kind of fast or this kind of slow, but this is a little more,

11:10 here's the whole picture and what the history is and where it's going and what it means and

11:14 specifically focused on the Conda side of things. Right.

11:19 Yeah. And even just the, the, the Python side, it's, I mean, it's, it's sort of an interesting

11:23 problem of, you know, Python's an interpreted language. So you're like, well, I don't, I don't

11:27 have any machine code to worry about. Right. But the interpreter of course is compiled. So you at

11:32 least need that. And then many, many Python packages also contain compiled bits and you'll need those

11:37 two. And, and this is, this is an interesting broad problem for the whole, the whole Python

11:42 ecosystem to try and tackle because that's not too often a whole new platform kind of just appears,

11:46 you know, making it a whole new architecture takes a while.

11:50 It absolutely does. I think there's a lot of interesting benefits to come. I do want to point

11:56 out for people listening. If you jump over to the PSF JetBrains Python developer survey,

12:02 the most recent one from 2020, and you look around a bit, you'll see that while we don't run

12:09 production stuff on macOS that much, 29% of the developers are using macOS to develop Python code.

12:18 Right. So Apple's pledged that we're going to take a hundred percent of this and move it over

12:24 to Silicon means almost a third of the people running Python in a couple of years will be under

12:31 this environment. Right. And even if you have a windows or Linux machine and you don't care about

12:36 macOS, you may be maintaining a package for people who do. Yeah. And that means Apple Silicon, right?

12:42 Yeah. And there's, I mean, it's, it's interesting. There's a whole, I mean, just other stuff you take

12:46 for granted you know, the, the availability of, of free continuous integration services that has been

12:54 transformative for the open source community. I mean, it's really improved the software quality that all

12:57 these open source projects can automatically run their tests and build packages every time there's a new

13:02 change. However, it's something like this comes out. And until you get, you know, arm Macs into these services and

13:10 if they're, you know, until they're freely available, a lot of the infrastructure of these open source

13:14 projects, they don't have a way to test on an M1 Mac except manually if they happen to have one and they

13:19 don't have a way to automate their build on an M1 Mac until that, until that sorts out. Yeah. And thinking

13:25 about the workflow here, there there's two challenges that this presents. One is you want to do a get push

13:32 production or get pushed to some branch or tag it. And that's going to trigger a CI build that might fork

13:38 off to run a windows compile, a Linux compile, a Mac compile, generate some platform

13:44 specific wheels with like Fortran compiled in there or whatever. And then you're going to ship that off.

13:49 If that CI system doesn't have an Apple Silicon machine, it can't build for Apple Silicon, right?

13:56 Yep. Yep.

13:57 And there was a time.

14:00 Yeah. Sorry. I mean, yeah. Well, where do you, you know, where do you get M1 in the cloud, right? As a

14:05 normal, I know there's a few hosted places, but as a, like a normal GitHub or an Azure, it's not common to just go grab a bunch

14:13 of those and pile them up. Right.

14:15 Yeah. And it'll take time. I mean, eventually in the same way that, you know, I was thinking back to,

14:19 you know, go back four or five years ago it was, there wasn't a whole lot of options for windows CI

14:25 available. There were a couple of providers and, and then there was sort of a huge change and then

14:31 pretty much everyone offered a windows option and they were faster and all of this stuff. And so I think,

14:36 but that took time. And, and I think that's the thing is, is these, the hardware is in people's hands now,

14:41 and it's just going to get more and more. And, and it's unclear how quickly we can catch up.

14:47 That's going to be a challenge for all of us.

14:49 It's absolutely going to be a challenge. It's, it's interesting. I hope that we, we get there soon.

14:54 The other problem in this same workflow is I was actually just looking at some NumPy issues,

15:02 specifically issue 18,143. I'm sure people have that right off the top of their head.

15:07 The title is please provide universal two wheels for macOS. And there's a huge, long comp conversation

15:15 about, I mean, this is like many, many lines of many, many messages in the thread. And one of the

15:22 problems they brought up is like, look, we can find a way to compile the binary bits, the C++ bits for M1,

15:30 but we can't test it. Like if we can't, we as developers cannot run this, this output, like it's,

15:36 it's a little sketchy to just compile and ship it to the world. And to be fair, this is on January 9th of

15:43 2021, when it was still hard, you know, these things were still shipping and still arriving there.

15:48 It was not like you just go to the Apple store and pick one up.

15:51 This portion of Talk Python to Me is brought to you by Shortcut, formerly known as clubhouse.io.

15:58 Happy with your project management tool? Most tools are either too simple for a growing engineering team

16:03 to manage everything, or way too complex for anyone to want to use them without constant prodding.

16:08 Shortcut is different though, because it's worse. No, wait, no, I mean, it's better.

16:12 Shortcut is project management built specifically for software teams. It's fast, intuitive, flexible,

16:18 powerful, and many other nice, positive adjectives. Key features include team-based workflows.

16:23 Individual teams can use default workflows or customize them to match the way they work.

16:29 Org-wide goals and roadmaps. The work in these workflows is automatically tied into larger company

16:34 goals. It takes one click to move from a roadmap to a team's work to individual updates and back.

16:40 Type version control integration. Whether you use GitHub, GitLab, or Bitbucket,

16:45 Clubhouse ties directly into them, so you can update progress from the command line.

16:49 Keyboard-friendly interface. The rest of Shortcut is just as friendly as their power bar,

16:54 allowing you to do virtually anything without touching your mouse. Throw that thing in the trash.

16:59 Iteration planning. Set weekly priorities and let Shortcut run the schedule for you with accompanying

17:05 burndown charts and other reporting. Give it a try over at talkpython.fm/shortcut.

17:11 Again, that's talkpython.fm/shortcut. Choose shortcut because you shouldn't have to project manage

17:18 your project management.

17:22 Yeah, as an interesting example, CondoForge was able to get Condo packages for Apple Silicon out pretty

17:29 quickly, but they did it with a clever sort of cross-compilation strategy where they were building

17:34 on x86 Macs the ARM packages and pushing them out. But they had enough people manually testing that they

17:44 had confidence in the process that it was okay. But that's very different than how they build other

17:48 packages, which are built and tested immediately, automatically. And if they fail tests, they don't

17:53 get uploaded. So that's, you know, it was, it was, it's a risk, but it helped get the software out in

17:58 people's hands quicker. But yeah, long-term we need to get these machines onto all these CI systems so

18:03 that we can use the same techniques we've built up over the years to ensure we have quality software.

18:08 I think we'll get there, but it's just going to take some time, right?

18:12 Yep. Yep. Yeah.

18:13 Let's see. Neil on Livestream says, speaking of open source, Apple is rumored to be hiring experts

18:19 in a risk. V or Fives have perhaps moved away from having to pay licensing fees to ARM. Yeah. I'm not

18:25 sure about that, but.

18:26 Yeah. I mean, it's a, what's interesting here is, is the, I mean, other, you know, chip architectures

18:32 have been around for a long, long time, but until very recently, you know, average users

18:38 didn't have to think about X86 versus ARM. ARM was for mobile phones and other, you know,

18:43 never had to worry about power PC or anything like that.

18:45 Not for real computers.

18:46 Yeah.

18:46 And so, but now once you, once, you know, going from one to two is a big step. Now the floodgates

18:53 are open and now we're thinking about, well, what else is out there? I mean, you know, risk

18:56 five, I'm not sure how you say, I think risk five is what you call it. Is, is an interesting

19:01 thing. And has even, you know, being a completely open standard, you don't have to even pay licensing

19:07 fees as mentioned. I don't know if Apple's going to make this transition again so quickly. But I,

19:15 I can guarantee you that, you know, everyone probably somewhere in a basement is thinking

19:19 about it, maybe doing some experiments. But yeah, chips move slowly, but it's interesting to think

19:24 about.

19:25 Yeah. That's not a thing you can change very frequently with drag developers along. I mean,

19:29 we're talking about all the challenges, you know, that are just down the pipeline from

19:34 that.

19:34 Yeah.

19:35 Very interesting. All right. Well, let's, let's just talk a few, a little bit about this.

19:38 First, you're excited about these as a data scientist.

19:42 Yeah. It's, it's there. I'm there really for sort of two reasons. I mean, one thing that's

19:46 interesting is just the power efficiency. I always, there was a talk long ago from the chief

19:51 scientist in NVIDIA, which really had an impression on me in which he, you know, paraphrasing roughly,

19:55 basically said that because everything is now power constrained power efficiency equals performance

20:02 in a way that is, you know, normally you just think, well, just put more power in there, but

20:06 that heat has to go somewhere. So you, you, we long since hit that wall. And so now you just have to

20:11 get more efficient to get more performance. Right.

20:13 That's an interesting opportunity.

20:15 You can get more, you can get like larger power supplies and larger computers. I have a

20:19 gaming SIM computer and it is so loud. If you get it going full power, like if the windows are open,

20:24 you can hear it outside the house. It's literally that loud. But at the same time, it's not just on

20:30 your personal computer, you know, in the cloud and places like that, right. You, you pay not just,

20:36 you know, how much performance you get. There's some sort of combination of how much energy does that

20:41 particular processor take to run. And if it's one fifth, you might be able to buy more cloud compute

20:47 per dollar.

20:48 Yeah. Power and cooling is a huge part of a computer, you know, data center expenses.

20:53 And even just, you know, it, you can only, you can put maybe, you know, one to 300 Watts into a CPU.

20:59 You're not, you're not going to put, you know, multiple kilowatts in there or something. And so

21:04 where, where is that? What else, what else can you do? And a lot of that is that, you know,

21:09 Moore's law is driven a lot by just every time you shrink the process, you do get more power

21:13 efficient. And, but now it's interesting to think about architectures that have been sort of thought

21:18 of that, that arm has come in into its own in a extremely power constrained environment. And so now

21:23 we're letting it loose on a laptop, which has way more power compared to a cell phone available.

21:29 What could we do if we fed, you know, right into the socket in the wall?

21:33 Yeah. And you know, what happens when I put it in the data center?

21:36 Yeah.

21:38 So that's, that's, I think arm in the data center is going to be really important.

21:42 Yeah.

21:42 Yeah.

21:43 Yeah.

21:43 I think it's, it's definitely, I'd always expected that to come before the desktop.

21:49 To be honest, I was surprised as many people were by the, you know, suddenness of the Apple

21:55 transition. cause I had assumed this maybe would happen much after we all got used to arm

22:00 in the data center, where you're probably running Linux and it's easy to recompile compared

22:05 to, you know, Mac and stuff like that.

22:06 Yeah. That's what I thought as well. The payoff is so high, right? They spend so much energy

22:12 on both direct electricity, as well as then cooling from the waste heat, from that energy

22:18 that it's the payoff is just completely, completely clear. Right. All right. So let's see, a couple

22:24 of things that you pointed out that make a big difference here is obviously arm versus x86,

22:29 built in on chip GPU, the whole system as a system on a chip thing, rather than a bunch of pieces going

22:35 through motherboard is pretty interesting. But I think the, maybe the most interesting one has to do

22:41 with the acceleration, things like the Apple neural engine that's built in and whatnot.

22:46 It sounds like the data science libraries in general are not targeting the built-in neural

22:52 engines yet, but maybe, maybe they will in the future. I don't know.

22:55 Yeah. It's a, it's something that we're going to have to figure out because, I mean, I think it

22:59 was a bit of chicken the egg that, you know, until this happened, you didn't have this kind of

23:02 hardware just sitting on people's desks. and you weren't going to, you know, run, data science

23:07 stuff on your phone. So now that it's here now, the question is, okay, what can we do with it?

23:12 I mean, right now, for example, you know, for the Apple neural engine, you can take advantage

23:16 of it using something called Coromel tools, which actually did a webinar sometime back on,

23:22 and, and, but that's like for basically you've trained a model and you want to run inference on it

23:27 more efficiently and quickly. but that's, you know, that's it. There's a, there's an alpha,

23:31 release of TensorFlow. That's GPU accelerated. And it would take advantage of the, you know,

23:37 on the M one, if you're, if you're running it there, but that's super early. and, and there's,

23:41 a lot more opportunities like that, but again, that will take time to adapt.

23:45 It will. I suspect as there's bigger gains to be had, they'll probably more likely to be adopted.

23:53 Right. So for example, I have my Mac mini here that I just completely love, but it, it's not that

24:00 powerful say compared to like a GeForce video card or something like that. But if Apple announces

24:06 something like a, a huge Apple pro Mac pro, with many, many, you know, 128 cores instead of 16 or

24:14 whatever, right. Then all of a sudden in the neural engine, all of a sudden that neural engine becomes

24:18 really interesting, right? And maybe it's worth going to the extra effort of writing specific code for it.

24:23 Yeah. Yeah. Well, that's the other thing that's interesting about this is we've only seen one

24:27 of these chips and it is by definition, the slowest one that will ever be made. And so, it's,

24:34 it's, it's, we don't even know how, you know, what is it going to be like to scale up? I mean,

24:37 one of those things that is, you know, you, if you're targeting that big desktop user, how are

24:42 they going to scale this up? This, this all fit on one package. Can they still do that? Will they

24:46 have to split out into multiple packages? there's a lot of engineering challenges that they

24:51 have to solve and we're not sure how they're going to solve them yet out on the outside. So,

24:56 we're going to have to, we have to see. It's going to be exciting to see that come along here.

25:00 All right. So, let's touch on just a couple of things, getting Python packages for M1.

25:05 What are some of the options there? Yeah. So, so the, the, the status still is roughly how I have in

25:11 this article, which is basically you can use pip to install stuff if wheels have been built and a

25:16 number of packages like NumPy have started to catch up and have, wheels that will run on the M1.

25:21 another option which works surprisingly well is to just use an x86 Python packaging distribution.

25:27 I think that's actually what I'm doing because it just runs over Rosetta 2.

25:31 Yeah. And that, yeah, it just works. it is shocking. I mean, Rosetta 2 on average,

25:37 I'm finding a sort of like a 20% speed hit, which for an entirely entire architecture switch is amazing.

25:44 I've never seen that before. or you can use a condo forge has the, as I mentioned earlier,

25:50 their, their sort of experimental, macOS arm, package distribution, which doesn't have

25:55 everything, but has a lot of things, and is using them, you know, it is all built for arm.

26:00 It's, there's no translation or anything going on there.

26:02 Right. And on python.org, I believe the Python is that you, if you go and download, I believe it's,

26:09 a universal binary now for sure. So that means it'll, it'll adapt and just run on arm or run on x86.

26:17 You just get one binary. The, the numpy conversation was kind of around that as well,

26:23 I believe. All right. you got, you did some, performance analysis on the performance cores

26:28 versus efficiency cores. That was pretty interesting. And so that was pretty similar to hyper threading.

26:33 If you want to run Linux or windows, you basically got to go with Docker or parallels. And then I guess

26:38 maybe the last thing is like, let's wrap up this subtopic with like pros and cons for data scientists,

26:43 people out there listening. They're like, ah, I can't take hearing about how cool the M1 is anymore.

26:47 Maybe I'm going to have to get one of these. Like, should they like, what do you think as a data

26:51 scientist? Yeah. As a data scientist, my takeaway from all the testing was you should be really excited

26:55 about this, but I would wait unless you are doing what I would describe as a little bit of data science

27:00 on the side and not a huge amount. mainly because, you know, these, the, what they've proven is

27:05 the architecture has great performance and great battery life. The thing we still have to see is how are they

27:10 going to get more Ram in there? How are they going to get more cores in there? and, and then also

27:14 when is the rest of the ecosystem going to catch up on package support? so I, honestly, I, I'm,

27:19 you know, if you're interested in sort of bleeding edge, knowing what's coming, I would totally jump in.

27:23 if you want this for your day to day, I would probably still wait and see what comes out next.

27:27 because I think a data scientist especially is going to want some of the, you know, more cores and

27:32 more Ram, especially than what these machines offer. Right. There's always remote desktop or,

27:36 or SSH or something like that. Right. If you've got an Intel machine sitting around,

27:41 you can just connect over the network locally. Yeah. Yeah. Very cool. All right. Excellent.

27:45 I just want to give a quick mention that Paul Everett from JetBrains and I did a Python developer

27:50 explores Apple's M1 way, way back in December 11th of 2020, right. When this thing came out.

27:56 so, people can check that. I'll put that in the show notes as well. All right. Let's talk about

28:01 the state of data science, 2021. How'd you all find out about this? How do you know the state?

28:07 Yeah. So, this is something we've been doing for a few years now. I mean, since we have

28:12 a big data scientist audience, you know, a couple of years back, we decided, Hey, let's,

28:17 let's ask them about what challenges they're seeing in their jobs, but, and then publish the results so

28:22 that the whole industry can learn a little bit more about what are data scientists seeing in their day-to-day

28:26 jobs that's, you know, going well, going poorly, where do they want to see improvements? What are

28:31 they, what are they sort of, feeling and thinking? So you got a bunch of people to come

28:36 fill out, the survey and give you some feedback and yeah, yeah, we, we, we, you know, 140 plus

28:44 countries. So we have pretty good, reach across the world. and, and, you know, more than 4,200

28:49 people took the survey. So it's, yeah, we got a lot of responses. It's always amazing to

28:55 see. Yeah. Quick side thought here, I guess. So you've got in that survey, which I'll link to the

29:00 PDF results in the show notes, you've got all the countries highlighted and obviously North America

29:06 is basically completely lit up as like a popular place of results. So as Western Europe, Australia,

29:12 and even Brazil, Africa is pretty, on the light on the side, what else can be done to get

29:19 sort of more Python, more data science going in Africa? Do you think you have any thoughts on that?

29:24 No, I don't. That's a good, that's an excellent question. I don't, that's actually might be a

29:28 good question for a future survey to be honest is, is I can speculate, you know, I don't know if it's,

29:33 you know, access to the computing or if it's bandwidth or, or if it's, you know,

29:39 resources available in the local languages. I mean, there's all sorts of possibilities.

29:43 One thing that is really nice about Python and data science is so much of the stuff is free,

29:47 right? So it's, it's not like, oh, you got to pay, you know, some huge Oracle database license to use

29:54 it or whatever. Right. So I, I mean, there's a real possibility of that. So yeah, I don't really know

29:59 either, but, let's see, there's the standard stuff about like education level. I guess one of the

30:05 areas maybe we could start on, it's just, you know, people who are doing data science,

30:09 where, where do they live in the organization, right? Are they the CEO? Are they vice president?

30:15 A good portion of them were, 50% is either senior folks or managers. That's kind of interesting,

30:22 right? Yeah, I can see it sort of coming out of, data science as helping in decision-making

30:28 and that sort of thing. And so I can, I can see it gravitating towards, the decision makers in an

30:34 organization. and, and that sort of thing. I mean, one of the interesting things that,

30:38 maybe as in a later, later one of the pages is, how spread out data science is across the

30:45 different departments as well. that there was, you know, obviously it and R and D show up higher

30:51 than the others. but you kind of see a long tail in all the departments. And, you know, my,

30:57 my theory on that is I think we're seeing data science evolving into sort of a profession and a

31:02 professional skill, if that makes sense. So in the same way that like every, you know,

31:06 knowledge workers are always expected to do writing and to know how to write. Yeah.

31:10 but we also hire professional technical writers. I think we're getting into a space where we'll

31:15 have everyone will need to have some numerical literacy and data science skills, even while we

31:21 also employ professional data scientists. Is it the new Excel? Like if I'm, if I'm a manager,

31:26 I, and I don't know how to use Excel, people are going to go, what is wrong with you? Why are you,

31:31 how did you get here? Right. You're going to have to know how to use a spreadsheet. I mean,

31:35 it could be Google sheets or whatever, but something like that to, you know, pull in data,

31:39 sum it up, put it in a graph and so on. And are you feel, are you seeing that more formal data science,

31:46 you know, Jupyter type stuff is kind of edging in on that world.

31:49 Yeah. It's, it's going to, again, I think we'll have to see sort of how the tools settle out.

31:53 one thing I know for sure is that you'll have to at least become familiar with the concept so

31:58 that even if the people doing the data science and reporting to you are using whatever their

32:03 favorite tool set is at least understanding their workflow and how data, you know, goes through that

32:08 life cycle and, you know, data cleaning and modeling and inference and all of those things,

32:13 you'll have to understand that at least enough to interpret what, what is being told and ask the

32:17 right questions about. Right. So if somebody comes to you and says, you asked me this question.

32:22 So I put together a Jupyter notebook that's using PyTorch forecasting. Maybe you can do none of those,

32:26 but you should kind of understand the realm of what that means. Something like that.

32:30 Yes. Yes. You'll have to know at least what steps they had to go through to get to your,

32:34 the answer. So you can ask good questions about, cause if you were a decision maker,

32:38 you need to be able to kind of defend your decision, which means you're going to have to

32:41 at least understand, you know, what went into the inputs into that decision.

32:45 Well, we bought that company cause Jeff over in business analytics said it was a good idea.

32:50 Turned out he, he didn't replace the, not a number section and that really broke it. So

32:55 this portion of talk Python is brought to you by masterworks.io. You have an investment portfolio

33:06 worth more than a hundred thousand dollars. Then this message is for you. There's a $6 trillion

33:10 asset class. That's in almost every billionaire's portfolio. In fact, on average, they allocate more

33:16 than 10% of their overall portfolios to it. It's outperformed the S and P gold and real estate by

33:23 nearly twofold over the last 25 years. And no, it's not cryptocurrency, which many experts don't

33:29 believe is a real asset class. We're talking about contemporary art. Thanks to a startup revolutionizing

33:35 fine art investing, rather than shelling out $20 million to buy an entire Picasso painting yourself,

33:41 you can now invest in a fraction of it. If you realize just how lucrative it can be,

33:45 contemporary art pieces returned 14% on average per year between 1995 and 2020, beating the S and P by

33:53 174%. Masterworks was founded by a serial tech entrepreneur and top 100 art collector. After he

34:00 made millions on art investing personally, he set out to democratize the asset class for everyone,

34:06 including you. Masterworks has been featured in places like the Wall Street Journal, the New York

34:11 Times and Bloomberg. With more than 200,000 members, demand is exploding. But lucky for you,

34:17 Masterworks has hooked me up with 23 passes to skip their extensive waitlist. Just head over to our

34:23 link and secure your spot. Visit talkpython.fm/masterworks or just click the link in your podcast

34:29 player's show notes. And be sure to check out their important disclosures at masterworks.io slash disclaimer.

34:37 I guess one of the requisite topics we should talk about is probably COVID-19 because that was going

34:42 to be over in a few weeks or months, but then it wasn't. So it's still ongoing. And one of the things

34:46 that you all asked about and studied was basically did COVID-19 and more specifically sort of the shutdown

34:53 as a result of it result in more data science, less data science, increased investment, not so much.

35:00 What did you all find there?

35:02 Yeah. So interestingly, I think we found that there was a sort of all different organizations

35:08 had every possible answer. So, you know, the, the, the, about a third decreased investment,

35:15 but a quarter increased investment and another quarter stayed the same. And so that's, you know,

35:21 there wasn't one definitive answer that everyone had for that, which is, I think probably has a lot

35:26 to do with where data science is at in their organization. I mean, on one hand, data

35:30 science is an activity that, is easy to do remotely. you can, you know, there are a lot

35:36 of jobs that you can't do remotely. Data science is one you could do remotely. So that, that part isn't

35:41 an obstacle so much. but is a lot of it also is, has to do with risk. I mean, everyone, when they,

35:46 when they face this was thinking in with their business hats on, what is the risk to my

35:51 organization of an unknown economic impact of this pandemic? And so a lot of places might have

35:57 viewed their data science as being, a risky still early kind of thing. And so let's pull back

36:03 a little bit. Let's not spend that money. Is it optional? Okay. We cancel it for a while. We put

36:07 it on hold. Yeah. Yeah. But, but clearly interesting for, for some organizations, it was so important.

36:11 They put more money in. and so it, it, a lot of it had to do with just where you're at in the

36:15 journey. I think industries, you found out where people were doing data science,

36:21 obviously technology, right? Tech companies. I'm guessing this is like Airbnb, Netflix,

36:26 those kinds of places. There's a lot of data science happening in those worlds. Academic was number two.

36:31 Yeah. I mean, data science is a, is still a actively researched thing. I mean, as we, as you see,

36:38 sometimes it's hard to keep up with all of the new advancements and changes and everything,

36:42 not just in the software, but in techniques. And so academia is super busy on this. you know,

36:47 banking is also a top one because, I kind of think of banking and finance as being some of the,

36:52 you know, the original, you know, corporate data scientists in some ways. and so obviously

36:58 there, it was interesting to see automotive actually score so highly. It's that's, that's the

37:03 one that surprised me as well. Automotive is 6% and the highest industry was 10%. So yeah,

37:08 that's really quite high. Yeah. I wonder how much of that is self-driving cars.

37:12 You know, I don't know that. I mean, the other one is, you know, as we've heard with the chip

37:17 shortages, supply chain logistics is an interesting use of data science to try and predict

37:22 how much supply of all the things you're going to have, where and when, and how should you

37:26 transport stuff. And I imagine car manufacturing is especially, challenging, especially now.

37:32 Interesting. Yeah. They, they really shot themselves in the foot, didn't they? When they said,

37:36 you know what, all these extra chips, people aren't going to need cars. They're not going

37:40 to buy cars during this downturn. So let's cancel our order. We'll just pick it back up in six months.

37:44 And six months later, there are no chips to be had. So, we have it. Yeah. I mean, GM,

37:49 I think it's even shutting down a significant portion of their production in the U S because

37:53 they're just out of chips, which is crazy. Antonio out in the live stream says he's doing

38:00 data science with his team in the energy oil and gas industry. And we're not the only ones.

38:05 Yeah. It's funny that doesn't appear in the list. we, we, we don't have energy, but they're,

38:09 they're, you know, down to 2%. again, all of the percentages are low because there's so many

38:14 industries and everyone was in all, it was all over the place, but yeah.

38:17 Team size is interesting. I think one of the things that it's interesting here is what I think of

38:22 software developers, they kind of cluster together in like development team groups, right? They've got

38:29 the software development department, maybe in a company or a team building a piece of software or

38:35 running a website. To me, data scientists feel like they might be more embedded within little groups.

38:41 There might be a data scientist in the marketing department, a data scientist in the DevOps

38:46 department and so on. is that maybe correct? Yeah. I think we've seen companies actually do both at

38:53 the same time, even where sometimes they'll have, I mean, one of the things we have listed is a data

38:56 science center of excellence. and, and what that ends up being is a, some sense, a group that

39:01 is pathfinding for an organization. They're saying, okay, these are the best practices. These are the

39:05 tools. This is what to do, figuring that out and then rolling it out to all the departments who have

39:10 their embedded data scientists who can take advantage of that. cause I think it's valuable to have a

39:14 data scientist embedded in the department because one of the most important things as a data scientist

39:18 is your understanding of the data you're analyzing and your familiarity with it. that I would,

39:23 I would really prefer the person analyzing, you know, car supply chains, understand what goes into

39:28 that and also no data science as opposed to a data scientist for whom it's all numbers and they don't

39:33 know. Right. If you could trade absolute expertise in Git versus really good understanding of the problem

39:40 domain, you're probably better off going, you know what, just keep zipping it up and just really answer

39:44 these questions. Well, I mean, you don't actually have to make that trade off, but I agree that domain

39:49 knowledge is more important here. Yeah. So it had the highest, so think of the departments where

39:55 data scientists live. It was pretty high than R and D and then this data center, center of excellence

40:01 you spoke about, then ops finance, administration, marketing, human resources. It's really spread out,

40:07 which is sort of what I was getting at before. Yeah. Yeah. So, so I think there are a lot of,

40:12 seeing a lot of organizations build their data science expertise, ground up department by

40:17 department and then maybe we'll coalesce some of it into, you know, a single, single department

40:22 at some point. Right. Maybe that department makes like the APIs for the rest of the sort of isolated

40:26 folks and so on. one that was interesting is how do you spend your time? I mean, you think about

40:31 these AI models or these plotly graphs and all these things that data scientists produce. Then there's the

40:37 quote that data cleaning is not the grunge work. It is the work, right? And you sort of have this chart

40:43 of like, how do you spend your time? And 22% is data preparation, 17% on top of that is data cleaning.

40:49 And so, yeah, that's pretty significant portion of just getting ready to ask questions.

40:54 Yeah. And that's, and that really, that that's the piece that requires that domain expertise to know

40:59 what you're looking at, what's relevant, what problems it'll have. No data set is perfect and,

41:04 and, no data set is perfect for all questions. And so, even if, you know,

41:10 you can't ever clean the data just once, cause what you're doing is preparing it for the questions

41:13 you're going to ask. And so you need someone who can, you know, understand what's going to happen

41:18 there and do that. And that's what, that's really the expertise you want. Yeah. Cool. Another topic

41:22 you asked about was, barriers to going to production. So, some pretty intense graphs,

41:28 many, many options across many categories, but basically you asked, what are the roadblocks do you

41:35 face when moving your models to a production environment? The, you know, intense graphs are really

41:39 that everyone has a slightly different perception of this depending on what seat they're in.

41:43 Are they, are they the analyst? Are they the data scientist? Are they the DevOps person? Everyone

41:48 has a different answer for what the roadblocks are. right. And, and which is makes sense because

41:53 you're going to see what is relevant to your job. when you, when you sum everyone up, you,

41:57 you kind of sort of see this even split across it security. Honestly, what I found interesting was that

42:04 there was both converting your model from Python or R into another language and also converting from

42:09 another language into Python and R. Yeah, exactly. So one of the challenges that people had was just

42:17 like you said, recoding models from Python or R into another language and then the exact reverse.

42:23 And they were almost exactly tied. 24% of the people said, Oh, I got to convert these Python

42:27 models to Java or whatever. The other people are like, he's got this Java model. I got to get into

42:32 Python so I can put it in FastAPI on the web. Right. Something like that.

42:36 Yeah. Anecdotally. I mean, I think the, the, the, you know, maybe we'll have to change the phrasing

42:40 of this question in the future because putting Python and R together might have, conflated a

42:45 couple of things potentially. cause so I just know anecdotal evidence. you know, we have

42:50 talked to customers who their data scientists wrote everything in R, but they didn't want to put R in

42:54 production and we're asking them to recode it into Python because Python was okay for production.

43:00 but I've also had the conversation. People are like, we don't have our data modeling in Python

43:04 and Python's not okay for production. Java is okay for production. and, and so it's, it's this weird

43:10 problem of companies have built up how they do deployments on specific languages. And those aren't

43:15 the languages that people are doing data science in all the time. Right. And I suspect in the Java

43:20 one, it's just like, we have a bunch of Java APIs and apps running and those people that do that stuff,

43:26 they run those apps and you're going to give us a model that's just going to fit into that world.

43:30 But if you are already running Python for your web servers, just put it in production. It's,

43:34 it's already right there, right? Yep. Yep. Yep. Yeah. Yeah. Quite interesting. Okay.

43:40 let's see. I'll flip through here and find a couple more. one was interesting. It was about open

43:45 source, enterprise adoption of open source. yeah, you may want to speak to the results there.

43:50 Yeah. I wish we could have asked this question 10 years ago, cause I think it would have been

43:54 fascinating to compare to now. Yeah. yeah. It's the trend that's super interesting. Yeah.

43:59 The, you know, the, one of the surprising things for me was the outcome that said,

44:03 well, less surprising was 87% of organizations said that they allow the use of open source inside

44:09 the organization. I think that's not too surprising. I mean, even just Linux is kind of like this sort

44:14 of baseline. How is your organization functioning without Linux? Yeah. and then almost what

44:19 programming language could you choose these days? That's not open source, right? You know, the,

44:25 you've got Java, you've got.net, like especially.net was one that wasn't open source is pretty

44:31 popular. Like too late. That's all open source and installed through package managers now. And then

44:35 then the move to Python. And yeah, I mean, I can hardly think of a language or a place to run where

44:41 you can't use some level of open source. Yeah. But the, the second question, which was,

44:46 does your employer encourage you to contribute to open source? I was surprised to see 65% said,

44:51 yes, that is a, a huge fraction and, is interesting because, that has not always

44:57 been that high. I know that we have spoken again to, you know, people who have said, Hey, you know,

45:02 my, I wish I could contribute, but my employer, we just don't have a policy for this or we don't have

45:07 a way to do that. Yeah. I used to hear that a lot, right. That it's just, it's, it's too complicated.

45:11 I might leak something out. yeah. Or bring in some GPL stuff and mess up our commercial product

45:19 or whatever. Right. Yeah. So I don't know how all these companies have, have solved that internally,

45:24 but I am excited to see, that there's now a huge potential base of open source contributors

45:29 out there that, commercially that there wasn't before. I do think there's something about creating

45:34 a culture for software developers and data scientists where they want to be. And people don't want to be

45:39 in a place where they're forced to use just proprietary tools that are old and crusty, and they're not

45:44 allowed to share their work or talk about their work. And, you know, there's people who would do

45:48 that, but as a, I would love to be in that environment. Like that's not that feeling and,

45:52 you know, talent's hard to come by. So you, you will probably create environments that attract

45:56 the best developers and the best developers don't want to be locked in a basement told they can't share

46:02 or contribute to anything. Yeah. Yeah. I definitely agree with that. Another thing that's hot these days,

46:07 hot in the, as you don't want it, but it's a very hot potato style is, supply chain stuff and open

46:16 source pipeline issues. Right. And the survey actually mentioned that one of the problems that

46:21 people mentioned, one of the reasons that they don't want to use open source is they believed it

46:26 was insecure because our $20 billion bank is now depending on, you know, this project from Sarah

46:33 about padding numbers or whatever, right? Like if somebody takes over a thing, we're going to pip

46:39 install a virus into the core trading engine. That's going to be bad, right? Like that's an extreme

46:43 example, but you did ask about what people are doing to secure their, basically the code they're

46:48 acquiring through open source. Yeah. And this is something, I mean, we're interested in just

46:52 generally because there's a lot more focus on security and you see more reports about supply chain

46:56 attacks on software. And so we're curious how different organizations are tackling the problem.

47:01 obviously the, the most unsurprisingly, the most, popular answer at 45% was they use a

47:06 managed repository, which interpret to mean, basically it's kind of like you have a private

47:11 mirror of the packages that are approved in your organization and everyone pulls from there,

47:15 not from the internet directly. which is a, a, a smart approach because it gives you a natural

47:21 sort of gating, thing that you can do where there is an, there is a review process to bring new

47:26 software in there. and, and so there's a lot of, you know, things here. I mean,

47:31 obviously even commercially, we sell a repository for condo packages, for precisely this reason,

47:37 because, customers want some governance and are more than happy to, pay us. Yeah.

47:44 Team edition, is our on package, repository. and so this is a, this was an ask for customers,

47:51 which is why we, we built this product, is they were like, Hey, we want your stuff, but we want

47:55 it inside our firewall. We don't want to go directly to your public repo. You want to opt in to say,

48:00 yes, we want the new numpy, not just, Oh, somebody randomly pushed them, pushed something out. And so

48:06 we're going to just grab it and assume that it's good. Right. You can apply policies as well. That's

48:11 common as a lot of places will say no GPL software for various reasons. or they might say, Oh,

48:16 you know, if there are reported, you know, CVEs, these security, reports that, you know,

48:21 go through NIST, they might say, I want no packages with a CVE more severe than some level.

48:27 and those, the, every IT department wants some, some handles to control that kind of policy,

48:34 decision-making. And so, yeah, so that's obviously that, that I think that's why that's the most popular

48:39 option is it's the easiest, thing to get a handle on. It is. Yeah. You can set up a private

48:43 PI PI server. Yep. Pretty straightforward. there's a cool article on testdriven.io,

48:49 but yeah, the, the Conda and the Conda version that you all offer. That's pretty cool.

48:54 45% as high. I didn't expect that many companies to have a private repository. It's good, but I don't,

49:02 I just expected it to be, I don't know, lower. Yeah. I, although on the other side, you know,

49:07 that means 55% of those were just downloading random stuff from the internet. So, so it's good. I think

49:13 the message is getting out that you have to think about these things from a risk perspective.

49:16 Another was 33% of the organizations do manual checks against a vulnerability database.

49:22 Yeah. So this is, what I was describing earlier. The CVE databases are often a common,

49:28 vulnerability, manual checks. That's a lot of labor. so I, I, it'll be interesting to,

49:34 see how many places move to automating that in some fashion in order to, the hard part there

49:39 is those databases have, again, to data prep and data cleaning often to make use

49:45 of those public databases. You need to do some amount of curation because there's a lot of stuff

49:49 that ends up in there that's mistagged or unclear or whatever. and so a lot of this manual checking

49:55 is probably also just doing that curation. One of the things that's nice. Yeah. One of the things that's

49:59 nice is, GitHub will now do automatic PRs for security problems that it knows about at least.

50:05 Yeah. Those, that kind of automation is going to be really, important, I think in the future,

50:09 just because you can't manually go through all those things.

50:11 What are you seeing around source control? You know, source code algorithms, these are

50:17 really important and people want to keep them super secure, but if they put them on their own private

50:22 source code repositories, they lose a lot of benefits like automatic vulnerability checking and stuff like

50:28 that. What's the GitHub or GitLab versus other stuff, maybe enterprise GitHub. What's the trends there?

50:34 The, the, the interesting thing there is, is yeah. you know, everyone is using source control at

50:39 some point and they oftentimes they want it managed inside their firewall. And so yeah, things like

50:43 GitHub enterprise and things and GitLab are pretty popular for that. a lot of times I think what

50:48 a places will do is they'll use, some kind of the, the, the next item here, the 30% said they're

50:53 using a vulnerability scanner. A lot of those vulnerability scanners you can use on your own internal source

50:58 repositories. And so that way they're, they're not taking advantage of GitHub automatically doing that for them,

51:04 but, they at least have some solution probably for looking for stuff.

51:08 20% said they have no idea what they're doing. And then another 20% said we're not doing anything.

51:14 Well, I'm sure of it. Let's maybe close out this overview of the survey results here by talking about

51:22 Python, Python's popularity. Is it growing? Is it shrinking? Is everyone switching to Julia or have

51:29 they all gone to go? What are they doing? Yeah. So I think, I think Python's

51:34 advantage here is being a, pretty good at a lot of things. And so it ends up being a natural

51:39 meeting point of people who are interested in, you know, web development and data science or system

51:45 administration automation and all of that. So I think, I think Python still has some, some growth to go,

51:49 but I mean, what's interesting is, is, you know, in our survey, the second, I would say the second

51:53 most popular, was SQL, which has been around forever and is going nowhere.

51:58 Those are often used. Yeah, exactly. And they're often used in parallel, right? Like,

52:01 yeah, I'm going to do a SQL query and then run some Python code against the results, that type of thing.

52:07 Yeah. Yeah, definitely. I, I'm a big believer in that there is no one language for everything and

52:11 there never will be. but there is, you know, a lot of different options that people are

52:17 looking to. I mean, go make sense for a lot of sort of network service kind of things. I mean,

52:21 Kubernetes is built almost entirely out of go. but, I'm not sure if I'd want to do any data

52:27 science and go at this point. and so it's going to always be a mix. It might not even be that

52:33 you're doing one or the other. You might be doing both. Like for example, maybe you've written some core

52:38 engine and rust, but then you wrap it in Python to program against it. Right. It could be both.

52:44 I guess it could even be a more combination than that, but, yeah, the popularity of Python looks,

52:49 looks strong. So it looks like it's still right up near the top. I mean, obviously the group that you

52:55 pulled is somewhat self-selecting, right. But that's still a general trend outside of your space.

53:00 Yeah. Yeah. This is definitely going to be skewed to Python because otherwise,

53:03 why are you taking an anaconda survey? But, but still I think, yeah, it is definitely something you see broadly in the industry as well.

53:10 Well, speaking of a different languages and stuff out in the live stream,

53:13 Alexander Semenov says, just learned that I can use rust in JupyterLab with some help from Anaconda.

53:19 My mind is blown. Good job.

53:21 Yeah. That's the one thing I should mention about Python is one of the advantages is if you're using

53:26 Python, you're probably benefiting from most of the languages on the stack, even if you're not writing

53:30 them. And so the, the ability of Python to connect to anything is I think it's strength and why it

53:35 continues to top these lists. Yeah, absolutely. And then Paul out there has a question about the

53:43 commercial license. And I guess there was some changes to it. Can you maybe speak to that? I

53:47 don't really track the details well enough to say much.

53:49 Yeah. So what, what we did was, our, the Anaconda distribution packages have a,

53:57 terms of service that says, if you are in an organization above a certain size, we want

54:01 you to have a commercial license if you're using it in your business. I forgot the exact threshold,

54:05 where that's at. and, and the reason there was to help one support the development of

54:11 those packages. And I should say, by the way, that terms of service does not apply to Condo Forge.

54:15 Obviously those are community packages. but if you, if you want the assurances that

54:20 Anaconda is providing on those packages and you are a company of a certain size,

54:23 we would like you to have a commercial license, that allows us to support you more directly.

54:28 It allows us to fund, continued work on those packages. And that's, that's sort of, it was,

54:33 it's a sustainability thing, I think. but it, it's, for most people, it's not an issue.

54:39 cause they're either below that size or you're just using it individually.

54:42 Do you know what that size is? What that cutoff is?

54:44 I do not recall off the top of my head. And so I'm afraid to quote a number.

54:47 Yeah. Yeah. Sure. No, no worries. Cool. All right. Well, thanks for giving us that. I mean,

54:52 it seems fair that large companies benefiting from your tools contribute back. I think that statement

54:58 should be applied to open source in general. If, if your company is built on Python, you should give back

55:04 to the Python space. If your company is built on Java, it's Oracle. I don't know if they need help,

55:08 but you know, in general, if you're built on top of something, there's a lot of support you can give

55:13 back. Right. It's, it's kind of insane to me that, you know, banks that are worth many, many billions

55:17 of dollars do very little in terms of like directly supporting the people who they're built upon. Right.

55:24 they hire a pay for a couple of people building the core libraries. Like if you're using Flask,

55:31 right. Support the Flask, pallets organization, something like that.

55:34 Yeah. And then we in turn, you know, take that licensing money and some fraction of it goes to

55:38 num focus for, the broader sort of data science open source community. In addition to,

55:43 you know, us directly funding some open source projects as well.

55:45 All right. Well, we're about out of time, Stan, but let's talk real quickly about Piston because

55:50 Piston is not, rewriting Python and rust. It's not replacing it with Cython or just moving to go.

55:59 It's, it's about making core Python faster, right?

56:01 Yeah, this is something, I mean, we've been thinking about, performance in Python for a

56:06 long time. one of the early projects that, you know, Anaconda created is called number. It's a

56:12 Python compiler. It's focused on numerical use cases and it really is, does its best job, in

56:18 dealing with that kind of numerical loop heavy code. but it's not a, it's not going to optimize your

56:23 entire program, but optimize a specific functions. And so number has is very good at a very specific

56:28 thing. And so we've been thinking for a long time about how we could broaden our impact. And so when

56:33 I saw that, Piston, which I, you know, among many pilots on compiler projects had reemerged in

56:38 2020, with a new version written from scratch, based on Python 3.8, as a just in time

56:45 compiler in the interpreter. So it's designed to optimize any Python program. it can't necessarily do any

56:51 given thing, as fast as number might be for a specific, you know, numerical algorithm, but the

56:56 breadth is, is really what, is interesting to us. and so I saw this project had emerged,

57:01 Piston 2.0 kind of came on the scene. I started looking more closely at it and we started talking

57:05 with them. and we realized that there's a lot that I think the, the Piston Anaconda could do

57:10 together. And so, we, have hired the Piston team on, to our open source group.

57:15 So they are funded to work on Piston the same way we fund open source developers to work on other

57:20 projects. and so we're really, but the benefit that, there's other, help we can

57:25 give and resources and infrastructure that we can offer this project. And so we're really excited to

57:29 see where this is going to go from here. Yeah. I'm excited as well. All these different things that

57:33 people are doing to make Python faster for everyone, not just, well, let's try to recompile this loop,

57:39 but just you run Python and it just goes better. I think that's pretty exciting. You know, we've got

57:44 the cinder projects from Facebook. Yeah. This is a really good year for Python, optimization projects.

57:51 I should be careful about typing that into a search engine, but, but the cinder project is,

57:58 is not something that's publicly available really. it's not like a supported improvement,

58:03 but it's a, here's what they did at Instagram. There's a bunch of speed ups. Maybe you all can

58:07 bring some of that back into regular Python, but yeah, it's, there's a lot of these types of ideas.

58:12 And yeah, awesome. Looking forward to see what you'll do with this.

58:14 And, you know, the, the cPython core developers, have even announced that they're going to,

58:20 you know, undertaking a new effort to speed up cPython. and so we will, we're looking to collaborate

58:25 with them. they, they're going to have to, you know, figure out how, what they can do within

58:29 the confines of cPython, because you are the Python interpreter for the world.

58:35 Yeah.

58:35 And so you need, you need to be careful, but there's a lot they're going to do. And we're

58:39 going to try and share ideas as much as we can. because these are both open source projects.

58:43 Right. A lot of the challenges have been in compatibility, right? Like, oh, we could do this,

58:48 but then C extensions don't work. And those are also important for performance in, in big ways and

58:54 other stuff, but yeah, so they do have to be careful, but that's great. All right. Final comment,

58:58 real quick follow-up from Paul. I'd like my company to do more open source, more to do more to support

59:04 open source. Any advice on promoting that? Yeah. I think the, the best, first place to start

59:10 is identifying what open source does your company absolutely rely on. and especially if you can

59:15 find an open source project that you absolutely rely on, that doesn't seem to be getting a lot

59:19 of support, and then go look at those projects and see what are they, you know, do they have an

59:24 established way to donate funds? do they have, you know, other needs? that's something I

59:30 think that is easier to sell as you say, look, our organization absolutely depends on X, whatever

59:34 this is, as opposed to picking a project at random. it's easier to show a specific business

59:39 speed. Yeah. Yeah, for sure. You can say, look, this is the core thing that we do and it's built

59:43 on this rather than, oh, here's some projects I ran across. We should give some of our money away.

59:47 Yeah. That's a hard, harder sell to, stockholders, I guess. All right. Well,

59:52 Stan, this has been really fun. Let me ask you the final two questions before we get out of here.

59:56 if you're going to write some Python code, what editor do you use?

59:59 So if I'm on, if I'm on a terminal, it's Emacs. if I have an actual GUI desktop,

01:00:05 I'm usually using VS Code these days. And then notable PI PI package or conda package that you're like,

01:00:10 oh, this thing is awesome. People should know about whatever.

01:00:13 Yeah. you know, wearing my, my GPU fan hat. I, I think a lot more people should know about

01:00:18 CUPI. C U P Y it's a, Python package. That's basically if you took NumPy, but made it run on the

01:00:24 GPU. it's a, the, the easiest way I can think of to get started in GPU computing, because it just uses

01:00:30 NumPy calls that you're familiar with. so I would highly recommend if you are at all curious about

01:00:35 GPU computing, go check out Coupy. So over there on that computer I have over there, it has a G force,

01:00:40 but on this one, it obviously doesn't have Nvidia on my Mac. does that work? Cuda cores,

01:00:47 the CU part of that is for the Nvidia bits, right? What's my GPU story. If I don't have Nvidia on my

01:00:54 machine, not as clear. yeah, there, the, you know, CUDA has kind of come to dominate the space,

01:01:00 being sort of, first out of the gate, the, there's a lot more Python projects for CUDA.

01:01:06 I'm, there are not, really clear choices, I think for AMD or for like, you know, built in GPUs,

01:01:12 at this point. although I've definitely watched the space, you know, Intel is coming

01:01:17 out with their own GPUs, sort of this year and starting next year. and they have been

01:01:22 collaborating with various open source projects, including the number project, to build Python

01:01:26 tools to run on Intel GPUs, both embedded and discrete. So, yeah. Okay. So this may change

01:01:33 in the future. It'll be interesting to see. Final call to action. People are excited about,

01:01:37 you know, digging more into these results and learning more about the state of the industry.

01:01:41 What do they do? go search for a state of data science, Anaconda, and you'll find the results of the survey. I would, there's a lot of detail in there. So I would

01:01:49 definitely go through and take a look at all of the, the charts and things. Cause there's a,

01:01:53 there's all kinds of topics covered in there. Yeah. I think it's 46 pages or something. And we

01:01:58 just covered some of the highlights. So absolutely. All right, Stan. Well, thank you for being here.

01:02:02 It's been great to chat with you. Thanks. It's been great. You bet.

01:02:05 This has been another episode of talk Python to me. Our guest on this episode was Stan Siebert,

01:02:11 and it's been brought to you by shortcut masterworks.io and the transcripts were brought to you by

01:02:16 assembly AI. Choose shortcut, formerly clubhouse IO for tracking all of your projects work because you

01:02:23 shouldn't have to project manage your project management. Visit talkpython.fm shortcut. Make

01:02:29 contemporary art your investment portfolio's unfair advantage. With masterworks, you can invest in

01:02:35 fractional works of fine art. Visit talkpython.fm/masterworks. Do you need a great automatic

01:02:41 speech to text API? Get human level accuracy in just a few lines of code. Visit talkpython.fm slash

01:02:47 assembly AI. Want to level up your Python? We have one of the largest catalogs of Python video courses over

01:02:53 at talkpython. Our content ranges from true beginners to deeply advanced topics like memory

01:02:58 and async. And best of all, there's not a subscription in sight. Check it out for yourself at training.talkpython.fm.

01:03:05 Be sure to subscribe to the show, open your favorite podcast app, and search for Python. We should be

01:03:10 right at the top. You can also find the iTunes feed at /itunes, the Google Play feed at /play,

01:03:16 and the direct RSS feed at /rss on talkpython.fm. We're live streaming most of our recordings these

01:03:23 days. If you want to be part of the show and have your comments featured on the air, be sure to subscribe

01:03:28 to our YouTube channel at talkpython.fm/youtube. This is your host, Michael Kennedy. Thanks so much for

01:03:34 listening. I really appreciate it. Now get out there and write some Python code.

01:03:50 I really appreciate it. Now get out there and write some Python code. And I'll see you next time.

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon