#333: State of Data Science in 2021 Transcript
00:00 We know that Python and Data Science are growing in lockstep together, but exactly what's happening in the data science space in 2021. Stan Siebert from Anaconda is here to give us a report on what they found with their latest state of data science in 2021 survey. This is Talk Python to Me, episode 333 recorded, August 9, 2021.
00:32 Welcome to Talk Python to Me, a weekly podcast on Python. This is your host, Michael Kennedy. Follow me on Twitter where I'm @mkennedy and keep up with the show and listen to past episodes at 'talkpython.fm' and follow the show on Twitter via @talkpython.
00:47 We've started streaming most of our episodes live on YouTube, subscribe to our YouTube channel over at 'talkpython.fm/youtube' to get notified about upcoming shows and be part of that episode.
00:58 This episode is brought to you by Shortcut, formerly known as 'Clubhouse.io' and 'Masterworks.io', and the transcripts are brought to you by 'Assembly AI'.
01:08 Please check out what they're offering during their segments. It really helps support the show. Stan, welcome to Talk Python to me.
01:15 Hey, nice to be here.
01:16 Yeah, it's great to have you here. I'm super excited to talk about data science things and the condo things, and we'll even squeeze a little one of my favorites, the Apple M1 stuff mixed in with data science, so it should be a fun conversation.
01:30 I'm also very excited about the M1.
01:33 We can geek out about that a little bit. That'll be fun. But before we get there, let's just start with your story. How did you get into programming in Python?
01:40 Yeah, programming started as a kid dating myself here. I learned to program Basic on the Osbourne One, a suitcase of a computer that we happen to have as a kid and then eventually picked up C and stuff like that didn't learn Python until College, mostly because I was frustrated with Perl. I just found that Perl just never fit in my brain. Right? And so I was like, well, what other scripting languages are there? And I found Python, and that was a huge game changer. I didn't really use it professionally or like, super seriously, until Grad school, when I had a summer research job and I realized that this new thing called 'NumPy' could help me do my analysis. And so that was when I really started to pick up Python seriously. And now here I am.
02:24 Basically, what were you studying in Grad school?
02:26 I was doing physics. So I did particle physics and use Python quite extensively, actually, throughout my research. C++, unfortunately for better or worse, but that's how I always ended up kind of being the software person on experiments. So when I was leaving academia, going into software engineering kind of was a logical step for me.
02:49 I was studying math in Grad school and did a lot of programming as well. And I sort of trended more and more towards the computer side and decided that that was the path as well.
03:00 Cool. A lot of the sort of logical thinking, problem solving you learn in physics or math or whatever. They translate pretty well to programming.
03:08 Yeah. And definitely working on large experiments. A lot of the sort of soft skills of software engineering things. Like how do you coordinate with people? How do you design software for multiple people to use that sort of thing? Actually inadvertently was learning how to be a software manager as a physicist and then only realize it later when I went into industry.
03:28 And how about now you're over Anaconda, right?
03:33 Maybe I'm doing the same thing. So now I'm both a developer and a manager at Anaconda.
03:38 Direct path from PhD, physics, particle physics to programming to data science at Anaconda. Is that how it goes?
03:46 We employ a surprising number of scientists who are a software engineers. And so I managed the team that does a lot of the open source at Anaconda. So we work on stuff like NumPy and DASK in various projects like that. Just recently hired the Piston developers to broaden our scope into more Python JIT optimization kind of stuff.
04:08 So I'm doing a mix of actual development on some projects, as well as just managing strategy, the usual kind of stuff.
04:16 Well, I suspect most people out there know what Anaconda is, but I have listeners who come from all over what is Anaconda?
04:23 It's kind of like a Python you download, but also it has its own special advantages, right?
04:29 Where we came out of and still is. Our main focus is how to get Python and just broader data science tools. One of the interesting things about data science. It's not just Python.
04:40 Most of people are going to have to combine Python, and maybe they don't realize it. But with Fortran and C++ and all the things that underpin all of these amazing libraries. And so a lot of what we do is try to get Python into the hands of data scientists is get them the latest things and make it easy for them to install on whatever platform they're on. Windows, Mac, Linux, that sort of thing. So the Anaconda has a free call individual edition. It's basically a package distribution and installer that lets you get started. And then there are thousands of condo packaging system. There are thousands of condo packages that you can install where we or the broader community have done a lot of the hard work to make sure all of those compiled packages are built to run on your system.
05:24 That's one of the real big challenges of the data science stuff is getting it compiled for your system, because if I use requests, it's pip install requests. Probably.
05:35 Maybe it runs a set up high. Maybe it just comes down as a wheel. I don't know. But it's just pure Python, and there's not a whole lot of magic if I'm really getting that far out there. Maybe I'm using SQL Alchemy and it has some C optimizations. It will try to compile if it doesn't. Well run some slower Python version. Probably. But in the data science world you've got really heavy dependencies. Right? As you said, stuff that requires a Fortran compiler on your computer. I don't know if I have a Fortran compiler on my Mac.
06:02 I'm just pretty sure.
06:03 I don't know. Maybe it's head in there. Probably not right.
06:06 And maybe C++ probably have a C++ compiler, but maybe not the right one. Maybe not the right version. Maybe my path is not set up. Right. And plus it's slow.
06:17 All of these things are a challenge. So Anaconda tries to basically be let's rebuild that stuff with a tool chain that we know will work and then deliver you the final binaries. The challenge with that for a lot of tooling is it's downloaded and installed to different machines with different architectures. So you've gone and build stuff for macOS. If you built stuff for Linux, you build stuff for Windows. And whatnot is that right? Yeah.
06:45 Building software is non trivial, and no matter how much a developer tries to automate it so that things just work, it helps to have someone do a little bit of quality control and a little bit of just deciding how to set all the switches to make sure that you get a thing that works so that you can just get going quickly early on.
07:04 I remember in the sort of 2014 2015 era, Anaconda was extremely popular with Windows users who did not have a lot of good options for how to get this stuff right. Like with Linux, you could kind of get it together and get it going. If you were motivated on Windows, it was often just like very much I don't know what to do. And so making it sort of one stop shopping for all of these packages. And another thing we wanted to do is make sure that there was a whole community of package building around it. It wasn't just us so things like 'conda-forge' is a community of package builders that we are part of and usually support because there's a long tail. There's always going to be stuff that's going to be we're never going to get around to packaging.
07:46 There's important stuff that you're like. This is essential. So NumPy, Matplotlib and so on. You all take control of making sure that that one gets out. But there's some biology library that people don't know about that you're not in charge of, and that's what the 'conda forge' + Conda is like sort of like 'pip' and 'PyPI', but also in a slightly more structured way.
08:10 Yeah. And that was why Conda was built to help make it so that it is possible for this community to grow up for people to package things that aren't Python at all that you might need all kinds of stuff like that, and there's always going to be in your specific scientific discipline. So, for example, 'Bioconda' is a really interesting distribution of packages built by the Bioinformatics community built on Conda, but they have all of the packages that they care about, and many of which I've never heard of and aren't in common use, but are really important to that scientific discipline out in the live stream.
08:42 We have a question from Neil, Heather Neil. I mentioned Linux, Windows, Mac OS.
08:48 Neil asked, Does Anaconda work on Raspberry Pi OS as in Arm 64?
08:53 Yeah. So the answer to that is Anaconda, not yet.
08:57 'Condo-forge' does have a set of community built packages for Raspberry Pi OS. The main challenge there is actually just a couple of months ago announced Arm 64 support, but it was aimed at the server Arm machines that are running Arm 8.2 instruction set, which the Raspberry Pi 8.0. And so the packages we built, which will work great on Server Arm are missing, are using some instructions that Raspberry Pies can't support.
09:26 But Conda-forge. So if you go look up Conda-forge and Raspberry Pi, you'll find some instructions on the how to install for that Arm is interesting.
09:35 Let's talk a little bit about that because I find this whole Apple Silicon move. They create their M1 processor, and they said, you know what? We're dropping intel dropping X 86. More importantly, and we're going to switch to basically ipad processors, slightly amped up ipad processors that turn out to be really, really fast, which is actually blew my mind. And it was unexpected. But I think the success of Apple is actually going to encourage others to do this as well, and going to add more platforms that things like Anaconda Conda-forge and stuff are going to have to support. Right. So there's a cool article over here by you on Anaconda called a Python Data Scientists Guide to the Apple Silicon Transition.
10:29 I'm a huge chip nerd just due to the background and thinking about optimization and performance. And so this came out of some experiments I was doing to just understand we got some M1 Mac mini into our data center and started immediately playing with them. And then I realized after some I should take the stuff I was learning and finding and put it together in a document for other people because I couldn't find this information anywhere organized in a way that was for me.
10:58 As a Python developer, I was having a hard time putting it all together.
11:02 There was some antithetical stuff about just like this kind of works for me or this is kind of fast or this kind of slow, but this is a little more here's the whole picture and what the history is and where it's going and what it means and specifically focused on the Conda side of things. Right? Yeah.
11:19 And even just the Python site. I mean, it's sort of an interesting problem of Python and interpreted language. So you're like, well, I don't have any machine code to worry about, right. But the interpreter, of course, is compiled, so you at least need that. And then many, many Python packages also contain compiled bits, and you'll need those two. And this is an interesting, broad problem for the whole Python ecosystem to try and tackle, because it's not too often a whole new platform kind of just appears.
11:48 Making a whole new architecture takes a while.
11:51 It absolutely does. I think there's a lot of interesting benefits to come. I do want to point out for people listening if you jump over to the the PSF JetBrains Python Developer survey, the most recent one from 2020, and you look around a bit. You'll see that while we don't run production stuff on Mac OS, that much, 29% of the developers are using Mac OS to develop Python code.
12:19 So Apple's pledged that we're going to take 100% of this and move it over to Silicon means it's almost a third of the people running Python in a couple of years will be under this environment. Even if you have a Windows or Linux machine and you don't care about Mac OS, you may be maintaining a package for people who do. And that means Apple Silicon, right?
12:42 Yeah. I mean, it's interesting. There's just other stuff you take for granted the availability of free, continuous integration services that has been transformative for the open source community. I mean, it's really improved the softwar quality that all these open source projects automatically run their tests and build packages every time there's a new change. However, in something like this comes out. And until you get Arm Max into these services, and until they're freely available, a lot of the infrastructure of these open source projects, so they don't have a way to test on an M1 Mac except manually if they happen to have one and they don't have a way to automate their build on M1 until that sorts out and thinking about the workflow here.
13:26 There's two challenges that this present one is you want to do a git push production or git pushed to some branch or tag it, and that's going to trigger a CI build that might fork off to run a windows. Compile a Linux. Compile a Mac. Compile generates some platform specific wheels with, like, Fortran compiled in there or whatever, and then you're going to ship that off. If that CI system doesn't have an Apple Silicon machine it can't build for Apple Silicon, right? Yes.
13:59 And the more where do you get M1 in the cloud as a normal I know there's a few hosted places, but as a normal GitHub or an Azure, it's not common to just go grab a bunch of those and pile them up, right? Yeah.
14:15 It'll take time eventually in the same way that I was thinking back to go back four or five years ago.
14:22 There wasn't a whole lot of options for Windows CI available. There were a couple of providers, and then there was sort of a huge change and then pretty much everyone offered a Windows option and they were faster and all of this stuff. And so I think that took time. I think that's the thing is the the hardware is in people's hands now and it's just going to get more and more. And it's unclear how quickly we can catch up.
14:47 That's going to be a challenge for all of us.
14:49 It's absolutely going to be a challenge. It's interesting. I hope that we get there soon. The other problem in the same workflow is I was actually just looking at some NumPy issues specifically issue 18,143. I'm sure people have that right off the top of their head. The title is Please Provide Universal Two Wheels from Mac OS, and there's a huge long conversation about this is like many, many messages in the thread, and one of the problems they brought up is like, look, we can find a way to compile the binary bits, the C++bits for M1, but we can't test it if we can't.
15:33 We as developers cannot run this output. It's a little sketchy to just compile and ship it to the world. And to be fair, this is on January 9 of 2021, when it was still hard. These things were still shipping and still arriving there. It was not like you just go to the Apple store and pick one up.
15:52 This portion of Talk Python to Me is brought to you by Shortcut, formerly known as Clubhouse.IO. Happy with your project management tool. Most tools are either too simple for a growing engineering team to manage everything, or way too complex for anyone to want to use them without constant prodding. Shortcut is different, though, because it's worse now. Wait, no, I mean it's better. Shortcut is project management built specifically for software teams. It's fast, intuitive, flexible, powerful, and many other nice positive adjectives. Key features include team based workflows. Individual teams can use default workflows or customize them to match the way they work. Org wide goals and roadmaps. The work in these workflows is automatically tied into larger company goals. It takes one click to move from a roadmap to a team's work to individual updates and back. Type version control integration, whether you use GitHub, GitLab or Bitbucket Club House ties directly into them so you can update progress from the command line. Keyboard friendly interface. The rest of Shortcut is just as friendly as their power bar, allowing you to do virtually anything without touching your mouse. Throw that thing in the trash. Iteration planning, set weekly priorities, and let Shortcut run the schedule for you with accompanying burndown charts and other reporting. Give it a try over at 'talkpython.fm/shortcut'. Again that's 'talkpython.fm/shortcut' choose Shortcut because you shouldn't have to project manage your project management.
17:23 As an interesting example, Condo-forge was able to get Conda packages for Apple Silicon out pretty quickly, but they did it with a clever sort of cross compilation strategy where they were building on X 86, Max the Arm packages and pushing them out.
17:41 But they had enough people manually testing that they had confidence in the process that it was okay. But that's very different than how they build other packages, which are built in test tested immediately, automatically. And if they fail tests, they don't get uploaded. So it's a risk. But it helped get the software out in people's hands quicker. But long term, we need to get these machines onto all of these CI systems so that we can use the same techniques we've built up over the years to ensure we have quality software.
18:08 I think we'll get there, but it's just going to take some time, right? Yep.
18:13 Let's see. Neil on livestream says speaking of open source Apples, rumor to be hiring experts in RISC-V to perhaps move away from having to pay licensing fees to Arm. I'm not sure about that.
18:28 What's interesting here is that the chip architectures have been around for a long, long time, but until very recently, average users didn't have to think about X 86 versus Arm. Arm was for mobile phones, and other never had to worry about Power PC or anything like computers.
18:48 But now, once going from one to two is a big step. Now the floodgates are open, and now we're thinking about, well, what else is out there is RISC-V I think RISC-V is what you call it is an interesting thing and has even being a completely open standard, you don't have to even pay licensing fees. As mentioned, I don't know if Apple's going to make this transition again so quickly, but I can guarantee you that everyone probably somewhere in a basement, is thinking about it, maybe doing some experiments, but yeah, slowly, but it's interesting to think about.
19:25 Yeah, that's not a thing you can change very frequently and drag and developers a lot. We're talking about all the challenges that are just down the pipeline from that. Yeah, very interesting. All right. Well, let's just talk a few a little bit about this first.
19:39 You're excited about these as a scientist? Yeah.
19:43 I'm really for two reasons. I mean, one thing that's interesting is just the power efficiency.
19:48 There was a talk long ago from the chief scientist in the video, which really had an impression on me in which he paraphrasing roughly basically said that because everything is now power constrained power efficiency equals performance in a way that is normally you just think we'll just put more power in there, but that heat has to go somewhere. So we long since hit that wall. And now you just have to get more efficient to get more performance, right.
20:13 That's an interesting. Or you can get like larger power supplies and larger computers. I have a gaming SIM computer and it is so loud if you get it going full power. If the windows are open, you can hear it outside the house. It's literally that loud, but at the same time, it's not just on your personal computer in the cloud and places like that, you pay, not just how much performance you get. There's some sort of combination of how much energy does that particular processor take to run? And if it's one fifth, you might be able to buy more cloud compute per dollar.
20:48 Yeah, power and cooling is a huge part of computer data center expenses, and even you can put maybe one to 300 Watts into a CPU.
21:00 You're not going to put multiple kilowatts in there or something. And where is that?
21:06 What else can you do? And a lot of that is that Moore's Law is driven a lot by just every time you shrink the process, you do get more power efficient. But now it's interesting to think of architectures that have been sort of thought of. That Arm has come into its own in an extremely power constrained environment. And so now we're letting it loose on a laptop which has way more power compared to a cell phone available.
21:29 And could we do if we fed right into the socket in the wall?
21:34 What happens when I put it in the data center?
21:37 I think ARM in the data center is going to be really important.
21:42 Yeah, I think definitely. I'd always expected that to come before the desktop. To be honest, I was surprised as many people were by the suddenness of the Apple transition because I had assumed this maybe would happen much after we all got used to Arm in the data center where you're probably running Linux and it's easy to recompile compared to Mac and stuff like that.
22:06 Yeah, that's what I thought as well. The payoff is so high they spend so much energy on direct electricity as well as the cooling from the waste heat from that energy that is, the payoff is just completely, completely clear. Right. Alright. So let's see a couple of things that you pointed out that make a big difference here is obviously Arm versus X 86 built in on chip GPU. The whole system has a system on a chip thing rather than a bunch of pieces going through. Motherboard is pretty interesting, but I think maybe the most interesting one has to do with the acceleration things like the Apple neural engine that's built in and whatnot it sounds like. The data science libraries in general are not targeting the built in neural engines yet, but maybe they will in the future. I don't know.
22:55 Yeah. It's something that we're going to have to figure out because I think it was a bit of chicken in the egg that until this happened, you didn't have this kind of hardware just sitting on people's desks and you weren't going to run data science stuff on your phone. So now that it's here. Now the question is okay, what can we do with it? I mean, right now, for example, for the Apple neural engine, you can take advantage of it using something called Carmel Tools, which actually do a webinar sometime back on.
23:23 But that's like for basically, you've trained a model and you want to run inference on it more efficiently and quickly. But that's it. There's an Alpha release of TensorFlow that's GPU accelerated, and it would take advantage of on the M1 if you're running it there. But that's super early, and there's a lot more opportunities like that. But again, it will take time to adapt.
23:45 It will, I suspect, as there's bigger games to be had, they'll probably more likely to be adopted. Right. So, for example, I have my Mac Mini here that I just completely love, but it's not that powerful, say, compared to a G Force video card or something like that. But if Apple analysis something like a huge Apple Pro Mac Pro with many, many 128 cores instead of 16 or whatever. Right. Then all of a sudden, in the neural engine, all of a sudden, that neural engine becomes really interesting.
24:19 Maybe it's worth go into the extra effort of writing specific code for it. Yeah.
24:25 That's the other thing that's interesting about this is we've only seen one of these chips, and it is, by definition, the slowest one that will ever be made.
23:34 So we don't even know what is it going to be like to scale up one of those things that if you're targeting that big desktop user, how are they going to scale this up? This all fit on one package. Can they still do that? Well, they have to split out into multiple packages.
24:50 There's a lot of engineering challenges that they have to solve, and we're not sure how they can solve them yet out on the outside. So we're going to have to wrap and see.
24:59 It's going to be exciting to see that come along here. All right. So let's touch on just a couple of things. Give Python packages for M one.
25:07 What are some of the options there?
25:09 So the status still is roughly how I have in this article, which is basically you can use pip to install stuff if wheels have been built, and a number of packages like NumPy have started to catch up and have wheels that will run on the M1.
25:23 Another option, which works surprisingly well, is to just use an X 86 Python packaging distribution.
25:29 I think that's actually what I'm doing because it just runs over Rosetta 2.
25:37 Yeah, it's just pretty fast. It is shocking, Rosetta two on average, I'm finding a sort of like a 20% speed hit, which for an entirely entire architecture switch, is amazing.
25:46 I've never seen that before. Or you can use code-forge. As I mentioned earlier, they're sort of experimental Mac OS Arm package distribution, which doesn't have everything but has a lot of things and using them. It is all built for Arm. There's no translation or anything going on there.
26:03 And on Python. Org. I believe the Python is that if you go and download, I believe it's a universal binary now. Sure. So that's a three night. It'll adapt and just run an Arm or run on X 86. You just get one binary.
26:20 The NumPy conversation was kind of around that as well. I believe. All right.
26:26 You did some performance analysis on the performance cores or versus efficiency cores. That was pretty interesting. So that was pretty similar to Hyper threading. If you want to run Linux or Windows, you basically gotta go Docker or Parallels. And then I guess maybe the last thing is let's wrap up this subtopic with Pros and Cons for data scientists. People out there listening. They're like, I can't take hearing about Apple M1 anymore. Maybe I'm gonna have to get one of these, should they? What do you think as a data scientist? Yeah.
26:52 As a data scientist might take away from all the testing was you should be really excited about this, but I would wait unless you are doing what I would describe is a little bit of data science on the side and not a huge amount, mainly because what they've proven is the architecture has great performance and great battery life. The thing we still have to see is how are they going to get more Ram in there? How are they going to get more cores in there and then also, when is the rest of the ecosystem going to catch up on package support? So honestly, if you're interested in sort of bleeding edge knowing what's coming, I would totally jump in.
27:24 If you want this for your day to day, I would probably still wait and see what comes out next, because I think a data scientist, especially, is going to want some of the more cores and more Ram, especially than what these machines offer.
27:35 There's always remote desktop or SSH or something like that. If you've got an intel machine sitting around, you can just connect over the network locally.
27:45 Yeah. Very cool. All right. Excellent. I just want to give a quick mention that Paul Everett from Jet Brains , and I did 'A Python Developer Explores Apple's M1' way way back in December 11, 2020. Right. When the thing came out so people can check that. I'll put down the show notes as well. All right.
28:01 Let's talk about the 'State of Data Science, 2021'. How do you all find out about this? How do you know the state?
28:09 Yeah. So this is something we've been doing for a few years now, since we have a big data scientist audience a couple of years back, we decided, hey, let's ask them about what challenges they're seeing in their jobs and then publish the results so that the whole industry can learn a little bit more about what are data scientists seeing in their day to day jobs? That's going well, going poorly. Where do they want to see improvements? What are they sort of feeling and thinking?
28:35 So you've got a bunch of people to come fill out the survey and give you some feedback.
28:41 And yeah, 140 plus countries. So we have pretty good reach across the world.
28:48 And more than 4200 people took the survey. So it's yeah. We got a lot of responses.
28:55 It's always amazing to see a quick side thought here. I guess you got in that survey, which I'll link to the PDF results in the show notes. You've got all the countries highlighted. And obviously North America is basically completely lit up as like a popular place of results. So as Western Europe, Australia and even Brazil, Africa is pretty on the light on the side.
29:18 What else can be done to get more Python, more data science going in Africa? Do you think do you have any thoughts on that?
29:25 That's an excellent question. I don't that's might be a good question for a future survey. To be honest is I can speculate. I don't know if it's access to the computing or if it's bandwidth or if it's resources available in the local languages.
29:43 I mean, there's all sorts of one thing that is really nice about Python and data science. So much of the stuff is free. Right. So it's not like, oh, you got to pay some huge Oracle database license to use it ever. Right. I mean, there's a real possibility of that. So I don't really know either. But let's see the standard stuff about education level. I guess one of the areas maybe we could start out. It's just people who are doing data science.
30:10 Where do they live in the organization, right. Are they the CEO? Are they vice President? A good portion of them 50% is either senior folks or managers. That's kind of interesting, right?
30:25 I can see it sort of coming out of data science is helping in decision making and that's sort of thing. So I can see it gravitating towards the decision makers in an organization and that sort of thing. I mean, one of the interesting things that maybe is it later, One of the pages is how spread out data science is across the different departments as well, that there was obviously IT and R&D show up higher than the others. But you kind of see a long tail in all the departments. And my theory on that is I think we're seeing data science evolving into sort of a and a professional skill, if that makes sense. So in the same way that knowledge workers are always expected to do writing and to know how to write.
31:11 But we also hire professional technical writers.
31:14 I think we're getting into a space where we'll have everyone needs to have some numerical literacy and data science skills, even while we also employ professional data scientists.
31:24 Is it the new Excel?
31:25 If I'm a manager and I don't know how to use Excel, people are going to go, what is wrong with you? How did you get here? Right? You're going to have to know how to use a spreadsheet. I mean, it could be Google sheets or whatever, but something like that to pull in data, someone up, put it in a graph and so on. And are you seeing that more formal data science Jupyter type stuff is kind of edging in on that world?
31:52 Again, I think we will have to see how the tools settle out. One thing I know for sure is that they will have to at least become familiar with the concept so that even if the people doing the data science and reporting to you are using whatever their favorite tool set is at least understanding their workflow and how data goes through that life cycle and data cleaning and modeling and inference and all of those things, you'll have to understand that at least enough to interpret what is being told and ask the right questions about it.
32:19 Right. So if somebody comes to you says you asked me this question. So I put together a Jupyter notebook that's using PyTorch forecasting. Maybe you can do none of those. But you should kind of understand the realm of what that means. Something like that.
32:31 Yes. You'll have to know at least what steps they have to go through to get to the answer. So you can ask good questions about because if you were a decision maker, you need to be able to kind of defend your decision, which means you're going to have to at least understand what went into the inputs into that decision.
32:46 Well, we bought that company could ever in business. Analytics said it was a good idea. It turned out he didn't replace the not a number section, and that really broke it.
33:01 This portion of Talk Python to Me is brought to you by Master Works. io?
23:45 You have an investment portfolio worth more than $100,000, then this message is for you. There's a $6 trillion asset class that's in almost every billionaires portfolio. In fact, on average, they allocate more than 10% of their overall portfolios. To it, it's outperformed SMP, Gold and Real estate by nearly two fold over the last 25 years. And no, it's not cryptocurrency, which many experts don't believe is a real asset class. We're talking about contemporary art, thanks to a startup revolutionizing fine art investing. Rather than chilling out $20 million to buy an entire Picasso painting yourself, you can now invest in a fraction of it. If you realize just how lucrative it can be, contemporary art pieces returned 14% on average per year between 1995 and 2020, beating the SMP by 174%. Master Works was founded by a serial tech entrepreneur and top 100 art collector. After he made millions on art investing. Personally, he set out to democratize the asset class for everyone, including you. Master Works has been featured in places like the Wall Street Journal, The New York Times, and Bloomberg with more than 200,000 members. Demand is exploding, but lucky for you, Master Works has hooked me up with 23 passes to skip their extensive wait list. Just head over to our link and secure your spot. Visit 'talkpython.fm/masterworks', or just click the link in your podcast players shownotes and be sure to check out their important disclosures at 'Masterworks.IO/disclaimer'. I guess one of the requisite topics we should talk about is probably COVID-19 because that was going to be over in a few weeks or months, but then it wasn't. So it's still ongoing. And one of the things that you all asked about and studied was basically did COVID-19 and more specifically, sort of the shut down as a result of it result in more data science, less data science, increased investment, not so much. What did you all find there?
35:03 Yeah, so Interestingly. I think we found that there was a different organizations, had every possible answer, about a third decreased investment, but a quarter increased investment in another quarter stayed the same. And so there wasn't one definitive answer that everyone had for that, which is, I think probably has a lot to do with where data science is at in their organization. I mean, on one hand, data science is an activity that is easy to do remotely.
35:37 There are a lot of jobs that you can't do remotely. Data science is one you could do remotely. So that part isn't an obstacle so much, but a lot of it also has to do with risk everyone when they face this was thinking with their business hands on.
35:50 What is the risk to my organization of an unknown economic impact of this pandemic? And so a lot of places might have viewed their data science as being a risky, still early kind of thing into. Let's pull back a little bit. Let's not spend that money.
36:06 Is it optional? Okay, we cancel it for a while. We put it on hold.
36:09 But clearly interesting for some organizations, it was so important to put more money in.
36:14 And so a lot of it had to do just where you're at in the journey.
36:17 I think industries you found out where people were doing data science, obviously technology, tech companies. I'm guessing this is like Airbnb, Netflix, those kinds of places. There's a lot of data science happening. The world academic was number two.
36:33 Yeah, data science is still an active research thing. As you see, sometimes it's hard to keep up with all of the new advancements and changes and everything, not just in the software but in techniques. And so academia is super busy on this.
36:48 Banking is also a top one, because I kind of think of banking and finance as being some of the the original corporate data scientists in some ways. And so obviously it was interesting to see Automotive actually score so highly.
37:02 That's the one that surprised me as well. Automotive is 6% and the highest industry was 10%. So that's really quite high. Yeah. I wonder how much of that is self driving cars.
37:13 I don't know that the other one, as we've heard with the chip shortages, supply chain, logistics is an interesting use of data science to try and predict how much supply of all the things you're going to have, where and when and how should you transport stuff. And I imagine car manufacturing is especially challenging, especially now.
37:33 Interesting. Yeah. They really shot themselves in the foot, didn't they? When they said, you know what all these extra chips, people aren't going to need cars. They're not going to buy cars during this downturn. So let's cancel order. We'll just pick it back up in six months and six months later, there are no chips to be had. So we have it a GM. I think it's even shutting down a significant portion of their production in the US because they're just out of chips, which is crazy.
37:57 Antonio, The Livestream says he's doing data science with his team in the energy, oil and gas industry. And we're not the only ones. It's funny that doesn't appear in the list.
38:08 We do have energy, but they're got 2% again, all of the percentages are low because there's so many industries and everyone was is all over the place.
38:18 But team size is interesting. I think one of the things that it's interesting here would I think, of software developers. They kind of cluster together in development team groups. They've got the software development Department, maybe in a company or a team building a piece of software or running a website. To me, data scientists feel like they might be more embedded within little groups. There might be a data scientist in the marketing Department, a data scientist in the DevOps Department, and so on.
38:49 Is that maybe correct?
38:51 Yeah. I think we've seen companies actually do both at the same time, even where. Sometimes they'll have one of the things they have listed as a data science center of excellence.
38:59 And what that ends up being is some sense, a group that is path finding for an organization. They're saying, okay, these are the best practices. These are the tools. This is what to do, figuring that out and then rolling it out to all the departments who have their embedded data scientists who can take advantage of that because I think it's valuable to have a data scientist embedded in the Department because one of the most important things is the data scientist is your understanding of the data you're analyzing and your familiarity with it that I would really prefer the person analyzing car supply chains understand what goes into that. And also no data science, as opposed to a data scientist for whom it's all numbers.
39:34 And they don't know if you could trade absolute expertise in Git versus a really good understanding of the problem domain, you're probably better off going, you know what? It just keep zipping it up and just really answer these questions. Well, you don't actually have to make that trade off, but I agree that domain knowledge is more important here. So IT had the highest thing departments, what are data science just live. It was pretty high then R &D, and then the Data Center of Excellence. You spoke about the Ops, Finance, Administration, marketing, human resources. It's really spread out, which is sort of what I was getting at before.
40:11 So I think there are a lot of seeing a lot of organizations build their data science Equites ground up Department by Department. And then maybe we'll coalesce some of it into a single Department at some point.
40:23 Maybe that Department makes the APIs for the rest of the sort of Isolating folks. And so on. One that was interesting is how do you spend your time you think about these AI models or these Plotly graphs and all these things that data Sciences produce? Then there's a quote that data cleaning is not the grunt. It is the work. Right. And you sort of have this chart of like, how do you spend your time? And 22% is data preparation, even on top of that data cleaning? So, yeah, that's a pretty significant portion of just getting ready to ask questions.
40:56 And really, that's the piece that requires that domain expertise to know what you're looking at, what's relevant, what problems it'll have. No data set is perfect, and no data set is perfect for all questions.
41:08 And so even if you can't ever clean the data just once, because what you're doing is preparing it for the questions you're going to ask. And so you need someone who can understand what's going to happen there and do that. And that's really the expertise you want.
41:22 Yeah. Cool. Another topic you ask about was barriers to go into production. So some pretty intense graphs, many, many options across many categories. But basically, you ask, what are the road box do you face when moving your models to a production environment?
41:39 Intense graphs, really, that everyone has a slightly different perception of this depending on what seat they're in. Are they the analyst or the data scientists or the DevOps person? Everyone has a different answer for what the road blocks are.
41:52 Which just makes sense because you're going to see what is relevant to your job.
41:56 When you sum everyone up, you kind of sort of see this even split across IT security.
42:03 Honestly, what I found interesting was that there was both converting your model from Python or R into another language and also converting from another language into Python and R.
42:13 So one of the challenges challenges that people had was just so you said, re-coding models from Python or R into another language and then the exact reverse. And they were almost exactly tied. 24% of the people said, oh, I got to convert these Python models to Java or whatever the other people because he got this Java model. I got to get into Python so I can put it in Fast API on the Web, right. Something like that.
42:37 Yeah. Anecdotally I think maybe we'll have to change the phrasing of this question in the future because putting Python and R together might have conflated a couple of things potentially. So I just know anecdotal evidence.
42:51 We have talked to customers who their data scientists wrote everything in R but they didn't want to put R in production. And we're asking them to Re-code it into Python because Python was okay for production. But I've also had the conversation. People are like we do all of our data modeling in Python, and Python is not okay for production. Java is okay for production.
43:10 And so it's this weird problem of companies have built up how they do deployments on specific languages. And those aren't the languages that people are doing data science in all the time.
43:19 Right. And I suspect in the Java one, it's just like we have a bunch of Java, APIs and apps running and those people that do that stuff, they run those apps, and you're going to give us a model that's just going to fit into that world. But if you are already running Python for your web servers, put it in production. It's already right there. Right.
43:38 Quite interesting. Okay.
43:41 Let's see. I'll flip through here and buying a couple more.
43:45 One was interesting. It was about open source enterprise adoption of open source.
43:50 You maybe want to speak to the results there.
43:51 Yeah. I wish we could have asked this question ten years ago because I think it would have been fascinating to compare to now as a trend, that's super interesting.
44:01 One of the surprising things for me was the outcome that said less surprising was 87 of organization said that they allowed the use of open source inside the organization. I think that's not too surprising. I mean, even just Linux is kind of like this sort of baseline. How is your organization functioning without Linux?
44:19 And then almost what programming language could you choose these days? That's not open source, right.
44:25 You've got Java, you've got .Net, especially Net with one that wasn't open source is pretty popular, like, too late. That's all open source and installed through package managers now and in the move to Python. And yeah. I mean, I can hardly think of a language or a place to run where you can't you some level of open source.
44:44 Yeah. But the second question, which was, does your employer encourage you to contribute to open source? I was surprised to see 65% said, yes, that is a huge fraction and is interesting, because that has not always been that high. I know that we have spoken again, people who have said, hey, I wish I could contribute, but my employer, we just don't have a policy for this or we don't have I just do that a lot.
45:10 Right. It's too complicated. I might leak something out or bring in some GPL stuff and mess up our commercial product or whatever, right?
45:22 So I don't know how all these companies have solved that internally, but I am excited to see that there's now a huge potential base of open source contributors out there that commercially. That wasn't before.
45:32 I do think there's something about creating a culture for software developers and data scientists where they want to be. And people don't want to be in a place where they're forced to use just proprietary tools that are old and crusty, and they're not allowed to share their work or talk about their work. And there's people who will do that. But I would love to be in that environment. That's not that feeling and talents had to combine. So we will probably create environments that attract the best developers and the best developers don't want to be locked in a basement and they can't share contribute to the day.
46:04 Yeah, I definitely agree with that.
46:06 Another thing that's hot these days, hot in the as you don't want it, but it's a hot potato style is supply chain and stuff and open source pipeline issues. The survey actually mentioned that one of the problems that people mentioned, one of the reasons that they don't want to use open source is they believed it was insecure because our $20 billion bank is now depending on this project from Sarah, about having numbers or whatever somebody takes over a thing. We're going to pip install a virus into the core trading engine. That's going to be bad, right? That's an extreme example. But you did ask about what people are doing to secure their basically the code they're acquiring through open source.
46:50 Yeah. And this is something we're interested in just generally, because there's a lot more focus on security, and you see more reports about supply chain attacks on software. And so we're curious how different organizations are tackling the problem.
47:03 The Unsurprisingly. The most popular answer at 45% was they use a managed repository, which interpret mean. Basically, it's kind of like you have a private mirror of the packages that are approved in your organization, and everyone pulls from there, not from the Internet directly, which is a smart approach because it gives you a natural sort of dating thing that you can do here. There is a review process to bring new software in there. And so there's a lot of things here. I mean, obviously, even commercially, we sell a repository for conda packages for precisely this reason because customers want some governance and are more than happy to pay us. Team addition is our on package repository.
47:50 And so this is an ask for customers, which is why we built this product. They were like, hey, we want your stuff, but we want it inside our firewall. We don't want to go directly to your public repo.
47:59 You want to opt in to say yes, we want the new NumPy, not just, oh, somebody randomly push something out. And so we're going to just grab it and assume that it's good. Right.
48:10 You can apply policies as well. That's common. There's a lot of places to say no GPL software for various reasons. Or they might say, oh, if they're reported CVS, the security reports that go through this, they might say, I want no packages with a CVE more severe than some level. And every IT Department wants some handles to control that kind of policy decision making. So that's obviously. I think that's why that's the most popular option. Is it's the easiest thing to get a handle on it is.
48:43 Yeah, you can set up a private PyPI server pretty straightforward. There's a cool article in 'testdriven.io', but yeah, the conda and Anaconda version that you all offer. That's pretty cool. 45% that's high. I didn't expect that many companies to have a private repository. It's good, but I just expect it to be or lower.
49:05 Yeah. Although on the other side, that means 55% of those were just downloading grant of stuff from the Internet.
49:13 So it's good. I think the message is getting out that you have to think about these things from a risk perspective.
49:17 Another was 33% of the organizations do manual checks against a vulnerability database.
49:23 Yeah. So this is what I was describing earlier. The CVE databases are often a common vulnerability.
49:31 Manual checks. That's a lot of labor.
49:32 So it'll be interesting to see how many places move to automating that in some fashion. In order to the hard part there is those databases have again to data prep and data cleaning often to make use of those public databases, you need to do some amount of curation cause there's a lot of stuff that end up in there that's mis tagged or unclear or whatever. And so a lot of the manual checking is probably also just doing that curation.
49:58 That's one of the things as nice as GitHub now do automatic PRS for security problems that it knows about, at least. Yeah.
50:06 That kind of automation is going to be really important. I think in the future, just because you can't manually go through all those things.
50:12 What are you seeing around source control? You know, source code algorithms. These are really important, and people want to keep them super secure. But if they put them on their own private source code repositories, they lose a lot of benefits, like automatic vulnerability checking and stuff like that. What's the GitHub or GitLab versus other stuff, maybe enterprise GitHub. It's the trends there.
50:36 The interesting thing there is everyone is using source control at some point, and oftentimes they want it managed inside their firewall. And so yeah, things like GitHub enterprise and things in GitLab are pretty popular for that. A lot of times. I think what a places will do is they'll use some kind of the next item here the 30% that they're using a vulnerability scanner. A lot of those vulnerability scanners you can use on your own internal source repositories.
51:00 And so that way they're not taking it of GitHub automatically doing that for them. But they at least have some solution, probably for looking for stuff.
51:10 20% said they have no idea what they're doing, and then another 20%. So we're not doing anything. Well, I'm sure of it.
51:17 Let's maybe close out this overview of the survey results here by talking about Python.
51:25 Python popularity. Is it growing? Is it shrinking? Is everyone switching to 'Julia' or have they all gone to 'Go'? What are they doing?
51:33 Yeah. So I think Python's advantage here is being a pretty good at a lot of things. And so it ends up being a natural meeting point of people who are interested in web development and data science or system administration automation and all of that. So I think Python still has some grows to go. But what's interesting is in our survey. I would say the second most popular was SQL, which has been around forever.
51:58 And it's going now those are often you. Yeah. Exactly. And they're often used in parallel, right.
52:03 I'm going to do a SQL query and then run some Python code against the results. That type of thing.
52:08 Yeah. Definitely. I'm a big believer in that there is no one language for everything and there never will be. But there is a lot of different options that people are looking to 'Go' make sense for a lot of sort of network service kind of things. Kubernetes is built entirely out of Go, but I'm not sure if I'd want to do any data science and go at this point. Sure.
52:30 So it's going to always be a mix.
52:33 It might not even be that you're doing one or the other. You might be doing both. Like, for example, maybe you've written some core engine and Rust, but then you wrap it in Python to program against it. Right. It could be both. I guess it could even be more combination than that. But the popularity of Python looks strong, so it looks like it's still right up near the top. Obviously the group that you pull is somewhat self selecting, right. But that's still a general trend outside of your space.
53:01 Yeah. This is definitely going to be skewed to Python because otherwise why you're taking an Anaconda survey, but still, I think it is definitely something you see broadly in the industry as well.
53:12 Speaking of different languages and stuff out in the live stream, Alexander Simonov says, just learn that I can use Rust in Jupyter lab with some help from Anaconda. My mind is blown. Good job.
53:22 Yeah. That's one thing I should mention about Python is one of the Avengers is if you're using Python, you're probably benefiting for most of the language that's on the stack, even if you're not writing them. And so the ability of Python to connect to anything because I think it's strength and why it continues to talk these lists.
53:38 Yeah, absolutely. And then Paul out there has a question about the commercial license, and I guess there are some changes to it. Can you maybe speak to that? I don't really track the details well enough to. Yeah.
53:51 What we did was the Anaconda distribution packages have a terms of service that says if you are in an organization above a certain size, we want you to have a commercial license if you're using it in your business. I forgot the exact threshold where that's at and the reason there was to help one support the development of those packages. I should say, by the way, that terms of service does not apply to Condo forge. Obviously, there's a community packages, but if you want the assurances that is providing on those packages and you are a company of a certain size, we would like you to have a commercial license that allows us to support you more directly and allows us to fund continued work on those packages. And so it's a sustainability thing, I think.
54:37 But for most people, it's not an issue because they're either below that size or you're just using it individually.
54:43 Do you know what that size is? What that cut off is?
54:45 I do not recall top of my head, and so I'm afraid to quote a number.
54:49 Yeah, sure. No worries.
54:51 All right. Well, thanks for giving us a it seems fair that large companies benefiting from your tools contribute back. I think that statement should be applied to open source in general. If your company is built on Python, you should give back to the Python space. If your company is built on Java, it's Oracle. I don't know if they need help, but in general, if you're built on top of something, there's a lot of support you can give back. Right. It's kind of insane to me that banks that are worth many, many billions of dollars, you very little in terms of directly supporting the people who they're built upon. Right.
55:27 They pay for a couple of people building the core libraries. Like if you're using Flask, support the Flask Palette organization, something like that. Yeah.
55:37 And then we in turn take that licensing money, and some fraction of it goes to them. Focus for the broader sort of data science open source community, in addition to us directly funding some open source projects as well.
55:46 Well, we're about out of time stand, but let's talk real quickly about Piston because Piston is not rewriting Python in Rust. It's not replacing it with Cython or just moving to 'Go' it's about making core Python faster, right? Yeah.
56:04 This is something we've been thinking about performance in Python for a long time. One of the early projects that Anaconda created is called Number. It's a Python compiler. It's focused on numerical use cases, and it really is does its best job in dealing with that kind of numerical loop heavy code.
56:23 But it's not going to optimize your entire program, but optimize specific functions. And so NumPy is very good at a very specific thing. And so we've been thinking for a long time about how we could broaden our impact.
56:33 And so when I saw that Piston, which among many pilots on compiler projects, had reemerged in 2020 with a version written from scratch based on Python 3.8 as a just in time compiler in the interpreter. So it's designed to optimize any Python program. It can't necessarily do any given thing as fast as number might be for a specific numerical algorithm. But the breadth is really what is interesting to us. And so I saw this project had emerged. Piston2.0 kind of came on the scene. I started looking more closely at it, and we started talking with them, and we realized that there's a lot that I think that Piston Anaconda could do together.
57:11 And so we have hired the Piston team on to our open source group. So they are funded to work on Piston the same way we fund open source developers to work on other projects. And so we're really but the benefit that there's other help we can give and resources and infrastructure that we can offer this project. And so we're really excited to see where this is going to go from here.
57:32 Yeah, I'm excited as well. All these different things that people are doing to make Python faster for everyone, not just.
57:38 Well, let's try to recompile this loop, just you run Python, and it just goes better. I think that's pretty exciting. We've got the Cynder projects from Facebook.
57:47 It's a really good year for Python optimization projects.
57:53 So be careful about typing that into a search engine.
57:57 But the Cynder project is is not something that's publicly available. Really. It's not like a supported improvement, but here's what they did at Instagram.
58:07 There's a bunch of speed up. Maybe you all can bring some of that back into regular Python. But, yeah, there's a lot of these types of ideas. Awesome. Looking forward to see what you all do with this.
58:15 And the C Python core developers have even announced that they're undertaking a new effort to speed up C Python, and so we're looking to collaborate with them. They're going to have to figure out what they can do within the confines of C Python because you are the Python interpreter for the world. And so you need to be careful. But there's a lot they're going to do. And we're going to try and share ideas as much as we can, because these are both open source projects.
58:45 A lot of the challenges have been in compatibility. We could do this, but then C extensions don't work and those are also important for our performance in big ways and other stuff. But yeah, so they do have to be careful, but that's great. Alright, final comment real quick. Follow up from Paul. I'd like my company to do more open source, more, to do more to support open source. Any advice on promoting that?
59:07 Yeah, I think the best first place to start is identifying what open source does your company absolutely rely on, and especially if you can find an open source project that you absolutely rely on. That doesn't seem to be getting a lot of support. And then go look at those projects and see do they have an established way to donate funds? Do they have other needs? That's something. I think that is easier to sell, as you say. Look, our organization absolutely depends on X, whatever this is as opposed to picking a project at random. It's easier to show a specific business need.
59:41 Yeah, for sure. You can say, look, this is the core thing that we do and it's built on this rather than oh, here's some projects I ran across. We should give some of our money away. Yeah, that's a harder sell to stockholders, I guess. All right. Well, Stan, this has been really fun. Let me ask you the final two questions before we get out of here. If you're going to write some Python code, what editor do you use?
01:00:00 So if I'm on a terminal, it's Emacs. If I have an actual GUI desktop.
01:00:05 I'm usually using VS code these days and then notable PyPI package or conda package that you're like. Oh, this thing is awesome.
01:00:11 People should know about a wearing my GPU fan hat. I think a lot more people should know about topic up CuPy
01:00:19 It's Python package that's basically if you took NumPy but made it run on the GPU, it's the easiest way I can think of to get started in GPU computing because it just uses non pie calls that you're familiar with. So I would highly recommend if you are at all curious about GPU computing, go check out CuPy
01:00:37 So over there on that computer I have over there it has a G force, but on this one it obviously doesn't have Nvidia on my Mac.
01:00:46 Does that work CUDA cores. So part of that is for the Nvidia reach, right? What's my GPU story if I don't have Nvidia on my machine?
01:00:55 Not as clear.
01:00:56 Yeah, the CUDA has kind of come to dominate the space being sort of first out of the gate, there's a lot more Python projects for CUDA. There aren't not really clear choices, I think for AMD or for built in GPU at this point.
01:01:13 Although I've definitely watched the space. Intel is coming out with their own GPUs sort of this year in the starting next year and they have been collaborating with various open source projects, including the number of projects to build Python tools to run on Intel GPUs, both embedded and discrete.
01:01:31 Okay, so this may change in the future. It'll be interesting to see final call to action.
01:01:36 People are excited about digging more into these results and learning more about the state of the industry. What do they do?
01:01:42 Go search for state of data science and Anaconda and you'll find the results of the survey.
01:01:47 There's a a lot of detail in there, so I would definitely go through and take a look at all of the charts and things because there's all kinds of topics covered in there.
01:01:55 Yeah, I think it's 46 pages or something and we just covered some of the highlights, so absolutely. All right, Stan. Well, thank you for being here. It's been great to chat with you.
01:02:03 Thanks. It's been great.
01:02:04 You bet. This has been another episode of Talk Python to me.
01:02:09 Our guest on this episode was Stan Severt, and it's been brought to you by 'Shortcut', 'Masterworks.IO' and the transcripts brought to you by 'Assembly AI'. Choose Shortcut, formerly 'ClubHouse.IO' for tracking all of your projects work because you shouldn't have to project manage your project management. Visit 'talkpython.fm/shortcut' make contemporary your investment portfolios unfair advantage. With Master works, you can invest in fractional works of fine art. Visit to 'talkpython.fm'/masterworks Do you need a great automatic speechtotext API? Get human level accuracy in just a few lines of code? Visit 'talkpython.fm/assemblyAI'. Want you level up your Python, we have one of the largest catalogs of Python video courses over at Talk Python. Our content ranges from true beginners to deeply advanced topics like memory and async and best of all, there's not a subscription in site. Check it out for yourself at 'Training .talkpython.fm' be sure to subscribe to the show. Open your favorite podcast app and search for Python. We should be right at the top. You can also find the itunes feed at /itunes, the Google Play feed at /Play and the Direct RSS feed at /RSS on Talk Python FM. We're live streaming most of our recordings these days. If you want to be part of the show and have your comments featured on the air, be sure to subscribe to our YouTube channel at 'talkpython.fm/youtube.
01:03:31 This is your host, Michael Kennedy. Thanks so much for listening. I really appreciate it. Now get out there and write some Python code.