Monitor performance issues & errors in your code

#422: How data scientists use Python Transcript

Recorded on Wednesday, May 31, 2023.

00:00 Regardless of which side of Python you sit on, software developer or data scientist, you surely know that data scientists and software devs seem to have different styles and priorities.

00:10 Why is that? And what are the benefits as well as the pitfalls of this separation?

00:14 That's the topic of this conversation with our guest, Dr. Jody Burchill, data science developer advocate at JetBrains.

00:22 This is Talk Python to Me, episode 422, recorded May 31st, 2023.

00:27 Welcome to Talk Python to Me, a weekly podcast on Python. This is your host, Michael Kennedy.

00:45 Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython, both on fosstodon.org.

00:52 Be careful with impersonating accounts on other instances.

00:55 There are many.

00:56 Keep up with the show and listen to over seven years of past episodes at talkpython.fm.

01:02 We've started streaming most of our episodes live on YouTube.

01:05 Subscribe to our YouTube channel over at talkpython.fm/youtube to get notified about upcoming shows and be part of that episode.

01:13 This episode is brought to you by JetBrains, who encourage you to get work done with PyCharm.

01:19 Download your free trial of PyCharm Professional at talkbython.fm/done-with-pycharm.

01:26 And it's brought to you by Prodigy from Explosion AI.

01:30 Spend better time with your data and build better ML-based applications with Prodigy, a radically efficient data annotation tool.

01:37 Get it at talkpython.fm/prodigy and use our code TALKPYTHON, all caps, to save 25% off a personal license.

01:45 Jodi, welcome to Talk Python to Me.

01:48 - Thank you, I am so thrilled to be on the show.

01:50 - I'm so thrilled to have you on the show.

01:52 I've been a fan of your work for a while and we got a chance to get to know each other at this year's PyCon.

01:57 And so here you are on the podcast as well.

02:00 - Thank you, we had some very nice Mexican food actually, or maybe Utah Mexican.

02:06 I don't know how I would interpret it.

02:08 It was very good though.

02:09 - It was very good, yeah.

02:10 The food was excellent.

02:11 I thought the parties were great at the conference and people who are maybe still holding out ongoing. Personally, I really enjoyed being there.

02:20 I think it's probably the best conference that I go to yearly.

02:24 And it's like the vibe is so nice.

02:27 On this show, we're going to talk about how data scientists use Python, which is somewhat different than maybe a software developer.

02:35 We have, which I guess I'll put myself solidly into that, that camp.

02:39 I do a bunch of web development, make APIs, I build apps and ship them.

02:43 That's quite a different story.

02:45 And I'm going to have a, I think we're have a really great time talking about those things.

02:48 But before we do, let's get a little, get to know you a bit.

02:51 Let's see, how'd you get into programming, Python, data science?

02:54 Yeah.

02:54 So, probably going hand in hand with maybe not being a developer.

02:59 My story is perhaps a little unconventional.

03:02 So my background is academic, like a lot of data scientists.

03:05 And unsurprisingly, the first language that I learned was R because I was doing psychology and health sciences and a lot of statistics.

03:13 And I was procrastinating once during my PhD.

03:17 You will find any excuse to not work on your thesis.

03:20 And I think I was reading, oh, you know, people who are into statistics, you should really learn Python, it's the future.

03:26 And I was like, I should learn Python.

03:28 So I sat down and-

03:30 - Because I really don't wanna write that next chapter.

03:32 I just don't. - Exactly, exactly.

03:33 So I remember it, like I actually had this long weekend and I worked my way through, I think it was Zed Shaw's "Learn Python the Hard Way." This is showing my age, I think.

03:43 I loved it. Like I completed the course in three days. And then I didn't know what to do with Python because the stats libraries weren't as developed back then. So I just put it aside for a couple of years and ended up picking it up again when I started working in industry, because obviously I've left academia. And you sort of fairly quickly, once you start in data science, move away from more sort of statistical stuff to machine learning.

04:07 And Python really has the libraries for that. So that's my journey. It's a little bit bibs and bobs and stops and starts.

04:15 But once I kind of picked up Python, it really was love at first sight.

04:18 Oh, that's that's excellent.

04:19 What's your PhD in?

04:20 It's actually computer science, of course.

04:22 Right.

04:22 Of course.

04:23 Of course.

04:24 You know, it's so funny.

04:25 You are the third person to ask me in two weeks.

04:28 And no one has asked me this question for like two years.

04:30 My PhD was in hurt feelings.

04:32 Hurt feelings?

04:33 Yeah.

04:33 OK.

04:34 I say this a little bit blithely.

04:35 So my PhD being in psychology, I was really interested in emotions research and relationships research.

04:42 So, I kind of wanted to see what happens to people emotionally when close relationships go bad.

04:49 And it's hurt feelings, like things like, you know, infidelities, rejections, all of that.

04:55 It's hurt.

04:55 So, I was just studying what generates like and regulates the intensity of hurt and studied that for four and a half years.

05:04 - Yeah, sounds interesting.

05:05 I'm sure there was a lot of data to process.

05:07 There was a lot of data to process and a lot of very interesting statistic.

05:13 That was sort of how I got into data science.

05:15 I fell in love with stats in undergrad and just kept going down that path.

05:19 - I think a lot of people are drawn to data science not with the intent of waking up one day and saying, "I'm gonna be a data scientist," but they're excited or inspired about something tangential and they're like, "Well, I really need to get something better than Excel to work on this." Right? - Absolutely.

05:36 Yeah, yeah.

05:37 And we'll probably talk about this a little bit later about why data scientists use programming.

05:42 And it kind of is like, in some ways that need to jump from something way more powerful and reproducible than Excel.

05:49 Yeah, yeah, for sure.

05:51 So how about now?

05:52 You said you've left academics.

05:53 And what are you doing these days?

05:56 Yeah, so that leap from academia was a long time ago now, I think that was like seven years again, showing my age.

06:03 So for six of those years, I was a data scientist.

06:06 So day to day was, you know, pretty varied, but the job I have now is very different.

06:13 So I currently work as a developer advocate at JetBrains.

06:17 And the way I would describe my job is I'm a liaison between data scientists and JetBrains.

06:23 So I try and advocate for our tools to be as good as they can be.

06:27 And I try to recommend ways that people can use our tools if I think it's useful.

06:31 But I'm definitely not marketing or sales.

06:33 more if I think this is the right fit for you, I'll do it.

06:36 So it's like the way I achieve that is really up to me.

06:41 For me, I really kind of I like to do a mixture of what I call internal and external activities. So external activities are actually kind of only tangentially related to the products. So this would be an example of an external activity.

06:54 It's just getting out there and educating people about data science or educating data scientists about technical topics, things like conference talks or webinars, you know, all this sort of stuff. And then internal stuff is more focused on maybe things a bit more related to the product. So if I think there's a feature that people would be really interested in, I might make a video about it or create, you know, a blog post. So yeah, it's a real hodgepodge. So this week, for example, I've been working on actually materials for a free workshop that they're organizing at EuroPython. So I'm going to be volunteering to help out that. It's completely unrelated to anything I'm doing at JetBrains. It's just a volunteer activity.

07:33 But last week I was at a conference week before that.

07:36 So I can see the jobs pretty varied.

07:38 I think developer evangelists, it seems like such a fun job.

07:41 You know, I had your colleague Paul Everett on and we actually talked.

07:45 It's quite a while ago, a couple of years ago, three, four.

07:48 And we had a whole episode on like a panel on what is the developer advocate, developer relations job.

07:56 But it's it just seems like such a great mix of you still get to travel a little bit, see people, but you also get to write code and work on influencing technology and products and stuff.

08:08 Yeah.

08:09 And I think the thing that I started to appreciate about the job a few months in is you have a platform with this job, and that means you can choose to promote the message that you want.

08:19 And a message that, no surprise, is very meaningful for me is data science is for everyone.

08:25 Like I hate the gatekeeping that can happen in tech communities.

08:28 I think it's quite bad in terms of like people being intimidated by math in data science.

08:34 And like, I'm here to say to you, if you want it, it's for you.

08:37 It is a very cool field.

08:39 And yeah, doesn't matter what background you come from.

08:41 I absolutely agree.

08:42 And I was kind of hinting at that saying like a lot of people who don't see themselves as developers or programmers, like still find really great places, really great fits in data science and in programming as well.

08:53 And I also want to second that I don't think you really need that much math.

08:57 Maybe if you're trying to build the next machine learning model platform, then yes, okay.

09:04 But that's not what most people do.

09:06 They take the data, they clean it up, they do interesting visualizations and maybe put it into some framework for production, right?

09:12 Yeah.

09:13 And the nice thing is the field is in such a point where you have so many frameworks or tools that will handle a lot of this stuff for you.

09:20 Like, I'm not saying you don't need any understanding of what's going on under the hood, but you can learn it incrementally.

09:26 A lot of it is like with software development, where you develop that, that smell or that instinct for when something is not right, that will benefit you more than, you know, knowing how backpropagation works from a calculus perspective.

09:39 Like that stuff is maybe a bit too much.

09:41 You don't necessarily need it.

09:42 Yeah.

09:43 Let's get into the main topic and talk a little bit about how does programming in Python differ on the data science side than say me as somebody who builds web apps.

09:52 Yeah. And maybe we can start by doing an orientation to like, what does a data scientist do? Because I think this confuses a lot of people. Yeah.

09:59 Yeah. So basically, the role of a data scientist is to sort of like, like the reason you would hire a data scientist is you have a bunch of data and you have an instinct that you can use that data to either improve your internal processes or sell some sort of IP.

10:17 So the reason, you know, we differ from BI analysts is BI analysts are doing analysis, but it's more about business as usual, which is really important.

10:27 You absolutely need BI analysts.

10:29 How are sales going?

10:30 How much have we made versus last year?

10:32 Like those kinds of charts, right?

10:34 Like absolutely fundamental questions.

10:36 So you need to be an analyst before you need a data scientist, but your data scientists are more there to push the envelope in a data driven way.

10:44 So we have two main outputs, I would say.

10:48 You can either create an analysis and do a report, or you can build some sort of model that will go into production.

10:55 So an example might be as a business, I have an instinct that I can get my customers to buy more things based on people like them also buying those things.

11:05 So in that case, your data scientists might be able to build you a recommendation engine. And this will have a business outcome.

11:13 Developers, obviously, on the other hand, have a very different goal.

11:16 Their goal is to create robust software systems.

11:20 So the concerns that they have are things like latency, server load, downtime, things like that.

11:27 And it's very interesting.

11:29 And we'll talk about it a bit more when we talk about code, if we kind of get into that topic. But basically, data scientists are not really interested in creating code for the long term, whereas the code becomes the product that software developers write often.

11:43 And you have to think about things like legacy systems, because eventually every Greenfield project becomes a legacy system if it lasts for long enough.

11:51 - Yeah, if you're lucky, right?

11:52 Because the alternative is it never really got used or it didn't add that much value and got discarded, shut down, all right.

11:59 So even though people talk about how much they don't want legacy code or how they kind of don't necessarily want to work on it 'cause they want to work on something new and shiny, That's kind of the success side of software development, right?

12:12 Yeah, exactly.

12:14 I do think it's super interesting, the different lifecycle of code on the data science side, because you might be just looking to explore a concept or understand an idea better and not necessarily ever intend to put it into production in the traditional software sense, right?

12:32 So I've seen some pretty interesting code written that people would look at and go, oh my goodness, what is, there's not even a function here.

12:43 How is this possible? You know, it's like copy and paste, reuse almost. And yet it really does go from having no idea to pictures and understanding and then maybe handing that off potentially to be written more robustly or better. Exactly. And it's really interesting. Like, They're kind of two very different processes. And that point actually where engineering and research meets is a very, very interesting one. And I've seen it work in multiple different ways in multiple different workplaces. So, for example, I've worked in places where the data scientists were completely sequestered away from the engineers. And there really wouldn't be that much discussion between the engineers and the data scientists during the research phase, which I do not advise. So what that means is the data scientists will come at the end and hand over this chunk of perhaps very difficult to read code to an engineer and be like, "Hey, so we need to implement this." And the engineer is like, "What is this? Okay, I will schedule that for the next six months." And then I've seen, or I've been a data scientist embedded within a software development team. And in that case, your project is marching in lockstep with what the the engineers are doing. And from the very start, you know, you've been discussing important things like you need to build a model that has latency constraints. You need to think about this as the data scientist in terms of like the model that you run, but also how it's implemented.

14:08 Right. Like how much memory does it use? Because if you run it on your own machine by yourself, then, well, it's kind of the limit of your computer sort of sometimes. But if you're running a thousand of them concurrently, because people are interacting at the website, all of a sudden that might make a difference.

14:23 Or like one of the most interesting problems I think I ever solved in my career was basically I was working at a job board.

14:30 We're trying to improve the search using natural language processing.

14:34 So we had this idea that we could build a model that found out the probabilistic relations between skills and job titles.

14:42 So if someone typed a skill into the search, we could expand it with job titles and then find all of the jobs that we indexed with that at search time and vice versa with job titles and skills.

14:52 The thing is, we need to find those relations at search time.

14:55 That is a very low latency system.

14:57 And it was super interesting because we had to think about how we could search that vector space in really, really quickly.

15:06 Instead of having to calculate the distance between that and every single vector, we had to work out how to do that more efficiently.

15:12 That sort of stuff I really like because it's so applied and it's so...

15:16 It's this really nice intersection between computer science and data science, I think.

15:20 It's super cool.

15:21 One of the things I really like about working with programming broadly is how concrete it is.

15:28 Right? You came from academics. You know, I was in grad school for a while as well.

15:32 And it's, you could debate on and on about a certain idea or concept. And it's like, well, you might be right. Or I'm here, you push a button and you get the answer or it runs or like, there's a really nice feedback of like, I built this thing and it's look, it's really really connecting these people and you know, then it comes down to can you do it in real time and other things like that. But that's a really cool aspect of programming.

16:00 This portion of talk Python to me is brought to you by JetBrains and PyCharm. Are you a data scientist or a web developer looking to take your projects to the next level? Well, I have the perfect tool for you. PyCharm. PyCharm is a powerful integrated development environment that empowers developers and data scientists like us to write clean and efficient code with ease.

16:21 Whether you're analyzing complex datasets or building dynamic web applications, PyCharm has got you covered.

16:28 With its intuitive interface and robust features, you can boost your productivity and bring your ideas to life faster than ever before.

16:35 For data scientists, PyCharm offers seamless integration with popular libraries like NumPy, Pandas, and Matplotlib.

16:41 You can explore, visualize, and manipulate data effortlessly, unlocking valuable insights with just a few lines of code.

16:48 And for us web developers, PyCharm provides a rich set of tools to streamline your workflow.

16:53 From intelligent code completion to advanced debugging capabilities, PyCharm helps you write clean, scalable code that powers stunning web applications.

17:02 Plus, PyCharm's support for popular frameworks like Django, FastAPI, and React make it a breeze to build and deploy your web projects.

17:10 It's time to say goodbye to tedious configuration and hello to rapid development.

17:15 But wait, there's more.

17:17 With PyCharm, you get even more advanced features like remote development, database integration, and version control, ensuring your projects stay organized and secure.

17:25 So whether you're diving into data science or shaping the future of the web, PyCharm is your go-to tool.

17:30 Join me and try PyCharm today.

17:32 Just visit talkpython.fm/done-with-pycharm, links in your show notes, and experience the power of PyCharm first hand for three months free. PyCharm, it's how I get work done.

17:46 I think it's kind of a shame that a lot of places do set up their engineering and data science teams so separately. Sure, we have quite different roles and we have quite different backgrounds sometimes. But I really think that having the two teams at least planning things together, you can really actually learn a lot from each other about how to approach problems.

18:10 When you were describing, you know, either having those groups really separated or working really closely together, maybe an analogous relationship that people could relate with is maybe front-end developers and people building the APIs in the back-end, right?

18:25 Like the people doing React or Angular or Vue or whatever it is, you know, in the web design.

18:32 Having those completely separated as well is also, you know, it's terrible, not a good idea.

18:36 - It doesn't make any sense.

18:38 And like, I can totally understand it from the point of view of like team composition, because it is, I think, better to have all your data scientists together because they can learn from each other.

18:47 But then I think having, I don't want to use the squad term 'cause I know it's become a little bit unpopular to use it, but you know, this idea of project-oriented teams, I think are quite important.

18:56 - Let's dive a little bit more into the research side of things that I want to ask you about.

19:01 Why Python?

19:02 Let's talk about how the research process works and maybe why that results in different priorities and styles of code and styles of engineering.

19:10 It starts at a similar point to all software projects, which is business comes to you and they have some sort of goal. Sometimes it's very vague and you need to interpret that and turn it into an executable project. But where the sort of uncertainty starts and like where it sort of becomes a research project rather than a project, and I don't know if I described that very well, but it becomes research versus something you're building concretely is even know at the start of a research project, whether it's even possible to answer the question that you're being asked or build the internal product that you're being asked for.

19:45 You might not understand the domain entirely, right? You're trying to gain understanding even.

19:50 In the very worst cases, you won't even know if you have the data because maybe your company has so much data and it's so poorly organized, again, something I've seen that you don't even know if the data exists to answer this question. So, first is going and getting your data. And you spend quite a lot of time with the data because the data will be the one that tells you the story. It'll tell you whether what you even want is possible. And you probably like heard data scientists hammering on about, you know, garbage in, garbage out. Like, you can build the most beautiful, sophisticated model you want. But if you have crap data where there's no signal, you're not going to get anything because it's just not there. Like, the relationship you're you're looking for is not there.

20:30 - Yeah, the side of that I've heard is 80% of the work is actually the data cleanup, data wrangling, data gathering before you just magically hit it with a plot or something, right?

20:41 - Absolutely, and it's interesting because that data cleaning, data wrangling step also doesn't happen in one go, especially if you're building a model.

20:48 So what will happen is you'll try something out and you'll be like, okay, it didn't quite work.

20:52 Maybe I need to manipulate the data in a different way or I need to create this new variable then you'll go again. And it's this super iterative process where you have this tightly coupled relationship between both the models and the data. So, it really is sort of, you know, how I was talking about the instinct, this is sort of where that comes in, because you're going to spend like 80% of your time honing your skills. But it's the most, I think, valuable part of the process.

21:17 And if the signal's there, you can usually get away with using really dumb and simple models, you know, things that are unfashionable now, like decision trees or linear regression, you can get away with them because you've just got such good data, but just go with a simpler model. It's got all the advantages. This is sort of, I think, what makes it different that you're sort of moving towards a goal, but you don't know what that goal is.

21:42 Estimation is always hard, right? What I found is best is really just time boxing each step, seeing if you are up to where you thought you'd be up to by a certain point. And if not, you need to just keep having those discussions with the business stakeholders because otherwise they're going to not be very happy if you've spent six months just looking at something and you have nothing. What have you built? Well, I have some notebooks I could show you. I have 40,000 notebooks and they're all terrible. Yeah. Speaking of the data, Diego in the audience has an interesting question. How big are the data sets businesses will bring to you typically?

22:18 enough that you don't need to go out and find more data?

22:21 This is a good question. So I hate to be it depends, you know, I get to say that, though, because, you know, I was a lead data scientist. So I earned that, that rank, it really does depend on the problem you need to solve. So typically, business will have enough data to cover at least some of the use cases. So to give you a really concrete example, this job board that I was talking about that I worked at, we actually had like a bunch of different job boards across Europe. So we had some that were a lot bigger, like Germany or the UK, and we had some that were really small, like Poland or Spain. And we wanted to build these multi-language models or models maybe for different languages.

23:02 We played around with both. And I don't think we really had enough data to support the models in these smaller languages.

23:09 So the models were just not as good quality because we didn't have enough data.

23:12 But for the bigger languages we did.

23:14 And then it sort of becomes a case of, OK, well, we have more data for these particular websites because they're the most like they're the ones that are bringing the most revenue. So then it sort of became like, well, OK, maybe it's good enough that we improve the search on the most important ones.

23:27 And for the smaller ones, we just wait until we accumulate more data.

23:30 So, yeah, most of the time I found that there's a way to make it work for at least part of the solution.

23:37 And then sometimes, like in the case of my last job, we had something like 170 billion auctions per day.

23:44 Sorry.

23:44 So we had so much data.

23:48 We even had problems like processing it.

23:50 So sometimes, you know, that's the other side of the story is when you've got too much and then how do you throw it away?

23:56 Right.

23:56 I mean, you've got this auction story.

23:59 Like another one that comes to mind is the large Hadron collider.

24:02 Oh yes.

24:03 Yes.

24:04 They've got layers and layers of like chips on hardware and then chips or machines right next to the Collectors and then on it where it's all about. How do we throw away?

24:13 You know terabytes of data down to get it to megabytes per second, right?

24:18 Yeah, and it's interesting because what you can end up in this within those situations is even then you can have underrepresented groups so for example, we had we're working with advertisers and apps, you know, basically trading ads and we ended up with some apps that were just so small that you were like, "Even with all this data, I really don't have enough to represent this particular combination in this country." >> Interesting. Very interesting.

24:45 Why Python? You started out in R, and of course, any distraction from writing a PhD is a good distraction.

24:53 But I do think there's been a really interesting graph.

24:58 If people go and look at, what is it, Stack Overflow Insights.

25:02 If you go look at Stack Overflow Insights, they had a really great graph that shows you the popularity of Python over time.

25:12 And there's just this huge inflection point around 2012.

25:15 And I feel like that's when a lot of the data science libraries really came around and took off.

25:21 It seems like there was a big inflection at one point, but, you know, why?

25:25 To be honest, I can talk about why I like Python from my background.

25:31 I couldn't really tell you exactly what caused that takeoff, but, you know, apart from, you know, this idea that the libraries were maturing enough. But the thing is, looking even at current surveys, around 60% of data scientists do not have a software development or a software engineering background. So, for people like us, we don't really understand, like, it sounds terrible, we don't really understand basic constructs in how a programming language works. And that can actually mean that going to some sort of compiled language even can be quite a steep learning curve.

26:03 Sure. Pointers to pointers, for example. Like, "Oh, no thanks." Yeah. Or having to deal with the fact that in Java, everything is a class. You're just like, "What is this?" But of course, you understand why if you have that background. But if you're trying to learn it yourself, you then have a lot of background you need to cover. But in Python, and in R as well, you don't need to cover those things.

26:26 It's super easy to prototype, it's super easy to script.

26:30 The flexibility of Python is what makes it, I think, the perfect prototyping language.

26:34 And that's essentially what you're doing, you're prototyping.

26:37 So we talked a little bit earlier about like, why not just Excel?

26:41 We didn't quite say that, but this was sort of what we were maybe getting at.

26:45 And yeah, we could do some of our work in Excel.

26:48 I've tried this.

26:50 And first I can tell you Excel really starts to struggle when you have too many calculations going on under the hood, it gets very, very slow. But to be honest, it's sort of just, it's just cleaner to code this sort of logic.

27:03 It's much more reproducible when you need to do this iterative sort of stuff.

27:06 And it also means that you can use much more powerful tools.

27:11 So you can say, use APIs that developers have made to process your data. You can use powerful.

27:17 - Or get data, right? Like I need live, live currency conversion data, Right. So much easier than in Excel.

27:23 Yes, yes, yes. Exactly.

27:25 Like you can like scrape data or you can, yeah, pull data in of an API or you can use powerful tools like Spark to process 170 billion auctions per day in order to reduce it down to something manageable.

27:40 So, yeah, it just gives you a lot more power.

27:43 But at the same time, why we use programming languages is it's just such a different focus.

27:50 it's a bit overkill to use something like Java.

27:52 I know some people do do natural language processing in Java, but that's more on the engineering side to build maintainable systems.

27:58 >> One of the things I like to say when thinking about how people who are coming from tangential interests like biology or whatever, is you can be really effective with Python and I suspect R as well, with a really partial understanding of what Python is and what it does.

28:14 You pointed out you don't even have to know what a class is or even really how to create a function.

28:19 You just, I can put these six magical incantations in a file and then I can do way more than otherwise could, right?

28:26 Then you learn one more, you make it better and better as you kind of gain experience.

28:30 Pretty much, and this is where I started, like, obviously I learned what functions and classes were when I first started programming.

28:36 But in the end, you will just, maybe it's not the best thing and we can sort of maybe get into this.

28:43 I suppose part of the confusion or not confusion, but internal debate I've had over the years is how good does data science code really need to be? Like, how much would data scientists benefit from knowing more about computer science topics or software engineering topics, maybe more to the point? And, you know, because like the thing is, every field has so much to learn.

29:03 Don't even get started on what's happening with large language models at the moment. Like, it's just overwhelming. Should we take some of our precious time and learn software engineering concepts. I'm not sure. Like, I'm not sure if I have the answer to that.

29:14 This portion of Talk Python to Me is brought to you by Prodigy, a data annotation tool from Explosion AI. Prodigy is created by Ines Montani and her team at Explosion, and she's been doing machine learning and NLP for a long time. Ines is a friend and frequent guest on the show, and if you've listened to any of her episodes, you know that she knows her ML tools.

29:37 So what is Prodigy? It's a modern, scriptable annotation tool for machine learning developers made by the team behind the popular NLP library, spaCy. You can easily and visually annotate and develop data for named entity recognition, text classification, span categorization, computer vision, audio, video, and more, and put your model in the loop for even faster results.

30:01 After collecting data, you can quickly train and export a custom spaCy model or download annotations to use it with any other library or framework.

30:09 Prodigy is entirely scriptable, in Python of course, the language we all love, and it seamlessly integrates with your favorite libraries and tools.

30:16 Plus, the new alpha version they just released also introduces a built-in support for large language models such as OpenAI's GPT models, and new tools for dividing up your data between multiple annotators. TalkByThon listeners get a massive discount on a lifetime license.

30:33 You'll get 25% off using the discount code Talk Python.

30:37 But don't wait too long. This offer does expire.

30:40 Get Prodigy at talkpython.fm/Prodigy and use our code TALKPYTHON, all caps, to save 25% off a personal license.

30:50 This link is in your podcast player show notes.

30:52 Thank you to Explosion AI for sponsoring the show.

30:55 I think it really depends on what kind of data scientist you are.

31:02 if what you are is someone doing research, as you described before, you're like, is there a trend between the type of device that they use to buy their thing at our store and how much they're buying on the second, you know, how much are likely to come back? Like if they're using an iPhone, do they, do they tend to spend more than if they're using an Android? And is that a thing that we should consider or, you know, is there any like that kind of exploration, which you can judge whether or not you should make that exploration, but just put that aside for a minute. That kind of stuff, like once you know that answer, maybe you don't need to run that code again. Maybe you don't care. You just you just want to kind of discover if there is a trend. And there, maybe you need to know software engineering techniques, but should you be writing unit tests for that?

31:47 I'd say maybe not, honestly. On the other hand, if your job is to create a model that's going to go into production, that's going to run behind a flask or FastAPI endpoint, then you're kind of in the realm of continuously running for many people over a long time. And I think that really is a different situation.

32:04 I think this is where you actually move from data science to machine learning engineering.

32:08 This term has a lot of different definitions. For me, I base my definition of ML engineers on the two people that I've worked with, who were like true full stack kind of people who could go from research and prototyping to deployment.

32:23 And they were data scientists who really cared enough to actually learn how to do proper engineering, and they could actually deploy their own things.

32:32 But then this leads to another one of my very favorite topics, which is who is responsible for apps in production.

32:39 And here's the thing.

32:41 So I think as good as your data scientist is going to be, or your ML engineer, let's say an ML engineer, let's say that they can actually deploy their own code.

32:49 If they're then responsible for that code in production, that then eats up the time that they can be prototyping and researching new things for you. So the conclusion I've come to over time, and again, this is a matter of debate. This is just my opinion. Basically, I think if your company is above any sort of level of size or complexity in terms of the data products it has, I think you really do need dedicated data science and engineering teams. Because in the end, no matter how good your data scientist code is going to be.

33:20 It needs to be implemented by the person who's going to maintain it.

33:23 And maybe they're not the ones writing the code from scratch.

33:26 Maybe they can adapt the data scientist code if it's good enough.

33:29 But in the end, they need to be comfortable and familiar enough with that code to be like, yeah, if I get pinged at three in the morning, I'm OK knowing what to do with this code. Yeah, that's a good point.

33:40 Yeah. So I think it's just easier to scale these teams in parallel rather than trying to hire this like all in one person who can do everything.

33:47 they're impossible to hire.

33:48 Like I've only ever met two over the course of my career and quickly they become overwhelmed by having to maintain projects.

33:55 - Right, is that the best use of their time?

33:57 - Yeah, and like it's maybe it's not necessarily even if it's the best use of their time, it's more like then who's gonna do your research because now you've used up that resource on maintaining two or three projects.

34:10 - Right, absolutely.

34:11 Chris May's got an interesting question out here in the audience.

34:14 It's kind of turns us on a little bit.

34:15 It says, "Development teams tend to work better and they focus on writing and refactoring code to make it testable and understandable.

34:21 And we've talked a little bit about maybe stuff that data scientists shouldn't care about or whatever.

34:27 So are there ideas that are like good practices for data scientists and teams of them?

34:33 - This is actually a really great question.

34:35 So basically, it's an interesting thing with data scientists that unlike software developers, we often tend to work alone on projects or maybe in very, very small teams, like maybe two or three people.

34:48 And I think it's probably a hangover from the fact a lot of us at XAcademics, we're just used to having like, it's not great, but it's...

34:57 A whiteboard, an office in the corner and no one knows what you're doing.

35:01 Exactly. And no one cares. That paper that three people read, took me three years. So what I think has been neglected, you know, aside from learning software engineering best practices, is more fundamental things, which is like writing maintainable code. And I don't mean maintainable in the sense of it's a system that needs to be able to run regularly. It's more like this is a piece of code that I can come back to in six months and understand what I was doing. Because, you know, research projects can be shelved forever, but maybe they need to be revisited and, you know, built upon. So, this was actually a topic I got really interested when I first move to industry, like the idea of reproducibility with data science projects.

35:45 It's about the code, but it's also about things like dependency management, which is notoriously difficult in Python to get reproducible environments later.

35:56 And even the operating system.

35:58 If Linux has really dramatically changed over time, then maybe, maybe your, your dependency, your old dependency, you want to keep that one, but it won't run on the new operating system or there's a whole spectrum of challenges there.

36:09 Exactly. Exactly. And it's sort of something that can be solved with using poetry, which is a little bit more robust.

36:17 But even then, it's you've still got it like it runs on my machine effect where your machine will not be the same machine.

36:23 Increasingly, there's actually a move towards doing more sort of cloud-based stuff for data science, which solves a few of these problems.

36:31 And it also solves the additional problem where data scientists often need to do remote development for various reasons.

36:36 like you need access to GPUs in order to train models.

36:40 So, you know, obviously, if you have a server, you have a Docker container which has environment specifications, you can power up that exact same environment. And that actually helps with that reproducibility a lot.

36:51 And then another point which I think is really important for data scientists and can be neglected is literate programming.

36:59 So this is an idea from Donald Knuth.

37:01 And it's this idea that you should write your code in such a way that it's actually understandable later. With data science work, it's also that you really need to document a lot of the implicit kind of assumptions that you make or decisions that you make as part of the research project process. And this is one of the reasons, probably a good segue, why Jupyter is so important. Jupyter notebooks are designed to be research documents.

37:26 So this is why you have the markdown cells if you've seen a Jupyter notebook, because it's this idea that you really, really need to document along with the code, the decisions that you made. Like, why did you choose this sample? Why did you decide to create the inputs to your models the way that you did it? You need to document all this stuff. So, yeah, reproducibility is a super interesting topic. And I think it's, yeah, something that really needs to be thought about carefully, even if you're not collaborating with anyone else, because otherwise your piece of research is going to be worthless in three months, 'cause you're not gonna remember what you did.

37:59 - I think notebooks are quite interesting.

38:01 They go a long ways to solving that.

38:02 When used in the right way, you can just jam a bunch of non-understandable stuff in there and it's just, well, now it's not understandable, but it's in a webpage instead of in an editor.

38:11 But I think as in, you know, not just programmers, but tech in general, we're just bad at thinking about the long-term life cycle of information and compute.

38:24 For example, I got a new heat pump to replace the furnace at my house. The manual for it came on a CD drive and I'm like, I don't think I have a CD. Where did I put that? I would go dig through a closet full of electronics. I'm not sure I can read that, right? And CDs seem so ubiquitous for so long, right? And just simple little mismatches like that just get worse over time. It's going to be tough to keep some of this older research and reproducibility around. Yeah, like it's super interesting that There are packages I used to use, you know, back when I first started in natural language processing.

39:01 Some of them haven't been updated from Python 2.

39:03 So I can't use them anymore.

39:04 Because they were just some, probably like a PhD project, and no one really had the time or energy to maintain it after that person graduated.

39:12 And the person graduated, got a job and doesn't really care that much anymore, potentially.

39:16 Exactly.

39:16 Not enough to keep it going.

39:17 Yeah, it's not even necessarily their fault.

39:19 It's just life.

39:21 Yeah, yeah.

39:21 Let's talk about some of the libraries and tools.

39:24 You mentioned Jupyter.

39:25 I think Jupyter is one of the absolute cornerstones, right?

39:29 So, Jupyter or Jupyter Lab? What are your thoughts here?

39:33 It's funny, actually, for years I was just working in Jupyter, playing Jupyter on my computer.

39:38 Maybe give people a quick summary of the difference, just so, who don't know.

39:42 Very good idea. So basically, Jupyter is, I suppose you could call it an editor.

39:46 It's basically an interactive document which you run against a Python kernel, or you can run it against different language kernels. There are, there are Julia, there are Kotlin notebooks. Should I give my little advertisement for JetBrains? Basically, what you can do is you can run code in cell blocks. Then you can also create markdown cells in between them. And this allows you to basically have markdown chunks and then cell chunks.

40:10 JupyterLab is hosted remotely. So you have basically a bunch of other functionality built in so you can open terminals, you can create scripts, things like that. But basically, It's like a little Jupyter ecosystem, which is designed to be remotely hosted, and it can be accessed simultaneously by several people.

40:28 So I would say Jupyter is good if you are just starting out and you're dealing with small data sets.

40:37 Maybe you're even retrieving things from databases, but you're not saving anything too heavy locally.

40:42 You're not using a huge amount of memory, like maybe unless you got one of those new M2 Macs and server in your office.

40:48 So go for it.

40:49 Yeah, JupyterLab, I think is good if basically you need to access different types of machines.

40:56 So maybe you need to be able to access GPU machines easily. You kind of want that remote first experience where you don't have to then connect to a remote machine. And I have found JupyterLab helpful in the past for sharing. But the thing you can't do with JupyterLab is real-time collaboration. And that's a bit of a pain in the butt. Obviously, since I started at JetBrains, I kind of, you know, like I'm using our tools and I like them a lot.

41:18 Obviously I wouldn't advocate them.

41:20 Yeah, I was going to ask, is this PyCharm, Dataspell?

41:23 When you actually do that, are you using some of those type of tools?

41:26 I am. So I won't turn this into too much of an advertiser for our tools, because it's not really the point of me being here.

41:33 But we've kind of tried, or my teams have tried to solve some of these problems that you might have with just using plain Jupyter notebooks, or even working with JupyterLab, maybe a bit more, like, robustly.

41:46 So we have actually three data science projects, products.

41:50 We have PyCharm and Dataspell, which you've mentioned.

41:52 They're desktop IDEs with the ability to connect to remote machines, but they're not really collaborative, but they do give you like really like nice experience with using Jupyter debugging and co-completion and all those sort of things.

42:05 We have another one, which is Datalore.

42:07 And this falls into those managed notebooks that I was talking about, it's cloud hosted.

42:12 And the nice thing about Datalore actually you can do real-time collaboration. So it sort of helps overcome...

42:17 - Comp style, sort of.

42:18 - Yes, it's the same technology, actually. So...

42:21 - Okay.

42:21 - Yeah. So it's kind of a very interesting thing because there will be times where, you know, maybe you're not working on a project with a data scientist, but you need them to have a look at your work. And when I was working with JupyterLab, what we would do is we would clone the notebook to our own folder, and then we were in the same environment, so it was okay. And you would rerun the whole thing again. And sometimes it would be pretty time consuming.

42:46 Datalore is an alternative to that. It may or may not be kind of your style. But it's pretty cool because you can actually just invite someone to the same notebook instance that you're in, and you're basically hosting them. And they have access to everything that you've already run.

43:01 So it's like true kind of real time.

43:04 Yeah, that's nice. Because sometimes a cell has to run for 30 minutes, but then it has this nice little answer and you can work with that afterwards, right?

43:11 - Exactly, or you want a model to be available and maybe you haven't saved it or something.

43:16 Like this is just a way around some of these friction points.

43:19 - I want to circle back just really quickly for a testimony, I guess, out in the audience.

43:24 Michael says, "I started teaching basic Git, "Docker and Python packaging "to bioinformatics students at UCLA "and it's made a huge difference in the handoff." And I think for actual projects, you know, I just think, as we were talking about what should people learn at data science and what they shouldn't, A little bit of the fluency with some of these tools is really helpful.

43:44 I absolutely agree.

43:45 I know it can be really overwhelming, especially Git initially for students.

43:51 Git is overwhelming at first.

43:53 Yeah. I would say because I tend to work on things by myself.

43:57 Yeah. This falls into the reproducibility and stuff that I was talking about earlier.

44:02 It's super, super important.

44:05 Once you get comfortable with just basic use of these tools, you can get really far.

44:09 Okay. Back to some of the tools, Jupyter, JupyterLab.

44:12 What about JupyterLite?

44:14 Have you, have you played with JupyterLite?

44:15 Any?

44:16 Only a teeny tiny bit because of this workshop that I'm going to be helping out with at EuroPython.

44:21 So they're going to be running the whole thing in, in JupyterLite, hopefully.

44:25 Couple of bugs to solve, but I think they're overcomable, but yeah, it's a really interesting alternative to Google Colab actually.

44:33 Yeah.

44:33 JupyterLite, take Pyodied, which is CPython running a WebAssembly, and then build a bunch of the data science libraries like Matplotlib and stuff in WebAssembly.

44:44 And then the benefit is you don't need a complex server to handle the compute and run arbitrary Python code, which is a little sketchy.

44:52 You just run it on the front end in WebAssembly, which is pretty cool.

44:55 I interviewed the folks at PySport a little while ago.

44:59 And it's just the ability to just take code and run all these different pieces on your front end without worrying about a server, I think is super cool.

45:08 If I get that right or not, but anyway, just I think running it on top of people using it, on top of the browsers, like you do JavaScript, is it's an interesting thing to throw into the mix for notebooks.

45:20 - Actually, a lot of these projects coming out using Pyodide are really interesting.

45:24 Obviously, PyScript is the big one from last year.

45:26 - Yeah, I think PyScript actually has really lots of interesting possibilities beyond just the data science side, right?

45:33 Whereas Pyodide is a little more focused on just, I think, really providing the data science tools on the client side.

45:40 We'll see where PyScript goes.

45:41 If they can make an equivalent of Vue.js or something like that, where people can start building legitimate front-end interactive web apps, like Airbnb or Google Maps or something, but with Python, that's gonna unlock something that has been locked away for a really long time.

45:58 With Pyodide, you know, that's like a nine or 10 meg download.

46:01 That's too much for the front end, just for like a public facing site generally at the start of time, but they're moving it to MicroPython as an option.

46:09 And that's a couple hundred K, which is like these other front end frameworks.

46:12 So it's very exciting.

46:14 I think that's going to be, that's definitely the most exciting thing in that area.

46:17 But all right, back to data science.

46:18 Let's see where you want to go next.

46:20 You want to talk pandas maybe?

46:21 - Yeah, let's jump into pandas, which is the other, the other biggie when you're talking about data science.

46:26 So what pandas is really important for is, it's basically the entry point of you working with your data. So it's a library, which basically allows you to work with data frames, data frames, basically tables. And from there, you can do data manipulation, you can explore your data and visualize it. And it also is an entry point to passing your data into models. Sometimes it'll need additional transformations, but say scikit-learn, which we can talk about in a sec, you can basically pass Panda's data frames directly into scikit-learn models.

46:59 Panda's also, because of its popularity, has kind of opened up this easy access to grid computing and other types of processing database stuff that you don't really need to learn those tools, but you get to take advantage of. And so two things that come to mind for me are Dask.

47:16 Yes.

47:16 It's kind of like a Panda's code, but instead, you can say actually run this across this cluster of machines or larger than memory or stuff on my personal computer or even just take advantage of all 10 cores on my M2 instead of the one.

47:34 Yes.

47:35 Have you done anything with Dask? Are you a fan of it?

47:37 I was kind of there when Dask was new.

47:39 And let's just say they find out a lot of the bugs.

47:43 Yeah, yeah.

47:43 So what ended up happening was I ended up learning PySpark instead.

47:46 So I went down a different kind of route.

47:49 But I think, you know, they solve very similar problems.

47:52 It's just Dask is much more similar to pandas.

47:55 And so you don't really need to deal with learning.

47:57 It's similar, but it's a new API.

48:00 Yeah. Another one that I was thinking of, I just had these guys on the show, sort of, is Ponder.

48:05 Oh, I have not heard of this.

48:07 So Ponder, they were at startup row at PyCon.

48:11 And they basically build on top of Moden, which is important, moden.pandas as PD.

48:16 And what it does is it, instead of pulling all the data back and executing the commands on your machine in memory, which maybe that data transfer is huge, it actually runs it inside of Postgres and other data, and I think PySpark as well.

48:28 Like it translates all these pandas commands to SQL commands to run inside the database where the data is, which is also a pretty interesting thing.

48:37 That is amazing.

48:38 So yeah, it's just interpreting the code in a completely different way.

48:41 You can do like query planning and optimize the code.

48:44 Yeah, exactly.

48:45 - I think they said that df.describe is like 300 lines of SQL.

48:50 It's really, really tough.

48:51 But once this thing writes it, then it's good to go.

48:53 And I think the reason I bring this up is like, you don't have to write that code.

48:56 You just have to know Pandas.

48:58 And then all of a sudden, there's these libraries that'll do either grid computing or really complex SQL queries that you don't care about.

49:05 - Yes. - You don't care to write or so on.

49:07 So I think it's, Pandas is interesting on its own, but it's almost like a gateway to the broader data science community.

49:12 - Agreed, agreed.

49:13 And it's such a de facto, I think, for data analysis now or data manipulation transformation.

49:20 Yeah, like I don't see it going away anytime soon.

49:23 And actually, Pandas 2.0 just came out.

49:27 And instead of being, yeah, instead of, Pandas is NumPy under the hood, which is fast, but it's not really equipped to deal with certain kinds of structures like strings, because, you know, it's not really what NumPy is about.

49:41 and also missing values, the way that it handles it is pretty janky.

49:44 So yeah, it's been rewritten with PyArrow under the hood.

49:48 >> Right.

49:48 >> Yeah. Apparently, the performance is so much better.

49:52 Something I need to sit down and actually try.

49:54 It's been out for like a month and I'm feeling a bit bad, but yeah.

49:57 Yeah, that's cool. It probably has support for some of the serialization formats to back up our term like, I said, Parquet and some of those types of things.

50:06 I think that comes straight out of PyArrow.

50:08 >> Yeah.

50:09 Excellent. So that kind of brings me to a trade-off I wanted to talk to you about before we get off of pandas.

50:14 Although it sounds like pandas 2.0, it makes this less important.

50:18 But you know, another sort of competitor that came out is Polars, which is a data framing library for Python written in Rust.

50:26 Many of the things are written in Rust these days when they care about performance.

50:30 It's like a big trend. It's the new C extensions of Python.

50:33 But this one is supposed to also be way faster than pandas 1.

50:37 and I think it's also based on PyArrow amongst other things.

50:41 The details are not super important.

50:43 More what I wanted to ask you is like, well, here's another way.

50:45 This is a totally different API.

50:47 It doesn't try to be compatible, so you got to learn it.

50:49 The question is, as a data scientist, as a data science team leader, how should you think about, do we keep chasing the shiny new thing, or do we stick with stuff that one, people know like pandas, but two, also extends into this broader space as a gateway, as we described, like, what are your thoughts here?

51:08 This is a super interesting question. So data scientists in some ways, have the luxury of being able to maybe use newer packages faster, because we build these small kind of atomic projects that we can just update to the next library that we feel like using in the next project. And maybe we're the only ones who ever look at that code. So it's cool. The problem is though, of course, is if someone else needs to look at your code, they are gonna need to be able to read it, which is not maybe the biggest problem.

51:40 The biggest problem of course, is any new library, you have less documentation and you have less entries on stack overflows.

51:47 So I would say you need to make a trade-off between the time you're not only gonna spend learning it, but also debugging it, 'cause it's gonna be slower, but your ChatGPT doesn't know much about polars.

51:57 basically you're essentially going to need to trade that off against, are you gonna see a benefit from that?

52:03 So do you actually have problems with processing your data fast enough?

52:09 If you're working on small data sets, probably not.

52:11 If you're not, then maybe try something pandas or pandas json.

52:15 - Yeah, that sort of community support side is important.

52:18 And I'm pretty sure there are a lot of data scientists out there who are the data, the one data scientist at their organization.

52:25 And so it's not like, - Oh, we'll go ask the other expert down the hall because if it's not you, there's no answer, right?

52:31 - Exactly.

52:32 I do think though, like, it's good to be curious.

52:34 It's good to try out new things as well.

52:37 And again, part of being a data scientist is you can experiment a bit more.

52:41 So--

52:42 - You know, 2017, 18, sort of the peak Python two versus three tension, I guess, maybe one year before then.

52:52 I noticed that the data scientists were like, I don't know what y'all are arguing about.

52:55 we're done with this.

52:57 What we're arguing about is, when can we take the Python 2 code out to absolutely 100% drop support for it, not when are we moving over?

53:04 Whereas people running that Django site that's been around for eight years, that's still on Python 2, they're starting to get nervous 'cause they don't wanna rewrite it 'cause it works, but they know they're gonna have to.

53:15 And I feel like, we talked about the legacy code as sort of the success story that is dreaded of software on the computer science side.

53:23 because that is less of a thing in data science, it's easier to go, well, this next project that we're starting in a couple of months, we can start with newer tools.

53:31 - Yeah, and I actually remember the point where I decided, okay, this is the last project I'm doing into because the thing that was keeping me into was actually one of those libraries that I mentioned, which built by a university.

53:45 And I was like, you know what?

53:46 I'm just gonna go find some alternative tool.

53:48 I think at that time, Spacey, which is a very well-known NLP library, actually, based here in Berlin, the company.

53:56 - Yeah, exactly, basically a neighbor of yours.

53:58 - That's right.

53:59 But I think Spacey was really getting off its feet in that time.

54:03 So I was like, you know what, I'm just gonna switch over to this new library and try that.

54:06 And it's excellent.

54:07 So I didn't look back.

54:09 - Yeah, Spacey's cool.

54:10 Enos Montani is doing really great work and everyone over at Explosion AI.

54:15 And that's the thing, sometimes it seems like a hassle, right, but if it forces you out of your comfort zone to pick stuff that's being actively developed, maybe it's worth it, right?

54:24 - Exactly.

54:24 - All right, we're getting short on time.

54:26 So you want to give us a lightning round and the other important libraries you think data scientists should pay attention to?

54:31 - Yeah, so let's just quickly go through the visualization side of things.

54:35 So visualization is massive.

54:37 So matplotlib is really the biggie and it's what a lot of libraries are actually built on top of in Python.

54:44 So the syntax is not that friendly.

54:46 So there's a lot of alternatives.

54:49 So Seaborn is a very popular one.

54:51 we actually have an internal one called Let's Plot, which is a part of ggplot2, and there's another one called plot9, and I think there actually may even be one called ggplot.

55:00 Plotly--

55:01 - Some of the fancy new ones that people hear about, they're actually internally just controlling Matplotlib in a cleaner API, right?

55:06 - Pretty much, and let me tell you, Matplotlib needs a clean API, it's a bit, let's say archaic.

55:13 - Although, give it some props for its XKCD graph style.

55:17 I mean-- - Yes, yes.

55:18 - That is pretty cool that you can get it to do that.

55:21 - I actually have done, I've done XKCD graphs in Python as well.

55:27 It's a goal that you aim for to do like elite visualizations.

55:31 - It's fun and XKCD is amazing in a lot of ways.

55:35 However, I think it also can serve an important role when you're presenting to like leaders of an organization, non-technical people, 'cause if they look and see a beautiful, pristine production already, sort of like, we're done.

55:50 No, no, no, this is the prototype. No, we're done. Look, you look, you already got it.

55:53 But if it comes out and sort of cartoony, kind of like wireframing for UI design, you're like, oh, there are no expectations. It's done. It's XKCD. We're going to get you the real graphs later, right?

56:03 Yeah, yeah.

56:04 There may be some value there.

56:05 Like a psychological effect where you make it look like a hand-drawn prototype.

56:10 Exactly. It looks just hand-drawn. It's barely done.

56:12 That's right.

56:13 It's really just theme equals.

56:14 It didn't take me two days.

56:15 [Laughter]

56:17 Scikit-learn, you mentioned that before.

56:19 Yes. So there are a whole bunch of libraries for doing machine learning. Scikit-learn is kind of your all-in-one for classic machine learning. But then, you know, you have this whole other branch of data science, which is around neural nets or deep learning. So you have Keras, TensorFlow, you have PyTorch, and then you have a package for working with a lot of like these generative AI models or large language models called Transformers from a company called Hugging Face. So, all of these are actually super accessible. I wouldn't say TensorFlow and PyTorch can be tricky, but Keras is like a friendly front end for them. Actually, if anyone is interested in getting into this side of things, there's a book called Deep Learning in Python by a AI researcher at Google called called Francois Chollet.

57:08 It is actually, I think, the most popular book ever on Manning.

57:13 So it's an amazing book.

57:16 I can only recommend it.

57:17 And it's very gentle for beginners who have no background in the area.

57:21 - Okay, yeah, cool.

57:22 I'll put that in the show notes.

57:23 - Awesome. - Yeah.

57:25 All right, well, there are many other things we can talk about.

57:28 Maybe just let's close this out with a quick shout out to your PyCon talk.

57:34 Eventually, someday, I'm sure that the talks for PyCon will be on YouTube.

57:41 They were last year, but I looked back and I was so excited near the end of the conference, I'm like, "Look, the talks are up." And I was talking to someone like, "Look, here's your talk." They're like, "No, that's my talk from last year." I'm like, "Oh." - Aw.

57:52 - Yeah, so it was maybe three or four months delayed till it actually came out.

57:56 So maybe this midsummer, the video of Virginia Talk will be out, but maybe just give people a quick elevator pitch of your talk here.

58:03 - Yeah, so I decided to give this talk because I kind of had to learn things the hard way in terms of performance with Python.

58:12 So basically I used to do everything with loops and then I had to start working with larger amounts of data and it just doesn't scale.

58:20 So over time, as I got better with Python, I learned more about NumPy, which is another important data science library.

58:27 And it basically allows you to do what's called vectorized operations.

58:30 So in this talk, I basically talk about like the math behind why vectorized operations work.

58:36 You don't need any math background to understand it.

58:38 It's very gentle.

58:39 And then just show like why some of these operations work in NumPy and how you can implement it yourself to get like really like massive gains in performance speed.

58:51 - Yeah, that's incredible.

58:52 Move a lot of that stuff down into like a C or a Rust layer and just let it do its magic instead of looping in Python.

58:58 Yeah. - Exactly.

59:00 - Yeah, very cool.

59:00 So I don't know when, but eventually this will be out as a video people can check out from me.

59:05 Now they know to go look for it.

59:07 - Yeah, I think the port team is still recovering.

59:10 So much work.

59:11 - I know.

59:12 All right, well, Jody, it's been great to have you on the show.

59:15 Before you get out of here, final two questions.

59:18 If you're gonna write some Python code, what editor are you using these days?

59:21 - So I'm actually using all three that I talked about.

59:24 I use PyCharm if I need to do something like a bit more on the engineering side, which is not that often for me.

59:30 Data Spell, if I'm doing sort of very local development and doing more of the research side, and then if I need some GPUs, I'm using Datalore.

59:39 So a bit boring, but using all of our tools, and I really like them.

59:44 - Yeah, they are good.

59:45 All right, and then notable PyPI package, something you wanna give a shout out to, or if you prefer a conda package, there's a lot of intersection there.

59:54 - I think my favorite package at the moment is Transformers.

59:56 It is amazing.

59:58 And the documentation that Hugging Face have put together is so good.

01:00:01 And just the work they're doing in open data science is so, so important.

01:00:06 So like big props to Hugging Face.

01:00:08 Like we should really support the work that they're doing.

01:00:10 - Excellent.

01:00:11 All right, well, thanks for being on the show and sharing your experience.

01:00:15 - Thank you so much for having me.

01:00:16 I had an absolute blast.

01:00:17 - Yeah, same.

01:00:18 Bye. - Bye.

01:00:19 - This has been another episode of Talk Python to Me.

01:00:23 Thank you to our sponsors.

01:00:24 Be sure to check out what they're offering.

01:00:26 It really helps support the show.

01:00:28 The folks over at JetBrains encourage you to get work done with PyCharm.

01:00:32 PyCharm Professional understands complex projects across multiple languages and technologies, so you can stay productive while you're writing Python code and other code like HTML or SQL.

01:00:44 Download your free trial at talkpython.fm/donewithpycharm.

01:00:49 Spend better time with your data and build better ML-based applications.

01:00:53 Use Prodigy from Explosion AI, a radically efficient data annotation tool.

01:00:58 Get it at talkpython.fm/prodigy and use our code TALKPYTHON all caps to save 25% off a personal license.

01:01:05 Want to level up your Python?

01:01:07 We have one of the largest catalogs of Python video courses over at Talk Python.

01:01:11 Our content ranges from true beginners to deeply advanced topics like memory and async.

01:01:16 And best of all, there's not a subscription in sight.

01:01:19 Check it out for yourself at training.talkpython.fm.

01:01:22 Be sure to subscribe to the show, open your favorite podcast app, and search for Python.

01:01:27 We should be right at the top.

01:01:28 You can also find the iTunes feed at /iTunes, the Google Play feed at /play, and the Direct RSS feed at /rss on talkpython.fm.

01:01:38 We're live streaming most of our recordings these days.

01:01:41 If you want to be part of the show and have your comments featured on the air, be sure to subscribe to our YouTube channel at talkpython.fm/youtube.

01:01:49 This is your host, Michael Kennedy.

01:01:50 Thanks so much for listening.

01:01:51 I really appreciate it. Now, get out there and write some Python code.

01:01:54 [MUSIC]

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon