How data scientists use Python

Episode Deep Dive Links Transcript

Regardless of which side of Python, software developer or data scientist, you sit on, you surely know that data scientists and software devs seem to have different styles and priorities. But why? And what are the benefits as well as the pitfalls of this separation. That's the topic of conversation with our guest, Dr. Jodie Burchell, data science developer advocate at JetBrains.

Play on YouTube

Watch the live stream version

Episode Deep Dive

Guests Introduction and Background

Dr. Jodie Burchell is a Data Science Developer Advocate at JetBrains. Originally an academic in psychology (with a PhD focusing on hurt feelings and relationship dynamics), she later moved into industry as a data scientist, then transitioned to her current advocacy role. Jodie draws on her experience merging psychology, statistics, and data science to help individuals build practical projects and strengthen community ties in the Python ecosystem.

What to Know If You're New to Python

If you’re newer to Python but want to follow this conversation more easily, here are a few points:

Python is especially popular for data exploration and machine learning because of its rich ecosystem of libraries (e.g., Pandas, NumPy).
Data scientists often use tools like Jupyter Notebooks to run Python code in smaller, more approachable chunks rather than building entire applications from the start.
Python’s flexible nature allows you to experiment, prototype, and then gradually learn more advanced concepts over time.

Key Points and Takeaways

Differences Between Data Scientists and Software Developers
Data science work often focuses on discovery, exploration, and modeling rather than long-term, production-level code. While software developers prioritize reliability, maintainability, and scalability from day one, data scientists first aim to see if an idea or data angle is even valid.
- Links / Tools:
  - JetBrains.com (Home of PyCharm, DataSpell)
Role of a Data Science Developer Advocate
As a developer advocate, Jodie bridges the gap between product teams and the community. She helps inform JetBrains about data scientists’ needs and conversely educates data scientists on best practices and new tools. This involves speaking at conferences, creating educational material, and ensuring data-focused features are front and center.
- Links / Tools:
  - PyCharm (IDE mentioned for Python development)
  - DataSpell by JetBrains
Jupyter Notebooks for Data Exploration
Jupyter is central to many data scientists' workflows because it supports interactive coding, immediate visualization, and explanatory text in a single document. Jodie pointed out how its literate programming style (mix of Markdown and code) is excellent for reproducibility, though real-time collaboration is limited unless you use cloud-based solutions like JupyterLab or DataSpell’s hosted offerings.
- Links / Tools:
  - Jupyter.org (Project Jupyter)
  - JupyterLab (Remote-friendly environment)
Iterative and Experimental Coding
Data scientists frequently write “throwaway” or short-lived code as they refine hypotheses. The code is not always production-ready, yet it’s critical for finding patterns and relationships in data. When or if something proves valuable, an engineer (or ML engineer) may repackage it into a more robust system.
- Links / Tools:
  - scikit-learn.org (Machine learning library often used in prototypes)
Importance of Team Integration
Separating data scientists and developers entirely can lead to friction when transitioning models to production. Embedding data scientists into engineering teams, or at least keeping close collaboration, ensures that concerns like latency, scaling, and production monitoring are addressed early on.
- Links / Tools:
  - FastAPI (A frequent choice for productionizing Python models)
Benefits of Python for Data Science
Python’s biggest strength is its simplicity and readability, allowing non-traditional programmers (e.g., psychologists, statisticians) to quickly pick up data analysis. Jodie noted that, compared to compiled languages, Python’s “just enough” design is ideal for rapid iteration on machine learning and data science projects.
- Links / Tools:
  - NumPy.org (Foundation for data arrays in Python)
Working with Large Data (Performance + Tools)
When teams hit memory or performance constraints, they can scale with frameworks like Dask or migrate to specialized libraries like Polars. The conversation highlighted how these projects extend Pandas-like APIs to cluster computing or Rust-based performance.
- Links / Tools:
  - Dask.org (Parallel computing and big data)
  - Polars.dev (Rust-based DataFrame library)
Reproducibility and Maintainability
While data science code might not need long-term support, reproducibility and good documentation are crucial so future teams (or your future self) can pick up where you left off. Tools such as Git, containerization, or cloud-based notebooks help preserve environment setups and data pipelines.
- Links / Tools:
  - Git-SCM.com (Git version control)
Production Responsibilities and ML Engineering
A critical question for any organization is “Who maintains the code that goes live?” Jodie emphasized that if a single “unicorn” data scientist handles both modeling and production, they may be pulled away from valuable research. Hence, many companies form dedicated ML engineering or MLOps teams to manage continuity.
- Links / Tools:
  - FastAPI again for RESTful deployment
  - MLflow or Kubeflow (not specifically mentioned in detail, but commonly used in MLOps)
Powerful Open-Source Libraries
The discussion touched on libraries like spaCy (for NLP) and Transformers (for large language models) to highlight how many specialized frameworks exist and how open-source communities are advancing rapidly. Even for complex topics like neural networks, user-friendly frontends (Keras, for instance) can help new practitioners dive in quickly.

Links / Tools:
- SpaCy.io
- HuggingFace.co (Transformers library)

Interesting Quotes and Stories

“Data science is for everyone. If you want it, it’s for you.”
Jodie highlighted her belief that Python and machine learning are accessible fields, encouraging a diverse range of backgrounds to enter.
“I actually had a PhD in hurt feelings.”
Jodie’s unique background in psychology underscores that many data scientists hail from disciplines outside of software engineering, showing how Python can bridge the gap.

Key Definitions and Terms

Data Wrangling: The process of cleaning, structuring, and enriching raw data into a more usable format.
Jupyter Notebook: A web-based interactive environment that combines live code with explanations and visualizations.
Vectorized Operations: NumPy-based computations that apply operations across entire arrays without explicit Python loops, often leading to faster performance.
ML Engineer: A specialized role that bridges data science and software engineering, focusing on deploying and maintaining models in production.

Learning Resources

Here are some resources to go deeper:

Data Science Jumpstart with 10 Projects: Covers practical projects to build real data science experience.
Move from Excel to Python with Pandas: Great if you’re comfortable with spreadsheets and want to shift into Python-based analysis.
Fundamentals of Dask: Learn parallel and distributed computing strategies in Python.

Overall Takeaway

The worlds of data science and software development often feel separate, but a shared foundation in Python can unify them. Whether you’re conducting early-stage explorations in Jupyter Notebooks or deploying large-scale models in production, communication and collaboration are vital. Jodie’s story demonstrates that creative and interdisciplinary backgrounds are powerful assets, and with Python’s flexible ecosystem, anyone motivated can solve complex data problems and contribute valuable insights.

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 Regardless of which side of Python you sit on, software developer or data scientist,

00:04 you surely know that data scientists and software devs seem to have different styles and priorities.

00:09 Why is that? And what are the benefits as well as the pitfalls of this separation?

00:14 That's the topic of this conversation with our guest, Dr. Jody Birchall,

00:19 data science developer advocate at JetBrains.

00:21 This is Talk Python in Me, episode 422, recorded May 31st, 2023.

00:27 Welcome to Talk Python in Me, a weekly podcast on Python.

00:43 This is your host, Michael Kennedy.

00:45 Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython, both on fosstodon.org.

00:52 Be careful with impersonating accounts on other instances, there are many.

00:56 Keep up with the show and listen to over seven years of past episodes at talkpython.fm.

01:01 We've started streaming most of our episodes live on YouTube.

01:05 Subscribe to our YouTube channel over at talkpython.fm/youtube to get notified about upcoming shows and be part of that episode.

01:12 This episode is brought to you by JetBrains, who encourage you to get work done with PyCharm.

01:19 Download your free trial of PyCharm Professional at talkpython.fm/done-with-PyCharm.

01:26 And it's brought to you by Prodigy from Explosion AI.

01:29 Spend better time with your data and build better ML-based applications with Prodigy, a radically

01:35 efficient data annotation tool.

01:37 Get it at talkpython.fm/Prodigy and use our code TALKPYTHON, all caps, to save 25%

01:43 off a personal license.

01:44 Jodi, welcome to Talk Python in Me.

01:48 Thank you.

01:48 I am so thrilled to be on the show.

01:50 I'm so thrilled to have you on the show.

01:52 I've been a fan of your work for a while and we got a chance to get to know each other at

01:56 this year's PyCon.

01:57 And so here you are on the podcast as well.

02:00 Thank you.

02:00 We had some very nice Mexican food, actually, or maybe Utah Mexican.

02:05 I don't know quite how I would interpret it.

02:07 It was very good, though.

02:08 It was very good.

02:10 Yeah, the food was excellent.

02:11 I thought the parties were great at the conference and people who are maybe still holding out on

02:18 going.

02:18 Personally, I really enjoyed being there.

02:20 I think it's probably the best conference that I go to yearly.

02:24 And it's like the vibe is so nice.

02:26 On this show, we're going to talk about how data scientists use Python, which is somewhat

02:33 different than maybe a software developer we have, which I guess I'll put myself solidly

02:38 into that camp.

02:39 I do a bunch of web development, make APIs, I build apps and ship them.

02:43 That's quite a different story.

02:45 And I think we're going to have a really great time talking about those things.

02:48 But before we do, let's get a little bit.

02:51 How did you get into programming, Python, data science?

02:54 Yeah.

02:54 So I'm probably going hand in hand with maybe not being a developer.

02:59 My story is perhaps a little unconventional.

03:01 So my background is academic, like a lot of data scientists.

03:05 And unsurprisingly, the first language that I learned was R because I was doing psychology

03:10 and health sciences and a lot of statistics.

03:13 And I was procrastinating once during my PhD.

03:17 You will find any excuse to not work on your thesis.

03:20 And I think I was reading, oh, you know, people who are into statistics, you should really learn

03:25 Python.

03:25 It's the future.

03:26 And I was like, I should learn Python.

03:28 So I sat down and...

03:29 Because I really don't want to write that next chapter.

03:32 I just don't.

03:32 Exactly.

03:33 So I remember it.

03:34 Like I actually had this long weekend and I worked my way through, I think it was Zed Shaw's

03:40 Learn Python the Hard Way.

03:41 This is showing my age, I think.

03:43 I loved it.

03:44 Like I completed the course in three days and then I didn't know what to do with Python

03:49 because the stats libraries weren't as developed back then.

03:51 So I just put it aside for a couple of years and ended up picking it up again when I started

03:57 working in industry because obviously I've left academia.

04:00 And you sort of fairly quickly, once you start in data science, move away from more sort of

04:05 statistical stuff to machine learning.

04:07 And Python really has libraries for that.

04:09 So that's my journey.

04:11 It's a little bit bibs and bobs and stops and starts.

04:15 But once I kind of picked up Python, it really was love at first sight.

04:18 Oh, that's excellent.

04:19 What's your PhD in?

04:20 It's actually...

04:21 Computer science, of course, right?

04:22 Of course, of course.

04:24 You know, it's so funny.

04:25 You are the third person to ask me in two weeks and no one has asked me this question for like

04:29 two years.

04:30 My PhD was in hurt feelings.

04:32 Hurt feelings?

04:33 Yeah.

04:33 Okay.

04:34 I say this a little bit blithely.

04:35 So my PhD being in psychology, I was really interested in emotions research and

04:41 relationships research.

04:42 So I kind of wanted to see what happens to people emotionally when close relationships

04:47 go bad.

04:48 And it's hurt feelings like things like, you know, infidelities, rejections, all of that.

04:55 It's hurt.

04:55 So I was just studying what generates like and regulates the intensity of hurt and studied

05:01 that for four and a half years.

05:03 Sounds interesting.

05:05 I'm sure there was a lot of data to process.

05:07 There was a lot of data to process and a lot of very interesting statistics.

05:12 That was sort of how I got into data science.

05:15 I fell in love with stats in undergrad and just kept going down that path.

05:18 I think a lot of people are drawn to data science, not with the intent of waking up one

05:25 day and saying, I'm going to be a data scientist, but they're excited or inspired about something

05:29 tangential.

05:30 And they're like, well, I really need to get something better than Excel to work on

05:34 this.

05:35 Absolutely.

05:36 Yeah.

05:37 And we'll probably talk about this a little bit later about why data scientists use programming.

05:42 And it kind of is like in some ways that need to jump from something way more powerful and

05:48 reproducible than Excel.

05:49 Yeah.

05:50 For sure.

05:50 So how about now?

05:51 You said you've left academics.

05:53 And what are you doing these days?

05:56 Yeah.

05:56 So that leap from academia was a long time ago now.

05:59 I think that was like seven years again, showing my age.

06:02 So for six of those years, I was a data scientist.

06:06 So day to day was, you know, pretty varied, but the job I have now is very different.

06:13 So I currently work as a developer advocate at JetBrains.

06:16 And the way I would describe my job is I'm a liaison between data scientists and JetBrains.

06:23 So I try and advocate for our tools to be as good as they can be.

06:27 And I try to recommend ways that people can use our tools if I think it's useful.

06:31 But I'm definitely not marketing or sales.

06:33 It's more if I think this is the right fit for you, I'll do it.

06:36 So it's like the way I achieve that is really up to me.

06:41 For me, I really kind of, I like to do a mixture of what I call internal and external

06:46 activities.

06:47 So external activities are actually kind of only tangentially related to the products.

06:52 So this would be an example of an external activity.

06:54 It's just getting out there and educating people about data science or educating data scientists

06:58 about technical topics, things like conference talks or webinars, you know, all this sort of

07:05 stuff.

07:05 And then internal stuff is more focused on maybe things a bit more related to the product.

07:10 So if I think there's a feature that people would be really interested in, I might make a

07:13 video about it or create, you know, a blog post.

07:17 So yeah, it's a real hodgepodge.

07:19 So this week, for example, I've been working on actually a materials for a free workshop that

07:26 they're organizing at Europython.

07:27 So I'm going to be volunteering to help out that.

07:29 It's completely unrelated to anything I'm doing at JetBrains.

07:32 It's just a volunteer activity.

07:33 But last week, I was at a conference, week before that.

07:36 So you can see the job's pretty varied.

07:38 Think developer evangelist.

07:40 It seems like such a fun job.

07:42 You know, I had your colleague, Paul Everett on and we actually talked, it's quite a while

07:46 ago, a couple of years ago, three, four.

07:48 And we had a whole episode on like a panel on what is the developer advocate, developer

07:54 relations job.

07:56 But it's, it just seems like such a great mix of you still get to travel a little bit,

08:01 see people, but you also get to write code and work on, on influencing technology and products

08:08 and stuff.

08:08 Yeah.

08:08 And I think the thing that I started to appreciate about the job a few months in is you have a

08:13 platform with this job and that means you can choose to promote the message that you want.

08:18 And a message that no surprise is very meaningful for me is data science is for everyone.

08:24 Like I hate the gatekeeping that can happen in tech communities.

08:28 I think it's quite bad in terms of like people being intimidated by math in data science.

08:33 And like, I'm here to say to you, if you want it, it's for you.

08:37 It is a very cool field and yeah, it doesn't matter what background you come from.

08:41 I absolutely agree.

08:42 And I was kind of hinting at that saying like a lot of people who don't see themselves as

08:46 developers or programmers, like still find really great places, really great fits in data

08:51 science and in programming as well.

08:53 And I also want to second that I don't think you really need that much math.

08:57 Maybe if you're trying to build the next machine learning model platform, then yes.

09:03 Okay.

09:03 But that's not what most people do.

09:05 They take the data, they clean it up.

09:07 They do interesting visualizations and maybe put it into some framework for production.

09:11 Right?

09:11 Yeah.

09:12 And the nice thing is the field is in such a point where you have so many frameworks or

09:17 tools that will handle a lot of this stuff for you.

09:20 Like I'm not saying you don't need any understanding of what's going on under the hood, but you can

09:24 learn it incrementally.

09:26 A lot of it is like with software development where you develop that smell or that instinct

09:30 for when something is not right.

09:32 That will benefit you more than, you know, knowing how backpropagation works from a calculus

09:38 perspective.

09:39 Like that stuff is maybe a bit too much.

09:41 You don't necessarily need it.

09:42 Yeah.

09:43 Let's get into the main topic and talk a little bit about how does programming and Python differ

09:48 on the data science side than say me as somebody who builds web apps.

09:52 Yeah.

09:53 And maybe we can start by doing an orientation to like, what does a data scientist do?

09:57 Because I think this confuses a lot of people.

09:59 Yeah.

10:00 So basically the role of a data scientist is to sort of like, like the reason you would

10:06 hire a data scientist is you have a bunch of data and you have an instinct that you can

10:11 use that data to either improve your internal processes or sell some sort of IP.

10:17 So the reason, you know, we differ from BI analysts is BI analysts are doing analysis, but it's

10:24 more about business as usual, which is really important.

10:27 You absolutely need BI analysts.

10:29 How are sales going?

10:30 How much have we made versus last year?

10:32 Like those kinds of charts, right?

10:34 Like absolutely fundamental questions.

10:36 So you need to be a analyst before you need a data scientist, but your data scientists are

10:40 more there to push the envelope in a data driven way.

10:44 So we have two main outputs.

10:47 I would say you can either create an analysis and do a report, or you can build some sort of

10:53 model that will go into production.

10:54 So an example might be as a business, I have an instinct that I can get my customers to buy

11:01 more things based on people like them also buying those things.

11:05 So in that case, your data scientists might be able to build you a recommendation engine

11:10 and this, you know, will have a business outcome.

11:13 Developers obviously, on the other hand, have a very different goal.

11:16 Their goal is to create robust software systems.

11:20 So the concerns that they have are things like latency, server load, downtime, things like that.

11:27 And it's very interesting.

11:29 And we'll talk about it a bit more when we talk about code, if we kind of get into that

11:32 topic, but basically data scientists are not really interested in creating code for the

11:38 long term.

11:38 Whereas the code becomes the product that software developers write often.

11:43 And you have to think about things like legacy systems, because eventually every Greenfield

11:48 project becomes a legacy system if it lasts for long enough.

11:51 Yeah.

11:51 If you're lucky, right?

11:52 Because the alternative is it never really got used or it didn't add that much value and

11:56 got discarded, shut down.

11:58 All right.

11:59 So even though people talk about how much they don't want legacy code or how they kind

12:03 of, you know, don't necessarily want to work on it because they want to work on something

12:07 new and shiny.

12:08 That's kind of the success side of software development, right?

12:12 Yeah, exactly.

12:13 I do think it's super interesting to different life cycle of code on the data science side,

12:20 because you might be just looking to explore a concept or understand an idea better and not

12:26 necessarily ever intend to put it into production in the traditional software sense, right?

12:32 So I've seen some pretty interesting code written that people would look at and go, oh my goodness,

12:40 what is there's not even a function here?

12:43 How is this possible?

12:44 You know, it's like copy and paste, reuse almost.

12:46 And yet it really does go from having no idea to pictures and understanding and then maybe

12:54 handing that off potentially to be written more robustly or better.

12:58 Exactly.

12:59 And it's really interesting.

13:01 Like they're kind of two very different processes.

13:04 And that point actually where engineering and research meets is a very, very interesting one.

13:11 And I've seen it work in multiple different ways in multiple different workplaces.

13:14 So for example, I've worked in places where the data scientists were completely sequestered

13:20 away from the engineers.

13:21 And there really wouldn't be that much discussion between the engineers and the data scientists

13:25 during the research phase, which I do not advise.

13:29 So what that means is the data scientists will come at the end and hand over this chunk of

13:33 perhaps very difficult to read code to an engineer and be like, hey, so we need to implement

13:39 this.

13:40 And the engineer is like, what is this?

13:41 Okay.

13:42 I will schedule that for the next six months.

13:44 And then I've seen, you know, or I've been a data scientist embedded within a software

13:50 development team.

13:51 And in that case, your project is marching in lockstep with what the engineers are doing.

13:56 And from the very start, you know, you've been discussing important things like you need

14:00 to build a model that has latency constraints.

14:02 You need to think about this as the data scientist in terms of like the model that you run,

14:07 but also how it's implemented.

14:09 Right.

14:09 How much memory does it use?

14:10 Because if you run it on your own machine by yourself, then, well, it's kind of the limit

14:14 of your computer sort of sometimes.

14:15 But if you're running a thousand of them concurrently because people are interacting at the website,

14:20 all of a sudden that might make a difference.

14:22 Exactly.

14:23 Or like one of the most interesting problems I think I ever solved in my career was basically

14:28 I was working at a job board.

14:29 We're trying to improve the search using natural language processing.

14:34 So we had this idea that we could build a model that found out the probabilistic relations between

14:40 skills and job titles.

14:41 So if someone typed a skill into the search, we could expand it with job titles and then

14:46 find all of the jobs that we indexed with that at search time and vice versa with job titles

14:51 and skills.

14:51 The thing is we need to find those relations at search time.

14:55 Like that is a very low latency system.

14:57 And it was super interesting because we had to think about how we could search that vector

15:02 space in really, really like quickly.

15:05 Like instead of having to calculate the distance between that and every single vector, we had

15:10 to work out how to do that more efficiently.

15:12 That sort of stuff I really like because it's so applied and it's so like it's this really

15:17 nice intersection between computer science and data science.

15:20 I think it's super cool.

15:21 That is neat.

15:22 One of the things I really like about working with programming broadly is how concrete it is,

15:28 right?

15:28 You came from academics.

15:30 You know, I was in grad school for a while as well.

15:33 And it's, you could debate on and on about a certain idea or concept.

15:37 And it's like, well, you might be right.

15:39 Or I'm here and you push a button and you get the answer or it runs or like, there's a really

15:44 nice feedback of like, I built this thing and it's look, it's really connecting these people.

15:49 And, you know, then it comes down to, can you do it in real time and other things like

15:54 that.

15:54 But that's a really cool aspect of programming.

15:57 This portion of Talk Python To Me is brought to you by JetBrains and PyCharm.

16:04 Are you a data scientist or a web developer looking to take your projects to the next level?

16:09 Well, I have the perfect tool for you, PyCharm.

16:11 PyCharm is a powerful integrated development environment that empowers developers and data

16:17 scientists like us to write clean and efficient code with ease.

16:21 Whether you're analyzing complex data sets or building dynamic web applications, PyCharm has

16:26 got you covered.

16:27 With its intuitive interface and robust features, you can boost your productivity and bring your

16:32 ideas to life faster than ever before.

16:34 For data scientists, PyCharm offers seamless integration with popular libraries like NumPy,

16:39 Pandas, and Matplotlib.

16:40 You can explore, visualize, and manipulate data effortlessly, unlocking valuable insights with

16:46 just a few lines of code.

16:48 And for us web developers, PyCharm provides a rich set of tools to streamline your workflow.

16:52 From intelligent code completion to advanced debugging capabilities, PyCharm helps you write clean, scalable

16:58 code that powers stunning web applications.

17:01 Plus, PyCharm's support for popular frameworks like Django, FastAPI, and React make it a breeze to build and deploy your web projects.

17:10 It's time to say goodbye to tedious configuration and hello to rapid development.

17:14 But wait, there's more.

17:16 With PyCharm, you get even more advanced features like remote development, database integration, and version control, ensuring your projects stay organized and secure.

17:25 So whether you're diving into data science or shaping the future of the web, PyCharm is your go-to tool.

17:30 Join me and try PyCharm today.

17:32 Just visit talkpython.fm/done-with-pycharm, links in your show notes, and experience the power of PyCharm firsthand for three months free.

17:43 PyCharm.

17:44 It's how I get work done.

17:49 I think it's kind of a shame that a lot of places do set up their engineering and data science teams so separately.

17:56 Sure, we have quite different roles and we have quite different backgrounds sometimes.

18:01 But I really think that having the two teams at least planning things together, you can really actually learn a lot from each other about how to approach problems.

18:10 When you were describing either having those groups really separated or working really closely together, maybe an analogous relationship that people could relate with is maybe front-end developers and people building the APIs and the back-end, right?

18:25 Like the people doing React or Angular or Vue or whatever it is, you know, and the web design.

18:32 Having those completely separated as well is also, you know, it's terrible.

18:36 Not a good idea.

18:36 It doesn't make any sense.

18:38 And like, I can totally understand it from the point of view of like team composition, because it is, I think, better to have all your data scientists together because they can learn from each other.

18:47 But then I think having, I don't want to use the squad term because I know it's become a little bit unpopular to use it.

18:52 But, you know, this idea of project-oriented teams, I think are quite important.

18:56 Let's dive a little bit more into the research side of things that I want to ask you about.

19:00 Why Python?

19:02 Let's talk about how the research process works and maybe why that results in different priorities and styles of code and styles of engineering.

19:10 It starts at a similar point to all software projects, which is business comes to you and they have some sort of goal.

19:17 Sometimes it's very vague and you need to interpret that and turn it into an executable project.

19:23 But where the sort of uncertainty starts and like where it sort of becomes a research project rather than a project.

19:31 And I don't know if I described that very well, but where it becomes research versus something you're building concretely is even know at the start of a research project, whether it's even possible to answer the question that you're being asked or build the internal product that you're being asked for.

19:46 You might not understand the domain entirely, right?

19:48 You're trying to gain understanding even.

19:50 In the very worst cases, you won't even know if you have the data because maybe your company has so much data and it's so poorly organized.

19:56 Again, something I've seen that you don't even know if the data exists to answer this question.

20:01 So first is going and getting your data.

20:04 And you spend quite a lot of time with the data because the data will be the one that tells you the story.

20:09 It'll tell you whether what you even want is possible.

20:12 And you probably like heard data scientists hammering on about, you know, garbage in, garbage out.

20:18 Like you can build the most beautiful, sophisticated model you want.

20:21 But if you have crap data where there's no signal, you're not going to get anything because it's just not there.

20:27 Like the relationship you're looking for is not there.

20:30 Yeah.

20:30 The side of that I've heard is 80% of the work is actually the data cleanup, data wrinkling, data gathering before you just magically hit it with a plot or something, right?

20:41 Absolutely.

20:42 And it's interesting because that data cleaning, data wrangling step also doesn't happen in one go, especially if you're building a model.

20:48 So what will happen is you'll try something out and you'll be like, okay, it didn't quite work.

20:52 Maybe I need to manipulate the data in a different way or I need to create this new variable.

20:56 And then you'll go again.

20:58 And it's this super iterative process where you have this tightly coupled relationship between both the models and the data.

21:06 So it really is sort of, you know, how I was talking about the instinct.

21:09 This is sort of where that comes in because you're going to spend like 80% of your time honing your skills.

21:14 But it's the most, I think, valuable part of the process.

21:17 And if the signal's there, you can usually get away with using really dumb and simple models.

21:23 You know, things that are unfashionable now like decision trees or linear regression.

21:28 You can get away with them because you've just got such good data that just go with a simpler model.

21:33 It's got all the advantages.

21:35 This is sort of, I think, what makes it different that you're sort of moving towards a goal, but you don't know what that goal is.

21:42 Estimation's always hard, right?

21:44 What I found is best is really just time boxing each step, seeing if you are up to where you thought you'd be up to by a certain point.

21:51 And if not, you need to just keep having those discussions with the business stakeholders because otherwise they're going to not be very happy if you've spent six months just looking at something and you have nothing.

22:02 Or what have you built?

22:03 Well, I have some notebooks I could show you.

22:06 I have 40,000 notebooks and they're all terrible.

22:09 Yeah.

22:11 Speaking of the data, Diego out in the audience has an interesting question.

22:15 How big are the data sets businesses will bring to you typically?

22:17 Enough that you don't need to go out and find more data.

22:20 This is a good question.

22:21 So I hate to be, it depends, you know.

22:25 I get to say that though because, you know, I was a lead data scientist, so I earned that rank.

22:31 It really does depend on the problem you need to solve.

22:33 So typically business will have enough data to cover at least some of the use cases.

22:40 So to give you a really concrete example, this job board that I was talking about that I worked at, we actually had like a bunch of different job boards across Europe.

22:49 So we had some that were a lot bigger like Germany or the UK.

22:53 And we had some that were really small like Poland or Spain.

22:56 And we wanted to build these multi-language models or models maybe for different languages.

23:02 We played around with both.

23:03 And I don't think we really had enough data to support the models in these smaller languages.

23:09 So the models were just not as good quality because we didn't have enough data.

23:12 But for the bigger languages, we did.

23:14 And then it sort of becomes a case of, okay, well, we have more data for these particular websites because they're the ones that are bringing the most revenue.

23:22 So then it sort of became like, well, okay, maybe it's good enough that we improve the search on the most important ones.

23:27 And for the smaller ones, we just wait until we accumulate more data.

23:31 So yeah, most of the time I found that there's a way to make it work for at least part of the solution.

23:37 And then sometimes, like in the case of my last job, we had something like 170 billion auctions per day.

23:44 So we had so much data.

23:48 We even had problems like processing it.

23:50 So sometimes, you know.

23:52 That's the other side of the story is when you've got too much and then how do you throw it away, right?

23:56 I mean, you've got this auction story, like another one that comes to mind is the Large Hadron Collider.

24:02 Oh, yes, yes.

24:04 They've got layers and layers of like chips on hardware and then chips or machines right next to the collectors and then on it, where it's all about how do we throw away, you know, terabytes of data down to get it to megabytes per second, right?

24:18 Yeah, and it's interesting because what you can end up in this within those situations is even then you can have underrepresented groups.

24:26 So, for example, we had, we're working with advertisers and apps, you know, basically trading ads.

24:31 And we ended up with some apps that were just so small that you were like, even with all this data, I really don't have enough to represent this particular combination in this country.

24:42 Interesting.

24:43 Very interesting.

24:44 So, why Python?

24:46 You know, you started out in R and, of course, any distraction from writing a PhD is a good distraction.

24:53 But I do think there's been a really interesting sort of graph.

24:57 Like, if people go and look at, what is it, Stack Overflow Insights.

25:02 If you go look at Stack Overflow Insights, they had a really great graph that shows you the popularity of Python over time.

25:12 And there's just this huge inflection point around 2012.

25:15 And I feel like that's when a lot of the data science libraries really came around and took off.

25:21 It seems like there was a big inflection at one point, but, you know, why?

25:25 To be honest, I can talk about why I like Python from my background.

25:31 I couldn't really tell you exactly what caused that takeoff.

25:34 But, you know, apart from, you know, this idea that the libraries were maturing enough.

25:38 But the thing is, looking even at current surveys, around 60% of data scientists do not have a software development or a software engineering background.

25:47 So, for people like us, we don't really understand, like, it sounds terrible.

25:52 We don't really understand basic constructs in how a programming language works.

25:56 And that can actually mean that going to some sort of compiled language even can be quite a steep learning curve.

26:03 Sure.

26:03 Pointers to pointers, for example.

26:05 Like, oh, no things.

26:07 Yeah.

26:08 Or having to deal with the fact that in Java, everything is a class.

26:12 You're just like, what is this?

26:15 But of course, you understand why if you have that background.

26:18 But if you're trying to learn it yourself, you then have a lot of background you need to cover.

26:22 But in Python and in R as well, you don't need to cover those things.

26:26 It's super easy to prototype.

26:28 It's super easy to script.

26:29 The flexibility of Python is what makes it, I think, the perfect prototyping language.

26:34 And that's essentially what you're doing.

26:36 You're prototyping.

26:36 So, we talked a little bit earlier about, like, why not just Excel?

26:41 We didn't quite say that.

26:42 But this was sort of what we were maybe getting at.

26:45 And, yeah, we could do some of our work in Excel.

26:48 I've tried this.

26:49 And first, I can tell you Excel really starts to struggle when you have too many calculations going on under the hood.

26:56 It gets very, very slow.

26:58 But to be honest, it's sort of just cleaner to code this sort of logic.

27:02 It's much more reproducible when you need to do this iterative sort of stuff.

27:06 And it also means that you can use much more powerful tools.

27:10 So, you can, say, use APIs that developers have made to process your data.

27:16 You can use powerful data.

27:18 Right?

27:18 Like, I need live currency conversion data.

27:21 Right?

27:22 So much easier than in Excel.

27:24 Yes, yes, yes, exactly.

27:25 Like, you can, like, scrape data.

27:27 Or you can, yeah, pull data in off an API.

27:30 Or you can use powerful tools like Spark to process 170 billion auctions per day in order to reduce it down to something manageable.

27:41 So, yeah, it just gives you a lot more power.

27:42 But at the same time, why we use programming languages is just such a different focus.

27:50 It's a bit overkill to use something like Java.

27:52 I know some people do do natural language processing in Java.

27:54 But that's more on the engineering side to build maintainable systems.

27:58 One of the things I like to say when thinking about how people who are coming from a tangential interest like biology or whatever is you can be really effective with Python.

28:08 And I suspect R as well with a really partial understanding of what Python is and what it does.

28:14 Right?

28:14 Like you pointed out, you don't even have to know what a class is or even really how to create a function.

28:19 You just, I can put these six magical incantations in a file and then I can do way more than I otherwise could.

28:26 Right?

28:27 Then you learn one more.

28:28 You make it better and better as you kind of gain experience.

28:30 Pretty much.

28:31 And this is where I started.

28:31 Like, obviously, I learned what functions and classes were when I first started programming.

28:36 But in the end, you will just, maybe it's not the best thing.

28:40 And we can sort of maybe get into this.

28:42 I suppose part of the confusion or not confusion, but internal debate I've had over the years is how good does data science code really need to be?

28:50 Like, how much would data scientists benefit from knowing more about computer science topics?

28:56 Or software engineering topics maybe more to the point?

28:59 And, you know, because like the thing is, every field has so much to learn.

29:03 Don't even get started on what's happening with large language models at the moment.

29:06 Like, it's just overwhelming.

29:07 Should we take some of our precious time and learn software engineering concepts?

29:12 I'm not sure.

29:12 Like, I'm not sure if I have the answer to that.

29:16 This portion of Talk Python Army is brought to you by Prodigy, a data annotation tool from Explosion AI.

29:22 Prodigy is created by Ines Montani and her team at Explosion.

29:26 And she's been doing machine learning and NLP for a long time.

29:30 Ines is a friend and frequent guest on the show.

29:32 And if you've listened to any of her episodes, you know that she knows her ML tools.

29:37 So what is Prodigy?

29:38 It's a modern, scriptable annotation tool for machine learning developers made by the team behind the popular NLP library, spaCy.

29:46 You can easily and visually annotate and develop data for named entity recognition, text classification, span, categorization, computer vision, audio, video, and more,

29:57 and put your model in the loop for even faster results.

30:00 After collecting data, you can quickly train and export a custom spaCy model or download annotations to use it with any other library or framework.

30:09 Prodigy is entirely scriptable in Python, of course, the language we all love.

30:13 And it seamlessly integrates with your favorite libraries and tools.

30:16 Plus, the new alpha version they just released also introduces a built-in support for large language models,

30:23 such as OpenAI's GPT models, and new tools for dividing up your data between multiple annotators.

30:29 Talk Python listeners get a massive discount on a lifetime license.

30:33 You'll get 25% off using the discount code Talk Python, but don't wait too long.

30:38 This offer does expire.

30:40 Get Prodigy at talkpython.fm/Prodigy and use our code Talk Python, all caps, to save 25% off a personal license.

30:49 This link is in your podcast player show notes.

30:52 Thank you to Explosion AI for sponsoring the show.

30:55 I think it really depends on what kind of data scientist you are.

31:02 If what you are is someone doing research, as you described before, and you're like, is there a trend between the type of device that they use to buy their thing at our store

31:13 and how much they're buying on the second, you know, how much are likely to come back?

31:17 Like, if they're using an iPhone, do they tend to spend more than if they're using an Android?

31:22 And is that a thing that we should consider?

31:24 Or, you know, is there any, like that kind of exploration, which you can judge whether or not you should make that exploration.

31:31 But just put that aside for a minute.

31:33 That kind of stuff, like once you know that answer, maybe you don't need to run that code again.

31:38 Maybe you don't care.

31:38 You just want to kind of discover if there is a trend.

31:41 And there, maybe you need to know software engineering techniques, but should you be writing unit tests for that?

31:46 I'd say maybe not, honestly.

31:48 On the other hand, if your job is to create a model that's going to go into production, that's going to run behind a Flask or FastAPI endpoint,

31:56 then you're kind of in the realm of continuously running for many people over a long time.

32:02 And I think that really is a different situation.

32:04 I think this is where you actually move from data science to machine learning engineering.

32:08 This term has a lot of different definitions.

32:11 For me, I base my definition of ML engineers on the two people that I've worked with,

32:15 who were like true full stack kind of people who could go from research and prototyping to deployment.

32:23 And they were data scientists who really cared enough to actually learn how to do proper engineering,

32:30 and they could actually deploy their own things.

32:32 But then this leads to another one of my very favorite topics, which is who is responsible for apps in production?

32:39 And here's the thing.

32:41 So I think as good as your data scientist is going to be, or your ML engineer, let's say an ML engineer,

32:47 let's say that they can actually deploy their own code.

32:49 If they're then responsible for that code in production, that then eats up the time that they can be prototyping

32:56 and researching new things for you.

32:58 So the conclusion I've come to over time, and again, this is a matter of a debate.

33:02 This is just my opinion.

33:04 Basically, I think if your company is above any sort of level of size or complexity,

33:09 in terms of the data products it has, I think you really do need dedicated data science and engineering teams.

33:16 Because in the end, no matter how good your data scientist code is going to be,

33:20 it needs to be implemented by the person who's going to maintain it.

33:23 And maybe they're not the ones writing the code from scratch.

33:26 Maybe they can adapt the data scientist code if it's good enough.

33:29 But in the end, they need to be comfortable and familiar enough with that code to be like,

33:33 yeah, if I get pinged at three in the morning, I'm okay knowing what to do with this code.

33:38 Yeah, that's a good point.

33:40 Yeah.

33:40 So I think it's just easier to scale these teams in parallel rather than trying to hire this like

33:45 wall in one person who can do everything.

33:47 They're impossible to hire.

33:48 Like I've only ever met two over the course of my career.

33:51 And quickly, they become overwhelmed by having to maintain projects.

33:55 Right.

33:55 Is that the best use of their time?

33:57 Yeah.

33:57 And like, it's maybe it's not necessarily even if it's the best use of their time.

34:01 It's more like, then who's going to do your research?

34:05 Because now you've used up that resource on maintaining two or three projects.

34:09 Right.

34:10 Absolutely.

34:10 Chris May has got an interesting question out here in the audience.

34:13 This kind of turns us on its head a little bit.

34:15 It says, development teams tend to work better when they focus on writing and refactoring code

34:19 to make it testable and understandable.

34:21 And we've talked a little bit about maybe stuff that data scientists shouldn't care about

34:26 or whatever.

34:27 So he has other ideas that are like good practices for data scientists and teams of them.

34:33 This is actually a really great question.

34:35 So basically, it's an interesting thing with data scientists that unlike software developers,

34:40 we often tend to work alone on projects or maybe in very, very small teams, like maybe two

34:47 or three people.

34:48 And I think it's probably a hangover from the fact a lot of us are ex-academics.

34:53 We're just used to having like, it's not great, but it's...

34:57 A whiteboard, an office in the corner, and no one knows what you're doing.

35:01 Exactly.

35:02 And no one cares.

35:03 That paper that three people read took me three years.

35:07 So what I think has been neglected, you know, aside from learning software engineering best

35:13 practices is more fundamental things, which is like writing maintainable code.

35:18 And I don't mean maintainable in the sense of it's a system that needs to be able to run

35:23 regularly.

35:23 It's more like this is a piece of code that I can come back to in six months and understand

35:29 what I was doing.

35:30 Because, you know, research projects can be shelved forever, but maybe they need to

35:34 be revisited and, you know, built upon.

35:37 So this was actually a topic I got really interested when I first moved to industry, like this idea

35:43 of reproducibility with data science projects.

35:45 It's about the code, but it's also about things like dependency management, which is notoriously...

35:50 Oh, yes.

35:51 ...difficult in Python to get reproducible environments later.

35:56 And even the operating system, if Linux is really dramatically changed over time, then

36:01 maybe your old dependency, you want to keep that one, but it won't run on the new operating

36:07 system or there's a whole spectrum of challenges there.

36:09 Exactly.

36:10 Exactly.

36:11 And it's sort of something that can be solved with using poetry, which is a little bit more

36:16 robust.

36:17 But even then, you've still got it.

36:19 Like it runs on my machine effect where your machine will not be the same machine.

36:24 Recently, there's actually a move towards doing more sort of cloud-based stuff for data

36:28 science, which solves a few of these problems.

36:31 And it also solves the additional problem where data scientists often need to do remote development

36:35 for various reasons.

36:36 Like you need access to GPUs in order to train models.

36:39 So, you know, obviously if you have a server, you have a Docker container, which has, you know,

36:45 environment specifications, you can power up that exact same environment.

36:48 And that actually helps with that reproducibility a lot.

36:51 And then another point, which I think is really important for data scientists and can be neglected

36:57 is literate programming.

36:58 So this is an idea from Donald Knuth.

37:00 And it's this idea that, you know, you should write your code in such a way that it's actually

37:06 understandable later.

37:07 With data science work, it's also that you really need to document a lot of the implicit kind of

37:14 assumptions that you make or decisions that you make as part of the research process.

37:18 And this is one of the reasons, probably a good segue, why Jupyter is so important.

37:23 Yeah.

37:23 Jupyter notebooks are designed to be research documents.

37:26 So this is why you have the markdown cells if you've seen a Jupyter notebook, because it's

37:31 this idea that you really, really need to like document along with the code, the decisions

37:36 that you made.

37:36 Like, why did you choose this sample?

37:38 Why did you decide to create the inputs to your models the way that you did it?

37:43 You need to document all this stuff.

37:44 So yeah, reproducibility is a super interesting topic.

37:47 And I think it's, yeah, something that really needs to be thought about carefully, even if

37:53 you're not collaborating with anyone else, because otherwise your piece of research is going to

37:57 be worthless in three months because you're not going to remember what you did.

37:59 I think notebooks are quite interesting.

38:01 They go a long ways to solving that.

38:02 When used in the right way, you can just jam a bunch of non-understandable stuff in there.

38:07 And it's just, well, now it's not understandable, but it's in a web page instead of in an editor.

38:11 But I think as in, you know, not just programmers, but tech in general, we're just bad at thinking

38:18 about the long-term life cycle of information and compute.

38:24 For example, I got a new heat pump to replace the furnace at my house.

38:29 The manual for it came on a CD drive.

38:31 And I'm like, I don't think I have a CD.

38:33 Where did I put that?

38:34 I went to go dig through closet full of electronics.

38:37 I'm not sure I can read that, right?

38:39 And, you know, CD seems so ubiquitous for so long, right?

38:43 And just simple little mismatches like that just get worse over time.

38:48 It's going to be tough to keep some of this older research and reproducibility around.

38:53 Yeah.

38:54 Like it's super interesting that there are packages I used to use, you know, back when I first

38:59 started in natural language processing.

39:01 Some of them haven't been updated from Python 2.

39:03 So I can't use them anymore because they were just some, probably like a PhD project and

39:08 no one really had the time or energy to maintain it after that person graduated.

39:12 And the person graduated, got a job and doesn't really care that much anymore, potentially.

39:16 Exactly.

39:16 Not enough to keep it going.

39:17 Yeah.

39:18 It's not even necessarily their fault.

39:19 It's just life.

39:20 Yeah.

39:21 Let's talk about some of the libraries and tools.

39:24 You mentioned Jupyter.

39:25 I think Jupyter is one of the absolute cornerstones, right?

39:29 So Jupyter or JupyterLab, what are your thoughts here?

39:33 It's funny, actually, for years, I was just working in Jupyter, playing Jupyter on my computer.

39:38 Maybe give people a quick summary of the differences so who don't know.

39:42 Very good idea.

39:43 So basically, Jupyter is, I suppose you could call it an editor.

39:46 It's basically an interactive document, which you run against a Python kernel, or you can

39:52 run it against different language kernels.

39:53 There are R, there are Julia, there are Kotlin notebooks.

39:56 Should I give my little advertisement for JetBrains?

39:58 Basically, what you can do is you can run code in cell blocks.

40:02 Then you can also create markdown cells in between them.

40:05 And this allows you to basically have, you know, markdown chunks and then cell chunks.

40:10 JupyterLab is hosted remotely.

40:13 So you have basically a bunch of other functionality built in.

40:16 So you can open terminals, you can create scripts, things like that.

40:20 But basically, it's like a little Jupyter ecosystem, which is designed to be remotely hosted.

40:25 And it can be accessed simultaneously by several people.

40:28 So I would say Jupyter is good if you are just starting out and you're dealing with small

40:35 data sets.

40:36 Maybe you're even retrieving things from databases, but you're not saving anything too heavy locally.

40:42 You're not using a huge amount of memory, like maybe unless you got one of those new M2 Macs

40:46 and server in your office.

40:48 So go for it.

40:49 Exactly.

40:49 Yeah.

40:50 JupyterLab, I think, is good if basically you need to access different types of machines.

40:56 So maybe you need to be able to access GPU machines easily.

40:59 You kind of want that remote first experience where you don't have to then connect to a remote

41:03 machine.

41:04 And I have found JupyterLab helpful in the past for sharing.

41:07 But the thing you can't do with JupyterLab is real-time collaboration.

41:10 And that's a bit of a pain in the butt.

41:12 Obviously, since I started at JetBrains, I've kind of, you know, like I'm using our tools

41:17 and I like them a lot.

41:18 Obviously, I wouldn't advocate them.

41:20 Yeah.

41:20 I was going to ask, is this PyCharm, Dataspel, like when you actually do that, are you using

41:25 some of those type of tools?

41:26 I am.

41:27 So I won't turn this into too much of an advertisement for our tools because it's not

41:31 really the point of me being here.

41:32 But we've kind of tried or my teams have tried to solve some of these problems that you might

41:39 have with just using plain Jupyter Notebooks or even working with JupyterLab, maybe a bit

41:43 more like robustly.

41:46 So we have actually three data science projects, products.

41:49 We have PyCharm and Dataspel, which you've mentioned.

41:52 They're desktop IDEs with the ability to connect to remote machines, but they're not really

41:57 collaborative, but they do give you like really like nice experience with using Jupyter, debugging

42:02 and co-completion and all those sort of things.

42:05 We have another one, which is DataLaw.

42:07 And this falls into those managed notebooks that I was talking about.

42:10 It's cloud hosted.

42:11 And the nice thing about DataLaw actually is you can do real-time collaboration.

42:16 So it sort of helps overcome-

42:18 Google Docs style, sort of.

42:18 Yes, it's the same technology actually.

42:20 So yeah, so it's kind of a very interesting thing because there will be times where, you

42:27 know, maybe you're not working on a project with a data scientist, but you need them to

42:31 have a look at your work.

42:32 And when I was working with JupyterLab, what we would do is we would clone the notebook to

42:37 our own folder and then we were in the same environment.

42:40 So it was okay.

42:41 But then you would rerun the whole thing again.

42:43 And sometimes it would be pretty time consuming.

42:46 DataLaw is an alternative to that.

42:47 It may or may not be kind of your style, but it's pretty cool because you can actually

42:53 just invite someone to the same notebook instance that you're in and you're basically hosting

42:58 them and they have access to everything that you've already run.

43:01 So it's like true kind of real time.

43:04 Yeah, that's nice.

43:04 Because sometimes a cell has to run for 30 minutes, but then it has this nice little answer

43:09 and you can work with that afterwards, right?

43:11 Exactly.

43:11 Or you want a model to be available and maybe you haven't saved it or something like this

43:17 is just a way around some of these friction points.

43:19 I want to circle back just really quickly for a sort of testimonial, I guess, out in the

43:23 audience.

43:24 Michael says, I started teaching basic Git, Docker and Python packaging to bioinformatics

43:29 students at UCLA.

43:31 And it's made a huge difference in the handoff.

43:33 But I think for actual projects, you know, I just think as we were talking about what should

43:38 people learn as data science and what they shouldn't.

43:40 Yeah.

43:40 A little bit of the fluency with some of these tools is really helpful.

43:44 I absolutely agree.

43:45 Like I know it can be really overwhelming, especially Git initially for students.

43:50 Git is overwhelming.

43:51 Yeah.

43:52 At first.

43:52 I would say I'm because I tend to work on things by myself.

43:57 Yeah, yeah.

43:57 This sort of falls into the reproducibility stuff that I was talking about earlier.

44:02 And it's super, super important.

44:04 Like and once you get comfortable with like just basic use of these tools, you can get really

44:08 far.

44:09 Okay.

44:09 Back to some of the tools, Jupyter, JupyterLab.

44:12 What about JupyterLite?

44:14 Have you played with JupyterLite any?

44:16 Only a teeny tiny bit because of this workshop that I'm going to be helping out with at Europython.

44:21 So they're going to be running the whole thing in JupyterLite, hopefully.

44:24 A couple of bugs to solve, but I think they're overcomable.

44:27 But yeah, it's a really interesting alternative to Google Colab, actually.

44:33 Yeah.

44:33 JupyterLite, take Pyodide, which is CPython running a WebAssembly, and then build a bunch

44:40 of the data science libraries like Matplotlib and stuff in web WebAssembly.

44:44 And then the benefit is you don't need a complex server to handle the compute and run arbitrary

44:50 Python code, which is a little sketchy.

44:52 You just run it on the front end in WebAssembly, which is pretty cool.

44:55 I interviewed the folks at PySport a little while ago.

44:58 And it's just the ability to just take code and run all these different pieces on your front end without worrying about a server, I think is super cool.

45:08 If I get that right or not.

45:09 But anyway, just I think running it on top of people using it on top of the browsers like you do JavaScript is it's an interesting thing to throw into the mix for notebooks.

45:20 Actually, a lot of these projects coming out using Pyodide are really interesting.

45:24 Obviously, PyScript is the big one from last year.

45:26 Yeah, I think PyScript actually has really lots of interesting possibilities beyond just the data science side, right?

45:33 Whereas Pyodide is a little more focused on just, I think, really providing the data science tools on the client side.

45:39 We'll see where PyScript goes.

45:41 If they can make an equivalent of Vue.js or something like that, where people can start building legitimate front end interactive web apps like Airbnb or Google Maps or something.

45:52 But with Python, that's going to unlock something that has been locked away for a really long time.

45:57 With Pyodide, that's like a nine or 10 meg download.

46:01 That's too much for the front end, just for like a public facing site generally at the start of time.

46:06 But they're moving it to MicroPython as an option.

46:09 And that's a couple hundred K, which is like these other front end frameworks.

46:12 So it's very exciting.

46:13 I think that's going to be that's definitely the most exciting thing in that area.

46:17 But all right, back to data science.

46:18 Let's see.

46:19 Where do you want to go next?

46:20 You want to talk Pandas maybe?

46:21 Yeah, let's jump into Pandas, which is the other biggie when you're talking about data science.

46:26 So what Pandas is really important for is it's basically the entry point of you working with your data.

46:32 So it's a library, which basically allows you to work with data frames.

46:37 Data frames are basically tables.

46:39 And from there, you can do data manipulation.

46:41 You can explore your data and visualize it.

46:44 And it also is an entry point to passing your data into models.

46:49 Sometimes it'll need additional transformations.

46:51 But say scikit-learn, which we can talk about in a sec, you can basically pass Pandas data frames directly into scikit-learn models.

46:59 Pandas also, because of its popularity, has kind of opened up this easy access to like grid computing and other types of processing database stuff that you don't really need to learn those tools, but you get to take advantage of.

47:14 And so two things that come to mind for me are Dask.

47:16 Yes.

47:16 It's kind of like a Pandas code, but instead you can actually run this across this cluster of machines or larger than memory or stuff on my personal computer or even just take advantage of all 10 cores on my M2 instead of the one.

47:34 Yes.

47:35 Have you done anything with Dask?

47:36 Are you a fan of it?

47:37 I was kind of there when Dask was new and let's just say they find out a lot of the bugs.

47:42 Yeah.

47:43 So what ended up happening was I ended up learning PySpark instead.

47:46 So I went down a different kind of route, but I think, you know, they solve very similar problems.

47:52 It's just Dask is much more similar to Pandas.

47:55 And so you don't really need to deal with learning.

47:57 It's similar, but it's a new API.

48:00 Yeah.

48:00 Another one that I was thinking of, I just had these guys on the show sort of is Ponder.

48:05 Oh, I have not heard of this.

48:07 So Ponder, they were at startup row at PyCon and they basically built on top of Moden, which is important, Moden.pandas as PD.

48:16 And what it does is it, instead of pulling all the data back and executing the commands on your machine in memory, which maybe that data transfer is huge.

48:23 It actually runs it inside of Postgres and other data.

48:26 And I think PySpark as well.

48:28 Like it translates all these Pandas commands to SQL commands to run inside the database where the data is, which is also a pretty interesting thing.

48:37 That is amazing.

48:38 So, yeah, it's just interpreting the code in a completely different way.

48:41 You can do like query planning and optimize the code.

48:44 Yeah, exactly.

48:44 They said that df.describe is like 300 lines of SQL.

48:50 It's really, really tough.

48:51 But once this thing writes it, then it's good to go.

48:53 And I think the reason I bring this up is like you don't have to write that code.

48:56 You just have to know Pandas.

48:58 And then all of a sudden there's these libraries that will do either grid computing or really complex SQL queries that you don't care about.

49:04 Yes.

49:05 You don't care to write or so on.

49:06 So I think it's Pandas is interesting on its own, but it's almost like a gateway to the broader data science community.

49:12 Agreed.

49:13 And it's such a de facto, I think, for data analysis now or data manipulation transformation.

49:20 Yeah, like I don't see it going away anytime soon.

49:23 And actually, Pandas 2.0 just came out.

49:26 And instead of being, yeah, instead of Pandas is NumPy under the hood, which is fast, but it's not really equipped to deal with certain kinds of structures like strings.

49:38 Because, you know, it's not really what NumPy is about.

49:41 And also missing values.

49:42 The way that it handles it is pretty janky.

49:45 So, yeah, it's been rewritten with PyArrow under the hood.

49:48 Right.

49:48 Yeah.

49:48 Apparently, the performance is so much better.

49:51 Something I need to sit down and actually try.

49:54 It's been out for like a month and I'm feeling a bit bad.

49:57 But, yeah.

49:57 Yeah, that's cool.

49:58 It probably has support for some of the serialization formats for the back of the term, like, is it Parquet?

50:04 And some of those types of things.

50:06 I think that comes straight out of PyArrow.

50:08 Yeah.

50:09 Yeah.

50:09 Excellent.

50:09 So that kind of brings me to a trade-off I wanted to talk to you about before we get off of Pandas.

50:14 Although it sounds like Pandas 2.0, it makes this less important.

50:18 But, you know, another sort of competitor that came out is Polars, which is a data frame library for Python written in Rust.

50:26 Many of the things are written in Rust these days when they care about performance.

50:30 It's like a big trend.

50:31 It's the new C extensions of Python.

50:33 But this one is supposed to also be way faster than Pandas 1.

50:37 And I think it's also based on PyArrow, amongst other things.

50:41 The details are not super important.

50:43 But more what I wanted to ask you is like, well, here's another way.

50:45 This is a totally different API.

50:47 It doesn't try to be compatible.

50:48 So you've got to learn it.

50:49 So the question is, as a data scientist, as a data science team leader, how should you think about, you know, do we keep chasing the shiny new thing?

50:59 Or do we stick with stuff that one, people know like Pandas, but two, also extends into this broader space as a gateway, as we described?

51:07 Like, what are your thoughts here?

51:08 This is a super interesting question.

51:10 So data scientists in some ways have the luxury of being able to maybe use newer packages faster.

51:18 Because we build these small kind of atomic projects that we can just update to the next library that we feel like using in the next project.

51:27 And maybe we're the only ones who ever look at that code.

51:29 So it's cool.

51:29 The problem is, though, of course, is if someone else needs to look at your code, they are going to need to be able to read it, which is not maybe the biggest problem.

51:39 The biggest problem, of course, is any new library.

51:43 You have less documentation and you have less entries on Stack Overflow.

51:47 So I would say you need to make a tradeoff between the time you're not only going to spend learning it, but also debugging it because it's going to be slower.

51:54 But your ChatGPT doesn't know much about Polis.

51:57 Basically, you're essentially going to need to trade that off against, are you going to see a benefit from that?

52:03 So do you actually have problems with processing your data fast enough?

52:09 If you're working on small data sets, probably not.

52:10 If you're not, then maybe try something pandas or pandas adjacent.

52:14 Yeah, that sort of community support side is important.

52:18 And I'm pretty sure there are a lot of data scientists out there who are the one data scientist at their organization.

52:25 And so it's not like, oh, we'll go ask the other expert down the hall because if it's not you, there's no answer, right?

52:31 Exactly.

52:31 I do think, though, like it's good to be curious.

52:34 It's good to try out new things as well.

52:37 And again, part of being a data scientist is you can experiment a bit more.

52:41 So, you know, 2017, 18, sort of the peak Python 2 versus 3 tension, I guess, maybe one year before then.

52:51 I noticed that the data scientists were like, I don't know what y'all are arguing about.

52:55 We're done with this.

52:56 What we're arguing about is when can we take the Python 2 code out to absolutely 100% drop support for it, not when are we moving over?

53:04 Whereas people running, you know, that Django site that's been around for eight years that's still on Python 2, they're starting to get nervous because they don't want to rewrite it because it works.

53:13 But they know they're going to have to.

53:14 And I feel like, you know, we talked about the legacy code is sort of the success story that is dreaded of software on the computer science side.

53:23 Because that is less of a thing in data science, it's easier to go, well, this next project that we're starting in a couple of months, we can start with newer tools.

53:31 Yep. And I actually remember the point where I decided, okay, this is the last project I'm doing into.

53:36 Because the thing that was keeping me into was actually one of those libraries that I mentioned, which built by a university.

53:45 And I was like, you know what, I'm just going to go find some alternative tool.

53:48 I think at that time, spaCy, which is a very well-known NLP library, actually, based here in Berlin, the company.

53:56 Yeah, exactly. Basically a neighbor of yours.

53:58 That's right. But I think spaCy was really getting off its feet in that time.

54:03 So I was like, you know, I'm just going to switch over to this new library and try that.

54:06 And it's excellent. So I didn't look back.

54:09 Yeah. spaCy's cool. Ines Montani is doing really great work.

54:13 And everyone over at Explosion AI.

54:14 And that's the thing. Sometimes it seems like a hassle, right?

54:18 But if it forces you out of your comfort zone to pick stuff that's being actively developed, maybe it's worth it, right?

54:23 Exactly.

54:24 All right. We're getting short on time. So you want to give us a lightning round and the other important libraries you think data scientists should pay attention to?

54:31 Yeah. So let's just quickly go through the visualization side of things.

54:35 So visualization is massive.

54:37 So matplotlib is really the biggie.

54:40 And it's what a lot of libraries are actually built on top of in Python.

54:44 So the syntax is not that friendly. So there's a lot of alternatives.

54:48 So Seaborn is a very popular one.

54:51 We actually have an internal one called Let's Plot, which is a port of ggplot2.

54:56 And there's another one called Plot9. And I think there actually may even be one called ggplot.

54:59 Plotlib.

55:00 Some of the fancy new ones that people hear about, they're actually internally just controlling matplotlib and a cleaner API, right?

55:06 Pretty much. And let me tell you, matplotlib needs a clean API. It's a bit, let's say, okay.

55:13 Although give it some props for its XKCD graph style.

55:17 I mean, that is pretty cool that you can get it to do that.

55:21 I actually have done, I've done XKCD graphs in Python as well.

55:27 It's a goal that you aim for to do like elite visualizations.

55:31 It's fun. And XKCD is amazing in a lot of ways.

55:35 However, I think it also can serve an important role when you're presenting to like leaders of an organization, non-technical people.

55:44 Because if they look and see a beautiful, pristine production ready, sort of like, we're done.

55:50 Like, no, no, no, this is a product. No, we're done. Look, you already got it.

55:53 But if it comes out in sort of cartoony, kind of like wireframing for UI design, you're like, oh, there are no expectations it's done.

56:01 It's XKCD. We're going to get you the real graphs later, right?

56:03 Yeah, yeah.

56:04 There may be some value there.

56:05 Like a psychological effect where you make it look like a hand-drawn prototype.

56:10 Exactly. It looks just hand-drawn. It's barely done.

56:12 That's right.

56:13 It's really just theme equals.

56:14 It didn't take me two days.

56:15 scikit-learn, you mentioned that before.

56:19 Yes. So there are a whole bunch of libraries for doing machine learning.

56:24 scikit-learn is kind of your all-in-one for classic machine learning.

56:27 But then, you know, you have this whole other branch of data science, which is around neural nets or deep learning.

56:33 So you have Keras, TensorFlow, you have PyTorch.

56:39 And then you have a package for working with a lot of like these generative AI models or large language models called Transformers from a company called Hugging Face.

56:50 So all of these are actually super accessible.

56:53 I wouldn't say TensorFlow and PyTorch can be tricky, but Keras is like a friendly front end for them.

56:57 Actually, if anyone is interested in getting into this side of things, there's a book called Deep Learning in Python by an AI researcher at Google called Francois Cholet.

57:08 It is actually, I think, the most popular book ever on Manning.

57:13 So it's an amazing book.

57:15 I can only recommend it.

57:17 And it's very gentle for beginners who have no background in the area.

57:20 Okay. Yeah, cool.

57:22 I'll put that in the show notes.

57:23 Awesome.

57:23 Yeah.

57:24 All right.

57:25 Well, there are many other things we can talk about.

57:28 Maybe just let's close this out with a quick shout out to your PyCon talk.

57:34 Eventually, someday, I'm sure that the talks for PyCon will be on YouTube.

57:40 They were last year, but I looked back and I was so excited near the end of the conference.

57:45 I'm like, look, the talks are up.

57:46 And I was talking to someone like, look, here's your talk.

57:49 They're like, no, that's my talk from last year.

57:50 I'm like, oh.

57:51 Yeah.

57:52 So it was maybe three or four months delayed until it actually came out.

57:56 So maybe this midsummer or the video Virginia talk will be out.

58:00 But maybe just give people a quick elevator pitch of your talk here.

58:03 Yeah.

58:04 So I decided to give this talk because I kind of had to learn things the hard way in terms

58:10 of performance with Python.

58:11 So basically, I used to do everything with loops.

58:15 And then I had to start working with larger amounts of data.

58:18 And it just doesn't scale.

58:19 So over time, as I got better with Python, I learned more about NumPy, which is another

58:25 important data science library.

58:26 And it basically allows you to do what's called vectorized operations.

58:30 So in this talk, I basically talk about the math behind why vectorized operations work.

58:36 You don't need any math background to understand.

58:38 It's very gentle.

58:39 And then just show why some of these operations work in NumPy and how you can implement it yourself

58:46 to get really massive gains in performance speed.

58:50 Yeah, that's incredible.

58:51 Move a lot of that stuff down into like a C or a Rust layer and just let it do its magic

58:56 instead of looping in Python.

58:58 Yeah.

58:58 Exactly.

58:59 Yeah.

59:00 Very cool.

59:00 So I don't know when, but eventually this will be out as a video.

59:03 People can check out for me now.

59:06 They know to go look for it.

59:07 Yeah, I think the poor team is still recovering so much work.

59:10 I know.

59:11 All right.

59:12 Well, Jodi, it's been great to have you on the show.

59:15 Before you get out of here, final two questions.

59:17 If you're going to write some Python code, what editor are you using these days?

59:20 So I'm actually using all three that I talked about.

59:23 I use PyCharm if I need to do something like a bit more on the engineering side, which is

59:29 not that often for me.

59:30 DataSpell if I'm doing sort of very local development and doing more of the research side.

59:35 And then if I need some GPUs, I'm using DataLaw.

59:39 So a bit boring, but using all of our tools.

59:42 And I really like them.

59:43 Yeah, they are good.

59:45 All right.

59:45 And then notable PyPI package, something you want to give a shout out to, or if you prefer

59:51 a Conda package, there's a lot of intersection there.

59:54 I think my favorite package at the moment is Transformers.

59:56 It is amazing.

59:58 And the documentation that Hugging Face have put together is so good.

01:00:01 And just the work they're doing in open data science is so, so important.

01:00:05 So like big props to Hugging Face.

01:00:08 We should really support the work that they're doing.

01:00:10 Excellent.

01:00:10 All right.

01:00:11 Well, thanks for being on the show and sharing your experience.

01:00:15 Thank you so much for having me.

01:00:16 I had an absolute blast.

01:00:17 Yeah, same.

01:00:18 Bye.

01:00:19 This has been another episode of Talk Python To Me.

01:00:22 Thank you to our sponsors.

01:00:24 Be sure to check out what they're offering.

01:00:26 It really helps support the show.

01:00:27 The folks over at JetBrains encourage you to get work done with PyCharm.

01:00:32 PyCharm Professional understands complex projects across multiple languages and technologies,

01:00:38 so you can stay productive while you're writing Python code and other code like HTML or SQL.

01:00:43 Download your free trial at talkpython.fm/done with PyCharm.

01:00:48 Spend better time with your data and build better ML-based applications.

01:00:53 Use Prodigy from Explosion AI, a radically efficient data annotation tool.

01:00:57 Get it at talkpython.fm/prodigy and use our code TALKPYTHON, all caps, to save 25% off a personal license.

01:01:06 Want to level up your Python?

01:01:07 We have one of the largest catalogs of Python video courses over at Talk Python.

01:01:11 Our content ranges from true beginners to deeply advanced topics like memory and async.

01:01:16 And best of all, there's not a subscription in sight.

01:01:19 Check it out for yourself at training.talkpython.fm.

01:01:22 Be sure to subscribe to the show, open your favorite podcast app, and search for Python.

01:01:26 We should be right at the top.

01:01:28 You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at /rss on talkpython.fm.

01:01:37 We're live streaming most of our recordings these days.

01:01:40 If you want to be part of the show and have your comments featured on the air, be sure to subscribe to our YouTube channel at talkpython.fm/youtube.

01:01:48 This is your host, Michael Kennedy.

01:01:50 Thanks so much for listening.

01:01:51 I really appreciate it.

01:01:53 Now get out there and write some Python code.

01:01:54 I'll see you next time.