Data Science Panel at PyCon 2024

Episode #467, published Thu, Jun 20, 2024, recorded Sat, May 18, 2024

Episode Deep Dive Links Transcript

I have a special episode for you this time around. We're coming to you live from PyCon 2024. I had the chance to sit down with some amazing people from the data science side of things: Jodie Burchell, Maria Jose Molina-Contreras, and Jessica Greene. We cover a whole set of recent topics from a data science perspective. Though we did have to cut the conversation a bit short as they were coming from and go to talks they were all giving but it was still a pretty deep conversation.

Play on YouTube

Watch the live stream version

Episode Deep Dive

Guests

Jodie Burchell Jodie is a data science developer advocate at JetBrains. She has a strong background in natural language processing (NLP) and statistics. Before becoming a developer advocate, Jodie spent many years in data science roles, focusing on both core statistical modeling with tabular data and cutting-edge NLP problems.
Maria Jose Molina-Contreras Maria is originally from Barcelona and now lives in Berlin, working as a data scientist at a startup tackling sustainability challenges. She’s been involved in data science for around eight years and is passionate about using machine learning tools to address real-world problems, such as sustainable packaging.
Jessica Greene Jessica is an ML engineer at Koja, which she describes as “the search engine for a better planet.” She moved into tech from a career in coffee roasting and has been in the field for six years. Jessica is largely self-taught, with a strong interest in backend engineering before moving into machine learning.

PyCon Community and Experience
- All three guests are excited to be at PyCon 2024. They note the welcoming, community-focused atmosphere at Python conferences and encourage people of all experience levels (especially beginners) to attend. They highlight that there are many programs, grants, and financial aid options that lower the barriers to entry for Python events.
Transition into Data Science
- Jessica’s career-change journey (from coffee roasting to ML engineering) illustrates how accessible data science can be today—there are many resources and communities to help people become self-taught developers.
- Jodie and Maria also mention career transitions, emphasizing that non-traditional backgrounds or prior academic research can lead naturally into data science.
Core Data Science vs. LLM “Hype”
- Despite the excitement around large language models, a major portion of day-to-day data science work still focuses on more “traditional” approaches such as decision trees, linear regression, and solid data cleaning practices.
- The panel points out that many business problems are still well-served by simpler and cheaper-to-deploy models.
Measurement, Validation, and Bias
- Jodie’s upcoming (now recorded) PyCon talk, “Lies, Damn Lies, and Large Language Models,” delves into measuring hallucinations and misinformation in LLMs. She mentions metrics like Truthful QA for factual correctness, highlighting that these evaluations often target specific definitions or language domains.
- Maria brings up the importance of carefully testing and evaluating LLMs for bias, toxicity, and potential security issues. She mentions the open-source library Giskard for model evaluation and fairness checks.
Useful Tools for LLMs and Model Evaluation
- LangChain A popular Python framework that helps build chat-like applications and more general LLM-based pipelines, including retrieval-augmented generation.
- Giskard A tool for inspecting, testing, and evaluating machine learning models (including LLMs) with a focus on bias detection and interpretability.
- CodeCarbon Jessica highlights CodeCarbon for measuring the carbon emissions of Python workloads. It helps teams assess the energy footprint of training or inferencing on large models—important for cost, sustainability, and performance considerations. See episode 318: Measuring your ML impact with CodeCarbon for a deep dive into CodeCarbon.
Responsible Use of LLMs and Smaller Models
- The guests stress that LLMs are expensive to run and come with privacy, security, and ethical concerns (e.g., hallucinations, data leaks, biases).
- They foresee a growing move toward more domain-specific or smaller models to reduce cost and complexity while still maintaining solid, explainable results.
Advice for Beginners in Data Science
- All speakers agree that understanding core data cleaning, measurement, and statistical methods is crucial—no code generator or AI assistant can replace deep problem-solving and domain knowledge.
- They encourage companies to hire more junior data scientists and invest in mentorship, stating that the “science” part of data science will remain essential regardless of how advanced AI tooling becomes.

Overall Takeaway

Data science remains firmly grounded in traditional modeling, solid experimentation, and careful measurement. While there is undeniable excitement around large language models and new AI-powered tools, many day-to-day business problems still benefit most from simpler solutions with a clear return on investment. Tools like LangChain, Giskard, and CodeCarbon address everything from LLM pipelines and bias checks to sustainability concerns, underscoring a key theme: combining a solid foundation in core data science with thoughtful exploration of new technologies.

Links from the show

Jodie Burchell: @t_redactyl
Jessica Greene: linkedin.com
Maria Jose Molina-Contreras: linkedin.com

Talk Python's free Shiny course: talkpython.fm/shiny
Watch this episode on YouTube: youtube.com
Episode #467 deep-dive: talkpython.fm/467
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy
Episode #467 deep-dive: talkpython.fm/467

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 I have a special episode for you this time around.

00:02 We're coming to you live from PyCon 2024.

00:05 I had the chance to sit down with some amazing people from the data science side of things.

00:10 Jody Burchell, Maria Jose Molina Contreras, and Jessica Green.

00:15 We cover a whole set of recent topics from a data science perspective.

00:19 Though we did have to cut the conversation a bit short as they were coming from and going to talks

00:24 they were all giving, but it's still a pretty deep conversation.

00:27 I know you'll enjoy it.

00:29 This is Talk Python To Me, episode 467 recorded on location in Pittsburgh

00:34 on May 18th, 2024.

00:36 Are you ready for your host?

00:38 Here he is.

00:39 You're listening to Michael Kennedy on Talk Python To Me.

00:42 Live from Portland, Oregon, and this segment was made with Python.

00:46 Welcome to Talk Python To Me, a weekly podcast on Python.

00:53 This is your host, Michael Kennedy.

00:54 Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython,

00:59 both on fosstodon.org.

01:02 Keep up with the show and listen to over seven years of past episodes at talkpython.fm.

01:07 We've started streaming most of our episodes live on YouTube.

01:11 Subscribe to our YouTube channel over at talkpython.fm/youtube to get notified about upcoming shows and be part of that episode.

01:19 This episode is brought to you by Sentry.

01:21 Don't let those errors go unnoticed.

01:22 Use Sentry like we do here at Talk Python.

01:24 Sign up at talkpython.fm/sentry.

01:28 And it's brought to you by Code Comments, an original podcast from Red Hat.

01:32 This podcast covers stories from technologists who've been through tough tech transitions

01:37 and share how their teams survived the journey.

01:40 Episodes are available everywhere you listen to your podcasts and at talkpython.fm/code dash comments.

01:47 Hello from PyCon.

01:48 Hello, Jessica, Jody, Maria.

01:51 Welcome to Talk Python To Me.

01:52 It's awesome to have you all here.

01:54 And I'm looking forward to talking about data science, some fun LLM questions, maybe,

01:59 some controversial questions, some data science tools, all sorts of good things.

02:03 Of course, before we get to that, you know, Jody, you've been on the show a time or two.

02:07 And people may know you, but maybe not.

02:10 So how about a quick introduction, what you all are into?

02:12 Maria, you want to start?

02:13 Oh, okay.

02:14 Well, my name is Maria.

02:16 I am originally from Barcelona, but I am based in Berlin.

02:20 I work as a data scientist in a small startup that we're trying to solve some sustainability problems.

02:29 And yeah, that is me.

02:31 Excellent.

02:31 Yeah, so my name is Jody, and I am a data science developer advocate.

02:35 I've been working in data science for about eight years.

02:38 And yeah, I'm currently working at JetBrains, as you can see from the shirt.

02:40 And in the background.

02:42 And the background.

02:43 And so I say my interest at the moment is natural language processing, because I worked in that a big chunk of my career.

02:50 But core statistics will always be my love.

02:53 So tabular data, I'm there for you always.

02:55 Beautiful.

02:56 Yeah, my name is Jessica.

02:58 So I'm an ML engineer at Koja, which is the search engine for a better planet.

03:03 I am actually a career changer.

03:06 So I used to roast coffee for a living.

03:07 And I really just got into this field in the last six years.

03:11 So I don't have like any formal training.

03:14 I'm a community slash self-taught engineer.

03:16 And I went through more of a like a backend focused path.

03:20 And now I've started to work in the ML realm.

03:22 So really exciting.

03:24 Yeah, very, very interesting.

03:25 Another thing I absolutely love is coffee.

03:27 Yeah.

03:28 Oh my gosh, coffee is so good.

03:31 I think we're running on it at PyCon.

03:32 Pretty much we are.

03:33 Yeah.

03:34 We're getting farther into the show and more coffee is needed.

03:38 But I do want to ask you, you know, what do you think about being in the data science space?

03:43 That's a really different world that interacting with people all day and working with your hands more or whatever.

03:49 like how has it been with this switch?

03:52 There is a lot of synergies actually when you're stood behind the espresso machine

03:55 and you're getting all the orders in and then you need to like problem solve

03:59 to like how you get everyone their correct order to the way that they like it.

04:03 So there was a lot of transferable skills, I will say.

04:06 But I think what I found really powerful and especially maybe learning at this specific period of time

04:13 is how accessible a lot of the tools are today.

04:16 So like how, I won't say easy because I put a lot of hard work into it, but like how possible it is

04:23 even with a background like mine to get into the field.

04:26 Awesome.

04:26 I switched, I didn't have a formal education either.

04:29 I took two computer college courses just because they match, you know, I needed for something else.

04:35 And yeah, I thought it, I think you can completely succeed here teaching yourself.

04:41 There's so many resources.

04:42 Honestly, the problem is what resources do you choose to learn these days, right?

04:46 You can spend all your time.

04:47 Well, I'm doing another tutorial.

04:48 I'm doing another class.

04:49 Like some point you got to start doing something, right?

04:51 Yeah.

04:52 And I think actually it felt like that probably when we all started.

04:56 Yeah.

04:56 So data science was just getting hot when I started.

04:59 And oh my God, back when I started, this is how long ago it was.

05:02 There were actually like those articles, like R versus Python.

05:05 Like there's no conversations anyone's having anymore, but they have similar conversations.

05:08 And I think it makes it super difficult for beginners because the field felt inaccessible,

05:13 I think, eight years ago.

05:15 The field feels very hostile to beginners right now, I think, because of the AI hype.

05:20 I don't actually think the field has changed that much in fundamentals.

05:24 It's just NLP has become a bigger thing in computer vision recently, but we can get into that later.

05:30 Yeah, I completely agree with you.

05:32 To be honest, for me, data science is super broad world, full of a lot of things that are kind of popping up,

05:42 doing different evolution during time.

05:44 And it's so interesting to see the evolution in the last eight years.

05:50 I started eight years ago in data science.

05:53 And I remember when, how I was doing things eight years ago and how I'm doing things now.

06:00 And I love it.

06:02 I love to see this progression.

06:03 And I am pretty sure that in eight more years, we're going to be in something completely different

06:10 and super exciting.

06:12 Yeah, I totally agree with that.

06:13 I do.

06:14 And I also think data science is interesting because coming into it, you can be a data scientist,

06:20 but because some other reason, right?

06:22 I could be a data scientist because I'm interested in biology or sustainability or something.

06:27 Whereas if you're a web developer or you build APIs or you optimize, you know, whatever,

06:33 you're more focused on, I care about the thing, the code itself, rather than I'm trying to,

06:38 I care about that.

06:38 And this is a tool to address that.

06:40 Yeah, yeah.

06:41 Yeah, actually, I was going to say, I met a bioinformatician yesterday.

06:44 Like, that's also a data scientist, like someone who works in genetic data.

06:48 Yeah, absolutely.

06:49 I had a comment from, I did a show recently from about how Python's used in neurology labs, right?

06:56 And somebody wrote me, this is my favorite episode.

06:58 It speaks to me.

06:58 I'm also a neurologist.

06:59 You know, like, it's really cool.

07:00 All right, we're looking out, kind of the backside a little bit, but we're looking out of the expo hall here at PyCon.

07:07 So I don't know how about you all feel, but for me, this is like my geek holiday.

07:12 I get to come here, and it's really special to me because I get to see my friends

07:17 who I've collaborated with projects on and I admire and I've worked with,

07:21 but I might never see them outside of this week.

07:24 You know, maybe they live in Australia or Europe or some oddly just down the street,

07:30 and yet still, I don't see them except here.

07:32 So maybe, what are your thoughts on PyCon here?

07:36 It's my first time attending, so I'm super stoked.

07:39 I have to say, like, it's slightly overwhelming because there's so many things going on

07:44 and like you mentioned, the opportunity to meet so many folks that I either already knew in some capacity

07:49 but had never met or didn't meet before but have heard of their work.

07:52 So yeah, it's been a real honor to be here, right?

07:55 I mean, we are all based in Berlin, so we do actually know each other, but it's also a pleasure just to come away

08:01 on a geek holiday with friends.

08:03 Yeah, and we were actually all just at PyCon DE just before this, like a month ago.

08:08 Yeah, one month or so.

08:09 Yeah, it's a different scale, let's put it that way, but I think it's a similar feel.

08:13 Like, one thing that I value so much about the Python community is that it's community,

08:18 and I'm very lucky to have gotten involved in a program called Hatchery,

08:23 which you two have also been involved in, the Hatchery we're running is Humble Data.

08:28 And what I like is this program got accepted at a Python conference, which is designed for people who have never coded

08:35 and who are career changers, because I'm also a career changer from academia.

08:39 And this is what makes, I think, Python special, the community, and I think the PyCons

08:44 are an absolute representation of that.

08:46 Yeah, absolutely.

08:47 For me, it's the same feeling.

08:50 I love to go to different conferences of Python, because we have a lot of things in common,

08:58 but also we have differences, and the different conferences bring a different point of value.

09:05 And I think it's awesome.

09:07 And came here and met friends.

09:10 This is my third time in here, and I'm super, super excited and happy.

09:14 And I'm super eager to next year.

09:16 And also the Python in Espanol.

09:19 Yeah, yeah, yeah, of course.

09:20 And also we have even, and here we have a track that is PyCon charlas to be even more welcoming to different people

09:28 from different communities.

09:29 And it's just amazing.

09:31 It's super nice, to be honest.

09:33 Awesome.

09:33 Yeah, I definitely want to encourage people out there listening who feel like,

09:37 oh, I'm not high enough of a level of Python.

09:40 I know.

09:41 To come.

09:41 I'm not ready for PyCon.

09:43 I believe last year, I haven't heard any numbers this year, I believe last year,

09:46 50% of the attendees were first time attendees.

09:49 And I think that's generally true.

09:51 A lot of times people are, it's their first time coming.

09:54 And yeah, it's, I think you can get a lot out of it even if you're not super advanced.

09:58 Maybe even more so than if you are super advanced.

10:01 I definitely have had the opportunity, like the honor, I would actually say,

10:04 to like listen into conversations around topics that I find interesting,

10:09 but aren't part of my day-to-day work.

10:11 And it's just like general vibe that whether it's at lunch or during the breaks

10:15 or after a talk, you get to partake in these conversations, which ultimately will advance you.

10:21 So if you also want to get sponsoring, right?

10:24 Like a lot of people need their work to sponsor them.

10:26 I think there's a lot of reasoning behind asking for PyCon as a conference

10:30 because there's so much value.

10:32 Jessica, that's a great point.

10:33 And I think also, I was talking to someone earlier about how much more affordable this is

10:38 than a lot of tech conferences.

10:39 A lot of them are like, how many thousand dollars is just the ticket?

10:43 And this is not that cheap, but it's relatively cheap compared.

10:47 And also, oh, sorry.

10:49 I was going to say, you could do a plug for EuroPython while you're here.

10:52 We have also the option to have grants.

10:55 There is a different programs, PyLadies grants or the conference organizes grants.

11:01 Also, this is something that could help people to try to apply or come here.

11:08 Yeah, they mentioned that at the opening keynote or the introductions before the keynote.

11:13 It's some significant number of grants that were given.

11:17 I can't remember the number, but it's like half a million dollars or something in grants.

11:20 Was that what it was?

11:20 I think it was around that scale.

11:22 Yeah.

11:23 Yeah.

11:23 It's a really big deal.

11:25 And I suppose all three of you being from Berlin, we should say generally the same stuff

11:30 applies to EuroPython as well, I imagine, right?

11:32 Yeah.

11:32 So if you're in Europe, you know, the biggest deal is to get all the way to the US,

11:37 maybe go to EuroPython as well, which would be fun.

11:39 Yeah.

11:40 or something more local.

11:41 This portion of Talk Python To Me is brought to you by OpenTelemetry support at Sentry.

11:46 In the previous two episodes, you heard how we use Sentry's error monitoring at Talk Python

11:52 and how distributed tracing connects errors, performance and slowdowns and more across services and tiers.

11:59 But you may be thinking our company uses OpenTelemetry, so it doesn't make sense for us to switch to Sentry.

12:05 After all, OpenTelemetry is a standard and you've already adopted it, right?

12:10 Well, did you know with just a couple of lines of code, you can connect OpenTelemetry's monitoring

12:16 and reporting to Sentry's backend.

12:18 OpenTelemetry does not come with a backend to store your data, analytics on top of that data,

12:23 a UI or error monitoring.

12:25 And that's exactly what you get when you integrate Sentry with your OpenTelemetry setup.

12:30 Don't fly blind.

12:32 Fix and monitor code faster with Sentry.

12:35 Integrate your OpenTelemetry systems with Sentry and see what you've been missing.

12:39 Create your Sentry account at talkpython.fm/sentry dash telemetry.

12:44 And when you sign up, use the code TALKPYTHON, all caps, no spaces.

12:48 It's good for two free months of Sentry's business plan, which will give you 20 times as many monthly events

12:54 as well as other features.

12:55 My thanks to Sentry for supporting Talk Python.

12:58 Jody, you've been on the receiving end of many, many questions.

13:04 And you've been, let's see here, doing demos, swarmed with people for a day and a half.

13:09 I'm surprised you still have your voice.

13:11 I've got to give a talk in two hours too.

13:13 So I hope I have a voice.

13:14 Yeah.

13:15 Speak quietly.

13:16 Save a little bit for that.

13:19 One of the questions you said was that people are still just have core data science questions.

13:24 They're not necessarily trying to figure out how LLMs are going to change the world.

13:28 But how do you do that with pandas or whatever?

13:30 Like, what are your thoughts on this?

13:32 Yeah.

13:32 What are your takeaways?

13:33 So I alluded to the fact I have an academic background.

13:36 I've probably talked about this on the last podcast.

13:38 But basically, my background is in behavioral sciences.

13:41 So a lot of core statistics and working with what's called tabular data, data and tables.

13:47 And pretty much, I would say, look, this is a guesstimate.

13:51 This is not scientific.

13:53 But my kind of gut feeling, PyCon after PyCon, conference after conference that I do,

13:57 I think like 80% of people are probably still doing this stuff because business questions are not necessarily solved with the cutting edge.

14:04 Business questions are solved with the simplest possible models that will address your needs.

14:10 I think we talked about this in the last podcast.

14:12 So like, for an example, my last job, we had to deal with low latency systems,

14:17 like very low latency.

14:18 So we used a decision tree to solve the problem.

14:21 Decision tree is a very old algorithm.

14:23 It's not sexy anymore, but everyone's secretly still using it.

14:27 And so, yeah, some people are doing cutting edge LLM stuff.

14:30 But my feeling is this is a technology that maybe has more interest than real profitable applications

14:38 because these are expensive models to run and deploy and to set up reliable pipelines for.

14:45 Yeah, my gut feeling is a lot of people still just doing boring linear regression,

14:50 which I will defend until the day I live.

14:52 My favorite algorithm.

14:53 Amazing.

14:54 Yeah.

14:54 And I mean, I think we've seen that in our work as well is we don't per se need the biggest, fanciest thing.

15:01 We need something that works and provides users with useful information.

15:05 I think there's also still a lot of problems with large language models,

15:09 like Simon alluded to in the keynote today around security.

15:13 So if you want to put this into a product, it's still kind of early days.

15:17 But I don't think those base kind of NLP techniques are going to go away anytime soon.

15:23 And I think like we spoke about learners earlier and people coming into the field.

15:27 There's still a huge amount of value just to go and learn this core aspects that will serve you really well.

15:34 Absolutely.

15:35 Way more than LLMs and AIs and all that stuff.

15:38 You can use a LLM to learn it.

15:40 You too, too.

15:42 That's what we just saw in the keynote.

15:43 Yeah, absolutely.

15:44 And I also think what people are going to do with LLMs and stuff a lot is ask it to help keep me this little bit of code or that bit of code.

15:52 But you're going to need to be able to look at it and say, yeah, that does make sense.

15:55 Yeah, that does fit in.

15:56 And so you need to know that's a reasonable use of pandas.

15:59 What do you think, Maria?

15:59 I completely agree.

16:02 The LLM school is kind of complex.

16:05 I think that it has a lot of potential.

16:07 And I think that a lot of people could see this potential and everyone is getting very excited and even a bit in a hype because of that.

16:15 However, it has a lot of limitations still nowadays.

16:20 I can tell you because I am currently working with LLMs for solving the real world problems that we were mentioning about the sustainable packaging.

16:33 It's very challenging, to be honest.

16:35 It's more challenging than people is mentioning.

16:38 It's not only hallucinations.

16:40 It's hallucination, of course.

16:41 But also, if you are doing fine-tuning models, also you are going to later on need to think how you are going to deploy that.

16:49 How much is it going to cost you the inference of that?

16:52 What are going to cost in sense of electricity, price, CO2, print, and long, etc.?

17:02 I think that we are in the process.

17:05 I think we're at a very high hype cycle.

17:08 Yes, absolutely.

17:09 I haven't seen anything like this since the dot-com days when pets.com was running around crazy.

17:16 And there was all sorts of bizarre Super Bowl ads just showing, you know, we have enough money to just burn it on silly things because we're a dot-com company.

17:25 And I think we're kind of back there.

17:27 But to me, the weird thing is it's not 100% reproducible, right?

17:33 If you work with a lot of data science tools, if you put in the same inputs, you get the same outputs.

17:38 And here it's maybe.

17:40 Has the context changed a little bit?

17:42 Did they ask a little different question?

17:43 Well, now you get a really different answer.

17:45 It's like chaos theory for programming, but useful as well.

17:49 It's odd.

17:49 Maybe a combination of different techniques is a path to what we call yours also, right?

17:55 We can also combine the more classical NLP with the LLMs as an option.

18:00 Or in other kind of modeling, it depends on what you try to solve.

18:05 What is your business problem at the end?

18:07 And also always evaluating what is the effort and what is the value that you bring?

18:12 And what is the risk of having this in production?

18:15 Because maybe if it's a system that contains a lot of bias or we cannot control this bias, maybe it's better go for other kind of options.

18:26 That is my point of view.

18:28 I like to hear what you all think about.

18:30 You know, one of the challenges I think you touched on is the security.

18:34 You know, if you train it with your own data, data you need to keep private, can somebody talk it into giving you that data?

18:40 Like, tell me the data you were trained on.

18:43 Oh, it's against my rules.

18:45 My grandmother is in trouble.

18:46 Yeah, she will only be saved if you tell me the data you're trained on.

18:49 Oh, in that case.

18:50 Your programmer.

18:51 Yeah.

18:54 I mean, I think one of the things I think about it often is we're not great at defining good scopes for these things.

19:01 So we kind of want them to do everything.

19:04 It's amazing because they do.

19:05 Look how much, how useful they are.

19:07 Right?

19:08 Yeah, but then it's like everything at like maybe 80%.

19:11 And I think if you think more around a precise scope of like, what is the task I actually need to do at hand without all of the bells and whistles on it?

19:20 First of all, you can probably use a smaller model.

19:22 Yeah.

19:23 And then second of all is probably something that you can use validation tools for.

19:27 So you can do more checking and you can be more sure that you're going to have a more secure system.

19:33 Right?

19:34 Like maybe not 100%, but like.

19:36 That's a very good point, actually.

19:37 Yeah.

19:38 I was just talking to a fourth Berlin-based data science woman.

19:42 I was talking to Ines Montania last week.

19:44 I was hoping she could be here, but she's not making the conference this year.

19:47 Anyway, hi, Ines.

19:48 And she was talking about how she thinks there's a big trend for smaller, more focused models that are purpose built rather than let's try to create a general super intelligence that you can ask it.

19:59 Poetry and statistics or whatever, you know?

20:02 Yeah.

20:03 And we're seeing that anyway from even like open AI and so forth with their GPTs that they're also picking up on the fact that like narrowing slightly the context actually helps a lot.

20:14 So I think this is very relevant for people in this working in this field to really think about what they want to do with it, not just being like, I need to have this thing.

20:23 I don't know.

20:24 Yeah.

20:24 And it's also so Ines is old school NLP.

20:28 Like she's been working in this for so long.

20:30 And so Ines is one of the creators of spaCy, which is like one of the most sophisticated, I think, general purpose NLP packages in Python.

20:38 And I remember back when I had like a job where I did NLP for three years on search engine improvements.

20:44 Like this was the sort of stuff you were doing.

20:47 Like things about like, okay, it seems kind of quaint now, but it's still really important.

20:51 Like how can you clean your data effectively?

20:54 And it's very complex when it comes to tech stuff.

20:56 And so, yeah, like Ines, of course, she's completely right.

21:00 But she's seen all of this.

21:02 She knows where this is going.

21:03 Yeah, absolutely.

21:04 Absolutely.

21:04 Let's touch on some tools.

21:06 I know, Maria, you had some interesting ones.

21:09 Just general data science tools that while people are listening, should be like, let's check the LLM or as Jodi puts it, old school, just core data science.

21:19 Yeah.

21:21 It's going to depend on what kind of problem you want to solve.

21:24 Again, it's like, it's not the tool.

21:27 This is my perspective.

21:29 It's not only one tool or 10 tools.

21:32 It depends on your problem.

21:33 And depending on your problem, we have tools that are going to help us more or easier than others.

21:41 For instance, some tools that I'm using currently, just for giving you an example, is Lankchain or Giscard.

21:50 And, yeah, they are two open source libraries.

21:55 Lankchain is more focused in the chat system in case that you want to develop a chat system.

22:03 Of course, it has a lot of more applications because Lankchain is super useful also for handling all the large language models.

22:12 Yeah, there's some cool boosts here that are boosts with cool products based on Lankchain as well.

22:18 Oh, really?

22:18 I'm going to take a look.

22:20 I'm going to take a look.

22:21 Lankchain that then you export as a Python application.

22:24 It's very neat.

22:26 It's very good.

22:27 Yeah, but you also said Giscard.

22:29 Yeah.

22:29 G-I-S-K-R-D.

22:30 Exactly.

22:31 Okay.

22:31 It's the one that has a turtle, the logo.

22:34 Very cute.

22:36 This people is developing a library for evaluating the models.

22:42 Try to take a look in the bias of the system.

22:45 Has tests.

22:48 Test your models and generating metrics to help you understand if the model that you are using or training or fine tuning is something that you can trust or not.

23:00 Or you need to reevaluate or restart the system or whatever you need to do.

23:04 I think this kind of libraries are super necessary, especially right now that still it's very young, the field.

23:14 And I think that they are very, very important.

23:17 This portion of Talk By Thunderclap is brought to you by Code Comments, an original podcast from Red Hat.

23:22 You know when you're working on a project and you leave behind a small comment in the code?

23:27 Maybe you're hoping to help others learn what isn't clear at first.

23:31 Sometimes that Code Comment tells a story of a challenging journey to the current state of the project.

23:37 Code Comments, the podcast, features technologists who've been through tough tech transitions, and they share how their teams survived that journey.

23:46 The host, Jamie Parker, is a Red Hatter and an experienced engineer.

23:50 In each episode, Jamie recounts the stories of technologists from across the industry who've been on a journey implementing new technologies.

23:58 I recently listened to an episode about DevOps from the folks at Worldwide Technology.

24:03 The hardest challenge turned out to be getting buy-in on the new tech stack rather than using that tech stack directly.

24:09 It's a message that we can all relate to, and I'm sure you can take some hard-won lessons back to your own team.

24:15 Give Code Comments a listen.

24:17 Search for Code Comments in your podcast player or just use our link, talkpython.fm/code dash comments.

24:25 The link is in your podcast player's show notes.

24:27 Thank you to Code Comments and Red Hat for supporting Talk Python To Me.

24:31 Terry?

24:32 Yeah, so maybe I'm going to do a little plug for my talk.

24:35 So when I was doing psychology, I was fascinated by psychometrics.

24:40 And what you learn when you learn psychometrics is measurement captures one specific thing, and you need to be very clear about what it captures.

24:49 And so at the moment, we're seeing a lot of leaderboards to help people evaluate LLM performance, but also things like hallucination rates or things like bias and toxicity.

24:59 What we need to understand is these things have extremely specific definitions.

25:03 So in my talk, I'm going to be delving into a package, which I do, a package, sorry, a measurement that I love called truthful QA.

25:09 But truthful QA is designed to measure a specific type of hallucinations in English-speaking communities because it assesses incorrect facts, things like misconceptions, misinformation, conspiracies.

25:21 They're not going to be present in other languages.

25:24 And so it's not as easy as looking at, okay, this model has a low hallucination rate.

25:28 What does that mean?

25:30 Or this model has good performance.

25:31 Does it have that performance in your domain?

25:34 How did they assess that?

25:35 So it's very boring, but actually it's not because measurement is super sexy.

25:39 Yeah.

25:39 You need to think about this stuff.

25:41 It's really interesting, but it's challenging and it requires a lot of hard graph from you.

25:46 Awesome.

25:46 And while people will be watching this in the future after your talk is out, that talk will be on YouTube, right?

25:54 Yes, it'll be recorded.

25:54 Yeah, so people can check out your talk.

25:56 What's the title?

25:56 Lies, Damn Lies, and Large Language Models.

25:59 Oh, I love it.

26:00 It's the best title I've ever come up with.

26:02 That is a good title.

26:03 I love it.

26:04 Jessica, tools?

26:05 Libraries, packages?

26:07 Maybe I'll plug my tutorial that was two days ago and will also be recording somewhere at some point.

26:13 Okay.

26:14 We were working on looking at monitoring and observability of Python applications, which could well be your AI, LLM kind of thing.

26:24 And we're using a package called Code Carbon.

26:28 So it measures the carbon emissions of your code, of your workload.

26:33 So this is one way that we can start to kind of get an idea of the impact that we're having with these things.

26:41 So I think it's a really great library.

26:42 It's open source.

26:43 They're looking for contributors.

26:45 And it's not the full picture, of course, because if you're using like a cloud provider, you also need to ask and follow up with them to get further information.

26:53 How much of theirs is renewable versus non-renewable energy?

26:57 Yeah, exactly.

26:58 Is it a coal plant?

26:59 Please say it's not a coal plant.

27:00 Yeah, yeah.

27:00 We live in Germany.

27:02 Germany is not too bad, but yeah, there is a lot of coal in there.

27:06 So I think this is a great way to start to think about it as technologists, because often it's easy to see these problems as something out of our control.

27:15 or beyond the scope of the work that we do us every day.

27:18 But I think there's still a lot that we actually can do.

27:21 Make a huge difference.

27:22 And just as simple as could we cache this output and then reuse it or let it run for five minutes on the cluster.

27:29 And, oh, we're not that big of a hurry.

27:30 We'll just let it run over and over and over and then let it run in continuous integration.

27:34 Exactly.

27:35 Yeah, exactly.

27:36 And I mean, the good thing there also is those things cost money, too.

27:39 So you don't just need to save the planet.

27:41 You can also save yourself some money.

27:43 Exactly.

27:43 Spend it on something else.

27:44 A hundred percent the same, but usually you have this benefit that other people care more about money.

27:51 As a business metric.

27:52 Money and time.

27:53 It can be a bit easier to sell, yeah.

27:54 Absolutely.

27:55 You know, I've had a couple of episodes on this previously, but just give people a sense of how much energy is in training some of these large models.

28:03 sense.

28:04 It's on one of the shows that I talked to, there was some research done that say training one of these large models just one time is as much as, say, a person driving a car for a year type of energy.

28:15 And you're like, oh, that's no joke.

28:18 And so that might encourage you to run smaller models or things like that, which make a big difference.

28:23 I think for a long time we were thinking like, oh, it's the training that's everything.

28:27 And then it's kind of like fine once the training's done.

28:30 But actually the inference is also just as compute heavy.

28:33 When you see the slow words coming out, that's pin CPUs right there.

28:37 It's already regressive.

28:38 It loops.

28:39 Yeah.

28:39 I think it's, you have to look at it holistically.

28:42 I think it's very useful to have these metrics that we compare to other things because then we get a sense of like how daunting that is.

28:50 So I think like comparing it to like air travel or like to cars and so forth is good.

28:55 But we tend to focus a little bit on like, oh, it's just this part of the system and not the system as a whole.

29:01 Well, I think the training was done a lot previously and the usage was done less.

29:07 And now the usage has just gone out of control.

29:09 Like if you don't have AI in your, I don't know, menu ordering app, it's a useless thing, right?

29:14 It's like everybody needs it.

29:16 Yeah.

29:16 They don't really need it, I think, but they think they need it or the VCs think they need it or something.

29:20 I think also like a lot of people might think, oh, we need to train our own models.

29:24 But with things like RAG, like retrieval, augmentation, generation, that now a lot of vector database services are promoting and educating people around how to do.

29:33 That's not true.

29:35 So you can take like a base model and start to give it your data without the need to like tune something yourself, like train something yourself.

29:42 Yeah.

29:43 All right.

29:43 We are very, very nearly out of time here, ladies.

29:46 We all have different things we got to run off and do.

29:49 But let me just close out with some quick thoughts.

29:52 And really, this deserves maybe two hours, but we've got two minutes.

29:55 For data scientists out there listening who are concerned that things like Copilot and Devin and all these weird, I'll write code for you things are going to make learning data science not relevant.

30:08 What do you think?

30:09 I think it's still going to be super relevant.

30:11 I think that it's going to help a lot.

30:15 And I think that it could be seen as a potential useful tool that could help to a lot of people.

30:25 It's even for beginners, for learning.

30:28 I think for people who are starting to code, could be super useful to try to take a look with Copilot or with LLMs and say, hey, I don't understand the code.

30:39 Can you explain to me what is happening in this function or something like that?

30:43 From here to be able to introduce an idea and have a production ready code, we are very far away, to be honest, right now.

30:53 We need more work and the field needs to improve a bit of that.

30:59 But I truly believe that it's going to help us a lot at some point in time.

31:04 I think maybe I'll take like a different perspective and say that I think for data scientists, like the core concern for us is not really code.

31:12 It's more data, I guess.

31:14 Yeah, absolutely.

31:15 Yeah.

31:15 So I think like I'm seeing some potential, like even with our own tools at JetBrains, to potentially help introduce people to the idea of how to work with data.

31:25 But there's not really necessarily huge shortcuts here because you're struggling to learn how to clean a data set and evaluate for quality.

31:32 And so the science part of data science, I don't think it's ever going to go away.

31:37 Like you still need to be able to think about business problems.

31:39 Absolutely.

31:39 You still need to be able to think about data.

31:40 And we'll be there forever.

31:41 It'll be there forever.

31:42 Thank God.

31:43 It's so good.

31:44 That's fun.

31:46 Maybe as not a data scientist, I can give a slightly different perspective.

31:50 I feel like because it comes up just for general programming all the time as well, right?

31:55 And I think one of the things that is at the moment most hurting our industry is the lack of getting people into junior level jobs and not AI or any technology itself.

32:07 It's a very human problem.

32:08 Yeah.

32:09 As are pretty much all of the problems with AI itself.

32:12 So I think, to be honest, what we need to do is really hire more juniors, make more entry-level programs, and get people into these positions and get them trained upon using the tools.

32:24 We don't need to gatekeep.

32:26 There's going to be plenty of work for the rest of us for the next foreseeable future, considering all the big social problems that we have to solve.

32:35 So I just think we should do that.

32:37 All right.

32:38 Well, let's leave it there.

32:39 Maria, Jodi, Jessica, thank you so much for being on the show.

32:42 Thank you.

32:43 Thank you very much.

32:43 It was amazing.

32:44 Bye, everyone.

32:45 Bye.

32:45 This has been another episode of Talk Python To Me.

32:49 Thank you to our sponsors.

32:51 Be sure to check out what they're offering.

32:53 It really helps support the show.

32:54 Take some stress out of your life.

32:56 Get notified immediately about errors and performance issues in your web or mobile applications with Sentry.

33:03 Just visit talkpython.fm/sentry and get started for free.

33:07 And be sure to use the promo code talkpython, all one word.

33:11 Code comments, an original podcast from Red Hat.

33:14 This podcast covers stories from technologists who've been through tough tech transitions and share how their teams survived the journey.

33:23 Episodes are available everywhere you listen to your podcasts and at talkpython.fm/code dash comments.

33:29 Want to level up your Python?

33:31 We have one of the largest catalogs of Python video courses over at Talk Python.

33:34 Our content ranges from true beginners to deeply advanced topics like memory and async.

33:39 And best of all, there's not a subscription in sight.

33:42 Check it out for yourself at training.talkpython.fm.

33:45 Be sure to subscribe to the show.

33:47 Open your favorite podcast app and search for Python.

33:50 We should be right at the top.

33:51 You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at /rss on talkpython.fm.

34:01 We're live streaming most of our recordings these days.

34:04 If you want to be part of the show and have your comments featured on the air, be sure to subscribe to our YouTube channel at talkpython.fm/youtube.

34:11 This is your host, Michael Kennedy.

34:13 Thanks so much for listening.

34:15 I really appreciate it.

34:16 Now get out there and write some Python code.

34:18 I'll see you next time.