Scaling data science across Python and R

Episode #236, published Tue, Oct 29, 2019, recorded Fri, Sep 27, 2019

Episode Deep Dive Links Transcript

Do you do data science? Imagine you work with over 200 data scientists. Many of whom have diverse backgrounds or have come from non-CS backgrounds. Some of them want to use Python. Others are keen to work with R.

Your job is to level the playing field across these experts through technical education and build libraries and tooling that are useful both in Python and R.

It sounds like a fun challenge, doesn't it? That's what Ethan Swan and Bradley Boehmke are up to. And they are here to give us a look inside their world!

Episode Deep Dive

Guests introduction and background

Ethan Swan is a data scientist with a formal engineering background who discovered his passion for coding through MATLAB and later gravitated toward Python. He focuses on teaching and upskilling data scientists internally, particularly around adopting more advanced technical workflows such as Spark.

Bradley Baumke started in economics and worked extensively with Excel in the Air Force for cost estimation and other analytical tasks. He later adopted R to handle more complex data analysis and has helped lead the effort to build shared data science tooling that addresses both R and Python users at scale.

What to Know If You're New to Python

Below are a few concepts that will help you get the most out of this discussion on scaling data science with Python and R:

Understanding Data Science Workflows: Basic familiarity with reading data (e.g., CSV files), data manipulation (e.g., pandas data frames), and creating simple plots (e.g., matplotlib or Altair) will help you follow how Python can complement or replace tools like Excel and SAS.
Open Source Ecosystem: Recognize that Python libraries are typically community-maintained, meaning new users must learn how to find or vet popular libraries on pypi.org.
Environments and Tooling: Tools like Jupyter Notebooks, Visual Studio Code, PyCharm, or even command-line editors (like Vim) come into play. Becoming comfortable with at least one editor and environment streamlines your learning.
Collaboration Across Languages: Be aware that Python can integrate with R (via rpy2 or reticulate) and large-scale systems such as Spark, which is vital in enterprise data science settings.

Key points and takeaways

Bridging Python and R in a Large Team In an organization with over 200 data scientists, some favor Python while others prefer R. Providing a unified or parallel experience across these languages has become crucial. The team must ensure consistent business logic, quality control, and synergy between these two ecosystems.
- Links and Tools:
  - rpy2
  - reticulate
Providing Internal Packages and Libraries Ethan, Brad, and their colleagues create custom Python and R packages that encapsulate “golden rules” for data access and business logic. This spares users from rewriting large SQL scripts or applying repetitive rules. However, maintaining identical functionality in two languages demands a careful approach and continuous communication.
- Links and Tools:
  - Artifactory (used for internal repository management)
  - PyPI (mirrored internally for Python packages)
  - CRAN (cran.r-project.org) for R packages
Spark as the Backbone for Big Data Spark is central to handling massive amounts of transactional data, such as 10+ years of grocery transactions, at scale. Many advanced data projects are moving from legacy SQL-only workflows to Spark-based distributed systems, increasing the need for data scientists to be comfortable with cluster computing.
- Links and Tools:
  - Spark
Cultural Shift from Closed-Source to Open-Source Tools A large segment of the data science team historically relied on Excel and SAS. Embracing open-source tools requires a new mindset, being ready for multiple packages, community support, version updates, and a broader ecosystem of solutions. Over time, Python and R have steadily replaced these older systems.
- Links and Tools:
  - SAS (Closed-source platform)
  - Excel
Overcoming Resistance to New Languages For many data scientists and analysts who spent years in Excel or SAS, picking up Python or R can feel daunting. The team’s internal education approach and official “intro to Python” classes taught them to adopt open-source tools more organically, with an emphasis on helpful ecosystems and extensive documentation.
- Links and Tools:
  - Jupyter Notebooks
  - RStudio (formerly RStudio)
Balancing Elegant vs. Practical APIs In building enterprise-wide APIs for data, the difference in usage patterns is stark: advanced developers might want sophisticated customization, while the majority prefer straightforward, plug-and-play calls. Ethan’s team tries to minimize friction, ensuring that the average data scientist can easily adopt shared libraries.
- Links and Tools:
  - Altair
  - tidyverse (for R)
Managing Deployments and Productionization Once a model or analysis is built, it needs to go into production. While notebooks (Jupyter or RStudio) are great for prototyping, teams often move final models into dockerized containers or frameworks that output Java artifacts (like POJOs from H2O or DataRobot) for scoring. This ensures stable and efficient deployments.
- Links and Tools:
  - DataRobot
  - H2O
  - Docker
Automated Notebook Execution Tools like PaperMill (by Netflix) let teams parameterize and run notebooks in batch mode. While some prefer rewriting notebooks into scripts, PaperMill can streamline routine tasks and maintain the simplicity of a notebook-based workflow.
- Links and Tools:
  - PaperMill
  - Databricks (also offers notebook-based automation)
Environment and Infrastructure Complexity Operating across multiple on-prem systems, Hadoop environments, and even multiple clouds means data scientists must adapt to cluster configurations, restricted permissions, and single sign-on. This multi-environment reality drives the need for standardizing approaches to tooling and dependency management.
- Links and Tools:
  - Hadoop
Future of Data Science Teams: Hybrid Skillsets Modern data science teams increasingly require bridging skill sets, some specialize in advanced modeling, others in software engineering. Enabling large-scale success means having “hybrid” contributors who can build standardized, robust libraries for both Python and R, plus handle infrastructure challenges like CI/CD.

Links and Tools:
- CI/CD pipelines

Interesting quotes and stories

"We have about 250 total data scientists... at that scale, so many people are doing similar work that it makes sense to automate some of that stuff." , Ethan

"It's not like we're expecting you to [already] know Spark, but you need to be open and willing to work in that environment." , Brad

"I realized people who came from SAS didn't automatically think, 'Just Google it, look at the docs, or look on Stack Overflow.' That was a brand-new skill for some." , Ethan

Key definitions and terms

Spark Session: A connection handle in Apache Spark that manages how Spark runs across a cluster, enabling distributed data processing.
reticulate: An R package allowing R to interface directly with Python, thus running Python code within R environments.
PaperMill: A tool that parameterizes and executes Jupyter Notebooks programmatically, useful for batch or scheduled runs.
POJO: Plain Old Java Object. Some machine learning tools generate POJO files for production scoring with Java.

Learning resources

If you'd like to level up your Python skills (especially if you are transitioning from a tool like SAS or Excel), you might find these courses helpful:

Python for Absolute Beginners: Start here if you’re entirely new to programming or Python.
Move from Excel to Python with Pandas: Perfect if you’re used to Excel and want to embrace Python data analysis.
Fundamentals of Dask: Learn how to parallelize and scale your data workflows, a key skill for big data in Python.

Overall takeaway

Scaling data science across large teams means moving beyond personal preferences and building consistent, reusable tools and practices for both Python and R. Successfully merging these technologies demands thoughtful package design, careful versioning, and ongoing education to align diverse skillsets. By standardizing core data workflows and embracing open-source ecosystems, organizations can enable hundreds of analysts to collaborate productively, handle massive datasets, and deploy cutting-edge solutions to production.

Links from the show

Guest: Ethan Swan
Website: ethanswan.com
Twitter: @eswan18
GitHub: github.com/eswan18

Guest: Bradley Boehmke
Website: bradleyboehmke.github.io
Twitter: @bradleyboehmke
Github: github.com/bradleyboehmke

84.51˚ Company
Tech Blog: 8451.com/blog
The Uplow'd Podcast: 8451.com/the-uplowd-by-8451-podcast
Episode #236 deep-dive: talkpython.fm/236
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy
Episode #236 deep-dive: talkpython.fm/236

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 Do you do data science? Imagine you work with over 200 data scientists, many of whom have diverse

00:07 backgrounds who have come from non-CS backgrounds. Some of them want to use Python, others are keen

00:12 to work with R. Your job is to level the playing field across these experts through technical

00:18 education and to build libraries and tooling that are useful to both Python and R-loving data

00:23 scientists. It sounds like a fun challenge, doesn't it? That's what Ethan Swan and Bradley

00:27 Baumke are up to, and they're here to give us a look inside their world. This is Talk Python To Me,

00:32 episode 236, recorded September 27th, 2019. Welcome to Talk Python To Me, a weekly podcast on Python,

00:53 the language, the libraries, the ecosystem, and the personalities. This is your host,

00:57 Michael Kennedy. Follow me on Twitter, where I'm @mkennedy. Keep up with the show and listen to

01:01 past episodes at talkpython.fm, and follow the show on Twitter via at Talk Python. This episode is

01:07 brought to you by Linode and Tidelift. Please check out what they're offering during their segments.

01:12 It really helps support the show. Ethan, Brad, welcome to Talk Python To Me.

01:15 Thanks. Good to be here.

01:16 Yeah, thanks.

01:17 Yeah, it's great to have you both here. It's going to be really fun to talk about enabling data science

01:22 across large teams and this whole blended world of data science, which it sounds pretty good,

01:29 actually. It sounds like a positive place.

01:31 Yeah, it's definitely getting more and more tangled, too.

01:33 Yeah, I can imagine. I can imagine. So we're going to talk about things like R and Python and

01:39 how those can maybe live together, how to bring maybe some computer science techniques and stuff

01:45 to work together across these different teams and so on. But before we do, let's get started with

01:52 your story. How did you get into programming in Python? Ethan, you want to go first?

01:55 I went into college as an undecided engineering major and didn't really know what I wanted to do,

01:59 but I was pretty sure it wasn't computer science. I was pretty sure that was for people who sat in

02:03 front of a computer and it sounded very boring. And I got into the intro class for engineering and

02:08 picked up MATLAB and just loved it. So from that point forward, I did some C and C++ in college

02:13 and then came out of college and started working in data science. I started with a little bit of R and

02:18 then found I was a lot more comfortable with Python. And now I use it in my job and also a little bit

02:22 outside of work for some personal projects. So I really, really enjoyed it after going through a

02:26 number of languages.

02:26 It's interesting that MATLAB was sort of the beginning programming experience. I think looking

02:33 in from the outside at the computer programming world, a lot of folks probably don't think that,

02:37 but I went through a math program and when I was studying, a lot of people, their first programming

02:42 experience was working in MATLAB and .m files and all that stuff.

02:46 Yeah. Well, it seems to be very useful across other engineering fields. And also,

02:51 it's relatively friendly. It's not like learning C or C++, which would probably scare a lot of people

02:55 away.

02:56 Yeah, absolutely. Absolutely. Brad, how about you?

02:59 My background is much more along economics. So I was in the Air Force doing a lot of life cycle

03:05 cost estimates for weapon systems, aircraft and the like. And a lot of that was done in Excel.

03:11 And when I went up and I did my PhD, I mean, I started getting a lot of my research data.

03:16 It was gnarly, just stuff spread out all over the place, ugly.

03:19 And your PhD was in economics?

03:21 No, it kind of. Yes and no. I had a unique PhD. It's technically called logistics,

03:26 but it was kind of a hybrid of like economics, applied stats and ops research.

03:31 Oh, right. Okay, cool.

03:32 Yep. Yep. So and the problem was, you know, I spent a couple months just trying to figure out,

03:36 like, how can I clean this data up and do my analysis within Excel? It was horrible. And so

03:41 that was about the same time that Johns Hopkins came out with a like an online data science course

03:47 through Coursera. And they featured or primarily focused on R. And that was kind of when I decided,

03:53 all right, you know, I need to, I need to take a programming language to really get through this

03:57 research.

03:57 You've outgrown Excel in the extreme.

03:59 Yes. Yes. Yep.

04:01 So did you abuse it pretty badly? Were you trying to make it do things that just wouldn't?

04:05 You know, it's funny because the work I was in within the Air Force, it was your classic

04:10 abuse Excel as much as possible, right? You open it up, you got a workbook that's got like 26

04:16 worksheets. You got stuff that is hyperlinked all over the place. You got, you know, hard coded

04:21 changes going on in there and you leave for one week, you come back and there's just no way you can

04:26 reproduce anything. And that was exactly what I was running into. And so that's, that's really what

04:31 got me into programming.

04:32 I think there's a lot of people out there who definitely consider themselves not programmers.

04:36 And yet they basically program Excel all the time. Right. Right. And a lot of folks could follow

04:43 your path and just add some programming skills and really be more effective.

04:48 I think that's kind of a theme of past shows. I know you bring that up a bit where, you know,

04:53 a lot of people would benefit from having a little bit of programming skill that they could bring to

04:56 their regular job rather than being full-time programmers. And that seems very true.

04:59 Yeah. And it sounds like exactly like this is a scenario for that. And it's definitely something

05:04 that I'm passionate about. So I bring it up all the time. The other thing I think that's

05:07 interesting about programming and programming and quotes around in Excel is we did a show called

05:13 escaping Excel hell. And one of the themes is Excel is basically full of all of these go-to

05:19 statements, right? Like you just go down and it says, go to that place and then go over here and

05:24 go across this sheet over to that up there. It's totally unclear what the flow of these things are.

05:29 It's bizarre. All right. So definitely programming languages are better. You both work at the same

05:35 company. Let's talk about what you do day to day because you're sort of on the same team, right?

05:40 Sort of. We collaborate very tightly. So I actually work on the education team. So our company is called

05:45 8451. We're a subsidiary of Kroger. We're mainly their data science marketing agency. And we both work

05:50 within the data science function. So my team is mainly involved with upskilling the function just

05:55 generally. That may mean scheduling classes for people that are new starters. It may also mean what

06:00 we call continuing education. So figuring out what people need to learn going forward to stay relevant

06:04 in the industry. I tend to be more on the technical side of that team. And that means that I collaborate

06:09 more tightly with Brad's team, which is more aligned to the technology.

06:12 Yeah, for sure. And Brad, how about you?

06:14 Yeah. So my team really focused on building kind of like internal components or internal packages.

06:19 Yeah, I'm sure we'll talk more about this a little later. But, you know, we have about 200 data

06:24 scientists that are at some point transitioning to using R and Python primarily or already are.

06:30 So we try to standardize certain tasks as much as possible. And, you know, we'll wrap that up into

06:37 like an R or Python package and kind of have like that centralized business logic for our own internal

06:43 capabilities as a package in either R or Python. So our team just focuses a lot on building those

06:48 packages. Yeah, that sounds super fun. It sounds like almost as if you're a small software team

06:54 or company building all these tools for the broader company, right? Or the broader data science

07:00 organization.

07:00 One thing that's definitely coming more and more clear is, you know, we have kind of like the

07:04 traditional data scientists and then we have the traditional like engineering function within the

07:08 company. And there's kind of like that big void in between that kind of bridges that gap,

07:12 where you have folks that have somewhat the software engineering capabilities, but they're

07:18 coming from it more of a data science perspective, right? And they can build things that are a little

07:22 bit more geared directly to how the data scientists work.

07:25 Yeah. Interesting.

07:26 We have about 250 total data scientists just for a sense of scale, which is one of the reasons that

07:31 we have a dedicated internal team to enable them because at that scale, so many people are doing

07:35 similar work that it makes sense to automate some of that stuff to build it into packages and things like that.

07:40 I can't think of many other companies that have that many data scientists. Why don't you tell folks

07:45 who, what Kroger is? Because I know being here in the U.S. and certainly spending some time in the

07:52 South there, you know, Kroger directly is there, but they also own a bunch of other companies and

07:57 stuff. So maybe just give people a quick background so they know.

08:00 Kroger is in, I believe, 38 states and has something on the order of 3,000 stores. So it's just an

08:06 enormous grocery chain in the U.S. So you may not have seen Kroger itself under the name Kroger

08:11 because they own other chains, Ralph's, Food for Less. I think there's 20 different labels. But yeah,

08:17 they're all over the place. And so it makes a lot of sense to have some sort of customer analytics

08:21 organization, which is what we are.

08:24 There's a lot of analytics around grocery stores and things like that and how you place things. You

08:30 know, there's the story of putting the bananas in the back and in back corner and things like this,

08:35 right?

08:35 There's definitely a lot of different areas. So yeah, the banana story or like the milk in the

08:40 back, people often tell what might actually be apocryphal, this idea that like these things are

08:45 in the back because it makes people go get them and walk through the rest of the store. It might be true,

08:49 but at this point, it's so ingrained. I'm not sure anybody knows. But there's other areas too,

08:52 where it's like, what kinds of coupons do you mail people? So in general, when people ask me what my

08:56 company does, the simplest summary is when you get coupons from a grocery store, that's people like us,

09:02 essentially, where based on what you bought in the past, we know that like you would probably

09:06 appreciate these kinds of coupons.

09:07 Largely the way you probably collect data, I can imagine two ways or maybe more, is one,

09:12 just when people pay with a credit card, that credit card number doesn't change usually, right? So you

09:17 can associate that with a person. And then also, a lot of these stores in the US have these

09:23 membership numbers that are free to sign up, but you get a small discount or you get some kind of

09:29 like dash reward. There's some kind of benefit to getting a membership and always using that number.

09:35 And that obviously feeds right back to what you guys need, right?

09:37 Yeah, like that loyalty membership that a lot of folks have, and that is like the majority of

09:41 customers. That's really what allows the data science that we do to kind of like personalize

09:46 shopping experience, right? So if you're going to go online and do like online shopping,

09:50 or if you're going to likely be going to the store in the next week, you know, we can try to

09:54 personalize what do we expect you to be shopping based off of your history, we can link that back

09:58 to your loyalty card number and everything.

10:00 Yeah, super interesting. We could go on all sorts of the stories like the bananas and so on. I don't

10:06 know the truth of them, so I won't go too much into them, but they sound fun. But 250 data scientists,

10:13 that's quite the large group, as I said, and it's a little bit why I touched on the MATLAB story and

10:18 the Excel story, because people seem to come to data science from different directions. I mean,

10:24 you tell me where your people come from, but there's the computer science side, like I want to

10:28 be a programmer, or maybe I'll focus on data, but also just statisticians or people interested in

10:35 marketing or all these different angles. And that's an interesting challenge, right?

10:38 With 200, 250 analysts or data scientists, you have this huge spectrum of kind of like talent and

10:45 backgrounds. And so we kind of categorize our data scientists into like three big buckets, right?

10:51 So we have like the insights folks, and those are the folks that are really focusing on like looking

10:55 at historical trends going on, doing a lot of visualization to try to tell a story about what's

11:00 going on with a product over time, what's going on with their customers. Then we got kind of another

11:05 bucket that is kind of our statistical modelers or machine learning specialists, right? And those are

11:10 the people that, you know, you would typically think of that are more educated on the stats or the

11:16 algorithms that we're applying within the company. And then we got another bucket that's

11:20 technology, right? And those are the folks that are really specialized on usually like the languages

11:25 that we're using are Python, really understanding like, how to really be using Git, how to be using

11:31 Linux and, and kind of maneuver around all the servers and different tech stack environments that

11:36 we have going on. Obviously, the largest bucket is that insights. And I don't know what the actual

11:41 number is. But I always say that roughly 60 to probably 70% of our data scientists kind of fall

11:47 towards that insights. And that's kind of where you're going to see a lot of folks that have a

11:51 background that would be typically aligned with like a business analyst, right? Maybe they're coming

11:56 from more of an engineering or economics background. And the folks in the that middle bucket that machine

12:03 learning that's gonna be more of your folks coming with like a stats, maybe stats, masters or PhD or

12:08 more. They could even be economics, but you know, they can add a stronger focus on account of metrics,

12:13 economics than kind of traditional economics. And then you've got that small bucket, which you get

12:18 a lot of people that I think are more like Ethan, Ethan's kind of what I would consider like a classic

12:23 person going in that bucket where they kind of have that computer science background, coming from school.

12:27 And that kind of creates that strong link between traditional software engineering in our data science

12:32 folks. That's a good taxonomy. Specific to our folks, I would say we have a lot of folks that have

12:38 kind of like an economics background. That's definitely a big kind of traditional degree that

12:44 we recruit a lot of people from. We have a lot of people from computer science programs, and then

12:48 kind of the traditional stats, right? So, and Ethan, you can throw in some others. But for my experience,

12:54 those kind of seem to be like the three like major themes of the backgrounds that we see.

12:58 That's definitely very common. I think historically, we leaned more from economics and statistics. And

13:04 recently, there's there's been a lot of changes. Data science as a product is like a newer thing.

13:07 In the past, I think there was less of a need for strong technical skills being a data scientist,

13:12 if that formal title even existed, right? Right. It was so new. It's like,

13:16 can you make graphs out of this big data with? Yeah, we love you just to do that. Right.

13:22 Things have really changed, especially because we've moved into using distributed systems like Spark.

13:26 And those things simply demand a higher level of technical expertise. And that's part of the reason that

13:31 we've shifted to hiring more technical people to at least support and sometimes do different work.

13:35 Sure. And that probably also feeds into why you all are building a lot of internal packages to help

13:40 put a smooth facade on top of some of these more technical things like Spark. That's definitely been a theme of shifting to new platforms. So, you know,

13:48 like probably most companies, we have a monolithic database system that for a long time,

13:53 we've relied upon. So most data scientists are pulling from one primary database. But over the last

13:59 couple of years, as we started to get things like clickstream data and just the needs of our

14:04 modeling changed, we started to push towards Spark. And Spark tends to be a really, I don't know,

14:11 a difficult adjustment for people coming from traditional databases, in my experience. And so

14:15 a lot of the work that Brad and I have done is work on simplifying that transition. Try to hide some of

14:21 the complexity that most people don't need to deal with. You probably don't need to configure

14:24 everything in your Spark environment, because you're not used to doing that in something like Oracle.

14:29 Yeah, absolutely. How much data skill do folks need to have for, as a data scientist, you know,

14:36 when I think data science, I think pandas, I think CSV, I think, you know, those kinds of

14:41 things, map, plot, lib, numpy, scikit, learn, these kinds of things, but not just the SQL query language

14:49 and things like Spark and stuff. Although I know that that's also a pretty big part of it. So maybe

14:54 could you just tell us for people out there listening, thinking, hey, I'd like to be a data

14:58 scientist, what skills should I go acquire? And where's that fit into that?

15:02 In my view, it's really a matter of the size of your data. Big data is such a generic term that I

15:07 think it may have lost meaning in a lot of cases.

15:09 Yeah, some person's big data is actually like, oh, that's nothing. We just, that's our test data,

15:12 right?

15:12 It's like, yeah, how big is your laptop's memory? That's really the question. And for us,

15:17 so we literally have every transaction that's happened at Kroger over the last

15:20 at least 10 or 15 years. And so the size of that data is just enormous. To do even trivial things like

15:26 filters, you still need a very powerful system. And so for us, and for large companies with

15:31 transactional records or clickstream records, you generally need very powerful distributed systems

15:37 or a central database. But, you know, historically, people think of pandas as being the primary data

15:43 science package. And that is true once you reduce your data to a manageable size. And perhaps some

15:47 companies have small enough data that they could do that on a single server. But for us, that's

15:51 generally not true.

15:51 Do you guys use things like Dask or stuff for distributed processing?

15:55 We don't really use Dask. There's been some interest in it, I think. So I'm not super familiar with

16:01 but I think that it occupies a similar niche to Spark. We are pretty, pretty far down the Spark.

16:06 Spark.

16:07 Sure. Once you kind of place your bets and you invest that many hours of that many people's work,

16:13 it can't just be slightly better or slightly different or whatever. It's got to be changing

16:17 the world type of thing to make you guys move.

16:19 We're also pushing towards migrating a lot of applications to the cloud. And in doing something

16:23 like that, you sometimes are a little more restricted in what you can do in an enterprise

16:27 setting because there's rules about how your environments work and things. And so we don't

16:31 generally get to like customize our own clusters, which you might want to do for Dask. So we have an

16:36 engineering and architecture team that sets up the Spark clusters for us that then we as data

16:40 scientists log into and use for our work.

16:43 That's kind of handy. I know there's a lot of places like that where there's just cluster computing

16:48 available like CERN has got some ginormous set of computers. You can just say run this on that

16:55 somehow, you know, and it just happens.

16:59 This portion of Talk Python To Me is brought to you by Linode. Are you looking for hosting that's fast,

17:04 simple, and incredibly affordable? Well, look past that bookstore and check out Linode at

17:09 talkpython.fm/Linode. That's L-I-N-O-D-E. Plans start at just $5 a month for a dedicated server

17:16 with a gig of RAM. They have 10 data centers across the globe. So no matter where you are or where your

17:21 users are, there's a data center for you. Whether you want to run a Python web app, host a private Git

17:26 server, or just a file server, you'll get native SSDs on all the machines, a newly upgraded 200

17:32 gigabit network, 24-7 friendly support, even on holidays, and a seven-day money-back guarantee.

17:37 Need a little help with your infrastructure? They even offer professional services to help you with

17:42 architecture, migrations, and more. Do you want a dedicated server for free for the next four months?

17:47 Just visit talkpython.fm/Linode. One of the things I think is interesting is this blend

17:55 between Python and R. And it sounds to me like people are coming to one of those two languages,

18:02 maybe even from somewhere else, maybe from Excel or from MATLAB or some of these other closed source

18:09 commercial tools. What's that look like? Because for me, it feels like a lot of times these are

18:15 positioned as an exclusive Python or R conversation. But maybe with that number of people, it's a slightly

18:23 different dynamic. What's it like there for you? I would say historically, at least my experience,

18:28 what I saw a lot of were people that were coming more from a computer science background,

18:32 kind of naturally aligned with the Python mindset and syntax. And the folks that traditionally came from

18:40 like a stats background or more of like a business analyst kind of gravitated towards R. And I still

18:47 see a lot of that, but I think it's starting to change quite a bit because you're getting more of

18:51 these like data science programs in universities. I mean, you're starting to get more of a mix within

18:56 those programs. And those programs are trying to either select one language or they're blending

19:03 two languages throughout the curriculum. So we still see a lot of crossover and folks coming with like

19:09 more of an R or Python. It's just, to me, it's not as easy to kind of pick out who it is, right? I used

19:15 to be able to look at someone and, you know, they said, well, you know, it's cool for computer science.

19:19 I was like, okay, well, obviously you're going to be a Python, more likely a Python than an R.

19:23 That's not always the case. So to me, it's getting a little bit more blurred. I think a lot of it just

19:29 has to do with the environment they're coming from. So if they're coming from a university, then,

19:33 you know, which university and what language are they just kind of like defaulting to?

19:37 Maybe even down to who the professor was and what book they chose.

19:41 Exactly. Yeah.

19:41 Right. It's, it's, I feel like it's almost not even chosen. It's, it's this organic growth of,

19:46 well, I was in this program and I had this professor a lot and that professor

19:50 knew Python or they knew R. So that's what we did, right? That.

19:55 Yep. And then also I think a lot of folks coming from, if you've got experience in industry and

19:59 you're coming from a different company over at 8451, then lots of times it just kind of depends on

20:04 the size of that company. It seems like companies that are smaller, that may be working with smaller

20:10 data sets, have a smaller infrastructure. It's easier to work on your, your local R studio or

20:15 PyCharm IDE and do your work. Yeah.

20:17 Those companies that are much larger and you need like a larger infrastructure for your tech stack,

20:22 I feel like they're kind of gravitating more towards Python. There's other reasons behind

20:26 that. But so I think the size of the company also determines it.

20:30 It's probably wired a little bit more aligned with a computer science and DevOps side of the world.

20:37 And, you know, it's probably just, there's a, a greater tendency for those folks to also be using

20:42 Python rather than to also be using R because, you know, if you come from a stats background,

20:46 what do you know about Docker? Right. I mean, probably not much unless you had to just set it up for some

20:51 reason for some research project, right?

20:53 We find that especially at our size, having a very large dedicated engineering function and an

20:58 architecture team and these other more technical teams tend to be a lot more fluent in Python.

21:03 And so even in communicating with them and like when you have proof of concept applications, if you

21:09 want to say, we're going to try to deploy something in a new way, that team is going to be a lot better

21:13 able to support Python in general because it's, it's more like their background. So I've definitely seen

21:18 since I've started R was a bit more popular. I think it's shifted to be about 50 50, but Python and R have sort of

21:24 found their niche. I think R is still the superior tool for visualization, which is sad because I like Python a lot

21:30 and I wish it were better, but, and I think there's hope, but R still is really, really good at that and really good at some

21:35 other things, readable code with the pipe operator and things like that. And it seems like R is doing really well in more of our

21:42 ad hoc analysis work. And then in our, our product style, like sciences that we deploy, that tends to be Python.

21:48 Interesting. So yeah, so the research may happen more in R, but the productization and the deployment might happen, might find its way over to Python.

21:56 Yeah. I think the more interactive type of work that you're doing, lots of times it's probably a little bit more majority on the R side, but the more we're trying to like standardize things or put things in some kind of automated procedure for production or whatever, that's when it starts to kind of gravitate towards Python.

22:11 Python, just because that's usually when we start getting the engineers involved a little bit more. And then how do we, how can we integrate this within our tech stack? And there's usually just less friction if we're doing that on the Python side.

22:22 Okay. So you talked about building these packages and libraries for folks to use to make things like Spark easier and so on. What is your view on this blended Python R world? Do you try to build the same basic API for both groups, but keep it Pythonic and I don't know, R-esque, whatever R is equivalent of Pythonic is? How do you think about that? Are they different functions? Because Python is more on the product side?

22:50 This is a great question. It's been something that Ethan and I and a few other folks have really been trying to get our arms around. We don't know what the best approach is. We've tried a few different things.

23:01 For example, so like we just have a standard process of ingesting data, right? So we got to do some kind of a data query. There's lots of times just common like business rules that we need to apply. We call them golden rules. You know, certain stores, certain products we're going to filter out, certain kind of loyalty membership, whatever, we're going to discard those.

23:20 And that's all business logic. And typically, historically, we've had very large SQL scripts that people were applying the same thing over and over, maybe slight twists.

23:31 A lot of that stuff, we can just kind of bundle up in both an R and a Python package to apply that golden rules or the business logic. And it just makes their work more efficient, right?

23:41 So now their data query goes from like applying this big script to just like, all right, here's a function that does that initial data query, get that output, then go and personalize your science, whatever you're doing.

23:52 Something like that, that's a great way where we can have both an R and a Python capability, as long as it doesn't get too large, right?

24:00 So when we do something like that, you know, we want to try to keep the R and Python packages, one, a similar capability, right?

24:08 So that the output that we get from both packages are going to be the same, that the syntax is going to be very similar, that the functionality is going to be very similar as well, right?

24:18 So basically, you want somebody to look at R and the Python packages, like it's doing the same thing, we're getting the same output, it has no impact on the output of the analysis, regardless of what package you use.

24:30 Yeah, well, it sounds super important, because if you evolve or version that SQL query just a little bit, and they get out of sync, and then you go do a bunch of predictive analysis on top of it, and you say, well, we decided this, but actually earlier, we thought this, but now it's that.

24:47 Like, no, that's just a different query.

24:48 This is the problem, right?

24:50 Like, it's a huge problem.

24:51 Yeah, it seems like you really want to control that if you can bundle that away into it, your call this function will tell you what the data is.

24:57 And just maintain that.

24:59 That's great.

24:59 But even that kind of what you're talking about right there, we see the same thing happen when we're building these packages kind of in tandem between the two languages.

25:05 Because it may be easy to kind of like, create that initial package that does a few things, and they're both operating very similar.

25:13 But the second you start getting, you know, eight other folks from across the company, it's like, oh, this is great.

25:18 I want to go and do a pull request and make a slight modification.

25:22 Then it's like, all right, well, I saw the Python just had like eight updates.

25:26 What are we going to do on the R side?

25:27 Are we going to do these exact same implementations or not?

25:30 Or maybe it's a unique thing that's kind of language specific.

25:33 And it's like, well, how do we kind of do that same thing within R?

25:36 And that's where it kind of explodes to be like, okay, there's no way we could actually build every single package we want to build in both R and Python and keep them at the same level.

25:46 Sure.

25:46 That's where it gets difficult to kind of figure out like what direction we need to go.

25:49 Yeah.

25:49 And what's your philosophy?

25:50 Are you let people make these changes and get the best library they can?

25:54 Or is it like, no, they need to be more similar.

25:56 This is a problem.

25:58 We're kind of figuring that out.

25:59 That's been one really interesting experience because in this regard, I mean, both in terms of the size of the data science function and how heterogeneous it is.

26:05 I do think we're maybe, if not totally unusual, we're maybe a little ahead in running into these problems than what I read on the Internet.

26:13 But I haven't read a lot of other people grappling with this problem.

26:15 So, you know, if you're listening and you've done this and you figured out a good strategy, let us know.

26:20 But I think we're still figuring out exactly what it is.

26:22 And so one thing Brad and I have discussed a lot is what are our options for building one underlying set of functionality that then you can interface with from both languages?

26:31 And that's pretty tricky because, you know, there's like an R package called reticulate that you can run Python code in.

26:37 And then there's a Python package called RPy2 that you can run R code in.

26:41 But these things tend to get a little unmanageable because they don't deal with environments the same way that native Python or R install does.

26:48 And so these things are just challenges.

26:50 We're experimenting right now with a way of tying together R and Python in the same session of a notebook by having them share what's called a Spark session, which is your connection to a Spark cluster.

27:01 And so in theory, under the hood, you could do all the work in one of the languages and return to the user a Spark object, which is translatable to both.

27:08 And so this is one of the things we're experimenting with, but we're trying a few different things.

27:12 But we've definitely found that separately maintaining two identical APIs is extremely challenging.

27:17 And I don't think we can do that for multiple packages going forward.

27:20 Right.

27:21 Yeah.

27:21 You have to have a pretty ironclad decider of the API, and then we'll just manifest that in the two languages.

27:29 And that's also pretty constrained, right?

27:30 Well, it really stifles contributions, right?

27:33 Because like Brad said, people want to issue a pull request.

27:36 And we don't want anybody who contributes to have to know both languages thoroughly enough to build it in both.

27:41 I mean, already we would ask them for documentation and things.

27:43 And it's like you're just broadening the size of the ask and limiting your potential contributors at that point.

27:48 Where's your unit tests?

27:48 And where your unit tests are, right?

27:51 Oh my goodness.

27:51 Yeah.

27:51 Interesting.

27:53 Well, my first thought when you were talking about this as a web developer background was,

27:58 well, maybe you could build some kind of API endpoint that they call.

28:02 And it doesn't matter what that's written in.

28:04 Heck, that could be Java or something.

28:05 Who knows?

28:06 Yeah.

28:06 Long as they get their JSON back in a uniform manner across the different languages, that might work.

28:13 It sounds like the Spark object is a little bit like the data side of that.

28:17 That's the issue, ultimately, that for a lot of the stuff we're doing, we need to actually transform data in some way.

28:23 And so sending a huge, you know, sending many gigabytes of data across a web API is not going to be very efficient.

28:29 Even if you turn on GZEP, it's still slow.

28:31 Yeah.

28:32 So that solution is something we've considered also that idea of like, maybe we can subscribe to some kind of REST endpoint and just use that.

28:39 And that works for certain problems.

28:41 But for a lot of our problems, it's ultimately about changing the data in some way.

28:44 So it doesn't work quite as well.

28:45 I see.

28:46 So the ability to directly let the database or Spark cluster do its processing and then give you the answer is really where it has to be.

28:54 Exactly.

28:55 Yeah.

28:56 Okay.

28:56 Interesting.

28:58 What other lessons do you all have from building the packages for two groups?

29:03 People out there thinking, you know, maybe it doesn't even have to be Python or R.

29:07 It could be Python and Java, like I said.

29:09 But there's a lot of these mixed environments out there.

29:12 Although, like I said, I think this is a particularly interesting data science blend of at the scale you all are working at.

29:18 One thing I've noticed is that being closely tied into a wide number of people in different parts of your data science function is really important because the way people use things is so different.

29:27 So we talked briefly about how people come from very different backgrounds within our data science function.

29:32 And that means that their understanding of how to use functionality is quite different.

29:37 And one thing I have to resist all the time is building a piece of functionality that to me looks really elegant because I realize the ways that it could be used or it supports some kind of customization.

29:47 For example, and I was talking about this with someone else who works on packages, the idea that maybe the user could pass in a custom function that would then override part of a pipeline or something.

29:56 And I always have to remember that, like, most people aren't going to do that.

29:59 The vast majority of our data scientists aren't attracted to these elegant solutions.

30:03 They just want the purely functional ones.

30:05 Like, what is this Lambda word?

30:06 Why do you make it so complicated?

30:07 Can I just call it?

30:08 Lambdas are a very good example.

30:10 Yeah.

30:10 And so it's good to remember, like, we're building this as a functional thing for people who don't want to learn every aspect of computer science.

30:17 They want to get their data science work done.

30:18 Okay.

30:18 Yeah.

30:19 Good advice.

30:20 Brad?

30:21 So I think another thing that we've kind of been running into is, and I think this is more and more common with other companies, is we have kind of a, we have many different tech stacks.

30:30 Basically, we are working on-prem servers.

30:34 We have on-prem Hadoop.

30:36 We are working in two different cloud environments right now.

30:40 So basically, we have, like, four different environments that our data scientists could be using these packages in.

30:46 And so a lot of times it takes a lot of planning.

30:49 Like, are we going to actually try to make this package completely agnostic to whatever environment you're in and be able to use it?

30:56 Or do we just want to say, hey, look, this is a package that does this one capability, but it's specific to this one cloud environment?

31:03 And that takes a lot of planning.

31:04 I think going into it, you know, like myself, Ethan, several other folks, we have built packages before, but it was largely, like, in more of an isolated environment.

31:15 Or it was just like, I'm just building a package that someone's going to use on their local ID on their own laptop.

31:22 It's focused, and you know what they're going to try to do with it.

31:25 Right, right.

31:26 So I think we've gotten a lot better at, like, really trying to plan out, like, what do we want this to look like?

31:31 And what are the stages that we're going to take?

31:33 That's still something we have a lot of work to do and get better at.

31:36 But I think the nice thing is we have kind of a group of data scientists that are really getting better at this.

31:42 And it's allowing us to kind of, like, understand, like, good, proper software engineering and approaches to that.

31:48 And I think that's slowly kind of filtering out to the other data scientists.

31:52 As we get smarter, we're trying to upscale their folks on thinking that same way.

31:56 Sure.

31:56 Yeah.

31:57 And building off what Brad's saying about, like, the challenges of building packages in an enterprise environment for people to use them in a variety of different ways.

32:04 One thing that was new to me was, like, building this stuff through enterprise tools is quite different than doing it on your own.

32:10 So a lot of people who maintain things like open source packages are using Travis CI, for example.

32:16 And we have, like, an enterprise CI-CD solution.

32:18 And these things tend to require authentication.

32:21 And they need to be integrated with other enterprise systems.

32:24 And so these things are all, at least for me, things that I never encountered in, like, personal projects or things in school.

32:29 But it is the challenges of working in a large company.

32:31 There's a lot of things that are locked down that require a sign-on in some way.

32:34 You have to pass credentials.

32:35 And these are like a whole new realm of problems to solve.

32:38 Yeah.

32:38 There's definitely more molasses in the gears or whatever in the enterprise world.

32:43 You can't just quickly throw things together, right?

32:45 You might have to do, well, like, my unit tests require a single sign-on.

32:49 Why is that?

32:50 This is really crazy.

32:51 Yeah.

32:51 And mocking gets quite challenging.

32:53 So that's one issue we have where mocking our tests, I mean, it could either be a giant project or we could do it in a mostly correct way.

33:02 You know, we could take a subset of the data and say this is, like, a good enough sample of it.

33:05 But this isn't really representative of what we want this package to do, especially because these are all really integration tests.

33:12 They're all, like, making sure that you actually can connect to the system.

33:15 So if you mock a system, essentially you're taking out one of the things you want to test.

33:18 You want to make sure you actually can connect to the real system because that's the challenge of building this functionality.

33:23 It's such a challenge because sometimes the thing that you're mocking out is simple.

33:28 But sometimes it's such an important system that if you don't do a genuine job of mocking it out, then what's the point of even having the test?

33:37 You know, I'm thinking of, like, complicated databases with 50 tables.

33:41 Yeah, sure, you can tell it's just going to return this data when you do this query.

33:46 But what if the data structure actually changes in the database, right?

33:49 Sure, the tests run because it thinks it has the old data.

33:52 But what does that tell you, right?

33:53 Or if you're integrating with, say, AWS and talking to S3 and Elastic Transcoder and you've got to get some result or, like, Elastic Transcriber for text.

34:03 And you're going to process those, you know, at some point, you're almost not even testing if you mock it too little.

34:10 And then, like you said, it's a huge project to recreate something like that.

34:13 It's funny you say the 50 tables thing because our central data mart is itself about 50 tables.

34:18 And then occasionally we also rely on things that are created by other data scientists.

34:23 And so, yeah, the scope of it is very large and it changes a lot in the background.

34:26 And then also, I kind of feel that Spark is a much more immature technology than some of the old database technologies.

34:32 And so updates happen that actually change the functionality of the system.

34:36 And suddenly it's like the things that worked before don't work anymore.

34:39 And you're mocking, you know, like if you mock up Spark, that's not going to work.

34:43 It's not going to be the same.

34:43 Yeah, it's going to say the test passed, but it'll wait till production and maybe Q&A to fail, right?

34:47 Yeah, these are things we have to think about more and more.

34:52 The next episode of Talk Python To Me is brought to you by Tidelift.

34:54 Tidelift is the first managed open source subscription, giving you commercial support and maintenance for the open source dependencies you use to build your applications.

35:04 And with Tidelift, you not only get more dependable software, but you pay the maintainers of the exact packages you're using,

35:10 which means your software will keep getting better.

35:13 The Tidelift subscription covers millions of open source projects across Python, JavaScript, Java, PHP, Ruby, .NET, and more.

35:20 And the subscription includes security updates, licensing, verification, and indemnification,

35:25 maintenance and code improvements, package selection and version guidance,

35:29 roadmap input, and tooling and cloud integration.

35:32 The bottom line is you get the capabilities you'd expect and require from commercial software.

35:37 But now for all the key open source software you depend upon, just visit talkpython.fm/Tidelift to get started today.

35:47 So you talked about the four different places where code runs, Brian.

35:51 You've got your Hadoop cluster locally, your Spark cluster locally, the two cloud vendors that you're running on.

35:57 Where are you headed?

35:58 Which one of those is legacy and which one is where you're headed?

36:01 Or are they all active?

36:02 We're definitely headed towards a cloud environment.

36:04 Okay.

36:04 The problem that we have, one, we do have data that is quite sensitive still.

36:09 And, you know, we got to make sure that we have all the security aligned within the cloud environment

36:15 before we can transition that.

36:17 And then we just have a lot of historical code still running.

36:20 And so you figure we got, you know, 250 analysts and we have just that many projects going on.

36:27 Like, how do we transition a lot of that code into the cloud?

36:30 So I think it's going to be many years of working in this kind of multi-environment kind of strategy.

36:38 I think ultimately the goal would be to be to a single cloud environment.

36:42 But then also from like a bit, I understand for like a business strategy that locks you into a kind of a certain pricing structure.

36:49 We may try to have multi-cloud environment.

36:52 That's pretty common across companies.

36:53 I think long-term we will try to be mostly in the cloud, whether or not we'll be with one vendor or not.

36:58 That's to be decided.

37:00 Yeah.

37:00 The one thing that I think what has changed with our recruiting is definitely looking for folks that aren't scared away from, you know, the work in a cloud environment.

37:08 Yeah.

37:08 A lot of students are coming from university that do not have any experience with like a Spark environment.

37:14 And that's fine.

37:15 It's not like we're expecting you to do that.

37:17 But you need to be open and willing and be prepared to work in that environment.

37:21 So that's definitely a big change.

37:22 It's also amazing we got this far without talking about SaaS because we, like many, many analytical companies that have been around for more than five or 10 years, have still dependencies on SaaS.

37:32 It's just very difficult to migrate off of enterprise tools.

37:35 And so, you know, we've been in the process of migrating from SaaS for quite some time.

37:39 And it's funny because when I came in here, which was three and a half years ago, the company was almost entirely on SaaS.

37:45 And R was like the upstart language.

37:47 And I think I was one of like two or three Python users in the whole company.

37:50 And things have changed a lot.

37:51 But like making the final cut, severing ties from old technologies is challenging.

37:55 It's one of the reasons we have so many platforms.

37:57 You just end up with production of things running on all these platforms and it would be a lot of work to change them.

38:02 So it just moves slowly.

38:04 Well, some of those systems, they're carefully balanced and highly finicky, but important.

38:09 Right.

38:09 And if you try to change them and you break it, all of a sudden that becomes your baby.

38:15 You have to babysit when it cries at night.

38:17 Right.

38:17 Like I'd rather not have SaaS, but I'd more than that.

38:22 I'd rather not touch that thing and make it my responsibility because currently it's not.

38:25 Right.

38:25 Like that's certainly an enterprise sort of experience, right?

38:29 Yep.

38:30 For sure.

38:30 Yeah.

38:31 Well, what's the transition been like from this somewhat expensive commercial product over to the combination of Python and R?

38:40 Was that easy?

38:41 Was it hard?

38:42 Did people welcome it?

38:43 Did they resist it?

38:45 You both do some on the education side within the company.

38:48 So you probably have a lot of visibility into how that first impression went.

38:52 So I used to lead our introduction to Python trainings.

38:56 So we have, like I said, some continuing education classes in the company.

39:00 And I will say I was just so surprised by the reaction people had to a new technology training the first time I gave it.

39:07 Because coming out of a computer science program where you sort of get thrown into languages, like I had a class in Java my senior year and I'd never use Java and the professor just sort of expected we'd pick it up.

39:15 I never really thought about this idea that you would be resistant to learning new technologies.

39:20 But when one software tool has dominated your industry for 20 years as SAS had, it's just really unfamiliar.

39:27 So I gave this course and a lot of people were asking questions that to me it was like, well, obviously you would Google that.

39:31 You know, like obviously you would look in the docs.

39:34 Obviously you do this.

39:35 And it's not obvious because these people come from a closed source tool that is carefully maintained and is highly backwards compatible.

39:41 But at the same time, it's not nearly as dynamic an ecosystem as something like Python or R.

39:46 And I think I've watched the culture change a lot since I started.

39:50 And I look at even the people coming out of school that start and they're so much more willing to jump into things, which I think is great.

39:56 And even the people that were here when I started have gotten more that way as well.

40:00 People have just like learned that the culture of open source tools is very different and you have to be more willing to jump from thing to thing.

40:06 And as we introduce new technologies, because we still do, people are more able to learn those, which I think is really great.

40:11 I can certainly see that.

40:12 You know, I'm somewhat sympathetic to those folks.

40:14 If you have spent a long time and you are very good at the tasks that you have to get done in one language or one technology and you've got to switch over, it's all of a sudden like I feel brand new again.

40:28 Like I remember not being able to load a file.

40:31 I remember not being able to actually properly, efficiently query a database.

40:35 I remember like all these things you're like, all of these are, I have these problems again.

40:39 I thought like I was beyond that.

40:40 Right.

40:40 And that's, that's certainly a challenging.

40:43 The other thing I think that makes it tricky going from something like that, or, you know, even something from like, say, C# and .net, where there's like a Microsoft that says, here's what the web stack looks like.

40:54 And we'll, we'll tell you in six months what the changes are going to be is in the open source space.

41:01 There's probably 10 different things that do what you want to do.

41:04 And how do you know which one of those to pick?

41:07 And then once you bet on one and you work on it for a while, all of a sudden, either maybe it gets sort of, it loses cache or something else comes along.

41:16 There's just, instead of being one or two ways to do a thing, now there's 20.

41:20 And it's like, I'm kind of new here.

41:22 So how do I even decide which of those 20?

41:24 Because it's, it's hard.

41:26 Yeah, that is absolutely a concern that people had.

41:28 I all the time would get this question because I was known as the Python guy early on.

41:31 You know, like, what do I do if this package changes?

41:34 Or like, how do I know this is still going to work?

41:36 There's no company that's behind this tool.

41:38 And if you come from this world, like if you come from the open source side, you think two things.

41:44 One, like most of the time the stuff keeps working.

41:46 Like the core functionality is extremely stable.

41:48 All the most popular open source languages, like they don't just stop being maintained.

41:53 Stuff is extremely, extremely well used.

41:56 And also, you know that if a package is no longer maintained, you look for another one because that stuff happens dynamically.

42:01 And it's unusual.

42:02 You'd have to be using something pretty fringe.

42:04 But it's unusual for you to end up being just out of luck in terms of having some functionality available to you.

42:10 Right.

42:10 Yeah, there's some few edge cases, but it's not common.

42:12 And, you know, there's always the, well, you can fork it and just run it.

42:16 Right.

42:16 If you use something that's pretty mature, the chances that it has a massive show-stopping problem discovered down the road, they're not that high usually.

42:25 Things stick around.

42:26 Right.

42:26 NumPy is probably not going to go unmaintained.

42:28 You know.

42:29 Exactly.

42:29 Django still has users.

42:31 Things like that, right?

42:32 You know, that's one thing that we do try to do internally.

42:34 And it's one thing that we're trying to get a little bit more smart on how we do it.

42:38 But with so many packages and so many capabilities out there, it's like, how do we make sure people are using kind of like a core set of packages that we kind of endorse or do the primary things that we want to do?

42:50 We try to create a little bit more structure around like what packages should we be using internally?

42:55 And then what's the process of bringing in like a newer package, right?

42:58 Do you guys have like a white labeling process where you sort of vet them or how does that work?

43:03 We're a little bit better about setting up like a sandbox area where, you know, if we find a package that is new or even like a package that's just on GitHub and not PyPI or CRAN, then how can we bring that in?

43:15 Do some testing.

43:16 Make sure that there's not any like interactions going on within our servers or whatever.

43:21 And then as long as we kind of pass all those regression tests, then yeah, okay, we can start bringing that in formally as a standard package in our servers or wherever.

43:32 Do you have a private PyPI server?

43:34 I don't know if you do have a CRAN, but CRAN server as well, like that you have more control over or do you let people just pip install straight off the main?

43:41 We use Artifactory, which is a tool that basically sets up those package repositories and you can have it clone them.

43:48 So, you know, we have what looks like a copy of PyPI, but then we blacklist certain things.

43:52 We whitelist certain things depending on the environment.

43:54 Yeah.

43:55 And it works for CRAN as well.

43:56 That's a really cool.

43:57 That's a quite, it seems like a very elaborate system, but for you all, it sounds like it's the right thing.

44:02 The nice thing about that is with our internal packages, we can actually have our CICD process push them to that Artifactory so that they could do, you know, pip install, whatever the package or install.packages in our, that package name.

44:17 And it's like you are importing it from PyPI or CRAN, but really you're just pulling it from our internal Artifactory.

44:24 Yeah.

44:24 When you have a scale of 250 people, you almost want to say the way that you share code across teams is the same way that you share code across open source, right?

44:33 It's, you create these packages, you put them into, you version them, you put them into Artifactory, maybe even pin your version in your deployment, things like that, right?

44:43 Is that what you do?

44:44 Sort of like getting that standard across the, getting the knowledge of these standards across the business is like one of our chief challenges.

44:50 Because just like open source, people don't necessarily hear about new packages that solve problems they've been encountering a bunch of times.

44:57 So while we encourage people to do things like pinning packages, we're still at an even earlier step where it's like, be aware of what new functionality is in these packages we're using.

45:06 Because all the time I see people setting up elaborate configurations with Spark.

45:10 And then I tell them we have a package that we, you know, we first released 1.0 like two months ago.

45:16 And it's like, all of this could be done for you, you know, and we can send as many emails as we want.

45:20 But people who work with 249 other data scientists delete a lot of emails because there's too many.

45:25 So, so finding a good.

45:28 Yeah.

45:28 It's another plague in the enterprise is like everyone thinks that you need to be copied.

45:32 Like if there's even a chance you need to know about it and what it results in is if everything's important, nothing is important, right?

45:39 A big problem.

45:40 Yeah.

45:40 And so figuring out how to socialize these, the way to use these packages and what packages even exist.

45:46 And then beyond that, like how to use them properly and version things properly is always something we have to think carefully about.

45:52 How do you all do that?

45:53 I mean, just letting folks know, like there is now a library that you can install that solves this problem or does that, or it has this challenge.

46:02 We're looking for feedback on how to make it better.

46:03 How do you get the word out about your projects and packages internally?

46:09 So we've tried to start doing a little bit of like beta testing.

46:12 So if we have a brand new package we're developing before we actually do a full release, we'll try to get a group of folks that do some beta testing on it to kind of give feedback.

46:21 You know, one is the functionality there.

46:23 Are there bugs that we're missing?

46:24 Two is the syntax kind of logical coming from more of the data science perspective.

46:30 And then three is the documentation there that they need to basically pick up from no knowledge of it and start applying it.

46:37 And that gives us that good initial feedback.

46:39 And then once we start getting the first release and everything, right now what we are doing is basically doing that email blast to all our data scientists and saying, here's the version number.

46:49 Here's what's new, what you need to know, how it impacts you.

46:53 But ultimately, I think what I've learned is that the most important thing is having advocates across the company that know about this.

47:00 Because often new functionality will arise and will only take over in part of the business.

47:05 When you have 250 people, it's like who knows about what is very different across teams.

47:10 And so one of the things we focused on with like our beta testers is making sure that this is a well-rounded group of people in different teams.

47:16 So those people serve as sort of the evangelists to tell other people on their team like, well, when you run into these problems, you should be doing this.

47:23 And that's really the only way to get that information across.

47:26 Because we can't sit in everybody's meetings and we can't like go and look over people's shoulders as they code.

47:31 So we need other people to do that for us.

47:33 So our first adopters are really the people that help.

47:35 That sounds like a pretty good way to set things up.

47:38 I want to come back to this, building these two packages and one, the same package for both languages.

47:45 And I'm not sure if we exactly covered it.

47:48 Do you try to have the same API for both as close as possible?

47:54 Or do you try to have something Pythonic for the Python one and something that's maybe effectively the same, but very much what our folks expect?

48:03 Like what is your philosophy in trying to build these packages for both groups?

48:06 That's a great question.

48:07 We try to balance the two.

48:09 We want the syntax, the API to be very similar across both.

48:14 But obviously we want folks that are coming from the R side or from the Python side to feel very natural in using it.

48:20 And so that means that we can't always have the exact same comparable syntax across the two.

48:26 You know, in R, it's very common within like the tidyverse packages, if you've heard that, where there's no quoting.

48:33 There's been ways that you can remove the quotations of argument inputs and everything.

48:38 So that's kind of a natural thing that we do, where if you look at the Python side, you get the same arguments and the same valid inputs that you could supply.

48:45 But you're going to have quotes versus non-quoted.

48:48 And so that can be differences.

48:49 And then there's, you know, other kind of differences.

48:51 And underneath the hood, like how do you do logging?

48:54 Obviously, that's going to be a little bit different in both of the two languages.

48:58 And one thing that's really hard to avoid is that fundamentally object orientation is extremely different in R and Python.

49:05 And, you know, R has things.

49:07 I believe the term is method dispatch.

49:09 So methods don't come after object names.

49:11 They come before and they look like standard functions.

49:14 And so things are just, if we wanted to build an exactly identical API, we would actually have to jump through a lot of hoops.

49:20 That wouldn't be very native to either of the languages.

49:23 So like Brad said, it's a fine line to walk.

49:25 We want it to be recognizable and similar, but we don't want to sacrifice the merits of the language for that.

49:30 Sounds like a good balance.

49:31 You know, let's round out our conversation with one more short topic.

49:36 When you think about data science, a lot of times, at least I think about things like Jupyter Notebooks, Jupyter Lab, maybe RStudio, and this exploring data.

49:48 When I think about product, like productizing, putting this behind some REST API or putting it in production, some of this stuff, I don't think notebooks and RStudio anymore.

50:00 What does that transition look like?

50:02 Like, how do you guys take this research and turn it into products like services and APIs and whatnot that can run?

50:09 So there's a few different approaches.

50:11 Historically, so I used to work on our digital team that would build the recommender systems for the Kroger website.

50:17 So much like Amazon's website, Kroger's website has like, because you bought this, you might also like this.

50:23 Right.

50:23 And so we need to find ways to serve recommendations.

50:25 Historically, that was largely done in batch style processing.

50:29 So at the beginning of a given week or something, we would say for each customer identifier, these are the products that they should get.

50:36 And we would send over these flat files.

50:37 But increasingly, we're moving to something that looks more like we will ship you an actual, usually it's container, but some kind of item that takes input and gives output.

50:47 So we can serve up dynamic recommendations.

50:49 So people, I think a common workflow for things like that is people build their model and do their exploration and do their just like modeling initially in notebooks or in RStudio.

51:00 But then they package this up as some kind of product that ends up being much more polished.

51:05 So in some cases, if we ship a container, people need to actually dockerize that and make sure that it can be used by someone external and then throw it over the wall to whoever actually manages the Kroger website, for example.

51:17 Okay, interesting.

51:17 Brad?

51:18 Yeah, and lots of times, you know, if you're doing like machine learning type of models, you know, you're not going to, you know, basically build a script that's got the machine learning code in it and use that to kind of score incoming observations.

51:31 Most of the time, you're going to have some kind of Java output product from these.

51:35 So data robot is a tool that we use internally that allows you to do kind of like automated machine learning tasks.

51:42 H2O is another very popular one.

51:44 And both of those have very similar like R and Python APIs to perform machine learning.

51:50 But the thing is, once you get done with kind of like finding what that optimal model is, then you typically are kicking out, you're going to use like a code gen or like a POJO, which just ends up being kind of like a Java object.

52:02 And that's what we can use to kind of like score new observations.

52:05 And so that gets away completely from using any kind of a notebook or scripting as far as like a Python or R scripting capability.

52:13 Okay.

52:13 Have either of you played around with or entertained the idea of something like Paper Mill?

52:18 Remember Paper Mill?

52:19 We do use Paper Mill internally for a couple of things.

52:22 Maybe tell people just like the quick elevator pitch of what that is so they know.

52:26 A little bit of background.

52:26 So notebooks, if you've used Jupyter notebooks before, they are designed for interactive work.

52:30 They're designed for like run this line of code, see this output, etc.

52:33 They're not really designed to be automated.

52:35 They don't lend themselves to being run from the command line.

52:38 And Paper Mill is a package that Netflix has produced that is open source that lets you automate notebook runs.

52:44 So you get the benefits of notebooks where you get this output in line with your work.

52:48 And you can do the development of the notebook as you did before.

52:51 But now you can batch these things.

52:54 You can run them every night or whatever you want to do.

52:56 I was going to say one can call the other.

52:57 You can almost treat them like little functions with inputs and outputs.

52:59 Yeah, exactly.

53:00 Yeah.

53:00 So you used them a little bit?

53:01 We use them internally for a couple of things.

53:03 Mainly, so again, because we are on so many different platforms, it's like different things come to different platforms at different times.

53:09 Especially because these are managed platforms everyone has to work on.

53:12 So I can't just go install Paper Mill because that's going to affect everyone.

53:16 So we have Paper Mill set up on some of our on-prem environments.

53:18 And I think we're still working on getting it set up in the cloud.

53:21 But in general, a lot of the stuff that we would want to automate in the cloud is a little easier to end up scripting in the end.

53:28 I think that's a heuristic.

53:29 Not always true, but often.

53:31 Yeah.

53:31 And one thing that we're using or kind of moving towards is using Databricks.

53:35 And there's a lot of functionality within Databricks that allows you to kind of parameterize and automate the runtime of those scripts.

53:41 So it ends up being kind of a notebook that can operate a little bit like Paper Mill.

53:46 That seems like the shortest or the most native way to bring where the work was originally done into productized data science.

53:54 But also, I see definitely some engineering challenges around that, you know, like testing, refactoring, etc.

54:01 I think historically, what we've primarily gone to is basically having your .r or .py scripts and just automating those with normal batching.

54:11 Yeah, it makes a lot of sense.

54:12 All right.

54:13 Well, I think we're just about out of time, so we'll have to leave it there.

54:17 But this was a really fascinating look inside what you all are doing there because it sounds like you're operating at a scale that most folks don't get to operate at maybe just yet.

54:25 Who knows?

54:25 Yeah, it's good to talk.

54:26 Absolutely.

54:26 Now, before I let you two out of here, I've got to ask you the last two questions.

54:30 So let's go with you, Ethan, first.

54:32 If you're going to write some data science code, I'll generalize it a little bit this time around.

54:37 What editor do you use?

54:38 I'm pretty all in on Vim, which is not all that popular in data science.

54:41 I use Jupyter Notebook sometimes.

54:43 I found a lot of extensions.

54:44 Let me use the Vim key bindings, but old habits die hard.

54:47 I've dabbled in VS Code recently, but I always come back to Vim.

54:49 All right.

54:50 Cool.

54:50 Brad?

54:50 I do write a lot of R code, so RStudio is kind of my go-to.

54:54 Even when I write my Python, I definitely enjoy writing it within RStudio.

54:58 RStudio actually supports multiple different languages, and it's just one of those editors

55:02 I've gotten used to.

55:03 I do use Notebooks sometimes when I'm teaching, whether it's an RStudio Notebook or Jupyter Notebook.

55:10 But if I want to be truly Pythonic, I go to PyCharm.

55:13 It's funny.

55:13 Editors are one of those things that once you get really comfortable with it, you can just

55:18 be more effective in the one that you like, right?

55:20 Exactly.

55:20 Yeah.

55:20 Cool.

55:21 And then a notable PyPI or, Brad, if you want, CRAN package for folks out there, something

55:27 some library you ran across that people should just know about.

55:30 Maybe not the most popular, but you're like, I found this thing and it was amazing and I

55:34 didn't even know about it.

55:34 I am all in on this package called Altair.

55:37 I think you've had Jake Vanderplass on the pod before.

55:39 I have, yeah.

55:40 Earlier, I alluded to Python's relatively weak visualization ecosystem for data science.

55:45 And it is, I teach some Python classes both internally and at the University of Cincinnati.

55:51 And teaching the visualization ecosystem is just terrible every time.

55:54 It's so bad.

55:55 And MatBotlib and Seaborn are so difficult to use and inconsistent.

55:58 And Altair is like the hope that I have for the Python ecosystem.

56:02 I think it's so, so nice.

56:03 I just want to see more adoption.

56:05 But it's great.

56:05 Like if you work in data science, you should absolutely switch to Altair.

56:09 And if you are used to the ggplot, really nice encoding style syntax and coding channels,

56:13 Altair is such a good answer in Python.

56:15 Yeah, I've heard really good things about it.

56:17 I haven't actually done that much with it myself.

56:20 But yeah, it's definitely good.

56:22 And Jake does nice work.

56:23 Yeah.

56:24 Brad?

56:24 I've been spending a lot more time in both R and Python.

56:27 So I'm getting more and more drawn towards packages that are available in both languages

56:33 and that kind of have very similar syntax or API.

56:36 So a few good ones.

56:38 So we use DataRobot internally.

56:40 So that's got a similar R and Python API.

56:43 H2O is another machine learning package that I really like.

56:47 And if you look at those, it's really tough to tell the difference between the R and the

56:50 Python syntax.

56:51 It must be nice when those exist for your ecosystem, right?

56:54 Yeah.

56:55 And even like TensorFlow and Keras, I've been doing a lot of stuff with deep learning lately.

56:59 And the R, Keras TensorFlow is, I mean, basically it is, it's using Reticulate to communicate

57:04 towards the Python Keras.

57:06 So the syntax between those two are very similar as well.

57:09 Yeah.

57:09 We talked about mocking stuff earlier.

57:11 I want to just throw out Modo.

57:12 If you guys use AWS, Modo will let you mock out almost every AWS service.

57:17 Really?

57:19 Okay.

57:19 Yeah.

57:20 If you want to mock out the API for EC2, you can do it.

57:23 You want to mock out S3, you can do it.

57:25 It's all in there.

57:25 So Bodo is the regular API in Python.

57:28 Modo is the mock Bodo, right?

57:30 Gotcha.

57:31 So is that built and maintained by AWS folks?

57:34 I don't think so.

57:35 It definitely doesn't look like it.

57:36 But anyway, there's some interesting things you can do with like a local version of it and

57:40 all sorts of funky stuff.

57:42 It looks like someone put a lot of effort into it.

57:44 Trying to solve that mocking problem.

57:46 It's a lot of work, right?

57:47 Yeah.

57:48 It is.

57:49 Yeah, definitely.

57:49 Cool.

57:50 All right.

57:50 Well, Ethan, Brad, this has been really interesting.

57:53 Final call to action.

57:54 People maybe want to try to create these unified environments or work better across their data

58:00 science teams.

58:01 What will you tell them?

58:02 It's worth investing in having some kind of centralized data science team within your overall

58:06 data science department that works on these resources.

58:09 You know, you have to carve out time for people.

58:11 Brad and I are both, I think, pretty lucky to be able to do this as most of our job.

58:15 If you don't have that time carved out, people just don't have time to contribute to centralized

58:18 resources and you end up with a lot of duplication of work.

58:21 And it's also good for some of your data scientists to have a more technical background

58:25 and be able to think about this stuff.

58:26 And I think we've benefited very much from that.

58:27 Yeah, it sounds like it.

58:28 Brad?

58:29 Yeah.

58:29 And I would say for the data scientists, like, you know, historically, you've been able to

58:33 kind of focus in one language.

58:34 And I think that's becoming less and less common.

58:36 So I think a lot of people need to be flexible in understanding both languages.

58:41 You know, you may be dominant in one, but at least be able to have some read capability

58:46 in the other one.

58:46 And one thing I've definitely benefited a lot from is working closely with Ethan and some

58:51 of the other folks in the company that are strong Python programmers.

58:54 There's a lot of good exchange of knowledge.

58:56 And once you start understanding, like, different types of languages, you kind of see the same

59:00 patterns that exist.

59:01 And that can really help you become a stronger developer.

59:04 There's always good stuff on both sides.

59:06 And if you can bring it over the fence, it's good, right?

59:08 Yeah, definitely.

59:08 Well, thank you both for being on the show.

59:10 It's been really interesting.

59:12 Yeah, thanks, Michael.

59:12 Yeah, thank you very much.

59:14 This has been another episode of Talk Python To Me.

59:17 Our guests on this episode were Ethan Swan and Bradley Baumke.

59:20 And it's been brought to you by Linode and Tidelift.

59:22 Linode is your go-to hosting for whatever you're building with Python.

59:26 Get four months free at talkpython.fm/linode.

59:30 That's L-I-N-O-D-E.

59:32 If you run an open source project, Tidelift wants to help you get paid for keeping it going

59:37 strong.

59:37 Just visit talkpython.fm/Tidelift, search for your package, and get started today.

59:44 Want to level up your Python?

59:45 If you're just getting started, try my Python Jumpstart by Building 10 Apps course.

59:50 Or if you're looking for something more advanced, check out our new async course that digs into

59:55 all the different types of async programming you can do in Python.

59:58 And of course, if you're interested in more than one of these, be sure to check out our

01:00:02 Everything Bundle.

01:00:03 It's like a subscription that never expires.

01:00:05 Be sure to subscribe to the show.

01:00:07 Open your favorite podcatcher and search for Python.

01:00:09 We should be right at the top.

01:00:10 You can also find the iTunes feed at /itunes.

01:00:14 The Google Play feed at /play.

01:00:15 And the direct RSS feed at /rss on talkpython.fm.

01:00:20 This is your host, Michael Kennedy.

01:00:21 Thanks so much for listening.

01:00:23 I really appreciate it.

01:00:24 Now get out there and write some Python code.

01:00:26 And I'll see you next time.

01:00:26 Bye.

01:00:27 Bye.

01:00:27 you you you you you you you you you you you you you you you you you you you

01:00:36 you you you you you you you you you you you you you you you Thank you.