Brought to you by Linode - Build your next big idea @ linode.com


« Return to show page

Transcript for Episode #236:
Scaling data science across Python and R

Recorded on Friday, Sep 27, 2019.

0:00 Michael Kennedy: Do you do data science? Imagine you work with over 200 data scientists, many of whom have diverse backgrounds, who have come from non-CS backgrounds. Some of them want to use Python. Others are keen to work with R. Your job is to level the playing field across these experts through technical education and to build libraries and tooling that are useful for both Python and R loving data scientists. It sounds like a fun challenge, doesn't it? That's what Ethan Swan and Bradley Boehmke are up to, and they're here to give us a look inside their world. This is Talk Python To Me, Episode 236, recorded September 27th, 2019. Welcome to Talk Python To Me, a weekly podcast on Python: the language, the libraries, the ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter where I'm @mkennedy. Keep up with the show and listen to past episodes at talkpython.fm. And follow the show on Twitter via @talkpython. This episode is brought to you by Linode and Tidelift. Please check out what they're offering during their segments. It really helps support the show. Ethan, Brad, welcome to Talk Python To Me.

1:15 Panelists: Thanks, good to be here. Yeah, thanks.

1:17 Michael Kennedy: Yeah, it's great to have you both here. It's going to be really fun to talk about enabling data science across large teams and this whole blended world of data science, which, it sounds pretty good, actually, sounds like a positive place.

1:31 Panelists: Yeah, it's definitely getting more and more tangled, too.

1:33 Michael Kennedy: Yeah, I can imagine. I can imagine. So we're going to talk about things like R and Python and how those can maybe live together, how to bring maybe some computer science techniques and stuff to work together across these different teams and so on. But before we do, let's get started with your story. How'd you get into programming and Python? Ethan, you want to go first?

1:55 Panelists: I went into college as an undecided engineering major and didn't really know what I wanted to do. But I was pretty sure it wasn't computer science. I was pretty sure that was for people who sat in front of a computer. That sounded very boring. And I got into the intro class for engineering and picked up MATLAB and just loved it. So from that point forward, I did some C and C++ in college and then came out of college and started working in data science. I started with a little bit of R and then found I was a lot more comfortable with Python. And now I use it in my job and also a little bit outside of work for some personal projects. So I really, really enjoyed it after going through a number of languages.

2:26 Michael Kennedy: It's interesting that MATLAB was sort of the beginning programming experience. I think looking in from the outside at the computer programming world, a lot of folks probably don't think that. But you know, I went through a math program. And when I was sitting, a lot of people, their first programming experience was working in MATLAB and .m files and all that stuff.

2:46 Panelists: Yeah, well, it seems to be very useful across other engineering fields. And so, and also it's relatively friendly. It's not like learning C or C++, which probably scare a lot of people away.

2:57 Michael Kennedy: Yeah, absolutely, absolutely. Brad, how about you?

2:59 Panelists: My background's much more along economics. So I was in the Air Force doing a lot of lifecycle cost estimates for weapons systems, aircraft, and the like. And a lot of that was done in Excel. And when I went up and I did my PhD, I started getting a lot of my research data. It was gnarly, just stuff spread out all over the place, ugly.

3:19 Michael Kennedy: And your PhD was in economics?

3:21 Panelists: No, it, kind of yes and no. I had a unique PhD. It's technically called logistics, but it was kind of a hybrid of economics, applied stats, and ops research.

3:31 Michael Kennedy: Oh, right, okay, cool.

3:32 Panelists: Yeah, so and the problem was I spent a couple of months just trying to figure out, how can I clean this data up and do my analysis within Excel? It was horrible. And so that was about the same time that Johns Hopkins came out with an online data science course through Coursera. And they featured, or primarily focused on R. And that was kind of when I decided, all right, I need to take a programming language to really get through this research.

3:57 Michael Kennedy: You've outgrown Excel in the extreme, right?

3:59 Panelists: Yes, yes, yep.

4:02 Michael Kennedy: So did you abuse it pretty badly? Were you trying to make it do things it just wouldn't?

4:05 Panelists: You know, it's funny because the work I was in within the Air Force, it was your classic abuse Excel as much as possible, right? You open it up, you've got a workbook that's got like 26 worksheets. You got stuff that is hyperlinked all over the place. You got hard-coded changes going on in there, and you leave for one week, you come back, and there's just no way you could reproduce anything. And that was exactly what I was running into. And so that's really what got me into programming.

4:32 Michael Kennedy: I think there's a lot of people out there who definitely consider themselves not programmers. And yet they basically program Excel all the time, right?

4:41 Panelists: Right.

4:42 Michael Kennedy: And a lot of folks could follow your path and just add some programming skills and really be more effective.

4:48 Panelists: I think that's kind of the theme of past shows. I know you bring that up a bit, where a lot of people would benefit from having a little bit of programming skill that they could bring to their regular job rather than being full-time programmers. And that seems very true.

4:59 Michael Kennedy: Yeah, and it sounds exactly like this is a scenario for that. And it's definitely something that I'm passionate about, so I bring it up all the time. The other thing I think that's interesting about programming, programming in quotes, in Excel is, we did a show called Escaping Excel Hell. And one of the themes is Excel is basically full of all these go-to statements, right? Like you just go down, it says go to that place, then go over here, then go across this sheet over to that. It's totally unclear what the flow of these things are. It's bizarre. Alright, so definitely programming languages are better. You both work at the same company. Let's talk about what you do day to day, 'cause you're sort of on the same team, right?

5:40 Panelists: Sort of. We collaborate very tightly. So I actually work on the education team. So our company's called 84.51. We're a subsidiary of Kroger. We're mainly their data science marketing agency. And we both work within the data science function. So my team is mainly involved with upscaling the function just generally. That may mean scheduling classes for people that are new starters. It may also mean what we call continuing education, so figuring out what people need to learn going forward to stay relevant in the industry. I tend to be more on the technical side of that team. That means that I collaborate more tightly with Brad's team, which is more aligned to the technology.

6:12 Michael Kennedy: Yeah, for sure. And Brad, how about you?

6:15 Panelists: Yeah, so my team really focuses on building kind of like internal components or internal packages. I'm sure we'll talk more about this a little later, but we have about 200 data scientists that are at some point transitioning to using R and Python primarily or already are. So we try to standardize certain tasks as much as possible. And we'll wrap that up into an R or Python package and kind of have that centralized business logic for our own internal capabilities as a package in either R or Python. So our team just focuses a lot on building those packages.

6:49 Michael Kennedy: Yeah, that sounds super fun. Sounds almost as if you're a small software team or company building all these tools for the broader company, right, or the broader data science organization.

7:01 Panelists: One thing that's definitely becoming more and more clear is we have kind of the traditional data scientists. Then we have the traditional engineering function within the company. And there's kind of that big void in between that kind of bridges that gap, right, where you have folks that have somewhat the software engineering capabilities, but they're coming from more of a data science perspective, right, and they can build things that are a little bit more geared directly to how the data scientists work.

7:25 Michael Kennedy: Yeah, interesting.

7:26 Panelists: We have about 250 total data scientists, just for a sense of scale, which is one of the reasons that we have a dedicated internal team to enable them, because at that scale, so many people are doing similar work that it makes sense to automate some of that stuff, to build it into packages and things like that.

7:40 Michael Kennedy: I can't think of many other companies that have that many data scientists. Why don't you tell folks what Kroger is, because I know being here in the US, certainly spending some time in the south there, Kroger directly is there. But they also own a bunch of other companies and stuff. So maybe just give people a quick background so they know.

8:00 Panelists: Kroger is in I believe 38 states and has something on the order of 3,000 stores. So it's just an enormous grocery chain in the US. So you may not have seen Kroger itself under the name Kroger, because they own other chains, Ralph's, Food For Less. I think there's 20 different labels. But yeah, they're all over the place. And so it makes a lot of sense to have some sort of customer analytics organization, which is what we are.

8:24 Michael Kennedy: There's a lot of analytics around grocery stores and things like that and how you place things. There's the story of putting the bananas in the back, and came back corner and things like this, right?

8:35 Panelists: There's definitely a lot of different areas. So yeah, the banana story, or like the milk in the back, people often tell what might actually be apocryphal, this idea that these things are in the back because it makes people go get them and walk through the rest of the store. It might be true, but at this point it's so ingrained I'm not sure anybody knows. But there's other areas, too, where it's like, what kinds of coupons do you mail people? So in general when my people ask me what my company does, the simplest summary is when you get coupons from a grocery store, that's people like us, essentially, where based on what you bought in the past, we know that you would probably appreciate these kinds of coupons.

9:07 Michael Kennedy: Largely the way you probably collect data, I can imagine two ways or maybe more, is one just when people pay with a credit card, that credit card number doesn't change usually, right? So you can associate that with a person. And then also a lot of these stores in the US have these membership numbers that are free to sign up but you get a small discount or you get some kind of gas reward point, there's some kind of benefit to getting a membership and always using that number. And that obviously feeds right back to what you guys need, right?

9:37 Panelists: Yeah, but that loyalty membership that a lot of folks have, and that is the majority of customers, that's really what allows the data science that we do to kind of personalize shopping experience, right? So if you're going to go online and do online shopping or if you're going to likely be going to the store in the next week, we can try to personalize, what do we expect you to be shopping based off of your history? We can link that back to your loyalty card number and everything.

10:00 Michael Kennedy: Yeah, super interesting. We could go on all sorts of stories like the bananas and so on. I don't know the truth of them, so I won't go too much into it. But they sound fun. But 250 data scientists, that's quite the large group, as I said. And it's a little bit why I touched on the MATLAB story and the Excel story, because people seem to come to data science from different directions. I mean, you tell me where your people come from, but there's the computer sciencey side, like I want to be a programmer. Maybe I'll focus on data. But also just statisticians or people interested in marketing or all these different angles. And that's an interesting challenge, right?

10:38 Panelists: With 200, 250 analysts or data scientists, you have this huge spectrum of kind of talent and background. And so we kind of categorize our data scientists into like three big buckets, right? So we have the insights folks, and those are the folks that are really focusing on looking at historical trends going on, doing a lot of visualization to try to tell a story about what's going on with a product over time, what's going on with their customers. Then we got kind of another bucket that is kind of our statistical modelers or machine learning specialists, right, and those are the people that you would typically think of that are more educated on the stats or the algorithms that we're applying within the company. And then we got another bucket that's technology, right? And those are the folks that are really specialized on usually the languages that we're using, R, Python, really understanding how to really be using Git, how to be using Linux and kind of maneuver around all the servers and different tech stack environments that we have going on. Obviously the largest bucket is that insights. And I don't know what the actual number is, but I always say that roughly 60 to probably 70% of our data scientists kind of fall toward that insights. And that's kind of where you're going to see a lot of folks that have a background that would be typically aligned with a business analyst, right? Maybe they're coming from more of an engineering or economics background. And the folks in that middle bucket, that machine learning, that's going to be more of your folks coming with a stats, maybe a stats Masters or PhD or more, they could even be economics. But they kind of had a stronger focus on econometrics than kind of traditional economics. And then you've got that small bucket, which you get a lot of people that I think are more like Ethan. Ethan's kind of what I would consider the classic person going in that bucket, where they kind of have that computer science background, coming from school. And that kind of creates that strong link between traditional software engineering and our data science folks.

12:33 Michael Kennedy: That's a good taxonomy.

12:34 Panelists: Specific to our folks, I would say we have a lot of folks that have kind of like an economics background. That's definitely a big kind of traditional degree that we recruit a lot of people from. We have a lot of people from computer science programs and then kind of the traditional stats, right? So, and Ethan, you can throw in some others, but from my experience, those kind of seem to be the three major themes of the background that we see. That's definitely very common. I think historically we leaned more from economics and statistics. And recently there's been a lot of changes. Data science as a product is a newer thing. In the past, I think there was less of a need for strong technical skills, being a data scientist, if that formal title even existed, right?

13:15 Michael Kennedy: Right, it was so new. It's like, can you make graphs out of this big data? We love you, just do that, right?

13:22 Panelists: Things have really changed, and especially because we've moved into using distributed systems like Spark. And those things simply demand a higher level of technical expertise. That's part of the reason that we've shifted to hiring more technical people to at least support and sometimes do different work.

13:35 Michael Kennedy: Sure, and that probably also feeds into why you all are building a lot of internal packages to help put a smooth facade on top of some of these more technical things like Spark.

13:45 Panelists: That's definitely been a theme of shifting to new platforms. So you know, like probably most companies, we have a monolithic database system that for a long time we've relied upon. So most data scientists are pulling from one primary database. But over the last couple years, as we started to get things like clickstream data and just the needs of our modeling changed. We started to push toward Spark. And Spark tends to be a really, I don't know, a difficult adjustment for people coming from traditional databases, in my experience. And so a lot of the work that Brad and I have done is work on simplifying that transition, try to hide some of the complexity that most people don't need to deal with. You probably don't need to configure everything in your Spark environment, because you're not used to doing that in something like Oracle.

14:29 Michael Kennedy: Yeah, absolutely. How much data skill do folks need to have for, as a data scientist? You know, when I think data science, I think pandas. I think CSV. I think those kinds of things, matplotlib, NumPy, scikit-learn, these kinds of things, but not just the SQL query language and things like Spark and stuff, although I know that that's also a pretty big part of it. So maybe, could you just tell us, for people out there listening, thinking, "Hey, I'd like to be a data scientist. What skills should I go acquire?" Where's that fit into that?

15:02 Panelists: In my view, it's really a matter of the size of your data. Big data's such a generic term that I think it may have lost meaning in a lot of cases.

15:09 Michael Kennedy: Yeah, so some person's big data is actually like, ah, that's nothing. That's our test data, right?

15:12 Panelists: It's like, yeah, how big is your laptop's memory? That's really the question. And for us, so we literally have every transaction that's happened at Kroger over the last at least 10 or 15 years. And so the size of that data is just enormous. To do even trivial things like filters, you still need a very powerful system. And so for us and for large companies with transactional records or clickstream records, you generally need very powerful distributed systems or a central database. But historically people think of pandas as being the primary data science package. And that is true once you reduce your data to a manageable size. And perhaps some companies have small enough data that they could do that on a single server. But for us, that's generally not true.

15:51 Michael Kennedy: Do you guys use things like Dask or stuff for distributed processing?

15:56 Panelists: We don't really use Dask. There's been some interest in it. I think, so I'm not super familiar with Dask, but I think that it occupies a similar niche to Spark. We're pretty far down the Spark road.

16:07 Michael Kennedy: Sure, once you kind of place your bets and you invest that many hours of that many people's work, it can't just be slightly better or slightly different or whatever. It's got to be changing the world type of thing to make you guys move.

16:19 Panelists: We're also pushing towards migrating a lot of applications to the cloud. And doing something like that, you sometimes are a little more restricted in what you can do in an enterprise setting, because there's rules about how your environments work and things. And so we don't generally get to customize our own clusters, which you might want to do for Dask. So we have an engineering and architecture team that sets up the Spark clusters for us that then we, as data scientists, log into and use for our work.

16:43 Michael Kennedy: That's kind of handy. I know there's a lot of places like that, where there's just cluster computing available, like CERN has got some ginormous set of computers. You can just say run this on that somehow, and it just happens. This portion of Talk Python To Me is brought to you by Linode. Are you looking for hosting that's fast, simple, and incredibly affordable? Well, look past that bookstore and check out Linode at talkpython.fm/linode. That's L-I-N-O-D-E. Plans start at just $5 a month for a dedicated server with a gig of RAM. They have 10 data centers across the globe, so no matter where you are or where your users are, there's a data center for you. Whether you want to run a Python web app, host a private Git server, or just a file server, you'll get native SSDs on all the machines, a newly upgraded 200-gigabit network, 24/7 friendly support, even on holidays, and a seven day money-back guarantee. Need a little help with your infrastructure? They even offer professional services to help you with architecture, migrations, and more. Do you want a dedicated server for free for the next four months? Just visit talkpython.fm/linode. One of the things I think's interesting is this blend between Python and R. And it sounds to me like people are coming to one of those two languages, maybe even from somewhere else, maybe from Excel or from MATLAB or some of these other closed source commercial tools. What's that look like? Because for me, it feels like a lot of times these are positioned as an exclusive Python or R conversation. But you, maybe with that number of people it's a slightly different dynamic. What's it like there for you?

18:25 Panelists: I would say historically, at least my experience, what I saw a lot of were people that were coming more from a computer science background kind of naturally aligned with the Python mindset and syntax. And the folks that traditionally came from a stats background or more of a business analyst kind of gravitated towards R. And I still see a lot of that. But I think it's starting to change quite a bit, because you're getting more of these data science programs in universities. I mean, you're certainly getting more of a mix within those programs. And those programs are trying to either select one language or they're blending two languages throughout the curriculum. So we still see a lot of crossover in folks coming with more of an R or Python. It's just, to me it's not as easy to kind of pick out who it is, right? I used to be able to look at someone, and they said, "well, I went to school for computer science." I was like, "Oh, okay, well, obviously you're going to be a Python, more likely a Python than an R." That's not always the case. So to me it's getting a little bit more blurred. I think a lot of it just has to do with the environment they're coming from. So if they're coming from a university, then which university, and what language are they just kind of defaulting to?

19:38 Michael Kennedy: Maybe even down to who the professor was and what book they chose.

19:41 Panelists: Exactly.

19:42 Michael Kennedy: I feel like it's almost not even chosen. It's this organic growth of, well, I was in this program, and I had this professor a lot. And that professor knew Python, or they knew R. So that's what we did, right?

19:55 Panelists: Yup, and then also I think a lot of folks coming from, if you got experience in industry and you're coming from a different company, or at 84.51, then lots of times it just kind of depends on the size of that company. It seems like companies that are smaller, that maybe work with smaller data sets, have a smaller infrastructure, it's easier to work on your local RStudio or PyCharm IDE and do your work. Those companies that are much larger and need a larger infrastructure for your tech stack, I feel like they're kind of gravitating more towards Python. There's other reasons behind that, but, so I think the size of the company also determines it.

20:30 Michael Kennedy: It's probably wired a little bit more aligned with a computer science and dev ops side of the world. And it's probably just, there's a greater tendency for those folks to also be using Python rather than to also be using R, because if you come from a stats background, what do you know about Docker, right? I mean, probably not much unless you had to just set it up for some reason for some research project, right?

20:53 Panelists: We find that, especially at our size, having a very large dedicated engineering function and an architecture team and these other more technical teams tend to be a lot more fluent in Python. And so even in communicating with them and like when you have proof of concept applications, if you want to say, "We're going to try to deploy "something in a new way." that team is going to be a lot better able to support Python in general because it's more like their background. So I've definitely seen, since I've started, R was a bit more popular. I think it's shifted to be about 50-50. But Python and R have sort of found their niche. I think R is still the superior tool for visualization, which is sad, because I like Python a lot and I wish it were better. And I think there's hope. But R still is really, really good at that and really good at some other things, readable code with the pipe operator and things like that. And it seems like R is doing really well in more of our ad hoc analysis work, and then in our product style sciences that we deploy, that tends to be Python.

21:48 Michael Kennedy: Interesting, so, yeah, so the research may happen more in R, but the productization and the deployment might happen, might find its way over to Python.

21:57 Panelists: Yeah, I think the more interactive type of work that you're doing, lots of times it's probably a little bit more maturity on the R side. But the more we're trying to standardize things or put things in some kind of automated procedure for production or whatever, that's when it starts to kind of gravitate towards Python, just because that's usually when we start getting the engineers involved a little bit more. And then how can we integrate this within our tech stack? And there's usually just less friction if we're doing that in the Python side.

22:23 Michael Kennedy: Okay, so you talked about building these packages and libraries for folks to use to make things like Spark easier and so on. What is your view on this blended Python-R world? Do you try to build the same basic API for both groups but keep it Pythonic and, I don't know, R-esque, whatever R's equivalent of Pythonic is? How do you think about that? Or are they different functions because Python is more on the product side?

22:51 Panelists: This is a great question. It's been something that Ethan and I and a few other folks have really been trying to get our arms around. We don't know what the best approach is. We've tried a few different things. For example, so we just have a standard process of ingesting data, right? So we got to do some kind of a data query. There's lots of times just common business rules that we need to apply. We call them golden rules. Certain stores, certain products we're going to filter out, a certain kind of loyalty membership, whatever, we're going to discard those. And that's all business logic. And typically, historically we've had very large SQL scripts that people were applying the same thing over and over, maybe slight twists. A lot of that stuff we can just kind of bundle up, both in R and a Python package, to apply that golden rules, or that business logic. And it just makes their work more efficient, right? So now their data query goes from applying this big script to just like, all right, here's a function that does that initial data query, get that output, then go and personalize your science, whatever you're doing. Something like that, that's a great way where we can have both an R and a Python capability, as long as it doesn't get too large, right? So when we do something like that, we want to try to keep the R and Python packages, one, at a similar capability, right, so that the output that we get for both packages are going to be the same, that the syntax is going to be very similar, that the functionality is going to be very similar as well, right? So basically you want somebody to look at R and the Python package, it's like, it's doing the same thing. We're getting the same output. It has no impact on the output of the analysis, regardless of what package you use.

24:30 Michael Kennedy: Yeah, well, it sounds super important, because if you evolve or version that SQL query just a little bit and they get out of sync, and then you go do a bunch of predictive analysis on top of it, and you say, "Well, we decided this, but actually earlier we thought this, but now it's that." like no, that's just a different query. This is a problem, right?

24:50 Panelists: It's a huge problem.

24:52 Michael Kennedy: Yeah, it seems like you really want to control that, and if you can bundle that away into here, call this function, we'll tell you what the data is, and just maintain that, that's great.

24:59 Panelists: But even then, kind of what you're talking about right there, we see the same thing happen when we're building these packages kind of in tandem between the two languages, 'cause it may be easy to kind of create that initial package that does a few things. And they're both operating very similar. But the second you start getting eight other folks from across the company that's like, "Oh, this is great. I want to go and do a pull request and make a slight modification." Then it's like, all right, well, I saw the Python just had like eight updates. What are we going to do on the R side? Are we going to do these exact same implementations, or not? Or maybe it's a unique thing that's kind of language specific. And it's like, well, how do we kind of do that same thing within R? And that's where it kind of explodes to be like, okay, there's no way we could actually build every single package we want to build in both R and Python and keep them at the same level.

25:46 Michael Kennedy: Sure.

25:47 Panelists: That's where it gets difficult to kind of figure out what direction you're going to go.

25:49 Michael Kennedy: Yeah, and what's your philosophy? Are you, let people make these changes and get the best library they can, or is it like, no, they need to be more similar? This is a problem.

25:58 Panelists: We're kind of figuring that out. That's been one really interesting experience, because in this regard, I mean, both in terms of the size of the data science function and how heterogeneous it is, I do think we're maybe, if not totally unusual, we're maybe a little ahead in running into these problems than what I read on the internet. Like, I haven't read a lot of other people grappling with this problem. So if you're listening and you've done this and you figured out a good strategy, let us know. But I think we're still figuring out exactly what it is. And so one thing Brad and I have discussed a lot is what are our options for building one underlying set of functionality that then you can interface with from both languages? And that's pretty tricky, because there's like an R package called reticulate that you can run Python code in. And then there's a Python package called rpy2 that you can run R code in. But these things tend to get, they get a little unmanageable because they don't deal with environments the same way that a native Python or R install does. And so these things are just challenges. We're experimenting right now with a way of tying together R and Python in the same session of a notebook by having them share what's called a Spark session, which is your connection to a Spark cluster. And so in theory, under the hood, you could do all the work in one of the languages and return to the user a Spark object, which is translatable to both. And so this is one of the things we're experimenting with. But we're trying a few different things. But we've definitely found that separately maintaining two identical APIs is extremely challenging, and I don't think we can do that for multiple packages going forward.

27:20 Michael Kennedy: Yeah. You have to have a pretty ironclad decider of the API, and then we'll just manifest that in the two languages. And that's also pretty constrained, right?

27:31 Panelists: Well, it really stifles contributions, right, because like Brad said, people want to issue a pull request, and we don't want anybody who contributes to have to know both languages thoroughly enough to build it in both. I mean, already we would ask them for documentation and things. And it's like, you're just broadening the size of the ask and limiting your potential contributors that way.

27:48 Michael Kennedy: Where's your unit tests? And where are your unit tests for R, right? Like, oh my goodness. Interesting, well, my first thought when you were talking about this as a web developer background was, well, maybe you could build some kind of API endpoints that they call, and it doesn't matter what that's written in. Like that could be Java or something, who knows? Long as they get their JSON back in a uniform manner across the different languages, that might work. It sounds like the Spark object is a little bit like the data side of that.

28:17 Panelists: That's the issue, ultimately, that for a lot of the stuff we're doing, we need to actually transform data in some way. And so sending a huge, sending many gigabytes of data across a web API is not going to be very efficient.

28:29 Michael Kennedy: Even if you turn on gzip, it's still slow.

28:32 Panelists: Yeah, so that solution is something we've considered also, that idea of like, maybe we can subscribe to some kind of REST endpoint and just use that. And that works for certain problems, but for a lot of our problems, it's ultimately about changing the data in some way. So it doesn't work quite as well.

28:45 Michael Kennedy: I see. So the ability to directly let the database or Spark cluster do its processing and then give you the answer is really where it has to be, huh?

28:55 Panelists: Exactly, yeah.

28:56 Michael Kennedy: Okay, interesting. What other lessons do you all have from building the packages for two groups? People out there thinking, maybe it doesn't even have to be Python and R. It could be Python and Java, like I said. But there's a lot of these mixed environments out there, although like I said, I think this is a particularly interesting data science blend at the scale you all are working at.

29:18 Panelists: One thing I've noticed is that being closely tied into a wide number of people and different parts of your data science function is really important, because the way people use things is so different. So we talked briefly about how people come from very different backgrounds within our data science function. And that means that their understanding of how to use functionality is quite different. And one thing I have to resist all the time is building a piece of functionality that to me looks really elegant, because I've realized the ways that it could be used or it supports some kind of customization. For example, and I was talking about this with someone else who works on packages, the idea that maybe the user could pass in a custom function that would then override part of a pipeline or something. And I always have to remember that most people aren't going to do that. The vast majority of our data scientists aren't attracted to these elegant solutions. They just want the purely functional ones.

30:05 Michael Kennedy: And sustainable ones.

30:05 Panelists: What is this lambda word?

30:06 Michael Kennedy: Why do you make it so complicated? Can't I just call it?

30:08 Panelists: Lambdas are a very good example, yeah. And so it's good to remember, we're building this as a functional thing for people who don't want to learn every aspect of computer science. They want to get their data science work done.

30:18 Michael Kennedy: Okay, yeah, good advice. Brad?

30:21 Panelists: So I think in everything that we've kind of been running into is, and I think this is more and more common with other companies, is we have kind of a, we have many different tech stacks. Basically we are working on-prem servers. We have on-prem Hadoop. We are working in two different cloud environments right now. So basically we have like four different environments that our data scientists could be using these packages in. And so lots of times it takes a lot of planning. Like, are we going to actually try to make this package completely agnostic to whatever environment you're in and be able to use it, or do we just want to say, "Hey, look, this is a package that has this one capability, but it's specific to this one cloud environment." And that takes a lot of planning. I think going into it, like myself, Ethan, several other folks, we have built packages before. But it was largely in more of an isolated environment, or it was just like, I'm just building a package that someone's going to use on their local IDE on their own laptop.

31:22 Michael Kennedy: It's focused, and you know what they're going to try to do with it.

31:25 Panelists: Right, right. So I think we've gotten a lot better at really trying to plan out, like what do we want this to look like, and what are the stages that we're going to take? That's still something we have a lot of work to do and get better at. But I think the nice thing is we have kind of a group of data scientists that are really getting better at this, and it's allowing us to kind of understand good, proper software engineering and approaches to that. And I think that's slowly kind of filtering out to the other data scientists. As we get smarter, we're trying to upscale other folks on thinking that same way.

31:56 Michael Kennedy: Sure.

31:57 Panelists: Yeah, and building off of what Brad's saying about the challenges of building packages in an enterprise environment for people to use them in a variety of different ways, one thing that was new to me was building this stuff through enterprise tools is quite different than doing it on your own. So a lot of people who maintain things like open source packages are using Travis CI, for example. And we have an enterprise CI/CD solution. And these things tend to require authentication. And they need to be integrated with other enterprise systems. And so these things are all, at least for me, things that I never encountered in personal projects or things at school. But it is the challenges of working in a large company. There's a lot of things that are locked down that require sign on in some way. You have to pass credentials. And these are like a whole new realm of problems to solve.

32:38 Michael Kennedy: Yeah, there's definitely more molasses in the gears or whatever in the enterprise world. You can't just quickly throw things together, right? You might have to do, like my unit test requires single sign on. Why is that? This is really crazy.

32:51 Panelists: Yeah, and mocking gets quite challenging. So that's one issue we have, where mocking our tests, I mean, it could either be a giant project, or we could do it in a mostly correct way, you know? We could take a subset of the data and say this is a good enough sample of it. But this isn't really representative of what we want this package to do, especially because these are all really integration tests. They're all like making sure that you actually can connect to the systems. So if you mock a system, essentially you're taking out one of the things you want to test. You want to make sure you actually can connect to the real system, 'cause that's the challenge of building this functionality.

33:23 Michael Kennedy: It's such a challenge because sometimes the thing that you're mocking out is simple, but sometimes it's such an important system that if you don't do a genuine job of mocking it out, then what's the point of even having the test? You know, I'm thinking of complicated databases with 50 tables, right? Yeah, sure, you can tell if it's going to return the data when you do this query, but what if the data structure actually changes in the database, right? Sure, the tests run 'cause it thinks it has the old data. But what did that tell you, right? Or if you're integrating with, say, AWS and talking to S3 and Elastic Transcoder and you've got to get some result, or Elastic Transcriber for text. And you're going to process those. You know, at some point, you're almost not even testing if you mock it too little. And then like you said, it's a huge project to recreate something like that.

34:13 Panelists: It's funny you say the 50 tables thing, because our central data mart is itself about 50 tables. And then occasionally we also rely on things that are created by other data scientists. And so yeah, the scope of it is very large, and it changes a lot in the background. And then also, I kind of feel that Spark is a much more immature technology than some of the old database technologies. And so updates happen that actually change the functionality of the system. Suddenly it's like the things that worked before don't work anymore, and you're mocking, like if you mock up Spark, that's not going to work. It's not going to be the same.

34:44 Michael Kennedy: Yeah, it's going to say the tests passed, but it'll wait til production and maybe QA to fail, right?

34:48 Panelists: Yeah, these are things we have to think about more and more.

34:51 Michael Kennedy: This portion of Talk Python To Me is brought to you by Tidelift. Tidelift is the first managed open source subscription, giving you commercial support and maintenance for the open source dependencies you use to build your applications. And with Tidelift, you not only get more dependable software, but you pay the maintainers of the exact packages you're using, which means your software will keep getting better. The Tidelift subscription covers millions of open source projects across Python, JavaScript, Java, PHP, Ruby, .NET, and more. And the subscription includes security updates, licensing verification and indemnification, maintenance and code improvements, package selection and version guidance, roadmap input, and tooling and cloud integration. The bottom line is you get the capabilities you'd expect and require from commercial software but now for all the key open source software you depend upon. Just visit talkpython.fm/tidelift to get started today. So you talked about the four different places where code runs, Brad. You've got your Hadoop cluster locally, your Spark cluster locally, the two cloud vendors that you're running on. Where are you headed? Which one of those is legacy, and which one is where you're headed? Or are they all active?

36:02 Panelists: We're definitely headed towards a cloud environment. The problem that we have, one, we do have data that is quite sensitive still. And we got to make sure that we have all the security aligned within the cloud environment before we can transition that. And then we just have a lot of historical code still running. And so you figure we got 250 analysts. We have just that many projects going on. How do we transition a lot of that code into the cloud? So I think it's going to be many years of working in this kind of multi-environment kind of strategy. I think ultimately the goal would be to be to a single cloud environment. But then also from, I understand for a business strategy that locks you in to kind of a certain pricing structure. We may try to have multi-cloud environment. That's pretty common across companies. I think long-term we will try to be mostly in the cloud. Whether or not we'll be with one vendor or not, that's to be decided. The one thing that I think, what has changed with our recruiting is definitely looking for folks that aren't scared away from being able to work in a cloud environment. A lot of students are coming from university that do not have any experience with a Spark environment. And that's fine. It's not like we're expecting you to do that. But you need to be open and willing and be prepared to work in that environment. So that's definitely a big change. It's also amazing we got this far without talking about SAS, because we, like many, many analytical companies that have been around for more than five or 10 years have still dependencies on SAS. It's just very difficult to migrate off of enterprise tools. And so we've been in the process of migrating from SAS for quite some time, and it's funny because when I came in here, which was three and a half years ago, the company was almost entirely on SaaS. And R was like the upstart language. And I think I was one of two or three Python users in the whole company. And things have changed a lot. But making the final cut, severing ties from old technologies is challenging. It's one of the reasons we have so many platforms. You just end up with production things running on all these platforms. And it would be a lot of work to change them. So it just moves slowly.

38:03 Michael Kennedy: Well, some of those systems, they're carefully balanced and highly finicky but important, right? And if you try to change them, if you break it, all of a sudden that becomes your baby. You have to babysit it when it cries at night, right? I'd rather not have SAS, but more than that I'd rather not touch that thing and make it my responsibility, 'cause currently it's not, right? That's certainly an enterprise sort of experience, right?

38:30 Panelists: Yup, for sure.

38:31 Michael Kennedy: Yeah, well, what's the transition been like from the somewhat expensive commercial product over to the combination of Python and R? Was that easy? Was it hard? Did people welcome it? Did they resist it? You both do some on the education side within the company, so you probably have a lot of visibility into how that first impression went.

38:52 Panelists: So I used to lead our introduction to Python trainings. So we have, like I said, some continuing education classes in the company. And I will say I was just so surprised by the reaction people had to a new technology training the first time I gave it, because coming out of a computer science program where you sort of get thrown into languages, like I had a class in Java my senior year, and I never used Java, and the professor just sort of expected we'd pick it up, I never really thought about this idea that you would be resistant to learning new technologies. But when one software tool has dominated your industry for 20 years, as SAS had, it's just really unfamiliar. So I gave this course, and a lot of people were asking questions that, to me it was like, well, obviously you would google that. You know, like obviously you would look at the docs. Obviously you would do this. And it's not obvious, because these people come from a closed source tool that is carefully maintained and is highly backwards compatible but at the same time is not nearly as dynamic an ecosystem as something like Python or R. And I think I watched the culture change a lot since I started. And I look at even the people coming out of school that start, and they're so much more willing to jump into things, which I think is great. And even the people that were here when I started have gotten more that way, as well. People have just learned that the culture of open source tools is very different, and you have to be more willing to jump from thing to thing. And as we introduce new technologies, 'cause we still do, people are more able to learn those, which I think is really great.

40:11 Michael Kennedy: I can certainly see that. You know, I'm somewhat sympathetic to those folks. If you have spent a long time and you are very good at the tasks that you have to get done in one language or one technology, and you got to switch over, it's all of a sudden like, I feel brand new again. Like I remember not being able to load a file. I remember not being able to actually properly efficiently query a database. I remember all these things. You're like, all these are, I have these problems again. I thought I was beyond that, right? And that's certainly challenging. The other thing I think that makes it tricky going from something like that or even something from say C# and .NET where there's a Microsoft that says, here's what the web stack looks like, and we'll tell you in six months what the changes are going to be, is in the open source space, there's probably 10 different things that do what you want to do. And how do you know which one of those to pick? And then once you bet on one and you work on it for a while, all of a sudden either maybe it gets sort of, it loses cachet or something else comes along. Instead of being one or two ways to do a thing, now there's 20. And it's like, I'm kind of new here. So how do I even decide which of those 20, because it's hard.

41:26 Panelists: Yeah, that is absolutely a concern that people had. I all the time would get this question, because I was known as the Python guy early on, like what do I do if this package changes? Or how do I know this is still going to work if there's no company that's behind this tool? And if you come from this world, like if you come from the open source side, you think two things. One, most of the time the stuff keeps working. The core functionality is extremely stable. All the most popular open source languages, they don't just stop being maintained. This stuff is extremely, extremely well-used. And also, you know that if a package is no longer maintained, you look for another one, because that stuff happens dynamically. And it's unusual. You'd have to be using something pretty fringe. But it's unusual for you to end up being just out of luck in terms of having some functionality available to you.

42:10 Michael Kennedy: Right, yeah, there's some few edge cases, but it's not common. And there's always the, well, you can fork it and just run it, right? If you use something that's pretty mature, the chances that it has a massive showstopping problem discovered down the road, they're not that high, usually. Things stick around, right? NumPy's probably not going to go unmaintained.

42:29 Panelists: Exactly.

42:30 Michael Kennedy: Django still has users. Things like that, right?

42:32 Panelists: You know, that's one thing that we do try to do internally, and it's one thing that we're trying to get a little bit more smart on how we do it. But with so many packages and so many capabilities out there, it's like, how do you make sure people are using kind of a core set of packages that we kind of endorse or do the primary things that we want to do? We try to create a little bit more structure around what packages should we be using internally, and then what's the process of bringing in a newer package, right?

42:58 Michael Kennedy: Do you guys have like a white labeling process where you sort of vet them, or how does that work?

43:03 Panelists: We're getting a little bit better about setting up like a sandbox area where if we find a package that is new or even a package that is just on GitHub and not PyPI or Crayon, then how can we bring that in, do some testing, make sure that there's not any interactions going on within our servers or whatever? And then as long as we kind of pass all those regression tests, then yeah, okay, we can start bringing that in formally as a standard package in our servers or wherever.

43:32 Michael Kennedy: Do you have a private PyPI server? I don't know if you'd have Crayon, but Crayon server, as well, that you have more control over, or do you let people just pip install straight off the main?

43:41 Panelists: We use Artifactory, which is a tool that basically sets up those package repositories. And you can have it clone them. So we have what looks like a copy of PyPI, but then we blacklist certain things or we whitelist certain things depending on the environment, yeah. And it works for Crayon, as well.

43:56 Michael Kennedy: That's a really cool, that's quite the, it seems like a very elaborate system. But for you all, it sounds like it's the right thing.

44:02 Panelists: The nice thing about that is with our internal packages, we can actually have our CI/CD process push them to that Artifactory so that they could do pip install, whatever the package, or install.packages in R, that package name. And it's like you are importing it from PyPI or Crayon, but really you're just pulling it from our internal Artifactory.

44:24 Michael Kennedy: Yeah, when you have a scale of 250 people, you almost want to say the way that you share code across teams is the same way that you share code across open source, right? It's you create these packages, you put them into, you version them, you put them into Artifactory, maybe even pin your version in your deployment, things like that, right? Is that what you do?

44:45 Panelists: Sort of like getting that standard across, getting the knowledge of these standards across the business is one of our chief challenges, because just like open source, people don't necessarily hear about new packages that solve problems they've been encountering a bunch of times. So while we encourage people to do things like pinning packages, we're still at an even earlier step, where it's like, be aware of what new functionality is in these packages we're using, 'cause all the time I see people setting up elaborate configurations with Spark. And then I tell them we have a package that would be, we first released 1.0 like two months ago. And it's like, all this could be done for you, you know? And we can send as many emails as we want, but people who work with 249 other data scientists delete a lot of emails, because there's too many. So finding a good...

45:28 Michael Kennedy: Yeah, it's another plague in the enterprise, is like everyone thinks that you need to be copied if there's even a chance you need to know about it. And what it results in is, if everything's important, nothing is important, right?

45:39 Panelists: A big problem, yeah. And so figuring out how to socialize the way to use these packages and what packages even exist and then beyond that, like how to use them properly and version things properly is always something we have to think carefully about.

45:52 Michael Kennedy: How do you all do that? I mean, just letting folks know, there is now a library that you can install that solves this problem or does that, or it has this challenge. We're looking for feedback on how to make it better. How do you get the word out about your projects and packages internally?

46:09 Panelists: So we've tried to start doing a little bit of beta testing. So if we have a brand new package we're developing, before we actually do a full release, we'll try to get a group of folks that do some beta testing on it to kind of give feedback. One, is the functionality there? Are there bugs that we're missing? Two, is the syntax kind of logical, coming from more of the data science perspective? And then three, is the documentation there that they need to basically pick up from no knowledge of it and start applying it? And that gives us that good initial feedback. And then once we start getting a first release and everything, right now what we are doing is basically doing that email blast to all our data scientists and saying, "Here's the version number. Here's what's new, what you need to know, how it impacts you." But ultimately I think what I've learned is that the most important thing is having advocates across the company that know about this, because often new functionality will arise that will only take over in part of the business. When you have 250 people, it's like, who knows about what is very different across teams. And so one of the things we focused on with our beta testers is making sure that this is a well-rounded group of people in different teams. So those people serve as sort of the evangelists, to tell other people on their team, well, when you run into these problems, you should be doing this. And that's really the only way to get that information across, 'cause we can't sit in everybody's meetings. And we can't go and look over people's shoulders as they code. So we need other people to do that for us. So our first adopters are really the people that help.

47:35 Michael Kennedy: That sounds like a pretty good way to set things up. I want to come back to this, building these two packages and the same package for both languages. And I'm not sure if we exactly covered it. Do you try to have the same API for both, as close as possible, or do you try to have something Pythonic for the Python one and something that's maybe effectively the same but very much what R folks expect? What is your philosophy in trying to build these packages for both groups?

48:07 Panelists: That's a great question. We try to balance the two. We want the syntax, the API to be very similar across both. But obviously we want folks that are coming from the R side or from the Python side to feel very natural in using it. And so that means that we can't always have the exact same comparable syntax across the two. You know, in R it's very common within the Tidyverse packages, if you've heard that, where there's no quoting. There's been ways that you can remove the quotations of argument inputs and everything. So that's kind of a natural thing that we do, where if you look at the Python side, you got the same arguments, the same valid inputs that you could supply, but you're going to have quotes versus non-quoted. And so that can be differences. And then there's other kind of differences. And underneath the hood, like how do you do logging? Obviously that's going to be a little bit different in both of the two languages. And one thing that's really hard to avoid is that fundamentally, object orientation is extremely different in R and Python. And R has things, I believe the term is method dispatch. So methods don't come after object names, they come before, and they look like just standard functions. And so things are just, if we wanted to build an exactly identical API, we would actually have to jump through a lot of hoops. That wouldn't be very native to either of the languages. So like Brad said, it's a fine line to walk. We want it to be recognizable and similar, but we don't want to sacrifice the merits of the language for that.

49:30 Michael Kennedy: Sounds like a good balance. You know, let's round out our conversation with one more short topic. When you think about data science, a lot of times at least I think about things like Jupyter Notebooks, JupyterLab, maybe RStudio, and this exploring data. When I think about product, like productizing, putting this behind some REST API or putting into production some of this stuff, I don't think Notebooks and RStudio anymore. What does that transition look, like how do you guys take this research and turn it into products, like services and APIs and whatnot that can run?

50:10 Panelists: So there's a few different approaches. Historically, so I used to work on our digital team that would build the recommender systems for the Kroger website. So much like Amazon's website, Kroger's website has like, because you bought this, you might also like this.

50:23 Michael Kennedy: Right.

50:24 Panelists: And so we need to find ways to serve up recommendations. Historically, that was largely done in batch style processing. So at the beginning of a given week or something, we would say, for each customer identifier, these are the products that they should get. And we would send over these flat files. But increasingly, we're moving to something that looks more like, we will ship you an actual, well, usually it's a container, but some kind of item that takes input and gives output so we can serve up dynamic recommendations. So people, I think a common workflow for things like that is people build their model and do their exploration and do their just like modeling initially in Notebooks or in RStudio. But then they package this up as some kind of product that ends up being much more polished. So in some cases, if we ship a container, people need to actually Dockerize that and make sure that it can be used by someone external and then thrown over the wall to who actually manages the Kroger website, for example.

51:17 Michael Kennedy: Okay, interesting. Brad?

51:19 Panelists: Yeah, and lots of times if you're doing machine learning type of models, you're not going to basically build a script that's got the machine learning code in it and use that to kind of score incoming observations. Most of the time you're going to have some kind of Java output product from these. So DataRobot is a tool that we use internally that allows you to do kind of like automated machine learning tasks. H2O is another very popular one. And both of those have very similar, like R and Python APIs to perform machine learning. But the thing is, once you get done with kind of finding what that optimal model is, then you typically are kicking out, you're going to use a CodeGen or a Bojo, which just ends up being kind of like a Java object. And that's what we can use to kind of score new observations. And so that gets away completely from using any kind of a notebook or scripting as far as a Python or R scripting capability.

52:13 Michael Kennedy: Okay, have either of you played around with or entertained the idea of something like papermill? Familiar with papermill?

52:19 Panelists: We do use papermill internally for a couple things.

52:22 Michael Kennedy: Maybe tell people just like the quick elevator pitch of what that is so they know.

52:26 Panelists: A little bit of background, so notebooks, if you've used Jupyter Notebooks before, they are designed for interactive work. They're designed for like, run this line of code, see this output, et cetera. They're not really designed to be automated. They don't lend themselves to being run from the command line. And papermill is a package that Netflix has produced that is open source that lets you automate notebook runs. So you get the benefits of notebooks, where you get this output inline with your work, and you can do the development of the notebook as you did before. But now you can batch these things. You can run them every night or whatever you want to do.

52:56 Michael Kennedy: Yeah, I was going to say one can call the other. You can almost treat them like little functions with inputs and outputs.

53:00 Panelists: Yeah, exactly.

53:01 Michael Kennedy: Yeah, so you use them a little bit?

53:01 Panelists: We use them internally for a couple things, mainly, so again, because we are on so many different platforms, it's like different things come to different platforms at different times, especially because these are managed platforms everyone has to work on. So I can't just go install papermill, because that's going to affect everyone. So we have papermill set up on some of our on-prem environments. And I think we're still working at getting it set up in the cloud. But in general, a lot of the stuff that we would want to automate in the cloud is a little easier to end up scripting in the end. I think that's a heuristic, not always true, but often.

53:31 Michael Kennedy: Interesting.

53:32 Panelists: Yeah, and one thing that we're using or kind of moving towards is using Databricks. And there's a lot of functionality within Databricks that allows you to kind of parameterize and automate the runtime of those scripts. So it ends up being kind of a notebook that can operate a little bit like papermill.

53:46 Michael Kennedy: That seems like that shortest or the most native way to bring where the work was originally done into productized data science. But also I see definitely some engineering challenges around that, like testing, refactoring, et cetera, yeah.

54:02 Panelists: I think historically what we've primarily gone to is basically having your .r or .py scripts and just automating those with normal batching.

54:11 Michael Kennedy: Yeah, that makes a lot of sense. All right, well, I think we're just about out of time. So we'll have to leave it there. But this was a really fascinating look inside what you all are doing there, 'cause it sounds like you're operating at a scale that most folks don't get to operate at maybe just yet. Who knows.

54:26 Panelists: Yeah, it was good to talk.

54:27 Michael Kennedy: Absolutely. Now, before I let you two out of here, I got to ask you the last two questions. So let's go with you, Ethan, first. If you're going to write some data science code, I'll generalize it a little bit this time around, what editor do you use?

54:38 Panelists: I'm pretty all in on Vim, which is not all that popular in data science. I use Jupyter Notebooks sometimes. I found a lot of extensions to let me use the Vim key bindings. But old habits die hard. I dabbled in VS Code recently, but I always go back to Vim.

54:49 Michael Kennedy: All right, cool. Brad?

54:52 Panelists: I do write a lot of R code, so RStudio is kind of my go-to. Even when I write my Python, I definitely enjoy writing it within RStudio. RStudio actually supports multiple different languages. And it's just one of those editors I've gotten used to. I do use notebooks sometimes when I'm teaching, whether it's an RStudio Notebook or Jupyter Notebook. I mean, if I want to be truly Pythonic, I go to PyCharm.

55:13 Michael Kennedy: It's funny, editors are one of those things that once you get really comfortable with it, you can just be more effective in the one that you like, right?

55:20 Panelists: Exactly, yeah.

55:21 Michael Kennedy: Cool, and then notable PyPI or, Brad, if you want, Crayon package for folks out there, something, some library you ran across that people should just know about, maybe not the most popular, but you're like, I found this thing, and it was amazing, and I didn't even know about it.

55:34 Panelists: I am all in on this package called altair. I think you've had Jake Vanderplas on the pod before.

55:39 Michael Kennedy: I have, yeah.

55:40 Panelists: Earlier I alluded to Python's relatively weak visualization ecosystem for data science. And it is, I teach some Python classes, both internally and at the University of Cincinnati, and teaching the visualization ecosystem is just terrible every time. It's so bad. And matplotlib and seaborn are so difficult to use and inconsistent. And altair is like the hope that I have for the Python ecosystem. I think it's so, so nice. I just want to see more adoption. But it's great. Like if you work in data science, you should absolutely switch to altair. And if you are used to the ggplot, really nice encoding style syntax, encoding channels, altair is such a good answer in Python.

56:15 Michael Kennedy: Yeah, I've heard really good things about it. I haven't actually done that much with it myself. But yeah, it's definitely good. Jake does nice work. Brad?

56:26 Panelists: I've been spending a lot more time in both R and Python, so I'm getting more and more drawn towards packages that are available in both languages and that kind of have very similar syntax or API. So a few good ones, so we use DataRobot internally, so that's got a similar R and Python API. H2O is another machine learning package that I really like. And if you look at those, it's really tough to tell the difference between the R and the Python syntax.

56:52 Michael Kennedy: It must be nice when those exist for your ecosystem, right?

56:54 Panelists: Yeah. And even TensorFlow and Keras, I've been doing a lot of stuff with deep learning lately. And the R, Keras, TensorFlow. I mean, basically it is. It's using reticulate to communicate towards the Python Keras. So the syntax between those two are very similar, as well.

57:09 Michael Kennedy: Yeah, we talked about mocking stuff earlier. I want to just throw out Moto. If you guys use AWS, Moto will let you mock out almost every AWS service.

57:18 Panelists: Really?

57:19 Michael Kennedy: Yeah. If you want to mock out the API for EC2, you can do it. You want to mock out S3, you can do it. It's all in there. So Boto is the regular API in Python. Moto is the mock Boto, right?

57:31 Panelists: So is that built and maintained by AWS folks, or?

57:34 Michael Kennedy: I don't think so. It definitely doesn't look like it. But anyway, there's some interesting things you can do with a local version of it, all sorts of funky stuff. It looks like someone put a lot of effort into it. Trying to solve that mocking problem. It's a lot of work, right?

57:49 Panelists: It is, definitely.

57:50 Michael Kennedy: Cool, cool. All right, well, Ethan, Brad, this has been really interesting. Final call to action, people maybe want to try to create these unified environments or work better across their data science teams. What will you tell them?

58:02 Panelists: It's worth investing and having some kind of centralized data science team within your overall data science department that works on these resources. You know, you have to carve out a time for people. Brad and I are both, I think, pretty lucky to be able to do this as most of our job. If you don't have that time carved out, people just don't have time to contribute to centralized resources and you end up with a lot of duplication of work. And it's also good for some of your data scientists to have more a technical background and be able to think about this stuff. And I think we've benefited very much from that.

58:27 Michael Kennedy: Yeah, it sounds like it. Brad?

58:30 Panelists: Yeah, and I would say for the data scientists, you know, historically you've been able to kind of focus in one language. And I think that's becoming less and less common. So I think a lot of people need to be flexible in understanding both languages. You may be dominant in one, but at least be able to have some read capability in the other one. And one thing I've definitely benefited a lot from is working closely with Ethan and some of the other folks in the company that are strong Python programmers. There's a lot of good exchange of knowledge. And once you start understanding different types of languages, you kind of see the same patterns that exist. And that could really help you become a stronger developer.

59:04 Michael Kennedy: There's always good stuff on both sides, and if you can bring it over the fence, it's good, right?

59:08 Panelists: Yeah, definitely.

59:09 Michael Kennedy: Well, thank you both for being on the show. And it's been really interesting.

59:12 Panelists: Yeah, thanks, Michael. Yeah, thank you very much.

59:14 Michael Kennedy: This has been another episode of Talk Python To Me. Our guests on this episode were Ethan Swan and Bradley Boehmke. And it's been brought to you by Linode and Tidelift. Linode is your go-to hosting for whatever you're building with Python. Get four months free at talkpython.fm/linode. That's L-I-N-O-D-E. If you run an open source project, Tidelift wants to help you get paid for keeping it going strong. Just visit talkpython.fm/tidelift, search for your package, and get started today. Want to level up your Python? If you're just getting started, try my Python Jumpstart By Building 10 Apps course. Or if you're looking for something more advanced, check out our new Async course that digs into all the different types of async programming you can do in Python. And of course if you're interested in more than one of these, be sure to check out our Everything Bundle. It's like a subscription that never expires. Be sure to subscribe to the show. Open your favorite podcatcher and search for Python. We should be right at the top. You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at /rss on talkpython.fm. This is your host, Michael Kennedy. Thanks so much for listening. I really appreciate it. Now get out there and write some Python code.

Back to show page