Brought to you by Linode - Build your next big idea @ linode.com


« Return to show page

Transcript for Episode #226:
Building Flask APIs for data scientists

Recorded on Monday, Aug 5, 2019.

0:00 Michael Kennedy: If you're a data scientist, how do you deliver your analysis and your models to the people who need them? A really good option is to serve them over Flask as an API, but there are some special considerations you might want to keep in mind. How should you structure this API? What type of project structures work best for data science and web apps together? That and much more on this episode of Talk Python to Me with Guest AJ Pryor. It's Episode 226, recorded August 5, 2019. Welcome to Talk Python to Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter, where I'm @mkennedy. Keep up with the show and listen to past episodes at talkpython.fm, and follow the show on Twitter via @TalkPython. This episode is brought to you by Linode and Rollbar. Please check out what they're offering during their segments. They really help support the show. AJ, welcome to Talk Python to Me.

1:04 AJ Pryor: Hi Michael, how are you doing?

1:05 Michael Kennedy: I'm doing super well, thanks for being on the show.

1:07 AJ Pryor: Yeah, absolutely, I'm super excited.

1:09 Michael Kennedy: Yeah, it's going to be a lot of fun. We get to talk about some things that are really popular in Python, web development and data science, and then we're going to intersect them together, which, I don't know, is going to make them mega-popular, because if you look at the Python space, all the surveys and sort of where people are working, it seems like it's mostly web development or data science and then just like a whole bunch of others, right, so we're hitting both of those today.

1:36 AJ Pryor: Yeah, I mean, you've got on one hand sort of most of the code that's written is for the web, and then you've got this meteoric rise of Python coming beside it, so I figure why not just do both, and then you're pretty much positioned at the top.

1:47 Michael Kennedy: Yeah, that's definitely a good place to be at, I would say.

1:49 AJ Pryor: One way to look at it at least, right?

1:51 Michael Kennedy: Yeah, it's a positive way for sure. So before we get set, though, let's go ahead and start with your story. How'd you get into programming and Python?

1:56 AJ Pryor: Sure, actually, I didn't grow up around programming at all. I grew up in the middle of nowhere, Georgia. I played sports as a kid, you know, I was always into math and science, but never programming, and I went to college, undergrad, at Georgia Tech and my first semester there, I was declared as an engineering major and took a MATLAB course and immediately, just everything about it clicked. I was hooked, and of course, you know, MATLAB is MATLAB, and so I changed my major immediately to computer science and then I got about one semester further, and started taking some of the sort of pure CS classes, you know, it was very theoretical, and that's fine and all, but I didn't feel as hands on with it, and simultaneous to that, I started taking some physics classes, which had an application of Python, so we were kind of modeling gravitational systems and whatnot, and that really resonated with me because it had both the math side of things that I gravitated towards, but I was also building things using code, which was new, and I got really excited about that, so I remember we had this lab where we built like an orbital system of the moon and a rocket and Earth, and before I know it, I had this little couple of dots on a screen moving around and I'm convinced that I'm like landing Apollo 11. I just, it was really gratifying, and so I've been hooked ever since, and the rest is history.

3:12 Michael Kennedy: Yeah, that's cool. It's really interesting how some of these early wins, like this three-body problem you're simulating in physics or whatever, like, it's not a big deal, right? The little simulation's probably pretty limited, but at the same time, it feels so gratifying, right?

3:26 AJ Pryor: Yeah, absolutely, and then of course that starts off, it's sort of the baby problem, and then as I went on to grad school and my PhD, and this was all physics, the coding got more complicated, the algorithms got more complicated, but at the core, it was still the same problem, right? You're solving some kind of computational problem through coding, and although that wasn't always Python, these days it's pretty much moved from that. I kind of diverted along the way and did some C++ and did CUDA and GPU programming, but Python is a lot of what I do day to day now.

3:56 Michael Kennedy: Yeah, cool, cool. You know, I followed a bit of a similar path, like I was working my PhD in math, and doing a lot of math stuff, but then also analyzing and programming, and I guess I should've known that I was not really intended to finish my degree, my PhD anyway, and go into math, because I would find when I'm working on these projects, I was really excited, especially about the programming, it was so cool, kind of like you described this, like, how awesome it felt to have the simulation running, and then I'd always kind of get, like, a little bit less excited and like oh, here comes the drudgery work part of it when I got to the math. And then it would go back to the programming< I'd get really excited again and like, eh, it should've been a sign that I okay, that's where I need to focus.

4:38 AJ Pryor: That's when you know, yeah.

4:40 Michael Kennedy: Yeah, but it was all good. So for you, what are you doin' these days?

4:42 AJ Pryor: So I work for a company called American Tire Distributors. It's the largest distributor of replacement tires in the world, so the tire industry is very interesting, because tires are something that pretty much everybody has to buy. You don't buy frequently, and it's an overwhelmingly miserable experience for most people. And it's an antiquated industry, so it turns out that they have a ton of data, but technologically speaking, it's not particularly modern, and so my company has spun up a pretty new analytics team that's data scientists, engineers, and developers, to kind of build out applications to try to revolutionize that industry and bring it up to the modern world in terms of a analytics, increasing supply chain efficiency in all of this, and so as a data scientist, particularly one who is interested in web development and app development like I am, it's kind of like being a kid in a candy store, because there's tons of data, both historical and streaming every day, and basically no solutions other than the traditional sort of BI descriptive backward-looking stuff, which is good, and you need that, but we're also looking to build more predictive analytics and doing forecasting and finding other optimization techniques to kind of squeeze efficiency, 'cause as a distributor, you're a middle man, and efficiency is the name of the game. So I find myself sort of stuck between I build back ends in Python, I write a lot of SQL, interact with databases. Sometimes that means building new databases on our cloud resources, and then I spend a lot of time building front ends to actually expose those applications in a way that's consumable, and that's actually a very important part of this whole process, because at the end of the day, if you're building an app for somebody, doesn't matter how fancy your mathematics is if the consumer who's actually making an actionable decision on that can't do anything useful, it's not worth your time. So that's where I find myself spending my time. I write a lot of code and talk to a lot of people.

6:30 Michael Kennedy: Yeah, yeah, that sounds so fun. I mean, almost any industry that has tons of data and yet no one is doing anything with it, not really, it just sounds really fun. You can come in and go all right, let's see what we can do here. We can bring in a little TensorFlow to do this. We could bring in some APIs to modernize that. Why are you seriously emailing me an Excel spreadsheet? Is that really what's happening?

6:53 AJ Pryor: Oh, it's the story of my life, dude.

6:56 Michael Kennedy: Do you guys use SharePoint?

6:57 AJ Pryor: You're hittin' home.

6:58 Michael Kennedy: Do you like share Excel docs through SharePoint?

7:01 AJ Pryor: Yeah, we use a bit of everything, which is part of the problem. There's historical reasons that it's an Excel spreadsheet or it's emailed, and you know, consolidating that's a lot of work, and that's something that we've got people working diligently on, but it keeps things interesting. You know, you learn about tech from the lens of like your big tech companies, and the reality is, there's an awful lot of businesses out there that are not nearly that refined, and so it's a different problem. It's fun, though. I really love my job.

7:27 Michael Kennedy: Yeah, it's definitely a different problem. I mean, if you're working at Netflix as a data scientist, it would be very different, or you know, somewhere like that, right? As an analogy. But I do think it sounds really fun to come in and have kind of this blank slate and say okay, it's 2019, these are the tools we're using, we could easily apply them here. Because it happens to be open source, you don't even have to get budget approved to like buy the $2,000 license or this or that. You just say okay, let me go after it, I'll do it.

7:57 AJ Pryor: Yeah, and then our analytics team functions basically like a startup inside of a very sizeable company, so we have freedom to pick the tech we want to use. A lot of those decisions get to be made sort of last minute on the fly. We get to use whatever the latest and greatest tech is, so it's a lot of fun.

8:12 Michael Kennedy: Yeah, that's actually, you know, I find those types of environments are really, really good for actually learning a whole lot of technology and techniques, because you're not dropped into, like, well, we already have an enterprise architect, and they say we use this technology and that we use it in this way and not in that way. Now, you go work on this small sliver, right? You're there to just explore, and I don't know what it's like for you, but the times I've been in those situations, people don't really care what technology you're using or how you're using it. If you're showing results and you're like, I've tried these things and look what we're doing, last week we couldn't do this, this week we can, that's all they really care about and they're super happy. Is that kind of the experience?

8:52 AJ Pryor: Yeah, I mean, more or less. Like you were saying earlier, if you worked at a company like a Netflix or an Uber, you can have a whole data science team, you know, whatever, 10, 20, 100 people dedicated to making tiny incremental improvements to an algorithm, and that might be worthwhile, because it's a half percentage point across an enormous profit base, right? And so that might justify it. But in a smaller team, you might have more of the kind of 80/20 approach, where you're trying to get most of the benefit with the least amount of work so that you can touch a lot of different points, because the surface area of your problem space is just huge compared to how many people you have. As a company grows over time, that dynamic shifts.

9:30 Michael Kennedy: Yeah, there's like seasons.

9:32 AJ Pryor: It's also brings up some sort of team structure issues.

9:34 Michael Kennedy: Yeah, for sure, for sure.

9:35 AJ Pryor: Yeah, exactly, like seasons. It kind of changes as the evolution of the company progresses.

9:39 Michael Kennedy: Yeah, cool cool. So one of the things that we're going to talk about, and I would like to start off the conversation with, you touched on it before, it's really cool to have these libraries and these predictions and all of this tooling in place, but if you can't share it and expose it over say like a web service or something like that, then it's not super valuable, is it?

10:02 AJ Pryor: No, not at all.

10:03 Michael Kennedy: So let's talk about your work that you're doing with Flask and data science, right? This is the web dev side of the intersection that I was talking about.

10:12 AJ Pryor: Right, basically what is a data science application, and how is that different from a traditional quote web app? The biggest difference is that a data science application is going to have some kind of computational, predictive machine learning element embedded into the actual API results. So if I'm purely a front-end developer and I'm used to a workflow of building an app, hitting an endpoint, receiving data and displaying it, there's basically no difference between a traditional app and something that's driven by data science other than maybe you're cooking up more graphs and visualizations depending on the context. But from the back-end perspective, you're going to be generating these API results calling some kind of prediction method. And so one of the great things about Python with machine learning, which is in my opinion one of the reasons it's become so popular is that it has scikit-learn, which is a very understandable interface to sort of churn out predictions repeatably across many different models, many different parameters, and so it makes the kind of plug-and-play nature of how the data science industry needs to work very friendly, and that means that we can have multiple data scientists trying to solve the same problem, each have their own models, they can be tuning it, we might combine them, but at the end of the day, we can have some kind of predict method that generates data, and if that data can be jsonified, it can be sent back through an API. And so it's really in the nature of the meaning of the data and where it comes from, and what problems it's trying to solve. That's really the only main difference.

11:38 Michael Kennedy: Yeah, for sure when you're talking about APIs, I suspect consuming them looks very similar to working with the GitHub API or the Stripe API or whatever, right?

11:49 AJ Pryor: Yeah, exactly.

11:50 Michael Kennedy: You've got some authentication, you call some methods, you get a json answer back, and you say, they recommend this tire, or they recommend it's time to buy, or not time to buy, or whatever, right? Like whatever it is you're trying to predict.

12:00 AJ Pryor: Yeah, exactly.

12:01 Michael Kennedy: When it comes to web applications, though, I suspect that also means bringing in other interesting elements. Like for example, there's probably more charting and graphing and interactive data display on something that is data science backed than something that's not. You know, like...

12:20 AJ Pryor: Right, exactly.

12:20 Michael Kennedy: My site that does training, right, it shows you videos, but I don't think there's a graph on the site. There's no graphs, nothing like that, really.

12:27 AJ Pryor: Right, exactly. Maybe on some of your admin dashboards or whatever, but...

12:31 Michael Kennedy: Possibly, yeah.

12:31 AJ Pryor: It depends on the context, so for example, we might be making a pricing tool, right, so let's say you want to do some machine learning analytics to figure out how should I adjust my pricing on different products? Well, the way somebody in the tire industry or the shoe industry are going to handle those problems are very different because they have different problems. So to pull from tires, there's an enormous number of different kinds and styles, and so it's tens of thousands of things that you might be making a decision about pricing. A human can't understand that, but if you make an application that makes it very easy to slice and dice that ecosystem and bubbles to the top the things that are maybe the most egregiously mispriced one way or the other, you allow a human to then make decisions about the important parts of that. But that kind of interaction might not be so meaningful if you're in a product space where there's not so many things and it's not as hard for a human to make a decision, but you have other problems. So it's kind of a dynamic business because it very much depends on the context, and that's why the user, their use case, matters so much in how you design an app, because it's one thing to just solve the same pricing problem twice, but the user experience might be vastly different, and that completely changes the front end that makes it a good product.

13:48 Michael Kennedy: Sure, and maybe whether or not it should be just a server-side Flask type of application or maybe you need to bring in some funky JavaScript for interactivity once you pull down the data or something like that. One thing I did want to ask you about is how does hosting and deployment look relative to, like for a data science web app rather than a standard web app? I know in data science, there's lots of computation, there's often leveraging GPUs, but is that more done in the preparation phase? Like do you train a model up or something, but then the actual execution of it, it doesn't need any special hardware in the deployment stage, or what does that look like?

14:23 AJ Pryor: Great question. It depends. So in some cases, you have a problem where the computational load can be done in advance. Great example is image classification. So Google has trained a number of models that they open source for image classification, like the Inception series of models that you can go download and use them to classify things that they already out of the box know how to do, cats and dogs and things like this. And to get there, Google threw a whole bunch of compute at that, tons of nodes, many GPUs, and it takes a long time to get that model weighted the way it is, get it trained to where it is. But from that point, another data scientist could pick that up where it left off and basically finish out the rest of a model to repurpose it. So in that case, the data scientist's workload is mitigated because Google did some front-end work. But at the same time, you might take that final trained model that data scientists adds their specific use case to, and once that's trained, you might be able to just throw that into a docker container and then make predictions with it very quickly, trivially quickly. It was the training that was the hard part. So in this case, you would just dockerize it, and you can deploy the application as is. However, there are other cases where the actual computation you need to do in the web server is where the complexity is. So for example, I have some apps I've built where they're built around optimizations where you have lots of free variables and the problem that you're solving is completely dependent on the state of the app, you have to do that on the fly, and so your Python server, your Flask app, is now either doing that computation directly or offloading it to some other server via something like a task manager with Celery or some other serverless hook that you've put together, microservice architecture, whatever you're using. There's many different ways to do it.

16:06 Michael Kennedy: Yeah, yeah, that's, definitely makes a lot of sense. So I guess depends is the answer, right? Yeah, it depends. Very interesting. How does serverless fit into the world here? I know serverless is good for asyncing stuff, right? I want to send an email, nobody needs to wait, and then I can shoot off something to say AWS Lambda or Azure Functions and just have it go. How's that work in your world?

16:26 AJ Pryor: Personally, serverless doesn't affect me too much. First off, the name serverless, I don't really understand because there's always a server there. It just means serverless in the sense that you don't have to manage partitioning and setting that up and so it's a valuable thing, especially if you're more of building like a front-end only type app, you know, if you want to use a Firebase or something, that kind of model. In my case, I came from more of a back-end development side first, and so I've never had an issue with creating those endpoints myself, and so I tend to have not needed the serverless model. That's not to say it's not good and useful, it just hasn't affected me very much ever.

17:01 Michael Kennedy: Yeah, it's interesting, and the way, I kind of see it the same way as well, and I think it's 'cause I came from the back-end development side first, possibly. So to me, I think there's a lot of value in serverless, it makes a lot of sense some of the time, but my perception, speaking only for me, if I'm going to build something for me, is I am trading code complexity, an application that has multiple things, all of it going on, it's got to keep running. I'm trading that code complexity for infrastructure complexity, right? I might now have 30 Lambda functions that all have to be versioned and migrated and kept in sync, but they're all separate things up in the cloud, and like I have to deal with that somehow and like how do I keep them all running, and so I always feel like I'm trading code complexity for infrastructure complexity, and I feel like I'm better at code than I am at infrastructure, so I lean towards not going that way. Same thing with microservices, right? I just feel like I'm better at managing code complexity than dev-ops-y infrastructure complexity, so let me play to that, you know? This portion of "Talk Python to Me" is brought to you by Linode. Are you looking for hosting that's fast, simple, and incredibly affordable? Well, look past that bookstore and check out Linode at talkpython.fm/linode. That's L-I-N-O-D-E. Plans start at just $5 a month for a dedicated server with a gig of RAM. They have 10 data centers across the globe, so no matter where you are or where your users are, there's a data center for you. Whether you want to run a Python web app, host a private Git server, or just a file server, you'll get native SSDs on all the machines, a newly-upgraded 200-gigabit network, 24/7 friendly support, even on holidays, and a seven-day money-back guarantee. Need a little help with your infrastructure? They even offer professional services to help you with architecture, migrations, and more. Do you want a dedicated server for free for the next four months? Just visit talkpython.fm/linode.

18:59 AJ Pryor: Totally get it. I think Docker is really the game changer there because nowadays, if you can throw it into a docker container, there are services like Google App Engine or AWS Elastic Beanstalk, where it can sort of transparently scale up and down and yes, of course there's differences between Lambda and those services, but at least as far as I have found, you can get pretty far just by having Docker and using whatever managed service on top of that to handle your kind of auto-scaling.

19:24 Michael Kennedy: Yeah, that's cool. Docker's nice because you can do basically anything that Linux can do, right? You don't have the restrictions like...

19:31 AJ Pryor: Yes, which makes me happy.

19:32 Michael Kennedy: Yes, it has to run within 30 seconds or 10 seconds or whatever it does for serverless.

19:36 AJ Pryor: Right.

19:37 Michael Kennedy: And you can only do, work with these, like you can do whatever you want, right? You need to install some like C library, it doesn't matter.

19:43 AJ Pryor: Absolutely.

19:44 Michael Kennedy: Do you use Docker much for your work?

19:46 AJ Pryor: Yes, oh yeah. Both my myself and my team, we Dockerize anything that we can Dockerize, pretty much. It just makes life so much easier. It's easier deployments, it's easier going from a VM to something like App Engine if you're on Google Cloud or whatever your service is. The installations, especially with some of the data science libraries that we're using, for example, CVXPY is an optimization framework, and it has some potential gotchas with compiling it. Putting that in a Docker container solves that problem. I don't have to worry about some obscure gcc version on the Linux box I'm deploying to or my Jenkins server...

20:18 Michael Kennedy: That's a good point.

20:18 AJ Pryor: If I just have it in Docker, so I'm a huge fan.

20:20 Michael Kennedy: Yeah, you get it working once, and then you just never have to touch it again.

20:23 AJ Pryor: Exactly.

20:23 Michael Kennedy: You containerize it and you just say, I depend on that, that works.

20:27 AJ Pryor: And your team can have one guy who's good at it, you know? Everybody doesn't have to learn it, 'cause it's a super copy-pastable kind of thing.

20:33 Michael Kennedy: Yeah, yeah, absolutely. What I found interesting about Docker was as I learned to create the Docker files and build the images and containers and whatnot, it's like, well, what you really need to know is Linux, right? Here's a bunch of commands that you're issuing to configure Linux, it just happens to be you issue them in a Docker file format rather than on the terminal , but it's basically, the complexity is not Docker, the complexity is Linux, and if you're comfortable with that, then Docker's actually a small step. So do you use something like Swarm or Docker Compose or Kubernetes on top of that?

21:11 AJ Pryor: Yeah, so generally, for those kind of things, we're either going to be using Kubernetes, but a lot of times, we'll just operate within App Engine, just because it's so simple. If you get your Flask app, whether it's a single application or whether you've got a front and back end separately, you Dockerize it, you can push it to App Engine, and then it will deal with scaling it up, down, making sure it stays up, and then you can also add rules on top of it like for static files, that they'll map to some internal Nginx configuration you never have to deal with. That works pretty well, but Kubernetes if it needs to be a full cluster or integrated microservices for sure.

21:44 Michael Kennedy: Yeah, yeah, like they've got to talk to each other. That's where it gets complex with Docker. All right, thanks for that diversion...

21:49 AJ Pryor: Yes.

21:50 Michael Kennedy: That was interesting. So if I'm building a data science web app, I would say if you went out to the street and just started interviewing random people, where all these people were knowledgeable about data science, they would probably say, like, well...

22:03 AJ Pryor: I want to live there.

22:04 Michael Kennedy: Yeah, for sure. So probably the first thought or the most popular answer of like how do I take my data science stuff and present it on the web, it would be Jupyter Notebooks. What do you think?

22:14 AJ Pryor: Yes. I have such a love/hate relationship with Jupyter, and I think other data scientists and my team would hate me for saying this, but Jupyter Notebooks just cause a lot of problems. They are super great for interactive data science, exploratory analysis, initial data cleansing and whatnot, they're great, but for anything you need to pipeline, productionalize, or make repeatable, the native Jupyter Notebook space just doesn't really work very well. I mean, how many, I guarantee you if you're listening to this and you have shared a Jupyter Notebook with a team before, you've gotten one back at some point that you ran and it immediately errors out because you don't have some file or whatever. Just make sure your Jupyter Notebook runs, start to finish. So yeah, you're right, though Michael, Jupyter Notebooks and apps are very different, and how you kind of connect those dots to production is not trivial.

23:02 Michael Kennedy: Yeah, it took me a while to appreciate Jupyter Notebooks and just the whole notebook style of programming, 'cause to me, coming from a more app-building type of software development, I was all about having different files for the different purposes and like factoring an app so the data access stuff is over here and then I can test this part and I put them together and here's the app, and when I looked at Jupyter, I'm like, all that stuff is missing. I mean, even to some degree, like a lot of times even functions are missing and that like kind of made me, gave me the willies a little bit, but then, you know, I saw people...

23:39 AJ Pryor: Yeah, how do I modularize...

23:40 Michael Kennedy: Exactly, right, how is this maintainable, right?

23:44 AJ Pryor: Unit test it, yeah.

23:45 Michael Kennedy: Yeah, exactly, so but then I saw people working with it and I'm like, they're just working differently. They're solving different problems than the problem I'm trying to solve. And the notebook space makes more sense for them, but to be able to push it actually into full production, I don't know, it doesn't seem like really necessarily the right answer. There are some projects trying to leverage notebooks for production that are pretty interesting and I wanted to ask you if you had any experience with them, not saying that I would recommend ditching Flask to use them, but they are interesting. Things like papermill, have you seen that?

24:22 AJ Pryor: Yes, I think papermill is exactly the kind of solution to the problems, 'cause I kind of hammered on Jupyter Notebooks, but I'm actually a big fan of them, it's just that it can be misused, and what papermill does is it provides you a way to parameterize your notebook and then execute them in a more programatic way, and if nothing else, this forces the developer of the Jupyter Notebook to kind of think about abstracting parameters out where it makes sense.

24:46 Michael Kennedy: Right, what are the inputs, what are the outputs?

24:48 AJ Pryor: And it makes it a lot easier to productionalize these things.

24:51 Michael Kennedy: Yeah, there's an interesting article I'll link to from Netflix, how they're using papermill to sort of productionalize their Jupyter stuff and do a lot of data science, still with the notebooks, but yeah, I would say if you really want to put something online and really make it accessible, right, you need an API or at least a website as probably the final endgame. What do you think?

25:15 AJ Pryor: Yeah, but I think that, I mean, whether or not it's papermill, I think some kind of integrated tooling is going to be the solution, whether it's something that makes it easy to go back and forth between notebooks and python files or whether it's a papermill or something like that. I think that where that will take you is a place where it's simpler to move from the data science that you're doing in the notebook to production, because we also can't just say, oh, well, you do your exploratory analysis in a Jupyter Notebook and then we productionalize it, which is a whole scratch rewrite. That's a huge waste of time, and so you have to sort of trade off how painful is it to refactor this versus how easy is it to just kind of build a tool that can make those dots connected for you? I think eventually we'll land on something that's kind of an integrated workflow, and that's going to get picked up very quickly because of how beneficial that timesave is.

26:03 Michael Kennedy: Yeah, it's going to be pretty interesting. Now I do think one of the challenges that people run into in the data science spaces, incredible growth of Python really does have a lot to do with data science and I think that's 'cause Python itself appeals a lot to people who are not necessarily programmers first, but they use programming to do something else amazing, right, like a biologist or a physicist or a statistician or something, but it also means a lot of folks come without necessarily a rigorous software engineering background. That's not a dig against them or anything negative, it just happens to be we all come from different perspectives, but I do feel like there's probably, a lot of teams and folks I talk to, it seems like they could use a little bit of help or some pointers on taking things like testing and maintainability and factoring it in, you talked about continuous integration and whatnot, and all that into their workflow. What are some of the software engineering techniques you think data scientists should pay attention to?

27:09 AJ Pryor: Number one is testing. Why do we care about unit testing? And I like to tell people that testing is actually important mostly for refactoring purposes. It's actually a side effect of writing tests that you verify that your code works. So what I mean is that, you know, if I have my test built and I've got good coverage and I've got the little auto reloader over and every time I change something, hit Control+S to save it and I get those little dots, I can be extremely aggressive with coming in and just gutting whole parts of the app and changing them around because I know I'm backed by those tests, and I can't tell you how many times I have done a refactor that would either have taken a long time to piece by piece change it and make sure nothing broke, or simply would've just not done 'cause it was like ugh, I don't want to touch that 'cause it's going to be super painful. So testing is one, and along with that comes sort of how you write code, 'cause if you're writing code for tests, you will do things like lift variables up or extract things out to be either a dependency injection-type flow.

28:09 Michael Kennedy: Yeah, you'll think about small functions versus large functions...

28:11 AJ Pryor: Small functions, exactly.

28:13 Michael Kennedy: Because large ones are super hard to test.

28:15 AJ Pryor: Which you'll see, and this is not a knock on people who are not quote software engineers, but you get these big long functions and then somewhere embedded in the middle is like a hard-coded call to some API and the endpoint and credentials are buried in there and maybe in source control, and it just becomes very difficult when you want to come in and change something, which inevitably, you know, your client wants you to do, and you come in there and you're like oh, man, this is such a mess, it's going to take me a while to dig myself out, whereas if you take a little bit more time to set yourself up from the beginning, it just flows way faster.

28:44 Michael Kennedy: Yeah, for sure.

28:45 AJ Pryor: So if I had to pick one or two, it's those, but I could go on and on about design patterns and whatnot, but I'll cut it there.

28:51 Michael Kennedy: Yeah, no, that's good advice. The one I see also often missing is proper source control, maybe. I don't know, it depends on the team, right, but if they're mostly scientists who are like leaving MATLAB and going into say Jupyter and Python, source control.

29:08 AJ Pryor: That's actually a good point, and I almost feel like source control gets overlooked as being just assumed that you know it, but very interesting, most people coming fresh out of school, they've only seen a limited amount of exposure to it, but you're completely right, that's a huge part of it. I actually am extremely interested in Git, as a separate aside, yeah, kind of our Slack channel run a Git tip of the day, so you can look at some of the blogs that we'll link. We've actually published a couple of those just kind of as a way of keeping everybody's brains fresh on source control, or how do I revert this branch in a particular way, or what is git-reflog, and it's like, I completely agree. If you're fluid in source control, it's also a big time saver.

29:46 Michael Kennedy: Yeah, well, I mean all these things go, everything we're touching on goes really well together. Testing so that I can change my code, writing code that is testable and easy to maintain, so I can have these tests, then having source control so that I can commit it and tag it and then go crazy and go, you know what, it was a horrible idea, we're either dropping this branch...

30:05 AJ Pryor: Trash that branch, yeah.

30:06 Michael Kennedy: Or we're rolling it back and we're going to skip over this bit, or whatever, right, like even if you forget to branch, you can still go back to your last commit or couple commits back. Yeah, they all kind of hit at the same core essence.

30:19 AJ Pryor: Yeah, and then you add your CI/CD on top of that, and now you've got it so when every time I push changes or whatever or Circle or Travis or Jenkins, grabs it, builds it, pushes it out, and I can be deploying five, 10, 100 times a day, model works extremely well. You catch bugs more quickly, and you're not afraid to change stuff.

30:35 Michael Kennedy: Yeah, for sure. Cool, that's good advice. So when we're building these apps, I have a pretty good sense for when it makes sense to have, say, one of these JavaScript-rich web apps. I'm thinking of AngularJS, React, Vue.js, something where like a lot of the application logic is actually written in JavaScript, and then there's a bunch of services, probably written in Python, that you're talking to behind the scenes like we were talking about with Flask. I have a good sense for when that maybe makes sense versus when a more server-side backed framework, Flask, Pyramid, Django, something like that. You can just stick with that and not add that extra complexity, and I always feel like there's a little bit of glitchiness in the front-end apps somewhere in some setup with some plugin or whatever, but I do feel like there's actually a tendency for people to assume they need more JavaScript than they actually need and they need these front-end frameworks more than they actually need, like you can go a long way with the server-side framework, but there are times. So maybe what are your thoughts in the data science web space around that?

31:43 AJ Pryor: It's a good question. We do both. There's a time and a place for both. Kind of the single Flask app that does both, the front and the back end, and then when you need to bring JavaScript in, the biggest thing that JavaScript gives you, and that's the whole magic of the web, really, is the interactivity. So if you've got, from a data science perspective, some kind of application that's maybe kind of like a dashboard, where you've got a number of visualization widgets, and maybe you change something and it reaggregates data and it's responsive. That kind of thing really sings as a JavaScript app, whether it's React, Angular, jQuery, whatever you use. When it's more of sort of static content, you know, a blog or a report or some kind of, anything that could be served with just a SQL query.

32:25 Michael Kennedy: Right.

32:26 AJ Pryor: Even if it's kind of...

32:26 Michael Kennedy: Though it's data-driven, it's not static per se, but-

32:27 AJ Pryor: Right, exactly.

32:28 Michael Kennedy: Once it hits the page, you don't need it to be interactively changing unless you open another page, right, like here's a list of things and I click on the book and I have the book details, right, like all that doesn't need any JavaScript, probably.

32:40 AJ Pryor: Yeah, and sometimes those lines get blurred because what happens if that SQL query is either really complicated or it returns a huge number of rows? I still kind of want to interact with it on the front end? Maybe I want to paginate...

32:52 Michael Kennedy: Right, filter it or something, yeah.

32:54 AJ Pryor: Going back to the server, I'd have to run that huge, heavy query. Well that's not great, but if I also have to fit a bunch of data into the browser, that might not work, so where do I put that pagination that's now not always trivial? And so those kind of things tip the decision whether you want to go JS or pure Python.

33:09 Michael Kennedy: Yeah, probably also heavily depends on your team, right? If you have a bunch of people and they just love Python and they don't want to touch JavaScript, like, saying, you know, it's probably better if we do this in JavaScript and we just force everyone to now do JavaScript, that might not actually be better given the people working on it, right, like it's, I think the team's desires and capabilities also should be considered, right?

33:30 AJ Pryor: Yeah, 'cause data scientists typically don't know JavaScript, right? They know Python and depending on their background, they might have some exposure to HTML, CSS, but you can't really depend on it. You can depend on Python or R, and hopefully SQL. So sometimes it's just the nature of who's going to be working on this project. They might choose the technology that fits, so it might be this is Python-only 'cause I'm really good with Jinja, that's fine. Then you use those, and it might be that we've got dedicated front-end engineers who are really good with JS and we put them on those heavier interactivity-type projects, and so it's also very much about sort of where you're team's at. We're a small team, and so you kind of play to your strengths.

34:07 Michael Kennedy: Yeah, that's definitely a good idea. One of the ways you can add interaction to these apps and not go and write all that stuff in D3 or HTML5 Canvas in JavaScript, God forbid, is to use something like Bokeh or Plotly or some of these other interactive stuff. There's things you can add to your site as well that may get you close enough for this exploration, what do you think?

34:32 AJ Pryor: This stuff is magical, because it unlocks that interactivity that I was saying is so good about the web. Two people who don't know JavaScript, so Bokeh's a great example, Plotly is also a one that I'm a big fan of. Plotly's nice because the charts that you create in Python are serializable to JSON, that's actually how it renders it into JavaScript eventually, but if you interact with Plotly within JavaScript itself, it's the same data structure, and so it plays very nicely when we have pure web devs and pure data scientists, that they can interact with the same objects, so I'm a big fan of Plotly. There's also a cool project called Dash that's part of the Plotly umbrella, I don't know how you call it that, that is a declarative way to create dashboards and visualizations that actually renders into a React app, so the internals are pretty cool. You create this layout, serializes it, and there's a React app that generates that into charts. So that's another cool way to unlock sort of interactivity. You can hook up widgets and whatnot. So those kind of ecosystems and tools, I see more and more of that. I think it's a great place to go, again, because it just enables people.

35:35 Michael Kennedy: Yeah, for sure. I think that's one of the things that you learn as you progress in software development is we probably all had the experience of I had this problem, I wrote all this code to solve it, and then I realized there was a library that would solve it in two lines and it took me a week to solve. There's all these things you can find and add in. I guess the challenge there is knowing when adding in something like that will get you 80% of the way there and would be a complete pain trying to finish it, or is it going to be good enough and you'll be happy with it? You know, like that's a real challenge, I think, but a lot of these data science visualization tools you can drop into your website do seem really nice.

36:13 AJ Pryor: Yeah, that's honestly one of the biggest things, I think, that makes experience matter is knowing when do I look for another tool? When should I pay for something? When do I roll my own? Or when do we go with the 80/20? And you know, you can overarchitect things to the point where nothing gets done, and then on the flip side, you can create this spaghetti monster mess that's completely unmaintainable, and I think being pragmatic is the biggest thing you hope to get with experience.

36:38 Michael Kennedy: Yeah, yeah, yeah, and coming back to what you were touching on before, if it's testable, maintainable, and you can evolve it quickly, you can make one choice and then change your mind, right?

36:49 AJ Pryor: Yeah exactly.

36:49 Michael Kennedy: Because it's easy to change, because you have the test that'll tell you no, it's not actually broken yet, and things like that, right? There's also that aspect of it, I think.

36:56 AJ Pryor: Yeah, you're payin' for technical debt, basically, and if you write unit tests and all these other things, you're getting rid of that, you catch up later.

37:05 Michael Kennedy: This portion of "Talk Python to Me' is brought to you by Rollbar. Got a question for you. Have you bene outsourcing your bug discovery to your users? Have you been making them send you bug reports? You know, there's two problems with that. You can't discover all the bugs this way, and some users don't bother reporting bugs at all. They just leave, sometimes forever. The best software teams practice proactive error monitoring. They detect all the errors in their production apps and services in real time and debug important errors in minutes or hours, sometimes before users even notice. Teams from companies like Twilio and InstaCart and CircleCI use Rollbar to do this. With Rollbar, you get a real-time feed of all the errors so you know exactly what's broken in production, and Rollbar automatically collects all the relevant data and metadata you need to debug the errors so you don't have to sift through logs. If you aren't using Rollbar yet, they have a special offer for you, and it's really awesome. Sign up and install Rollbar at talkpython.fm/rollbar, and Rollbar will send you a $100 gift card to use at the open collective, where you can donate to any of the 900+ projects listed under the open source collective or to the Women Who Code organization. Get notified of errors in real time and make a difference in open source. Visit talkpython.fm/rollbar today. You recently wrote an article called "Flask Best Practices: Patterns for Building Testable, Scalable, Maintainable APIs" really from the perspective that we're coming from here from the data science side of things, but also the web developer side, so maybe we could touch on that a little bit.

38:38 AJ Pryor: Sure, so the idea with that blog post was for structuring these kind of applications, you know, Flask is very unopinionated, which is great. It's kind of like in the front-end world, you have this sort of React versus Angular, and there's also Vue, but a lot of discussion between those two, and one of the biggest differences, React is very unopinionated and kind of, call it plug-in based, and then Angular is full weight, and sort of opinionated, but kind of has all the bells and whistles included, and in the Python world, Flask kind of fits that sort of plug-and-play-type thing, where you have the freedom to make all of these decisions and do things the way you want, but you also have the burden to make all these decisions and do things the way you want. And so it can cut both ways.

39:20 Michael Kennedy: One of the things that's always bugged me about Flask is I felt like it was presented artificially simplistic, and what I mean by that is like they always, Django's super complicated, and look at all the stuff you do for Pyramid with this cookie cutter thing, but what'd you get, well you just, for Flask, you just create one file, app.py, you create the app, you put app.route, one function, boom, we're done. And that's true, it does say "hello, world" on the screen, but that's not maintainable, that's not how real apps work. They get big and complicated, and like you were saying, there's just an absence of any guidance on the next step, right? How does that not become a 4,000-line app.py?

40:00 AJ Pryor: Well, that's what happens. I mean, Armin Ronacher is very humble, so he tends to give talks and basically downplay Flask and say look how simple it is, but you can build big systems with it, you just have to make the right decisions, and because there's not a whole lot of direction out there, you do end up with these multi-thousand-line monstrosities that, you know, you end up having to Control+F for whatever endpoint you're hitting, and it's very difficult to sort of grok. And yeah, so this blog post is basically, after having tried a lot of different things and having some sort of core philosophies like testing that I needed out of the end product, it was a pattern that I landed on that we tried for several months and it really sung, very happy with it, and so this blog post is basically kind of just sharing that experience 'cause it was working within a team on a code base that was a mixture of Python, Typescript, JS, all this kind of things, and so...

40:51 Michael Kennedy: Yeah, so you basically talk about a set of tools, a way of organizing all the files that you actually have on real, actual Flask apps like data access layers and models and whatever, right? Test files and whatnot. The structure is really interesting. I think it's especially good for APIs. It's not the structure that I use for my website, but I do have a very structured way that kind of has similar goals to what you have. So maybe we could just start first talking about this, some of the packages. So you have Flask, obviously, pytest, it's pretty de facto these days, SQLAlchemy I think is probably the most common ORM if you're going to talk to a relational database.

41:30 AJ Pryor: What, there's another one?

41:31 Michael Kennedy: There are, we got Peewee ORM and some other interesting little ones...

41:36 AJ Pryor: Yeah, yeah, yeah, yeah, I know.

41:37 Michael Kennedy: Yeah, I know, but, yeah, you have to choose, though, right? With Flask, you've got to go look 'em all up and decide, right? With Django, it's just Django ORM or whatever. But then some other interesting ones, you have Flask-RESTPlus, which is, it's interesting, you have Flask-accept, and also Marshmallow, so maybe touch on some of the other ones like Marshmallow, Flask-RESTPlus, and so on.

41:59 AJ Pryor: Well specifically, I wanted to integrate these things, so Marshmallow is not, is really outside of Flask. It's a serialization and deserialization library, and so it has a really rich way of declaring that there are these fields and they have a certain type, and some things might be required, some not, some might have defaults, some might not, and furthermore, it gives you all these hooks for it.

42:22 Michael Kennedy: Yeah, so I give you like a Python dictionary, which maybe originated from a JSON file and then Marshmallow will answer the question like is this valid data? Something like that?

42:31 AJ Pryor: Exactly. And if it's not, it will give you good error messages, so let's say you give me a JSON file, you post it to my server, and I've declared that I want to make a widget and a widget has these fields that I declare with the Marshmallow schema. I call the load method, and it will take the input data and create either a dictionary or you can give it a hook to create some kind of actual Python class, and if one of those input fails validation, it gives you a really nice error message that oh, this is not a valid int, and so on the front end, the front end devs, it really helps them out because they say, oh, okay, yeah, AJ's schemas kicked back an error because I'm passing a float instead of an int, or whatever it is. So there's Marshmallow, and then it also does the serialization on the way out, so for example, you have a user model that has a password, you probably don't want to send the password back to the front end, so you can set certain fields that are read or write only to solve that kind of problem. So it's very flexible. On the flip side, there is Flask-RESTPlus, which I was very excited about, which basically just makes it easier to write endpoints where you're going through the full GET, POST, PUT, DELETE flow, because otherwise, you're writing a raw Flask endpoint, and you have to sort of declare the methods it takes and effectively do a switch case on that, but with Flask-RESTPlus, you create these resources that are class-based, and you put the methods that they need for each of the different type of http protocols, and it just makes it less code. And so combining these two, though, was not trivial, because on one hand, Flask-RESTPlus has its own validation scheme that, although they declare is going to be deprecated, it doesn't seem like that's ever actually going to happen. They even reference that we're deprecating this because we would like to switch to Marshmallow, but if you look through the Git history, there's years of conversation where they're like, eh, well, it's kind of hard, and we've got this and that problem.

44:24 Michael Kennedy: Sure.

44:25 AJ Pryor: And so I'm sort of at this frustration point, where I want to use both of these cool technologies, and so Flask-accepts as a library, it's just a couple of decorators, really, that allows me to unify RESTPlus and Marshmallow, because what RESTPlus gives you in addition to those resources is Swagger docs, and the Swagger docs allow it to turn your API into a webpage that I can go to and it shows me all the endpoints and I can interactively hit them. Basically, it's like, almost imagine if you had some kind of export config that would let you load into Postman, and it just pre-populated every one of your endpoints with a click, and it would just hit them with the right parameters and whatnot.

45:04 Michael Kennedy: That's pretty awesome.

45:04 AJ Pryor: In order to get both of those, this Flask-accepts library allows you to sort of mix the two, so you would just have an endpoint, you put this accepts decorator, give it a Marshmallow schema, it will apply that validation on the way in, attach something to the Flask request object that you can then go on your merry way and use, and it also supports those Swagger docs.

45:23 Michael Kennedy: Yeah, it looks really nice. And this Flask-accepts is something that you wrote because of your work with Marshmallow and Flask-RESTPlus and you're like, these need to live better together, right?

45:32 AJ Pryor: Yeah, it was purely a personal pain point and I wrote it over like a weekend to solve a problem I had at work, and naturally, then, since it's 2019, you open source things, so.

45:42 Michael Kennedy: Yeah, it's cool.

45:43 AJ Pryor: I don't know if people will find it useful or not. There's other things, but it solves my problem.

45:47 Michael Kennedy: Yeah, super. So those are all really good, and you talk about those in there. One other thing I want to throw out there is people are just thinking about their Flask, like a new Flask project potentially is secure.py, secure dot PY, so there's a lot of recommendations from OWASP and other web security companies and organizations saying, you know what, you should really add the header XSS protection and turn it on and set the mode to block or the iframe option should by default be set to same origin. Nobody can embed you into their sites and make it look different or whatever. So like with one line of middleware for Flask, Pyramid, Django, and a whole bunch of others, you can just add this thing and it'll automatically add all those headers, so that's pretty cool. So if you're security-conscious, that might be worth checking out.

46:31 AJ Pryor: Yeah I think, especially with security, it's really smart to delegate that to third parties.

46:36 Michael Kennedy: Yes.

46:36 AJ Pryor: And allow them to be the experts, because it's a full-time career to keep up with that and every day, there's some new hack people come out with, and just delegate that stuff.

46:44 Michael Kennedy: Yeah, and you'll sleep slightly better.

46:46 AJ Pryor: Slightly.

46:46 Michael Kennedy: Slightly. Like it's no panacea, but it's better than not.

46:51 AJ Pryor: You're looking for other problems.

46:52 Michael Kennedy: Exactly. So let's, we don't have a whole ton of time to spend on this, but I do want to talk just a little bit about this and then touch on some high-performance computing bits as well, but maybe give us a quick overflow of how you're structuring your Flask app, because I think it's, like I said, it's different than what I'm doing and it's a little bit unique, and people can think about whether it makes sense, but I think it's definitely worth considering. What are you doing there?

47:15 AJ Pryor: And you were right that this is kind of more thought about from an API perspective, so if you're doing a lot of U-RENDERing, this is not directly the same approach. But in essence, the philosophy is you want to be able to break up each of your, call them entities, it's a thing you're going to interact with from the server, into separate directories.

47:33 Michael Kennedy: This is the core tenant of your idea of organizing your code and your files and whatnot, and the reason I think this makes more sense for APIs than it does for apps is APIs are often centered around these entities, right? Like I want to go talk about the users or talk about, right, like this is like the essence of what APIs do often.

47:53 AJ Pryor: Yeah, but I mean even if I were to go set up a site that was more view-based, I would still think about it this way in the sense of...

47:58 Michael Kennedy: Sure.

47:59 AJ Pryor: If I had some kind of module that was like my users, I'm still going to change each of the pages maybe within the users' module...

48:05 Michael Kennedy: Yes.

48:05 AJ Pryor: To be their own single folders.

48:08 Michael Kennedy: Yeah, I agree. So go ahead and tell us about this.

48:11 AJ Pryor: Yeah, so the idea, you have a model which is basically, probably your SQLAlchemy layer, you know, your persistent storage. You have some kind of interface that, we called it an interface, it's 'cause it's basically what a Typescript interface is, so it defines the type, what are the attributes that need to make this object, and there's a whole discussion there around why typing is super beneficial, so there's great tools out there like mypy and static analysis tools, and they save you an enormous amount of bug finding by highlighting errors where you've typoed something or missed a parameter, and so that gives you a layer to sort of make sure that your functions are behaving the way, at least you're using them the way you've defined them. There's the schema, which is the Marshmallow schema that I referred to, and so that handles the serialization layer on the way in and out of the application. That's where you can do last-minute transforms, change names, export things, pick fields off, whatever. There is the controller. The controller is what's, it's actually, it's the route. The controller, the way I think about it is it takes all these other pieces and it kind of is the glue. You know, gets the parameters, calls the service, which I'll discuss in a second, gets the data, wrangles it, and outputs it. And then lastly, you have the service, which is what's responsible for actually manipulating the entity. And now whether the entity is like a user you're storing your database or whether it is the result of some optimization calculation, doesn't really matter, it's just a way for you to organize all this code. And then internally, you can use Pandas or NumPy or C++ or whatever, just how you think about it.

49:44 Michael Kennedy: Because the last place that those database queries or those Panda calculations should be is like in the route view method.

49:53 AJ Pryor: Under your route. Yeah, you don't want that.

49:54 Michael Kennedy: It shouldn't be crammed in there, I get you. They're separate.

49:56 AJ Pryor: So the route should be pretty simple...

49:57 Michael Kennedy: Yes, exactly.

49:57 AJ Pryor: Right, you know, five, 10 lines. Get the data, call it service, serialize it.

50:01 Michael Kennedy: Exactly, it should be orchestration between all these other...

50:02 AJ Pryor: Yeah, exactly.

50:03 Michael Kennedy: Pieces is the way I see it. Yeah, cool.

50:07 AJ Pryor: And then tests go alongside with that.

50:08 Michael Kennedy: Yeah, so your main philosophy here is that instead of having a services section and a controller section and a data section and whatnot is you have like a user section with a service controller model kind of in its own self-contained area, right?

50:25 AJ Pryor: Yes, so how long would it take you to delete everything related to users? There's basically your sanity check. If you got to go through and dig eight trees down into every other directory and find it, you're not compliant with what I am proposing.

50:39 Michael Kennedy: Okay, yeah, it's interesting article and people can check it out for sure.

50:43 AJ Pryor: The gory details, yeah, just for sure.

50:44 Michael Kennedy: Yeah, exactly, but I do definitely think, if nothing else, Flask-RESTPlus, Marshmallow, all these things are pretty interesting and it's cool to see how you're putting them to use there. Great, so let's talk a little bit about computation. You know, I touched on this at the beginning when I said what is the deployment story, right? Do I need to like deploy to a cluster that has GPUs because I got to have GPUs in production, or something like this, right? Let's talk a little bit about that. I mean, I think one of the things that's interesting to just touch on first is Python performance. People will tell me sometimes, usually they're not Python people, they'll tell me Python is slow so I can't use Python for X, right? The performance story of Python is actually complicated and it depends, right? Like yes, maybe this part of Python, if I wrote it in pure Python, is slower than say Java, but if I were to write it in C, it would actually be faster than Java, and so I could actually do something like a Cython version of this code, which then, you know, maps over to C++ or C and then compiles down to native instructions. So is it fast or is it slow? I don't know, it's like this blend. So you can do a lot of tricks to make it fast where it matters, use things like NumPy and whatnot, but maybe, how do you see that world from somebody who does way more computation than I do?

52:05 AJ Pryor: Right, this whole topic of HPC, High Performance Computing, is something I'm super passionate about. It was my whole grad experience. As far as Python's performance, you should compute from Python, but not with Python. So Python is slow if you're talking about doing numeric computation in Python. You should invoke Python libraries that push that computation down to C. So I mean I write loops all the time, but any time you're writing a loop to accomplish something that has a lot of I, Js, and Ks in it, you should at least kind of get the heebie jeebies and think maybe there's a NumPy or Pandas method that's going to push this down to a C or C++ layer that's super fast.

52:45 Michael Kennedy: Right, and if there's not, could you make that one function a Cython function? And really change it, right?

52:51 AJ Pryor: Right, yeah, so Cython's the way to go if you have a custom operation that you want to do that you can't do out of the box with NumPy, because basically all you do is you provide a set of annotations to create the Cython file, and then you run a transpiler that converts that into valid C or C++ and then brings that back into Python. 'Cause Python, if you don't know, is ultimately written in C, so this kind of gives you a way to write Python and compile it and then bring it back in, or you could also just write directly in C++, and that's kind of what you have to do if you want to do custom computing with GPUs, and again, unless you're using a library that enables GPU routines, if you're doing custom stuff, you're going to have to go to the C++ level and bring in like a shared library or something like that. At that point, it's pretty much unavoidable.

53:40 Michael Kennedy: Yeah, it's interesting, I have yet to need to go to C++ for anything, but I'm happy when the libraries that I use have C extensions or parts of them are somehow using C speedups to make them much faster and I don't have to worry about that.

53:57 AJ Pryor: Yeah, so I find writing in raw C++ is helpful if you need to deal with a lot of multi-threaded stuff. Multi-threaded in the C++ context, not in the GIL context, so if you have an algorithm that you want to manually deal with pushing out threads and multiprocing something and bringing them back, I find it's easier to do in C++, but to be honest, I don't have to do that very much day to day. I did that in the past life, but in industrial data science, I don't need to do that. Pretty much NumPy, CyPy, Pandas, gives me everything I need, more or less.

54:27 Michael Kennedy: All right.

54:27 AJ Pryor: It takes a lot of time, right, it's developer time versus compute time, at the end of the day.

54:30 Michael Kennedy: Yeah, yeah, it's another one of those define fast, right, like if it takes you two days to write the C++ code that then runs in five minutes, or it took you half an hour to write the Python code and it runs in 20 minutes, who solved that problem first, right? It depends how many times you're running it. There's a lot of considerations there.

54:50 AJ Pryor: But if you're looking for a hedge fund, where a tiny incremented performances, whatever, put a bunch of zeroes after a dollar sign, then maybe it matters and it makes sense to spend weeks on this tiny micro-optimization. It's all about context.

55:02 Michael Kennedy: The whole hedge fund algorithmic trading space is crazy. Like when you consider server colocation and stuff so that you can drop a few milliseconds' latency because that changes your profit margins, that's a bizarre industry, yes. But it's, that's how it is, right? Cool, all right, so you talked about GPUs. What are some of the things that we compute with GUPs. You know the real simple ones are like I am doing machine learning training, like I training a machine-learning model, and I'm going to do that say with GPUs or something like that, but what are some of the more, what are some of the other things I might go and write GPU code with? For, rather?

55:39 AJ Pryor: Well, in practice, you might not need to. You probably don't need to until you have a problem that you don't have a better solution for. So I don't necessarily think it's sort of the general audience, but for the people that it's the right solution for, it's probably the only good solution. The GPUs are very good at doing enormous computations in parallel, so interestingly, I mean, they started out as graphics cards, right, and I think people hear of it, like make the connection between my video game engine and how am I doing science? Well, what actually happened was scientists realized that they could use these native graphics card APIs for text or vertex shading, and they wrote problems like, whatever, an earthquake simulation or something, using these vertex APIs because along the way, it did the math they needed, and that was sort of where CUDA came out. So the kinds of problems they solve are things that you can do heavily in parallel. So for example, matrix math, GPUs are very good at. Aggregations, GPUs are very good at. Sorting, not very good at, because sorting is sort of inherently, in general, interdependent on the state of the rest of the algorithm, and although there's GPU sorting algorithms, at a high level, those are the kinds of problems, so it's the right tool for the right job kind of.

56:52 Michael Kennedy: Yeah, yeah, cool. You know, I didn't appreciate how much computation GPUs could do until I started working in 3D graphics, which I did a long time ago, but you think, I'm loading up all these models, I'm applying all these, like every time you want to rotate or move any item that is on the screen, it's a matrix multiplication on all the vertices of it to make that happen, and there's usually multiple ones and then, just, if you start to think about a modern graphical simulation, a game or some other 3D environment, and then the amount of matrix multiplications per second at 120 frames a second...

57:29 AJ Pryor: It's mind boggling.

57:30 Michael Kennedy: Like the human mind cannot grasp how much math is happening per second. It's just unbelievable.

57:35 AJ Pryor: Yes, I mean the physics engines and games these days do ricochets and everything on the fly, and then there's also the memory management. You've got memory on the host and the GPU and it's constantly going back and forth. I mean, all of this, it's extremely complex, so it's really amazing technology.

57:50 Michael Kennedy: Yeah, if you can turn that computational engine onto more direct problems like you're talking about, it's pretty interesting. So some other options, you know, like there's grid computing. We could set up like 50 computers in AWS or GPC or GCP, something like that. What's a good story for that, right? Maybe you don't need all these clusters, or if you do have them, maybe you can use something like Dask and you don't actually have to program against it, it just, magic happens if you set them up right.

58:19 AJ Pryor: Yeah, Dask is a great project 'cause it's basically solving these kinds of problems for you. As the user, you just say, I have a thing, I need it to run fast. I don't want to know about how you have your nodes networked together for running an MPI job, but originally, that was what you had to do. So I think the more projects like Dask evolve, and you kind of write what feels like maybe Pandas and it can just scale outwards, it's going to be really good for the community.

58:45 Michael Kennedy: Yeah, that's cool. You know, I always thought Dask was something that would make no sense for me, 'cause I don't really have these large cluster-type computations that I ever have to do, but when I was talking to Matthew Rocklin, I realized, or he pointed out to me, that you can run Dask on a single CPU, on a single machine, and it'll actually do the multiprocessing and parallelism and all that, and so it's just another interesting use case where you might not think of it.

59:09 AJ Pryor: Yeah, I mean, all that comes with overhead, too, though, so it's always right tool for the right job. You wouldn't take an airplane across the street, even though it's faster than a car, right, because by the time you went through all of that, it doesn't make any sense. The overhead's too high. But at some point, it catches up, right?

59:24 Michael Kennedy: Yeah, for sure. Super interesting. Okay, well I guess let's leave with this. You have a few other libraries to leverage HPC without too much expertise. You want to take us through those really quick, and then we'll close it out.

59:37 AJ Pryor: Yeah, so it just, the idea is if you're not somebody who knows CUDA and as obsessed with all these little micro-optimizations like I've been for a long time, which is the vast majority of the normal working coding population, how do you get the advantage of these is to just use those tools like Dask. That's one option. There's libraries that are inherently parallel. NumPy has some multicore support and whatnot. There's also a lot of analytic sort of more managed services, like Google BigQuery is basically a SQL-like engine that just scales out transparently to you to lots of nodes and comes back very quickly. There's another one called Citus DB that Microsoft I think recently acquired, and it's very good for that kind of thing, so there's technologies that you can leverage, and I think by and large, that's the most straightforward way to get into that without getting a PhD in the subject.

1:00:27 Michael Kennedy: Yeah, start there, then go custom if you got to, right?

1:00:30 AJ Pryor: Exactly.

1:00:31 Michael Kennedy: All right, cool. Well, this has been really fun to talk about all these things. Now before we wrap it up, I got to ask you the two final questions, of course. If you're writing some Python code, what editor do you use?

1:00:41 AJ Pryor: VS Code these days. I used to be diehard Vim, 'cause I knew the shortcuts, I could get around faster, and something about VS Code, I found myself outpacing my productivity, and then at that point, I was sold. I mean I still use Vim when I'm in a remote terminal, but I'm a big VS Code fan, yeah.

1:01:00 Michael Kennedy: That's cool, yeah, it's definitely got the momentum these days and there's a lot of effort to make it better, so I think it's only going to get better.

1:01:06 AJ Pryor: It's great for Typescript, if you're writing Angular. It's very good for Typescript. I mean, Microsoft makes both products, go figure.

1:01:12 Michael Kennedy: Yeah, and like it's literally, even the Python extension is written in Typescript, so it's no surprise that it's good at writing in Typescript, right?

1:01:21 AJ Pryor: Yeah, Electron apps, yeah, I mean, they use Typescript for that.

1:01:23 Michael Kennedy: Yeah, absolutely. Cool, and then notable PyPI packages?

1:01:26 AJ Pryor: So there's one called pytest-watch. I mentioned, I mean, I like pytest, and this one's super simple. It just gives you a terminal watcher, so you pip install pytest-watch and there's like a PTW command that just on reload will run your test again. Super simple, but I use it all the time.

1:01:42 Michael Kennedy: That's really cool. So if you just hit save, your tests run, basically.

1:01:46 AJ Pryor: Yep, yep, and so I just leave that running pretty much 24/7 and just sit there, change code, Control+S and just glance over, make sure my tests are good, all day long.

1:01:54 Michael Kennedy: That's pretty solid.

1:01:54 AJ Pryor: Very help. Another is, I mentioned mypy, and then specifically, there's a flake-8 mypy extension for VS Code. The mypy is a static type analysis, so it will do things like if I write a function in Python and I say this takes a parameter that is a string, and then I later call that function with a parameter that's an int, I get a little red squiggle and it says error, you've called this function incorrectly, and it catches a whole class of bugs that you might not find until run time because, well, it's a dynamic language. So this is a newer thing for me. It's mostly born out of using Typescript for a while and moving from JavaScript to Typescript and realizing like wow, this is a huge productivity boost, and so now that Python seems to be moving towards more this type annotation stuff, I'm just completely jumping on the bandwagon.

1:02:44 Michael Kennedy: That's cool, I'm a fan of type annotations as well

1:02:45 AJ Pryor: It's working out well.

1:02:46 Michael Kennedy: Yeah, I really think they add a lot. And then if you are interested in performance, there's mypyc, which compiles that code, the type annotated Python to C code as well, so those people looking for performance, that's another option.

1:02:58 AJ Pryor: Is that a Facebook thing?

1:03:00 Michael Kennedy: It is, ah, where is it? It's under its own organization. I don't remember. There was one like that, I think, from Facebook and one like that from Dropbox, and I don't know which is which.

1:03:12 AJ Pryor: Now somebody's listening to it and just cursing my name for that psychology. No, that was my project! They're like oh, this guy's an idiot.

1:03:16 Michael Kennedy: It's under its own organization, it's not under a different organization. Yeah, so I'm not sure, I...

1:03:21 AJ Pryor: That's a great idea, though, the idea that you annotate this, and now it's almost like Cython, you know, it knows more about what you meant and now it could potentially convert that into better code, see it going there.

1:03:31 Michael Kennedy: Yeah, yeah, I don't think it's generally useful yet, but it's getting close, so it's pretty cool.

1:03:36 AJ Pryor: Yeah. You need critical mass.

1:03:38 Michael Kennedy: Yeah, I think they used it to build mypy.

1:03:38 AJ Pryor: It gets to a certain point, there's enough adoption.

1:03:41 Michael Kennedy: Yeah, exactly. Super cool, and then you have one more you want to throw out there before we hit the...

1:03:44 AJ Pryor: Yeah, I have one more. It's called RxPY, so Reactive extensions. You should go to reactivex.io. It's a whole pattern of programming. It's been around for a while, but it's based on the idea of observable streams, so rather than imperatively saying I want my program to do X, then Y, you say I want to sort of declare what my program should do in response to data changes, and it makes certain types of problems tremendously easier to reason about and program, and there's a JavaScript counterpart, RxJS, that's super popular in Angular, and this is the Python implementation. It's not nearly as mainstream, even though that Repo has a lot of stars, but it's something I'm experimenting a lot with just because I've seen some of the problems it solves in front end development and found a couple of interesting cases where it can be used in Python.

1:04:33 Michael Kennedy: Yeah, looks interesting. Haven't used that before, but the whole observable data notification stuff is, it's kind of cool. Nice, good recommendation. So all right, so people are out there, maybe they're doing some data science, they want to get it on the web, what's the final call to action? Where do they start?

1:04:50 AJ Pryor: So, well one place you could start is there's another project that sort of was born out of a weekend thing that would be very, it's sort of in its infancy, called Flaskerize, and the idea there was we have front-end apps and back-end apps that you traditionally deploy separately, or you make a Python-only app, but in principle, you could serve your React front end from the same Flask API and if you have a single app, it makes deployment a lot easier, it makes scaling a lot easier, and so we sort had this idea like why don't I just make a command line tool that could take a static site that you built with whatever, whether it's a JavaScript thing like React or Angular or whether it's a Jekyll or Gatsby or whatever, and you want to now basically embed that in your existing API and potentially Dockerize it with like one command, and so that's what this Flaskerize project is about is sort of making a code generation and templating and dev tool sort of Flask command line interface, and at this point, I sort of played around with it a bit and I'm using some of its dev tool form in production, and I think there's a lot that could be done with this, but I don't necessarily always have the time, so I have a lot of ideas around that. So it'd be interesting for people to check that out or potentially contribute. You know, there's a rich CLI in some of the front-end communities that doesn't exist in Python. I think that could be a real productivity boost in Python. We don't really have a good like templated generator. At least not that I've found, particularly.

1:06:21 Michael Kennedy: Yeah, that's pretty interesting. Take your stuff and basically convert it to Flask. That's cool. All right, yeah, nice. And then I'm going to have a bunch of links from various articles and libraries that you've talked about here, so throw those all in the show notes, people can just click on them in their player.

1:06:35 AJ Pryor: Yeah, and then I'm speaking at a conference at the end of the month if you want to hear me ramble on more about this type of stuff. If you're near the Charlotte area, there's the Data Science North Carolina, 2019. I guess we'll put a link in the show notes below for that as well.

1:06:48 Michael Kennedy: Yeah, do you know if they record those videos and put 'em online?

1:06:51 AJ Pryor: They do. I'll make a reminder to myself once that's out, to put that up. The conference is at the end of August, so it'll be around shortly after...

1:06:57 Michael Kennedy: It'll be close to the time... Close to the time this comes out, yeah, for sure.

1:07:01 AJ Pryor: Through the magic of time travel...

1:07:02 Michael Kennedy: Exactly.

1:07:02 AJ Pryor: We can put that link retro back in.

1:07:05 Michael Kennedy: Yeah, that sounds good. And then credit where credit's due, mypyc is developed by Dropbox.

1:07:10 AJ Pryor: Thank you for preventing me from getting sued. Appreciate that.

1:07:13 Michael Kennedy: Yeah, no problem. All right, well AJ, it's been really fun to talk about this intersection of data science and web development, and thanks for sharing with everyone.

1:07:21 AJ Pryor: Yes, sir, thank you for having me.

1:07:23 Michael Kennedy: This has been another episode of Talk Python to Me. Our guest on this episode was AJ Pryor, and it's been brought to you by Linode and Rollbar. Linode is your go-to hosting for whatever you're building with Python. Get four months free at talkpython.fm/linode. That's L-I-N-O-D-E. Rollbar takes the pain out of errors. They give you the context and insight you need to quickly locate and fix errors that might have gone unnoticed until users complain, of course. Track a ridiculous number of errors for free as Talk Python to Me listeners at talkpython.fm/rollbar. Want to level up your Python? If you're just getting started, try my Python Jumpstart by Building Ten Apps course, or if you're looking for something more advanced, check out our new Async course that digs into all the different types of async programming you can do in Python. And of course, if you're interested in more than one of these, be sure to check out our Everything Bundle. It's like a subscription that never expires. Be sure to subscribe to the show. Open your favorite podcatcher and search for Python. We should be right at the top. You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at /rss on talkpython.fm. This is your host, Michael Kennedy. Thanks so much for listening, I really appreciate it. Now get out there and write some Python code.

Back to show page