Learn Python with Talk Python's 270 hours of courses

#226: Building Flask APIs for data scientists Transcript

Recorded on Monday, Aug 5, 2019.

00:00 If you're a data scientist, how do you deliver your analysis and your models to the people who need them?

00:04 A really good option is to serve them over Flask as an API.

00:07 But there are some special considerations you might want to keep in mind.

00:12 How should you structure this API?

00:13 What type of project structures work best for data science and web apps together?

00:18 That and much more on this episode of Talk Python to Me with guest AJ Pryor.

00:22 It's episode 226, recorded August 5th, 2019.

00:26 Welcome to Talk Python to Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities.

00:46 This is your host, Michael Kennedy.

00:47 Follow me on Twitter where I'm @mkennedy.

00:50 Keep up with the show and listen to past episodes at talkpython.fm.

00:53 And follow the show on Twitter via at Talk Python.

00:56 This episode is brought to you by Linode and Rollbar.

00:59 Please check out what they're offering during their segments.

01:01 It really helps support the show.

01:02 Hey, Jay, welcome to Talk Python to Me.

01:04 Hi, Michael. How are you doing?

01:05 I'm doing super well. Thanks for being on the show.

01:07 Yeah, absolutely. I'm super excited.

01:09 Yeah, it's going to be a lot of fun.

01:10 We get to talk about some things that are really popular in Python, web development, and data science.

01:18 And then we're going to intersect them together, which I don't know is going to make them mega popular.

01:22 Because if you look at the Python space, all the surveys and sort of where people are working, it seems like it's mostly web development or data science, and then just like a whole bunch of others, right?

01:35 So we're hitting both of those today.

01:36 Yeah, I mean, you've got on one hand sort of most of the code that's written is for the web.

01:40 And then you've got this meteoric rise of Python coming beside it.

01:43 So I figure why not just do both?

01:45 And then you're pretty much positioned at the top.

01:47 Yeah, it's definitely a good place to be, I would say.

01:49 One way to look at it, at least, right?

01:51 Yeah, it's a positive way for sure.

01:52 So before we get into that, though, let's go ahead and start with your story.

01:55 How do you get into programming in Python?

01:57 Sure.

01:57 Actually, I didn't grow up around programming at all.

02:00 I grew up in the middle of nowhere, Georgia.

02:02 I played sports as a kid.

02:04 You know, I was always into math and science, but never programming.

02:07 And I went to college undergrad at Georgia Tech.

02:10 And my first semester there was declared as an engineering major and took a MATLAB course.

02:16 And immediately, just everything about it clicked.

02:18 I was hooked.

02:19 And of course, you know, MATLAB is MATLAB.

02:21 And so I changed my major immediately to computer science.

02:26 And then I got about one semester further and started taking some of the sort of pure CS classes.

02:33 You know, it's very theoretical and that's fine and all, but I didn't feel as hands-on with it.

02:38 And simultaneous to that, I started taking some physics classes, which had an application of Python.

02:43 So we were kind of modeling gravitational systems and whatnot.

02:46 And that really resonated with me because it had both the math side of things that I gravitated towards,

02:51 but I was also building things using code, which was new.

02:55 And I got really excited about that.

02:56 So I remember we had this lab where we built like an orbital system of the moon and a rocket and earth.

03:02 And, you know, before I know it, I had this little couple of dots on a screen moving around.

03:06 And I'm convinced that I'm like landing Apollo 11.

03:08 And I just, it was really gratifying.

03:10 And so I've been hooked ever since.

03:11 And the rest is history.

03:12 Yeah, that's cool.

03:13 It's really interesting how some of these early wins, like this three-body problem you're simulating

03:18 in physics or whatever, like it's not a big deal, right?

03:20 Like the little simulation is probably pretty limited, but at the same time, it feels like

03:25 so gratifying, right?

03:26 Yeah, absolutely.

03:27 And then, you know, of course that starts off as sort of the baby problem.

03:30 And then as I went on to grad school and my PhD, and this was all physics, the coding got

03:35 more complicated.

03:36 The algorithms got more complicated, but at the core, it was still, it was still the same

03:39 problem, right?

03:40 You're solving some kind of computational problem through coding.

03:44 And although that wasn't always Python these days, it's pretty much moved from that.

03:47 You know, I kind of diverted along the way and did some C++ and did CUDA and GPU programming,

03:53 but Python is a lot of what I do day to day now.

03:55 Yeah, cool, cool.

03:56 You know, I followed a bit of a similar path.

03:59 Like I was working my PhD in math and doing a lot of math stuff, but then also analyzing

04:04 and programming.

04:05 And I guess I should have known that I was not really intended to finish my degree, my PhD

04:11 anyway, and go into math because I would find when I'm working on these projects, like I was

04:18 really excited, especially about the programming.

04:20 And it was so cool, kind of like you described this, like how awesome it felt to have this

04:24 simulation running.

04:25 And then I'd always kind of get like a little bit less excited and like, oh, here comes the

04:30 drudgery work part of it when I got to the math and then you go back to the program and

04:35 I get really excited again.

04:36 I'm like, I should have been a sign that I could have.

04:38 That's what you know.

04:39 Yeah.

04:39 Yeah.

04:39 But no, it was all good.

04:40 So for you, what are you doing these days?

04:42 So I work for a company called American Tire Distributors.

04:45 It's the largest distributor of replacement tires in the world.

04:49 So the tire industry is very interesting because tires are something that pretty much everybody

04:54 has to buy.

04:54 You don't buy frequently.

04:56 And it's an overwhelmingly miserable experience for most people.

04:59 Yeah.

04:59 And it's an antiquated industry.

05:01 So it turns out that they have a ton of data, but technologically speaking, it's not particularly

05:06 modern.

05:08 And so my company has spun up a pretty new analytics team that's data scientists, engineers,

05:13 and developers to kind of build out applications to try to revolutionize that industry and bring

05:19 it up to the modern world in terms of analytics, increasing supply chain efficiency and all of

05:23 this.

05:24 And so as a data scientist, particularly one who is interested in web development and app

05:28 development, like I am, it's kind of like being a kid in a candy store because there's tons of data,

05:33 both historical and streaming every day.

05:35 And basically no solutions other than the traditional sort of BI descriptive backward

05:41 looking stuff, which is good.

05:43 And you need that.

05:43 But we're also looking to build more predictive analytics and doing forecasting and finding

05:48 other optimization techniques to kind of squeeze efficiency because as a distributor, you're a

05:52 middleman and efficiency is the name of the game.

05:54 So I find myself sort of stuck between I build backends in Python.

05:58 I write a lot of SQL, interact with databases.

06:01 Sometimes that means building new databases on our cloud resources.

06:04 And then I spend a lot of time building front ends to actually expose those applications in a way

06:10 that's consumable.

06:11 And that's actually a very important part of this whole process because at the end of the day,

06:14 if you're building an app for somebody, it doesn't matter how fancy your mathematics is.

06:19 If the consumer who's actually making an actionable decision on that can't do anything useful,

06:24 it's not worth your time.

06:25 So that's where I find myself spending my time.

06:28 I write a lot of code and talk to a lot of people.

06:30 Yeah, yeah.

06:31 That sounds so fun.

06:32 I mean, almost any industry that has like tons of data and yet no one is doing anything with it.

06:39 Not really.

06:40 It just sounds really fun.

06:41 You can come and go, all right, let's see what we can do here.

06:43 We can bring in a little TensorFlow to do this.

06:45 We can bring in some APIs to modernize that.

06:48 Like, why are you seriously emailing me an Excel spreadsheet?

06:51 Is that really what's happening?

06:53 Oh, that's the story of my life.

06:56 Do you guys use SharePoint?

06:56 You're hitting home.

06:57 Do you like share Excel docs through SharePoint?

07:01 Yeah.

07:02 I mean, we use a bit of everything, which is part of the problem.

07:05 There's historical reasons that it's an Excel spreadsheet or it's emailed.

07:08 And, you know, consolidating, that's a lot of work.

07:10 And that's something that we've got people working diligently on.

07:13 But it keeps things interesting.

07:14 You know, you learn about tech from the lens of like your big tech companies.

07:18 And the reality is there's an awful lot of business out there that are not nearly that refined.

07:23 And so it's a different problem.

07:25 It's fun, though.

07:26 I really love my job.

07:27 Yeah, it's definitely a different problem.

07:28 I mean, if you're working at Netflix as a data scientist, it would be very different.

07:32 Or, you know, somewhere like that, right?

07:34 Just as an analogy.

07:35 But I do think it sounds really fun to come in and have kind of this blank slate and say, okay, it's 2019.

07:42 These are the tools we're using.

07:44 We could easily apply them here.

07:46 Because it happens to be open source, you don't even have to get budget approved to like buy the $2,000 license for this or that.

07:54 Like you just say, okay, you know, let me go after it.

07:57 I'll do it.

07:57 Yeah.

07:57 I mean, our analytics team functions basically like a startup inside of a very sizable company.

08:03 So we have freedom to pick the tech we want to use.

08:06 A lot of those decisions get to be made sort of last minute on the fly.

08:08 We get to use whatever the latest and greatest tech is.

08:11 So it's a lot of fun.

08:12 Yeah, that's actually, you know, I find those types of environments are really, really good for actually learning a whole lot of technology and techniques.

08:20 Because you're not dropped into like, well, we already have an enterprise architect.

08:25 And they say we use this technology.

08:27 And then we use it in this way and not in that way.

08:29 Now you go work on a small sliver, right?

08:31 Like you're there to just explore.

08:33 And I don't know what it's like for you.

08:36 But the times I've been in those situations, like people don't really care what technology you're using or how you're using it.

08:43 If you're showing results and you're like, I've tried these things and look what we're doing.

08:46 Like last week we couldn't do this.

08:48 This week we can.

08:49 That's all they really care about.

08:51 And they're super happy.

08:51 Is that kind of the experience?

08:52 Yeah, I mean, more or less, like you were saying earlier, if you worked at a company like a Netflix or an Uber, you can have a whole data science team, you know, whatever, 10, 20, 100 people dedicated to making tiny incremental improvements to an algorithm.

09:05 And that might be worthwhile because it's a half percentage point across an enormous profit base, right?

09:11 And so that might justify it.

09:12 But in a smaller team, you might have more of the kind of 80, 20 approach where you're trying to get most of the benefit with the least amount of work so that you can touch a lot of different points.

09:22 Because the surface area of your problem space is just huge compared to how many people you have.

09:27 As a company grows over time, that dynamic shifts.

09:30 Yeah.

09:30 There's like seasons.

09:32 It's also brings up some sort of team structure issues.

09:34 Yeah, for sure.

09:35 Yeah, exactly.

09:36 Like seasons, it kind of changes as the evolution of the company progresses.

09:39 Yeah.

09:39 Cool, cool.

09:41 So one of the things that we're going to talk about, and I would like to start off the conversation with, you touched on it before.

09:47 It's really cool to have these libraries and these predictions and all of this tooling in place.

09:54 But if you can't share it and expose it over, say, like a web service or something like that, then it's not super valuable, is it?

10:02 No, not at all.

10:02 So let's talk about your work that you're doing with Flask and data science, right?

10:08 This is the web dev side of the intersection that I was talking about.

10:12 Right.

10:13 Basically, you know, what is a data science application and how is that different from a traditional, quote, web app?

10:19 The biggest difference is that a data science application is going to have some kind of computational predictive machine learning element embedded into the actual API results.

10:28 So if I'm purely a front end developer and I'm used to a workflow of building an app, hitting an endpoint, receiving data and displaying it, there's basically no difference between a traditional app and something that's driven by data science.

10:41 Other than maybe you're cooking up more graphs and visualizations, depending on the context.

10:45 But from the back end perspective, you're going to be generating these API results, calling some kind of prediction method.

10:51 And so one of the great things about Python with machine learning, which is, in my opinion, one of the reasons it's become so popular, is that it has scikit-learn, which is a very understandable interface to sort of churn out predictions repeatedly across many different models, many different parameters.

11:07 And so it makes the kind of plug and play nature of how the data science industry needs to work very friendly.

11:14 And that means that we can have multiple data scientists trying to solve the same problem.

11:18 Each have their own models.

11:19 They can be tuning it.

11:20 We might combine them.

11:21 But at the end of the day, we can have some kind of predict method that generates data and that that data can be JSONified.

11:28 It can be sent back through an API.

11:30 And so it's really in the nature of the meaning of the data and where it comes from and what problems it's trying to solve.

11:37 It's really the only main difference.

11:39 Yeah, for sure.

11:40 When you're talking about APIs, I suspect consuming them looks very similar to working with the GitHub API or the Stripe API or whatever, right?

11:49 Yeah, exactly.

11:49 You've got some authentication.

11:50 You call some methods.

11:52 You get a JSON answer back and you say, they recommend this tire or they recommend it's time to buy or not time to buy or whatever, right?

11:59 Like whatever it is you're trying to predict.

12:00 Yeah, exactly.

12:01 When it comes to web applications, though, I suspect that also means bringing in other interesting elements.

12:08 Like, for example, there's probably more charting and graphing and interactive data display on something that is data science backed than something that's not.

12:19 You know, like my site that does like training, right?

12:22 It shows you videos, but I don't think there's a graph on the site.

12:25 Like there's no graphs, nothing like that, really.

12:27 Right, exactly.

12:28 Maybe on some of your like admin dashboards or whatever.

12:30 Possibly, yeah.

12:31 It depends on the context.

12:32 So, for example, we might be making a pricing tool, right?

12:36 So, let's say you want to do some machine learning analytics to figure out how should I adjust my pricing on different products?

12:42 Well, the way somebody in the tire industry or the shoe industry are going to handle those problems are very different because they have different problems.

12:50 So, to pull from tires, there's an enormous number of different kinds of things.

12:55 And so, it's tens of thousands of things that you might be making a decision about pricing.

13:00 A human can't understand that.

13:01 A human can't understand that.

13:01 But if you make an application that makes it very easy to slice and dice that ecosystem and bubbles to the top the things that are maybe the most egregiously mispriced one way or the other, you allow a human to then make decisions about the important parts of that.

13:16 But that kind of interaction might not be so meaningful if you're in a product space where there's not so many things and, you know, it's not as hard for a human to make a decision, but you have other problems.

13:26 So, it's kind of a dynamic business because it very much depends on the context.

13:31 And that's why the user, their use case matters so much in how you design an app because it's one thing to just solve the same pricing problem twice, but the user experience might be vastly different.

13:41 And that completely changes the front end that makes it a good product.

13:44 Sure. And maybe whether or not it should be just a server-side Flask type of application, or maybe you need to bring in some funky JavaScript for interactivity once you pull down the data or something like that.

13:54 One thing I did want to ask you about is how does hosting and deployment look relative to, like, for a data science web app rather than a standard web app?

14:05 Like, I know in data science, there's lots of computation.

14:09 There's often leveraging GPUs, but is that more done in the preparation phase?

14:15 Like, do you train a model up or something?

14:16 But then the actual execution of it, it doesn't need any special hardware in the deployment stage?

14:22 Or what does that look like?

14:22 Great question. It depends.

14:24 So, in some cases, you have a problem where the computational load can be done in advance.

14:30 Great example is image classification.

14:32 So, Google has trained a number of models that they open source for image classification, like the Inception series of models that you can go download and use them to classify things that they already out of the box know how to do, cats and dogs and things like this.

14:46 And to get there, Google threw a whole bunch of compute at that, you know, tons of nodes, many GPUs, and it takes a long time to get that model weighted the way it is, get it trained to where it is.

14:58 But from that point, another data scientist could pick that up where it left off and basically finish out the rest of a model to repurpose it.

15:05 So, in that case, the data scientist workload is mitigated because Google did some front-end work.

15:11 But at the same time, you might take that final trained model that data scientist adds their specific use case to, and once that's trained, you might be able to just throw that into a Docker container and then make predictions with it very quickly, trivially quickly.

15:24 It was the training that was the hard part.

15:26 So, in this case, you would just Dockerize it and you can deploy the application as is.

15:31 However, there are other cases where the actual computation you need to do in the web server is where the complexity is.

15:38 So, for example, I have some apps I've built where they're built around optimizations where you have lots of free variables and the problem that you're solving is completely dependent on the state of the app.

15:48 You have to do that on the fly.

15:50 And so, your Python server, your Flask app is now either doing that computation directly or offloading it to some other server via something like a task manager with Celery or some other serverless hook that you've put together, microservice architecture, whatever you're using.

16:04 There's many different ways to do it.

16:06 Yeah, yeah.

16:06 That definitely makes a lot of sense.

16:07 So, I guess depends is the answer, right?

16:10 Yeah.

16:10 It depends.

16:11 Very interesting.

16:12 How does serverless fit into the world here?

16:16 I know serverless is good for async stuff, right?

16:18 Like, I want to send an email.

16:19 Like, nobody needs to wait in that.

16:20 I can shoot off something to say AWS Lambda or Azure Functions and just have it go.

16:24 How does that work in your world?

16:26 Personally, serverless doesn't affect me too much.

16:29 First off, the name serverless, I don't really understand because there's always a server there.

16:33 It just means serverless in the sense that you don't have to manage partitioning and setting that up.

16:39 And so, it's a valuable thing, especially if you're more of building like a front-end only type app.

16:44 You know, if you want to use a Firebase or something, that kind of model.

16:47 In my case, I came from more of a back-end development side first.

16:51 And so, I've never had an issue with creating those endpoints myself.

16:55 And so, I tend to have not needed the serverless model.

16:58 It's not to say it's not good and useful.

17:00 It just hasn't affected me very much ever.

17:01 Yeah, it's interesting.

17:02 The way I kind of see it the same way as well.

17:05 And I think it's because I came from the back-end development side first, possibly.

17:08 So, to me, I think there's a lot of value in serverless.

17:12 It makes a lot of sense some of the time.

17:14 But my perception, speaking only for me, if I'm going to build something for me, is I am trading code complexity, an application that has multiple things, all of it going on.

17:26 It's got to keep running.

17:27 I'm trading that code complexity for infrastructure complexity, right?

17:32 I might now have 30 Lambda functions that all have to be versioned and migrated and kept in sync.

17:39 But they're all separate things up in the cloud.

17:41 And I have to deal with that somehow.

17:43 And how do I keep them all running?

17:45 And so, I always feel like I'm trading code complexity for infrastructure complexity.

17:49 And I feel like I'm better at code than I am at infrastructure.

17:51 So, I lean towards not going that way.

17:54 Same thing with microservices, right?

17:56 I just feel like I'm better at managing code complexity than DevOps-y infrastructure complexity.

18:01 So, let me play to that, you know?

18:05 This portion of Talk Python to me is brought to you by Linode.

18:08 Are you looking for hosting that's fast, simple, and incredibly affordable?

18:12 Well, look past that bookstore and check out Linode at talkpython.fm/Linode.

18:17 That's L-I-N-O-D-E.

18:19 Plans start at just $5 a month for a dedicated server with a gig of RAM.

18:23 They have 10 data centers across the globe.

18:26 So, no matter where you are or where your users are, there's a data center for you.

18:30 Whether you want to run a Python web app, host a private Git server, or just a file server,

18:34 you'll get native SSDs on all the machines, a newly upgraded 200 gigabit network, 24-7

18:40 friendly support, even on holidays, and a seven-day money-back guarantee.

18:44 Need a little help with your infrastructure?

18:46 They even offer professional services to help you with architecture, migrations, and more.

18:50 Do you want a dedicated server for free for the next four months?

18:53 Just visit talkpython.fm/Linode.

18:56 Totally get it.

18:59 I think Docker is really the game changer there because nowadays, if you can throw it into a Docker container, there are services like Google App Engine

19:07 or AWS Elastic Beanstalk where it can sort of transparently scale up and down.

19:12 And yes, of course, there's differences between Lambda and those services.

19:15 But at least as far as I have found, you can get pretty far just by having Docker

19:21 and using whatever managed service on top of that to handle your kind of auto scaling.

19:24 Yeah, that's cool.

19:25 Docker is nice because you can do basically anything that Linux can do.

19:28 Right.

19:29 You don't have the restrictions like...

19:31 Yes, which makes me happy.

19:31 Yes, it has to run within 30 seconds or 10 seconds or whatever it does for service.

19:36 And you can only do work with these dependent...

19:39 Like you can do whatever you want, right?

19:40 You need to install some like C library.

19:43 It doesn't matter.

19:43 Absolutely.

19:44 Do you use Docker much for your work?

19:46 Yes.

19:46 Oh, yeah.

19:47 Both myself and my team, we Dockerize anything that we can Dockerize pretty much.

19:51 It just makes life so much easier.

19:52 It's easier deployments.

19:54 It's easier going from a VM to something like App Engine if you're on Google Cloud or whatever

19:59 your service is.

19:59 The installations, especially with some of the data science libraries that we're using,

20:03 for example, CVXPy is an optimization framework, and it has some potential gotchas with compiling

20:10 it.

20:10 Putting that in a Docker container solves that problem.

20:12 I don't have to worry about some obscure GCC version on the Linux box I'm deploying to

20:16 or my Jenkins server if I just have it in Docker.

20:19 So I'm a huge fan.

20:20 Yeah, you get it working once and then you just never have to touch it again.

20:24 You containerize it and you just say, I depend on that.

20:26 That works.

20:27 And your team can have one guy who's good at it.

20:29 You know, everybody doesn't have to learn it because it's a super copy-pastable kind of

20:33 thing.

20:33 Yeah, yeah, absolutely.

20:34 What I found interesting about Docker was as I learned to create the Docker files and

20:39 build the images and containers and whatnot, it's like, well, what you really need to know

20:44 is Linux, right?

20:46 Here's a bunch of commands that you're issuing to configure Linux.

20:49 It just happens to be you issue them in a Docker file format rather than on the terminal

20:55 association.

20:56 But it's basically like the complexity is not Docker.

20:59 The complexity is Linux.

21:01 And if you're comfortable with that, then Docker is actually a small step.

21:04 Yeah, exactly.

21:05 So do you use something like Swarm or Docker Compose or Kubernetes on top of that?

21:10 Yeah, so generally for those kinds of things, we're either going to be using Kubernetes,

21:15 but a lot of times we'll just operate within App Engine just because it's so simple.

21:20 If you get your Flask app, whether it's a single application or whether you've got a front and

21:25 back end separately, you Dockerize it, you can push it to App Engine, and then it will deal

21:29 with scaling it up, down, making sure it stays up.

21:32 And then you can also add rules on top of it, like for static files that they'll map to some

21:37 internal Nginx configuration you never have to deal with.

21:39 That works pretty well.

21:40 But Kubernetes, if it needs to be a full cluster or integrated microservices, for sure.

21:44 Yeah, yeah.

21:45 Like they've got to talk to each other.

21:46 That's where it gets complex with Docker.

21:48 All right.

21:49 Thanks for the diversion.

21:50 Yes.

21:50 That was interesting.

21:50 So if I'm building a data science web app, I would say, you know, if you went out to the

21:56 street and just started interviewing random people where all these people were knowledgeable

22:00 about data science, they would probably say like, well...

22:03 I want to live there.

22:03 Yeah, for sure.

22:05 So probably the first thought or the most popular answer of like, how do I take my

22:09 data science stuff and present it on the web?

22:11 It would be Jupyter Notebooks.

22:12 What do you think?

22:13 Yes.

22:13 I have such a love-hate relationship with Jupyter.

22:16 And I think other data scientists and my team would hate me for saying this, but Jupyter

22:20 Notebooks just cause a lot of problems.

22:22 They are super great for interactive data science, exploratory analysis, initial data cleansing

22:29 and whatnot.

22:29 They're great.

22:30 But for anything you need to pipeline, productionalize or make repeatable, the native Jupyter Notebook

22:36 space just doesn't really work very well.

22:38 I mean, how many...

22:39 I guarantee you, if you're listening to this and you have shared a Jupyter Notebook with

22:43 a team before, you've gotten one back at some point that you ran and it immediately errors

22:47 out because you don't have some file or whatever.

22:49 Just make sure your Jupyter Notebook runs start to finish.

22:53 So yeah, you're right, Michael.

22:55 Jupyter Notebooks and apps are very different.

22:57 And how you kind of connect those thoughts to production is not trivial.

23:02 Yeah, it took me a while to appreciate Jupyter Notebooks and just the whole notebook style

23:08 of programming.

23:08 Because to me, coming from a more app building type of software development, like I was all

23:15 about having different files for the different purposes and like factoring an app.

23:21 So like the data access stuff is over here and then I can test this part and I put them

23:25 together and then here's the app.

23:27 And when I looked at Jupyter, I'm like, all that stuff is missing.

23:30 I mean, even to some degree, like a lot of times, even functions are missing.

23:33 And that like kind of made me a little, gave me the willies a little bit.

23:38 But then, you know, I saw people...

23:39 Yeah, how do I modularize?

23:40 Exactly, right?

23:41 Like how do I, how is this maintainable, right?

23:43 So...

23:44 Unit test it, yeah.

23:45 Yeah, exactly.

23:45 So, but then I saw people working with it and I'm like, they're just working differently.

23:49 They're solving different problems than the problem I'm trying to solve.

23:53 And the notebook space makes more sense for them.

23:57 But to be able to push it actually into full production, I don't know, it's, it doesn't

24:02 seem like really necessarily the right answer.

24:04 There are some projects trying to leverage notebooks for production that are pretty interesting.

24:12 I want to ask you if you had any experience with them, not saying that I would recommend

24:16 like ditching flasks to use them, but they are interesting.

24:18 Things like Paper Mill, have you seen that?

24:20 Yes, I think Paper Mill is exactly the kind of solution to the problems because it kind

24:25 of hammered on Jupyter Notebooks.

24:27 I'm actually a big fan of them.

24:28 It's just that it can be misused.

24:30 And what Paper Mill does is it provides you a way to parameterize your notebook and then

24:35 execute them in a more programmatic way.

24:38 And if nothing else, this forces the developer of the Jupyter Notebook to kind of think about

24:43 abstracting parameters out where it makes sense.

24:45 Right.

24:46 What are the inputs?

24:47 What are the outputs?

24:47 And it makes it a lot easier to productionalize these things.

24:51 Yeah.

24:51 There's an interesting article I'll link to from Netflix, how they're using Paper Mill to

24:57 sort of productionalize their Jupyter stuff and do a lot of data science still with the notebooks.

25:03 But yeah, I would say like, if you really want to put something online and really make it

25:07 accessible, right?

25:08 You need an API or at least a website is probably the final end game.

25:14 What do you think?

25:15 Yeah.

25:15 But I think that, I mean, whether or not it's Paper Mill, I think some kind of integrated tooling

25:20 is going to be the solution.

25:21 Whether it's something that makes it easy to go back and forth between notebooks and Python

25:26 files or whether it's a Paper Mill or something like that.

25:29 I think that where that will take you is a place where it's simpler to move from the data science

25:34 that you're doing in the notebook to production, because we also can't just say, oh, well,

25:38 you do your exploratory analysis in a Jupyter notebook and then we productionalize it,

25:42 which is a whole scratch rewrite.

25:44 Like that's a huge waste of time.

25:45 And so you have to sort of trade off how painful is it to refactor this versus how easy is it to

25:52 just kind of build a tool that can make those dots connected for you.

25:55 I think eventually we'll land on something that's kind of an integrated workflow and that's

25:59 going to get picked up very quickly because of how beneficial that time save is.

26:02 Yeah.

26:03 It's going to be pretty interesting.

26:04 Now, I do think one of the challenges that people run into in the data science space is

26:11 incredible growth of Python really does have a lot to do with data science.

26:15 And I think that's because Python itself appeals a lot to people who are not necessarily programmers

26:22 first, but they use programming to do something else amazing, right?

26:26 Like a biologist or a physicist or a statistician or something.

26:29 But it also means like a lot of folks come without necessarily a rigorous software engineering

26:38 background.

26:39 That's not a dig against them or anything negative.

26:41 It just happens to be like we all come from different perspectives.

26:44 But I do feel like there's probably a lot of teams and folks I talk to, it seems like

26:48 they could use a little bit of help or some pointers on taking things like testing and maintainability

26:57 and factoring.

26:58 And you talked about continuous integration and whatnot and all that into their workflow.

27:03 What are some of the software engineering techniques you think data scientists should pay attention

27:08 to?

27:08 Number one is testing.

27:09 Why do we care about unit testing?

27:12 And I like to tell people that testing is actually important mostly for refactoring purposes.

27:19 It's actually a side effect of writing tests that you verify that your code works.

27:23 So what I mean is that, you know, if I have my test built and I've got good coverage and I've

27:28 got the little auto reloader over and every time I change something, hit control S to save it

27:33 and I get those little dots, I can be extremely aggressive with coming in and just gutting

27:37 whole parts of the app and changing them around because I know I'm backed by those tests.

27:42 And I can't tell you how many times I have done a refactor that would either have taken a long

27:47 time to piece by piece change it and make sure nothing broke or simply would have just not done

27:52 because it was like, oh, I don't want to touch that because it's going to be super painful.

27:55 So testing is one.

27:57 And along with that comes sort of how you write code, because if you're writing code

28:01 for tests, you will do things like lift variables up or abstract things out to be either a dependency

28:08 injection type flow.

28:09 Yeah.

28:10 You'll think about small functions versus large functions because large ones are super hard

28:14 to test.

28:15 What you'll see, and this is not a knock on people who are not quote software engineers,

28:19 but you get these big long functions and then somewhere embedded in the middle is like a

28:23 hard coded call to some API and the endpoint and credentials are buried in there.

28:27 And maybe in source control.

28:28 And it just becomes very difficult when you want to come in and change something, which

28:32 inevitably, you know, your client wants you to do.

28:35 And you come in there and you're like, oh man, this is such a mess.

28:37 It's going to take me a while to dig myself out.

28:39 Whereas if you take a little bit more time to set yourself up from the beginning, it just

28:43 flows way faster.

28:44 Yeah, for sure.

28:45 So if I had to pick one or two, it's those, but I could go on and on about design patterns

28:49 and whatnot, but I'll cut it there.

28:51 Yeah, no, that's good advice.

28:53 The one I see also often missing is like proper source control.

28:57 Maybe, I don't know.

28:58 It depends on the team, right?

29:00 But if they're like mostly scientists who are started, you know, like leaving MATLAB and

29:05 going into say Jupyter and Python, like source control.

29:08 That's actually a good point.

29:10 And I almost feel like source control gets overlooked as being just assumed that you know it, but

29:16 very interesting.

29:16 Most people coming fresh out of school, you know, they've only seen a limited amount of

29:20 exposure to it, but you're completely right.

29:23 It's a huge part of it.

29:24 I actually am extremely interested in Git as a separate aside.

29:27 You know, I kind of, our Slack channel, we have run a Git tip of the day.

29:30 So you can look at some of the blogs that we'll link.

29:32 We've actually published a couple of those just to kind of as a way of keeping everybody's

29:35 brains fresh on source control.

29:37 Or how do I revert this branch in a particular way?

29:40 Or what is Git ref log?

29:41 And so I completely agree.

29:43 If you're fluid in source control, it's also a big time saver.

29:46 Yeah.

29:46 Well, I mean, all these things go, like everything we're touching on goes really well together.

29:50 Testing so that I can change my code, writing code that is testable and easy to maintain

29:56 so I can have these tests.

29:57 And then having source control so that I can, you know, like commit it and tag it and then

30:02 go crazy and go, you know what?

30:03 It was a horrible idea.

30:04 We're either dropping this branch or we're rolling it back and we're going to skip over

30:09 this bit or whatever, right?

30:10 Like even if you forgot to branch, you can still go back to your last commit or a couple

30:15 commits back.

30:15 Yeah.

30:16 They all kind of hit at the same core essence.

30:19 Yeah.

30:19 And then you add your CICD on top of that.

30:21 And now you've got it.

30:22 So when every time I push changes or whatever circle or Travis or Jenkins grabs it, builds

30:27 it, pushes it out.

30:28 And I can be deploying five, 10 times, a hundred times a day.

30:31 Model works extremely well.

30:33 You catch bugs more quickly and you're not afraid to change stuff.

30:35 Yeah, for sure.

30:36 Cool.

30:37 That's good advice.

30:38 So when we're building these apps, I have a pretty good sense for when it makes sense

30:42 to have, say, one of these JavaScript rich web apps.

30:48 I'm thinking of AngularJS, React, VueJS, something where like a lot of the application logic is actually

30:56 written in JavaScript.

30:57 And then there's a bunch of services probably written in Python that you're talking to behind

31:02 the scenes, like we were talking about with Flask.

31:05 I have a good sense for when that maybe makes sense versus when a more server-side backed

31:11 framework, Flask, Pyramid, Django, something like that.

31:14 You can just stick with that and not add that extra complexity.

31:18 I always feel like there's a little bit of glitchiness in the front-end apps somewhere and

31:23 some setup with some plugin or whatever.

31:26 But I do feel like there's actually a tendency for people to assume they need more JavaScript

31:31 than they actually need.

31:32 And they need these front-end frameworks more than they actually need.

31:35 Like you can go a long way with the server-side framework, but there are times.

31:38 So maybe what are your thoughts in the data science web space around that?

31:43 It's a good question.

31:44 We do both.

31:45 There's a time and a place for both kind of the single Flask app that does both the front

31:49 and the back end.

31:49 And then when you need to bring JavaScript in, the biggest thing that JavaScript gives

31:53 you, and it's the whole magic of the web really, is the interactivity.

31:57 So if you've got, from a data science perspective, some kind of application that's maybe kind of

32:03 like a dashboard where you've got a number of visualization widgets and maybe you change

32:06 something and it re-aggregates data and it's responsive.

32:10 That kind of thing really sings as a JavaScript app, whether it's React, Angular, whatever you,

32:16 jQuery, whatever you use.

32:17 When it's more of sort of static content, you know, a blog or a report or some kind of anything

32:23 even that could be served with just a SQL query.

32:25 Right.

32:26 Even if it's data-driven, it's like not static per se.

32:28 But once it hits the page, you don't need it to be interactively changing unless you like

32:34 open another page, right?

32:35 Like here's a list of things and I click on the book and I have the book details, right?

32:38 Like all that doesn't need any JavaScript, probably.

32:40 Yeah.

32:41 And since sometimes those lines get blurred because what happens if that SQL query is either

32:46 really complicated or it returns a huge number of rows, I still kind of want to interact with

32:51 on the front end.

32:51 Maybe I want to paginate it.

32:53 Right.

32:53 Filter it or something.

32:54 Yeah.

32:54 Going back to the server, I'd have to run that huge heavy query.

32:56 Well, that's not great.

32:57 But if I also have to fit a bunch of data into the browser, that might not work.

33:01 So where do I put that pagination is now not always trivial.

33:04 And so those kinds of things tip the decision, whether you want to go JS or pure Python.

33:08 Well, yeah, probably also heavily depends on your team, right?

33:11 If you have a bunch of people and they just love Python and they don't want to touch JavaScript,

33:15 like saying, you know, it's probably better if we do this in JavaScript and we just force

33:20 everyone to now do JavaScript.

33:21 Like that might not actually be better given the people working on it, right?

33:26 Like it's, I think the team's desires and capabilities also should be considered, right?

33:31 Yeah.

33:31 Because data scientists typically don't know JavaScript, right?

33:34 They know Python and depending on their background, they might have some exposure to HTML, CSS,

33:39 but you can't really depend on it.

33:41 You can depend on Python or R and hopefully SQL.

33:44 So sometimes it's just the nature of who's going to be working on this project.

33:48 They might choose the technology that fits.

33:50 So it might be, this is Python only because I'm really good with Jinja.

33:52 So I'm going to use those.

33:54 And it might be that we've got dedicated front end engineers who are really good with JS and

33:57 we put them on those heavier interactivity type projects.

34:00 And so it's, it's, it's also very much about sort of where your team's at.

34:04 We're, we're a small team.

34:05 And so you kind of play to your strengths.

34:07 Yeah, that's definitely a good idea.

34:08 One of the ways you can add interaction to these apps and not go and write all that stuff

34:15 in D3 or HTML5 Canvas and JavaScript, God forbid, is to use something like Bokeh or Plotly

34:23 or some of these other interactive stuff.

34:25 There's like things you can add to your site as well that may get you close enough for this

34:30 exploration.

34:31 What do you think?

34:32 This stuff is magical because it unlocks that interactivity that I was saying is so good about

34:37 the web to people who don't know JavaScript.

34:39 So Bokeh is a great example.

34:41 Plotly is also a one that I'm a big fan of.

34:43 Plotly is nice because the charts that you create in Python are serializable to JSON.

34:48 That's actually how it renders it into JavaScript eventually.

34:52 But if you interact with Plotly within JavaScript itself, it's the same data structure.

34:57 And so it plays very nicely when we have pure web devs and pure data scientists that they

35:01 can both interact with the same objects.

35:03 So I'm a big fan of Plotly.

35:05 There's also a cool project called Dash.

35:07 That's part of the Plotly umbrella.

35:09 I don't know how you'd call it that.

35:10 That is a declarative way to create dashboards and visualizations that actually renders into

35:16 a React app.

35:17 So the internals are pretty cool.

35:19 You create this layout, serializes it, and there's a React app that generates that into

35:23 charts.

35:23 So that's another cool way to unlock sort of interactivity.

35:26 You can hook up widgets and whatnot.

35:27 So those kind of ecosystems and tools, I see more and more of that.

35:31 I think it's a great place to go again because it just enables people.

35:35 Yeah, for sure.

35:36 I think that's one of the things that you learn as you progress in software development

35:39 is we probably all had the experience of, I had this problem.

35:44 I wrote all this code to solve it.

35:46 And then I realized there was a library that would solve it in two lines.

35:49 And it took me a week to solve it.

35:51 Like there's all these things you can find and add in.

35:53 I guess the challenge there is knowing when adding in something like that will get you 80%

35:59 of the way there and there'd be a complete pain trying to finish it.

36:02 Or is it going to be good enough and you'll be happy with it?

36:05 You know, like that's a real challenge, I think.

36:07 But a lot of these data science visualization tools you can drop into your website do seem

36:12 really nice.

36:13 Yeah, that's honestly one of the biggest things I think that makes experience matter is knowing

36:18 when do I look for another tool?

36:20 When should I pay for something?

36:22 When do I roll my own?

36:24 Or when do we go with the 80-20?

36:25 And you can over-architect things to the point where nothing gets done.

36:30 And then on the flip side, you can create this spaghetti monster mess that's completely

36:33 unmaintainable.

36:34 And I think being pragmatic is the biggest thing you hope to get with experience.

36:38 Yeah.

36:39 And coming back to what you were touching on before, if it's testable, maintainable, and

36:43 you can evolve it quickly, you can make one choice and then change your mind, right?

36:49 Because it's easy to change.

36:50 Because you have the test that'll tell you, no, it's not actually broken yet.

36:53 And things like that, right?

36:54 There's also that aspect of it, I think.

36:56 Yeah, you're paying for technical debt, basically.

36:59 And if you write unit tests and all these other things, you're getting rid of that.

37:02 You catch up later.

37:05 This portion of Talk Python to Me is brought to you by Rollbar.

37:08 Got a question for you.

37:09 Have you been outsourcing your bug discovery to your users?

37:12 Have you been making them send you bug reports?

37:14 You know there's two problems with that.

37:16 You can't discover all the bugs this way.

37:18 And some users don't bother reporting bugs at all.

37:21 They just leave, sometimes forever.

37:22 The best software teams practice proactive error monitoring.

37:27 They detect all the errors in their production apps and services in real time and debug important errors in minutes or hours, sometimes before users even notice.

37:35 Teams from companies like Twilio, Instacart, and CircleCI use Rollbar to do this.

37:40 With Rollbar, you get a real-time feed of all the errors so you know exactly what's broken in production.

37:46 And Rollbar automatically collects all the relevant data and metadata you need to debug the errors so you don't have to sift through logs.

37:53 If you aren't using Rollbar yet, they have a special offer for you, and it's really awesome.

37:57 Sign up and install Rollbar at talkpython.fm/Rollbar.

38:01 And Rollbar will send you a $100 gift card to use at the Open Collective,

38:06 where you can donate to any of the 900-plus projects listed under the Open Source Collective or to the Women Who Code organization.

38:14 Get notified of errors in real time and make a difference in open source.

38:17 Visit talkpython.fm/Rollbar today.

38:20 You recently wrote an article called Flask Best Practices, Patterns for Building Testable, Scalable, and Maintainable APIs,

38:29 really from the perspective that we're coming from here, from the data science side of things, but also the web developer side.

38:37 So maybe we could touch on that a little bit.

38:38 Sure.

38:39 So the idea with that blog post was for structuring these kind of applications, you know, Flask is very unopinionated, which is great.

38:47 You know, it's kind of like in the front end world, you have this sort of React versus Angular, and there's also Vue, but a lot of discussion between those two.

38:54 And one of the biggest differences, React is very unopinionated and kind of call it plugin-based.

39:00 And then Angular is full weight and sort of opinionated, but kind of has all the bells and whistles included.

39:06 And in the Python world, Flask kind of fits that sort of plug-and-play type thing, where you have the freedom to make all of these decisions and do things the way you want, but you also have the burden to make all these decisions and do things the way you want.

39:18 And so it can cut both ways.

39:20 One of the things that's always bugged me about Flask is I felt like it was presented artificially simplistic.

39:28 And what I mean by that is like, they always, Django is super complicated.

39:32 And look at all the stuff you do for Pyramid with this cookie cutter thing.

39:35 But what you get, you just, for Flask, you just create one file, app.py, you create the app, you put app.route, one function, boom, we're done.

39:43 And that's true, it does say hello world on the screen, but like, that's not maintainable.

39:47 That's not how real apps work.

39:48 They get big and complicated.

39:49 And like you were saying, there's just an absence of any guidance on the next step, right?

39:57 How does that not become a 4,000 line app.py?

40:00 Well, that's what happens.

40:01 I mean, Armin Roneher is very humble.

40:04 So he tends to give talks and basically downplay Flask and say, look how simple it is.

40:08 But you can build big systems with it.

40:11 You just have to make the right decisions.

40:13 And because there's not a whole lot of direction out there, you do end up with these multi-thousand line monstrosities that, you know, you end up having to control F for whatever endpoint you're hitting.

40:23 And it's very difficult to sort of grok.

40:24 And yeah, so this blog post is basically, after having tried a lot of different things and having some sort of core philosophies like testing that I needed out of the end product.

40:34 It was a pattern that I landed on that we tried for several months.

40:39 And it really sung.

40:40 I'm very happy with it.

40:41 And so this blog post is basically kind of just sharing that experience because it was working within a team on a code base that was a mixture of Python, TypeScript, JS, all this kind of things.

40:51 Yeah, so you basically talk about a set of tools, a way of organizing all the files that you actually have on real actual Flask apps, like data access layers and models and whatever, right?

41:02 Test files and whatnot.

41:04 The structure is really interesting.

41:05 I think it's especially good for APIs.

41:08 It's not the structure that I use for my website, but I do have a very structured way that like kind of has similar goals to what you have.

41:15 So maybe we could just start first talking about this with some of the packages.

41:19 So you have Flask, obviously.

41:21 pytest is pretty de facto these days.

41:25 SQLAlchemy, I think, is probably the most common ORM if you're going to talk to a relational database.

41:29 What? There's another one?

41:30 There are.

41:32 You know, we got PeeWee ORM and some other interesting little ones.

41:36 Oh, yeah, yeah, yeah.

41:37 I know.

41:37 Yeah, I know.

41:38 But yeah, you have to choose, though, right?

41:40 With Flask, you've got to go look them all up and decide, right?

41:43 Django is just Django ORM or whatever.

41:44 But then some other interesting ones, you have Flask REST+, which is interesting.

41:50 You have Flask Accept and also Marshmallow.

41:53 So maybe touch on some of the other ones like Marshmallow, Flask REST+, and so on.

41:59 Well, specifically, I wanted to integrate these things.

42:02 So Marshmallow is not really outside of Flask.

42:06 It's a serialization and deserialization library.

42:08 And so it has a really rich way of declaring that there are these fields and they have a

42:14 certain type.

42:15 And some things might be required, some not.

42:17 Some might have defaults, some might not.

42:19 And furthermore, it gives you all these hooks for it.

42:22 Yeah.

42:22 So I give you like a Python dictionary, which may be originated from a JSON file.

42:26 And then Marshmallow will answer the question like, is this valid data?

42:30 Something like that?

42:31 Exactly.

42:32 And if it's not, it will give you good error messages.

42:35 So let's say you give me a JSON file.

42:37 You post it to my server and I've declared that I want to make a widget.

42:41 And a widget has these fields that I declare with the Marshmallow schema.

42:44 I call it the load method and it will take the input data and create either a dictionary or you

42:50 can give it a hook to create some kind of actual Python class.

42:53 And if one of those input fails validation, it gives you a really nice error message that,

42:57 oh, this is not a valid int.

42:58 And so on the front end, the front end devs, it really helps them out because they say, oh,

43:03 okay, AJ's schema has kicked back an error because I'm passing a float instead of an int or whatever

43:08 it is.

43:08 So there's Marshmallow.

43:10 And then it also does the serialization on the way out.

43:13 So for example, you have a user model that has a password.

43:17 You probably don't want to send the password back to the front end.

43:21 So you can set certain fields that are, you know, read or write only to solve that kind of problem.

43:26 So it's very flexible.

43:27 It's on the flip side.

43:29 There is Flask REST plus, which I was very excited about, which basically just makes it easier to write

43:36 endpoints where you're going through the full, you know, get post put delete flow,

43:41 because otherwise you're writing a raw Flask endpoint.

43:44 You have to sort of declare the methods it takes and effectively do a switch case on that.

43:48 But with Flask REST plus, you create these resources that are class-based and you put the methods that

43:53 they need for each of the different type of HTTP protocols.

43:56 And it just makes it less code.

43:58 And so combining these two though was not trivial because on one hand, Flask REST plus has its own

44:06 validation scheme that although they declare is going to be deprecated, it doesn't seem like that's ever actually going to happen.

44:12 They even referenced that we're deprecating this because we would like to switch to Marshmallow.

44:17 But if you look through the Git history, there's years of conversation where they're like,

44:22 eh, well, it's kind of hard and we've got this and that problem.

44:24 Sure.

44:24 And so I'm sort of at this frustration point where I want to use both of these cool technologies.

44:29 And so Flask accepts is a library.

44:32 It's just a couple of decorators really that allows me to unify REST plus and Marshmallow.

44:37 Because what REST plus gives you in addition to those resources is Swagger docs.

44:41 And the Swagger docs allow it to turn your API into a webpage that I can go to and it shows me all the endpoints and I can interactively hit them.

44:50 Basically, it's like almost imagine if you had some kind of export config that would let you load into Postman and it just pre-populated every one of your endpoints with a click and it would just hit them with the right parameters and whatnot.

45:03 And so in order to get both of those, this Flask accepts library allows you to sort of mix the two.

45:09 So you would just have an endpoint.

45:11 You put this accepts decorator, give it a Marshmallow schema.

45:14 It will apply that validation on the way in, attach something to the Flask request object that you can then go on your merry way and use.

45:20 And it also supports those Swagger docs.

45:22 Yeah, it looks really nice.

45:23 And this Flask accepts is something that you wrote because of your work with Marshmallow and Flask REST plus.

45:30 And you're like, these need to live better together, right?

45:32 Yeah, it was purely a personal pain point.

45:34 And I wrote it over like a weekend to solve a problem I had at work.

45:38 And naturally then since it's 2019, you open source things.

45:42 Yeah, it's cool.

45:43 I don't know if people will find it useful or not.

45:45 There's other things, but it solves my problem.

45:47 Yeah, super.

45:47 So those are all really good.

45:49 And you talk about those in there.

45:50 One other thing I want to throw out there is people are just thinking about their Flask, like a new Flask project potentially is secure.py.

45:59 So there's a lot of recommendations from OWASP and other web security companies and organizations saying, you know what?

46:05 You should really add the header XSS protection and turn it on and set the mode to block.

46:12 Or the iframe option should by default be set to same origin so nobody can embed you into their sites and make it look different or whatever.

46:20 So like with one line of middleware for Flask, Pyramid Django and a whole bunch of others, you can just add this thing and it'll automatically add all those headers.

46:27 So that's pretty cool.

46:28 So if you're security conscious, that might be worth checking out.

46:31 Yeah, I think especially with security, it's really smart to delegate that to third parties and allow them to be the experts because it's a full-time career to keep up with that.

46:41 And every day there's some new hack people come out with and just delegate that stuff.

46:45 Yeah, and you'll sleep slightly better.

46:46 Yeah, slightly better.

46:47 Slightly.

46:48 Like it's no fantasy, but it's better than not.

46:51 You wake up for other problems.

46:53 Exactly.

46:53 So we don't have a whole ton of time to spend on this, but I do want to talk just a little bit about this and then touch on some high-performance computing bits as well.

47:01 But maybe give us a quick overflow of how you're structuring your Flask app because I think it's, like I said, it's different than what I'm doing and it's a little bit unique and people can think about whether it makes sense, but I think it's definitely worth considering.

47:13 What are you doing there?

47:15 And you were right that this is kind of more thought about from an API perspective.

47:18 So if you're doing a lot of view rendering, this is not directly the same approach.

47:22 But in essence, the philosophy is you want to be able to break up each of your, call them entities, it's a thing you're going to interact with from the server, into separate directories.

47:33 Like this is the core tenet of your idea of organizing your code and your files and whatnot.

47:38 And the reason I think this makes more sense for APIs than it does for apps is APIs are often centered around these entities, right?

47:46 Like I want to go talk about the users or talk about these, right?

47:50 Like this is like the essence of what APIs do often.

47:53 Yeah, but I mean, even if I were to go set up a site that was more view-based, I would still think about it this way in the sense of if I had some kind of module that was like my users, I'm still going to change each of the pages maybe within the users module to be their own single folders.

48:08 Yeah, I agree.

48:09 So go ahead and tell us about this.

48:11 Yeah, so the idea that you have a model, which is basically probably your SQLAlchemy layer, you know, your persistent storage.

48:17 You have some kind of interface that we called it an interface.

48:21 It's because it is basically what a TypeScript interface is.

48:23 So it defines the type, what are the attributes that need to make this object?

48:28 And there's a whole discussion there around why typing is super beneficial.

48:33 So there's great tools out there like mypy and static analysis tools, and they save you an enormous amount of bug finding by highlighting errors where you've typoed something or missed a parameter.

48:44 And so that gives you a layer to sort of make sure that your functions are behaving the way at least you're using them the way you've defined them.

48:50 There's the schema, which is the marshmallow schema that I referred to.

48:54 And so that handles the serialization layer on the way in and out of the application.

48:59 That's where you can do last minute transforms, change names, export things, pick fields off, whatever.

49:05 There is the controller.

49:08 The controller is what's response.

49:10 It's actually is the route.

49:11 The controller, the way I think about it is it takes all these other pieces, and it kind of is the glue.

49:16 It gets the parameters, calls the service, which I'll discuss in a second, gets the data, wrangles it, and outputs it.

49:23 And then lastly, you have the service, which is what's responsible for actually manipulating the entity.

49:29 Now, whether the entity is like a user you're storing your database or whether it is the result of some optimization calculation, it doesn't really matter.

49:36 It's just a way for you to organize all this code.

49:39 And then internally, you can use Pandas or NumPy or C++ or whatever, just how you think about it.

49:44 Because the last place that those database queries or those Panda calculations should be is like in the route view method.

49:52 Under your route, yeah.

49:54 It shouldn't be crammed in there.

49:55 It should be separate.

49:56 The route should be pretty simple, right?

49:57 Yes, exactly.

49:58 You know, five, ten lines.

49:58 It should be orchestration.

50:00 Get the data call service.

50:01 Exactly.

50:01 It should be orchestration between all these other pieces is the way I see it.

50:04 Yeah.

50:04 Cool.

50:06 And then tests go alongside with that.

50:08 Yeah.

50:08 So your main philosophy here is that instead of having a services section and a controller section and a data section and whatnot,

50:16 is to have like a user section with a service controller model kind of in its own self-contained area, right?

50:25 Yes.

50:26 So how long would it take you to delete everything related to users?

50:30 There's basically your sanity check.

50:32 If you've got to go through and dig eight trees down into every other directory and find it, you're not compliant with what I am proposing.

50:39 Okay.

50:40 Yeah.

50:40 It's an interesting article and people can check it out for sure.

50:43 The gory details.

50:44 Yeah.

50:44 Yeah, exactly.

50:45 But I definitely think, you know, if nothing else, like FlashRest Plus, Marshmallow, all these things are pretty interesting.

50:54 And it's cool to see how you're putting them to use there.

50:56 Great.

50:57 So let's talk a little bit about computation.

50:59 You know, I touched on this at the beginning when I said, like, what is the deployment story, right?

51:03 Do I need to like deploy to a cluster that has GPUs because I got to have GPUs in production or something like this, right?

51:10 Let's talk a little bit about that.

51:12 I mean, I think one of the things that's interesting to just touch on first is Python performance.

51:17 Like people will tell me sometimes, usually they're not Python people.

51:21 They'll tell me Python is slow.

51:23 So I can't use Python for X, right?

51:26 The performance story of Python is actually complicated and it depends, right?

51:32 Like, yes, maybe this part of Python, if I wrote it in pure Python is slower than, say, Java.

51:37 But if I were to write it in C, it would actually be faster than Java.

51:42 And so I could actually do something like a Cython version of this code, which then, you know, maps over to C++ or C and then compiles down to native instructions.

51:53 So is it fast or is it slow?

51:55 I don't know.

51:55 Like it's like this blend.

51:57 And so you can do a lot of tricks to make it fast where it matters.

51:59 Use things like NumPy and whatnot.

52:01 But maybe like, like, how do you see that world from somebody who does way more computation than I do?

52:05 Right.

52:06 This whole topic of HPC, high performance computing is something I'm super passionate about.

52:10 It was my whole grad experience.

52:12 As far as Python's performance, you should compute from Python, but not with Python.

52:17 So Python is slow.

52:19 If you're talking about doing numeric computation in Python, you should invoke Python libraries that push that computation down to C.

52:29 So, I mean, I write loops all the time.

52:32 But anytime you're writing a loop to accomplish something that has a lot of IJs and Ks in it, you should at least kind of get the heebie-jeebies and think maybe there's a NumPy or Pandas method that's going to push this down to a C or C++ layer that's super fast.

52:45 Right.

52:45 And if there's not, could you make that one function a Cython function and really change it, right?

52:51 Right.

52:52 Yeah.

52:52 So Cython's the way to go if you have a custom operation that you want to do that you can't do out of the box with NumPy.

52:59 Because basically all you do is you provide a set of annotations to create the Cython file and then you run a transpiler that converts that into valid C, C++, and then brings that back into Python.

53:11 Because Python, if you don't know, is ultimately written in C.

53:14 So this kind of gives you a way to write Python and compile it and then bring it back in.

53:19 Or you could also just write directly in C++.

53:22 And that's kind of what you have to do if you want to do custom computing with GPUs.

53:27 You're almost, and again, unless you're using a library that enables GPU routines, if you're doing custom stuff, you're going to have to go to the C++ level and bring in like a shared library or something like that.

53:39 At that point, it's pretty much unavoidable.

53:40 Yeah.

53:40 It's interesting.

53:41 I have yet to need to go to C++.

53:44 But I'm happy when the libraries that I use have C extensions or parts of them are somehow using C speedups to make them much faster.

53:55 And I don't have to worry about it.

53:57 Yeah.

53:57 So I find writing in raw C++ is helpful if you need to deal with a lot of multi-threaded stuff.

54:02 Multi-threaded in the C++ context, not in the GIL context.

54:06 So if you have an algorithm that you want to manually deal with pushing out threads and multi-procking something and bringing them back, I find it's easier to do in C++.

54:15 But to be honest, I don't have to do that very much day to day.

54:18 I did that in the past life.

54:19 But in industrial data science, I don't need to do that.

54:23 Pretty much NumPy, SciPy, Pandas gives me everything I need more or less.

54:26 It takes a lot of time, right?

54:28 It's developer time versus compute time at the end of the day.

54:30 Yeah.

54:30 Yeah.

54:30 It's another one of those define fast, right?

54:32 Like if it takes you two days to write the C++ code that then runs in five minutes, or it took you half an hour to write the Python code and it runs in 20 minutes, who solved that problem first, right?

54:46 Like it depends how many times you're running it.

54:48 Like there's a lot of considerations there.

54:50 But if you're working for a hedge fund where a tiny increment of performance is whatever, put a bunch of zeros after a dollar sign, then maybe it matters and it makes sense to spend weeks on this tiny micro optimization.

55:01 It's all about context.

55:02 The whole hedge fund algorithmic trading space is crazy.

55:05 Like when you consider like server co-location and stuff so that you can like drop a few milliseconds of latency because that changes your profit margins.

55:14 Like that's a bizarre industry.

55:15 Yes.

55:16 But that's how it is, right?

55:17 Cool.

55:18 All right.

55:19 So you talked about GPUs.

55:20 Like what are, I mean, what are some of the things that we compute with GPUs?

55:24 You know, the real simple ones are like, I'm doing machine learning training, like a training machine learning model.

55:31 And I'm going to do that, say with GPUs or something like that.

55:33 But what are some of the more, what are some of the other things I might go and write GPU code with for rather?

55:39 Well, in practice, you might not need to.

55:41 You probably don't need to until you have a problem that you don't have a better solution for.

55:46 So I don't necessarily think it's sort of the general audience, but for the people that it's the right solution for, it's probably the only good solution.

55:54 So GPUs are very good at doing enormous computations in parallel.

56:00 So interestingly, I mean, they started out as graphics cards, right?

56:02 And I think people here make the connection between like my video game engine and how am I doing science?

56:07 Well, what actually happened was scientists realized that they could use these native graphics card APIs for like text or vertex shading.

56:16 And they wrote problems like whatever, an earthquake simulation or something using these vertex APIs because along the way it did the math they needed.

56:24 Yeah.

56:24 And that was sort of where CUDA came out.

56:26 So the kinds of problems they solve are things that you can do heavily in parallel.

56:31 So for example, matrix math, GPUs are very good at.

56:34 Aggregations, GPUs are very good at.

56:37 Sorting, not very good at because sorting is sort of inherently in general interdependent on the state of the rest of the algorithm.

56:45 And although there's GPU sorting algorithms at a high level, those are the kinds of problems.

56:50 So it's the right tool for the right job.

56:52 Yeah, cool.

56:53 You know, I didn't appreciate how much computation GPUs could do until I started working in 3D graphics, which I did a long time ago.

57:01 But you think I'm loading up all these models.

57:03 I'm applying all these, like every time you want to rotate or move any item that is on the screen, it's a matrix multiplication on all the vertices of it to make that happen.

57:15 And there's usually multiple ones.

57:16 And then just if you start to think about like a modern graphical simulation, a game or some other 3D environment, and then the amount of matrix multiplications per second at 120 frames a second.

57:29 It's mind boggling.

57:30 Like the human mind cannot grasp how much math is happening per second.

57:34 It's just unbelievable.

57:35 Yes.

57:36 I mean, the physics engines in games these days do, you know, ricochets and everything on the fly.

57:40 And then there's also the memory management.

57:43 You've got memory on the host and the GPU, and it's constantly going back and forth.

57:47 I mean, all of this, it's extremely complex.

57:48 So, yeah, it's really amazing technology.

57:50 Yeah, if you can turn that computational engine on to like more direct problems like you're talking about, it's pretty interesting.

57:57 So, some other options, you know, like there's grid computing.

58:01 We could set up like 50 computers in AWS or GPC or GCP, something like that.

58:08 Like what's a good story for that, right?

58:09 Like maybe you don't need all these clusters, or if you do have them, maybe you can use something like Dask, and you don't actually have to program against it.

58:17 It just magic happens if you set them up right.

58:19 Yeah, Dask is a great project because it's basically solving these kinds of problems for you.

58:24 You know, as the user, you just say, I have a thing.

58:26 I need it to run fast.

58:28 I don't want to know about how you have your nodes networked together for running an MPI job.

58:34 But originally, that was what you had to do.

58:36 So, I think the more projects like Dask evolve, and you kind of write what feels like maybe pandas, and it can just scale outwards, it's going to be really good for the community.

58:45 Yeah, that's cool.

58:46 You know, I always thought Dask was something that would make no sense for me because I don't really have these large cluster type computations that I ever have to do.

58:53 But when I was talking to Matthew Rocklin, I realized, he pointed out to me, that you can run Dask on a single CPU, or a single machine, and it'll actually do the multiprocessing and parallelism and all that.

59:06 And so, it's just another interesting use case where you might not think of it.

59:09 Yeah.

59:10 I mean, all that comes with overhead, too, though.

59:11 So, it's always the right tool for the right job.

59:13 You wouldn't take an airplane across the street, even though it's faster than a car, right?

59:18 Because by the time you went through all of that, it doesn't make any sense.

59:21 The overhead's too high.

59:22 But at some point, it catches up, right?

59:24 Yeah, for sure.

59:25 Super interesting.

59:26 Okay, well, I guess let's leave it at this.

59:29 You have a few other libraries to leverage HPC without too much expertise.

59:34 You want to take us through those really quick, and then we'll close it out?

59:37 Yeah, so it's just the idea is if you're not somebody who knows CUDA and is obsessed with all these little micro-optimizations, like I've been for a long time,

59:46 which is the vast majority of the normal work encoding population, how do you get the advantage of these is to just use those tools like Dask.

59:54 That's one option.

59:55 There's libraries that are inherently parallel.

59:57 NumPy has some multi-core support and whatnot.

01:00:00 There's also a lot of analytics, sort of more managed services, like Google BigQuery is basically a SQL-like engine that just scales out transparently to you to lots of nodes.

01:00:12 It comes back very quickly.

01:00:14 There's another one called CitusDB that Microsoft, I think, recently acquired, and it's very good for that kind of thing.

01:00:19 So there's technologies that you can leverage, and I think, by and large, that's the most straightforward way to get into that without getting a PhD in the subject.

01:00:27 Yeah, start there.

01:00:28 Then go custom if you got to, right?

01:00:30 Exactly.

01:00:31 All right, cool.

01:00:31 Well, this has been really fun to talk about all these things.

01:00:33 Now, before we wrap it up, I've got to ask you the two final questions, of course.

01:00:39 If you're writing some Python code, what editor do you use?

01:00:41 VS Code these days.

01:00:43 I used to be diehard Vim because I knew the shortcuts.

01:00:46 I could get around faster.

01:00:48 And something about VS Code, I found myself outpacing my productivity.

01:00:53 And then at that point, I was sold.

01:00:55 I mean, I still use Vim when I'm in a remote terminal.

01:00:57 But I'm a big VS Code fan, yeah.

01:01:00 That's cool.

01:01:01 Yeah, it's definitely got the momentum these days.

01:01:02 And there's a lot of effort to make it better.

01:01:05 So I think it's only going to get better.

01:01:06 It's great for TypeScript.

01:01:08 If you're writing Angular, it's very good for TypeScript.

01:01:10 I mean, Microsoft makes both products.

01:01:12 Go figure it out.

01:01:12 Yeah, and it's literally even the Python extension is written in TypeScript.

01:01:17 So it's no surprise that it's good at writing in TypeScript, right?

01:01:21 Yeah, Electron apps.

01:01:22 Yeah, I mean, they use TypeScript for that.

01:01:23 Yeah, absolutely.

01:01:24 Cool.

01:01:24 And then notable PyPI packages?

01:01:26 So there's one called pytest Watch.

01:01:29 I mentioned, I mean, I like pytest.

01:01:31 This one's super simple.

01:01:32 It just gives you a terminal watcher.

01:01:34 So you pip install pytest Watch.

01:01:36 And there's like a PTW command that just on reload will run your test again.

01:01:41 Super simple, but I use it all the time.

01:01:43 That's really cool.

01:01:43 So if you just hit save, like your test run, basically.

01:01:46 Yep, yep.

01:01:46 And so I just leave that running pretty much 24-7 and just sit there, change code,

01:01:51 control S, and just glance over and make sure my tests are good all day long.

01:01:54 That's pretty solid.

01:01:55 Very helpful.

01:01:55 Cool.

01:01:56 Another is, I mentioned mypy.

01:01:59 And then specifically, there's a flake 8 mypy extension for VS Code.

01:02:03 The mypy is a static type analysis.

01:02:06 So it will do things like, if I write a function in Python, and I say, this takes a parameter

01:02:10 that is a string.

01:02:12 And then I later call that function with a parameter that's an int.

01:02:15 I get a little red squiggle, and it says, you know, error, you've called this function

01:02:18 incorrectly.

01:02:19 And it catches a whole class of bugs that you might not find until runtime because, well,

01:02:24 it's a dynamic language.

01:02:25 So this is a newer thing for me.

01:02:29 It's mostly born out of using TypeScript for a while and moving from JavaScript to TypeScript

01:02:34 and realizing, like, wow, this is a huge productivity boost.

01:02:37 And so now that Python seems to be moving towards more of this type annotation stuff, I'm just

01:02:42 completely jumping on the bandwagon.

01:02:43 That's cool.

01:02:44 Yeah.

01:02:44 I'm a fan of type annotations as well.

01:02:46 Yeah.

01:02:47 I really think they add a lot.

01:02:48 And then if you're interested in performance, there's mypyC, which compiles that code, the

01:02:53 type annotated Python 2C code as well.

01:02:56 So those people looking for performance, that's another option.

01:02:59 Is that a Facebook thing?

01:03:00 It is.

01:03:01 Where is it?

01:03:02 It's under its own organization.

01:03:05 I don't remember.

01:03:06 There was one like that, I think, from Facebook and one like that from Dropbox.

01:03:11 And I don't know which.

01:03:12 Now there's somebody listening to it and just cursing my name for that.

01:03:15 So apologies.

01:03:15 It's under its own organization.

01:03:17 It's not like under a different organization.

01:03:20 Yeah.

01:03:21 So I'm not sure.

01:03:21 That's a great idea, though.

01:03:22 The idea that you annotate this and now it's almost like Cython, you know, it knows more about

01:03:27 what you meant and now it could potentially convert that into better code.

01:03:30 See it going there.

01:03:31 Yeah.

01:03:31 I don't think it's generally useful yet, but it's getting close.

01:03:35 So it's pretty cool.

01:03:36 Yeah.

01:03:36 You need critical mass.

01:03:37 Yeah.

01:03:38 I think they use it to build my pie itself.

01:03:40 At a certain point, there's enough adoption.

01:03:41 Yeah, exactly.

01:03:41 Yeah.

01:03:42 Super cool.

01:03:42 And then you have one more you want to throw out there before we hit that.

01:03:44 I have one more.

01:03:45 It's called RxPy.

01:03:46 So reactive extensions, you should go to reactivex.io.

01:03:51 It's a whole pattern of programming that's been around for a while, but it's based on the

01:03:56 idea of observable streams.

01:03:58 So rather than imperatively saying, I want my program to do X and Y, you say, I want to

01:04:03 sort of declare what my program should do in response to data changes.

01:04:07 And it makes certain types of problems tremendously easier to reason about and program.

01:04:12 And there's a JavaScript counterpart, RxJS, that's super popular in Angular.

01:04:16 And this is the Python implementation.

01:04:18 It's not nearly as mainstream, even though that repo has a lot of stars, but it's something

01:04:23 I'm experimenting a lot with just because I've seen some of the problems it solves in front

01:04:28 end development and found a couple of interesting cases where it can be used in Python.

01:04:33 Yeah, it looks interesting.

01:04:34 I haven't used that before, but the whole observable data notification stuff is kind of cool.

01:04:40 Nice.

01:04:42 Good recommendation.

01:04:42 So, all right.

01:04:43 So people are out there.

01:04:45 Maybe they're doing some data science.

01:04:46 They want to get it on the web.

01:04:47 What's the final call to action?

01:04:49 Where do they start?

01:04:50 So, well, one place you could start is there's another project that sort of was born out of

01:04:55 a weekend thing that would be very, it's sort of in its infancy called Flaskerize.

01:04:59 And the idea there was we have front end apps and back end apps that you traditionally deploy

01:05:05 separately, or you make a Python only app.

01:05:08 But in principle, you could serve your React front end from the same Flask API.

01:05:13 And if you have a single app, it makes deployment a lot easier.

01:05:16 It makes scaling a lot easier.

01:05:17 And so we sort of had this idea, like, why don't I just make a command line tool that could

01:05:22 take a static site that you built with whatever, whether it's a JavaScript thing like React or

01:05:26 Angular, or whether it's a Jekyll or Gatsby or whatever.

01:05:30 And you want to basically embed that in your existing API and potentially Dockerize it with

01:05:35 like one command.

01:05:36 And so that's what this Flaskerize project is about is sort of making a code generation and

01:05:42 templating and DevTool sort of Flask command line interface.

01:05:47 And at this point, I've sort of played around with it a bit and I'm using some of it as its

01:05:53 DevTool form in production.

01:05:55 I think there's a lot that could be done with this, but I don't necessarily always have the

01:05:59 time.

01:05:59 So I have a lot of ideas around that.

01:06:01 So it'd be interesting for people to check that out or potentially contribute.

01:06:06 You know, there's a rich CLI in some of the front end communities that doesn't exist in

01:06:12 Python.

01:06:12 I think that could be a real productivity boost in Python.

01:06:15 We don't really have a good like templated generator.

01:06:19 It's not that I've found particularly useful.

01:06:21 That's pretty interesting.

01:06:21 Take your stuff and basically convert it to Flask.

01:06:24 That's cool.

01:06:24 All right.

01:06:25 Yeah, nice.

01:06:26 And then I'm going to have a bunch of links from various articles and libraries that you've

01:06:30 talked about here.

01:06:31 So throw those all in the show notes.

01:06:33 People can just click on them in there and their player.

01:06:34 Yeah.

01:06:35 And then I'm speaking at a conference at the end of the month.

01:06:37 If you want to hear me ramble on more about this type of stuff, if you're near the Charlotte

01:06:42 area, there's the Data Science North Carolina 2019.

01:06:45 I guess we'll put a link in the show notes below for that as well.

01:06:48 Yeah.

01:06:48 Do you know if they record those videos and put them online?

01:06:50 They do.

01:06:51 I'll make a reminder to myself once that's out to put that up.

01:06:54 The conference is at the end of August.

01:06:56 So it'll be around shortly after this airs.

01:06:59 The closest time this comes out.

01:07:00 Yeah, for sure.

01:07:00 Through the magic of time travel, we can put that link retro back in.

01:07:05 Yeah, that sounds good.

01:07:06 And then credit where credit's due.

01:07:08 mypyC is developed by Dropbox.

01:07:10 Thank you for preventing me from getting sued.

01:07:11 Appreciate that.

01:07:12 Yeah, no problem.

01:07:14 All right.

01:07:14 Well, AJ, it's been really fun to talk about this intersection of data science.

01:07:18 And web development.

01:07:19 And thanks for sharing with everyone.

01:07:20 Yes, sir.

01:07:21 Thank you for having me.

01:07:23 This has been another episode of Talk Python to Me.

01:07:26 Our guest on this episode was AJ Pryor.

01:07:28 And it's been brought to you by Linode and Rollbar.

01:07:31 Linode is your go-to hosting for whatever you're building with Python.

01:07:35 Get four months free at talkpython.fm/Linode.

01:07:39 That's L-I-N-O-D-E.

01:07:41 Rollbar takes the pain out of errors.

01:07:44 They give you the context and insight you need to quickly locate and fix errors that might have gone unnoticed.

01:07:49 Until users complain, of course.

01:07:51 Track a ridiculous number of errors for free as Talk Python to Me listeners at talkpython.fm/rollbar.

01:07:57 Want to level up your Python?

01:08:00 If you're just getting started, try my Python Jumpstart by Building 10 Apps course.

01:08:04 Or if you're looking for something more advanced, check out our new async course that digs into all the different types of async programming you can do in Python.

01:08:13 And of course, if you're interested in more than one of these, be sure to check out our Everything Bundle.

01:08:17 It's like a subscription that never expires.

01:08:19 Be sure to subscribe to the show.

01:08:21 Open your favorite podcatcher and search for Python.

01:08:24 We should be right at the top.

01:08:25 You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at /rss on talkpython.fm.

01:08:34 This is your host, Michael Kennedy.

01:08:36 Thanks so much for listening.

01:08:38 I really appreciate it.

01:08:39 Now get out there and write some Python code.

01:08:40 I really appreciate it.

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon