Learn Python with Talk Python's 270 hours of courses

#220: Machine Learning in the cloud with Azure ML Transcript

Recorded on Thursday, Jun 20, 2019.

00:00 On this episode, you'll meet Francesca Lazeri and hear her story how she went from research fellow in economics at the Harvard Business School

00:06 to working on the AI and data science stack on the Azure team at Microsoft.

00:11 This is Talk Python to Me, episode 220, recorded June 20th, 2019.

00:29 Welcome to Talk Python to Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities.

00:35 This is your host, Michael Kennedy. Follow me on Twitter where I'm @mkennedy.

00:39 Keep up with the show and listen to past episodes at talkpython.fm and follow the show on Twitter via at Talk Python.

00:46 This episode is brought to you by Linode and Talk Python Training.

00:49 Be sure to check out what the offers are for both of these segments. It really helps support the show.

00:55 Hey, everyone. Before we get to the interview, I want to quickly tell you about a new course we just launched.

01:00 It's our first major Flask course, and it's called Building Data-Driven Web Apps in Flask and SQLAlchemy.

01:06 This one's a deep dive into Flask. We cover things like routing models, templates, databases, and migrations, and even deployment and security.

01:14 And we do all of this in the context of building a clone of the pypi.org website.

01:19 Check it out over at training.talkpython.fm.

01:22 If you're not sure if you want to choose Flask just yet for your web app, then give our 100 Days of Web course a look.

01:29 We cover many frameworks and programming models in 25 four-day projects, so you get a super wide view of what's out there.

01:37 Then you could pick Flask or Django or Pyramid or something else.

01:41 Thanks for checking it out. Now let's get to the interview.

01:43 Francesca, welcome to Talk Python to Me.

01:46 Hi, thank you. Thank you for having me.

01:48 It's really great to have you on the show.

01:49 We got a chance to meet at Microsoft Build where I was walking around the expo hall floor,

01:54 and you're doing really cool stuff with AI and Python, and I'm just happy to have you here.

02:00 Thank you so much. I'm very excited for being here.

02:03 Yeah, so we're going to dig into this whole machine learning thing that you're working on,

02:08 and just the industry in general, and machine learning at Microsoft as well.

02:12 And that's going to be a lot of fun, but before we get to all those things, you know,

02:15 let's get started with your story. How did you get into programming and Python?

02:18 Yeah, that's a great question. I ask myself this question all the time.

02:21 I have to say, before joining Microsoft as a data scientist, I was a research fellow in business

02:27 economics at Harvard Business School, and over there was in charge of performing statistical

02:33 and econometrics analysis within the technology operation management unit.

02:38 So at that time, really, my title wasn't a data scientist, but it wasn't, as I said, a research fellow.

02:44 But I had to work with a massive amount of data from different data sources, such as patent, publication data, social network data.

02:54 The goal of my research was to investigate and measure the impact of external knowledge networks.

03:02 And that's why we were using that specific type of data on companies' innovation.

03:07 Is that like social media or what's an external source of data that you might be looking at?

03:12 Patent data, publication data, and social network data.

03:16 These three were the main three sources of data.

03:19 And we were trying to understand if these knowledge networks were also localized from a geographical point of view,

03:27 or they were more like global networks of knowledge.

03:32 And how these networks, both local and global, could affect a company's innovation.

03:39 As you might know, like, for example, Cambridge, Massachusetts, that is where I'm working from,

03:45 and where my office is at Microsoft Cambridge.

03:48 It's a very powerful tech cluster and also pharma cluster.

03:53 Okay, well, that sounds like a really cool thing to study there.

03:55 So you're studying this data, you're studying these projects at Harvard Business School.

04:00 And of course, this is probably not a challenge that can easily be solved with Excel, and definitely not manually, right?

04:06 Exactly.

04:07 So at that time, Python was emerging as a leader in data science programming.

04:13 And while there are still many people in academia who use R, SPSS for their analysis,

04:20 Python, I think, was getting very, very popular because there were a lot of data science libraries.

04:26 I remember that one of the most popular and still very popular is Pandas.

04:32 So it has many in-build features, such as the ability to read the data from many sources.

04:38 You can create large data frames from these sources and also compute aggregate analytics.

04:44 And this was exactly what I needed as a research fellow and also for doing all the data analysis that I needed to do.

04:52 Yeah, that's really interesting.

04:53 What year was this?

04:54 Like, about?

04:55 It was about 2012, 2013, 2014.

05:00 Because then I started to work at Microsoft as a data scientist exactly in 2014.

05:06 And at that time, I remember that R was still very, very strong for both academics and also data scientists.

05:15 But Python was already there, of course.

05:17 Of course, and the reason I ask is I have this theory that there's a big change at an inflection point around 2012 where a couple of things happened that just made Python in the data science space really start to accelerate out ahead of the other options like R.

05:35 And it's just, it's super interesting that it's this time as well that you're, you sort of picked it up.

05:39 Yeah, totally agree.

05:40 Yeah.

05:40 And so did you know other programming languages before or did you have to teach yourself Python to get into this?

05:46 I had to teach myself Python.

05:47 Of course, at university, I took a couple of classes, just introduction to Python.

05:52 And those classes were very helpful.

05:55 Before that, I was already coding a lot in R.

05:58 SQL was another very powerful language that I'm still using sometimes to just ingest the data.

06:05 You know, sometimes you need to prepare your data, some tables, and you need just to ingest them.

06:10 And from there, you can start using the tools that you prefer to build a machine learning solution.

06:16 And it's Python, the main tool that I use.

06:18 But I have to say, beside these two languages, then the third language is, for me, it's Python, because it's very, very powerful for data science.

06:28 And I would say data analysis in general.

06:31 So also for people who need to do a lot of statistics, econometrics type of analysis, Python is a very, very good choice.

06:39 Absolutely.

06:40 Did you come up with any really interesting results from your research at your time there that you can talk about briefly or anything like that?

06:47 Yeah, absolutely.

06:48 I mean, I love my time there at Harvard.

06:51 I was working with a great professor and a great research unit.

06:55 And so the results were very interesting.

06:59 So we noticed that for the most important innovations, and we were targeting the biotech industry, the localization was very, very important.

07:10 And like we noticed that data scientists who were located in the same, sorry, not data scientists, scientists in the biotech industry, and also some data scientists, but we were looking just at the general category of scientists who were located in the same geographical cluster.

07:25 They were actually able to innovate more often and much faster.

07:30 So, and of course, we were seeing these in terms of publication, in terms of how they were citing each other.

07:37 And most importantly, in terms of patents, because the only way to really measure knowledge and how the knowledge can be clustered is through patents, unfortunately.

07:48 I mean, I say unfortunately, because it's a sort of extra step.

07:52 You have to make a few assumptions to make this correlation between knowledge and patents.

07:58 But still, it's a good indicator for innovation.

08:02 So we noticed this.

08:04 However, we noticed that for long-term innovation, like if you look at the company and you look at the number of patents that the company and the research unit have filed, you can see that geographical barriers are actually not an issue.

08:17 And right now, especially with the use of technology, innovation and the knowledge do not need to be linked from a geographical point of view.

08:28 So this is what we noticed with our research.

08:31 Yeah, that sounds really interesting.

08:32 So the short term is more important to be around fellow scientists, but long term, the knowledge gets out there.

08:38 Cool.

08:39 All right.

08:40 So you're no longer working.

08:41 You're no longer doing that research at Harvard.

08:43 You're still near Harvard, but you're working at Microsoft.

08:46 What do you do day to day there?

08:47 Right now, I'm a machine learning scientist on the cloud advocacy team.

08:51 And what I do, I work with a lot of external customers and the Azure machine learning community in general.

09:00 Specifically, I work a lot with universities and research institutions from students, professors to researchers on machine learning projects.

09:11 Oh, that sounds really interesting.

09:12 And there's something special about universities.

09:14 I mean, you were at Harvard for a while.

09:16 Just being on campus and being in those environments is really nice.

09:20 So it must be fun to work with those research groups just on a day-to-day basis.

09:25 Absolutely.

09:25 And I think, again, talking about innovation, I think that the real innovation starts most of the time from university, from research units, from what we call a spin-off.

09:38 There are a specific type of startups coming from academia.

09:42 So I really think that having these close relations with university is a key for innovation.

09:49 And when I say innovation, I mean also in terms of machine learning solutions.

09:53 So it's nice because most of the time I leverage their knowledge in terms of AI and machine learning.

10:01 But I'm the one that is more like expert, even if it's hard to use this word, because I think that every time we learn from each other.

10:10 But more or less, I have more experience with the technology that they can use.

10:14 And so that their innovation, their research is not just on paper, but they can actually translate these, transform their research into a product, into a service, into a feature that other people can consume.

10:27 So that's why it's so interesting to work with these type of customers.

10:32 I feel very lucky.

10:33 Yeah, for sure.

10:34 They're so interesting.

10:35 And especially in science and biology and pharmaceuticals, those are not the kind of startups that people often can just go start on their own.

10:43 They often spin out of these labs.

10:46 And that's like a pretty cool area, right?

10:47 Absolutely.

10:49 And again, this is a topic that is similar to my Harvard University research.

10:55 What we call it, economists call it a spillover type of effect.

11:00 Like when you build the knowledge and then you decide to transfer this knowledge outside.

11:07 And when I say outside, it can be to a different industry.

11:09 Like from academia, you go to the real industry.

11:12 It could be biotech, tech, or any other industry.

11:15 Or you change.

11:17 You change completely your area.

11:19 Like from economics, you go to machine learning.

11:22 As you said, it's something, it's a real nice phenomenon that you can notice when you are like in a cluster, in a tech and knowledge cluster.

11:30 And yeah, it's a very interesting phenomenon.

11:33 Yeah, I'm sure that it is.

11:35 How interchangeable is the knowledge between the different projects that you're working on?

11:40 Like how much do you have to try to study what they're doing?

11:43 Or how interchangeable is like this machine learning, data science world?

11:48 That is a great question.

11:50 So I think that the difference is in the data and in the business problem, or I would say research problem in this case, that you are trying to solve.

12:01 Everything that is in the middle is actually very similar in the sense that the approach can change or the type of technique that you are using.

12:11 If you have like economics background or a biotech background or a machine learning background can be like a little bit different.

12:20 But more or less, you have this scientific mindset that you know that there are a few techniques, some very powerful techniques that you can apply on your data to an answer that is going to be able to solve your problem.

12:35 Again, if you are talking to a company, of course, it's a business problem.

12:38 In this case, since my customers most of the time are university research institutes, the problem is more like a research problem.

12:45 So it's a very, very interesting.

12:47 Of course, those people, as you said, they have most of the time very different backgrounds, but they are experts of the data.

12:55 And they know very well what the output from their research needs to be like.

13:00 What is the type of answer?

13:02 What is the type of problem that they want to solve?

13:04 So my role is really between these two points.

13:08 Like I really help them understand how they can use the machine learning and the cloud to build the hand-to-hand AI solutions and apps most of the time to solve their own specific research problems.

13:20 And also to sometimes can be just an optimization of a specific process, for example.

13:25 And what is interesting in my role is that I don't just help them understanding what are the potentials with using machine learning, but it's also about collecting feedback.

13:35 Because when I work with them, of course, I always have some type of feedback that I can collect.

13:41 And then I support our machine learning product teams, our AI product teams, to build a new feature, to optimize our services for both machine learning scientists and data scientists.

13:55 This portion of Talk Python to me is brought to you by Linode.

13:58 Are you looking for hosting that's fast, simple, and incredibly affordable?

14:02 Well, look past that bookstore and check out Linode at talkpython.fm/Linode.

14:07 That's L-I-N-O-D-E.

14:09 Plans start at just $5 a month for a dedicated server with a gig of RAM.

14:13 They have 10 data centers across the globe.

14:16 So no matter where you are or where your users are, there's a data center for you.

14:19 Whether you want to run a Python web app, host a private Git server, or just a file server,

14:24 you'll get native SSDs on all the machines, a newly upgraded 200 gigabit network, 24-7 friendly support,

14:31 even on holidays, and a seven-day money-back guarantee.

14:34 Need a little help with your infrastructure?

14:35 They even offer professional services to help you with architecture, migrations, and more.

14:40 Do you want a dedicated server for free for the next four months?

14:43 Just visit talkpython.fm/Linode.

14:48 Artificial intelligence was something, when I was in college, it was like a small research side of computer science.

14:56 And there were some people working on it.

14:58 But it was always, it felt like one of those technologies that's 30 years away, if ever.

15:04 The type of problems they were trying to solve is, well, let's build a little chat bot.

15:08 And then if a person can chat with the chat bot in IRC, and they don't know that it's a machine, it's artificial intelligence,

15:15 that we're very close, right?

15:17 And that seemed like an interesting research project, but not really practical.

15:23 And now we have machine learning taking stuff just so far along, right?

15:29 We've got self-driving cars.

15:30 We have computer algorithms determining whether folks have cancer.

15:35 All sorts of, like, mind-blowing stuff.

15:38 So I feel like AI has really become real, right?

15:42 It's actually become a thing.

15:44 But also, I think it's a little bit misunderstood in some interesting ways.

15:49 Like, how would you describe artificial intelligence to non-technical people or machine learning even?

15:56 Yeah, that's another great question.

15:58 I think that simply speaking, AI is just about programming computers to make decisions.

16:06 And machine learning, of course, focuses more on making predictions about the future.

16:13 So I always like to explain AI with an example.

16:16 Like, there is a very nice app that I have been using here at Microsoft, and it's called Seeing AI.

16:24 It's a Microsoft research project that really brings together the power of the cloud and AI to deliver this intelligent app

16:33 that is designed to help people that are blind or that they are, like, low vision.

16:40 And they help them, like, to go through their everyday life.

16:45 It's very nice because with this app, you can just point to your phone's camera.

16:49 You can select a channel, and you can hear a description of what the app has recognized around you.

16:56 So, again, this is a very, very good example of AI and how AI can help us also to improve some of our everyday actions, I would say.

17:08 As I said at the beginning, it's really about programming computers to do something, to make some decisions.

17:13 That's really incredible, this Seeing AI app here.

17:18 I'll put a link in the show notes.

17:19 So, you just hold it up and it just says, hey, I see there's a car over there and there's a table with two people sitting at it, something like this?

17:26 Exactly.

17:27 Wow.

17:27 Yeah, it's very, very nice because, again, it's with your phone camera.

17:32 You can select a channel and here, again, a description of what has been recognized around you.

17:39 So, it's really designed to help you navigate your day.

17:43 That's amazing.

17:44 I'll tell you a real quick story that's kind of wild that happened to me about 10 or 15 years ago.

17:48 And this person could have really used this app.

17:50 I was walking to the grocery store where I lived.

17:53 It was maybe three blocks, four blocks away.

17:55 And I was walking along and this guy comes over or is standing next to me at this light and says, hey, could you help me?

18:01 I said, sure, no problem.

18:03 What do you need?

18:03 He says, could you tell me where the grocery store is?

18:06 And it was clearly visible just across, diagonal across the intersection.

18:09 I said, yeah, it's right there.

18:11 He goes, don't point, man.

18:12 I'm blind.

18:13 I'm like, wait, what?

18:14 You know, it was not at all obvious.

18:16 He's like, could you just help me?

18:17 I just got disoriented.

18:18 Could you just help me walk over there?

18:20 And I took him by the arm.

18:22 We walked over there and he said, thank you very much.

18:24 When we got to the store, he just went off on his own and went shopping.

18:26 I was just amazed how well he could function.

18:29 But it seems like just something like this, he could just hold it up and go, there's a store across the street.

18:33 Oh, there it is.

18:34 I see.

18:35 You know, it would be just amazing.

18:36 And it could really change people's lives.

18:38 That's awesome.

18:38 Absolutely.

18:39 Yeah, I totally agree.

18:40 And there are many, many other examples.

18:42 I really like to talk about this app, the seeing an AI app, because I think that it can be a very good impact on people's lives.

18:53 But I totally agree with you.

18:54 There are so many other examples that they're just around you.

18:58 And probably we don't even notice at this time.

19:00 But they are there of the AI apps that just help us with our everyday tasks that we have to perform.

19:08 So, yeah.

19:08 Nice story.

19:09 Very nice story.

19:10 Yeah, thanks.

19:11 So, you're describing artificial intelligence and machine learning.

19:14 I saw a funny joke that said, how can you tell the difference between artificial intelligence and machine learning?

19:20 And it said, if it's written in Python, it's probably machine learning.

19:25 If it's written in PowerPoint as a concept, it's probably AI.

19:30 I saw that joke, of course, as well.

19:33 It's a good one.

19:34 And it's a very, very good one.

19:36 And I have to say, probably it's also true.

19:39 There is some truth there.

19:41 I mean, AI is something that, as we said, in some cases is already around us.

19:47 In some other cases, it's not there yet or it needs to be improved.

19:51 But I still think that it depends a lot on how you define AI and what are your expectations from AI.

19:58 But, yes, that joke, it's very popular.

20:00 And I saw that as well.

20:03 That's a good one.

20:03 Yep.

20:03 Nice.

20:04 All right.

20:04 So, let's talk a little bit about the AI stuff you guys have going on at Microsoft.

20:09 And you work on the machine learning cloud advocacy team.

20:12 So, it seems like a lot of stuff these days that Microsoft is doing, especially in the developer space, is something at Azure or something on the cloud or something like that, right?

20:24 It feels to me like Azure has really become the super big focus.

20:28 And once again, developers have kind of really become a big focus at Microsoft, as opposed to, say, just Windows and Office, for example, as the two key pillars of the company or whatever, right?

20:39 What do you think?

20:40 What's your impression?

20:41 That's totally right.

20:42 And we can feel this both from an external and internal point of view.

20:48 I think that everything probably started with Satya Nadella's first emails to us, to Microsoft employees, because it's very interesting.

20:58 Instead of focusing on the past, he wrote about the future and, in particular, the importance of cloud for Microsoft's growth.

21:08 He's also saying that our industry does not respect the tradition, but only respect innovation.

21:13 So, this means that there was, at that time, and still, there is a strong focus on the cloud, on machine learning, and on AI.

21:22 For example, also from an internal point of view, we noticed that in 2017, Microsoft launched an AI division with more than 5,000 computer scientists.

21:36 That was, of course, that was a huge change.

21:38 And also, of course, there was, like, computer scientists, software engineers, AI developers.

21:43 And at that time, we also launched an intelligent cloud division, which included products such as Server and Azure.

21:52 This was, again, a big change that we noticed both from an internal point of view, but also externally.

21:59 And then, another, I think, big change that we noticed is that Microsoft has done a series of, I would say, smart acquisitions.

22:08 For example, GitHub and LinkedIn.

22:12 And I think that he did that because we really wanted to make AI accessible to developers and to our communities in general.

22:22 So, yes, it's a big change.

22:24 Of course, it didn't start this year.

22:26 It has been going on for a while at Microsoft.

22:29 And I think that we are seeing the results right now.

22:32 There is a big focus on cloud, AI, and machine learning.

22:37 Yeah, it definitely seems like a big shift, but in a positive way.

22:41 I think it's a good move.

22:43 So, maybe tell us about some of the AI stuff that you have going on at Azure.

22:48 I know there's a lot of cool things that you have going on there.

22:53 I know you have some stuff around ML DevOps, for example, right?

22:57 Machine learning DevOps and just doing things like productizing machine learning models

23:02 or turning them into production systems like REST endpoints and so on.

23:06 So, maybe we could talk a little bit about all the stuff you got going on there.

23:09 Yeah, absolutely.

23:10 So, right now we have MLOps or DevOps for machine learning capabilities.

23:16 And this includes Azure DevOps.

23:18 This enables Azure DevOps to be used to manage the entire machine learning lifecycle,

23:24 including, for example, model reproducibility, validation, the deployment part, the retraining.

23:33 And it's very interesting as a data scientist to see this because I have to say the DevOps for machine learning includes a lot of the steps that the data scientists need to perform when they are building an end-to-end machine learning solution,

23:48 model training model management, model management, deployment, and monitoring.

23:55 I can go on and on.

23:57 So, those pipelines allow for what we call the modularization of these different steps or different phases into smaller steps.

24:09 And also, it provides a mechanism for automating, sharing, and most importantly, reproducing the models for models.

24:17 And not only models, also the different machine learning assets that you need.

24:20 Again, thanks to these, all the machine learning workflow becomes like much, much easier to perform and to reproduce and to automate.

24:30 So, it's great to see this.

24:31 It sounds like it solves a lot of good problems or problems that need solving.

24:35 Yeah.

24:35 So, what does machine learning in production look like, right?

24:39 Is that like TensorFlow behind a RESTful, some like Flask REST service?

24:44 Or what does it look like for folks who are like just web developers or not doing, you know, data science day-to-day?

24:50 Yeah, that is very interesting because I think that data scientists, they love to focus a lot on the machine learning piece.

24:58 But then, once you decided what is the best model or the best models, because it can be, of course, multiple models that you want to push into production,

25:07 it's also very important to understand how you can deploy the solution and how other people can eventually consume the solution.

25:14 Back to your questions.

25:16 Into production, it's a web service.

25:19 It can be, I have to say that the first format that it takes is a pickle file because this training run, that is the model training,

25:29 this training run produces a Python serialized object that we call it, again, a pickle file.

25:34 And this contains the model and the data preprocessing.

25:38 So, this is very, very important because then at this time, there is actually the following step is to make this web service.

25:47 And the web service is really, there is a REST API that, again, you can call to just consume the service from whatever environment you prefer.

25:56 Like, it can be, again, an app or it can be like just another platform that your company is already using.

26:02 And you want to use that to consume the results from your machine learning model.

26:07 Yeah, that sounds pretty cool.

26:08 So, maybe I have a REST endpoint and I just upload a picture, like a PNG or something.

26:13 Exactly.

26:14 And it takes it, understands it, feeds it through the pre-trained model.

26:18 So, there's the training side of things, which can be super computational, but then maybe the evaluation decision-making process is really quick, right?

26:28 So, what response times people typically look for, but I suspect it's much, much, much faster, right?

26:33 It's a question that doesn't really have an answer.

26:36 Like, it really depends on the type of solution that you are building.

26:39 And most importantly, it depends on the type of data, the amount of data that you are using.

26:46 So, there is not a clear answer to that.

26:50 But it's interesting how to see that right now there are many different solutions that you can use, actually, to accelerate, I would say, not only the deployment and the consumption process, but also the training process.

27:05 Like, for example, we have right now a new feature.

27:09 It's actually not really new because it was launched last September, September 2018.

27:14 And it's called Automated Machine Learning.

27:17 And it's a feature within Azure Machine Learning Service.

27:20 For those who are not familiar with Automated Machine Learning, this is a process of automating the time-consuming, very iterative, I would say, task of machine learning model development.

27:32 And allows data scientists and also analysts and developers to build machine learning models with, you know, high-scale efficiency and also productivity.

27:42 Because you don't need, actually, to manually create all these different models by yourself.

27:48 But Automated Machine Learning actually runs many different models for you.

27:53 And it suggests you the best model and also all the hyperparameter tuning is, again, done for you.

27:59 So, it's a new feature that can somehow optimize the machine learning flow.

28:04 But, again, going back to your question about time, when you are in a machine learning context, saying, like, a specific time, it's very hard.

28:12 Super hard, yeah.

28:14 I see.

28:15 So, that's pretty meta to have machine learning teaching your machine learning algorithm, right?

28:20 Yeah, yeah, yeah.

28:21 Automated Machine Learning, it's very interesting because it's actually, again, was developed by Microsoft Research.

28:28 It's like Seeing AI, the app.

28:31 And then our products team was able, actually, to translate these research pieces into a real feature.

28:38 So, again, it's a sort of a big recommender system for your machine learning pipelines.

28:44 And I've been using these now for a while as a data scientist for the past few months.

28:50 And I have to say that, really, it's not just about saving you some time as a data scientist, but it's also a sort of check.

28:59 Like, sometimes when you prepare your data or, like, you pick a specific type of models to try on your training data set, you somehow are exposed to biases.

29:10 Because, again, it's a selection process that you, as a human, you have to do.

29:14 While if you have also an external voice somehow, an external suggestion that just gives you a different perspective or different suggestions, again, I think that is a sort of sanity check.

29:26 So, again, it's not just about saving time, but I think it's also about making your process more objective and less subjective.

29:36 Yeah, it sounds like it would be really helpful.

29:38 So, normally, when you're doing training, you have to pick some kind of model.

29:42 You say, I think this kind of model with this many nodes and this type of thing, we're going to set it up.

29:48 And here goes the training data where you know what the outcome should be.

29:52 Like, let's say, housing prices, right?

29:54 Like, one of the interesting problems people tried was, given just the description of a house and a neighborhood, predict the price that it should sell for.

30:02 Like, you could feed all that and say the actual price was this.

30:05 And it corrects itself, right?

30:08 But there's still a lot of decisions on the actual model that you feed it to or the setup you feed it to on top of the training data, right?

30:15 That's totally correct.

30:17 And, again, this is something that most data scientists do it manually in the sense that you start with your raw feature.

30:25 Actually, you start with your raw data, and then you do some data cleaning, data preparation.

30:31 And then it starts what we call the feature engineering.

30:34 But, again, the feature engineering is something that you do based on some assumptions that you have in your mind.

30:41 Because feature engineering is really about creating additional features based on the raw data that you have.

30:48 And some of these additional features, let me give you a real example.

30:52 Like, for example, if we are in a time series forecasting scenario, you have the timestamped column,

30:57 and you can build additional features such as is the holiday or is not holidays?

31:02 This is afternoon or no?

31:04 Is the weekend or no?

31:05 And this can help you making your model more accurate in some cases.

31:09 This is not the case sometimes, but they can help.

31:12 But, again, also the feature engineering is something that you do before you even know what is going to be the output of your model.

31:19 And then after the feature engineering, of course, you start manually selecting and trying a few approaches that,

31:25 based on your experience and knowledge as a data scientist, have worked pretty well in similar scenarios.

31:31 And after that, of course, usually you need like one single machine learning model.

31:37 And, of course, you have been doing a lot of iteration for the hyperparameter tuning part.

31:43 And then you push everything into production.

31:45 But as you said, it's something that you have to try out manually.

31:50 And there are now some new features like automated machine learning that can help you somehow saving some time.

31:57 Because, of course, they do all the training, not only the training, but they do all the experimentation with different machine learning models for you.

32:05 And also, most importantly, the hyperparameter tuning part.

32:08 Yeah, that sounds like a pretty cool service.

32:10 So if I'm going to go work with like the machine learning SDK over on Azure for what you guys are doing,

32:17 like what libraries are supported, you know, are the standard Python libraries the ones we get to use?

32:23 Or do you have to use like a Microsoft ML one or something like that?

32:26 All the standard Python libraries are supported.

32:30 And then, of course, we have the library for the Azure Machine Learning SDK for Python.

32:38 And this is, again, you have to think about it as a sort of a library, Python library.

32:44 And the most important part of this is that the deployment part is going to be much, much easier.

32:51 Why?

32:52 Because, of course, when you prepare your data and you do all the feature engineering part,

32:58 and then you go to the modeling phase, that part, basically, it's Python.

33:02 It's Python.

33:03 And you can, of course, just use the classical Python libraries.

33:08 But then when you go into the deployment part, I have to say that Azure Machine Learning SDK for Python,

33:14 at least I find it very powerful because you just need to write a couple of functions,

33:19 like what we call the init and run functions.

33:22 And those two functions are going basically to define the model and how the data needs to be,

33:28 the data that you use to feed the model.

33:30 And as soon as you have defined these two functions, then it's very easy just to register your model

33:36 and then create that pickle file that I was mentioning before.

33:40 So, yeah, going back to your question, is Python-based.

33:44 And it's very nice to build and run machine learning workflow with what we call the Azure Machine Learning Service.

33:51 Another thing I saw you all talking about is model interoperability.

33:55 What does that mean?

33:56 Yeah, that's, I would say, is one of my favorite topics.

34:01 Because I think that model interpretability is another topic that is very close to what we call the ethics in AI or biases in AI.

34:13 So, first of all, it's a package.

34:16 So, first of all, it's a package, Python-based package within the Azure Machine Learning Python SDK.

34:21 And what it does for you is really make the machine learning models not black boxes anymore for you.

34:30 So, like, for example, you can use classes and methods that are in the SDK.

34:36 And you can get, with this model interpretability package, you can get feature importance values for both raw and engineer feature.

34:44 You can get interpretability on real-world data set, both during training and inference deployment.

34:52 So, it's very nice.

34:53 And then the most interesting part is that you can also get interacting visualization to help you understand your data.

35:01 So, like, for example, let's say that we are in a use case where you want to predict the price of a car.

35:08 You create some additional features, and then you try out a couple of machine learning models, and then you deploy those models or just pick one model and you deploy it.

35:18 But then, most of the time, data scientists, they want really to understand what are the different features that actually affect the accuracy of the model, the performance of your model.

35:27 And this is something that, right now, you can get through the model interpretability package within Azure Machine Learning.

35:34 So, I think it's something very, very important.

35:37 Yeah, it's really interesting because one of the big problems with machine learning is it's really good at making decisions,

35:43 but there are certain circumstances where you need to know real concretely why a decision was made, right?

35:51 Like, sometimes it's just to improve the system.

35:55 Like, if a car crashes, like a Tesla self-driving car turns the wrong way and crashes, like they need that to not happen again.

36:02 So, understanding how to fix it is important.

36:04 But other times, it's like legal even, right?

36:06 Like, if you get rejected for a mortgage for your house and you apply for a loan, you get rejected.

36:13 A lot of times, there's laws that say you have to be told why you were rejected or there has to be some visibility into that, right?

36:21 To avoid bias.

36:22 Absolutely.

36:23 That's why I mentioned ethics in AI.

36:26 That is a big topic right now.

36:29 And biases.

36:31 Because, as you said, the first place where biases create is the data set that you are using.

36:37 So, all the training data that you use most of the time can create biases in your model.

36:43 And, as a result, also in the different outputs that your model is giving you in the final results that, you know, you are looking at.

36:52 And, as you said, having, like, sort of visibility, transparency on why the model gave you that specific output is something that we all need to have.

37:04 At least, we have to have this option.

37:05 It's not, I really believe that it's something that, for sure, is going to be interesting for data scientists.

37:12 Because data scientists, I think, they don't want to use machine learning models as a black box at all.

37:17 They really want to understand why specific processes were done and why we had specific results.

37:24 But it's also, as you said, it's also a topic that should be interesting for everybody.

37:29 Because, again, some of the decisions that are made based on the machine learning models that we build can really affect personal life in a good or negative way.

37:39 Yeah, and you definitely want to have visibility around those things, right?

37:43 Absolutely.

37:43 Human biases can get into the algorithms and then the machines just make biased decisions faster.

37:49 Yeah, yeah, absolutely.

37:50 More systematically, which is not so good.

37:53 No.

37:54 So, if people want to play around with some of this stuff, you guys have the Azure Machine Learning Notebooks, right?

37:59 Is that something that's easy to go play with?

38:00 Yeah, and we have Azure Machine Learning Notebooks.

38:03 This is something new that was actually announced at the build where we met.

38:08 And these are integrated with Azure Machine Learning Service.

38:12 And they provide a code-first experience for Python developers.

38:16 And so, as a Python developer, you can build and deploy your models in a workspace.

38:22 And also, developer and data scientists can then perform every operation that are supported by the Azure Machine Learning Python SDK.

38:30 So, you don't need to install anything else, anything additional.

38:35 You can just connect with this.

38:36 And it's nice because you also have a pretty good computer target, the virtual machine.

38:41 It's a very easy environment to use if you are a data scientist or a developer and you want to use Python with Azure Machine Learning Service.

38:51 Okay, that sounds really cool.

38:52 Another challenge working with machine learning is you have to feed it a lot of data.

38:57 And you have to have, like, the data somewhat cleaned up, right?

39:01 Hence, Pandas is pretty powerful and interesting there.

39:04 Yeah.

39:04 And so on.

39:04 But you have some open data sets as well.

39:07 You want to tell folks what kind of data they get there?

39:10 That sounds pretty helpful.

39:11 Yeah.

39:11 So, this is a part of some of the new open source capabilities that Azure Machine Learning has.

39:20 And Azure Open Data Set, I would say it's one of my favorites because it's really a sort of repo for a public data set that you can use to add scenario-specific features to your machine learning solution to get more accurate models.

39:35 Like, for example, I was building an energy demand forecasting solution and I already added the load data, load historical data from a public data set.

39:46 But I really wanted to include weather data because, as you know, energy consumption can be strongly dependent on weather.

39:56 And specifically, I wanted to use temperature data.

40:02 So, thanks to Azure Open Data Set, I was able to get this public data set for weather.

40:09 But we have also holidays, public safety data, and also location data.

40:14 So, all these external and public data can really enrich your original data set and can help you to build more accurate models.

40:23 The last thing that I think is very important for these open Azure Open Data Set is that they are on the cloud and they are available to, for example, Azure Databricks, Machine Learning Service, and Machine Learning Studio.

40:37 So, you can really consume them from different tools.

40:41 And you can also access them through APIs and use them in other products like visualization products like Power BI or Azure Data Factory.

40:51 Yeah, it's pretty cool that it's already uploaded and accessible there.

40:55 Because, like, one of the data sets is you have the New York City taxi and limousine trip records.

41:00 Yeah.

41:01 But that's 500 million rows and it's five gigs of data.

41:04 So, it would be better if you didn't have to upload that and try to move that around and whatever, right?

41:09 Just download it or just ingest it and process it.

41:13 Yeah, absolutely.

41:14 And again, I did this for the Energy Demand Forecasting Solution.

41:18 And the pain points for me as a data scientist was really not only to look for a public weather data source,

41:26 but because I found that there are around several weather data that you can use.

41:32 But the problem was, like, really to download them and, most importantly, make the hand-to-hand solution working.

41:39 Because once you download them, you want to consume them over time.

41:43 And having them, like, on the cloud and the fact that you can consume new data and also forecast data,

41:50 because it was, again, a time series forecasting data.

41:53 So, as an input, I needed the forecast weather data.

41:57 It's something that really, really helped me and made my work as a data scientist much, much easier.

42:03 Yeah, that's really cool.

42:04 So, I know that a lot of the stuff that you all are doing is on the cloud.

42:08 And, right, you have, like, virtual machines and ones with GPUs that you can leverage and services and so on.

42:14 But I'm sure that a lot of data scientists, even who use the cloud, also work locally on their computers, right?

42:21 Absolutely, yep.

42:22 Yeah.

42:22 So, what does a proper data science, machine learning computer look like these days?

42:29 You know, like, you have to get some fancy GPU.

42:32 Like, what do you do for a setup?

42:33 What do you see people using for setups in your interactions with them?

42:37 Yeah.

42:37 So, I have to say, most of the researchers, students, and also big corporations I work with,

42:43 they all use right now data science virtual machines, especially when I talk and I work with other data scientists.

42:51 Because those data science virtual machines, and some of them are also called deep learning virtual machines,

42:58 they come with all the different tools that you need as a data scientist, different cluster, GPUs, as you said, the right size.

43:06 And they are already there, and you can just start working with these.

43:10 And it's true that you mentioned that most of the time, or I would say not most of the time, probably sometimes data scientists,

43:17 they still want to use their local machine, especially probably at the beginning of machine learning projects.

43:23 So, some of the editors that I have been seeing using a lot are like Visual Studio Code, that of course there is a local app that they can just download.

43:35 But other very, very popular editors are like JupyterLab, PyCharm.

43:40 So, there are many options out there.

43:43 And I have to say the most popular ones, again, are virtual machines.

43:47 And then when you have to work on your local machine, you probably use editors such as Visual Studio Code and other tools, again, like PyCharm.

43:56 How often do you see data privacy being a problem or a challenge that people run into?

44:02 Just stuff they don't want to put on the cloud, or they're not allowed to leave the building as it is?

44:08 I see it very often.

44:10 I think that is all about the trust, really.

44:13 Because when you start a new data science project or like a new machine learning project, it's all about getting to know each other.

44:21 So, you work as a data scientist, you work with an external entity.

44:25 Again, it can be a big corporation, it can be a university.

44:28 And of course, at the beginning, when you are what we call the scoping phase,

44:32 so you are defining what is the research problem or the business problem that you want to solve,

44:36 what is the data that you're going to use.

44:38 Of course, over there, you are building trust.

44:42 It's the moment that you are building trust.

44:44 And probably you don't use, they just share with you a sample of the data, and you just share with them like a general demo to show like the potential of different machine learning tools and the cloud.

44:58 But then I think it's the real moment where you are building trust between each other.

45:03 And of course, security in the data space, it's a concern that everybody needs to have.

45:10 I mean, we have to be concerned about data because this means that we are aware of what are the different options out there.

45:17 And that's why I think that VM, virtual machines, most of the time are the best choice when you work also with external customers and with external data scientists,

45:27 because it's a sort of common place where you can just connect using and, of course, performing all the security steps that your customers or, you know,

45:37 your research peers wants you to perform, of course.

45:40 But it's like a sort of common place where you can share all your experiments or data and so on.

45:48 Like a safe, safe middle ground.

45:50 Yes, exactly.

45:50 This is exactly what it means.

45:51 Inside anybody's firewalls and things like that.

45:55 Interesting.

45:56 That's pretty cool.

45:57 So we're getting sort of close to the end of our time.

45:59 And I did want to ask you to do a little bit of a prediction for us because you are seeing so many different projects that are coming along,

46:07 what people are doing and researching.

46:09 So what do you think that machine learning is going to create or do for society or how is it going to change society?

46:17 And like, say, five or 10 years that maybe normal people who don't follow this see coming.

46:22 I'm a very optimistic person.

46:24 So I don't really think that, you know, AI is going to impact our everyday life or our world in a negative way.

46:34 I really think that if we are able to just take advantage of what AI and when I say AI, I mean, also all the machine learning algorithms that are behind AI.

46:44 If we are able to really take advantage of this, we can really see some improvements in the different processes and different industries such as healthcare.

46:55 Yeah, healthcare is the first one that I was thinking of as well.

46:58 Healthcare is something because they are, I think, after the finance industry, they are the industry that actually produce the biggest amount of data.

47:08 And of course, the more data you have, better your models and objective the results are going to be.

47:16 So for sure, if I have to mention a few industries where I see AI being successful in the next few years,

47:23 one, as I mentioned, is healthcare.

47:25 The second is transportations.

47:28 I really think that not only driverless cars are going to improve our everyday life, but also what we call the predictive maintenance,

47:37 that is, you know, predicting when a machine is going to fail, is something that is going to have a huge impact in the transportation to field.

47:46 And the third one is, I would say, probably is less interesting, but still it's an industry that we all need to interact with.

47:56 That is the finance industry.

47:57 So the finance industry, why I see the biggest impact there?

48:01 Because actually, they are the ones who have been using machine learning for longer time.

48:06 But I have to say that this interaction between AI and machine learning is something new to them as well.

48:14 So I think that we are going to see interesting solution there as well.

48:18 And probably is going to democratize what sometimes looks and sounds inaccessible to many people who are not experts in economics or trading.

48:29 So I think these are the three main industries where we will see a lot of changes and impact from the AI world.

48:39 Yeah, those are all definitely, I think healthcare for sure.

48:42 There's just so much data and it has the opportunity to be just so transformative, right?

48:47 You think of the cancer diagnosis and the other things or, and, you know, probably pharmaceuticals as well.

48:53 It's kind of tied into that, right?

48:54 Like creating new drugs in a much quicker way.

48:56 Absolutely.

48:57 And if you think about healthcare, beside the two excellent examples that you just made, also there is all the insurance word for healthcare.

49:07 Also that can really help facilitate some of the processes that sometimes patients have to go through.

49:14 And if we can optimize time and make these processes better for, you know, people who need healthcare.

49:19 I think that also is a very, very good improvement.

49:23 Yeah.

49:23 One other I'd love to see is the energy sector around renewable energy and stuff like that.

49:28 There's, it seems like that a lot of good stuff could happen there.

49:30 Totally agree.

49:31 All right.

49:31 So I guess before we call it a show, I'm going to ask you the two final questions I always ask.

49:36 The first one is if you're going to write some Python code, what editor do you use?

49:40 I already mentioned this editor is a Visual Studio Code.

49:43 I have to say that I really like right now the Python extension and the Azure Machine Learning extension.

49:50 The Azure Machine Learning for Visual Studio Code that previously was called Visual Studio Code Tools for AI is an extension that you can use as data scientists to build, train, and also deploy your machine learning models, of course, on the cloud.

50:04 Also on the edge, you can leverage the power of Azure Machine Learning service.

50:11 Other editors that I really like is our Jupyter Lab.

50:15 I think I really like it because it provides a high level of integration between notebooks, documents, and different activities.

50:23 And then the other one that I mentioned, and I think that data scientists should also consider, is a pie chart.

50:29 Because I think that it's similar to Visual Studio Code that they both have interesting features such as a code editor, errors highlighting, and also a powerful debugger with a nice graphical interface.

50:43 They seem to have a slightly different philosophy, Visual Studio Code and PyCharm, but they're both really good.

50:48 Excellent.

50:48 And then, of course, the Jupyter stuff is like standard data scientists.

50:53 You've got to fit that into the workflow somewhere, right?

50:55 Absolutely.

50:56 Awesome.

50:57 All right, and then finally, a notable PyPI package.

51:00 Not necessarily the most popular, but something that folks maybe haven't heard of, but it's really cool that you want to share.

51:05 Again, I'm very passionate about time series forecasting.

51:08 And the package that I have been using a lot lately is the forecasting package with AutoML.

51:15 So for forecasting tasks, automated machine learning uses these preprocessing and estimation steps that are specific to time series data.

51:24 So the preprocessing steps will, for example, detect time series samples frequency, if it is hourly, daily, or weekly.

51:33 And they create new records for missing data for missing time stamps to make the series continuous.

51:42 And then they can also input missing values in the target and feature columns.

51:46 And they can create grain-based features to enable fixed effects across different series.

51:52 So this is, again, it's a very nice forecasting package that you can use with the AutoML config object within Azure Machine Learning Service.

52:03 And I have to say, I have been using this a lot.

52:06 Yeah, that's cool.

52:07 Yeah.

52:07 Yeah, that's really cool.

52:08 Their example on the PyPI page is like eight lines of code.

52:12 Yeah.

52:12 It's pretty incredible.

52:13 Yeah.

52:14 Yeah.

52:14 Super cool.

52:15 Okay.

52:15 Well, that's a good one.

52:16 And I hadn't heard of it either.

52:17 So quite nice.

52:18 Yeah, yeah.

52:19 You should try it out.

52:20 Yeah, definitely.

52:21 All right.

52:22 Well, thank you for sharing what you've been up to and all of your experience working with all these other groups, universities and companies and so on.

52:29 It's been great to chat with you.

52:30 Thank you.

52:31 Yeah.

52:31 And people want to get started with some of these things you talked about.

52:34 You know, what's Final Call to Action?

52:36 Where should they go check out?

52:37 Where should they go?

52:38 There are a few links that I would like to share.

52:41 The first one is aka.ms slash Azure ML service.

52:46 So it's very easy to remember.

52:48 And the other one is aka.ms slash Azure ML for VS Code.

52:54 And if you want to get started, the last link is aka.ms get started Azure ML.

53:00 So these are some of the links that you can look at if you want to learn more about Azure Machine Learning service, Azure Machine Learning for Visual Studio Code.

53:09 And if you want just to get started with Azure in general.

53:13 Yeah, that sounds great.

53:15 Those are good links.

53:15 And I'll be sure to put them in the show notes so people can check them out.

53:18 That's great.

53:18 Thank you.

53:19 Yeah, you bet.

53:19 Thanks for being on the show.

53:20 Thank you.

53:20 Bye-bye.

53:22 This has been another episode of Talk Python to Me.

53:25 Our guest on this episode was Francesca Lazeri.

53:28 And it's been brought to you by Linode and Talk Python Training.

53:31 Linode is your go-to hosting for whatever you're building with Python.

53:35 Get four months free at talkpython.fm/Linode.

53:38 That's L-I-N-O-D-E.

53:40 Want to level up your Python?

53:42 If you're just getting started, try my Python Jumpstart by Building 10 Apps course.

53:48 Or if you're looking for something more advanced, check out our new async course that digs into all the different types of async programming you can do in Python.

53:55 And of course, if you're interested in more than one of these, be sure to check out our Everything Bundle.

54:00 It's like a subscription that never expires.

54:02 Be sure to subscribe to the show.

54:04 Open your favorite podcatcher and search for Python.

54:07 We should be right at the top.

54:08 This is your host, Michael Kennedy.

54:19 Thanks so much for listening.

54:20 I really appreciate it.

54:22 Now get out there and write some Python code.

54:36 I really appreciate it.

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon