#220: Machine Learning in the cloud with Azure ML Transcript
00:00 Michael Kennedy: On this episode you'll meet Francesca Lazzeri, and hear her story how she went from research fellow in economics at the Harvard Business School to working on the AI and data science stack on the Azure team at Microsoft. This is Talk Python to Me, Episode 220, recorded June 20th, 2019. Welcome to Talk Python to Me, a weekly podcast on Python: the language, the libraries, the ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter, where I'm @mkennedy, keep up with the show and listen to past episodes at talkpython.fm, and follow the show on Twitter via @talkpython. This episode is brought to you by Linode and Talk Python Training. Be sure to check out what the offers are for both of these segments. It really helps support the show. Hey, everyone, before we get to the interview I want to quickly tell you about a new course we just launched. It's our first major Flask course, and it's called Building Data-Driven Web Apps in Flask and SQLAlchemy. This one's a deep dive into Flask. We cover things like routing, models, templates, databases, and migrations, and even deployment, and security, and we do all of this in the context of building a clone of the PyPI.org website. Check it out over at training.talkpython.fm. If you're not sure if you want to choose Flask just yet for your web app, then give our #100DaysOfWeb course a look. We cover many frameworks and programming models in 25 four-day projects so you get a super-wide view of what's out there. Then you could pick Flask, or Django, or Pyramid, or something else. Thanks for checking it out. Now let's get to the interview. Francesca, welcome to Talk Python to Me.
01:46 Francesca Lazzeri: Hi, thank you. Thank you for having me.
01:48 Michael Kennedy: It's really great to have you on the show. We got a chance to meet at Microsoft Build, where I was walking around the expo hall floor, and you're doing really cool stuff with AI and Python, and I'm just happy to have you here.
02:00 Francesca Lazzeri: Thank you so much, I'm very excited for being here.
02:03 Michael Kennedy: Yeah, so we're going to dig into this whole machine learning thing that you're working on, and just the industry in general, and machine learning at Microsoft as well, and that's going to be a lot of fun, but before we get to all those things, let's get started with your story. How did you get into programming in Python?
02:18 Francesca Lazzeri: Yeah, that's a great question. I ask myself this question all the time. I have to say, before joining Microsoft as a data scientist I was a research fellow in business economics at Harvard Business School, and over there was in charge of performing statistical and econometrics analysis within the Technology Operation Management Unit, so at that time, really, my title wasn't a data scientist, but was, as I said, a research fellow, but I had to work with a massive amount of data from different data sources such as patent, publication data, social network data. The goal of my research was to investigate and measure the impact of external knowledge networks, and that's why we were using that specific type of data on companies' innovation.
03:07 Michael Kennedy: Is that like social media, or what's an external source of data that you might be looking at?
03:12 Francesca Lazzeri: Patent data, publication data, and social network data. These three were the main three sources of data, and we were trying to understand if these knowledge network were like also localized from a geographical point of view, or they were all more like global networks of knowledge, and how these networks, both local and global, could effect a company's innovation. As you may know, for example, Cambridge, Massachusetts, that is where I'm working from, and where I make my office is at Microsoft Cambridge. It's a very powerful tech clusters and also pharma cluster.
03:53 Michael Kennedy: Okay, well that sounds like a really cool thing to study there, and so you're studying this data. You're studying these projects at Harvard Business School, and of course this is probably not a challenge that can easily be solved with Excel, and definitely not manually, right?
04:06 Francesca Lazzeri: Exactly, so at that time Python was emerging as a leader in data science programming, and while there are still many people in academia who use R, SPSS, for their analysis, Python, I think was getting very, very popular because there were a lot of data science libraries. I remember that one of the most popular, and is still very popular, is pandas, so it has many in-built features such as the ability to read the data from many sources. You can create large data frames from these sources, and also compute aggregate analytics, and this was exactly what I needed as a research fellow, and also for doing all the data analysis that I needed to do.
04:52 Michael Kennedy: Yeah, that's really interesting. What year was this, about?
04:55 Francesca Lazzeri: It was about 2012, 2013, 2014, because then I started to work at Microsoft as a data scientist exactly in 2014, and at that time I remember that R was still very, very strong for both academics and also data scientists, but Python was already there, of course.
05:17 Michael Kennedy: Of course, and the reason I ask is I have this theory that there's a big change, an inflection point around 2012, where a couple of things happen that just made Python and the data science base really start to accelerate out ahead of the other options like R, and it's just super-interesting that it's this time, as well, that you sort of picked it up.
05:39 Francesca Lazzeri: Yeah, totally agree.
05:40 Michael Kennedy: Yeah, and so did you know other programming languages before, or did you have to teach yourself Python to get into this?
05:46 Francesca Lazzeri: I had to teach myself Python. Of course, at university I took a couple of classes, just introduction to Python, and those classes were very helpful. Before that, I was already coding a lot in R. SQL was another very powerful language that I'm still using sometimes to just ingest the data. Sometimes you need to prepare your data, some tables, and you need just to ingest them, and from there you can start using the tools that you prefer to build a machine learning solution, and it's Python, the main tool that I use, but I have to say, beside these two languages, then the third language is, for me it's Python because it's very, very powerful for data science, and I would say data analysis in general, so also for people who needs to do a lot of statistics, econometrics type of analysis, Python is a very, very good choice.
06:40 Michael Kennedy: Absolutely. Do you come up with any really interesting results from your research at your time there that you can talk about briefly, or anything like that?
06:46 Francesca Lazzeri: Yeah, yeah, absolutely. I mean, I love my time there at Harvard. I was working with a great professor, and great research unit, and so the results were very interesting, so we noticed that for the most important innovations that we were targeting the biotech industry, the localization was very, very important. We noticed that data scientists were located in the same, sorry, not data scientists, scientists in the biotech industry, and also some data scientists, but we were looking just at the general aggregate of scientists, who were located in the same geographical cluster. They were actually able to innovate more often, and much faster, so and of course we were seeing these in terms of publication, in terms of how they were citing each other, and the most importantly in terms of patents because the only way to really measure knowledge, and how the knowledge can be clustered, is through patents, unfortunately. I mean, I say unfortunately because it's a sort of extra step. You have to make a few assumption to make these correlation between knowledge and patents, but still it's a good indicator for innovation, so we noticed this. However, we noticed that for long-term innovation, like if you look at a company and you look at the number of patents that the company and data research unit have filed, you can see that geographical barriers are actually not an issue, and right now, especially with the use of technology, innovation and knowledge do not need to be linked from a geographical point of view, so this is what we noticed with our research.
08:31 Michael Kennedy: Yeah, that sounds really interesting, so the short term is more important to be around fellow scientists, but long term the knowledge gets out there. Cool, all right,
08:40 Francesca Lazzeri: Exactly.
08:40 Michael Kennedy: so you're not longer working you're not longer doing that research at Harvard. You're still near Harvard, but you're working at Microsoft. What do you day-to-day there?
08:47 Francesca Lazzeri: Right now I'm a machine learning scientist on the Cloud Advocacy team, and what I do, I work with a lot of external customer, and Azure machine learning also community in general. Specifically, I work a lot with universities and the research institutions, from students, professors, to researchers, on machine learning projects.
09:11 Michael Kennedy: Oh, that sounds really interesting, and there is something special about universities. I mean, you were at Harvard for a while. Just being on campus and being in those environments is really nice, so it must be fun to work with those research groups just on a day-to-day basis.
09:25 Francesca Lazzeri: Absolutely, and I think, again, talking about innovation, I think that the real innovation starts, most of the time, from university, from research units, from what we call a spinoff. They're a specific type of start-ups coming from academia, so I really think that having these close relations with university is a key for innovation, and when I say innovation I mean also in terms of machine learning solutions, so it's nice because most of the time I leverage their knowledge in terms of AI and machine learning, but I'm the one that is more like expert, even if it's hard to use these words because I think that we all, every time we learn from each other, but I'm more or less, I have more experience with the technology that they can use, and so that their innovation, their research is not just on paper, but they can actually translate these, transform their research into a product, into a service, into a feature that other people can consume, so that's why it's so interesting to work with these type of customers. I feel very lucky.
10:33 Michael Kennedy: Yeah, for sure. They're so interesting, and especially in like science, and biology, and pharmaceuticals. Those are not the kind of start-ups that people often can just go start on their own. They often spin out of these labs, and that's like a particular area, right?
10:48 Francesca Lazzeri: Absolutely, and again, this is a topic that is similar to my Harvard University research. What we call it, economists call it a spillover type of effect, like when you build a knowledge, and then you decide to transfer this knowledge outside, and when I say outside, it can be to a different industry, like from academia you go to the real industry: could be biotech, tech, or any other industry, or you change. You change completely your area, like from economics you go to machine learning. As I said, it's something, it's a real nice phenomenon that you can notice when you are like in a cluster of tech, a knowledge cluster, and yeah, it's a very interesting phenomenon.
11:33 Michael Kennedy: Yeah, I'm sure that it is. How interchangeable is the knowledge between the different projects that you're working on? Like how much do you have to try to study what they're doing, or how interchangeable is this machine learning, data science world?
11:48 Francesca Lazzeri: This is a great question, so I think that the difference is in the data and in the business problem, or I would say research problem in this case, that you're trying to solve. Everything that is in the middle is actually very similar, in the sense that the approach can change, or the type of technique that you are using, if you have like a economics background, or a biotech background, or a machine learning background, can be like a little bit different, but more or less you have this scientific mindset that you know that there are a few techniques, some very powerful techniques that you can apply on your data to an answer that is going to be able to solve your problem. Again, if you are talking to a company, of course it's a business problem. In this case, since my customers, most of the times, are university research institutes, that the problem is more like a research problem, so it's very, very interesting. Of course those people, as you said, that they have, most of the time, very different backgrounds, but they are expert of the data, and they know very well what the output from their research needs to be, like what is the type of answer? What is the type of problem that they want to solve? So my role is really between these two points. I really help them understand how they can use the machine learning and the cloud to be able to build end-to-end AI solutions, and apps, most of the time, to solve their own specific research problems, and also to sometimes can be just an optimization of a specific process, for example, and what is interesting in my role is that I don't just help them understanding what are the potentials with using machine learning, but is also about collecting feedback because when I work with them, of course I always have some type of feedback that I can collect, and then I support our machine learning product teams, our AI product teams, to build a new feature, to optimize our services, for both the machine learning scientist and the data scientists.
13:55 Michael Kennedy: This portion of Talk Python to Me is brought to you by Linode. Are you looking for hosting that's fast, simple, and incredibly affordable? Well, look past that bookstore and check out Linode at talkpython.fm/linode. That's L-I-N-O-D-E. Plans start at just $5 a month for a dedicated server with a gig of RAM. They have 10 data centers across the globe, so no matter where you are, or where your users are, there's a data center for you. Whether you want to run a Python web app, host a private Git server, or just a file server, you'll get native SSDs on all the machines, a newly upgraded 200-gigabit network, 24/7 friendly support, even on holidays, and a seven-day money-back guarantee. Need a little help with your infrastructure? They even offer professional services to help you with architecture, migrations, and more. Do you want a dedicated server for free for the next four months? Just visit talkpython.fm/linode. Artificial intelligence was something, when I was in college it was like a small research side of computer science, and there were some people working on it, but it was always, it felt like one of those technologies that's 30 years away, if ever. The type of problems they were trying solve was, well, let's build a little chatbot, and then if a person can chat with the chatbot in IRC, and they don't know that it's a machine, it's artificial intelligence, that we're very close, right? And that seemed like an interesting research project, but not really practical, and now we have machine learning taking stuff just so far along, right? We've got self-driving cars. We have computer algorithms determining whether folks have cancer, all sorts of mind-blowing stuff, so I feel like AI has really become real, right? It's actually become a thing, but also I think it's a little bit misunderstood in some interesting ways. How would you describe artificial intelligence to non-technical people, or machine learning even?
15:56 Francesca Lazzeri: Yeah, that's another great question. I think that, simply speaking, AI is just about programming computers to make decisions, and machine learning, of course, focus more on making prediction about the future, so I always like to explain AI with an example, like there is a very nice app that I have been using here at Microsoft, and it's called Seeing AI. It's a Microsoft research project that really brings together the power of the cloud and AI to deliver these, not just these intelligence app, but it is designed to help people that are blind or that they are low vision, and that they help them to go through their everyday life. It's very nice because with this app you can just find to your phone's camera, you can select a channel, and you can hear a description of what the app has recognized around you, so again, this is a very, very good example of AI and how AI can help us, also, to improve some of our everyday actions, I would say. As I said at the beginning, it's really about programming computers to do something, to make some decisions.
17:14 Michael Kennedy: That's really incredible, this Seeing AI app here. I'll put a link in the show notes, so you just hold it up, and it just says, "Hey, I see there's a car over there, and there's a table with two people sitting at it." something like this?
17:27 Francesca Lazzeri: Exactly.
17:27 Michael Kennedy: Wow.
17:27 Francesca Lazzeri: Yeah, it's very, very nice because, again, it's with your phone camera you can select a channel and hear, again, a description of what has been recognized around you, so it's really designed to help you navigate your day.
17:43 Michael Kennedy: That's amazing. I'll tell you a real quick story that's kind of wild that happened to me about 10 or 15 years ago, and this person could have really used this app. I was walking to the grocery store where I lived. It was maybe three blocks, four blocks away, and I was walking along, and this guy comes over, or is standing next to me at this light, and says, "Hey, could you help me?" I said, "Sure, no problem, what do you need?" He says, "Could you tell me where the grocery store is?" and it was clearly visible just across, diagonal across the intersection. I said, "Yeah, it's right there." He goes, "Don't point, man, I'm blind." I went, "Wait, what?" It was not at all obvious. He's like, "Could you just help me? I just got disoriented. Could you just help me walk over there?" and I took him by the arm, and we walked over there, and he said, "Thank you very much." When we got to the store he just went off on his own and went shopping. I was just amazed how well he could function, but it seems like just something like this, he could just hold it up and go, "There's a store across the street. "Oh, there it is, I see." It would be just amazing, and it could really change people's lives. That's awesome.
18:39 Francesca Lazzeri: Absolutely, yeah, I totally agree, and there are many, many other examples. I really like to talk about this app, the Seeing AI app, because I think that as a, it can be a very good impact on people lives, but I totally agree with you: There are so many other examples that they're just around you, and probably we don't even notice at this time, but they are there, of AI apps that just help us with our everyday tasks that we have to perform, so yeah. Nice story, very nice story.
19:10 Michael Kennedy: Yeah, thanks, so you were describing artificial intelligence and machine learning. I saw a funny joke that said, "How can you tell the difference between artificial intelligence and machine learning?" And it said, "If it's written in Python, it's probably machine learning. If it's written in PowerPoint, as a concept, it's probably AI."
19:30 Francesca Lazzeri: I saw that joke, of course, as well,
19:33 Michael Kennedy: It's a good one, right?
19:34 Francesca Lazzeri: It's a very, very good one, and I have to say, probably is also true. There is some truth there. I mean, AI is something that, as we said, in some cases is already around us, in some other cases it's not there yet, or it needs to be improve, but I still think that it depends a lot on how you define AI, and what are your expectations from AI, but yes, that joke, it's very popular, and I saw that as well.
20:02 Michael Kennedy: It's a good one. Nice. All right, so let's talk a little bit about the AI stuff you guys have going on at Microsoft, and you work on the machine learning Cloud Advocacy team, so it seems like a lot of stuff these days that Microsoft is doing, especially in the developer space, is something at Azure, or something on the cloud, or something like that, right? It feels to me like Azure has really become the super-big focus, and once again, developers have kind of really become a big focus at Microsoft, as opposed to, say, just Windows and Office, for example, as the two key pillars of the company, or whatever, right? What do you think? What's your impression?
20:41 Francesca Lazzeri: That's totally right, and we can feel these both from an external and internal point of view. I think that everything probably started with Satya Nadella first e-mails to has to Microsoft employees, because it's very interesting. Instead of focusing on the past, he wrote about the future, and in particular the importance of cloud for Microsoft's growth. He's also said that our industry does not respect tradition, but only respect innovation, so this means that there was at that time, and still there is a strong focus on the cloud, on machine learning, and on AI. For example, also from an internal point of view, we noticed that in 2017 Microsoft launched an AI division with more than 5,000 computer scientists. That was like a huge change, and also, of course, there was like computer scientists, software engineers, AI developers, and at that time we also launched an Intelligent Cloud division, which included products such as server and Azure. This was, again, a big change that we noticed, both from an internal point of view, but also externally, and then another, I think, big change that we noticed is that Microsoft has done a series of, I would say, smart acquisitions. For example, GitHub and LinkedIn, and I think that he did that because we really wanted to make AI accessible to developers and to our communities in general, so yes, it's a big change. Of course, it didn't started this year. It has been going on for a while at Microsoft, and I think that we are seeing the results right now. There is a big focus on cloud, AI, and machine learning.
22:38 Michael Kennedy: Yeah, it definitely seems like a big shift, but in a positive way. I think it's a good move, so maybe tell us about some of the AI stuff that you have going on in Azure. I know there's a lot of cool things that you have going on there. I know you have some stuff around ML DevOps, for example, right? Machine learning DevOps, and just doing things like productizing machine learning models, or turning them into production systems like REST endpoints and so on, so maybe you could talk a little bit about all the stuff you got going on there.
23:09 Francesca Lazzeri: Yeah, absolutely, so right now we have ML Ops or DevOps for machine learning capabilities, and these includes the Azure DevOps. These enables Azure DevOps to be used to manage the entire machine learning life cycle, including, for example, model reproducibility, validation, the deployment part, the retraining, and it's very interesting, as a data scientist, to see these, because I have to say the DevOps for machine learning includes a lot of the steps that data scientists needs to perform when they are building an end-to-end machine learning solution, such as data preparation, experimentation, model training, model management, deployment, and monitoring. I can go on and on, so those pipelines allows for what we call the modularization of these different steps, or different phases, into smaller steps, and also it provides a mechanism for automating, sharing, and most importantly reproducing the models for, models, and not only models, also the different machine learning assets that you need. Again, thanks to these, all the machine learning workflow becomes much, much easier to perform, and to reproduce, and to automate, so it's great to see these.
24:32 Michael Kennedy: It sounds like it's also a lot of good problems, or problems that need solving, so what does machine learning in production look like, right? Is that like TensorFlow behind RESTful, some like Flask REST service, or what does it look like for folks who are like just web developers or not doing data science day-to-day?
24:50 Francesca Lazzeri: Yeah, that is very interesting because I think that data scientists still have to focus a lot on the machine learning piece, but then, once you decided what is the best model, or the best models, because it can be, of course, multiple models that you want to push into production, is also very important to understand how you can deploy the solution, and how other people can eventually consume the solution. Back to your questions. Into production, it's a web service. It can be, I have to say that the first format that it takes is a pickle file, because this training run, that is the data model training, this training run produce a Python serialized object that we call it, again, a pickle file, and this contains the model and the data preprocessing, so this is very, very important because then, at this time there is actually the following step is to make this a web service, and web service is really, there is a REST API that, again, you can call to just consume the service from whatever environment you prefer, like it can be, again, an app, or it can be just another platform that your company's already using, and you want just to use that to consume the results from your machine learning model.
26:07 Michael Kennedy: Yeah, that sounds pretty cool, so maybe I have a REST endpoint, and I just upload a picture, like a PNG or something, and it takes it, understands it, feeds it through the pretrained model, so there's the training side of things, which can be super-computational, but then maybe the evaluation, decision-making process is really quick, right? So what response times people typically look for, but I suspect it's much, much, much faster, right?
26:33 Francesca Lazzeri: It's a question that doesn't really have an answer, like it really depends on the type of solution that you're building, and most importantly it depends on the type of data, the amount of data that you're using, so there is not a clear answer to that, but it's interesting how to see that right now there are many different solutions that you can use, actually, to accelerate, I would say, not only the deployment and the consumption process, but also the training process. For example, we have, right now, a new feature. It's actually not really new because it was launched last September, September, 2018, and it's called automated machine learning, and is a feature within Azure Machine Learning service. For those who are not familiar with automated machine learning, this is a process of automating the time-consuming, very iterative, I would say, task of machine learning model development, and allows data scientists, and also analysts and developers, to build machine learning models with a high scale efficiency and also productivity because you don't need, actually, to manually create all these different models by yourself, but automated machine learning actually runs many different models for you, and suggests you the best model, and also they, all the hyperparameter tuning is, again, done for you, so it's a new feature that can somehow optimize the machine learning flow, but again, going back to your question about time, when you are in a machine learning context, saying a specific time, it's very hard.
28:12 Michael Kennedy: Super-hard, yeah, I see, so that's pretty meta, to have machine learning teaching your machine learning algorithm, right?
28:21 Francesca Lazzeri: Yeah, yeah, yeah. Automated machine learning, it's very interesting because it's actually, again, was developed by Microsoft research, is like Seeing AI, the app, and then our products team was able, actually, to translate this research piece into a real feature, so again, it's a sort of a big recommender system for your machine learning pipelines, and I've been using this now for a while as a data scientist for the past few months, and I have to say that, really, it's not just about saving you some times as a data scientist, but it's also a sort of check, like sometimes when you prepare your data, or like you pick a specific type of models to try on your training data set, you somehow are exposed to biases because, again, it's a selection process that you, as a human, you have to do, while if you have, also, an external voice somehow, an external suggestion, that just gives you a different perspective or a different suggestions, again, I think that is a sort of sanity check, so again, it's not just about saving time, but I think it's also about making your process more objective and less subjective.
29:36 Michael Kennedy: Yeah, it sounds like it would be really helpful, so normally, when you're doing training, you have to pick some kind of model. You say I think this kind of model, with this many nodes, and this type of thing, we're going to set it up, and here goes the training data, where you know what the outcome should be. Let's say, housing prices, right? Like one of the interesting problems people tried was, given just the description of a house in a neighborhood, predict the price that it should sell for. You could feed all that in, say the actual price was this, and it corrects itself, right? But there's still a lot of decisions on the actual model that you feed it to, or the setup you feed it to on top of the training data, right?
30:15 Francesca Lazzeri: That's totally correct, and again, this is something that most data scientists do it manually in the sense that you start with your raw feature, actually you start with your raw data, and then you do some data cleaning, data preparation, and then it starts what we call the feature engineering, but again, the feature engineering is something that you do based on some assumptions that you have in your mind, because feature engineering is really about creating additional features based on the data, on the raw data that you have, and some of these additional features, let me give you a real example, like, for example, if we are in a time series forecasting scenario, you have the time stamp column, and you can build additional features such as is a holiday or is not holidays, this is afternoon or no, is a weekend or no, and this can help you, making your model more accurate in some cases. It's not the case sometimes, but they can help, but again, also the feature engineering is something that you do before you even know what is going to be the output of your model, and then, after the feature engineering, of course, you start manually selecting and trying a few approaches that, based on your experience and knowledge as a data scientist, have worked pretty well in similar scenarios, and after that, of course, usually you need like one single machine learning model, and, of course, you have been doing a lot of iteration for the hyperparameter tuning part, and then you push everything to production, but as you said, it's something that you have to try out manually, and that there are now some new features like automated machine learning that can help you somehow saving some time because, of course, they do all the, not only the training, but they do all experimentation with different machine learning models for you, and also, most importantly, the hyperparameter tuning part.
32:08 Michael Kennedy: Yeah, that sounds like a pretty cool service, so if I'm going to go work with like the Machine Learning SDK over on Azure for what you guys are doing, what libraries are supported? Are the standard Python libraries the ones that get used, or do you have to use like a Microsoft ML one, or something like that?
32:27 Francesca Lazzeri: All the standard Python libraries are supported, and then, of course, we have the library for the Azure Machine Learning SDK for Python, and this is, again is, you have to think about it as a sort of a library, Python library, and the most important part of this is that the deployment part is going to be much, much easier. Why? Because of course when you prepare your data, and you do all the feature engineering part, and then you go to the modeling phase, that part, basically, it's Python. It's Python, and you can, of course, just use classical Python libraries, but then, when you go into the deployment part, I have to say that Azure Machine Learning SDK for Python, at least I find it very powerful because you just need to write a couple of functions, like what we call the init and run functions, and those two function are going, basically, to define the model and how the data needs to be, the data that you use to feed the model, and as soon as you have defined these two function, then it's very easy just to register your model, and then create that pickle file that I was mentioning before. So yeah, going back to your question, it's Python based, and it's very nice to build a run machine learning workflow with what we call the Azure Machine Learning service.
33:51 Michael Kennedy: Another thing I saw you all talking about is model interoperability. What does that mean?
33:56 Francesca Lazzeri: Yeah, that is, I would say it's one of my favorite topic because I think that model interoperability is another topic that is very close to what we call the ethics in AI, or biases in AI. So what it does for you, so first of all it's a package, Python-based package within the Azure Machine Learning Python SDK, and what it does for you is really make the machine learning models not black boxes anymore for you. So like, for example, you can use classes and methods that are in the SDK, and you can get, with this model interoperability package you can get feature importance values for both raw and engineer feature. You can get interoperability on real-word data set, both during training and inference deployment, so it's very nice, and then, the most interesting part is that you can also get interacting visualization to help you understand your data. So like for example, let's say that we are in a use case where you wanted to predict the price of a car. You create some additional features, and then you try out a couple of machine learning models, and then you deploy those models, or just pick one model and you deploy it, but then, most of the time data scientists, they want, really, to understand what are the different features that actually affected the accuracy of the model, the performance of your model, and this is something that, right now, you can get through the model interoperability package within Azure Machine Learning, so I think it's something very, very important.
35:37 Michael Kennedy: Yeah, it's really interesting because one of the big problems with machine learning is it's really good at making decisions, but there are certain circumstances where you need to know real concretely why a decision was made, right? Like sometimes it's just to improve the system, like if a car crashes, like a Tesla self-driving car turns the wrong way and crashes, they need that to not happen again, so understanding how to fix it is important, but other times it's legal, even, right? Like if you get rejected for a mortgage for your house, and you apply for a loan, you get rejected, a lot of times there's laws that say you have to be told why you were rejected, or there has to be some visibility into that, right, to avoid bias.
36:22 Francesca Lazzeri: Absolutely, that's why I mentioned ethics in AI. That is a big topic right now, and bias is, because, as you said, the first place where bias is create is the data set that you are using, so all the training data that you use most of the time can create biases in your model, and as a result also in the different outputs that your model is giving you, in the final results that you are looking at, and as you said, having a sort of visibility, transparency on why the model gave you that specific output is something that we all need to have, at least we have to have this option. I really believe that it's something that, for sure, is going to be interesting for data scientists, because data scientists, I think, they don't want to use machine learning models as a black box at all. They really want to understand why specific processes were done, and why we had specific results, but it's also, as you said, it's also a topic that should be interesting for everybody because again, some of the decision that are made based on the machine learning models that we build can really affect personal life in a good or negative way.
37:39 Michael Kennedy: Yeah, and you definitely want to have visibility around those things, right?
37:43 Francesca Lazzeri: Absolutely.
37:44 Michael Kennedy: Human biases can get into the algorithms, and then the machines just make biased decisions faster.
37:49 Francesca Lazzeri: Yeah, yeah, absolutely.
37:52 Michael Kennedy: More systematically,
37:52 Francesca Lazzeri: Yes.
37:53 Michael Kennedy: which is not so good.
37:53 Francesca Lazzeri: No.
37:54 Michael Kennedy: So if people want to play around with some of this stuff, you guys have the Azure Machine Learning Notebooks, right? Is that something that's easy to go play with?
38:01 Francesca Lazzeri: Yeah, and we have Azure Machine Learning Notebooks. This is something new that was actually announced at Build, where we met, and these are integrated with Azure Machine Learning service, and they provide a code-first experience for Python developers. As a Python developer you can build and deploy your models in a workspace, and also developer and data scientists can then perform every operation that are supported by the Azure Machine Learning Python SDK, so you don't need to install anything else, anything additional. You can just connect with these, and it's nice because you also have a pretty good computer target, the virtual machine. It's a very easy environment to use if you are a data scientist or a developer and you want to use Python with Azure Machine Learning service.
38:51 Michael Kennedy: Okay, that sounds really cool. Another challenge working with machine learning is you have to feed it a lot of data, and you have to have the data somewhat cleaned up, right? Hence pandas, it's pretty powerful and interesting there, and so on, but you have some open data sets as well. You want to tell folks what kind of data they can get there? That sounds pretty helpful.
39:11 Francesca Lazzeri: Yeah, so this is a part of some of the new open source capabilities that Azure Machine Learning has, and Azure Open Dataset, I would say it's one of my favorite because it's really a sort of repo for a public data set that you can use to, the scenario-specific features to your machine learning solution to get more accurate models. Like for example, I was building an energy demand forecasting solution, and I already added the load data, load historical data from a public data set, but I really wanted to include weather data because, as you know, energy can be, energy consumption can be strongly dependent on weather.
39:57 Michael Kennedy: Right, if it's hot or cold, yeah.
39:59 Francesca Lazzeri: Yes, specifically I wanted to use like temperature data, so thanks to Azure Open Dataset I was able to get at this public data set for weather, but we have also holidays, public safety data, and also location data, so all these external public data can really enrich your original data set, and can help you to build more accurate models. The last thing that I think it's very important for these Azure Open Datasets is that they are on the cloud, and that they are available to, for example, Azure Databricks machine learning service, and Machine Learning Studio, so you can really consume them from different tools, and you can also access them through APIs, and use them in other products like visualization products, like Power BI or Azure Data Factory.
40:52 Michael Kennedy: Yeah, and it's pretty cool that it's already uploaded and accessible there because one of the data sets is you have the New York City taxi and limousine trip records, but that's 500 million rows, and it's five gigs of data, so it would be better if you didn't have to upload that, and try to move that around, and whatever, right? Just download it and, or just ingest it and process it.
41:13 Francesca Lazzeri: Yeah, absolutely, and again, I did this for the energy demand forecasting solution, and the pain points for me, as a data scientist, were, so it was really not totally to look for a public weather data source, but because I found that there are around several weather data that you can use, but the problem was like really to download them, and most importantly make the end-to-end solution working because once you download them you want to consume them over time, and having them on the cloud, and the fact that you can consume new data, and also forecast the data because it was, again, time series forecasting data, so as an input I needed forecast weather data. It's something that really, really helped me, and made my work, as a data scientist, much, much easier.
42:03 Michael Kennedy: Yeah, that's really cool, so I know that a lot of the stuff that you all are doing is on the cloud, and right? You have like virtual machines, and ones with GPUs that you can leverage, and services, and so on, but I'm sure that a lot of data scientists even who use the cloud also work locally on their computers, right?
42:21 Francesca Lazzeri: Absolutely, yeah.
42:22 Michael Kennedy: Yeah, so what does a proper data science machine learning computer look like these days? You have to get some fancy GPU? What do you do for a setup? What do you see people using for setups in your interactions with them?
42:37 Francesca Lazzeri: Yeah, so I have to say most of the researchers, students, and also big corporations I work with, they all use, right now, Data Science Virtual Machines, especially when I talk and I work with other data scientists because those Data Science Virtual Machines, and some of them are also called Deep Learning Virtual Machines, they come with all the different tools that you need as a data scientist: different class storage GPUs, as you said, the right size, and they are already there, and you can just start working with these, and it's true that you mentioned that most of the time, or I would say not most of the time, probably sometimes data scientists, they still want to use their local machine, especially, probably at the beginning of machine learning projects, so some of the editors that I have been seeing using a lot are like Visual Studio Code. That, of course, is a local app that they can just download, but other very, very popular editors are, like JupyterLab, PyCharm, so there are many options out there, and I have to say the most popular ones, again, are virtual machines, and then, when you have to work on your local machine you'll probably use an editor such as Visual Studio Code and other tools, again, like PyCharm.
43:56 Michael Kennedy: How often do you see data privacy being a problem, or a challenge that people run into, just the stuff they don't want to put on the cloud, or they're not allowed to leave the building as it is?
44:08 Francesca Lazzeri: I see it very often. I think that is all about trust, really, because when you start a new data science project, or like a new machine learning project, it's all about getting to know each other, so you work, as a data science, you work with an external entity. Again, can be a big corporation. It could be a university, and of course at the beginning, when you are what we call the scoping phase, so you are defining what is the research problem or the business problem that you want to solve, what is the data that you're going to use? Of course, over there there is like you are building trust, this moment that you are building trust, and probably be you don't use like they just shared with you a sample of the data, and you just share with them a general demo to show the potential of different machine learning tools in the cloud, but then, I think it's the real moment where you are building trust between each other, and of course the security in the data space, it's a concern that everybody needs to have. I mean, we have to be concerned about data because this means that we aware of what are the different options out there? And that's why I think that VM, virtual machines, most of the time are the best choice when you work also with external customers and with external data scientists because it's a sort of common place where you can just connect using, and of course performing all the security steps that your customers, or your researcher, peers, wants you to perform, of course, but it's like a sort of common place where you can share all your experiments, or data, and so on.
45:48 Michael Kennedy: Yeah, like I say, safe middle ground.
45:50 Francesca Lazzeri: Yes, exactly, this is exactly what I mean.
45:51 Michael Kennedy: You're not inside anybody's firewalls and things like that. Interesting, that's pretty cool, so we're getting sort of close to the end of our time, and I did want to ask you to do a little bit of a prediction for us 'cause you are seeing so many different projects that are coming along, what people are doing and researching. So what do you think that machine learning is going to create or do for society, or how it's going to change society in, say, five or 10 years, that maybe normal people who don't follow this see coming?
46:22 Francesca Lazzeri: I'm very optimistic person, so I don't really think that AI is going to impact our everyday life, or our world in a negative way. I really think that if we are able to just take advantage of what AI, and when I say AI I mean also all the machine learning algorithms that are behind AI, if we are able to really take advantage of these, we can really see some improvements in different processes and different industries such as health care.
46:55 Michael Kennedy: Yeah, health care is the first one that I was thinking of, as well.
46:57 Francesca Lazzeri: Yeah, health care is something because they are, I think after the finance industry, they are the industry that actually produce the biggest amount of data, and of course the more data you have, better your models and objective the results are going to be. So for sure if I have to mention a few industry where I see AI being successful in the next few years, one, as I mentioned, is health care. The second is transportations. I really think that not only driverless cars are going to improve our everyday life, but also what we called the predictive maintenance that is predicting when a machine is going to fails, is something that is going to have a huge impact in the transportation field, and the third one is, I would say probably is less interesting, but still it's an industry that we all need to interact with. It is the finance industry, so the finance industry, why I see the biggest impact there because actually they are the ones who have been using machine learning for longer time, but I have to say that these interaction between AI and machine learning is something new to them as well, so I think that we are going to see interesting solution there as well, and probably is going to democratize what sometimes looks and sounds inaccessible to many people who are not expert in economics or trading, so I think these are the three main industries where we will see a lot of changes and impact from the AI world.
48:39 Michael Kennedy: Yeah, those are all definitely. I think health care, for sure.
48:42 Francesca Lazzeri: Yes.
48:42 Michael Kennedy: There's just so much data, and it has the opportunity to be just so transformative, right? You think you have the cancer diagnosis, and the other things, and probably pharmaceuticals, as well, is kind of tied into that, right, like creating new drugs in a much quicker way.
48:57 Francesca Lazzeri: Absolutely, and if you think about health care, beside the two excellent examples that you just made, also there is all the insurance would for health care, also, that can really help facilitate some of the processes that sometimes patients have to go through, and if we can optimize time, and make these processes better for people who needs health care, I think that also is a very, very good improvement.
49:23 Michael Kennedy: Yeah, one other I'd love to see is the energy sector around renewable energy and stuff like that. It seems like a lot of good stuff could happen there.
49:31 Francesca Lazzeri: Totally agree.
49:31 Michael Kennedy: All right, so I guess before we call it a show I'm going to ask you the two final questions I always ask. The first one is if you're going to write some Python code, what editor do you use?
49:40 Francesca Lazzeri: I already mentioned this editor. It's Visual Studio Code. I have to say that I really like, right now, the Python extension, and the Azure Machine Learning extension. The Azure Machine Learning for Visual Studio Code that previously was called Visual Studio Code Tools for AI is an extension that you can use, as a data scientist, to build, train, and also deploy your machine learning models, of course, on the cloud, or also on the Edge you can use, you can leverage the power of Azure Machine Learning service. Other editors that I really like are JupyterLab. I think I really like it because it provides a high level of integration between notebooks, documents, and different activities, and then the other one that I mentioned, and I think that data scientists should also consider, is PyCharm because I think that it is similar to Visual Studio Code, that they both have interesting features such as a code editor, errors highlighting, and also powerful debugger with a nice graphical interface.
50:43 Michael Kennedy: They seem to have a slightly different philosophy, Visual Studio Code and PyCharm, but they're both really good, excellent, and then, of course, the Jupyter stuff is like standard. A standard data scientist, you've got to fit that into the workflow somewhere, right?
50:56 Francesca Lazzeri: Absolutely.
50:56 Michael Kennedy: Awesome, all right, and then, finally, a notable PyPI package, not necessarily the most popular, but something that folks maybe haven't heard of, but is really cool, that you want to share.
51:06 Francesca Lazzeri: Again, I'm very passionate about time series forecasting, and the package that I have been using a lot lately is the forecasting package with AutoML. So for forecasting tasks, automated machine learning use these preprocessing and estimation steps that are specific to time series data, so the preprocessing steps will, for example, detect time series samples frequency, if it is hourly, daily, or weekly, and they create new records for missing data, for missing time stamps, to make the series continuous, and then they can also input missing values in the target and feature columns, and they can create grain-based features to enable fixed effects across a different series, so this is, again, it's a very nice package, forecasting package that you can use with a AutoML config object within Azure Machine Learning service, and I have to say I have been using this a lot.
52:06 Michael Kennedy: Yeah, that's cool. Yeah, that's really cool. Their example on the PyPI page is like eight lines of code. Yeah.
52:13 Francesca Lazzeri: It's pretty incredible.
52:14 Michael Kennedy: Yeah, super-cool, okay, well, that's a good one, and I hadn't hear of it either, so quite nice.
52:18 Francesca Lazzeri: Yeah, yeah, you should try it out.
52:20 Michael Kennedy: Yeah, definitely. All right, well thank you for sharing what you've been up to, and all of your experience working with all these other groups, universities, and companies, and so on. It's been great to chat with you.
52:30 Francesca Lazzeri: Thank you.
52:31 Michael Kennedy: Yeah, and people want to get started with some of these things you talked about, what's final call to action? What should they go check out? Where should they go?
52:39 Francesca Lazzeri: There are a few links that I would like to share. The first one is aka.ms/azuremlservice, so it's very easy to remember, and the other one is aka.ms/azuremlforvscode, and if you want to get started, the last link is aka.ms/getstartedazureml, so these are some of the links that you can look at if you want to learn more about Azure Machine Learning service, Azure Machine Learning for Visual Studio Code, and if you want just to get started with Azure in general.
53:14 Michael Kennedy: Yeah, that sounds great. Those are good links, and I'll be sure to put them in the show notes so people can check them out.
53:18 Francesca Lazzeri: That's great, thank you.
53:19 Michael Kennedy: Yeah, you bet. Thanks for being on the show.
53:20 Francesca Lazzeri: Thank you, bye-bye.
53:22 Michael Kennedy: This has been another episode of Talk Python to Me. Our guest on this episode was Francesca Lazzeri, and it's been brought to you by Linode and Talk Python Training. Linode is your go-to hosting for whatever you're building with Python. Get four months free at talkpython.fm/linode. That's L-I-N-O-D-E. Want to level up your Python? If you're just getting started, try my Python Jumpstart by Building 10 Apps course, or, if you're looking for something more advanced, check out our new async course that digs into all the different types of async programming you can do in Python, and of course, if you're interested in more than one of these, be sure to check out our Everything Bundle. It's like a subscription that never expires. Be sure to subscribe to the show. Open your favorite podcatcher and search for Python. We should be right at the top. You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at /rss on talkpython.fm. This is your host, Michael Kennedy. Thanks so much for listening. I really appreciate it. Now, get out there and write some Python code.