#283: Web scraping, the 2020 edition Transcript
00:00 web scraping is pulling the HTML of a website down and parsing useful data out of it. The use cases for this type of functionality are endless. Have a bunch of data on governmental sites that are only listed online in HTML without a download. There's an API for that. Do you want to keep abreast of what your competitors are featuring on their site? There's an API for that. Need alerts for changes on a website, for example, enrollment is now open at your college and you want to be first and avoid that 8am morning slot. There's an API for that as well. But API is screen scraping and until a tooth from scraping hub is here to tell us all about it. This is talk Python to me, Episode 283, recorded July 22 2020.
00:54 Welcome to talk Python to me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter, where I'm at m Kennedy. Keep up with the show and listen to past episodes at talk python.fm and follow the show on Twitter via at talk Python. This episode is brought to you by linode. And us pythons async. And parallel programming support is highly underrated. Have you shied away from the amazing new async and await keywords because you've heard it's way too complicated or that it's just not worth the effort. But the right workloads 100 times speed up is totally possible with minor changes to your code. But you do need to understand the internals. And that's why our course async techniques and examples and Python, show you how to write async code successfully as well as how it works. Get started with async and await today with our course at talk python.fm slash async.
01:50 Tila Welcome to talk Python to me. Thanks for having me, Michael. Yeah, it's great to have you here. I'm looking forward to talking about web scraping. It's the API that you didn't know it exists, but it's out there. If you're willing to find it. It's like the dark API. Yeah, it's a secret API that you did a little work to sort of like be able to use it to earn the right to use it. Yeah. But yeah, I like it. It's useful. It definitely is. And we've got some pretty interesting use cases to talk about and a little bit of history and stuff. And obviously gonna talk about scrapey stuff about scraping what you guys are doing there. So a lot of fun stuff. But before we get to all that, let's start with your story. How'd you get into programming in Python? Yeah, sure. So actually, I got into programming back in like elementary school. I was the sort of kid like, you know, I think I was like, eighth grade. And at the time, everybody in my class in school, everybody was asking for like, I think it was Xbox 360. that just came out. Okay. Everybody was asking for that for Christmas. Yeah. And I was the one who was asking for a book about programming. What language did you want at that time? Yeah, so actually, I did like a lot of research. I think for like, three months, I was researching which programming language is the best for beginners. So I was on different forums, Yahoo, and swears in every places on the internet. And many people suggested at the time, Pascal, so I got this book, I think it was called something like introduction to turbo Pascal or something like that. Right. Okay. Yeah. So I got the book. And that was the first step. And then after like, 40 pages into the book, I throw it away, because I found that just googling stuff, googling stuff in Google and StackOverflow. It just, you know, easier to learn it that way. Nice. So how did you end up writing anything in turbo Pascal? Yeah, I was just, you know, the regular things beginners usually do. Like in the terminal, I, you know, I printed my name and read, I printed like figures with characters, you know, just the regular things beginners do. But then I quickly switch to Java after Pascal. Right? Java can build web apps and build gooeys more interesting things. Yeah, at the time, I was interested in Android, Android development. Actually, it was a few years later, when I got to high school. I was I started to like program in Java to develop Android apps. I created like, I don't know, maybe like five or six apps for Google Play that I just found useful, like minor things like a music player that plays music randomly, which they're on your phone. And the only input is how long you want to listen to music now. Cool. So I was like, I want to listen for 20 minutes, and they just randomly place music for 20 minutes. So you know, little things like that. Yeah. Cool. How'd you find your way over to Python? Thanks to thanks to web scraping, actually. Yeah, yeah, you know, I'm not sure which was the first point No, I actually remember which one
05:00 The first time I came across web scraping, I was trying to dabble up like a software, like I was pretty into sports betting football and soccer. And I didn't play high stakes, because I wasn't that good to afford to do it. But I was interested in the analysis part, you know, looking up stats, looking up injury information, looking at different factors that would determine the result, the outcomes of games, and I wanted to create a software that would predict the outcome off these games, or some of the we have, I don't know, 25 games in a weekend. And it would predict like, Okay, look at these three games. And these are going to be the result. They wanted to create a software like this, right? And if that beats the spread, that they are offering that Las Vegas, you're like, oh, they're gonna win by 14. Oh, they said they're only winning by seven. I wanted to take that. Maybe. Yeah, try that. And I also tried to outfit it go.
06:03 Yeah, the thing is, I wouldn't say it's been too good. But it wasn't too bad. I mean, I experimented the like, first of all, I was betting on like, over and under, because I found that that's like a more predictable thing to bet on. At least that's what I thought with that, because in reality, it isn't. But I bet on over and under. And so this software would tell me that hey, Ender's Game, you know, if it's sucker, there will be more than two goals in the first half. And so what I would do is, I would watch the game live, because in life, there are better odds, because as the time progresses, the art or scoring, the first half goes up. I just waited for like the I think it was the 15th minutes or something like that. And there is still no goal scored in the first half. I would bet on it. So I was doing this for like, I think four months without money, just testing the system. Right? That was watching and Okay, I would bet right now. And I was calculating ahead this Excel spreadsheet and everything. And in that four month period, I was just, I think it was not profitable at all, but it was close to being just a little profitable on long term. Yeah. Well, maybe you were onto something, if you get two more factors may bring in a little machine learning. Who knows? Yeah, excellent. Maybe I should pick it up. But that's what I came across web scraping, right? Because I dated data and dated stats. And a lot of these places. They don't want to give out that data. They will show it on their website, but they're not going to make an API that you can just stream Right, exactly. At the time. There were like, different commercial API's that big news corporations use, but it was just a hobby project. And the only way for me to get data was to extract it from a publicly available website. And so I started to look into it. And at the time, I was programming in Java. So I found j soup. I found the HTML unit, which is a testing library, but it can be used for web scraping. I found it. I think it's pronounced Jones, and found other libraries. And everybody was saying that, which one is the best? which one is the best? Everybody was talking on the forums about scrappy, and I didn't know what scraper was. So I looked it up. It's in Python. I don't know how to program Python. I barely can program in Java. But I eventually learned Python just to learn scrapper. Oh, yeah. And it's such a nice way to do web scraping. It's a weird language. This Python. It doesn't have semi colons, curly braces, there's not so many of those. Yeah, it was, like coppergate. To Java, it's very clear. Yeah, that's for sure. Java is a little verbose in all of its type information, even for a typed language like C compared to C, or I don't know, maybe even C sharp, like it likes, its interfaces and all of its things and so on. Right? Yeah. The best thing I love about Python, which I cannot do in Java is that in Python, you can read the code, like an actual sentence, right? I mean, not always, but sometimes you can like a sentence. Yeah, there's this cartoon joke about Python, that there's these two scientists or something like that. And they've got this pseudocode it says pseudocode on their file, and it's like pseudocode dot txt, then I go, how do we turn this into a program? Oh, you change the txt to a.pi. And it'll run? Yeah, it's already there. Yeah, exactly. All right. Well, so that's how you got to it. And obviously, your interest in web scraping. So you've continued that interest today. Right? What are you doing these days? Yes. So I've been I've been developing like, a lot of spiders, a lot of scrapers with scrappy and other tools over the years. And nowadays, I joined the company who actually created scratch
10:00 scraping cup, and I'm with scraping carp for over a year now. And what I'm doing is as a day to day is educating people, but web scraping, both on the technical side of things and business side of things, like how to extract data from the web, why is it useful for you or why it can be useful for you? And so nowadays, I don't really code that much, unless it's for an article or a tutorial, or to showcase some new stuff. I spend more time on creating videos, and just you know, teaching. Yeah, well, that sounds really fun. You just get to explore, right? Yeah, and the biggest part, or the best part I love about it is that, like, I get to speak to customers who are doing web scraping, or who are sort of enjoying the benefits of web scraping. And there are some really cool stories like what these customers do with data. Yeah, and it's a really crowded, like, people can get super creative, like how to make use of, you know, web data, which is like, it's in front of you, you can see the data on your website, it doesn't look like interesting or like, you know, but when you extract it, and you structure it, you can actually make use of it and drive, you know, you can do many different things. But you can drive better decisions in companies, which is pretty exciting. Yeah. And I guess if you work at a company, you're looking to think about, I'd like to do some comparative analysis again, say our competitors, what are they doing? What are they doing with sales, like in terms of discounts, what are they doing in terms of things are featuring, right, you could theoretically, write a web scraper that looks at your competitors, data, their sites, their presentation, and sort of gives you that over time relative to you, right, or something like that. Yeah, like, many use cases are, you know, as you said, about monitoring competitors, or monitoring the market. And especially, it's a big thing in e commerce, where most of the sectors in ecommerce are really, like there are a lot of companies doing the same thing, selling the same thing. And they just really hard for them to, to sell more products. So with web scraping, they can monitor the competitor prices, they can monitor the competitors, stock information, they can monitor pretty much everything that is on their website, publicly available. And they can gather this information. And you know, it can be like 10s of thousands of products or more. And you can see where you should raise your prices, lower your prices, and how to price your products better and do marketing better. Yeah, and there's a lot of implicit signals as well that are not either they're not most companies, some do, but most won't go. There are 720 of these sold today or whatever. Yeah, but you could say graph like the number of reviews over time or the number of stars over time, things like that would give you a sense of like a proxy, that kind of information, or you could come up with a pretty good analysis and pretty interesting analysis.
13:16 This portion of talk Python to me is brought to you by linode. Whether you're working on a personal project or managing your enterprises infrastructure, linode has the pricing support and scale that you need to take your project to the next level, with 11 data centers worldwide, including their newest data center in Sydney, Australia, enterprise grade hardware, s3 compatible storage, and the next generation network linode delivers the performance that you expect at a price that you don't get started on the node today with a $20 credit and you get access to native SSD storage, a 40 gigabit network industry leading processors, their revamped Cloud Manager cloud not linode.com root access to your server along with their newest API and a Python COI just visit talk python.fm slash linode when creating a new linode account, and you'll automatically get $20 credit for your next project. Oh, and one last thing they're hiring go to lynda.com slash careers to find out more, let them know that we think you
14:14 also like some of the things that people monetary web scraping is more like higher level, we have the price stock, or those are like really specific things or values. But with web scraping, you can actually monitor like company behaviors, what the company's doing at a high level. And you can achieve this by monitoring news or looking for like, the text or the blog of the company and right getting out some useful information from that and you know, setting up alerts and things like that. Yeah, super cool. So there's a couple of examples that I wanted to put out there and then get your thought on it. One is most of the world has been through the wringer on COVID-19 and all that
15:00 stuff in the US, we definitely have been recently getting our fair share of disruption from that. So there's an article on towards data science comm where some, the challenge was that in the early days of the lockdowns, at least in the United States, we had things like instacart, and other grocery delivery things like there's local grocery stores, you could order stuff, and they would bring it to your car outside. So you'd have to go in other people, fewer people to store all sorts of stuff that was probably a benefit from that. But if you tried to book it, it was like, a week and a half out. Yeah. And it was it was a mess, right. And so this person, they wrote it, me, Gil wrote this article that talked about using web scraping, to find a grocery store delivery slots, right. So you could basically say, I'm gonna go to my grocery store, or to like something like instacart. And I'm just gonna watch until one of these slots opens up. And then what he had to do was actually send them a text, like, go now and book it place, my order an hour or whatever. And I think that that's a really creative way of using web scraping. What do you think? Yeah, it's definitely really creative and useful. So he wasn't just only getting alerted when there is a new spot available, but he actually created the bot. So it like it choses that spot? Right? He would actually place the order. Yeah, that's smart. I mean, there are, we had the same situation in Hungary, where you couldn't find any free spot. And the thing I love about web scraping is that it can be useful for huge companies to do web scraping at scale. But it can be also useful for literally everyone. Yeah, in the bar, just doing little things like this. I have a friend who is looking to buy an apartment. And currently the real estate market is really like fragile, I would say, yeah, yeah. And what he wants to do that I suggested him to do it is to create set up a bot that monitors like 10 websites, then real estate listing website. And so he will get alerted when there is a apartment available with the attributes he is interested in. Right. Okay, that's awesome. Yes. So there are so many little things like this in web scraping. And it's just awesome. Yeah, I agree. People talk often about this concept popularized by our swagger, like automate the boring stuff, like, Oh, I got to always do this thing in Excel. So could I write a script that just generates the Excel workbook, or I've always got to do this other thing, copy these files and rename them, right? Something just does that. But this feels like the automate the boring stuff of web interfaces, automate tracking of these sites, and let it like finding out when there's a new thing. Yeah, ultimate, the boring stuff is a really good resource. In this case, it's really like what it's about, if you didn't create a web scraper for this, what you would do, you would probably look at the website, every single right, as often as you can every day, every hour, you'd hear somebody say I checked it every day until I found the apartment I wanted, right? That's what they would say. Yeah, exactly. Yeah. And what you don't hear them say so often is, my bot alerted me when my dream apartment was available. Yeah, I created a scrappy spider that alerts me when there's a new apartment listing and an error. Yeah. And I think it's a really great example. And it's like, there's so many things on your computer that you do that you kind of feel like, Ah, this stuff I just got to do. But on the web is way more human, that you probably end up I always go to this site, and I do this thing. And I got to do that, right. web scraping are kind of open up a ton of that stuff. So let me give you two examples. One, really grand, large scale one, like extremely small, but like really interesting to me in the next, you know, hour. So grand scale. First, when Trump was elected in the United States. 2015. And he was going to be comm President 2016, there was a lot of concern about what was going to happen to some of the data that had so far been hosted on places like whitehouse.gov. So like, there's a lot of climate change data posted there from the scientists. And there was a thought that he might take that down, not necessarily destroy it, but definitely take it down publicly from the site. And there's, you know, there's all these different places, the CDC, other organizations, I have a bunch of data that a bunch of data scientists, scientists and other folks were just worried about. So there were these big movements, and I struggled to find the articles, you know, almost four years out now can't quite find the right search term to pull them up. But there was these, like, save the data, hackathon type thing. So people got together at different universities, like 2030 people, and they would spend the weekend writing a web scraping to go around to all these different organizations download the data. I think they were coordinated across locations, and then they would put that all in a central location. I think somewhere in Europe, I can switch
20:00 Under something so that data could live on. And I think the day that Trump was elected, I think the climate change data went off of whitehouse.gov. So Lee's one of the data sources did disappear. And that's a pretty large scale web scraping effort. Some of it was like download CSV, but a lot of it was web scraping, I think. Yeah, like, the thing is, people say that, once you put something on the internet, it's going to stay there forever. And why does it stories are interesting is that this is information that is useful for everyone. Yeah. And this is the kind of information that you want to make publicly available. And I think this is one of the things that this movement is about, like, if I can call movement, open data, yeah, to make data open and accessible for everyone. And that scraping is really a great tool to do that, to extract data from the web, put it in a structured format, in a database. So you can use it can just store information, or you can get some kind of intelligence from the data. And it's really like without web scraping, what you couldn't do anything else, you just copy, paste the whole thing, or, you know, what will you do? It will take a lot of by hand work. Yeah. Wouldn't be good. Yeah. Yeah, that's really cool. And that was an interesting, large scale, like, Hey, we have to organize, we've got one month to do it. Let's make this happen. So on a much smaller scale, I recently decided to realize that some of my pages on my site, were getting indexed by Google, not because anything on the internet was linking to them publicly. But some thing would be in like Gmail, or somebody would write it down, and then Google would find it. So I just went through and put a bunch of no index meta tags on a bunch of my parts of my site. But it really scared me because I changed a whole bunch of it. And I'm pretty sure I got it right. But what if I accidentally put no index on like some of my course pages that advertise them, and then all of a sudden, the traffic goes from whatever it is to literally zero? Because I accidentally told all the search engines, please. Yeah, don't tell anyone about this page. So what I was doing actually, right before we called unrelated to the fact that this was the topic was I was going to write a very simple web scraper thing that goes and grabs my sitemap hits every single page, and just tells me the pages that are not indexed, just for me to double check that, you know, that template didn't get reused somewhere where I didn't think it got reused, you know, like the little shared bit that has the the head in it, and so on. Yeah, it's sort of like, in SEO, Search Engine Optimization, we have this thing. Technical SEO. Yeah. And actually, it's another use case for web scraping, or like crawling, where you want to crawl your own website, just you know, you did. And to learn the sitemap, or to find pages where you don't have proper SEO optimization, like you don't have meta description on some pages. You have multiple, like, h1 tags, or whatever, right? And with crawling, you can figure these out. And I just remember the story from like, 2016, where web scraping was used to reveal lobbying and corruption in Peru. Oh, wow. Okay. Yeah, I'm not hundred percent sure. But I think it was related to, you know, the Panama Papers when? Yeah, yeah, I was gonna guess it might have something to do with the Panama Papers, which was really interesting in and of itself. Yeah. Yeah. And they actually, they use many tools to scrape the web. But they also use crappy to get like information from these, I guess government websites, or, or other websites. But yeah, like, it's crazy that a web scraping can reveal these kinds of really, this is a big thing. Yeah. corruption. And web scraping can tell you that, hey, there is something wrong here. You should pay attention. It lets you automate the discovery and the linking of all this different data that was never intended to be put together, right. There's no API that's supposed to take like these donations, this investment, this other construction deal, or whatever the heck it was, and like link them together. But with web scraping, you can sort of pull it off, right? Yeah, you can structure the whole lab. I mean, technically, you could structure the whole lab. Or you could get like a set of websites that are considered or that has the same kind of data you're looking for, like you can be ecommerce, FinTech, real estate, whatever. You can grab those sites, extract the data, the publicly available data, structure it and then you can do many sorts of things. You can do like NLP, you can just search in the data, you can do many different kinds of things. Yeah, that's cool, too. Let's talk a little bit specifically about scrappy and maybe scraping hub as well and some of the just the more general challenges you might run into with web scraping to write on the scrappy website, homepage project site. There's a real simple example says here's why you should use it, you can just create this class.
35:40 Talk Python to me, it's partially supported by our training courses. How does your team keep their Python skills sharp? How do you make sure new hires Get Started fast and learn the pythonic? way? If the answer is a series of boring videos that don't inspire, or a subscription service you pay way too much for and use way too little. Listen up. At talk Python training, we have enterprise tiers for all of our courses, get just the one course you need for your team with full reporting and monitoring, or ditch that unused subscription for our course bundles, which include all the courses and you pay about the same price as his subscription. Once For details, visit training, talk python.fm slash business or just email sales at talk python.fm.
36:24 Yeah, so let's maybe round out our conversation talking about hosting. So I can go and run web scraping on my system. And that sometimes works pretty well. Sometimes it doesn't. There was a time a friend of mine was thinking about getting a car and put it on one of these car sharing sites, a little bit general. And we wanted to know like, okay, for a car that's similar to the kind of car he's thinking about getting How often is it rented? What does it rent for? Thanks. So would it be worthwhile as just a pure investment? Right? So I wrote up some web scraping stuff that would just monitor like a couple of cars I picked after a little while, they didn't talk to me anymore. The website was mad at me for like, always asking about this car. Yeah, so that seems like a common problem. And if I do that, for my house, I probably can't check out that site anymore for a while, right, I probably shouldn't do for my house. But even from a dedicated server like that might cause problems. Yeah, it's not just web scraping, like I remember, I was doing some kind of financial research. It was about like, one of the cryptocurrencies and I was using an actual API, like it wasn't scraping, I was using an actual API. And it was pretty basic, like I was just learning how to use this thing. And I created a for loop. And I put the request the API request in the for loop. And I was just testing. And by accident, I put like a huge number.
37:49 In the for loop, like I don't know, like a million or something. And it created a million requests, I really try to create a million requests, but it stopped after like, I don't know, maybe like 10,000 requests in a minute, right. And because you know, the server didn't like that I was doing that. And it's the same thing with web scraping, you need to be really careful not to hit the website too hard. Because when you scrape the web, it's really important. I mean, it's very important to be ethical and be illegal. And one of the things that you can do to achieve these is to just put some kind of delay limit, just over a respect the website and make sure that you don't cause any harm to the website. Right? soon. As soon as people think that what you're doing, you think you're just getting data, but they might perceive it as this is a distributed denial of service attack against by site, we're gonna block this person, right, that's not a good situation, you want to be in it, it's also not super nice to whoever runs that website to, you know, hammer it, exactly. The website doesn't know what you want to do. It just sees a bot. And the truly, like, you know, because of app scraping itself is legal, you can do it, if it's a publicly available website, and you need to make sure that you are being I mean, we use this in the web scraping community, you need to be nice, you need to be nice to the website. And you know, you can look at the robots. txt file, usually there is a like a delay specified in the robots. txt file that tells you that, hey, you should put like three seconds between each request. And but just the general even if there is no such a rule defined in the robots. txt file, you should pay attention. You should be careful with your how many requests you may, how frequently you make those requests. So yeah, it's important. Yeah, interesting. And so I guess, One option is you guys have a cool thing, which I just discovered, I don't know how new it is. But scrappy D daemon that lets you basically deploy your scrappy scripts to it and it'll schedule and run them and that's pretty cool. And you can control that so you could set that up yourself. Or obviously you have the business side of your whole story which is scraping
45:00 appscrip big word, I believe with the advancements of machine learning and other stuff. So it's gonna be really interesting to see what's going to happen in the next few years. Yeah, for sure. All right. Well, I think that's probably it for the time, we got to talk about web scraping. But the final two questions before we get out of here, you're gonna write some code, what editor Do you use Python code, pi charm, by charm, right on. And then notable pi package. I mean, there's obviously pip install scrappy, but maybe some project that you came across race, you know, like, all this is so awesome. People should know about x, or x. It's a hard question. Because I, I've been sort of out of the game for a lot of months now. Aside from scrapy, I just really like to, I don't have one specific example. I just really like to find these GitHub repositories, where when it's not like library, but it's like a collection of like, there are many libraries, GitHub repos, where a collection of templates, right? You know, like, Hey, if you're a beginner, you can use this templates to get started. And for me, that was really useful when I started out, were using a new tool, scrap your other libraries, is that you can use these starter code sort of, right, right. I wonder if there's a better crappie? Possibly, probably. Yeah. Yeah, there is I'll throw into a throw it into the show notes. But yeah, cool. But just cookie cutter, make me a scrappy project. Let's roll and do some landscaping unique. So there's a you know, just last note, also, I think it's got awesome web scraping, okay. Or something like that, which has hundreds of tools for web scraping. Oh, yeah. We can link it later. Web if I find it. Okay. Yeah. It's awesome. web scraping on GitHub. Yeah. Nice. I'll put that in the show notes as well. I love those awesome lists. Yeah, it's easy to get lost. Because you're like, Oh, I have this one thing and I use it, then all of a sudden, you're like, Oh, look at that. There's like 10 other options in this thing that I thought I knew. Yeah, exactly. Awesome. All right. I tell it. It's been great to chat with all these shows, all these web scraping stories and chat with you. So thanks for being here. Yeah, it's been amazing. Thank you. Thank you very much. Yeah, you bet. Bye. Bye. Bye. This has been another episode of talk Python to me. Our guest in this episode has been Atilla tooth. It has been brought to you by talk Python training and linode. Start your next Python project on the nodes state of the art cloud service, just visit talk python.fm slash linode. Li in Eau de, you'll automatically get a $20 credit when you create a new account. Want to level up your Python. If you're just getting started, try my Python jumpstart by building 10 apps course or if you're looking for something more advanced, check out our new async course the digs into all the different types of async programming you can do in Python. And of course, if you're interested in more than one of these, be sure to check out our everything bundle. It's like a subscription that never expires. Be sure to subscribe to the show, open your favorite pod catcher and search for Python. We should be right at the top. You can also find the iTunes feed at slash iTunes. The Google Play feed is slash play in the direct RSS feed at slash RSS on talk python.fm. This is your host Michael Kennedy. Thanks so much for listening. I really appreciate it. Get out there and write some Python code