Learn Python with Talk Python's 270 hours of courses

#283: Web scraping, the 2020 edition Transcript

Recorded on Wednesday, Jul 22, 2020.

00:00 web scraping is pulling the HTML of a website down and parsing useful data out of it. The use cases for this type of functionality are endless. Have a bunch of data on governmental sites that are only listed online in HTML without a download. There's an API for that. Do you want to keep abreast of what your competitors are featuring on their site? There's an API for that. Need alerts for changes on a website, for example, enrollment is now open at your college and you want to be first and avoid that 8am morning slot. There's an API for that as well. But API is screen scraping and until a tooth from scraping hub is here to tell us all about it. This is talk Python to me, Episode 283, recorded July 22 2020.

00:54 Welcome to talk Python to me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter, where I'm at m Kennedy. Keep up with the show and listen to past episodes at talk python.fm and follow the show on Twitter via at talk Python. This episode is brought to you by linode. And us pythons async. And parallel programming support is highly underrated. Have you shied away from the amazing new async and await keywords because you've heard it's way too complicated or that it's just not worth the effort. But the right workloads 100 times speed up is totally possible with minor changes to your code. But you do need to understand the internals. And that's why our course async techniques and examples and Python, show you how to write async code successfully as well as how it works. Get started with async and await today with our course at talkpython.fm/async.

01:50 Tila Welcome to talk Python to me. Thanks for having me, Michael. Yeah, it's great to have you here. I'm looking forward to talking about web scraping. It's the API that you didn't know it exists, but it's out there. If you're willing to find it. It's like the dark API. Yeah, it's a secret API that you did a little work to sort of like be able to use it to earn the right to use it. Yeah. But yeah, I like it. It's useful. It definitely is. And we've got some pretty interesting use cases to talk about and a little bit of history and stuff. And obviously gonna talk about scrapey stuff about scraping what you guys are doing there. So a lot of fun stuff. But before we get to all that, let's start with your story. How'd you get into programming in Python? Yeah, sure. So actually, I got into programming back in like elementary school. I was the sort of kid like, you know, I think I was like, eighth grade. And at the time, everybody in my class in school, everybody was asking for like, I think it was Xbox 360. that just came out. Okay. Everybody was asking for that for Christmas. Yeah. And I was the one who was asking for a book about programming. What language did you want at that time? Yeah, so actually, I did like a lot of research. I think for like, three months, I was researching which programming language is the best for beginners. So I was on different forums, Yahoo, and swears in every places on the internet. And many people suggested at the time, Pascal, so I got this book, I think it was called something like introduction to turbo Pascal or something like that. Right. Okay. Yeah. So I got the book. And that was the first step. And then after like, 40 pages into the book, I throw it away, because I found that just googling stuff, googling stuff in Google and StackOverflow. It just, you know, easier to learn it that way. Nice. So how did you end up writing anything in turbo Pascal? Yeah, I was just, you know, the regular things beginners usually do. Like in the terminal, I, you know, I printed my name and read, I printed like figures with characters, you know, just the regular things beginners do. But then I quickly switch to Java after Pascal. Right? Java can build web apps and build gooeys more interesting things. Yeah, at the time, I was interested in Android, Android development. Actually, it was a few years later, when I got to high school. I was I started to like program in Java to develop Android apps. I created like, I don't know, maybe like five or six apps for Google Play that I just found useful, like minor things like a music player that plays music randomly, which they're on your phone. And the only input is how long you want to listen to music now. Cool. So I was like, I want to listen for 20 minutes, and they just randomly place music for 20 minutes. So you know, little things like that. Yeah. Cool. How'd you find your way over to Python? Thanks to thanks to web scraping, actually. Yeah, yeah, you know, I'm not sure which was the first point No, I actually remember which one

05:00 The first time I came across web scraping, I was trying to dabble up like a software, like I was pretty into sports betting football and soccer. And I didn't play high stakes, because I wasn't that good to afford to do it. But I was interested in the analysis part, you know, looking up stats, looking up injury information, looking at different factors that would determine the result, the outcomes of games, and I wanted to create a software that would predict the outcome off these games, or some of the we have, I don't know, 25 games in a weekend. And it would predict like, Okay, look at these three games. And these are going to be the result. They wanted to create a software like this, right? And if that beats the spread, that they are offering that Las Vegas, you're like, oh, they're gonna win by 14. Oh, they said they're only winning by seven. I wanted to take that. Maybe. Yeah, try that. And I also tried to outfit it go.

06:03 Yeah, the thing is, I wouldn't say it's been too good. But it wasn't too bad. I mean, I experimented the like, first of all, I was betting on like, over and under, because I found that that's like a more predictable thing to bet on. At least that's what I thought with that, because in reality, it isn't. But I bet on over and under. And so this software would tell me that hey, Ender's Game, you know, if it's sucker, there will be more than two goals in the first half. And so what I would do is, I would watch the game live, because in life, there are better odds, because as the time progresses, the art or scoring, the first half goes up. I just waited for like the I think it was the 15th minutes or something like that. And there is still no goal scored in the first half. I would bet on it. So I was doing this for like, I think four months without money, just testing the system. Right? That was watching and Okay, I would bet right now. And I was calculating ahead this Excel spreadsheet and everything. And in that four month period, I was just, I think it was not profitable at all, but it was close to being just a little profitable on long term. Yeah. Well, maybe you were onto something, if you get two more factors may bring in a little machine learning. Who knows? Yeah, excellent. Maybe I should pick it up. But that's what I came across web scraping, right? Because I dated data and dated stats. And a lot of these places. They don't want to give out that data. They will show it on their website, but they're not going to make an API that you can just stream Right, exactly. At the time. There were like, different commercial API's that big news corporations use, but it was just a hobby project. And the only way for me to get data was to extract it from a publicly available website. And so I started to look into it. And at the time, I was programming in Java. So I found j soup. I found the HTML unit, which is a testing library, but it can be used for web scraping. I found it. I think it's pronounced Jones, and found other libraries. And everybody was saying that, which one is the best? which one is the best? Everybody was talking on the forums about scrappy, and I didn't know what scraper was. So I looked it up. It's in Python. I don't know how to program Python. I barely can program in Java. But I eventually learned Python just to learn scrapper. Oh, yeah. And it's such a nice way to do web scraping. It's a weird language. This Python. It doesn't have semi colons, curly braces, there's not so many of those. Yeah, it was, like coppergate. To Java, it's very clear. Yeah, that's for sure. Java is a little verbose in all of its type information, even for a typed language like C compared to C, or I don't know, maybe even C#, like it likes, its interfaces and all of its things and so on. Right? Yeah. The best thing I love about Python, which I cannot do in Java is that in Python, you can read the code, like an actual sentence, right? I mean, not always, but sometimes you can like a sentence. Yeah, there's this cartoon joke about Python, that there's these two scientists or something like that. And they've got this pseudocode it says pseudocode on their file, and it's like pseudocode dot txt, then I go, how do we turn this into a program? Oh, you change the txt to a.pi. And it'll run? Yeah, it's already there. Yeah, exactly. All right. Well, so that's how you got to it. And obviously, your interest in web scraping. So you've continued that interest today. Right? What are you doing these days? Yes. So I've been I've been developing like, a lot of spiders, a lot of scrapers with scrappy and other tools over the years. And nowadays, I joined the company who actually created scratch

10:00 scraping cup, and I'm with scraping carp for over a year now. And what I'm doing is as a day to day is educating people, but web scraping, both on the technical side of things and business side of things, like how to extract data from the web, why is it useful for you or why it can be useful for you? And so nowadays, I don't really code that much, unless it's for an article or a tutorial, or to showcase some new stuff. I spend more time on creating videos, and just you know, teaching. Yeah, well, that sounds really fun. You just get to explore, right? Yeah, and the biggest part, or the best part I love about it is that, like, I get to speak to customers who are doing web scraping, or who are sort of enjoying the benefits of web scraping. And there are some really cool stories like what these customers do with data. Yeah, and it's a really crowded, like, people can get super creative, like how to make use of, you know, web data, which is like, it's in front of you, you can see the data on your website, it doesn't look like interesting or like, you know, but when you extract it, and you structure it, you can actually make use of it and drive, you know, you can do many different things. But you can drive better decisions in companies, which is pretty exciting. Yeah. And I guess if you work at a company, you're looking to think about, I'd like to do some comparative analysis again, say our competitors, what are they doing? What are they doing with sales, like in terms of discounts, what are they doing in terms of things are featuring, right, you could theoretically, write a web scraper that looks at your competitors, data, their sites, their presentation, and sort of gives you that over time relative to you, right, or something like that. Yeah, like, many use cases are, you know, as you said, about monitoring competitors, or monitoring the market. And especially, it's a big thing in e commerce, where most of the sectors in ecommerce are really, like there are a lot of companies doing the same thing, selling the same thing. And they just really hard for them to, to sell more products. So with web scraping, they can monitor the competitor prices, they can monitor the competitors, stock information, they can monitor pretty much everything that is on their website, publicly available. And they can gather this information. And you know, it can be like 10s of thousands of products or more. And you can see where you should raise your prices, lower your prices, and how to price your products better and do marketing better. Yeah, and there's a lot of implicit signals as well that are not either they're not most companies, some do, but most won't go. There are 720 of these sold today or whatever. Yeah, but you could say graph like the number of reviews over time or the number of stars over time, things like that would give you a sense of like a proxy, that kind of information, or you could come up with a pretty good analysis and pretty interesting analysis.

13:16 This portion of talk Python to me is brought to you by linode. Whether you're working on a personal project or managing your enterprises infrastructure, linode has the pricing support and scale that you need to take your project to the next level, with 11 data centers worldwide, including their newest data center in Sydney, Australia, enterprise grade hardware, s3 compatible storage, and the next generation network linode delivers the performance that you expect at a price that you don't get started on the node today with a $20 credit and you get access to native SSD storage, a 40 gigabit network industry leading processors, their revamped Cloud Manager cloud not linode.com root access to your server along with their newest API and a Python COI just visit talkpython.fm/ linode when creating a new linode account, and you'll automatically get $20 credit for your next project. Oh, and one last thing they're hiring go to lynda.com slash careers to find out more, let them know that we think you

14:14 also like some of the things that people monetary web scraping is more like higher level, we have the price stock, or those are like really specific things or values. But with web scraping, you can actually monitor like company behaviors, what the company's doing at a high level. And you can achieve this by monitoring news or looking for like, the text or the blog of the company and right getting out some useful information from that and you know, setting up alerts and things like that. Yeah, super cool. So there's a couple of examples that I wanted to put out there and then get your thought on it. One is most of the world has been through the wringer on COVID-19 and all that

15:00 stuff in the US, we definitely have been recently getting our fair share of disruption from that. So there's an article on towards data science comm where some, the challenge was that in the early days of the lockdowns, at least in the United States, we had things like instacart, and other grocery delivery things like there's local grocery stores, you could order stuff, and they would bring it to your car outside. So you'd have to go in other people, fewer people to store all sorts of stuff that was probably a benefit from that. But if you tried to book it, it was like, a week and a half out. Yeah. And it was it was a mess, right. And so this person, they wrote it, me, Gil wrote this article that talked about using web scraping, to find a grocery store delivery slots, right. So you could basically say, I'm gonna go to my grocery store, or to like something like instacart. And I'm just gonna watch until one of these slots opens up. And then what he had to do was actually send them a text, like, go now and book it place, my order an hour or whatever. And I think that that's a really creative way of using web scraping. What do you think? Yeah, it's definitely really creative and useful. So he wasn't just only getting alerted when there is a new spot available, but he actually created the bot. So it like it choses that spot? Right? He would actually place the order. Yeah, that's smart. I mean, there are, we had the same situation in Hungary, where you couldn't find any free spot. And the thing I love about web scraping is that it can be useful for huge companies to do web scraping at scale. But it can be also useful for literally everyone. Yeah, in the bar, just doing little things like this. I have a friend who is looking to buy an apartment. And currently the real estate market is really like fragile, I would say, yeah, yeah. And what he wants to do that I suggested him to do it is to create set up a bot that monitors like 10 websites, then real estate listing website. And so he will get alerted when there is a apartment available with the attributes he is interested in. Right. Okay, that's awesome. Yes. So there are so many little things like this in web scraping. And it's just awesome. Yeah, I agree. People talk often about this concept popularized by our swagger, like automate the boring stuff, like, Oh, I got to always do this thing in Excel. So could I write a script that just generates the Excel workbook, or I've always got to do this other thing, copy these files and rename them, right? Something just does that. But this feels like the automate the boring stuff of web interfaces, automate tracking of these sites, and let it like finding out when there's a new thing. Yeah, ultimate, the boring stuff is a really good resource. In this case, it's really like what it's about, if you didn't create a web scraper for this, what you would do, you would probably look at the website, every single right, as often as you can every day, every hour, you'd hear somebody say I checked it every day until I found the apartment I wanted, right? That's what they would say. Yeah, exactly. Yeah. And what you don't hear them say so often is, my bot alerted me when my dream apartment was available. Yeah, I created a scrappy spider that alerts me when there's a new apartment listing and an error. Yeah. And I think it's a really great example. And it's like, there's so many things on your computer that you do that you kind of feel like, Ah, this stuff I just got to do. But on the web is way more human, that you probably end up I always go to this site, and I do this thing. And I got to do that, right. web scraping are kind of open up a ton of that stuff. So let me give you two examples. One, really grand, large scale one, like extremely small, but like really interesting to me in the next, you know, hour. So grand scale. First, when Trump was elected in the United States. 2015. And he was going to be comm President 2016, there was a lot of concern about what was going to happen to some of the data that had so far been hosted on places like whitehouse.gov. So like, there's a lot of climate change data posted there from the scientists. And there was a thought that he might take that down, not necessarily destroy it, but definitely take it down publicly from the site. And there's, you know, there's all these different places, the CDC, other organizations, I have a bunch of data that a bunch of data scientists, scientists and other folks were just worried about. So there were these big movements, and I struggled to find the articles, you know, almost four years out now can't quite find the right search term to pull them up. But there was these, like, save the data, hackathon type thing. So people got together at different universities, like 2030 people, and they would spend the weekend writing a web scraping to go around to all these different organizations download the data. I think they were coordinated across locations, and then they would put that all in a central location. I think somewhere in Europe, I can switch

20:00 Under something so that data could live on. And I think the day that Trump was elected, I think the climate change data went off of whitehouse.gov. So Lee's one of the data sources did disappear. And that's a pretty large scale web scraping effort. Some of it was like download CSV, but a lot of it was web scraping, I think. Yeah, like, the thing is, people say that, once you put something on the internet, it's going to stay there forever. And why does it stories are interesting is that this is information that is useful for everyone. Yeah. And this is the kind of information that you want to make publicly available. And I think this is one of the things that this movement is about, like, if I can call movement, open data, yeah, to make data open and accessible for everyone. And that scraping is really a great tool to do that, to extract data from the web, put it in a structured format, in a database. So you can use it can just store information, or you can get some kind of intelligence from the data. And it's really like without web scraping, what you couldn't do anything else, you just copy, paste the whole thing, or, you know, what will you do? It will take a lot of by hand work. Yeah. Wouldn't be good. Yeah. Yeah, that's really cool. And that was an interesting, large scale, like, Hey, we have to organize, we've got one month to do it. Let's make this happen. So on a much smaller scale, I recently decided to realize that some of my pages on my site, were getting indexed by Google, not because anything on the internet was linking to them publicly. But some thing would be in like Gmail, or somebody would write it down, and then Google would find it. So I just went through and put a bunch of no index meta tags on a bunch of my parts of my site. But it really scared me because I changed a whole bunch of it. And I'm pretty sure I got it right. But what if I accidentally put no index on like some of my course pages that advertise them, and then all of a sudden, the traffic goes from whatever it is to literally zero? Because I accidentally told all the search engines, please. Yeah, don't tell anyone about this page. So what I was doing actually, right before we called unrelated to the fact that this was the topic was I was going to write a very simple web scraper thing that goes and grabs my sitemap hits every single page, and just tells me the pages that are not indexed, just for me to double check that, you know, that template didn't get reused somewhere where I didn't think it got reused, you know, like the little shared bit that has the the head in it, and so on. Yeah, it's sort of like, in SEO, Search Engine Optimization, we have this thing. Technical SEO. Yeah. And actually, it's another use case for web scraping, or like crawling, where you want to crawl your own website, just you know, you did. And to learn the sitemap, or to find pages where you don't have proper SEO optimization, like you don't have meta description on some pages. You have multiple, like, h1 tags, or whatever, right? And with crawling, you can figure these out. And I just remember the story from like, 2016, where web scraping was used to reveal lobbying and corruption in Peru. Oh, wow. Okay. Yeah, I'm not hundred percent sure. But I think it was related to, you know, the Panama Papers when? Yeah, yeah, I was gonna guess it might have something to do with the Panama Papers, which was really interesting in and of itself. Yeah. Yeah. And they actually, they use many tools to scrape the web. But they also use crappy to get like information from these, I guess government websites, or, or other websites. But yeah, like, it's crazy that a web scraping can reveal these kinds of really, this is a big thing. Yeah. corruption. And web scraping can tell you that, hey, there is something wrong here. You should pay attention. It lets you automate the discovery and the linking of all this different data that was never intended to be put together, right. There's no API that's supposed to take like these donations, this investment, this other construction deal, or whatever the heck it was, and like link them together. But with web scraping, you can sort of pull it off, right? Yeah, you can structure the whole lab. I mean, technically, you could structure the whole lab. Or you could get like a set of websites that are considered or that has the same kind of data you're looking for, like you can be ecommerce, FinTech, real estate, whatever. You can grab those sites, extract the data, the publicly available data, structure it and then you can do many sorts of things. You can do like NLP, you can just search in the data, you can do many different kinds of things. Yeah, that's cool, too. Let's talk a little bit specifically about scrappy and maybe scraping hub as well and some of the just the more general challenges you might run into with web scraping to write on the scrappy website, homepage project site. There's a real simple example says here's why you should use it, you can just create this class.

25:00 derives from scrappy that spider and you give it some URLs and it has this parse function. And off it goes. Do you want to talk us real quickly about what it's like to write code to do web scraping with this API? Which scrappy? Yeah, yeah. So with scrappy, if you use scrappy, what you really get is like a full fledged web crawling framework, or like a web scripting framework. So it really makes it easy for you to extract data from any website. And it has there are other libraries like beautifulsoup, or XML, or in Java, we have j soup that are only focused around parsing HTML and getting or getting data from HTML and XML files. Right. beautifulsoup is I have the text of the HTML source. Now tell me stuff about it. Right? Right. That's sort of its whole job. Yeah. Right. But crappy, I would say it's like maybe 10% of the whole picture or the big picture. Because it's crappy, like, yeah, you can parse HTML. But in real world project, that's not the only task you need to do. In real project, you need to process the data as well, you need to structure the data properly, you need to maybe put it into a database, maybe export it as a CSV file, or Jason, you need to clean the data, you need to normalize the data. In real world scenarios, you need to do a lot of things, not just scraping the data. And scrappy really makes it easy to develop these spiders that grab the data. And then right there in scrappy, you can clean it, you can process it. And when the data leaves scrappy, it's usable for whatever you want to use it for. Right? Maybe you get a list or sequence of dictionaries that have the pieces of data that you went after something like that. Yeah, right. So that's why I really like scrappy. But also, I think it's the best framework to maintain your spiders, because a big problem in web scraping is that once you write the spider, it works, you get the data, it will probably not work in six months in the future, because the website changes or layout changes, they've decided to switch JavaScript front end frameworks or redesign the navigation or something, right, exactly. And so you need to adjust your code or just your select doors, you need to maybe adjust the whole processing part of scrappy because maybe in the past, you were able to extract on the messy code from the HTML, but they did some changes on the website. Now it's clean, you don't need so much processing. And with scrappy, it's really easy to maintain the code, which is you know, truly important if you rely on that big structured data on a daily basis unit to maintain the code. Really neat. So, to me, when I look at this simple API here, it's like, you create a class, you give it a set of start URLs, there's a parse function, and that comes in, you run some CSS selectors on what you get for each thing that you find you can yield that back as part of the thing like the result you found, and then you can follow the next one, I think that's probably the biggest differentiator from what I can tell is, is like, as you're parsing it, you're directing it to go do more discovery, maybe on linked pages and things like that. Yeah. Like on websites where you need to go deeper, maybe because, you know, you need to go deeper in the sitemap, or you need to like paginate, you need to do more requests to get the data. Yeah, in tools like beautifulsoup, you would need to sort of create your own logic or like, you know how you want to do that. But in scrappy, it's already sort of straightforward how you want to do that, or how you can do that. And you just need to just click this button, click this button, and you paginate right, with scrappy, because you have this high level functions and methods that lets you paginate and lets you do things in a chain, you know, yeah, yeah, very cool API. So let's talk about some of the challenges you might get. I mean, probably, to me the biggest challenge. So there's the two problems that I would say the two biggest challenges. One is probably, maybe generally you could call it data quality, but more specifically, like the structure changed, right? They've added a column to this table. And it's not what I thought it was anymore. Like it used to be temperature. And now it's pressure because they've added that column. And I wrote it to have three columns in the table. Now it's got four. The other one is I go and I get the source code of the page. And it says, you know, view dot application, or there's like an Angular app or something. And you get just a bunch of script tags that are have no data in them, because it's not actually executed yet. Yeah, I mean, this is like really all the things that all the reasons

30:00 Why you need to maintain the code, which can be, you know, it takes time to maintain the code. Actually, that's why there are more advancing machine learning technologies that pretty much maintain the code for you. Okay, give us an example of that, like, find me the title or find me the stock price off of this page. Machine magic. Yes. So actually, it's not that big of a magic like in the future, you won't need to write things like selectors. You don't need to tell scrappy or like, you know, whatever to use to, hey, this is the price title, whatever, I want to extract it, this is the Select or where you can find it in HTML, in the future, you will not need to do that. In the future, what you will only need to say is that this is the page, this is real estate listing, or this is a product page, or this is some quite a financial page. So you just need to say the data type. Yeah. And then the machine will know what fields to look for. So it will know that if it's a product page, you probably need like price, you probably need product title, or like product name, sorry, description, stock information, maybe reviews, if any, say in the future, you just need to specify the data type, and you will get all the data fields that you want with ammo. Yeah, that's like a dream. Well, it's sort of like it's a reality in some cases, because like, you know, there are many, there are some products out there that do this at scraping Comm. We have, like, for example, news API, which does exactly this thing for articles and news, okay. Yes. So what you do is that, hey, this is the page URL, give me everything. And you get the article title, you get the tags body, you get, you know, everything that is on the page well, and if any place is gonna have enough data on scraping to like answer those kinds of questions more generally. Sounds like you guys, might you guys do so much web scraping, right that like way more than a lot of people in terms of as a platform? I don't know how you, like, share that across projects? I don't think that I don't see a good way to do that. But it seems like you should be in a position that somehow you could build some cool tools around all that. Yeah, we do a lot of scraping, as you said. And it's really about the biggest problem is that there is no one schema that websites follow when it comes to data. So when you scrape, you need to like figure out the structure that would work for all the websites, and then you extract you sort of like standardize the data. And that's why the data type is important to specify, right? What about JavaScript? That's all the cool kids say they should don't write server code anymore. Don't write templates, just write just API's, and you're talking to them with HTML and JavaScript. But that doesn't scrape very well, I'm not sure that's actually what you should be doing. But entirely, but a lot of people are and it makes web scraping, definitely more challenging. Yeah, like nowadays, all the websites has some kind of JavaScript running. And if the data you need to extract is rendered with JavaScript, that can be a challenge. And first, if you're writing a spider, first, you should look for ways to get the data without executing JavaScript. You can do this sort of, you know, I call them hidden API's, when there is no documented API for the website. But if you look in the background, there is actually a public API available, right? So it's clearly going back. Like if you've got a Vue JS front end, it's probably calling some function that's returning like a JSON result of the data you actually want. You just got to look in your web browser tools to see that happening. Right? Yeah, exactly. Like with Ajax calls. Many times. That's the case that just need to sort of grab the JSON, and you can get the data easily. Well, and is it formatted JSON, right? It's not just HTML. I mean, it's like pretty sweet actually, to get a hold of that feed. Oh, yeah. But on the other side, if you cannot find an API like this, you need to execute JavaScript using some kind of headless browser like selenium, or poverty or splash. And because you know, you need an actual browser to execute JavaScript, right? But then it's gonna take more hardware resources to run JavaScript is going to be slower. And it is going to make the whole process a little bit more complicated. Yeah, there's probably no mechanism built right into scrapey. That under the render the JavaScript is there, like scrapy itself cannot render JavaScript but there is a really neat integration with splash. Okay, splash is also maintained. It was created by scraping hub so it's really easy to unsplash is a sort of like a headless browser created.

35:00 The for web scraping. And you can integrate splash with scrappy really easily. It's literally just like a middleware in scrappy, and then it will execute JavaScript for you. Yeah. So he, yeah, you will be able to see the data, and you will be able to extract it. Right. It says the headless browser designed specifically for web scraping turned any JavaScript heavy site into data, right. coopera. Like, that looks pretty neat. That's open source. People can just use that as well. Or what's the story that? yeah, just like scrappy. They can set up their own server, and you can use it that way. Or you can also post it on scraping cup.

35:40 Talk Python to me, it's partially supported by our training courses. How does your team keep their Python skills sharp? How do you make sure new hires Get Started fast and learn the pythonic? way? If the answer is a series of boring videos that don't inspire, or a subscription service you pay way too much for and use way too little. Listen up. At Talk Python Training, we have enterprise tiers for all of our courses, get just the one course you need for your team with full reporting and monitoring, or ditch that unused subscription for our course bundles, which include all the courses and you pay about the same price as his subscription. Once For details, visit training, talkpython.fm/ business or just email sales at talk python.fm.

36:24 Yeah, so let's maybe round out our conversation talking about hosting. So I can go and run web scraping on my system. And that sometimes works pretty well. Sometimes it doesn't. There was a time a friend of mine was thinking about getting a car and put it on one of these car sharing sites, a little bit general. And we wanted to know like, okay, for a car that's similar to the kind of car he's thinking about getting How often is it rented? What does it rent for? Thanks. So would it be worthwhile as just a pure investment? Right? So I wrote up some web scraping stuff that would just monitor like a couple of cars I picked after a little while, they didn't talk to me anymore. The website was mad at me for like, always asking about this car. Yeah, so that seems like a common problem. And if I do that, for my house, I probably can't check out that site anymore for a while, right, I probably shouldn't do for my house. But even from a dedicated server like that might cause problems. Yeah, it's not just web scraping, like I remember, I was doing some kind of financial research. It was about like, one of the cryptocurrencies and I was using an actual API, like it wasn't scraping, I was using an actual API. And it was pretty basic, like I was just learning how to use this thing. And I created a for loop. And I put the request the API request in the for loop. And I was just testing. And by accident, I put like a huge number.

37:49 In the for loop, like I don't know, like a million or something. And it created a million requests, I really try to create a million requests, but it stopped after like, I don't know, maybe like 10,000 requests in a minute, right. And because you know, the server didn't like that I was doing that. And it's the same thing with web scraping, you need to be really careful not to hit the website too hard. Because when you scrape the web, it's really important. I mean, it's very important to be ethical and be illegal. And one of the things that you can do to achieve these is to just put some kind of delay limit, just over a respect the website and make sure that you don't cause any harm to the website. Right? soon. As soon as people think that what you're doing, you think you're just getting data, but they might perceive it as this is a distributed denial of service attack against by site, we're gonna block this person, right, that's not a good situation, you want to be in it, it's also not super nice to whoever runs that website to, you know, hammer it, exactly. The website doesn't know what you want to do. It just sees a bot. And the truly, like, you know, because of app scraping itself is legal, you can do it, if it's a publicly available website, and you need to make sure that you are being I mean, we use this in the web scraping community, you need to be nice, you need to be nice to the website. And you know, you can look at the robots. txt file, usually there is a like a delay specified in the robots. txt file that tells you that, hey, you should put like three seconds between each request. And but just the general even if there is no such a rule defined in the robots. txt file, you should pay attention. You should be careful with your how many requests you may, how frequently you make those requests. So yeah, it's important. Yeah, interesting. And so I guess, One option is you guys have a cool thing, which I just discovered, I don't know how new it is. But scrappy D daemon that lets you basically deploy your scrappy scripts to it and it'll schedule and run them and that's pretty cool. And you can control that so you could set that up yourself. Or obviously you have the business side of your whole story which is scraping

40:00 hub. So what is scraping hub? Do that makes us better? Yes, then just like getting a server at AWS or linode, or something like that. Yeah, so scraping carp is really about, you can deploy your spiders, but it's really it's made for web scraping. And there is all the tools available in describing a platform that you would want to use for web scraping. Like, you know, we mentioned scrappy, and splash in the scraping platform, you can upload your spider, your scrappy spider, and then you can with a click of a button, you can add splash. So it renders JavaScript. Or you can also if you want to use proxies, for web scraping, you can with a click of a button, you can add proxies to your web scraping. You can also if you have a website, like we mentioned, you know, articles and news, if you want to protect data, we have an API, so you don't have to actually write the spider, you can just use the API and get the data. So you don't need to maintain the code. So it's scraping up is really just the ivory scraping code. But I truly believe that it's the best platform if you're doing web scraping at scale, because you have all the tools in one place that you would want to use for about scripting. Right, right. So a lot of stuff you might bring your own infrastructure for. You guys have like as a service type stuff. Yeah, exactly. You can sign up, we have a free trial for I think for most of our services, you can try it, you can see how it works. And there's a monthly fee. And that's it. Okay, yeah, that seems pretty awesome. And also, we have a lot of clients who don't want to deal but the stuff deal with the technical data, so they just say that, hey, I want this type of data from these website, give me the data. And so we just do the hardware for them. Right. Maybe they're not even Python people. Yeah, they might be like, they don't even know how to code or like brag, I think could be political scientists are like, we need to just scrape this but like, you guys do scraping, right? Yeah, yeah, exactly. So let's round out our conversation on this, just two quick comment on scraping hover a quick thought on it. So when I talked to Pablo back in 2016, so I had a show with him about scrapey and scraping hub and all that back in Episode 54 years ago or something. One of the things that really impressed me about what you guys are doing and continues to be the case is you've taken pretty straightforward, open source projects, not something completely insane, like a whole new database model or something. But just pretty straightforward, but polished model of web scraping if you build a really powerful business on top of it. And I think it's an interesting model for taking open source and making it your job, more than just I'll do consulting in this thing, right? If I wrote this web framework, I could maybe sustain myself by consulting with people who want to write websites in that framework or so. But this is a pretty interesting next level type of thing that you guys have pulled off here. And what are your thoughts on this? Yeah, hundred percent agree. Like I think, you know, Pablo, and Pablo Hoffman and shamans, they did an awesome job when they created scrappy, like, at the time, it was the first real web scripting framework, as I know, and they open source that. And it's really, like, if you look at the GitHub page of scrappy, it's really like I feel like it's a community, like the PR people are chatting about these people are talking about how the community can improve this tool. And then the fact you got like, 17,000, GitHub questions or something like oh, yeah, tagged on, it adds quite a bit. Yeah. And the fact that public chains are able to pull this thing into business. But in a really, like, people can still use crappy without using our services, they can still use flash without paying, they can use it, it's open source. So the fact that they built on top of the open source tools in a way that you don't need to be affiliated with the company to get the benefits of scrappy and other open source tools. Like there are many other tools. It's not not just crappy, like there are so many other tools for web scraping. And I agree, I think it's just really amazing that they have been doing this. Yeah, it seems like it's really adding value. Because if the core thing is already open source, and people would just use it, you know? Yeah, like, especially if they want to do it at scale. Yeah, you just cannot do huge things like millions of extracted records or things like that on your computer or on your laptop. You just cannot do that. No, you get about 50,000 then you get blocked. Yeah, yeah, that's an option as well. Yeah, exactly. So yeah, very cool. I think it's a really neat example because a lot of people are thinking about how do I take this thing that's got some traction, I'm doing open source and make this my job because I'm I'm tired of filling JIRA tickets at this thing. That's kind of okay, but not really my passion, right and cool example. So, congrats. You guys, keep up the good work. Yeah, we are. We definitely try our best. There are some awesome things to come in.

45:00 appscrip big word, I believe with the advancements of machine learning and other stuff. So it's gonna be really interesting to see what's going to happen in the next few years. Yeah, for sure. All right. Well, I think that's probably it for the time, we got to talk about web scraping. But the final two questions before we get out of here, you're gonna write some code, what editor Do you use Python code, pi charm, by charm, right on. And then notable pi package. I mean, there's obviously pip install scrappy, but maybe some project that you came across race, you know, like, all this is so awesome. People should know about x, or x. It's a hard question. Because I, I've been sort of out of the game for a lot of months now. Aside from scrapy, I just really like to, I don't have one specific example. I just really like to find these GitHub repositories, where when it's not like library, but it's like a collection of like, there are many libraries, GitHub repos, where a collection of templates, right? You know, like, Hey, if you're a beginner, you can use this templates to get started. And for me, that was really useful when I started out, were using a new tool, scrap your other libraries, is that you can use these starter code sort of, right, right. I wonder if there's a better crappie? Possibly, probably. Yeah. Yeah, there is I'll throw into a throw it into the show notes. But yeah, cool. But just cookie cutter, make me a scrappy project. Let's roll and do some landscaping unique. So there's a you know, just last note, also, I think it's got awesome web scraping, okay. Or something like that, which has hundreds of tools for web scraping. Oh, yeah. We can link it later. Web if I find it. Okay. Yeah. It's awesome. web scraping on GitHub. Yeah. Nice. I'll put that in the show notes as well. I love those awesome lists. Yeah, it's easy to get lost. Because you're like, Oh, I have this one thing and I use it, then all of a sudden, you're like, Oh, look at that. There's like 10 other options in this thing that I thought I knew. Yeah, exactly. Awesome. All right. I tell it. It's been great to chat with all these shows, all these web scraping stories and chat with you. So thanks for being here. Yeah, it's been amazing. Thank you. Thank you very much. Yeah, you bet. Bye. Bye. Bye. This has been another episode of talk Python to me. Our guest in this episode has been Atilla tooth. It has been brought to you by Talk Python Training and linode. Start your next Python project on the nodes state of the art cloud service, just visit talkpython.fm/ linode. Li in Eau de, you'll automatically get a $20 credit when you create a new account. Want to level up your Python. If you're just getting started, try my Python jumpstart by building 10 apps course or if you're looking for something more advanced, check out our new async course the digs into all the different types of async programming you can do in Python. And of course, if you're interested in more than one of these, be sure to check out our everything bundle. It's like a subscription that never expires. Be sure to subscribe to the show, open your favorite pod catcher and search for Python. We should be right at the top. You can also find the iTunes feed at /itunes. The Google Play feed is /play in the direct RSS feed at /rss on talk python.fm. This is your host Michael Kennedy. Thanks so much for listening. I really appreciate it. Get out there and write some Python code

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon