Learn Python with Talk Python's 270 hours of courses

Web scraping, the 2020 edition

Episode #283, published Wed, Sep 23, 2020, recorded Wed, Jul 22, 2020

Web scraping is pulling the HTML of a website down and parsing useful data out of it. The use-cases for this type of functionality are endless. Have a bunch of data on governmental sites that are only listed online in HTML without a download? There's an API for that! Do you want to keep abreast of what your competitors are featuring on their site? There's an API for that. Need alerts for changes on a website, for example enrollment is now open at your college and you want to be first to get in and avoid the 8am Monday morning course slot? There's an API for that.

That API is screen scraping and Attila Tóth from ScrapingHub is here to tell us all about it.
Attila Tóth on LinkedIn: linkedin.com
Scrapy project: scrapy.org
Scrapinghub on Twitter: @scrapinghub
Scrapinghub: scrapinghub.com
cookiecutter template for Scrapy projects: github.com
Splash: headless browser designed specifically for web scraping: scrapinghub.com/splash
Awesome Web Scraping list: github.com

Talk Python episode 50 on web scraping: talkpython.fm
How Web Scraping is Revealing Lobbying and Corruption in Peru: blog.scrapinghub.com
Web Data Extraction Summit event: extractsummit.io
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy

Episode Transcript

Collapse transcript

00:00 Web scraping is pulling the HTML of a website down and parsing useful data out of it.

00:04 The use cases for this type of functionality are endless.

00:07 Have a bunch of data on governmental sites that are only listed online in HTML without a download?

00:13 There's an API for that.

00:15 Do you want to keep abreast of what your competitors are featuring on their site?

00:19 There's an API for that.

00:20 Need alerts for changes on a website?

00:23 For example, enrollment is now open at your college, and you want to be first and avoid that 8 a.m. morning slot?

00:29 Well, there's an API for that as well.

00:31 That API is Screen Scraping, and Attila Toth from Scraping Hub is here to tell us all about it.

00:36 This is Talk Python to Me, episode 283, recorded July 22, 2020.

00:41 Welcome to Talk Python to Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities.

01:00 This is your host, Michael Kennedy.

01:02 Follow me on Twitter, where I'm @mkennedy.

01:05 Keep up with the show and listen to past episodes at talkpython.fm, and follow the show on Twitter via at Talk Python.

01:11 This episode is brought to you by Linode and us.

01:15 Python's async and parallel programming support is highly underrated.

01:20 Have you shied away from the amazing new async and await keywords because you've heard it's

01:24 way too complicated or that it's just not worth the effort?

01:27 With the right workloads, a hundred times speed up is totally possible with minor changes to your code.

01:33 But you do need to understand the internals, and that's why our course, Async Techniques and Examples in Python,

01:38 show you how to write async code successfully as well as how it works.

01:44 Get started with async and await today with our course at talkpython.fm/async.

01:49 Attila, welcome to Talk Python to me.

01:52 Thanks for having me, Michael.

01:53 Yeah, it's great to have you here.

01:55 I'm looking forward to talking about web scraping.

01:57 It's the API that you didn't know it exists, but it's out there if you're willing to find it.

02:02 It's like the dark API.

02:03 Yeah, it's like a secret API that you need a little work to sort of like be able to use it, to earn the right to use it.

02:11 Yeah.

02:12 But yeah, I like it.

02:13 It's useful.

02:14 It definitely is.

02:15 And we've got some pretty interesting use cases to talk about and a little bit of history and stuff.

02:20 I'm obviously going to talk about Scrapey, some stuff about Scraping Hub, what you guys are doing there.

02:24 So a lot of fun stuff.

02:25 But before we get to all that, let's start with your story.

02:27 How'd you get into programming in Python?

02:28 Yeah, sure.

02:29 So actually, I got into programming back in elementary school.

02:33 I was the sort of kid, I think I was like eighth grade.

02:37 And at the time, everybody in my class in school, everybody was asking for like,

02:42 I think it was Xbox 360 that just came out.

02:46 Everybody was asking for that for Christmas.

02:48 Yep.

02:49 And I was the one who was asking for a book about programming.

02:54 What language did you want at that time?

02:56 Yeah.

02:57 So actually, I did like a lot of research.

03:00 I think for like three months, I was researching which programming language is the best for beginners.

03:07 So I was on different forums, Yahoo Answers, in every place on the internet.

03:13 And many people suggested at the time, Pascal.

03:16 So I got this book.

03:18 I think it was called something like Introduction to Turbo Pascal or something like that.

03:23 Right.

03:24 Okay.

03:24 Yeah.

03:25 So I got the book.

03:26 And that was the first step.

03:28 And then after like 40 pages into the book, I threw it away because I found that just Googling stuff,

03:35 Googling stuff in Google and Stack Overflow, it's just easier to learn it that way.

03:41 Nice.

03:42 So did you end up writing anything in Turbo Pascal?

03:45 Yeah.

03:46 I was just, you know, the regular things beginners usually do.

03:49 Like in the terminal, I, you know, I printed my name in red.

03:54 I printed like figures with characters, you know, just the regular things beginners do.

04:00 But then I quickly switched to Java after Pascal.

04:04 Right.

04:05 Java, you can build web apps, you can build GUIs, more interesting things.

04:10 Yeah.

04:10 At the time I was interested in Android, Android development.

04:14 Actually, it was a few years later when I got to high school, I started to like program in Java to develop Android apps.

04:22 I created like, I don't know, maybe like five or six apps for Google Play that I just found useful,

04:30 like minor things like a music player that plays music randomly, which there are on your phone.

04:37 And the only input is how long you want to listen to music.

04:41 Oh, cool.

04:41 So I was like, I want to listen for 20 minutes.

04:44 And they just randomly plays music for 20 minutes.

04:47 So, you know, little things like that.

04:49 Yeah.

04:49 Cool.

04:50 How'd you find your way over to Python?

04:51 Thanks to, thanks to web scraping actually.

04:54 Yeah.

04:54 Yeah.

04:55 You know, I'm not sure which was the first point.

04:58 No, I actually remember which was the first time I came across web scraping.

05:02 I was trying to develop like a software.

05:06 Like I was pretty into sports betting, football and soccer.

05:09 And I didn't play high stakes because I wasn't that good to afford to do it.

05:16 But I was interested in the analysis part, you know, looking up stats, looking up injury information,

05:22 looking at different factors that would determine the result, the outcomes of games.

05:28 And I wanted to create a software that would predict the outcome of these games or some of the,

05:36 we have, I don't know, 25 games in a weekend.

05:38 And it would predict the like, okay, look at these three games.

05:43 And these are going to be the results.

05:45 So I wanted to create a software like this.

05:47 Right.

05:48 And if that beats the spread that they are offering at Las Vegas, you're like,

05:51 oh, they're going to win by 14.

05:53 Oh, they said they're only winning by seven.

05:55 I want to take that.

05:56 Maybe.

05:56 Yeah.

05:57 I tried that.

05:58 And I also tried.

05:59 How did it go?

06:00 Yeah.

06:03 Well, the thing is, I wouldn't say it went too good, but it wasn't too bad.

06:07 I mean, I experimented, like, first of all, I was betting on like over and under,

06:12 because I found that that's like a more predictable thing to bet on.

06:17 That's what I thought, because in reality, it isn't, but I bet on over and under.

06:22 And so this software would tell me that, hey, in this game, you know, if it's soccer,

06:29 there will be more than two goals in the first half.

06:33 And so what I would do is I would watch the game live, because in life, there are better odds,

06:40 because as the time progresses, the odds for scoring the first half goes up.

06:45 So I just waited for like the, I think it was the 15th minutes or something like that.

06:50 And if there is still no goals scored in the first half, I would bet on it.

06:55 So I was doing this for like, I think four months without money, just testing the system.

07:01 Right.

07:02 I was watching and okay, I would bet right now.

07:05 And I was calculating, I had this Excel spreadsheet and everything.

07:08 And in that four months period, I was just, I think it was not profitable at all,

07:14 but it was close to being just a little profitable long term.

07:19 Yeah.

07:19 Well, maybe you were onto something.

07:21 If you get some more factors, maybe bring in a little machine learning, who knows?

07:25 Yeah, actually, maybe I should pick it up.

07:28 But that's when I came across web scraping.

07:30 Right.

07:31 Because I needed data.

07:33 I needed stats.

07:34 And a lot of these places, they don't want to give out that data.

07:37 They'll show it on their website, but they're not going to make an API that you can just stream.

07:40 Right.

07:41 Exactly.

07:42 At the time, there were like different commercial APIs that big news corporations use,

07:50 but it was just a hobby project.

07:51 And the only way for me to get data was to extract it from a publicly available website.

07:59 And so I started to look into it.

08:02 And at the time I was programming in Java.

08:04 So I found Jsoup.

08:06 I found HTML unit, which is a testing library, but it can be used for web scraping.

08:12 I found, is it Jount?

08:15 I think it's pronounced Jount.

08:16 And I found other libraries.

08:19 And everybody was saying that, which one is the best?

08:22 Which one is the best?

08:23 Everybody was talking on the forums about Scrappy.

08:26 And I didn't know what Scrappy was.

08:28 So I looked it up.

08:29 It's in Python.

08:30 I don't know how to program in Python.

08:32 I barely can program in Java.

08:34 But then I eventually learned Python just to learn Scrappy.

08:39 Cool.

08:40 Yeah.

08:40 And it's such a nice way to do web scraping.

08:42 It's a weird language.

08:44 It's Python.

08:45 It doesn't have semicolons, curly braces.

08:47 There's not so many of those.

08:49 Yeah.

08:49 It was like comparing it to Java.

08:51 It's very clean.

08:53 Yeah.

08:53 That's for sure.

08:54 Java is a little verbose in all of its type information, even for a typed language,

08:58 like say compared to C or I don't know, maybe even C#.

09:02 Like it likes its interfaces and all of its things and so on.

09:06 Right.

09:06 Yeah.

09:07 The best thing I love about Python, which I cannot do in Java, is that in Python,

09:11 you can read the code like an actual sentence.

09:15 Right.

09:16 I mean, not always, but sometimes you can, it's like a sentence.

09:19 Yeah.

09:19 There's this cartoon joke about Python that there's these two scientists or something like that.

09:26 And they've got this pseudocode.

09:28 It says pseudocode on their file.

09:29 And it's like pseudocode.txt.

09:31 They're like, oh, how do we turn this into a program?

09:34 Oh, you change the txt to a .py and it'll run.

09:36 Yeah.

09:37 It's already there.

09:38 Yeah, exactly.

09:39 All right.

09:40 Well, so that's how you got into it.

09:42 And obviously your interest in web scraping.

09:44 So, you've continued that interest today, right?

09:47 What are you doing these days?

09:48 Yeah.

09:49 So I've been developing like a lot of spiders, a lot of scrapers with Scrappy and other tools over the years.

09:55 And nowadays I joined the company who actually created Scrappy, Scraping Hub.

10:01 And I'm with Scraping Hub for over a year now.

10:04 And what I'm doing is as a day-to-day is educating people about web scraping,

10:09 both on the technical side of things and business side of things, like how to extract data from the web,

10:16 why is it useful for you or why it can be useful for you.

10:20 And so nowadays I don't really code that much unless it's for an article or a tutorial or to showcase some new stuff.

10:29 I spend more time on creating videos and just, you know, teaching.

10:34 Yeah.

10:34 Well, that sounds really fun.

10:35 You just get to explore, right?

10:37 Yeah.

10:37 And the biggest part or the best part I love about it is that like, I get to speak to customers who are doing web scraping or who are sort of

10:47 enjoying the benefits of web scraping.

10:49 And there are some really cool stories, like what these customers do with data.

10:55 Yeah.

10:56 And it's really creative.

10:57 Like people can get super creative, like how to make use of, you know, web data,

11:02 which is like, it's in front of you.

11:05 You can see the data on your website.

11:06 It doesn't look like interesting or like, you know, but when you extract it and you structure it,

11:12 you can actually make use of it and drive, you know, you can do many different things,

11:17 but you can drive better decisions in companies, which is pretty exciting.

11:21 Yeah.

11:22 And I guess if you work at a company and you're looking to think about, I'd like to do some comparative analysis against say our competitors and what

11:31 are they doing?

11:32 What are they doing with sales?

11:33 Like in terms of discounts, what are they doing in terms of things are featuring,

11:37 right?

11:38 You could theoretically write a web scraper that looks at your competitors'

11:42 data, their sites, their presentation, and sort of gives you that over time relative to you,

11:48 right?

11:48 Or something like that.

11:49 Yeah.

11:50 Like many use cases are, you know, as you said about monitoring competitors or monitoring the market.

11:59 And especially it's a big thing in e-commerce where most of the sectors in e-commerce are

12:06 really like, there are a lot of companies doing the same thing, selling the same thing.

12:12 And it's just really hard for them to sell more products.

12:16 So with web scraping, they can monitor the competitor prices.

12:21 They can monitor the competitors' stock information.

12:25 They can monitor pretty much everything that is on their website, publicly available.

12:31 And they can gather this information.

12:33 And, you know, it can be like tens of thousands of products or more.

12:38 And you can see where you should raise your prices, lower your prices, and how to price your products better and do marketing better.

12:49 Yeah.

12:50 And there's a lot of implicit signals as well that are not, you know, they're not most companies.

12:54 Some do, but most won't go, there are 720 of these sold today or whatever.

12:59 Yeah.

13:00 but you could say graph, like the number of reviews over time or the number of stars over time,

13:06 things like that would give you a sense of like a proxy, that kind of information.

13:10 Right.

13:10 you could come up with a pretty good analysis and pretty interesting analysis.

13:13 This portion of Talk Python to Me is brought to you by Linode.

13:18 Whether you're working on a personal project or managing your enterprise's infrastructure,

13:22 Linode has the pricing, support, and scale that you need to take your project to the next level.

13:27 With 11 data centers worldwide, including their newest data center in Sydney,

13:31 Australia, enterprise-grade hardware, S3 compatible storage, and the next generation network,

13:38 Linode delivers the performance that you expect, at a price that you don't.

13:42 Get started on Linode today with a $20 credit, and you get access to native SSD storage,

13:47 a 40 gigabit network, industry-leading processors, their revamped cloud manager,

13:52 cloud.linode.com, root access to your server, along with their newest API,

13:56 and a Python CLI.

13:58 Just visit talkpython.fm/Linode when creating a new Linode account, and you'll automatically get $20 credit for your next project.

14:06 Oh, and one last thing, they're hiring.

14:08 Go to linode.com slash careers, to find out more.

14:11 Let them know that we sent you.

14:15 also, like, some of the things that people monitor with web scraping is more like higher level.

14:20 We have the price, stock, but those are, like, really specific things, or values.

14:26 But with web scraping, you can actually monitor, like, company behaviors,

14:30 what the company is doing on a high level.

14:33 And you can achieve this by monitoring news, looking for, like, the text,

14:39 or the blog of the company, and getting out some useful information from that.

14:45 And, you know, setting up alerts and things like that.

14:47 Yeah, yeah, super cool.

14:48 So there's a couple of examples that I wanted to put out there, and then get your thought on it.

14:54 One is, you know, most of the world has been through the ringer on COVID-19

14:59 and all of this stuff.

15:00 In the U.S., we definitely have been recently getting our fair share of disruption from that.

15:05 So there's an article on towardsdatascience.com where some, the challenge was that

15:12 in the early days of the lockdowns, at least in the United States, we had things like Instacart

15:18 and other grocery delivery things, like there's local grocery stores, you could order stuff

15:23 and they would bring it to your car outside so you didn't have to go in,

15:26 other people, fewer people in the store, all sorts of stuff that was probably a benefit from that.

15:30 But if you tried to book it, it was like a week and a half out.

15:34 Yeah.

15:34 And it was, it was a mess, right?

15:36 And so this person, as they wrote it, Emmy GIL wrote this article that talked about using web scraping

15:44 to find a grocery store delivery slots, right?

15:48 So you could basically say, I'm going to go to my grocery store or to like something like,

15:51 Instacart and I'm just going to watch until one of these slots opens up.

15:55 And then what he had to do was actually send him a text, like go now and book it,

15:59 place my order now or whatever.

16:01 And I think that that's a really creative way of using web scraping.

16:05 What do you think?

16:06 Yeah, it's definitely really creative.

16:08 It's useful.

16:09 So he wasn't just only getting alerted when there's a new spot available,

16:14 but he actually created the bot.

16:16 So it like, it chooses that spot.

16:18 Right.

16:19 It would actually place the order.

16:21 Yeah, that's smart.

16:21 I mean, there are, we had the same situation in Hungary where you couldn't find any free spot.

16:28 And the thing I love about web scraping is that it can be useful for huge companies

16:34 to do web scraping at scale, but it can be also useful for literally everyone.

16:40 Yeah.

16:40 In the world, just doing little things like this.

16:43 I have a friend who is looking to buy an apartment and currently the real estate market

16:49 is really like fragile, I would say.

16:51 Yeah, yeah.

16:52 Yeah.

16:52 And what he wants to do that I suggested him to do it is to create, set up a bot

16:58 that monitors like 10 websites, 10 real estate listing websites.

17:04 And so he will get alerted when there is a apartment available with the attributes

17:11 he's interested in.

17:12 Right.

17:13 Okay.

17:13 That's awesome.

17:14 Yeah.

17:14 So there are so many little things like this in web scraping and it's just awesome.

17:21 Yeah, I agree.

17:21 People talk often about this concept popularized by Al Swigert, like automate the boring stuff.

17:27 Like, oh, I got to always do this thing in Excel.

17:29 So could I write a script that just generates the Excel workbook or I've always got to

17:33 do this other thing, copy these files and rename them, write something that does that.

17:36 But this feels like the automate the boring stuff of web interfaces, automate tracking

17:41 of these sites and let it like finding out when there's a new thing.

17:45 Yeah.

17:45 Automate the boring stuff is a really good resource.

17:48 In this case, it's really like what it's about.

17:51 If you didn't create a web scraper for this, what you would do?

17:55 You would probably look at the website every single, as often as you can,

18:00 every day, every hour.

18:01 You'd hear somebody say, I checked it every day until I found the apartment I wanted,

18:05 right?

18:06 That's what they would say.

18:07 Yeah, exactly.

18:08 Yeah.

18:09 And what you don't hear them say so often is my bot alerted me when my dream apartment

18:14 was available.

18:15 Yeah.

18:15 I created a scrappy spider that alerts me when there is a new apartment listing

18:20 in that area.

18:21 Yeah.

18:21 And, but I think it's a really great example and it's like, there's so many things

18:26 on your computer that you do that you kind of feel like, ah, this stuff I just got to do.

18:30 But on the web, there's way more even that you probably end up, I always go to this site

18:34 and then I do this thing and I got to do that, right?

18:35 And web scraping will kind of open up a ton of that stuff.

18:39 So, let me give you two examples.

18:41 One, really grand, large scale.

18:44 One, like, extremely small but like, really interesting to me in the next,

18:48 you know, hour.

18:48 So, grand scale first.

18:50 When Trump was elected in the United States 2015 and he was going to be come president

18:56 in 2016, there was a lot of concern about what was going to happen to some of the data

19:01 that had so far been hosted on places like whitehouse.gov.

19:05 So, like, there's a lot of climate change data hosted there from the scientists

19:09 and there was a thought that he might take that down, not necessarily destroy it,

19:14 but definitely take it down publicly from the site and there's, you know,

19:18 there's all these different places, the CDC and other organizations that have a bunch

19:23 of data that a bunch of data scientists and other folks were just worried

19:28 about.

19:29 So, there were these big movements and I struggle to find the articles, you know,

19:33 almost four years out now.

19:35 Can't quite find the right search term to pull them up but there was these

19:38 like, save the data hackathon type things.

19:42 So, people got together at different universities, like 20, 30 people and they would spend

19:47 the weekend writing web scraping to go around to all these different organizations,

19:51 download the data.

19:53 I think they were coordinated across locations and then they would put that all

19:57 in a central location, I think somewhere in Europe, like in Switzerland or something,

20:00 so the data could live on.

20:02 And I think the day that Trump was elected, I think the climate change data

20:06 went off of whitehouse.gov.

20:07 So, at least one of the data sources did disappear and that's a pretty large

20:12 scale web scraping effort.

20:13 Some of it was just like download CSVs, but a lot of it was web scraping,

20:16 I think.

20:17 Yeah, like the thing is people say that once you put something on the internet,

20:22 it's going to stay there forever.

20:23 And why this story is interesting is that this is information that is useful

20:29 for everyone.

20:30 Yeah.

20:30 And this is the kind of information that you want to make publicly available.

20:35 And I think this is one of the things that this movement is about, if I can call

20:40 it movement, open data to make data open and accessible for everyone.

20:45 And web scraping is really a great tool to do that, to extract data from the web,

20:53 put it in a structured format in a database so you can use it, can just store

20:59 information or you can get some kind of intelligence from the data.

21:03 And it's really like without web scraping, what you couldn't do anything else.

21:08 will you just copy-paste the whole thing or what will you do?

21:11 It would take a lot of by-hand work.

21:14 Yeah, it wouldn't be good.

21:15 Yeah.

21:15 Yeah, so that's really cool.

21:17 And that was an interesting large scale, like, hey, we have to organize.

21:21 We've got one month to do it.

21:22 Let's make this happen.

21:23 So on a much smaller scale, I recently decided to realize that some of my

21:28 pages on my site were getting indexed by Google, not because anything on the

21:32 internet was linking to them publicly, but something would be in Gmail or

21:36 somebody would write it down and then Google would find it.

21:39 So I just went through and put a bunch of no index meta tags on a bunch of my

21:43 parts of my site.

21:44 But it really scared me because I changed a whole bunch of it and I'm pretty sure I

21:48 got it right.

21:48 But what if I accidentally put no index on some of my course pages that advertise

21:54 them and all of a sudden the traffic goes from whatever it is to literally

21:58 zero because I accidentally told all the search engines, please don't tell

22:03 anyone about this page.

22:04 what I was doing actually right before we called unrelated to the fact that

22:07 this was the topic was I was going to write a very simple web scrappy thing

22:11 that goes and grabs my site map, hits every single page and just tells me

22:15 the pages that are not indexed just for me to double check.

22:18 But you know, that template didn't get reused somewhere where I didn't think it

22:21 got reused, you know, like the little shared bit that has the head in it and so

22:25 on.

22:25 Yeah, it's sort of like in SEO, like search engine optimization, we have this

22:29 thing, technical SEO.

22:31 Yeah.

22:32 And actually, it's another use case for web scraping or like crawling, where you want

22:38 to crawl your own website, just as you know, you did and to learn the site map or

22:44 to find pages where you don't have proper SEO optimization, like you don't

22:49 have meta description on some pages, you have multiple like H1 tags or whatever.

22:56 Right.

22:56 and with crawling, you can figure these out.

22:59 And I just remember the story from like 2016, where web scraping was used to

23:07 reveal lobbying and corruption in Peru.

23:10 Oh, wow.

23:11 Okay.

23:12 Yeah, I'm not 100% sure, but I think it was related to, you know, the Panama

23:16 Papers.

23:17 Yeah, I was going to guess it might have something to do with the Panama

23:21 Papers, which was really interesting in and of itself.

23:23 Yeah, yeah, and they actually, they use many tools to scrape the web, but they

23:29 also use Scrappy to get like information from these, I guess, government

23:34 websites or other websites.

23:36 But yeah, like it's crazy that a web scraping can reveal these kinds of really,

23:42 this is a big thing.

23:44 Yeah.

23:44 Corruption and web scraping can tell you that, hey, there is something wrong

23:49 here, you should pay attention.

23:50 It lets you automate the discovery and the linking of all this different data

23:54 that was never intended to be put together, right?

23:57 There's no API that's supposed to take like these donations, this investment,

24:02 this other construction deal, or whatever the heck it was, and like link

24:06 them together.

24:06 But with web scraping, you can sort of pull it off, right?

24:09 Yeah, you can structure the whole web.

24:11 I mean, technically, you could structure the whole web, or you could get like a

24:17 set of websites that are considered or that has the same kind of data you

24:22 are looking for, like you can be e-commerce, fintech, real estate, whatever.

24:27 You can grab those sites, extract the data, the publicly available data,

24:32 structure it, and then you can do many sorts of things.

24:35 You can do like NLP, you can just search in the data, you can do many different

24:39 kinds of things.

24:41 Yeah, that's cool.

24:42 So let's talk a little bit specifically about Scrappy and maybe Scraping Hub

24:47 as well, and some of the more general challenges you might run into with web scraping.

24:51 So right on the Scrappy website, homepage, project site, there's a real simple

24:57 example that says here's why you should use it.

24:58 You can just create this class, it derives from Scrappy.

25:01 spider, and you give it some URLs, and it has this parse function, and off

25:05 it goes.

25:06 Do you want to talk to us real quickly about what it's like to write code to do

25:09 web scraping with this API?

25:11 With Scrappy?

25:12 Yeah.

25:12 Yeah, so with Scrappy, if you use Scrappy, what you really get is like a

25:18 full-fledged web crawling framework, or like a web scraping framework.

25:22 So it really makes it easy for you to extract data from any website.

25:28 And it has, there are other libraries like BeautifulSoup, or LXML, or in Java,

25:34 we have JSUP, that are only focused around parsing HTML and getting, or getting data

25:40 from HTML and XML files.

25:43 Right, BeautifulSoup is I have the text, I have the HTML source, now tell me

25:47 stuff about it, right?

25:48 Right.

25:49 That's sort of its whole job, yeah.

25:50 Right, but Scrappy, I would say it's like maybe 10% of the whole picture

25:55 or the big picture, because with Scrappy, like, yeah, you can parse the HTML,

26:00 but in real-world project, that's not the only task you need to do.

26:05 In real-world project, you need to process the data as well.

26:10 You need to structure the data properly, you need to maybe put it into a database,

26:16 maybe export it as a CSV file or JSON, you need to clean the data, you need to

26:23 normalize the data.

26:24 In real-world scenarios, you need to do a lot of things, not just scraping the

26:28 data, and Scrappy really makes it easy to develop these spiders that grab the

26:34 data, and then right there in Scrappy, you can clean it, you can process

26:40 it, and when the data leaves Scrappy, it's usable for whatever you want to

26:46 use it for.

26:46 Right, maybe you get a list or sequence of dictionaries that have the pieces

26:51 of data that you went after or something like that.

26:53 Yeah, right.

26:54 So that's why I really like Scrappy, but also I think it's the best framework to

27:00 maintain your spiders, because a big problem in web scraping is that once

27:05 you write the spider, it works, you get the data, it will probably not work in

27:11 the six months in the future because the website changes or layout changes.

27:16 They've decided to switch JavaScript front-end frameworks or redesign the

27:21 navigation or something, right?

27:23 Exactly.

27:23 And so you need to adjust your code, adjust your selectors, you need to maybe adjust

27:30 the whole processing part of Scrappy, because maybe in the past you were able

27:36 to extract only messy code from the HTML, but they did some changes on the website.

27:42 Now it's clean, you don't need so much processing.

27:44 And with Scrappy, it's really easy to maintain the code, which is, you know,

27:49 it's really important if you rely on web-extracted data on a daily basis,

27:53 you need to maintain the code.

27:55 Yeah, really neat.

27:56 So to me, when I look at this simple API here, it's like you create a class,

28:01 you give it a set of start URLs, and there's a parse function, and that comes in,

28:05 you run some CSS selectors on what you get.

28:08 For each thing that you find, you can yield that back as part of the thing, like the

28:12 result you found, and then you can follow the next one.

28:15 I think that's probably the biggest differentiator from what I can tell is,

28:19 it's like as you're parsing it, you're directing it to go do more discovery, maybe

28:25 on linked pages and things like that.

28:27 Yeah, like on websites where you need to go deeper, maybe because, you know, you need to

28:33 go deeper in the sitemap or you need to like paginate, you need to do more

28:39 requests to get the data.

28:41 Yeah.

28:41 In tools like BeautifulSoup, you would need to sort of create your own logic or like,

28:47 you know, how you want to do that.

28:49 But in Scrappy, it's already sort of straightforward how you want to do that or how you

28:54 can do that.

28:55 And you just need to just click this button, click this button, and you paginate with

29:00 Scrappy because you have this high-level functions and methods that let you paginate and

29:06 let you do things in a chain, you know?

29:08 Yeah.

29:09 Yeah, very cool API.

29:10 So let's talk about some of the challenges you might get.

29:14 I mean, probably to me, the biggest challenge, well, there's two problems

29:17 that I would say, the two biggest challenges.

29:19 One is probably maybe generally you could call it data quality, but more

29:25 specifically like the structure changed, right?

29:28 they've added a column to this table and it's not what I thought it was anymore.

29:32 Like it used to be temperature and now it's pressure because they've added that

29:36 column and I wrote it to have three columns in the table, now it's got four.

29:39 The other one is I go and I get the source code of the page and it says, you know,

29:45 view.application or there's like an Angular app or something and you get

29:50 just a bunch of script tags that have no data in them because it's not actually executed yet.

29:55 Yeah, I mean, this is like really all the things that all the reasons why you

30:00 need to maintain the code, which can be, you know, it takes time to maintain the

30:05 code.

30:05 Actually, that's why there are more advancing machine learning technologies

30:11 that pretty much maintain the code for you.

30:15 Oh, wow.

30:15 Okay.

30:15 Give us an example of that.

30:17 Like find me the title or find me the stock price off of this page, machine

30:21 magic.

30:21 Yeah.

30:22 So actually it's not that big of a magic.

30:24 Like in the future, you won't need to write things like selectors.

30:29 You don't need to tell Scrappy or like, you know, whatever tool you use to, hey, this

30:35 is the price title, whatever.

30:38 I want to extract it.

30:39 This is the selector where you can find it in the HTML.

30:43 In the future, you will not need to do that.

30:46 In the future, what you will only need to say is that this is the page.

30:51 this is real estate listing or this is a product page or this is some kind of

30:57 financial page.

30:59 So you just need to say the data type.

31:01 Yeah.

31:02 And then the machine will know what fields to look for.

31:07 So it will know that if it's a product page, you probably need like price.

31:13 You probably need product title or like product name, sorry, description,

31:17 stock information, maybe reviews, if any.

31:20 So in the future, you just need to specify the data type and you will get all

31:25 the data fields that you want with ML.

31:28 Yeah.

31:28 That's like a dream.

31:29 Well, it's sort of like it's a reality in some cases because like, you know, there

31:34 are many, there are some products out there that do this at Scraping Hub.

31:39 We have like, for example, news API, which does exactly this thing for articles and news.

31:45 Okay.

31:46 Yeah.

31:46 So what you do is that, hey, this is the page URL.

31:50 give me everything and you get the article title, you get the text body,

31:55 you get, you know, everything that is on the page.

31:58 Well, and if any place is going to have enough data on scraping to like answer those kinds of

32:03 questions more generally, it sounds like you guys might.

32:06 You guys do so much web scraping, right?

32:08 That like way more than a lot of people in terms of as a platform.

32:12 I don't know how you like share that across projects.

32:15 I don't think that, I don't see a good way to do that, but it seems like you

32:18 should be in a position that somehow you could build some cool tools around all

32:22 that.

32:22 Yeah, we do a lot of web scraping, as you said.

32:25 And it's really about the biggest problem is that there is no one schema that websites

32:32 follow when it comes to data.

32:34 So when you scrape, you need to like figure out a structure that would work for

32:40 all the websites.

32:41 and then you extract, you sort of like standardize the data.

32:45 And that's why the data type is important to specify.

32:48 Right.

32:49 What about JavaScript?

32:50 That's all the cool kids say they should, you don't write server code anymore.

32:55 You don't write templates.

32:56 Just you write just APIs and you talk to them with HTML and JavaScript.

33:00 But that doesn't scrape very well.

33:02 I'm not sure that's necessarily what you should be doing, but entirely, but a

33:05 lot of people are and it makes web scraping definitely more challenging.

33:09 Yeah.

33:09 Like nowadays, all the websites have some kind of JavaScript running.

33:13 And if the data you need to extract is rendered with JavaScript, that can

33:19 be a challenge.

33:21 And first, if you're writing a spider, first, you should look for ways to get

33:26 the data without executing JavaScript.

33:28 You can do this sort of, you know, I call them hidden APIs when there is

33:34 no documented API for the website.

33:36 But if you look in the background, background, that is actually a public

33:41 API available.

33:42 Right.

33:42 So it's clearly going back, like if you've got a Vue.js front end, it's probably

33:47 calling some function that's returning like a JSON result of the data you actually

33:52 want.

33:53 You just got to look in your web browser tools to see that happening, right?

33:56 Yeah, exactly.

33:57 Like with AJAX calls, many times that's the case that you just need to sort of grab

34:02 the JSON and you can get the data easily.

34:05 Well, and it's formatted JSON, right?

34:07 It's not just HTML.

34:08 I mean, it's like pretty sweet actually to get a hold of that feed.

34:11 Oh, yeah.

34:12 But on the other side, if you cannot find an API like this, you need to execute JavaScript using

34:18 some kind of headless browser like Selenium or Puppeteer or Splash.

34:23 And because, you know, you need an actual browser to execute JavaScript.

34:27 Right.

34:27 But then it's going to take more hardware resources to run JavaScript.

34:32 So it's going to be slower and it's going to make the whole process a little bit

34:37 more complicated.

34:38 Yeah.

34:38 There's probably no mechanism built right into Scrapey that will render the

34:43 JavaScript, is there?

34:44 Like Scrappy itself cannot render JavaScript, but there is a really neat integration with

34:50 Splash.

34:50 Okay.

34:51 Splash is also made, you know, it was created by Scraping Hub, so it's really easy to,

34:56 and Splash is sort of like a headless browser created only for web scraping.

35:01 and you can integrate Splash with Scrappy really easily.

35:05 It's literally just like a middleware in Scrappy.

35:08 And then it will execute JavaScript for you.

35:11 Yeah.

35:11 Sweet.

35:11 Yeah.

35:12 You will be able to see the data and then you will be able to extract it.

35:16 Right.

35:16 It says the headless browser designed specifically for web scraping.

35:20 Turn any JavaScript heavy site into data.

35:23 Right.

35:23 Cool.

35:24 I'll put a link to that.

35:25 That's pretty neat.

35:26 And that's open source.

35:27 People can just use that as well or what's the story with that?

35:30 Yeah.

35:30 Just like Scrappy, they can set up their own server and you can use it that way or you can

35:36 also host it on Scraping Hub.

35:37 Talk Python to me is partially supported by our training courses.

35:43 How does your team keep their Python skills sharp?

35:46 How do you make sure new hires get started fast and learn the Pythonic way?

35:50 If the answer is a series of boring videos that don't inspire or a subscription service

35:56 you pay way too much for and use way too little, listen up.

35:59 At Talk Python Training, we have enterprise tiers for all of our courses.

36:03 Get just the one course you need for your team with full reporting and monitoring.

36:07 Or ditch that unused subscription for our course bundles which include all the

36:12 courses and you pay about the same price as a subscription once.

36:15 For details, visit training. talkpython.fm/business or just email sales at

36:21 talkpython.fm.

36:24 Yeah, so let's maybe round out our conversation talking about hosting.

36:28 So I can go and run web scraping on my system and that sometimes works pretty

36:35 well, sometimes it doesn't.

36:36 There was this time a friend of mine was thinking about getting a car and putting it

36:40 on one of these car sharing sites, keep it a little bit general and we wanted to

36:44 know like, okay, for a car that's similar to the kind of car he's thinking about getting,

36:47 how often is it rented?

36:49 What does it rent for?

36:50 Like, so would it be worthwhile as just a pure investment, right?

36:54 So I wrote up some web scraping stuff that would just monitor like a couple

36:58 of cars I picked.

36:59 After a little while, they didn't talk to me anymore.

37:02 The website was mad at me for like always asking about this car.

37:06 Yeah.

37:07 So that seems like a common problem and if I do that for my house, I probably

37:11 can't check out that site anymore for a while, right?

37:14 I probably shouldn't do it for my house, but even from a dedicated server, like

37:17 that might cause problems.

37:18 Yeah, it's not just with web scraping.

37:20 Like I remember I was doing some kind of financial research.

37:24 It was about like one of the cryptocurrencies and I was using an actual API.

37:28 Like I wasn't scraping.

37:29 I was using an actual API and it was pretty basic.

37:32 Like I was just learning how to use this thing and I created a for loop and I put the

37:39 request, the API request in the for loop and I was just testing and by accident, I put like

37:46 a huge number in the for loop, like I don't know, like a million or something.

37:51 And it created a million requests.

37:54 I mean, it tried to create a million requests, but it stopped after like,

37:58 I don't know, maybe like 10,000 requests in a minute.

38:01 Right.

38:02 And because, you know, the server didn't like that I was doing that.

38:04 And it's the same thing with web scraping.

38:07 You need to be really careful not to hit the website too hard because when you scrape the

38:13 web, it's really important.

38:14 I mean, it's very important to be ethical and be legal.

38:19 And one of the things that you can do to achieve these is to just put some kind of

38:24 delay limit just over a respect the website and make sure that you don't

38:28 cause any harm to the website.

38:30 Right.

38:31 As soon as people think that what you're doing, you think you're just getting data, but

38:35 they might perceive it as this is the distributed denial of service attack

38:39 against my site.

38:40 We're going to block this person, right?

38:42 That's not a good situation you want to be in.

38:44 And it's also not super nice to whoever runs that website to hammer it.

38:48 Exactly.

38:48 The website doesn't know what you want to do.

38:51 It just sees a bot and it's really like, you know, because web scraping itself is legal.

38:56 You can do it if it's a publicly available website and you need to make sure

39:01 that you are being, I mean, we use this in the web scraping community.

39:05 You need to be nice.

39:06 You need to be nice to the websites and you know, you can look at the robots

39:10 text file.

39:11 Usually there is a, like a, a delay specified in the robots text file that

39:16 tells you that, hey, you should put like three seconds between each request.

39:21 And, but just in general, even if there is no such a rule defined in the robots

39:26 text file, you should pay attention.

39:28 You should be careful with your, how many requests you make, how frequently you make

39:33 those requests.

39:34 So yeah, it's important.

39:35 Yeah.

39:36 Interesting.

39:36 And so I guess one option is you guys have a cool thing, which I just discovered.

39:42 I don't know how new it is, but scrappy D, a daemon that lets you basically deploy your

39:48 scrappy scripts to it and it'll schedule and run them and that's pretty cool and

39:53 you can control that.

39:53 So you could set that up yourself or obviously you have the business side of your

39:58 whole story, which is scraping hub.

40:00 So what does scraping hub do that makes this better?

40:03 Yes.

40:04 Then just like getting a server at AWS or Linode or something like that.

40:08 Yeah.

40:09 So scraping hub is really about, you can deploy your spiders, but it's really, it's made

40:15 for web scraping and there is all the tools available in the scraping hub platform

40:20 that you would want to use for web scraping.

40:23 Like, you know, we mentioned scrappy and splash in the scraping hub platform.

40:27 You can upload your spider, your scrappy spider, and then you can, with a click of

40:31 a button, you can add splash.

40:33 So it renders JavaScript or you can also, if you want to use proxies for web

40:39 scraping, you can, with a click of a button, you can add proxies to your

40:43 web scraping.

40:44 You can also, if you have a website, like we mentioned, you know, articles and news, if you

40:49 want to extract data, we have an API, so you don't have to actually write the spider.

40:54 You can just use the API and get the data so you don't need to maintain the code.

40:59 So it's scraping up is really just the, I work at scraping up, but I truly

41:03 believe that it's the best platform if you're doing web scraping at scale, because you have

41:08 all the tools in one place that you would want to use for web scraping.

41:12 Right, right.

41:14 So a lot of the stuff you might bring your own infrastructure for, you guys have like as a

41:19 service type stuff?

41:20 Yeah, exactly.

41:21 You can sign up.

41:22 We have a free trial for, I think, for most of our services.

41:26 You can try it.

41:27 You can see how it works.

41:28 And then there's a monthly fee and that's it.

41:31 Okay.

41:31 Yeah, that seems pretty awesome.

41:33 And also we have a lot of clients who don't want to deal with the stuff,

41:36 deal with the technical details.

41:38 So they just say that, hey, I want this type of data from these websites,

41:41 give me the data.

41:42 And so we just do the hardware for them.

41:45 Right.

41:45 Maybe they're not even Python people.

41:47 Yeah.

41:47 They might be like, they don't even know how to code or like.

41:50 Right.

41:51 They could be political scientists.

41:52 They're like, we need to just scrape this.

41:54 But like you guys do scraping, right?

41:56 Yeah.

41:56 Yeah, exactly.

41:57 So let's round out our conversation on this.

42:00 Just a quick comment on Scraping Hub or a quick thought on it.

42:04 So when I talked to Pablo back in 2016, so I had a show with him about Scraping and Scraping Hub

42:11 and all that back in episode 50, four years ago or something.

42:14 One of the things that really impressed me about what you guys are doing

42:18 and continues to be the case is you've taken pretty straightforward open source projects,

42:24 not something completely insane, like a whole new database model or something, but just pretty

42:29 straightforward, but polished model of web scraping.

42:32 And you built a really powerful business on top of it.

42:35 And I think it's an interesting model for taking open source and making it your job more

42:40 than just I'll do consulting in this thing, right?

42:43 If I wrote this web framework, I could maybe sustain myself by consulting with people who

42:48 want to write websites in that framework or so.

42:50 But this is a pretty interesting next level type of thing that you guys pulled off here.

42:55 And what are your thoughts on this?

42:56 Yeah, 100% agree.

42:58 Like I think, you know, Pablo and Pablo Hoffman and Shane Evans, they did

43:02 an awesome job when they created Scrappy.

43:05 Like at the time, it was the first real web scraping framework, as I know.

43:10 And they open sourced it.

43:11 And it's really like if you look at the GitHub page of Scrappy, it's really like I feel like

43:17 it's a community.

43:18 Like people are chatting about this.

43:20 People are talking about how the community can improve this tool.

43:25 And then the fact...

43:26 You've got like 17,000 GitHub questions or something like that.

43:30 Oh, yeah.

43:30 Tagged on it.

43:31 Yeah, it's quite a bit.

43:31 Yeah.

43:32 And the fact that Pablo and Shane were able to pull this thing into business, but in a

43:37 really...

43:38 Like people can still use Scrappy without using our services.

43:42 They can still use Splash without paying.

43:45 They can use it.

43:45 It's open source.

43:47 So the fact that they built on top of the open source tools in a way that you don't need

43:53 to be affiliated with the company to get the benefits of Scrappy and other open source tools.

43:58 Like there are many other tools.

44:00 It's not just Scrappy.

44:01 Like there are so many other tools for web scraping.

44:04 And I agree.

44:05 I think it's just really amazing that they have been doing this.

44:09 Yeah.

44:09 It seems like it's really adding value because if not the core thing is already open

44:14 source and people would just use it, you know?

44:16 Yeah.

44:16 Like especially if they want to do it at scale.

44:19 Yeah.

44:20 You just cannot do huge things like millions of extracted records or things like that on your

44:26 computer or on your laptop.

44:28 You just cannot do that.

44:29 No, you get about 50,000 and then you get blocked.

44:32 Yeah.

44:33 Yeah.

44:33 That's an option as well.

44:34 Yeah, exactly.

44:36 So yeah, very cool.

44:37 I think it's a really neat example because a lot of people are thinking about

44:40 how do I take this thing that's got some traction I'm doing with open source

44:43 and make this my job because I'm tired of filling Jira tickets at this thing that's kind of

44:49 okay but not really my passion, right?

44:51 And cool example.

44:52 So congrats to you guys.

44:53 Keep up the good work.

44:54 Yeah, we are.

44:55 We definitely try our best.

44:57 There are some awesome things to come in the web scraping world, I believe, with the

45:02 advancements of machine learning and other stuff.

45:04 So it's going to be really interesting to see what's going to happen in the

45:08 next few years.

45:09 Yeah, for sure.

45:10 All right.

45:10 Well, I think that's probably it for the time we got to talk about web scraping.

45:13 But the final two questions before you get out of here, you're going to write some

45:17 code.

45:17 What editor do you use?

45:18 Python code?

45:19 PyCharm.

45:20 PyCharm.

45:21 Right on.

45:21 And then notable PyPI package.

45:23 I mean, there's obviously pip install, scrappy, but maybe some project that you

45:28 came across, you know, like, oh, this is so awesome.

45:30 People should know about X.

45:32 Oh, actually, it's a hard question because I've been sort of out of the game for

45:37 a lot of months now.

45:38 Aside from scrappy, I just really like to, I don't have one specific example.

45:43 I just really like to find these GitHub repositories where, when it's not like

45:49 a library, but it's like a collection of, like, there are many libraries, GitHub

45:54 repos with a collection of templates.

45:56 Right.

45:56 You know, like, hey, if you're a beginner, you can use these templates to get

46:00 started.

46:01 And for me, that was really useful when I started out with using a new tool

46:05 like scrappy or other libraries is that you can use these starter codes sort of.

46:12 Right, right.

46:12 I wonder if there's a better scrappy, possibly.

46:15 Probably.

46:15 Yeah.

46:16 Yeah.

46:16 There is.

46:17 I'll throw it into the show notes, but yeah.

46:21 Cool.

46:21 But just cookie cutter, make me a scrappy project.

46:24 Let's roll and do some web scraping.

46:26 Pretty neat.

46:26 Also, there is a, you know, just last note, awesome.

46:28 I think it's got awesome web scraping.

46:31 OK.

46:31 Or something like that, which has hundreds of tools for web scraping.

46:35 Oh, yeah.

46:36 We can link it later.

46:38 Maybe if I find it.

46:39 OK.

46:39 Yeah.

46:40 It's got awesome web scraping on GitHub.

46:42 Yeah.

46:43 Nice.

46:43 I'll put that in the show notes as well.

46:45 I love those awesome lists.

46:46 Yeah.

46:46 It's easy to get lost because you were like, oh, I have this one thing and I

46:50 use it.

46:50 Then all of a sudden you're like, oh, look at that.

46:52 There's like 10 other options in this thing that I thought I knew.

46:55 Yeah, exactly.

46:57 Awesome.

46:58 All right.

46:58 Attila, it's been great to chat with all these, share all these web scraping

47:02 stories and chat with you.

47:03 So thanks for being here.

47:04 Yeah, it's been amazing.

47:05 Thank you.

47:05 Thank you very much.

47:06 Yeah, you bet.

47:07 Bye-bye.

47:07 Bye.

47:08 This has been another episode of Talk Python to Me.

47:11 Our guest in this episode has been Attila Toth and it's been brought to you by Talk

47:15 Python Training and Linode.

47:17 Start your next Python project on Linode's state-of-the-art cloud service.

47:21 Just visit talkpython.fm/Linode, L-I-N-O-D-E.

47:26 You'll automatically get a $20 credit when you create a new account.

47:29 Want to level up your Python?

47:31 If you're just getting started, try my Python Jumpstart by Building 10 Apps course.

47:36 Or if you're looking for something more advanced, check out our new Async course

47:40 that digs into all the different types of Async programming you can do in

47:44 Python.

47:44 And of course, if you're interested in more than one of these, be sure to

47:48 check out our Everything Bundle.

47:49 It's like a subscription that never expires.

47:51 Be sure to subscribe to the show.

47:53 Open your favorite podcatcher and search for Python.

47:55 We should be right at the top.

47:57 You can also find the iTunes feed at /itunes, the Google Play feed at slash Play, and the

48:02 direct RSS feed at /rss on talkpython.fm.

48:06 This is your host, Michael Kennedy.

48:08 Thanks so much for listening.

48:09 I really appreciate it.

48:10 Now get out there and write some Python code.

48:12 you're welcome.

48:13 Bye.

48:13 Thank you.

Talk Python's Mastodon Michael Kennedy's Mastodon