Learn Python with Talk Python's 270 hours of courses

#283: Web scraping, the 2020 edition Transcript

Recorded on Wednesday, Jul 22, 2020.

00:00 Web scraping is pulling the HTML of a website down and parsing useful data out of it.

00:04 The use cases for this type of functionality are endless.

00:07 Have a bunch of data on governmental sites that are only listed online in HTML without a download?

00:13 There's an API for that.

00:15 Do you want to keep abreast of what your competitors are featuring on their site?

00:19 There's an API for that.

00:20 Need alerts for changes on a website?

00:23 For example, enrollment is now open at your college, and you want to be first and avoid that 8 a.m. morning slot?

00:29 Well, there's an API for that as well.

00:31 That API is Screen Scraping, and Attila Toth from Scraping Hub is here to tell us all about it.

00:36 This is Talk Python to Me, episode 283, recorded July 22, 2020.

00:41 Welcome to Talk Python to Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities.

01:00 This is your host, Michael Kennedy.

01:02 Follow me on Twitter, where I'm @mkennedy.

01:05 Keep up with the show and listen to past episodes at talkpython.fm, and follow the show on Twitter via at Talk Python.

01:11 This episode is brought to you by Linode and us.

01:15 Python's async and parallel programming support is highly underrated.

01:20 Have you shied away from the amazing new async and await keywords because you've heard it's

01:24 way too complicated or that it's just not worth the effort?

01:27 With the right workloads, a hundred times speed up is totally possible with minor changes to your code.

01:33 But you do need to understand the internals, and that's why our course, Async Techniques and Examples in Python,

01:38 show you how to write async code successfully as well as how it works.

01:44 Get started with async and await today with our course at talkpython.fm/async.

01:49 Attila, welcome to Talk Python to me.

01:52 Thanks for having me, Michael.

01:53 Yeah, it's great to have you here.

01:55 I'm looking forward to talking about web scraping.

01:57 It's the API that you didn't know it exists, but it's out there if you're willing to find it.

02:02 It's like the dark API.

02:03 Yeah, it's like a secret API that you need a little work to sort of like be able to use it, to earn the right to use it.

02:11 Yeah.

02:12 But yeah, I like it.

02:13 It's useful.

02:14 It definitely is.

02:15 And we've got some pretty interesting use cases to talk about and a little bit of history and stuff.

02:20 I'm obviously going to talk about Scrapey, some stuff about Scraping Hub, what you guys are doing there.

02:24 So a lot of fun stuff.

02:25 But before we get to all that, let's start with your story.

02:27 How'd you get into programming in Python?

02:28 Yeah, sure.

02:29 So actually, I got into programming back in elementary school.

02:33 I was the sort of kid, I think I was like eighth grade.

02:37 And at the time, everybody in my class in school, everybody was asking for like,

02:42 I think it was Xbox 360 that just came out.

02:46 Everybody was asking for that for Christmas.

02:48 Yep.

02:49 And I was the one who was asking for a book about programming.

02:54 What language did you want at that time?

02:56 Yeah.

02:57 So actually, I did like a lot of research.

03:00 I think for like three months, I was researching which programming language is the best for beginners.

03:07 So I was on different forums, Yahoo Answers, in every place on the internet.

03:13 And many people suggested at the time, Pascal.

03:16 So I got this book.

03:18 I think it was called something like Introduction to Turbo Pascal or something like that.

03:23 Right.

03:24 Okay.

03:24 Yeah.

03:25 So I got the book.

03:26 And that was the first step.

03:28 And then after like 40 pages into the book, I threw it away because I found that just Googling stuff,

03:35 Googling stuff in Google and Stack Overflow, it's just easier to learn it that way.

03:41 Nice.

03:42 So did you end up writing anything in Turbo Pascal?

03:45 Yeah.

03:46 I was just, you know, the regular things beginners usually do.

03:49 Like in the terminal, I, you know, I printed my name in red.

03:54 I printed like figures with characters, you know, just the regular things beginners do.

04:00 But then I quickly switched to Java after Pascal.

04:04 Right.

04:05 Java, you can build web apps, you can build GUIs, more interesting things.

04:10 Yeah.

04:10 At the time I was interested in Android, Android development.

04:14 Actually, it was a few years later when I got to high school, I started to like program in Java to develop Android apps.

04:22 I created like, I don't know, maybe like five or six apps for Google Play that I just found useful,

04:30 like minor things like a music player that plays music randomly, which there are on your phone.

04:37 And the only input is how long you want to listen to music.

04:41 Oh, cool.

04:41 So I was like, I want to listen for 20 minutes.

04:44 And they just randomly plays music for 20 minutes.

04:47 So, you know, little things like that.

04:49 Yeah.

04:49 Cool.

04:50 How'd you find your way over to Python?

04:51 Thanks to, thanks to web scraping actually.

04:54 Yeah.

04:54 Yeah.

04:55 You know, I'm not sure which was the first point.

04:58 No, I actually remember which was the first time I came across web scraping.

05:02 I was trying to develop like a software.

05:06 Like I was pretty into sports betting, football and soccer.

05:09 And I didn't play high stakes because I wasn't that good to afford to do it.

05:16 But I was interested in the analysis part, you know, looking up stats, looking up injury information,

05:22 looking at different factors that would determine the result, the outcomes of games.

05:28 And I wanted to create a software that would predict the outcome of these games or some of the,

05:36 we have, I don't know, 25 games in a weekend.

05:38 And it would predict the like, okay, look at these three games.

05:43 And these are going to be the results.

05:45 So I wanted to create a software like this.

05:47 Right.

05:48 And if that beats the spread that they are offering at Las Vegas, you're like,

05:51 oh, they're going to win by 14.

05:53 Oh, they said they're only winning by seven.

05:55 I want to take that.

05:56 Maybe.

05:56 Yeah.

05:57 I tried that.

05:58 And I also tried.

05:59 How did it go?

06:00 Yeah.

06:03 Well, the thing is, I wouldn't say it went too good, but it wasn't too bad.

06:07 I mean, I experimented, like, first of all, I was betting on like over and under,

06:12 because I found that that's like a more predictable thing to bet on.

06:17 That's what I thought, because in reality, it isn't, but I bet on over and under.

06:22 And so this software would tell me that, hey, in this game, you know, if it's soccer,

06:29 there will be more than two goals in the first half.

06:33 And so what I would do is I would watch the game live, because in life, there are better odds,

06:40 because as the time progresses, the odds for scoring the first half goes up.

06:45 So I just waited for like the, I think it was the 15th minutes or something like that.

06:50 And if there is still no goals scored in the first half, I would bet on it.

06:55 So I was doing this for like, I think four months without money, just testing the system.

07:01 Right.

07:02 I was watching and okay, I would bet right now.

07:05 And I was calculating, I had this Excel spreadsheet and everything.

07:08 And in that four months period, I was just, I think it was not profitable at all,

07:14 but it was close to being just a little profitable long term.

07:19 Yeah.

07:19 Well, maybe you were onto something.

07:21 If you get some more factors, maybe bring in a little machine learning, who knows?

07:25 Yeah, actually, maybe I should pick it up.

07:28 But that's when I came across web scraping.

07:30 Right.

07:31 Because I needed data.

07:33 I needed stats.

07:34 And a lot of these places, they don't want to give out that data.

07:37 They'll show it on their website, but they're not going to make an API that you can just stream.

07:40 Right.

07:41 Exactly.

07:42 At the time, there were like different commercial APIs that big news corporations use,

07:50 but it was just a hobby project.

07:51 And the only way for me to get data was to extract it from a publicly available website.

07:59 And so I started to look into it.

08:02 And at the time I was programming in Java.

08:04 So I found Jsoup.

08:06 I found HTML unit, which is a testing library, but it can be used for web scraping.

08:12 I found, is it Jount?

08:15 I think it's pronounced Jount.

08:16 And I found other libraries.

08:19 And everybody was saying that, which one is the best?

08:22 Which one is the best?

08:23 Everybody was talking on the forums about Scrappy.

08:26 And I didn't know what Scrappy was.

08:28 So I looked it up.

08:29 It's in Python.

08:30 I don't know how to program in Python.

08:32 I barely can program in Java.

08:34 But then I eventually learned Python just to learn Scrappy.

08:39 Cool.

08:40 Yeah.

08:40 And it's such a nice way to do web scraping.

08:42 It's a weird language.

08:44 It's Python.

08:45 It doesn't have semicolons, curly braces.

08:47 There's not so many of those.

08:49 Yeah.

08:49 It was like comparing it to Java.

08:51 It's very clean.

08:53 Yeah.

08:53 That's for sure.

08:54 Java is a little verbose in all of its type information, even for a typed language,

08:58 like say compared to C or I don't know, maybe even C#.

09:02 Like it likes its interfaces and all of its things and so on.

09:06 Right.

09:06 Yeah.

09:07 The best thing I love about Python, which I cannot do in Java, is that in Python,

09:11 you can read the code like an actual sentence.

09:15 Right.

09:16 I mean, not always, but sometimes you can, it's like a sentence.

09:19 Yeah.

09:19 There's this cartoon joke about Python that there's these two scientists or something like that.

09:26 And they've got this pseudocode.

09:28 It says pseudocode on their file.

09:29 And it's like pseudocode.txt.

09:31 They're like, oh, how do we turn this into a program?

09:34 Oh, you change the txt to a .py and it'll run.

09:36 Yeah.

09:37 It's already there.

09:38 Yeah, exactly.

09:39 All right.

09:40 Well, so that's how you got into it.

09:42 And obviously your interest in web scraping.

09:44 So, you've continued that interest today, right?

09:47 What are you doing these days?

09:48 Yeah.

09:49 So I've been developing like a lot of spiders, a lot of scrapers with Scrappy and other tools over the years.

09:55 And nowadays I joined the company who actually created Scrappy, Scraping Hub.

10:01 And I'm with Scraping Hub for over a year now.

10:04 And what I'm doing is as a day-to-day is educating people about web scraping,

10:09 both on the technical side of things and business side of things, like how to extract data from the web,

10:16 why is it useful for you or why it can be useful for you.

10:20 And so nowadays I don't really code that much unless it's for an article or a tutorial or to showcase some new stuff.

10:29 I spend more time on creating videos and just, you know, teaching.

10:34 Yeah.

10:34 Well, that sounds really fun.

10:35 You just get to explore, right?

10:37 Yeah.

10:37 And the biggest part or the best part I love about it is that like, I get to speak to customers who are doing web scraping or who are sort of

10:47 enjoying the benefits of web scraping.

10:49 And there are some really cool stories, like what these customers do with data.

10:55 Yeah.

10:56 And it's really creative.

10:57 Like people can get super creative, like how to make use of, you know, web data,

11:02 which is like, it's in front of you.

11:05 You can see the data on your website.

11:06 It doesn't look like interesting or like, you know, but when you extract it and you structure it,

11:12 you can actually make use of it and drive, you know, you can do many different things,

11:17 but you can drive better decisions in companies, which is pretty exciting.

11:21 Yeah.

11:22 And I guess if you work at a company and you're looking to think about, I'd like to do some comparative analysis against say our competitors and what

11:31 are they doing?

11:32 What are they doing with sales?

11:33 Like in terms of discounts, what are they doing in terms of things are featuring,

11:37 right?

11:38 You could theoretically write a web scraper that looks at your competitors'

11:42 data, their sites, their presentation, and sort of gives you that over time relative to you,

11:48 right?

11:48 Or something like that.

11:49 Yeah.

11:50 Like many use cases are, you know, as you said about monitoring competitors or monitoring the market.

11:59 And especially it's a big thing in e-commerce where most of the sectors in e-commerce are

12:06 really like, there are a lot of companies doing the same thing, selling the same thing.

12:12 And it's just really hard for them to sell more products.

12:16 So with web scraping, they can monitor the competitor prices.

12:21 They can monitor the competitors' stock information.

12:25 They can monitor pretty much everything that is on their website, publicly available.

12:31 And they can gather this information.

12:33 And, you know, it can be like tens of thousands of products or more.

12:38 And you can see where you should raise your prices, lower your prices, and how to price your products better and do marketing better.

12:49 Yeah.

12:50 And there's a lot of implicit signals as well that are not, you know, they're not most companies.

12:54 Some do, but most won't go, there are 720 of these sold today or whatever.

12:59 Yeah.

13:00 but you could say graph, like the number of reviews over time or the number of stars over time,

13:06 things like that would give you a sense of like a proxy, that kind of information.

13:10 Right.

13:10 you could come up with a pretty good analysis and pretty interesting analysis.

13:13 This portion of Talk Python to Me is brought to you by Linode.

13:18 Whether you're working on a personal project or managing your enterprise's infrastructure,

13:22 Linode has the pricing, support, and scale that you need to take your project to the next level.

13:27 With 11 data centers worldwide, including their newest data center in Sydney,

13:31 Australia, enterprise-grade hardware, S3 compatible storage, and the next generation network,

13:38 Linode delivers the performance that you expect, at a price that you don't.

13:42 Get started on Linode today with a $20 credit, and you get access to native SSD storage,

13:47 a 40 gigabit network, industry-leading processors, their revamped cloud manager,

13:52 cloud.linode.com, root access to your server, along with their newest API,

13:56 and a Python CLI.

13:58 Just visit talkpython.fm/Linode when creating a new Linode account, and you'll automatically get $20 credit for your next project.

14:06 Oh, and one last thing, they're hiring.

14:08 Go to linode.com slash careers, to find out more.

14:11 Let them know that we sent you.

14:15 also, like, some of the things that people monitor with web scraping is more like higher level.

14:20 We have the price, stock, but those are, like, really specific things, or values.

14:26 But with web scraping, you can actually monitor, like, company behaviors,

14:30 what the company is doing on a high level.

14:33 And you can achieve this by monitoring news, looking for, like, the text,

14:39 or the blog of the company, and getting out some useful information from that.

14:45 And, you know, setting up alerts and things like that.

14:47 Yeah, yeah, super cool.

14:48 So there's a couple of examples that I wanted to put out there, and then get your thought on it.

14:54 One is, you know, most of the world has been through the ringer on COVID-19

14:59 and all of this stuff.

15:00 In the U.S., we definitely have been recently getting our fair share of disruption from that.

15:05 So there's an article on towardsdatascience.com where some, the challenge was that

15:12 in the early days of the lockdowns, at least in the United States, we had things like Instacart

15:18 and other grocery delivery things, like there's local grocery stores, you could order stuff

15:23 and they would bring it to your car outside so you didn't have to go in,

15:26 other people, fewer people in the store, all sorts of stuff that was probably a benefit from that.

15:30 But if you tried to book it, it was like a week and a half out.

15:34 Yeah.

15:34 And it was, it was a mess, right?

15:36 And so this person, as they wrote it, Emmy GIL wrote this article that talked about using web scraping

15:44 to find a grocery store delivery slots, right?

15:48 So you could basically say, I'm going to go to my grocery store or to like something like,

15:51 Instacart and I'm just going to watch until one of these slots opens up.

15:55 And then what he had to do was actually send him a text, like go now and book it,

15:59 place my order now or whatever.

16:01 And I think that that's a really creative way of using web scraping.

16:05 What do you think?

16:06 Yeah, it's definitely really creative.

16:08 It's useful.

16:09 So he wasn't just only getting alerted when there's a new spot available,

16:14 but he actually created the bot.

16:16 So it like, it chooses that spot.

16:18 Right.

16:19 It would actually place the order.

16:21 Yeah, that's smart.

16:21 I mean, there are, we had the same situation in Hungary where you couldn't find any free spot.

16:28 And the thing I love about web scraping is that it can be useful for huge companies

16:34 to do web scraping at scale, but it can be also useful for literally everyone.

16:40 Yeah.

16:40 In the world, just doing little things like this.

16:43 I have a friend who is looking to buy an apartment and currently the real estate market

16:49 is really like fragile, I would say.

16:51 Yeah, yeah.

16:52 Yeah.

16:52 And what he wants to do that I suggested him to do it is to create, set up a bot

16:58 that monitors like 10 websites, 10 real estate listing websites.

17:04 And so he will get alerted when there is a apartment available with the attributes

17:11 he's interested in.

17:12 Right.

17:13 Okay.

17:13 That's awesome.

17:14 Yeah.

17:14 So there are so many little things like this in web scraping and it's just awesome.

17:21 Yeah, I agree.

17:21 People talk often about this concept popularized by Al Swigert, like automate the boring stuff.

17:27 Like, oh, I got to always do this thing in Excel.

17:29 So could I write a script that just generates the Excel workbook or I've always got to

17:33 do this other thing, copy these files and rename them, write something that does that.

17:36 But this feels like the automate the boring stuff of web interfaces, automate tracking

17:41 of these sites and let it like finding out when there's a new thing.

17:45 Yeah.

17:45 Automate the boring stuff is a really good resource.

17:48 In this case, it's really like what it's about.

17:51 If you didn't create a web scraper for this, what you would do?

17:55 You would probably look at the website every single, as often as you can,

18:00 every day, every hour.

18:01 You'd hear somebody say, I checked it every day until I found the apartment I wanted,

18:05 right?

18:06 That's what they would say.

18:07 Yeah, exactly.

18:08 Yeah.

18:09 And what you don't hear them say so often is my bot alerted me when my dream apartment

18:14 was available.

18:15 Yeah.

18:15 I created a scrappy spider that alerts me when there is a new apartment listing

18:20 in that area.

18:21 Yeah.

18:21 And, but I think it's a really great example and it's like, there's so many things

18:26 on your computer that you do that you kind of feel like, ah, this stuff I just got to do.

18:30 But on the web, there's way more even that you probably end up, I always go to this site

18:34 and then I do this thing and I got to do that, right?

18:35 And web scraping will kind of open up a ton of that stuff.

18:39 So, let me give you two examples.

18:41 One, really grand, large scale.

18:44 One, like, extremely small but like, really interesting to me in the next,

18:48 you know, hour.

18:48 So, grand scale first.

18:50 When Trump was elected in the United States 2015 and he was going to be come president

18:56 in 2016, there was a lot of concern about what was going to happen to some of the data

19:01 that had so far been hosted on places like whitehouse.gov.

19:05 So, like, there's a lot of climate change data hosted there from the scientists

19:09 and there was a thought that he might take that down, not necessarily destroy it,

19:14 but definitely take it down publicly from the site and there's, you know,

19:18 there's all these different places, the CDC and other organizations that have a bunch

19:23 of data that a bunch of data scientists and other folks were just worried

19:28 about.

19:29 So, there were these big movements and I struggle to find the articles, you know,

19:33 almost four years out now.

19:35 Can't quite find the right search term to pull them up but there was these

19:38 like, save the data hackathon type things.

19:42 So, people got together at different universities, like 20, 30 people and they would spend

19:47 the weekend writing web scraping to go around to all these different organizations,

19:51 download the data.

19:53 I think they were coordinated across locations and then they would put that all

19:57 in a central location, I think somewhere in Europe, like in Switzerland or something,

20:00 so the data could live on.

20:02 And I think the day that Trump was elected, I think the climate change data

20:06 went off of whitehouse.gov.

20:07 So, at least one of the data sources did disappear and that's a pretty large

20:12 scale web scraping effort.

20:13 Some of it was just like download CSVs, but a lot of it was web scraping,

20:16 I think.

20:17 Yeah, like the thing is people say that once you put something on the internet,

20:22 it's going to stay there forever.

20:23 And why this story is interesting is that this is information that is useful

20:29 for everyone.

20:30 Yeah.

20:30 And this is the kind of information that you want to make publicly available.

20:35 And I think this is one of the things that this movement is about, if I can call

20:40 it movement, open data to make data open and accessible for everyone.

20:45 And web scraping is really a great tool to do that, to extract data from the web,

20:53 put it in a structured format in a database so you can use it, can just store

20:59 information or you can get some kind of intelligence from the data.

21:03 And it's really like without web scraping, what you couldn't do anything else.

21:08 will you just copy-paste the whole thing or what will you do?

21:11 It would take a lot of by-hand work.

21:14 Yeah, it wouldn't be good.

21:15 Yeah.

21:15 Yeah, so that's really cool.

21:17 And that was an interesting large scale, like, hey, we have to organize.

21:21 We've got one month to do it.

21:22 Let's make this happen.

21:23 So on a much smaller scale, I recently decided to realize that some of my

21:28 pages on my site were getting indexed by Google, not because anything on the

21:32 internet was linking to them publicly, but something would be in Gmail or

21:36 somebody would write it down and then Google would find it.

21:39 So I just went through and put a bunch of no index meta tags on a bunch of my

21:43 parts of my site.

21:44 But it really scared me because I changed a whole bunch of it and I'm pretty sure I

21:48 got it right.

21:48 But what if I accidentally put no index on some of my course pages that advertise

21:54 them and all of a sudden the traffic goes from whatever it is to literally

21:58 zero because I accidentally told all the search engines, please don't tell

22:03 anyone about this page.

22:04 what I was doing actually right before we called unrelated to the fact that

22:07 this was the topic was I was going to write a very simple web scrappy thing

22:11 that goes and grabs my site map, hits every single page and just tells me

22:15 the pages that are not indexed just for me to double check.

22:18 But you know, that template didn't get reused somewhere where I didn't think it

22:21 got reused, you know, like the little shared bit that has the head in it and so

22:25 on.

22:25 Yeah, it's sort of like in SEO, like search engine optimization, we have this

22:29 thing, technical SEO.

22:31 Yeah.

22:32 And actually, it's another use case for web scraping or like crawling, where you want

22:38 to crawl your own website, just as you know, you did and to learn the site map or

22:44 to find pages where you don't have proper SEO optimization, like you don't

22:49 have meta description on some pages, you have multiple like H1 tags or whatever.

22:56 Right.

22:56 and with crawling, you can figure these out.

22:59 And I just remember the story from like 2016, where web scraping was used to

23:07 reveal lobbying and corruption in Peru.

23:10 Oh, wow.

23:11 Okay.

23:12 Yeah, I'm not 100% sure, but I think it was related to, you know, the Panama

23:16 Papers.

23:17 Yeah, I was going to guess it might have something to do with the Panama

23:21 Papers, which was really interesting in and of itself.

23:23 Yeah, yeah, and they actually, they use many tools to scrape the web, but they

23:29 also use Scrappy to get like information from these, I guess, government

23:34 websites or other websites.

23:36 But yeah, like it's crazy that a web scraping can reveal these kinds of really,

23:42 this is a big thing.

23:44 Yeah.

23:44 Corruption and web scraping can tell you that, hey, there is something wrong

23:49 here, you should pay attention.

23:50 It lets you automate the discovery and the linking of all this different data

23:54 that was never intended to be put together, right?

23:57 There's no API that's supposed to take like these donations, this investment,

24:02 this other construction deal, or whatever the heck it was, and like link

24:06 them together.

24:06 But with web scraping, you can sort of pull it off, right?

24:09 Yeah, you can structure the whole web.

24:11 I mean, technically, you could structure the whole web, or you could get like a

24:17 set of websites that are considered or that has the same kind of data you

24:22 are looking for, like you can be e-commerce, fintech, real estate, whatever.

24:27 You can grab those sites, extract the data, the publicly available data,

24:32 structure it, and then you can do many sorts of things.

24:35 You can do like NLP, you can just search in the data, you can do many different

24:39 kinds of things.

24:41 Yeah, that's cool.

24:42 So let's talk a little bit specifically about Scrappy and maybe Scraping Hub

24:47 as well, and some of the more general challenges you might run into with web scraping.

24:51 So right on the Scrappy website, homepage, project site, there's a real simple

24:57 example that says here's why you should use it.

24:58 You can just create this class, it derives from Scrappy.

25:01 spider, and you give it some URLs, and it has this parse function, and off

25:05 it goes.

25:06 Do you want to talk to us real quickly about what it's like to write code to do

25:09 web scraping with this API?

25:11 With Scrappy?

25:12 Yeah.

25:12 Yeah, so with Scrappy, if you use Scrappy, what you really get is like a

25:18 full-fledged web crawling framework, or like a web scraping framework.

25:22 So it really makes it easy for you to extract data from any website.

25:28 And it has, there are other libraries like BeautifulSoup, or LXML, or in Java,

25:34 we have JSUP, that are only focused around parsing HTML and getting, or getting data

25:40 from HTML and XML files.

25:43 Right, BeautifulSoup is I have the text, I have the HTML source, now tell me

25:47 stuff about it, right?

25:48 Right.

25:49 That's sort of its whole job, yeah.

25:50 Right, but Scrappy, I would say it's like maybe 10% of the whole picture

25:55 or the big picture, because with Scrappy, like, yeah, you can parse the HTML,

26:00 but in real-world project, that's not the only task you need to do.

26:05 In real-world project, you need to process the data as well.

26:10 You need to structure the data properly, you need to maybe put it into a database,

26:16 maybe export it as a CSV file or JSON, you need to clean the data, you need to

26:23 normalize the data.

26:24 In real-world scenarios, you need to do a lot of things, not just scraping the

26:28 data, and Scrappy really makes it easy to develop these spiders that grab the

26:34 data, and then right there in Scrappy, you can clean it, you can process

26:40 it, and when the data leaves Scrappy, it's usable for whatever you want to

26:46 use it for.

26:46 Right, maybe you get a list or sequence of dictionaries that have the pieces

26:51 of data that you went after or something like that.

26:53 Yeah, right.

26:54 So that's why I really like Scrappy, but also I think it's the best framework to

27:00 maintain your spiders, because a big problem in web scraping is that once

27:05 you write the spider, it works, you get the data, it will probably not work in

27:11 the six months in the future because the website changes or layout changes.

27:16 They've decided to switch JavaScript front-end frameworks or redesign the

27:21 navigation or something, right?

27:23 Exactly.

27:23 And so you need to adjust your code, adjust your selectors, you need to maybe adjust

27:30 the whole processing part of Scrappy, because maybe in the past you were able

27:36 to extract only messy code from the HTML, but they did some changes on the website.

27:42 Now it's clean, you don't need so much processing.

27:44 And with Scrappy, it's really easy to maintain the code, which is, you know,

27:49 it's really important if you rely on web-extracted data on a daily basis,

27:53 you need to maintain the code.

27:55 Yeah, really neat.

27:56 So to me, when I look at this simple API here, it's like you create a class,

28:01 you give it a set of start URLs, and there's a parse function, and that comes in,

28:05 you run some CSS selectors on what you get.

28:08 For each thing that you find, you can yield that back as part of the thing, like the

28:12 result you found, and then you can follow the next one.

28:15 I think that's probably the biggest differentiator from what I can tell is,

28:19 it's like as you're parsing it, you're directing it to go do more discovery, maybe

28:25 on linked pages and things like that.

28:27 Yeah, like on websites where you need to go deeper, maybe because, you know, you need to

28:33 go deeper in the sitemap or you need to like paginate, you need to do more

28:39 requests to get the data.

28:41 Yeah.

28:41 In tools like BeautifulSoup, you would need to sort of create your own logic or like,

28:47 you know, how you want to do that.

28:49 But in Scrappy, it's already sort of straightforward how you want to do that or how you

28:54 can do that.

28:55 And you just need to just click this button, click this button, and you paginate with

29:00 Scrappy because you have this high-level functions and methods that let you paginate and

29:06 let you do things in a chain, you know?

29:08 Yeah.

29:09 Yeah, very cool API.

29:10 So let's talk about some of the challenges you might get.

29:14 I mean, probably to me, the biggest challenge, well, there's two problems

29:17 that I would say, the two biggest challenges.

29:19 One is probably maybe generally you could call it data quality, but more

29:25 specifically like the structure changed, right?

29:28 they've added a column to this table and it's not what I thought it was anymore.

29:32 Like it used to be temperature and now it's pressure because they've added that

29:36 column and I wrote it to have three columns in the table, now it's got four.

29:39 The other one is I go and I get the source code of the page and it says, you know,

29:45 view.application or there's like an Angular app or something and you get

29:50 just a bunch of script tags that have no data in them because it's not actually executed yet.

29:55 Yeah, I mean, this is like really all the things that all the reasons why you

30:00 need to maintain the code, which can be, you know, it takes time to maintain the

30:05 code.

30:05 Actually, that's why there are more advancing machine learning technologies

30:11 that pretty much maintain the code for you.

30:15 Oh, wow.

30:15 Okay.

30:15 Give us an example of that.

30:17 Like find me the title or find me the stock price off of this page, machine

30:21 magic.

30:21 Yeah.

30:22 So actually it's not that big of a magic.

30:24 Like in the future, you won't need to write things like selectors.

30:29 You don't need to tell Scrappy or like, you know, whatever tool you use to, hey, this

30:35 is the price title, whatever.

30:38 I want to extract it.

30:39 This is the selector where you can find it in the HTML.

30:43 In the future, you will not need to do that.

30:46 In the future, what you will only need to say is that this is the page.

30:51 this is real estate listing or this is a product page or this is some kind of

30:57 financial page.

30:59 So you just need to say the data type.

31:01 Yeah.

31:02 And then the machine will know what fields to look for.

31:07 So it will know that if it's a product page, you probably need like price.

31:13 You probably need product title or like product name, sorry, description,

31:17 stock information, maybe reviews, if any.

31:20 So in the future, you just need to specify the data type and you will get all

31:25 the data fields that you want with ML.

31:28 Yeah.

31:28 That's like a dream.

31:29 Well, it's sort of like it's a reality in some cases because like, you know, there

31:34 are many, there are some products out there that do this at Scraping Hub.

31:39 We have like, for example, news API, which does exactly this thing for articles and news.

31:45 Okay.

31:46 Yeah.

31:46 So what you do is that, hey, this is the page URL.

31:50 give me everything and you get the article title, you get the text body,

31:55 you get, you know, everything that is on the page.

31:58 Well, and if any place is going to have enough data on scraping to like answer those kinds of

32:03 questions more generally, it sounds like you guys might.

32:06 You guys do so much web scraping, right?

32:08 That like way more than a lot of people in terms of as a platform.

32:12 I don't know how you like share that across projects.

32:15 I don't think that, I don't see a good way to do that, but it seems like you

32:18 should be in a position that somehow you could build some cool tools around all

32:22 that.

32:22 Yeah, we do a lot of web scraping, as you said.

32:25 And it's really about the biggest problem is that there is no one schema that websites

32:32 follow when it comes to data.

32:34 So when you scrape, you need to like figure out a structure that would work for

32:40 all the websites.

32:41 and then you extract, you sort of like standardize the data.

32:45 And that's why the data type is important to specify.

32:48 Right.

32:49 What about JavaScript?

32:50 That's all the cool kids say they should, you don't write server code anymore.

32:55 You don't write templates.

32:56 Just you write just APIs and you talk to them with HTML and JavaScript.

33:00 But that doesn't scrape very well.

33:02 I'm not sure that's necessarily what you should be doing, but entirely, but a

33:05 lot of people are and it makes web scraping definitely more challenging.

33:09 Yeah.

33:09 Like nowadays, all the websites have some kind of JavaScript running.

33:13 And if the data you need to extract is rendered with JavaScript, that can

33:19 be a challenge.

33:21 And first, if you're writing a spider, first, you should look for ways to get

33:26 the data without executing JavaScript.

33:28 You can do this sort of, you know, I call them hidden APIs when there is

33:34 no documented API for the website.

33:36 But if you look in the background, background, that is actually a public

33:41 API available.

33:42 Right.

33:42 So it's clearly going back, like if you've got a Vue.js front end, it's probably

33:47 calling some function that's returning like a JSON result of the data you actually

33:52 want.

33:53 You just got to look in your web browser tools to see that happening, right?

33:56 Yeah, exactly.

33:57 Like with AJAX calls, many times that's the case that you just need to sort of grab

34:02 the JSON and you can get the data easily.

34:05 Well, and it's formatted JSON, right?

34:07 It's not just HTML.

34:08 I mean, it's like pretty sweet actually to get a hold of that feed.

34:11 Oh, yeah.

34:12 But on the other side, if you cannot find an API like this, you need to execute JavaScript using

34:18 some kind of headless browser like Selenium or Puppeteer or Splash.

34:23 And because, you know, you need an actual browser to execute JavaScript.

34:27 Right.

34:27 But then it's going to take more hardware resources to run JavaScript.

34:32 So it's going to be slower and it's going to make the whole process a little bit

34:37 more complicated.

34:38 Yeah.

34:38 There's probably no mechanism built right into Scrapey that will render the

34:43 JavaScript, is there?

34:44 Like Scrappy itself cannot render JavaScript, but there is a really neat integration with

34:50 Splash.

34:50 Okay.

34:51 Splash is also made, you know, it was created by Scraping Hub, so it's really easy to,

34:56 and Splash is sort of like a headless browser created only for web scraping.

35:01 and you can integrate Splash with Scrappy really easily.

35:05 It's literally just like a middleware in Scrappy.

35:08 And then it will execute JavaScript for you.

35:11 Yeah.

35:11 Sweet.

35:11 Yeah.

35:12 You will be able to see the data and then you will be able to extract it.

35:16 Right.

35:16 It says the headless browser designed specifically for web scraping.

35:20 Turn any JavaScript heavy site into data.

35:23 Right.

35:23 Cool.

35:24 I'll put a link to that.

35:25 That's pretty neat.

35:26 And that's open source.

35:27 People can just use that as well or what's the story with that?

35:30 Yeah.

35:30 Just like Scrappy, they can set up their own server and you can use it that way or you can

35:36 also host it on Scraping Hub.

35:37 Talk Python to me is partially supported by our training courses.

35:43 How does your team keep their Python skills sharp?

35:46 How do you make sure new hires get started fast and learn the Pythonic way?

35:50 If the answer is a series of boring videos that don't inspire or a subscription service

35:56 you pay way too much for and use way too little, listen up.

35:59 At Talk Python Training, we have enterprise tiers for all of our courses.

36:03 Get just the one course you need for your team with full reporting and monitoring.

36:07 Or ditch that unused subscription for our course bundles which include all the

36:12 courses and you pay about the same price as a subscription once.

36:15 For details, visit training. talkpython.fm/business or just email sales at

36:21 talkpython.fm.

36:24 Yeah, so let's maybe round out our conversation talking about hosting.

36:28 So I can go and run web scraping on my system and that sometimes works pretty

36:35 well, sometimes it doesn't.

36:36 There was this time a friend of mine was thinking about getting a car and putting it

36:40 on one of these car sharing sites, keep it a little bit general and we wanted to

36:44 know like, okay, for a car that's similar to the kind of car he's thinking about getting,

36:47 how often is it rented?

36:49 What does it rent for?

36:50 Like, so would it be worthwhile as just a pure investment, right?

36:54 So I wrote up some web scraping stuff that would just monitor like a couple

36:58 of cars I picked.

36:59 After a little while, they didn't talk to me anymore.

37:02 The website was mad at me for like always asking about this car.

37:06 Yeah.

37:07 So that seems like a common problem and if I do that for my house, I probably

37:11 can't check out that site anymore for a while, right?

37:14 I probably shouldn't do it for my house, but even from a dedicated server, like

37:17 that might cause problems.

37:18 Yeah, it's not just with web scraping.

37:20 Like I remember I was doing some kind of financial research.

37:24 It was about like one of the cryptocurrencies and I was using an actual API.

37:28 Like I wasn't scraping.

37:29 I was using an actual API and it was pretty basic.

37:32 Like I was just learning how to use this thing and I created a for loop and I put the

37:39 request, the API request in the for loop and I was just testing and by accident, I put like

37:46 a huge number in the for loop, like I don't know, like a million or something.

37:51 And it created a million requests.

37:54 I mean, it tried to create a million requests, but it stopped after like,

37:58 I don't know, maybe like 10,000 requests in a minute.

38:01 Right.

38:02 And because, you know, the server didn't like that I was doing that.

38:04 And it's the same thing with web scraping.

38:07 You need to be really careful not to hit the website too hard because when you scrape the

38:13 web, it's really important.

38:14 I mean, it's very important to be ethical and be legal.

38:19 And one of the things that you can do to achieve these is to just put some kind of

38:24 delay limit just over a respect the website and make sure that you don't

38:28 cause any harm to the website.

38:30 Right.

38:31 As soon as people think that what you're doing, you think you're just getting data, but

38:35 they might perceive it as this is the distributed denial of service attack

38:39 against my site.

38:40 We're going to block this person, right?

38:42 That's not a good situation you want to be in.

38:44 And it's also not super nice to whoever runs that website to hammer it.

38:48 Exactly.

38:48 The website doesn't know what you want to do.

38:51 It just sees a bot and it's really like, you know, because web scraping itself is legal.

38:56 You can do it if it's a publicly available website and you need to make sure

39:01 that you are being, I mean, we use this in the web scraping community.

39:05 You need to be nice.

39:06 You need to be nice to the websites and you know, you can look at the robots

39:10 text file.

39:11 Usually there is a, like a, a delay specified in the robots text file that

39:16 tells you that, hey, you should put like three seconds between each request.

39:21 And, but just in general, even if there is no such a rule defined in the robots

39:26 text file, you should pay attention.

39:28 You should be careful with your, how many requests you make, how frequently you make

39:33 those requests.

39:34 So yeah, it's important.

39:35 Yeah.

39:36 Interesting.

39:36 And so I guess one option is you guys have a cool thing, which I just discovered.

39:42 I don't know how new it is, but scrappy D, a daemon that lets you basically deploy your

39:48 scrappy scripts to it and it'll schedule and run them and that's pretty cool and

39:53 you can control that.

39:53 So you could set that up yourself or obviously you have the business side of your

39:58 whole story, which is scraping hub.

40:00 So what does scraping hub do that makes this better?

40:03 Yes.

40:04 Then just like getting a server at AWS or Linode or something like that.

40:08 Yeah.

40:09 So scraping hub is really about, you can deploy your spiders, but it's really, it's made

40:15 for web scraping and there is all the tools available in the scraping hub platform

40:20 that you would want to use for web scraping.

40:23 Like, you know, we mentioned scrappy and splash in the scraping hub platform.

40:27 You can upload your spider, your scrappy spider, and then you can, with a click of

40:31 a button, you can add splash.

40:33 So it renders JavaScript or you can also, if you want to use proxies for web

40:39 scraping, you can, with a click of a button, you can add proxies to your

40:43 web scraping.

40:44 You can also, if you have a website, like we mentioned, you know, articles and news, if you

40:49 want to extract data, we have an API, so you don't have to actually write the spider.

40:54 You can just use the API and get the data so you don't need to maintain the code.

40:59 So it's scraping up is really just the, I work at scraping up, but I truly

41:03 believe that it's the best platform if you're doing web scraping at scale, because you have

41:08 all the tools in one place that you would want to use for web scraping.

41:12 Right, right.

41:14 So a lot of the stuff you might bring your own infrastructure for, you guys have like as a

41:19 service type stuff?

41:20 Yeah, exactly.

41:21 You can sign up.

41:22 We have a free trial for, I think, for most of our services.

41:26 You can try it.

41:27 You can see how it works.

41:28 And then there's a monthly fee and that's it.

41:31 Okay.

41:31 Yeah, that seems pretty awesome.

41:33 And also we have a lot of clients who don't want to deal with the stuff,

41:36 deal with the technical details.

41:38 So they just say that, hey, I want this type of data from these websites,

41:41 give me the data.

41:42 And so we just do the hardware for them.

41:45 Right.

41:45 Maybe they're not even Python people.

41:47 Yeah.

41:47 They might be like, they don't even know how to code or like.

41:50 Right.

41:51 They could be political scientists.

41:52 They're like, we need to just scrape this.

41:54 But like you guys do scraping, right?

41:56 Yeah.

41:56 Yeah, exactly.

41:57 So let's round out our conversation on this.

42:00 Just a quick comment on Scraping Hub or a quick thought on it.

42:04 So when I talked to Pablo back in 2016, so I had a show with him about Scraping and Scraping Hub

42:11 and all that back in episode 50, four years ago or something.

42:14 One of the things that really impressed me about what you guys are doing

42:18 and continues to be the case is you've taken pretty straightforward open source projects,

42:24 not something completely insane, like a whole new database model or something, but just pretty

42:29 straightforward, but polished model of web scraping.

42:32 And you built a really powerful business on top of it.

42:35 And I think it's an interesting model for taking open source and making it your job more

42:40 than just I'll do consulting in this thing, right?

42:43 If I wrote this web framework, I could maybe sustain myself by consulting with people who

42:48 want to write websites in that framework or so.

42:50 But this is a pretty interesting next level type of thing that you guys pulled off here.

42:55 And what are your thoughts on this?

42:56 Yeah, 100% agree.

42:58 Like I think, you know, Pablo and Pablo Hoffman and Shane Evans, they did

43:02 an awesome job when they created Scrappy.

43:05 Like at the time, it was the first real web scraping framework, as I know.

43:10 And they open sourced it.

43:11 And it's really like if you look at the GitHub page of Scrappy, it's really like I feel like

43:17 it's a community.

43:18 Like people are chatting about this.

43:20 People are talking about how the community can improve this tool.

43:25 And then the fact...

43:26 You've got like 17,000 GitHub questions or something like that.

43:30 Oh, yeah.

43:30 Tagged on it.

43:31 Yeah, it's quite a bit.

43:31 Yeah.

43:32 And the fact that Pablo and Shane were able to pull this thing into business, but in a

43:37 really...

43:38 Like people can still use Scrappy without using our services.

43:42 They can still use Splash without paying.

43:45 They can use it.

43:45 It's open source.

43:47 So the fact that they built on top of the open source tools in a way that you don't need

43:53 to be affiliated with the company to get the benefits of Scrappy and other open source tools.

43:58 Like there are many other tools.

44:00 It's not just Scrappy.

44:01 Like there are so many other tools for web scraping.

44:04 And I agree.

44:05 I think it's just really amazing that they have been doing this.

44:09 Yeah.

44:09 It seems like it's really adding value because if not the core thing is already open

44:14 source and people would just use it, you know?

44:16 Yeah.

44:16 Like especially if they want to do it at scale.

44:19 Yeah.

44:20 You just cannot do huge things like millions of extracted records or things like that on your

44:26 computer or on your laptop.

44:28 You just cannot do that.

44:29 No, you get about 50,000 and then you get blocked.

44:32 Yeah.

44:33 Yeah.

44:33 That's an option as well.

44:34 Yeah, exactly.

44:36 So yeah, very cool.

44:37 I think it's a really neat example because a lot of people are thinking about

44:40 how do I take this thing that's got some traction I'm doing with open source

44:43 and make this my job because I'm tired of filling Jira tickets at this thing that's kind of

44:49 okay but not really my passion, right?

44:51 And cool example.

44:52 So congrats to you guys.

44:53 Keep up the good work.

44:54 Yeah, we are.

44:55 We definitely try our best.

44:57 There are some awesome things to come in the web scraping world, I believe, with the

45:02 advancements of machine learning and other stuff.

45:04 So it's going to be really interesting to see what's going to happen in the

45:08 next few years.

45:09 Yeah, for sure.

45:10 All right.

45:10 Well, I think that's probably it for the time we got to talk about web scraping.

45:13 But the final two questions before you get out of here, you're going to write some

45:17 code.

45:17 What editor do you use?

45:18 Python code?

45:19 PyCharm.

45:20 PyCharm.

45:21 Right on.

45:21 And then notable PyPI package.

45:23 I mean, there's obviously pip install, scrappy, but maybe some project that you

45:28 came across, you know, like, oh, this is so awesome.

45:30 People should know about X.

45:32 Oh, actually, it's a hard question because I've been sort of out of the game for

45:37 a lot of months now.

45:38 Aside from scrappy, I just really like to, I don't have one specific example.

45:43 I just really like to find these GitHub repositories where, when it's not like

45:49 a library, but it's like a collection of, like, there are many libraries, GitHub

45:54 repos with a collection of templates.

45:56 Right.

45:56 You know, like, hey, if you're a beginner, you can use these templates to get

46:00 started.

46:01 And for me, that was really useful when I started out with using a new tool

46:05 like scrappy or other libraries is that you can use these starter codes sort of.

46:12 Right, right.

46:12 I wonder if there's a better scrappy, possibly.

46:15 Probably.

46:15 Yeah.

46:16 Yeah.

46:16 There is.

46:17 I'll throw it into the show notes, but yeah.

46:21 Cool.

46:21 But just cookie cutter, make me a scrappy project.

46:24 Let's roll and do some web scraping.

46:26 Pretty neat.

46:26 Also, there is a, you know, just last note, awesome.

46:28 I think it's got awesome web scraping.

46:31 OK.

46:31 Or something like that, which has hundreds of tools for web scraping.

46:35 Oh, yeah.

46:36 We can link it later.

46:38 Maybe if I find it.

46:39 OK.

46:39 Yeah.

46:40 It's got awesome web scraping on GitHub.

46:42 Yeah.

46:43 Nice.

46:43 I'll put that in the show notes as well.

46:45 I love those awesome lists.

46:46 Yeah.

46:46 It's easy to get lost because you were like, oh, I have this one thing and I

46:50 use it.

46:50 Then all of a sudden you're like, oh, look at that.

46:52 There's like 10 other options in this thing that I thought I knew.

46:55 Yeah, exactly.

46:57 Awesome.

46:58 All right.

46:58 Attila, it's been great to chat with all these, share all these web scraping

47:02 stories and chat with you.

47:03 So thanks for being here.

47:04 Yeah, it's been amazing.

47:05 Thank you.

47:05 Thank you very much.

47:06 Yeah, you bet.

47:07 Bye-bye.

47:07 Bye.

47:08 This has been another episode of Talk Python to Me.

47:11 Our guest in this episode has been Attila Toth and it's been brought to you by Talk

47:15 Python Training and Linode.

47:17 Start your next Python project on Linode's state-of-the-art cloud service.

47:21 Just visit talkpython.fm/Linode, L-I-N-O-D-E.

47:26 You'll automatically get a $20 credit when you create a new account.

47:29 Want to level up your Python?

47:31 If you're just getting started, try my Python Jumpstart by Building 10 Apps course.

47:36 Or if you're looking for something more advanced, check out our new Async course

47:40 that digs into all the different types of Async programming you can do in

47:44 Python.

47:44 And of course, if you're interested in more than one of these, be sure to

47:48 check out our Everything Bundle.

47:49 It's like a subscription that never expires.

47:51 Be sure to subscribe to the show.

47:53 Open your favorite podcatcher and search for Python.

47:55 We should be right at the top.

47:57 You can also find the iTunes feed at /itunes, the Google Play feed at slash Play, and the

48:02 direct RSS feed at /rss on talkpython.fm.

48:06 This is your host, Michael Kennedy.

48:08 Thanks so much for listening.

48:09 I really appreciate it.

48:10 Now get out there and write some Python code.

48:12 you're welcome.

48:13 Bye.

48:13 Thank you.

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon