#283: Web scraping, the 2020 edition Transcript
00:00 Web scraping is pulling the HTML of a website down and parsing useful data out of it.
00:04 The use cases for this type of functionality are endless.
00:07 Have a bunch of data on governmental sites that are only listed online in HTML without a download?
00:13 There's an API for that.
00:15 Do you want to keep abreast of what your competitors are featuring on their site?
00:19 There's an API for that.
00:20 Need alerts for changes on a website?
00:23 For example, enrollment is now open at your college, and you want to be first and avoid that 8 a.m. morning slot?
00:29 Well, there's an API for that as well.
00:31 That API is Screen Scraping, and Attila Toth from Scraping Hub is here to tell us all about it.
00:36 This is Talk Python to Me, episode 283, recorded July 22, 2020.
00:41 Welcome to Talk Python to Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities.
01:00 This is your host, Michael Kennedy.
01:02 Follow me on Twitter, where I'm @mkennedy.
01:05 Keep up with the show and listen to past episodes at talkpython.fm, and follow the show on Twitter via at Talk Python.
01:11 This episode is brought to you by Linode and us.
01:15 Python's async and parallel programming support is highly underrated.
01:20 Have you shied away from the amazing new async and await keywords because you've heard it's
01:24 way too complicated or that it's just not worth the effort?
01:27 With the right workloads, a hundred times speed up is totally possible with minor changes to your code.
01:33 But you do need to understand the internals, and that's why our course, Async Techniques and Examples in Python,
01:38 show you how to write async code successfully as well as how it works.
01:44 Get started with async and await today with our course at talkpython.fm/async.
01:49 Attila, welcome to Talk Python to me.
01:52 Thanks for having me, Michael.
01:53 Yeah, it's great to have you here.
01:55 I'm looking forward to talking about web scraping.
01:57 It's the API that you didn't know it exists, but it's out there if you're willing to find it.
02:02 It's like the dark API.
02:03 Yeah, it's like a secret API that you need a little work to sort of like be able to use it, to earn the right to use it.
02:11 Yeah.
02:12 But yeah, I like it.
02:13 It's useful.
02:14 It definitely is.
02:15 And we've got some pretty interesting use cases to talk about and a little bit of history and stuff.
02:20 I'm obviously going to talk about Scrapey, some stuff about Scraping Hub, what you guys are doing there.
02:24 So a lot of fun stuff.
02:25 But before we get to all that, let's start with your story.
02:27 How'd you get into programming in Python?
02:28 Yeah, sure.
02:29 So actually, I got into programming back in elementary school.
02:33 I was the sort of kid, I think I was like eighth grade.
02:37 And at the time, everybody in my class in school, everybody was asking for like,
02:42 I think it was Xbox 360 that just came out.
02:46 Everybody was asking for that for Christmas.
02:48 Yep.
02:49 And I was the one who was asking for a book about programming.
02:54 What language did you want at that time?
02:56 Yeah.
02:57 So actually, I did like a lot of research.
03:00 I think for like three months, I was researching which programming language is the best for beginners.
03:07 So I was on different forums, Yahoo Answers, in every place on the internet.
03:13 And many people suggested at the time, Pascal.
03:16 So I got this book.
03:18 I think it was called something like Introduction to Turbo Pascal or something like that.
03:23 Right.
03:24 Okay.
03:24 Yeah.
03:25 So I got the book.
03:26 And that was the first step.
03:28 And then after like 40 pages into the book, I threw it away because I found that just Googling stuff,
03:35 Googling stuff in Google and Stack Overflow, it's just easier to learn it that way.
03:41 Nice.
03:42 So did you end up writing anything in Turbo Pascal?
03:45 Yeah.
03:46 I was just, you know, the regular things beginners usually do.
03:49 Like in the terminal, I, you know, I printed my name in red.
03:54 I printed like figures with characters, you know, just the regular things beginners do.
04:00 But then I quickly switched to Java after Pascal.
04:04 Right.
04:05 Java, you can build web apps, you can build GUIs, more interesting things.
04:10 Yeah.
04:10 At the time I was interested in Android, Android development.
04:14 Actually, it was a few years later when I got to high school, I started to like program in Java to develop Android apps.
04:22 I created like, I don't know, maybe like five or six apps for Google Play that I just found useful,
04:30 like minor things like a music player that plays music randomly, which there are on your phone.
04:37 And the only input is how long you want to listen to music.
04:41 Oh, cool.
04:41 So I was like, I want to listen for 20 minutes.
04:44 And they just randomly plays music for 20 minutes.
04:47 So, you know, little things like that.
04:49 Yeah.
04:49 Cool.
04:50 How'd you find your way over to Python?
04:51 Thanks to, thanks to web scraping actually.
04:54 Yeah.
04:54 Yeah.
04:55 You know, I'm not sure which was the first point.
04:58 No, I actually remember which was the first time I came across web scraping.
05:02 I was trying to develop like a software.
05:06 Like I was pretty into sports betting, football and soccer.
05:09 And I didn't play high stakes because I wasn't that good to afford to do it.
05:16 But I was interested in the analysis part, you know, looking up stats, looking up injury information,
05:22 looking at different factors that would determine the result, the outcomes of games.
05:28 And I wanted to create a software that would predict the outcome of these games or some of the,
05:36 we have, I don't know, 25 games in a weekend.
05:38 And it would predict the like, okay, look at these three games.
05:43 And these are going to be the results.
05:45 So I wanted to create a software like this.
05:47 Right.
05:48 And if that beats the spread that they are offering at Las Vegas, you're like,
05:51 oh, they're going to win by 14.
05:53 Oh, they said they're only winning by seven.
05:55 I want to take that.
05:56 Maybe.
05:56 Yeah.
05:57 I tried that.
05:58 And I also tried.
05:59 How did it go?
06:00 Yeah.
06:03 Well, the thing is, I wouldn't say it went too good, but it wasn't too bad.
06:07 I mean, I experimented, like, first of all, I was betting on like over and under,
06:12 because I found that that's like a more predictable thing to bet on.
06:17 That's what I thought, because in reality, it isn't, but I bet on over and under.
06:22 And so this software would tell me that, hey, in this game, you know, if it's soccer,
06:29 there will be more than two goals in the first half.
06:33 And so what I would do is I would watch the game live, because in life, there are better odds,
06:40 because as the time progresses, the odds for scoring the first half goes up.
06:45 So I just waited for like the, I think it was the 15th minutes or something like that.
06:50 And if there is still no goals scored in the first half, I would bet on it.
06:55 So I was doing this for like, I think four months without money, just testing the system.
07:01 Right.
07:02 I was watching and okay, I would bet right now.
07:05 And I was calculating, I had this Excel spreadsheet and everything.
07:08 And in that four months period, I was just, I think it was not profitable at all,
07:14 but it was close to being just a little profitable long term.
07:19 Yeah.
07:19 Well, maybe you were onto something.
07:21 If you get some more factors, maybe bring in a little machine learning, who knows?
07:25 Yeah, actually, maybe I should pick it up.
07:28 But that's when I came across web scraping.
07:30 Right.
07:31 Because I needed data.
07:33 I needed stats.
07:34 And a lot of these places, they don't want to give out that data.
07:37 They'll show it on their website, but they're not going to make an API that you can just stream.
07:40 Right.
07:41 Exactly.
07:42 At the time, there were like different commercial APIs that big news corporations use,
07:50 but it was just a hobby project.
07:51 And the only way for me to get data was to extract it from a publicly available website.
07:59 And so I started to look into it.
08:02 And at the time I was programming in Java.
08:04 So I found Jsoup.
08:06 I found HTML unit, which is a testing library, but it can be used for web scraping.
08:12 I found, is it Jount?
08:15 I think it's pronounced Jount.
08:16 And I found other libraries.
08:19 And everybody was saying that, which one is the best?
08:22 Which one is the best?
08:23 Everybody was talking on the forums about Scrappy.
08:26 And I didn't know what Scrappy was.
08:28 So I looked it up.
08:29 It's in Python.
08:30 I don't know how to program in Python.
08:32 I barely can program in Java.
08:34 But then I eventually learned Python just to learn Scrappy.
08:39 Cool.
08:40 Yeah.
08:40 And it's such a nice way to do web scraping.
08:42 It's a weird language.
08:44 It's Python.
08:45 It doesn't have semicolons, curly braces.
08:47 There's not so many of those.
08:49 Yeah.
08:49 It was like comparing it to Java.
08:51 It's very clean.
08:53 Yeah.
08:53 That's for sure.
08:54 Java is a little verbose in all of its type information, even for a typed language,
08:58 like say compared to C or I don't know, maybe even C#.
09:02 Like it likes its interfaces and all of its things and so on.
09:06 Right.
09:06 Yeah.
09:07 The best thing I love about Python, which I cannot do in Java, is that in Python,
09:11 you can read the code like an actual sentence.
09:15 Right.
09:16 I mean, not always, but sometimes you can, it's like a sentence.
09:19 Yeah.
09:19 There's this cartoon joke about Python that there's these two scientists or something like that.
09:26 And they've got this pseudocode.
09:28 It says pseudocode on their file.
09:29 And it's like pseudocode.txt.
09:31 They're like, oh, how do we turn this into a program?
09:34 Oh, you change the txt to a .py and it'll run.
09:36 Yeah.
09:37 It's already there.
09:38 Yeah, exactly.
09:39 All right.
09:40 Well, so that's how you got into it.
09:42 And obviously your interest in web scraping.
09:44 So, you've continued that interest today, right?
09:47 What are you doing these days?
09:48 Yeah.
09:49 So I've been developing like a lot of spiders, a lot of scrapers with Scrappy and other tools over the years.
09:55 And nowadays I joined the company who actually created Scrappy, Scraping Hub.
10:01 And I'm with Scraping Hub for over a year now.
10:04 And what I'm doing is as a day-to-day is educating people about web scraping,
10:09 both on the technical side of things and business side of things, like how to extract data from the web,
10:16 why is it useful for you or why it can be useful for you.
10:20 And so nowadays I don't really code that much unless it's for an article or a tutorial or to showcase some new stuff.
10:29 I spend more time on creating videos and just, you know, teaching.
10:34 Yeah.
10:34 Well, that sounds really fun.
10:35 You just get to explore, right?
10:37 Yeah.
10:37 And the biggest part or the best part I love about it is that like, I get to speak to customers who are doing web scraping or who are sort of
10:47 enjoying the benefits of web scraping.
10:49 And there are some really cool stories, like what these customers do with data.
10:55 Yeah.
10:56 And it's really creative.
10:57 Like people can get super creative, like how to make use of, you know, web data,
11:02 which is like, it's in front of you.
11:05 You can see the data on your website.
11:06 It doesn't look like interesting or like, you know, but when you extract it and you structure it,
11:12 you can actually make use of it and drive, you know, you can do many different things,
11:17 but you can drive better decisions in companies, which is pretty exciting.
11:21 Yeah.
11:22 And I guess if you work at a company and you're looking to think about, I'd like to do some comparative analysis against say our competitors and what
11:31 are they doing?
11:32 What are they doing with sales?
11:33 Like in terms of discounts, what are they doing in terms of things are featuring,
11:37 right?
11:38 You could theoretically write a web scraper that looks at your competitors'
11:42 data, their sites, their presentation, and sort of gives you that over time relative to you,
11:48 right?
11:48 Or something like that.
11:49 Yeah.
11:50 Like many use cases are, you know, as you said about monitoring competitors or monitoring the market.
11:59 And especially it's a big thing in e-commerce where most of the sectors in e-commerce are
12:06 really like, there are a lot of companies doing the same thing, selling the same thing.
12:12 And it's just really hard for them to sell more products.
12:16 So with web scraping, they can monitor the competitor prices.
12:21 They can monitor the competitors' stock information.
12:25 They can monitor pretty much everything that is on their website, publicly available.
12:31 And they can gather this information.
12:33 And, you know, it can be like tens of thousands of products or more.
12:38 And you can see where you should raise your prices, lower your prices, and how to price your products better and do marketing better.
12:49 Yeah.
12:50 And there's a lot of implicit signals as well that are not, you know, they're not most companies.
12:54 Some do, but most won't go, there are 720 of these sold today or whatever.
12:59 Yeah.
13:00 but you could say graph, like the number of reviews over time or the number of stars over time,
13:06 things like that would give you a sense of like a proxy, that kind of information.
13:10 Right.
13:10 you could come up with a pretty good analysis and pretty interesting analysis.
13:13 This portion of Talk Python to Me is brought to you by Linode.
13:18 Whether you're working on a personal project or managing your enterprise's infrastructure,
13:22 Linode has the pricing, support, and scale that you need to take your project to the next level.
13:27 With 11 data centers worldwide, including their newest data center in Sydney,
13:31 Australia, enterprise-grade hardware, S3 compatible storage, and the next generation network,
13:38 Linode delivers the performance that you expect, at a price that you don't.
13:42 Get started on Linode today with a $20 credit, and you get access to native SSD storage,
13:47 a 40 gigabit network, industry-leading processors, their revamped cloud manager,
13:52 cloud.linode.com, root access to your server, along with their newest API,
13:56 and a Python CLI.
13:58 Just visit talkpython.fm/Linode when creating a new Linode account, and you'll automatically get $20 credit for your next project.
14:06 Oh, and one last thing, they're hiring.
14:08 Go to linode.com slash careers, to find out more.
14:11 Let them know that we sent you.
14:15 also, like, some of the things that people monitor with web scraping is more like higher level.
14:20 We have the price, stock, but those are, like, really specific things, or values.
14:26 But with web scraping, you can actually monitor, like, company behaviors,
14:30 what the company is doing on a high level.
14:33 And you can achieve this by monitoring news, looking for, like, the text,
14:39 or the blog of the company, and getting out some useful information from that.
14:45 And, you know, setting up alerts and things like that.
14:47 Yeah, yeah, super cool.
14:48 So there's a couple of examples that I wanted to put out there, and then get your thought on it.
14:54 One is, you know, most of the world has been through the ringer on COVID-19
14:59 and all of this stuff.
15:00 In the U.S., we definitely have been recently getting our fair share of disruption from that.
15:05 So there's an article on towardsdatascience.com where some, the challenge was that
15:12 in the early days of the lockdowns, at least in the United States, we had things like Instacart
15:18 and other grocery delivery things, like there's local grocery stores, you could order stuff
15:23 and they would bring it to your car outside so you didn't have to go in,
15:26 other people, fewer people in the store, all sorts of stuff that was probably a benefit from that.
15:30 But if you tried to book it, it was like a week and a half out.
15:34 Yeah.
15:34 And it was, it was a mess, right?
15:36 And so this person, as they wrote it, Emmy GIL wrote this article that talked about using web scraping
15:44 to find a grocery store delivery slots, right?
15:48 So you could basically say, I'm going to go to my grocery store or to like something like,
15:51 Instacart and I'm just going to watch until one of these slots opens up.
15:55 And then what he had to do was actually send him a text, like go now and book it,
15:59 place my order now or whatever.
16:01 And I think that that's a really creative way of using web scraping.
16:05 What do you think?
16:06 Yeah, it's definitely really creative.
16:08 It's useful.
16:09 So he wasn't just only getting alerted when there's a new spot available,
16:14 but he actually created the bot.
16:16 So it like, it chooses that spot.
16:18 Right.
16:19 It would actually place the order.
16:21 Yeah, that's smart.
16:21 I mean, there are, we had the same situation in Hungary where you couldn't find any free spot.
16:28 And the thing I love about web scraping is that it can be useful for huge companies
16:34 to do web scraping at scale, but it can be also useful for literally everyone.
16:40 Yeah.
16:40 In the world, just doing little things like this.
16:43 I have a friend who is looking to buy an apartment and currently the real estate market
16:49 is really like fragile, I would say.
16:51 Yeah, yeah.
16:52 Yeah.
16:52 And what he wants to do that I suggested him to do it is to create, set up a bot
16:58 that monitors like 10 websites, 10 real estate listing websites.
17:04 And so he will get alerted when there is a apartment available with the attributes
17:11 he's interested in.
17:12 Right.
17:13 Okay.
17:13 That's awesome.
17:14 Yeah.
17:14 So there are so many little things like this in web scraping and it's just awesome.
17:21 Yeah, I agree.
17:21 People talk often about this concept popularized by Al Swigert, like automate the boring stuff.
17:27 Like, oh, I got to always do this thing in Excel.
17:29 So could I write a script that just generates the Excel workbook or I've always got to
17:33 do this other thing, copy these files and rename them, write something that does that.
17:36 But this feels like the automate the boring stuff of web interfaces, automate tracking
17:41 of these sites and let it like finding out when there's a new thing.
17:45 Yeah.
17:45 Automate the boring stuff is a really good resource.
17:48 In this case, it's really like what it's about.
17:51 If you didn't create a web scraper for this, what you would do?
17:55 You would probably look at the website every single, as often as you can,
18:00 every day, every hour.
18:01 You'd hear somebody say, I checked it every day until I found the apartment I wanted,
18:05 right?
18:06 That's what they would say.
18:07 Yeah, exactly.
18:08 Yeah.
18:09 And what you don't hear them say so often is my bot alerted me when my dream apartment
18:14 was available.
18:15 Yeah.
18:15 I created a scrappy spider that alerts me when there is a new apartment listing
18:20 in that area.
18:21 Yeah.
18:21 And, but I think it's a really great example and it's like, there's so many things
18:26 on your computer that you do that you kind of feel like, ah, this stuff I just got to do.
18:30 But on the web, there's way more even that you probably end up, I always go to this site
18:34 and then I do this thing and I got to do that, right?
18:35 And web scraping will kind of open up a ton of that stuff.
18:39 So, let me give you two examples.
18:41 One, really grand, large scale.
18:44 One, like, extremely small but like, really interesting to me in the next,
18:48 you know, hour.
18:48 So, grand scale first.
18:50 When Trump was elected in the United States 2015 and he was going to be come president
18:56 in 2016, there was a lot of concern about what was going to happen to some of the data
19:01 that had so far been hosted on places like whitehouse.gov.
19:05 So, like, there's a lot of climate change data hosted there from the scientists
19:09 and there was a thought that he might take that down, not necessarily destroy it,
19:14 but definitely take it down publicly from the site and there's, you know,
19:18 there's all these different places, the CDC and other organizations that have a bunch
19:23 of data that a bunch of data scientists and other folks were just worried
19:28 about.
19:29 So, there were these big movements and I struggle to find the articles, you know,
19:33 almost four years out now.
19:35 Can't quite find the right search term to pull them up but there was these
19:38 like, save the data hackathon type things.
19:42 So, people got together at different universities, like 20, 30 people and they would spend
19:47 the weekend writing web scraping to go around to all these different organizations,
19:51 download the data.
19:53 I think they were coordinated across locations and then they would put that all
19:57 in a central location, I think somewhere in Europe, like in Switzerland or something,
20:00 so the data could live on.
20:02 And I think the day that Trump was elected, I think the climate change data
20:06 went off of whitehouse.gov.
20:07 So, at least one of the data sources did disappear and that's a pretty large
20:12 scale web scraping effort.
20:13 Some of it was just like download CSVs, but a lot of it was web scraping,
20:16 I think.
20:17 Yeah, like the thing is people say that once you put something on the internet,
20:22 it's going to stay there forever.
20:23 And why this story is interesting is that this is information that is useful
20:29 for everyone.
20:30 Yeah.
20:30 And this is the kind of information that you want to make publicly available.
20:35 And I think this is one of the things that this movement is about, if I can call
20:40 it movement, open data to make data open and accessible for everyone.
20:45 And web scraping is really a great tool to do that, to extract data from the web,
20:53 put it in a structured format in a database so you can use it, can just store
20:59 information or you can get some kind of intelligence from the data.
21:03 And it's really like without web scraping, what you couldn't do anything else.
21:08 will you just copy-paste the whole thing or what will you do?
21:11 It would take a lot of by-hand work.
21:14 Yeah, it wouldn't be good.
21:15 Yeah.
21:15 Yeah, so that's really cool.
21:17 And that was an interesting large scale, like, hey, we have to organize.
21:21 We've got one month to do it.
21:22 Let's make this happen.
21:23 So on a much smaller scale, I recently decided to realize that some of my
21:28 pages on my site were getting indexed by Google, not because anything on the
21:32 internet was linking to them publicly, but something would be in Gmail or
21:36 somebody would write it down and then Google would find it.
21:39 So I just went through and put a bunch of no index meta tags on a bunch of my
21:43 parts of my site.
21:44 But it really scared me because I changed a whole bunch of it and I'm pretty sure I
21:48 got it right.
21:48 But what if I accidentally put no index on some of my course pages that advertise
21:54 them and all of a sudden the traffic goes from whatever it is to literally
21:58 zero because I accidentally told all the search engines, please don't tell
22:03 anyone about this page.
22:04 what I was doing actually right before we called unrelated to the fact that
22:07 this was the topic was I was going to write a very simple web scrappy thing
22:11 that goes and grabs my site map, hits every single page and just tells me
22:15 the pages that are not indexed just for me to double check.
22:18 But you know, that template didn't get reused somewhere where I didn't think it
22:21 got reused, you know, like the little shared bit that has the head in it and so
22:25 on.
22:25 Yeah, it's sort of like in SEO, like search engine optimization, we have this
22:29 thing, technical SEO.
22:31 Yeah.
22:32 And actually, it's another use case for web scraping or like crawling, where you want
22:38 to crawl your own website, just as you know, you did and to learn the site map or
22:44 to find pages where you don't have proper SEO optimization, like you don't
22:49 have meta description on some pages, you have multiple like H1 tags or whatever.
22:56 Right.
22:56 and with crawling, you can figure these out.
22:59 And I just remember the story from like 2016, where web scraping was used to
23:07 reveal lobbying and corruption in Peru.
23:10 Oh, wow.
23:11 Okay.
23:12 Yeah, I'm not 100% sure, but I think it was related to, you know, the Panama
23:16 Papers.
23:17 Yeah, I was going to guess it might have something to do with the Panama
23:21 Papers, which was really interesting in and of itself.
23:23 Yeah, yeah, and they actually, they use many tools to scrape the web, but they
23:29 also use Scrappy to get like information from these, I guess, government
23:34 websites or other websites.
23:36 But yeah, like it's crazy that a web scraping can reveal these kinds of really,
23:42 this is a big thing.
23:44 Yeah.
23:44 Corruption and web scraping can tell you that, hey, there is something wrong
23:49 here, you should pay attention.
23:50 It lets you automate the discovery and the linking of all this different data
23:54 that was never intended to be put together, right?
23:57 There's no API that's supposed to take like these donations, this investment,
24:02 this other construction deal, or whatever the heck it was, and like link
24:06 them together.
24:06 But with web scraping, you can sort of pull it off, right?
24:09 Yeah, you can structure the whole web.
24:11 I mean, technically, you could structure the whole web, or you could get like a
24:17 set of websites that are considered or that has the same kind of data you
24:22 are looking for, like you can be e-commerce, fintech, real estate, whatever.
24:27 You can grab those sites, extract the data, the publicly available data,
24:32 structure it, and then you can do many sorts of things.
24:35 You can do like NLP, you can just search in the data, you can do many different
24:39 kinds of things.
24:41 Yeah, that's cool.
24:42 So let's talk a little bit specifically about Scrappy and maybe Scraping Hub
24:47 as well, and some of the more general challenges you might run into with web scraping.
24:51 So right on the Scrappy website, homepage, project site, there's a real simple
24:57 example that says here's why you should use it.
24:58 You can just create this class, it derives from Scrappy.
25:01 spider, and you give it some URLs, and it has this parse function, and off
25:05 it goes.
25:06 Do you want to talk to us real quickly about what it's like to write code to do
25:09 web scraping with this API?
25:11 With Scrappy?
25:12 Yeah.
25:12 Yeah, so with Scrappy, if you use Scrappy, what you really get is like a
25:18 full-fledged web crawling framework, or like a web scraping framework.
25:22 So it really makes it easy for you to extract data from any website.
25:28 And it has, there are other libraries like BeautifulSoup, or LXML, or in Java,
25:34 we have JSUP, that are only focused around parsing HTML and getting, or getting data
25:40 from HTML and XML files.
25:43 Right, BeautifulSoup is I have the text, I have the HTML source, now tell me
25:47 stuff about it, right?
25:48 Right.
25:49 That's sort of its whole job, yeah.
25:50 Right, but Scrappy, I would say it's like maybe 10% of the whole picture
25:55 or the big picture, because with Scrappy, like, yeah, you can parse the HTML,
26:00 but in real-world project, that's not the only task you need to do.
26:05 In real-world project, you need to process the data as well.
26:10 You need to structure the data properly, you need to maybe put it into a database,
26:16 maybe export it as a CSV file or JSON, you need to clean the data, you need to
26:23 normalize the data.
26:24 In real-world scenarios, you need to do a lot of things, not just scraping the
26:28 data, and Scrappy really makes it easy to develop these spiders that grab the
26:34 data, and then right there in Scrappy, you can clean it, you can process
26:40 it, and when the data leaves Scrappy, it's usable for whatever you want to
26:46 use it for.
26:46 Right, maybe you get a list or sequence of dictionaries that have the pieces
26:51 of data that you went after or something like that.
26:53 Yeah, right.
26:54 So that's why I really like Scrappy, but also I think it's the best framework to
27:00 maintain your spiders, because a big problem in web scraping is that once
27:05 you write the spider, it works, you get the data, it will probably not work in
27:11 the six months in the future because the website changes or layout changes.
27:16 They've decided to switch JavaScript front-end frameworks or redesign the
27:21 navigation or something, right?
27:23 Exactly.
27:23 And so you need to adjust your code, adjust your selectors, you need to maybe adjust
27:30 the whole processing part of Scrappy, because maybe in the past you were able
27:36 to extract only messy code from the HTML, but they did some changes on the website.
27:42 Now it's clean, you don't need so much processing.
27:44 And with Scrappy, it's really easy to maintain the code, which is, you know,
27:49 it's really important if you rely on web-extracted data on a daily basis,
27:53 you need to maintain the code.
27:55 Yeah, really neat.
27:56 So to me, when I look at this simple API here, it's like you create a class,
28:01 you give it a set of start URLs, and there's a parse function, and that comes in,
28:05 you run some CSS selectors on what you get.
28:08 For each thing that you find, you can yield that back as part of the thing, like the
28:12 result you found, and then you can follow the next one.
28:15 I think that's probably the biggest differentiator from what I can tell is,
28:19 it's like as you're parsing it, you're directing it to go do more discovery, maybe
28:25 on linked pages and things like that.
28:27 Yeah, like on websites where you need to go deeper, maybe because, you know, you need to
28:33 go deeper in the sitemap or you need to like paginate, you need to do more
28:39 requests to get the data.
28:41 Yeah.
28:41 In tools like BeautifulSoup, you would need to sort of create your own logic or like,
28:47 you know, how you want to do that.
28:49 But in Scrappy, it's already sort of straightforward how you want to do that or how you
28:54 can do that.
28:55 And you just need to just click this button, click this button, and you paginate with
29:00 Scrappy because you have this high-level functions and methods that let you paginate and
29:06 let you do things in a chain, you know?
29:08 Yeah.
29:09 Yeah, very cool API.
29:10 So let's talk about some of the challenges you might get.
29:14 I mean, probably to me, the biggest challenge, well, there's two problems
29:17 that I would say, the two biggest challenges.
29:19 One is probably maybe generally you could call it data quality, but more
29:25 specifically like the structure changed, right?
29:28 they've added a column to this table and it's not what I thought it was anymore.
29:32 Like it used to be temperature and now it's pressure because they've added that
29:36 column and I wrote it to have three columns in the table, now it's got four.
29:39 The other one is I go and I get the source code of the page and it says, you know,
29:45 view.application or there's like an Angular app or something and you get
29:50 just a bunch of script tags that have no data in them because it's not actually executed yet.
29:55 Yeah, I mean, this is like really all the things that all the reasons why you
30:00 need to maintain the code, which can be, you know, it takes time to maintain the
30:05 code.
30:05 Actually, that's why there are more advancing machine learning technologies
30:11 that pretty much maintain the code for you.
30:15 Oh, wow.
30:15 Okay.
30:15 Give us an example of that.
30:17 Like find me the title or find me the stock price off of this page, machine
30:21 magic.
30:21 Yeah.
30:22 So actually it's not that big of a magic.
30:24 Like in the future, you won't need to write things like selectors.
30:29 You don't need to tell Scrappy or like, you know, whatever tool you use to, hey, this
30:35 is the price title, whatever.
30:38 I want to extract it.
30:39 This is the selector where you can find it in the HTML.
30:43 In the future, you will not need to do that.
30:46 In the future, what you will only need to say is that this is the page.
30:51 this is real estate listing or this is a product page or this is some kind of
30:57 financial page.
30:59 So you just need to say the data type.
31:01 Yeah.
31:02 And then the machine will know what fields to look for.
31:07 So it will know that if it's a product page, you probably need like price.
31:13 You probably need product title or like product name, sorry, description,
31:17 stock information, maybe reviews, if any.
31:20 So in the future, you just need to specify the data type and you will get all
31:25 the data fields that you want with ML.
31:28 Yeah.
31:28 That's like a dream.
31:29 Well, it's sort of like it's a reality in some cases because like, you know, there
31:34 are many, there are some products out there that do this at Scraping Hub.
31:39 We have like, for example, news API, which does exactly this thing for articles and news.
31:45 Okay.
31:46 Yeah.
31:46 So what you do is that, hey, this is the page URL.
31:50 give me everything and you get the article title, you get the text body,
31:55 you get, you know, everything that is on the page.
31:58 Well, and if any place is going to have enough data on scraping to like answer those kinds of
32:03 questions more generally, it sounds like you guys might.
32:06 You guys do so much web scraping, right?
32:08 That like way more than a lot of people in terms of as a platform.
32:12 I don't know how you like share that across projects.
32:15 I don't think that, I don't see a good way to do that, but it seems like you
32:18 should be in a position that somehow you could build some cool tools around all
32:22 that.
32:22 Yeah, we do a lot of web scraping, as you said.
32:25 And it's really about the biggest problem is that there is no one schema that websites
32:32 follow when it comes to data.
32:34 So when you scrape, you need to like figure out a structure that would work for
32:40 all the websites.
32:41 and then you extract, you sort of like standardize the data.
32:45 And that's why the data type is important to specify.
32:48 Right.
32:49 What about JavaScript?
32:50 That's all the cool kids say they should, you don't write server code anymore.
32:55 You don't write templates.
32:56 Just you write just APIs and you talk to them with HTML and JavaScript.
33:00 But that doesn't scrape very well.
33:02 I'm not sure that's necessarily what you should be doing, but entirely, but a
33:05 lot of people are and it makes web scraping definitely more challenging.
33:09 Yeah.
33:09 Like nowadays, all the websites have some kind of JavaScript running.
33:13 And if the data you need to extract is rendered with JavaScript, that can
33:19 be a challenge.
33:21 And first, if you're writing a spider, first, you should look for ways to get
33:26 the data without executing JavaScript.
33:28 You can do this sort of, you know, I call them hidden APIs when there is
33:34 no documented API for the website.
33:36 But if you look in the background, background, that is actually a public
33:41 API available.
33:42 Right.
33:42 So it's clearly going back, like if you've got a Vue.js front end, it's probably
33:47 calling some function that's returning like a JSON result of the data you actually
33:52 want.
33:53 You just got to look in your web browser tools to see that happening, right?
33:56 Yeah, exactly.
33:57 Like with AJAX calls, many times that's the case that you just need to sort of grab
34:02 the JSON and you can get the data easily.
34:05 Well, and it's formatted JSON, right?
34:07 It's not just HTML.
34:08 I mean, it's like pretty sweet actually to get a hold of that feed.
34:11 Oh, yeah.
34:12 But on the other side, if you cannot find an API like this, you need to execute JavaScript using
34:18 some kind of headless browser like Selenium or Puppeteer or Splash.
34:23 And because, you know, you need an actual browser to execute JavaScript.
34:27 Right.
34:27 But then it's going to take more hardware resources to run JavaScript.
34:32 So it's going to be slower and it's going to make the whole process a little bit
34:37 more complicated.
34:38 Yeah.
34:38 There's probably no mechanism built right into Scrapey that will render the
34:43 JavaScript, is there?
34:44 Like Scrappy itself cannot render JavaScript, but there is a really neat integration with
34:50 Splash.
34:50 Okay.
34:51 Splash is also made, you know, it was created by Scraping Hub, so it's really easy to,
34:56 and Splash is sort of like a headless browser created only for web scraping.
35:01 and you can integrate Splash with Scrappy really easily.
35:05 It's literally just like a middleware in Scrappy.
35:08 And then it will execute JavaScript for you.
35:11 Yeah.
35:11 Sweet.
35:11 Yeah.
35:12 You will be able to see the data and then you will be able to extract it.
35:16 Right.
35:16 It says the headless browser designed specifically for web scraping.
35:20 Turn any JavaScript heavy site into data.
35:23 Right.
35:23 Cool.
35:24 I'll put a link to that.
35:25 That's pretty neat.
35:26 And that's open source.
35:27 People can just use that as well or what's the story with that?
35:30 Yeah.
35:30 Just like Scrappy, they can set up their own server and you can use it that way or you can
35:36 also host it on Scraping Hub.
35:37 Talk Python to me is partially supported by our training courses.
35:43 How does your team keep their Python skills sharp?
35:46 How do you make sure new hires get started fast and learn the Pythonic way?
35:50 If the answer is a series of boring videos that don't inspire or a subscription service
35:56 you pay way too much for and use way too little, listen up.
35:59 At Talk Python Training, we have enterprise tiers for all of our courses.
36:03 Get just the one course you need for your team with full reporting and monitoring.
36:07 Or ditch that unused subscription for our course bundles which include all the
36:12 courses and you pay about the same price as a subscription once.
36:15 For details, visit training. talkpython.fm/business or just email sales at
36:21 talkpython.fm.
36:24 Yeah, so let's maybe round out our conversation talking about hosting.
36:28 So I can go and run web scraping on my system and that sometimes works pretty
36:35 well, sometimes it doesn't.
36:36 There was this time a friend of mine was thinking about getting a car and putting it
36:40 on one of these car sharing sites, keep it a little bit general and we wanted to
36:44 know like, okay, for a car that's similar to the kind of car he's thinking about getting,
36:47 how often is it rented?
36:49 What does it rent for?
36:50 Like, so would it be worthwhile as just a pure investment, right?
36:54 So I wrote up some web scraping stuff that would just monitor like a couple
36:58 of cars I picked.
36:59 After a little while, they didn't talk to me anymore.
37:02 The website was mad at me for like always asking about this car.
37:06 Yeah.
37:07 So that seems like a common problem and if I do that for my house, I probably
37:11 can't check out that site anymore for a while, right?
37:14 I probably shouldn't do it for my house, but even from a dedicated server, like
37:17 that might cause problems.
37:18 Yeah, it's not just with web scraping.
37:20 Like I remember I was doing some kind of financial research.
37:24 It was about like one of the cryptocurrencies and I was using an actual API.
37:28 Like I wasn't scraping.
37:29 I was using an actual API and it was pretty basic.
37:32 Like I was just learning how to use this thing and I created a for loop and I put the
37:39 request, the API request in the for loop and I was just testing and by accident, I put like
37:46 a huge number in the for loop, like I don't know, like a million or something.
37:51 And it created a million requests.
37:54 I mean, it tried to create a million requests, but it stopped after like,
37:58 I don't know, maybe like 10,000 requests in a minute.
38:01 Right.
38:02 And because, you know, the server didn't like that I was doing that.
38:04 And it's the same thing with web scraping.
38:07 You need to be really careful not to hit the website too hard because when you scrape the
38:13 web, it's really important.
38:14 I mean, it's very important to be ethical and be legal.
38:19 And one of the things that you can do to achieve these is to just put some kind of
38:24 delay limit just over a respect the website and make sure that you don't
38:28 cause any harm to the website.
38:30 Right.
38:31 As soon as people think that what you're doing, you think you're just getting data, but
38:35 they might perceive it as this is the distributed denial of service attack
38:39 against my site.
38:40 We're going to block this person, right?
38:42 That's not a good situation you want to be in.
38:44 And it's also not super nice to whoever runs that website to hammer it.
38:48 Exactly.
38:48 The website doesn't know what you want to do.
38:51 It just sees a bot and it's really like, you know, because web scraping itself is legal.
38:56 You can do it if it's a publicly available website and you need to make sure
39:01 that you are being, I mean, we use this in the web scraping community.
39:05 You need to be nice.
39:06 You need to be nice to the websites and you know, you can look at the robots
39:10 text file.
39:11 Usually there is a, like a, a delay specified in the robots text file that
39:16 tells you that, hey, you should put like three seconds between each request.
39:21 And, but just in general, even if there is no such a rule defined in the robots
39:26 text file, you should pay attention.
39:28 You should be careful with your, how many requests you make, how frequently you make
39:33 those requests.
39:34 So yeah, it's important.
39:35 Yeah.
39:36 Interesting.
39:36 And so I guess one option is you guys have a cool thing, which I just discovered.
39:42 I don't know how new it is, but scrappy D, a daemon that lets you basically deploy your
39:48 scrappy scripts to it and it'll schedule and run them and that's pretty cool and
39:53 you can control that.
39:53 So you could set that up yourself or obviously you have the business side of your
39:58 whole story, which is scraping hub.
40:00 So what does scraping hub do that makes this better?
40:03 Yes.
40:04 Then just like getting a server at AWS or Linode or something like that.
40:08 Yeah.
40:09 So scraping hub is really about, you can deploy your spiders, but it's really, it's made
40:15 for web scraping and there is all the tools available in the scraping hub platform
40:20 that you would want to use for web scraping.
40:23 Like, you know, we mentioned scrappy and splash in the scraping hub platform.
40:27 You can upload your spider, your scrappy spider, and then you can, with a click of
40:31 a button, you can add splash.
40:33 So it renders JavaScript or you can also, if you want to use proxies for web
40:39 scraping, you can, with a click of a button, you can add proxies to your
40:43 web scraping.
40:44 You can also, if you have a website, like we mentioned, you know, articles and news, if you
40:49 want to extract data, we have an API, so you don't have to actually write the spider.
40:54 You can just use the API and get the data so you don't need to maintain the code.
40:59 So it's scraping up is really just the, I work at scraping up, but I truly
41:03 believe that it's the best platform if you're doing web scraping at scale, because you have
41:08 all the tools in one place that you would want to use for web scraping.
41:12 Right, right.
41:14 So a lot of the stuff you might bring your own infrastructure for, you guys have like as a
41:19 service type stuff?
41:20 Yeah, exactly.
41:21 You can sign up.
41:22 We have a free trial for, I think, for most of our services.
41:26 You can try it.
41:27 You can see how it works.
41:28 And then there's a monthly fee and that's it.
41:31 Okay.
41:31 Yeah, that seems pretty awesome.
41:33 And also we have a lot of clients who don't want to deal with the stuff,
41:36 deal with the technical details.
41:38 So they just say that, hey, I want this type of data from these websites,
41:41 give me the data.
41:42 And so we just do the hardware for them.
41:45 Right.
41:45 Maybe they're not even Python people.
41:47 Yeah.
41:47 They might be like, they don't even know how to code or like.
41:50 Right.
41:51 They could be political scientists.
41:52 They're like, we need to just scrape this.
41:54 But like you guys do scraping, right?
41:56 Yeah.
41:56 Yeah, exactly.
41:57 So let's round out our conversation on this.
42:00 Just a quick comment on Scraping Hub or a quick thought on it.
42:04 So when I talked to Pablo back in 2016, so I had a show with him about Scraping and Scraping Hub
42:11 and all that back in episode 50, four years ago or something.
42:14 One of the things that really impressed me about what you guys are doing
42:18 and continues to be the case is you've taken pretty straightforward open source projects,
42:24 not something completely insane, like a whole new database model or something, but just pretty
42:29 straightforward, but polished model of web scraping.
42:32 And you built a really powerful business on top of it.
42:35 And I think it's an interesting model for taking open source and making it your job more
42:40 than just I'll do consulting in this thing, right?
42:43 If I wrote this web framework, I could maybe sustain myself by consulting with people who
42:48 want to write websites in that framework or so.
42:50 But this is a pretty interesting next level type of thing that you guys pulled off here.
42:55 And what are your thoughts on this?
42:56 Yeah, 100% agree.
42:58 Like I think, you know, Pablo and Pablo Hoffman and Shane Evans, they did
43:02 an awesome job when they created Scrappy.
43:05 Like at the time, it was the first real web scraping framework, as I know.
43:10 And they open sourced it.
43:11 And it's really like if you look at the GitHub page of Scrappy, it's really like I feel like
43:17 it's a community.
43:18 Like people are chatting about this.
43:20 People are talking about how the community can improve this tool.
43:25 And then the fact...
43:26 You've got like 17,000 GitHub questions or something like that.
43:30 Oh, yeah.
43:30 Tagged on it.
43:31 Yeah, it's quite a bit.
43:31 Yeah.
43:32 And the fact that Pablo and Shane were able to pull this thing into business, but in a
43:37 really...
43:38 Like people can still use Scrappy without using our services.
43:42 They can still use Splash without paying.
43:45 They can use it.
43:45 It's open source.
43:47 So the fact that they built on top of the open source tools in a way that you don't need
43:53 to be affiliated with the company to get the benefits of Scrappy and other open source tools.
43:58 Like there are many other tools.
44:00 It's not just Scrappy.
44:01 Like there are so many other tools for web scraping.
44:04 And I agree.
44:05 I think it's just really amazing that they have been doing this.
44:09 Yeah.
44:09 It seems like it's really adding value because if not the core thing is already open
44:14 source and people would just use it, you know?
44:16 Yeah.
44:16 Like especially if they want to do it at scale.
44:19 Yeah.
44:20 You just cannot do huge things like millions of extracted records or things like that on your
44:26 computer or on your laptop.
44:28 You just cannot do that.
44:29 No, you get about 50,000 and then you get blocked.
44:32 Yeah.
44:33 Yeah.
44:33 That's an option as well.
44:34 Yeah, exactly.
44:36 So yeah, very cool.
44:37 I think it's a really neat example because a lot of people are thinking about
44:40 how do I take this thing that's got some traction I'm doing with open source
44:43 and make this my job because I'm tired of filling Jira tickets at this thing that's kind of
44:49 okay but not really my passion, right?
44:51 And cool example.
44:52 So congrats to you guys.
44:53 Keep up the good work.
44:54 Yeah, we are.
44:55 We definitely try our best.
44:57 There are some awesome things to come in the web scraping world, I believe, with the
45:02 advancements of machine learning and other stuff.
45:04 So it's going to be really interesting to see what's going to happen in the
45:08 next few years.
45:09 Yeah, for sure.
45:10 All right.
45:10 Well, I think that's probably it for the time we got to talk about web scraping.
45:13 But the final two questions before you get out of here, you're going to write some
45:17 code.
45:17 What editor do you use?
45:18 Python code?
45:19 PyCharm.
45:20 PyCharm.
45:21 Right on.
45:21 And then notable PyPI package.
45:23 I mean, there's obviously pip install, scrappy, but maybe some project that you
45:28 came across, you know, like, oh, this is so awesome.
45:30 People should know about X.
45:32 Oh, actually, it's a hard question because I've been sort of out of the game for
45:37 a lot of months now.
45:38 Aside from scrappy, I just really like to, I don't have one specific example.
45:43 I just really like to find these GitHub repositories where, when it's not like
45:49 a library, but it's like a collection of, like, there are many libraries, GitHub
45:54 repos with a collection of templates.
45:56 Right.
45:56 You know, like, hey, if you're a beginner, you can use these templates to get
46:00 started.
46:01 And for me, that was really useful when I started out with using a new tool
46:05 like scrappy or other libraries is that you can use these starter codes sort of.
46:12 Right, right.
46:12 I wonder if there's a better scrappy, possibly.
46:15 Probably.
46:15 Yeah.
46:16 Yeah.
46:16 There is.
46:17 I'll throw it into the show notes, but yeah.
46:21 Cool.
46:21 But just cookie cutter, make me a scrappy project.
46:24 Let's roll and do some web scraping.
46:26 Pretty neat.
46:26 Also, there is a, you know, just last note, awesome.
46:28 I think it's got awesome web scraping.
46:31 OK.
46:31 Or something like that, which has hundreds of tools for web scraping.
46:35 Oh, yeah.
46:36 We can link it later.
46:38 Maybe if I find it.
46:39 OK.
46:39 Yeah.
46:40 It's got awesome web scraping on GitHub.
46:42 Yeah.
46:43 Nice.
46:43 I'll put that in the show notes as well.
46:45 I love those awesome lists.
46:46 Yeah.
46:46 It's easy to get lost because you were like, oh, I have this one thing and I
46:50 use it.
46:50 Then all of a sudden you're like, oh, look at that.
46:52 There's like 10 other options in this thing that I thought I knew.
46:55 Yeah, exactly.
46:57 Awesome.
46:58 All right.
46:58 Attila, it's been great to chat with all these, share all these web scraping
47:02 stories and chat with you.
47:03 So thanks for being here.
47:04 Yeah, it's been amazing.
47:05 Thank you.
47:05 Thank you very much.
47:06 Yeah, you bet.
47:07 Bye-bye.
47:07 Bye.
47:08 This has been another episode of Talk Python to Me.
47:11 Our guest in this episode has been Attila Toth and it's been brought to you by Talk
47:15 Python Training and Linode.
47:17 Start your next Python project on Linode's state-of-the-art cloud service.
47:21 Just visit talkpython.fm/Linode, L-I-N-O-D-E.
47:26 You'll automatically get a $20 credit when you create a new account.
47:29 Want to level up your Python?
47:31 If you're just getting started, try my Python Jumpstart by Building 10 Apps course.
47:36 Or if you're looking for something more advanced, check out our new Async course
47:40 that digs into all the different types of Async programming you can do in
47:44 Python.
47:44 And of course, if you're interested in more than one of these, be sure to
47:48 check out our Everything Bundle.
47:49 It's like a subscription that never expires.
47:51 Be sure to subscribe to the show.
47:53 Open your favorite podcatcher and search for Python.
47:55 We should be right at the top.
47:57 You can also find the iTunes feed at /itunes, the Google Play feed at slash Play, and the
48:02 direct RSS feed at /rss on talkpython.fm.
48:06 This is your host, Michael Kennedy.
48:08 Thanks so much for listening.
48:09 I really appreciate it.
48:10 Now get out there and write some Python code.
48:12 you're welcome.
48:13 Bye.
48:13 Thank you.