#50: Web scraping at scale with Scrapy and ScrapingHub Transcript
00:00 What do you do when you're working with an amazing web application that, for whatever reason, doesn't have an API?
00:04 One option is to say, gee, I wish that site had an API and just give up.
00:09 Or you could use Scrapey, an open source web scraping framework from Pablo Hoffman and scrapinghub.com, and create your own API.
00:17 On episode 50 of Talk Python to Me, we'll talk about how to do this, when it makes sense, and even when it's allowed.
00:24 This episode has been recorded February 16th, 2016.
00:28 Welcome to Talk Python to Me, a weekly podcast on Python.
00:56 The language, the libraries, the ecosystem, and the personalities.
00:59 This is your host, Michael Kennedy.
01:01 Follow me on Twitter, where I'm @mkennedy.
01:03 Keep up with the show and listen to past episodes at talkpython.fm.
01:07 And follow the show on Twitter via at Talk Python.
01:10 This episode is brought to you by Hired and SnapCI.
01:13 Thank them for supporting the show on Twitter via at Hired underscore HQ and at Snap underscore CI.
01:20 Hey, everyone.
01:21 Thanks for joining me today.
01:22 We have a great interview on tap with Pablo Hoffman.
01:25 I want to give you a quick Kickstarter update before we get to that, though.
01:28 There are just three days left to join the course via Kickstarter.
01:31 Get a big discount while you're at it.
01:33 Of course, it'll be for sale afterwards, but not at the Kickstarter prices.
01:36 The students who have had early access have had really positive things to say.
01:40 Here's just a taste.
01:42 This is by far the best Python course I've done to date.
01:45 Clear explanations, good apps that incorporate the concepts of particular sessions, and follow
01:49 up on core concepts.
01:51 The course is very engaging with a little humor, which makes it much easier to dedicate time
01:55 to following along.
01:56 If you're looking for an easy to follow, enjoyable Python course, back this project now.
02:00 Get comfortable and enjoy.
02:01 Thanks, Andy.
02:02 You can check it out at talkpython.fm/course.
02:06 Now let's talk about web scraping.
02:08 Pablo, welcome to the show.
02:10 Thank you, Michael.
02:11 Thanks for having me here.
02:12 Yeah, we got some really cool stuff to talk about.
02:14 Web scraping, open source businesses.
02:16 We'll talk a little bit of Python 3, maybe.
02:19 All sorts of cool things.
02:20 But before we dig into them, let's talk about your story.
02:24 How did you get into Python and programming?
02:25 All right.
02:26 So I met Python when I was in college in 2004, and I immediately fell in love with it, with
02:31 the simplicity and the structure of the syntax.
02:34 Back then, I was doing a lot of subjects in college, but I always tried to find an opportunity
02:40 to use Python for whatever.
02:41 I would invent a lot of crazy, just useless stuff just to be able to use Python for it.
02:49 Before Python, I used my past.
02:51 I used PHP, but I don't regret anything I did.
02:56 But yeah, I come from that background, more sysadmin, intranet web application background, sort
03:02 of natural flow for me.
03:04 I started working on Python in 2014, introduced by a colleague in college.
03:10 And I never looked back.
03:12 And yeah, it was only in 2007 that I was able to start a company working solely on Python.
03:19 So I had to wait three years.
03:21 But you've made it.
03:23 You finally made it.
03:24 Yeah, yeah, yeah.
03:25 Absolutely.
03:25 And I'm still here almost 10 years later working pretty much exclusively with Python and enjoying
03:33 every moment of it.
03:34 Yeah, that's really cool.
03:35 What continues to surprise me about Python is here's a language in the ecosystem that's
03:40 25 plus years old and seems to be becoming more popular and growing faster, even though
03:47 it's not like something new that just sprung on the scene.
03:49 So yeah, I think we're in a good place.
03:52 Yeah, yeah.
03:52 I believe it's a lot of combination of things, right?
03:55 And the community certainly is one of the key aspects of Python, in addition to the language
04:00 being beautiful and elegant by itself.
04:02 So a lot of things that come together for that to happen.
04:07 Yeah, I totally agree.
04:08 So we're going to talk about web scraping.
04:10 There's a whole variety of listeners.
04:12 Some of them go, oh, yeah, web scraping.
04:14 I did that back in, you know, whatever.
04:16 But a lot of people maybe don't really understand or are familiar with this.
04:19 So maybe give me the background on what is web scraping?
04:22 Why do we do this?
04:22 How does it work?
04:23 Right.
04:24 So there's a bunch of information everywhere in the web right now and has always been and
04:30 continues to be and will.
04:31 The information is growing in presenting amounts.
04:35 And so there's a natural need for extracting and using that information for all kinds of purposes.
04:42 But you can't just take what's on a web page and use it as is.
04:46 You need to transform it, extract it and transform it in order to apply whatever.
04:52 You need to apply to it.
04:54 And that is where web scraping comes into play.
04:56 And data is great.
04:58 Data is beautiful.
04:59 There's a lot of things that you can do with data once you have it in a format that you can manipulate.
05:05 And so that's where web scraping discipline sits.
05:07 And given the increasing and ever increasing amount of data in web pages in a structure form,
05:13 web scraping is here to stay definitely and it's going to continue growing.
05:18 My interest in web scraping came shortly after I met Python in college.
05:22 So there was this new site ecosystem in my country here in Uruguay where newspapers had just came online but didn't really get it.
05:31 So they posted the news there.
05:34 However, they seemed too pleased.
05:37 And the information there wasn't really accessible.
05:39 There was no commenting possibilities.
05:41 So I started off creating a new site that aggregated all the newspaper, the local media available then, which was really short.
05:48 And sort of put it together in a single news aggregator with RSS feeds.
05:53 Nice.
05:54 So it was just like a very early simple version of Google News type thing?
05:57 Yeah, Google News existed back then, but Uruguay was completely ignored by them, right?
06:02 Even up to today, we don't have Uruguay in Google News supported country.
06:08 But you can think of it, yeah, as a Google News with custom extractors for each site.
06:12 So I ended up building this project.
06:15 It was called NotiUI because Noticias and Uruguay, the two words, combined together.
06:20 I built it in Python.
06:22 The extractors, as I was live, started getting some attention.
06:26 It was after that project that I run my extra time completely.
06:31 It's a hobby project that I got noticed by a company here in Uruguay that introduced me to the company that I later joined to work full-time on Python.
06:42 I eventually created Scrappy and open-sourced it.
06:45 So, yeah, it's funny how you can connect the dots after everything.
06:49 Yeah, you can connect the dots looking backwards, but not forwards, right?
06:53 Yeah, forward now.
06:54 That's really cool.
06:56 Before we move on to Scrappy, ideally, everything, all the information should be free.
07:01 Maybe there should be, like, an HTTP JSON service, we could access, or, God forbid, like an XML, so WCF service or something.
07:09 But a lot of times, even that is not available, right?
07:12 There's just, it's on the web, and technically you can get the text, but nobody's built an API, right?
07:17 And that's really where web scraping kind of comes in, is when you know that it is accessible, but there's no structured API for it, right?
07:24 Exactly. And as a future of web scraping, I see scraping evolving a lot with a lot more machine learning incorporated directly at the early extraction phase,
07:33 so that you can turn a website page into an API as quick as possible without requiring manual coding or manual intervention.
07:43 That's really interesting, because the web scraping that I know is, you say, I'm looking for this information, and maybe here's a CSS selector that I could go against the page, and then I'll grab it.
07:52 But that's not machine learning, right?
07:54 That's the web scraping that most people are familiar with.
07:56 But once you need to scale to a lot of websites that simply doesn't work, it was the scraping that Scrappy was built for.
08:05 And because of this need to scale to maintaining a lot of website instructors in a common way and more consistent, unified fashion is that this Scrappy idea came to light.
08:16 Because otherwise, after you're maintaining a couple of dozens of sites, you're fighting with just your own code infrastructure rather than writing the expats or the CSS selectors.
08:27 That's the easy part, right?
08:29 That's one of the reasons why we came up with Scrappy.
08:33 Very cool.
08:33 So why don't you tell us what it is?
08:35 Scrappy started as an initiative to build a lot of web scrapers more easily and relieve the pain of maintaining them.
08:44 Because Scrappy was built in an environment where we had to maintain hundreds of spiders, as we call them.
08:52 And you need to have a certain structure in the code, certain conventions in place, so that when someone else joins the team, a new developer joins the team,
09:03 and starts developing new spiders, they don't need to go through learning a lot of intricate internals about how the spiders are supposed to work and the rest of the infrastructure.
09:14 So what Scrappy does is it tries to factor out the common things that you do when you write web scrapers and separate them from the actual extraction rules or expats and CSS selectors that you will type for each website.
09:29 So that you have all the rest of the site.
09:32 So that's where Scrappy excels, really.
09:38 And compared to other ad hoc solutions, like combining requests with beautiful soup, which have great libraries and they do a great job.
09:47 And maybe if you're writing a single structure for a single site, you wouldn't find much difference between using one or the other.
09:54 But if your project involves maintaining a spider for hundreds or even dozens of websites, you'll see these conventions that Scrappy proposes very welcome advantages.
10:05 So in a way, it's a framework.
10:07 It's not a library.
10:08 So it's not that it doesn't get in the way.
10:10 It does get in the way.
10:11 But for good reasons.
10:13 It tries to impose some common practice and good practices and mechanisms so that you don't have to go through the typical journey of first doing this simple stuff and then realizing that you needed something more complex only to end up implementing a smaller version of what Scrappy is.
10:32 That's a really interesting comparison with the other ones.
10:35 I want to have you maybe quickly walk people through the API.
10:38 But before we kind of get off the beginning of it, can you just tell me the history?
10:42 You said it came out of your work at this company.
10:44 Like, did you actually get permission to kind of extract that framework out or did you leave that company and rebuild it with your knowledge from there?
10:51 What year is that?
10:52 This company, MyDeco, was a furniture aggregator in the UK.
10:56 I joined them in 2007.
10:58 And when I joined MyDeco, there was a lot of parts of Scrappy already built there.
11:03 Like the core framework, the downloader, and the more important internals were already in place.
11:11 It was working fine already.
11:12 You really noticed that there was an improvement over how you will go about writing ad hoc web crawlers.
11:20 So I started in this Scrappy department in MyDeco.
11:24 And I noticed that there was a huge potential for releasing this to the wild and making other people benefit from it.
11:32 By then, I was already a hardcore open source fan.
11:35 And I've been following many open source projects and open source practices.
11:40 And I was looking for an opportunity to get involved and to release my own open source project.
11:46 Pretty much like I was waiting for opportunities to pop up on or to work on Python in 2004.
11:52 I was looking for open source opportunities in 2007 as well.
11:59 Yeah, fortunately, the people in charge of MyDeco, the technical part of MyDeco, were quite aligned with open source as well.
12:06 Quite fans themselves.
12:08 So that makes things a lot easier.
12:10 So we didn't have much problems convincing the board.
12:13 Well, they didn't have much problems convincing the board to allow us to open source that part.
12:18 What happened afterwards is that there was a lot of work on my side and a couple of guys that worked in my company back then.
12:26 They're working on ironing out, factoring out the common code and packaging in a way that makes sense for external developers to digest and use.
12:35 That's cool.
12:36 And when you did that, basically you released this open source.
12:39 Did this company like take it back and sort of start using it?
12:42 Or did they just kind of go along in a parallel path?
12:44 We started from something already built, right?
12:47 It's not like we decided to build an open source project and started from scratch there.
12:52 We had a system already working and we wanted to take certain core parts of it and open source it.
12:59 This was a challenge because there was a lot of dependencies in the code that took us some time to factor out.
13:06 But at no point we wanted to diverge or fork into two separate projects.
13:11 I never even considered it to go that path.
13:14 But I think it was crucial and important that we remain always in sync because MyDeco was the main user of Scrappy and the feedback loop was crucial for Scrappy to succeed.
13:24 If you look at it, it makes no sense to build an open source.
13:29 At least if you want to build a successful open source project, for me, you need to have successful companies or users using it, right?
13:37 Just don't build it in abstract unless it's an academic thing, of course.
13:40 This allowed us to grow Scrappy.
13:42 The beginning was a lot of de-integration work, to put it somehow, so that all the crawlers at this company remain running while we move stuff to a state where it could be open source.
13:57 And sure, a few spiders broke here and there, but yeah, we learned along the process how to improve spider testing, if you will, for when you make changes in a framework.
14:09 And that's also part of Scrappy as one of the features.
14:13 One thing that's cool about the spiders is you can rerun them against the same source if you have to, right?
14:18 You can check if they extract the same data, and that's how you actually check that the spider remains working.
14:24 You need to find a way to do it fast enough if you have thousands of them.
14:28 That's just one caveat to keep in mind, but yeah.
14:33 Sure, sure.
14:34 What was the version of Python that you started on?
14:36 It was 2.5 back then.
14:37 It was only a year or two ago that we dropped support for 2.5 in Scrappy.
14:42 And yeah, right now we're 2.7 and almost finishing support for Python 3.
14:47 Actually, support is finished, and it's in beta mode now in the last version of Scrappy, which makes me very proud.
14:53 Yeah, I just saw an announcement.
14:54 What was that, two weeks ago that you guys announced that you are now officially beta in Python 3?
15:01 I had a project where I needed to do some web scraping, and I ended up going the request plus beautiful soup for direction because I didn't want to be painted into the corner of halving to do Python 2.
15:13 And I'm like, oh, if I go down Scrappy, the path maybe.
15:15 But so I was like, oh, yes, that's great when that came out.
15:19 So congratulations.
15:20 Was it a lot of work?
15:21 It was because it's a moderately big project.
15:23 But we decided to really, really prioritize this because we realized that maybe a bit late, if you will, that people were actually not using or leaving Scrappy because of the lack of Python 3 support.
15:36 But I'm glad that we're on it now and we have it there.
15:39 Yeah, that's fantastic.
15:40 Congrats.
15:41 Thank you.
15:41 Cool.
15:41 So let's talk about the API a little bit.
15:43 If you go to Scrappy.org, right there you've got a couple of code snippets showing you how you can use it and so on.
15:50 One of your examples is if I go to like a blog, I want to go grab all the links that are categories.
15:55 Can you maybe just talk us through what that looks like?
15:57 Yeah.
15:57 You want me to go through the code that is on Scrappy.org?
16:01 Not exactly.
16:02 Just tell me what, like kind of how do I get started with Scrappy, basically.
16:05 So Scrappy has a pretty good tutorial in the base documentation that I recommend as a starting point.
16:11 I always recommend a starting point.
16:13 The idea is that you should, unless you have complex needs, you should be enough writing a spider,
16:19 which can be just a couple of lines as it's shown in Scrappy.org website.
16:23 The spider API itself is really simple in the sense that you have this, well, class to group all things related to the spider,
16:30 where the class has methods that are called.
16:33 These are essentially callbacks that are called with responses or web pages that are retrieved from the web,
16:40 that are delivered to these callbacks.
16:42 And then these callbacks process the response in however they need, and then return items or subsequent requests to follow.
16:51 That's the very basic idea of Scrappy API in its best.
16:56 So using this simple API, you can build a lot of things on top of it.
17:01 I love APIs that are really simple at the bottom and that allows you to do a lot.
17:05 This is the most basic spider API, and all spiders in Scrappy will follow this one.
17:11 But then on top of it, there's variations and improvements, like something called crawl spider,
17:17 that will allow you to set up some rules for URLs that should be followed by the spider.
17:22 So you set up some class attributes with these rules and start URLs that they should start crawling from.
17:28 And the spider automatically follows them and calls certain callbacks when the URLs fit a certain pattern.
17:37 This type of spider is very useful for, for example, crawling a retailer, an e-commerce site,
17:43 where you want to extract the product data for certain URLs.
17:46 And many sites will follow this pattern of having a certain pattern of URLs to follow
17:52 and certain rules to extract the product data out of the pages.
17:57 So all that internally is built with a simple API of receiving a response and returning requests to follow
18:04 and eat items of escape data.
18:07 That's basically one of the great things and beauty of it.
18:11 That's really cool.
18:12 And at the heart of it, you've got this CSS method that I can give sort of complex hierarchical CSS expressions
18:19 to get a hold of the pieces I need, right?
18:21 Yeah.
18:22 And on the extraction side, we also provide some convenient methods for extracting the data using well-known selectors
18:30 like CSS selectors or even XPaths.
18:33 Yeah.
18:33 This is like the most universal way to address regions of pages.
18:38 So you can do anything with it.
18:40 And with the requests to follow, you can generate further requests to make what you would say ASHA calls
18:47 in order to retrieve extra data from the web pages.
18:50 And here is something to note.
18:52 If you want to use this low level, you need to view in your browser what requests the website is doing
19:00 and replicate those in Scrappy.
19:02 Oh, interesting.
19:03 So suppose I want to go and I want to scrape like a single page app, spa type app that's written in AngularJS.
19:10 Right?
19:11 If I just go hit that with a direct request, I'm going to get curly this, curly that, meaningless, no data, right?
19:17 Yeah, exactly.
19:18 Exactly.
19:29 This episode is brought to you by Hired.
19:32 Hired is a two-sided, curated marketplace that connects the world's knowledge workers to the best opportunities.
19:37 Each offer you receive has salary and equity presented right up front, and you can view the offers to accept or reject them before you even talk to the company.
19:45 Typically, candidates receive five or more offers within the first week, and there are no obligations ever.
19:50 Sounds awesome, doesn't it?
19:52 Well, did I mention the signing bonus?
19:54 Everyone who accepts a job from Hired gets a $1,000 signing bonus.
19:57 And as Talk Python listeners, it gets way sweeter.
20:00 Use the link Hired.com slash Talk Python To Me, and Hired will double the signing bonus to $2,000.
20:06 Opportunity's knocking.
20:08 Visit Hired.com slash Talk Python To Me and answer the call.
20:17 Remember that Scrappy was built in a different world in 2007.
20:21 There was a lot of static.
20:22 There was not so much Ajax, not so much JavaScript.
20:25 And now with this crazy world moving to more like apps running in websites rather than website websites, Scrappy still is able to do it because Scrappy is, to some extent, a very low level.
20:38 Whatever you can do in a browser, you can do in Scrappy because it works at the HTTP response request level.
20:44 You sometimes feel that you need something more digestible so that you have the data readily available, the data that is delivered to Scrappy.
20:54 Based on this need, we're working on extending Scrappy JavaScript support.
20:58 Being the hardcore reusable component funds that we are, we ended up creating a separate component called Splash, which is sort of a mini browser that runs with an HTTP API.
21:13 And it interacts really well with Scrappy, which is one of the many libraries that integrates well with Scrappy, so that you can actually have rendered data, JavaScript rendered data available in your callbacks for it to use, as well as executing actions on the website.
21:30 That's really cool.
21:31 I didn't know it did that.
21:32 So that's really excellent.
21:34 Yeah, it's not one of the things that are most prominently shown when you came across Scrappy.
21:39 Yeah, of course.
21:39 But if you've got to go after a website that's sort of one of these front-end JavaScript data binding type frameworks, you have to have those things.
21:46 That's really cool.
21:47 So another thing that's in your sample that I think is worth bringing up just really quickly, although it's not specific to what you guys are doing, is you're using the yield keyword and your parse methods.
22:00 And the bit that you do supports sort of these generator methods.
22:05 Yeah.
22:06 And these sort of progressive get-as-much-as-you-like type of methods.
22:09 So do you want to maybe just talk really quickly about generator methods?
22:12 Because I don't think we've talked about it on the show very often, if at all.
22:16 Yeah, absolutely.
22:17 So this spider callback that I was talking about actually can return a generator or an iterator in its most general form.
22:24 So what Scrappy does internally, it calls this callback and starts iterating the output of it.
22:30 So it can potentially return an infinite generator.
22:34 And Scrappy sometimes is used that way.
22:37 You start in a page and start monitoring the page and keep returning data and your requests to follow as they appear.
22:44 Or you generate just incremental numbers or whatever of pages to check.
22:49 And yeah, Scrappy uses memory efficiently in order not to consume the whole generator at once.
22:55 It consumes its incisable, manageable parts.
22:59 And it has a bunch of flow control directives in place to make sure that no place in the framework consumes too much and overflows memory.
23:09 This has the result that in the spider code, you can use nine things like GIL,
23:14 where you just send the data out into Scrappy framework side.
23:20 And you'll be sure that it's going to get processed and you don't have to check yourself that if you're sending too much data or too little or whatever.
23:28 So yeah, that's how you can make benefits of how Scrappy makes benefits.
23:33 Yeah.
23:33 So internally, it uses iterators everywhere.
23:36 And you can just layer on your own iterators, right?
23:40 Because if it would build like lists and, you know, fill them up, that would kind of be useless, right?
23:44 Exactly.
23:44 It's iterator and generator friendly by default.
23:47 And you can take that advantage.
23:49 Yeah, that's really awesome.
23:50 If I want to go to a website, you know, maybe let's just take somewhere like the New York Times, say.
23:57 So the New York Times, they're a well-known newspaper.
24:00 Their articles are highly valued.
24:01 But there's also usage restrictions.
24:03 They have their so-called paywall about trying to make sure you have an account and stuff.
24:08 Is it legal for me to go and turn Scrappy loose on the New York Times?
24:12 Or what's the story sort of around like when I can and can't use the web scraping?
24:17 We like to see scraping as a tool.
24:19 And as with many other tools, you can use it for with legal purposes in mind and illegal purposes in mind.
24:26 Even tools like Bitstorrent, which is a de facto example, can be used for completely authentic and genuine stuff, legal stuff.
24:36 I'm not saying that.
24:37 So they use what you're doing with the tool is ultimately what makes it illegal or not.
24:42 In the case of the New York Times, you should probably obey whatever rules they have to access the content.
24:48 I hope that sometimes they will realize that they are doing it wrong.
24:53 You can't see, at least I don't see scraping as a way to get data, to steal data that isn't supposed to be taken.
25:01 There's a lot of projects and stuff out there that uses scraping to just gather the data that is already available there in order to process it.
25:12 Right.
25:12 Like a really common example would be Google.
25:14 Exactly.
25:14 I don't know that they use Scrappy, but obviously they do that.
25:17 Conceptually, they do that, right?
25:18 Exactly.
25:19 Google is the largest scraper in the world.
25:21 And they get the data and add value to it by showing them their search results.
25:28 Similar thing applies everywhere for us and how we tackle projects.
25:34 We generally go with public information only when we work on scraping projects.
25:42 We don't want to deal with all the legal stuff like this.
25:48 And yeah, we always recommend customers to get proper assistance and possible consent from the websites to get the data.
25:55 Because many cases, believe it or not, their websites don't care if you take the data from them.
26:01 But they just don't want to go through the table of packaging the data and sending it to you or putting it from you.
26:06 Yeah, right.
26:06 Or maybe they would like to, but they're technically incapable.
26:10 Exactly.
26:10 Not everybody who has a website is a programmer.
26:13 Yeah.
26:14 I've heard so many times saying to website owners saying, okay, if you can take the data and you don't cause me any problems, then go for it.
26:22 And you wouldn't believe how common that sentiment is.
26:26 Yeah, yeah.
26:26 That's really cool.
26:27 That's really cool.
26:27 Let's talk about large-scale web crawling.
26:30 I mentioned Google.
26:31 That's kind of large-scale.
26:32 You had your experience with thousands of crawlers.
26:36 What do you need to worry about when you're running stuff at that scale rather than I have my app that goes to one page and just get some information?
26:43 Right.
26:44 Scalping can kind of scale in two directions, I would say.
26:48 One is the code direction and another is the infrastructure and volumes direction.
26:55 Large-scale at the code side could be having to maintain a couple thousand spiders in your project.
27:03 Let's call it more vertical scaling, if you will.
27:06 Whereas scaling in the horizontal or infrastructure side would be having a single code, perhaps more automated or based on sophisticated extraction, that needs to grow a lot of web pages, like hundreds of millions of web pages.
27:22 Okay.
27:23 And that's more like what you're talking about with the machine learning.
27:25 Maybe you would do Scrapey plus Scikit-learn and a single code base to just go understand the web versus I know that these are the 27 retailers' websites and I want to give you the price of all of them.
27:38 So I'll write specific code to get the price of these common objects.
27:42 Something like this, right?
27:42 Yeah, exactly.
27:43 And you can only scale so much, vertically scale so much with numbers of spiders because the cost starts increasing quite a lot to keep these spiders running and well-maintained.
27:53 So because spiders break, you need to monitor them, react when they break, and the size of the team that you need to have to maintain them goes up really quick.
28:03 You get a little nice happy message from Target saying, we've just redesigned Target.com.
28:09 And you're like, oh, no, there it goes, right?
28:11 Yeah.
28:11 Yeah.
28:12 Don't get me started with Black Friday and holiday seasons.
28:14 Challenges vary depending on which case we're talking.
28:19 The code scalability is somehow minimized or addresses as much as possible by scrappy common conventions.
28:26 It's as much as you can do.
28:28 I mean, in the end, you have to write the XPaths or CSS selectors anyway at some point.
28:34 But aside from that, anything else that can be automated, infrastructure or review all sides, we try to automate it.
28:43 The other type of scalability problems are related with crawling a huge number of pages.
28:49 There involves not only the massive amount of data that you need to go through and to digest, but things like revisiting policies for when you have to keep track of a huge number of pages in a very large website or group of websites.
29:07 And you only want to revisit them depending on how often they get updated.
29:12 So there's a few common algorithms used in this case.
29:16 But the big challenge here is keeping this big queue of URLs that you need to keep control of and prioritize.
29:26 Because the crawler just kind of takes URLs from a queue.
29:29 Crawler just will ask a queue, okay, what's the next URL you want me to fetch?
29:33 And it will take that URL, fetch it, return the process results.
29:38 But keeping this queue is the challenging thing when doing it at scale.
29:43 Yeah, yeah, of course.
29:44 Because you want to, you know, cache, sort of conceptually cache as much of that as possible and not hit it if it's not going to have changed.
29:51 Is that a place where you could plug in, like, machine learning and have it watch and sort of tell it, like, this time I learned something new, this time I didn't.
29:59 And it could maybe be smart about this?
30:01 Yeah, yeah, exactly.
30:02 As I was saying, there's a few common algorithms used to revisit simply based on frequency.
30:08 For example, frequency of updates.
30:10 Like if you visit a page two times during a certain amount of period and the page is the same.
30:16 Or actually the extracted data is the same.
30:18 Exactly.
30:18 That's what matters, right?
30:19 Just to, yeah, to take out overhead changes.
30:23 Then you perhaps double the time that you're going to check that page again.
30:28 So if you revisit in a day, it didn't change.
30:32 Then you're going to visit again at least in two days from now.
30:35 And if the page changes, then for that page, you reduce to the half of the time that you waited.
30:41 So you're going to revisit in a half day.
30:43 And let the wait times adjust automatically.
30:48 That works as a basic algorithm.
30:50 It works really well.
30:52 And Scrappy has its own internal scheduler, which serves as this in-memory queue for requests.
31:00 It can only grow so much because Scrappy all runs in a process, in an operating system process.
31:05 And there's memory limitations when running a single process.
31:10 So when you need to scale to very large number of web pages, you need to use something external.
31:17 And on this side of the equation is that we're working on for just over a year now in a new project called Frontera,
31:25 which is an extension to Scrappy to deal with large crowds.
31:30 And it essentially manages what is called the crawl frontier.
31:33 I should have mentioned before, but this queue of requests to follow is called the crawl frontier in Scraping and web crawling terms.
31:43 Yeah, very cool.
31:43 You guys also built like a web crawling as a service or web crawling infrastructure as a service, if you will, right?
31:51 Tell me about that.
31:52 Yeah, so one of the things that are common to many Scraping, if not all Scraping projects, are the infrastructures required by it.
32:01 As I was saying, writing the CSS selector expat is something that you may need to do separately for each website.
32:08 But all the rest, like running the spider, getting the data, reviewing the data with your colleagues or customer, iterating over it, is pretty much the same.
32:20 And it's surprisingly what ends up taking more time than writing the CSS selectors or the expats itself.
32:27 Because when you're dealing with data, sometimes there are different expectations of how the customer needs the data and how you produce it from your spider.
32:37 So being able to have those in sync with an efficient tool is of crucial importance.
32:45 So we realized this around 2010 that there was a business opportunity to launch a tool for taking the next step after Scrappy.
32:53 Because Scrappy kind of solves the problems at the developer level.
32:58 You can run Scrappy Spider with the Scrappy crawl command.
33:02 It all works the same in any machine, in any platform.
33:04 But what after that?
33:06 What if you need to collaborate with teams?
33:08 Then it starts getting a bit complicated.
33:11 We started Scraping Hub to deal with that, with running Scrappy Spiders in the cloud in the most friendly way possible.
33:21 So that developers can benefit from it.
33:24 Yeah, that's cool.
33:25 And that's just at scrapinghub.com, right?
33:27 Yeah, exactly.
33:27 Yeah.
33:28 So people can check it out there.
33:29 Maybe tell us, like, what is the main use case?
33:32 Like, why do people come and set up a system to run on your platform?
33:37 The nice thing about Scraping Hub is that you can run any Scrappy Spider that runs in your machine already will run on Scraping Hub cloud.
33:45 So that's the premise of the service.
33:47 We don't require you to make any changes to your already working spiders.
33:53 You can just run a command, Scraping Hub, deploy, and you have it already deployed in our cloud.
34:00 And you can run it using the web analytics panel from there.
34:04 Think of it similar to Heroku here.
34:06 Not sure if you're familiar with it.
34:07 But Heroku is the same for websites.
34:10 You have a website running locally.
34:12 And yes, you need to configure some manifest file for sure to indicate a few things.
34:18 But that's all you need to do.
34:20 And then with Heroku.deploy, you have the website running.
34:23 It doesn't matter where you have the URL.
34:25 And anyone can access it.
34:28 We wanted to build the same for spiders.
34:31 There wasn't anything like it.
34:32 And still, there isn't.
34:34 And we realized that we had kind of this, the best framework for writing with spiders.
34:40 And kind of the next obvious thing to do was to give it the friendliest way possible to run this and collaborate with your team.
34:49 You can run and you can see the stuff, the data that is being scraped with the images or all those things like nicely rendered.
34:57 And you can add inline comments to the data to check and mention colleagues to check if it's okay.
35:04 The other big piece of infrastructure is that you need to, it's very hard to have this infrastructure when you're small.
35:12 So it doesn't make sense to build this if you're just crawling a few sites.
35:16 Although big companies that do a lot of web crawling like Google have their own sophisticated crawling infrastructures already in place.
35:25 And having a sophisticated crawling infrastructure in place is a pretty daunting task for a small startup or for a small company.
35:33 So we wanted to make that accessible for small companies when we started escaping how all this or as much as this nice infrastructure perks that one has when you have a recently large crawling infrastructure.
35:48 Make it available to everyone.
35:50 That's really an interesting mission.
35:52 I mean, if I was going to do a startup and a key component of that was to go out and just gather all this data through web scraping, you know, you could set up your own infrastructure on AWS or DigitalOcean or wherever.
36:03 Right.
36:04 But that's not your job.
36:05 Why do you want to do that?
36:06 Right.
36:06 Just drop it over there and let you guys deal with it.
36:08 Continuous delivery isn't just a buzzword.
36:25 It's a shift in productivity that will help your whole team become more efficient.
36:29 With SnapCI's continuous delivery tool, you can test, debug, and deploy your code quickly and reliably.
36:36 Get your product in the hands of your users faster and deploy from just about anywhere at any time.
36:41 And did you know that ThoughtWorks literally wrote the book on continuous integration and continuous delivery?
36:46 Connect Snap to your GitHub repo and they'll build and run your first pipeline automagically.
36:51 Thanks SnapCI for sponsoring this episode by trying them for free at snap.ci slash talkpython.
36:58 That's what we try to convey.
37:09 Like, we are the Heroku of, we're trying to become the Heroku of web scraping.
37:14 You can always build an AWS server or DigitalOcean server and deploy everything there.
37:21 And it will work because everything is based on open source tools and we have remained firmly connected.
37:28 against any type of vendor lock-in.
37:30 We release as much as we can on open source.
37:33 And I'm not understating this in any way.
37:36 The things that we haven't yet open source, if we haven't cleaned them enough to send them out there.
37:43 And it will cause more harm than good being out there.
37:46 But eventually we see ourselves like open source and everything and charging for infrastructure, essentially, to run those things.
37:54 We don't want to charge for code licenses.
37:56 Right.
37:56 Do you see it as somewhat analogous to OpenStack?
37:59 Yes, absolutely.
38:00 From a business model.
38:01 Yeah, absolutely.
38:03 Because the infrastructure is something that you can never expect to open source, right?
38:07 So there's, it's not like a major flaw that we could have at some point.
38:14 And so, yeah, it's very similar in that regard to OpenStack.
38:18 There's also two types of customers, right?
38:21 as sometimes I say, like developers with no money, but very eager to learn and do stuff.
38:30 And companies with big pockets and no time and they want things like working tomorrow.
38:38 We're trying to fulfill both targets, right?
38:42 We have a business division in Scape and have that where you can get the spider written and everything
38:48 and you don't need to do anything, just get the data.
38:51 But internally, this uses all of the rest of our platform that we share with developers
38:56 in the world.
38:57 And we both work on the platform.
39:00 So if you're a developer, you can use the platform.
39:03 And if you're a business, you can hire developers to using the same platform, get the job done
39:08 for you.
39:09 And I believe that trying to fulfill both audiences is key for business built around open source
39:17 to succeed.
39:18 Well, I guess you could extend that to any business.
39:21 But it has been one of our best decisions.
39:25 And I've always been a fan of building developer tools.
39:29 Coming from a developer background, and I'm still a programmer first and foremost.
39:34 I love tools that are well done that allow you to enjoy working and having fun while you
39:41 work.
39:41 So I love building these things.
39:44 I always like when I wanted to work on something, I first try to build tools to make that something
39:52 more efficient before actually going to work.
39:55 Yeah, that's cool.
39:56 And now you just get to enable other people to build their thing, right?
39:59 And you get to just keep working on the tools.
40:00 Yeah.
40:01 So I want to come back to this idea of open source and business.
40:05 These are very interesting mixes to me.
40:09 Real quickly, what's the future of web scraping like?
40:11 From where we are now, what do you see coming up in the next five or 10 years that's going
40:16 to be different?
40:17 Well, the web is going to be there still, that's for sure.
40:20 Technologies will change quite a bit in five years if you look back at what happened in
40:26 the last five years.
40:27 And I think the world is a lot less proprietary now in terms of internet technologies.
40:36 Perhaps the last big one was the fact that Flash lost to HTML5.
40:43 It's kind of the last example of a big proprietary versus open technology.
40:49 Yeah, that's a big technical wall coming crashing down that will let you into that area now,
40:55 right?
40:55 Yeah.
40:56 So of course, we benefit from the fact that everything is in HTML.
41:00 It's going to be more in HTML out there.
41:03 And it will continue to be.
41:05 But the point is that technologies may change a bit, but the information that is available
41:11 out there will need to be retrieved and used by companies.
41:15 So scraping in general, I don't think that it will change much the concept.
41:20 The earlier concept of screen scraping, of extracting data from screens rather than from web pages,
41:27 still remains to this day.
41:29 And if anything, it will change a bit in terms of how you extract the data.
41:35 And it's going to involve a lot more JavaScript processing and kind of tool automation for reproducing
41:43 actions rather than following links to follow the pattern of how web applications are transforming.
41:50 But I don't expect a lot of that to change aside from that, really.
41:56 But there's going to be a lot of sophisticated technologies that will keep increasing on the
42:02 side of keeping web crawlers away, detecting web crawlers and banning them for sure.
42:08 There's a bit of an arms race there, right?
42:10 Yeah, yeah, exactly.
42:11 It's a...
42:12 User agent equals MSIE 10.
42:15 Yeah.
42:17 Things like this, yes?
42:18 That summarizes well.
42:21 But yeah, the fact that the companies are going to be more protective of their data, or some
42:28 of them, the ones that consider it a core business value, will mean that the scraping technologies
42:33 will need to evolve and become more sophisticated as well to follow it.
42:38 So there's going to be a large, very expensive, private market where you will be able to find,
42:45 I don't know, mechanisms to still get the data out of web pages with an even bigger infrastructure.
42:51 That's not like the focus of what we're trying to achieve, right?
42:57 Right, of course.
42:57 So two other things that are sort of not knowing anything about this come to mind for me is,
43:02 one, you talked about the JavaScript front-end stuff, the fact that you guys support grabbing
43:08 sort of AngularJS style apps.
43:11 I think there's more of those, right?
43:12 So that'll be interesting.
43:13 But what about HTTP2?
43:15 Will that have any effect on you guys, or better scalability, or no effect?
43:20 What do you think?
43:20 No, I don't see HTTP...
43:23 I mean, it won't change really much.
43:26 It's going to be part of...
43:29 The framework will be adapted to take HTTP2 enhancement in hand, but the discipline of the scraping data,
43:38 I don't see it will change much after that.
43:42 Yeah.
43:42 Web sockets may require, if you think of it, a more...
43:47 It's a different approach to it.
43:49 Yeah, when the data starts coming out of web sockets.
43:50 Yeah.
43:51 I guess that we haven't done pretty much any web sockets scraping recently, or at all.
43:58 And I really haven't put a lot of thought to it.
44:02 It hasn't crossed my mind a lot later.
44:05 Sure.
44:05 So let's talk a little bit about, while we have a few minutes left, let's talk about business, open source, making a living on open source.
44:11 There's some really powerful examples of companies doing this and people doing it.
44:16 But there's also a lot of people who dream of it, but don't see a path forward.
44:20 Right?
44:21 Notable examples of people making it work for companies.
44:24 Continuum, the Anaconda guys, the Red Hat OpenStack, MongoDB.
44:31 Those are all really companies that have somehow made open source work.
44:35 Somehow made it work.
44:36 Somehow.
44:37 But there's got to be a ton of failed attempts, unfortunately.
44:41 So maybe tell me, how is this going for you guys?
44:43 What's it like to run sort of a distributed company that builds an open source product, but still somehow makes it a business?
44:52 We started by doing a lot of professional, and we still do a lot of professional service work consulting using our own tools.
45:00 So you need to keep this mind, this audience of companies and businesses that could benefit from your open source project if you want to make a living out of it.
45:10 In our case, it wasn't too difficult, right?
45:14 Because there's a natural need from data.
45:16 Data is king, and many companies are after data.
45:20 And the open source tool that we developed and we maintain is related to extracting data.
45:27 So we started escaping here with no standard funding and still don't have any by providing solutions to these companies, first and foremost.
45:36 And then we were able to evolve the open source products as the needs of our customers require, improving it along the way when we found the opportunity.
45:45 And even up to this point, the modus operandi is very similar.
45:49 So try to find, I would say, try to find where your paying customers will be, like companies that will benefit from your open source project.
46:07 And try to build, like try to build a company that proposes, provides these offerings.
46:13 In our case, we did it by providing development expertise to write web crawlers.
46:20 But it's very similar how we did it with how continuous that's it, for example.
46:25 And I'm a big fan of them.
46:27 And it's a really nice model to follow.
46:31 They do a lot of consulting, but they keep this Anaconda platform and everything that makes their jobs easier.
46:39 And at the same time, allows them to spend working on open source a lot.
46:43 And I always wanted to be able to just live in the open source world and keep coding on Scrappy.
46:52 Unfortunately, I'm not able to code much on Scrappy these days.
46:56 But I plan to return to it once the company runs in autopilot.
47:01 And I'm able to retire to get back to Scrappy and size projects.
47:07 But yeah, if you're just trying to connect the dots to where companies will make use of the open source project that you're working on.
47:18 Like if you're working on something like Bootstrap, find a consulting company that builds websites quickly, prototype websites quickly, those type of things.
47:28 So let me think through if I were trying to plan this out for Scrappy, how it might go.
47:34 So if I were, you know, you guys five or eight years ago, whenever, before you started Scraping Hub, I'm thinking, okay, well, I've got this really successful web scraping open source project.
47:46 And people are using it.
47:47 The company is using it.
47:48 They're really just, they're using it as a tool.
47:50 But the thing they actually want is they want lots of data off web pages fast.
47:55 And so then how do you build a business?
47:57 I think it makes perfect sense to say, well, let's build like this infrastructure where you can take the thing you're already doing and just push a button, drop it onto it and deliver the data fast in a reliable way without all the other stuff, right?
48:09 Without adding anything to it.
48:10 Yeah, yeah, yeah.
48:11 Is that kind of the thinking you guys went through?
48:12 That's a perfect example.
48:14 And in fact, we didn't focus early on on just pushing a button and getting the data automatically.
48:21 We focused more on the provides engineer hours and consulting and training for our tools.
48:28 But we are very much focused on data marketplaces and automatic APS to do extraction right now.
48:35 But the model is just what you just described is also perfectly possible.
48:40 I mean, it's another way to go about doing that.
48:43 And it's even better if you find a way to productize your open source project in a way that companies can be able to benefit from it without having to go through all the installation and configuration and customization phase.
48:58 So if you find a way to productize your open source project and provide it in a hosted manner, it's another very common example of a few successful ones.
49:07 Like GetSentry, for example, this tool for monitoring websites.
49:12 They provide a nice hosted solution of the open source tool and they're doing awesome.
49:17 Yeah, that's another model that works really well.
49:22 It's in the end a matter of finding where your tool produces value to business and try to sort of connect the dots or try to offer it so that when you're working for your customer, you're implicitly forced to work in your open source project and remain connected to it and growing it.
49:42 You need to find a way to put your open source project in your work agenda.
49:47 Right.
49:52 So if you do consulting, maybe you make sure that your contracts include clauses that if I'm helping you with consulting using my open source project and there's a natural enhancement that I could add to the project, but that would be kind of driven by your need.
50:06 Make that possible for me to put that in my open source project without any conflict.
50:10 Something like that, right?
50:11 Exactly.
50:11 In our cases, we include a clause like that.
50:14 Sometimes we even get Scrappy enhancements sponsored by our customers, which is great.
50:20 You can get that win-win situation.
50:23 You need to be a bit creative, of course, and be patient as well.
50:27 For a long time, at the early stages of Scrappy, I was working on it in my extra time and I wasn't really sure where the project was going or if it was going to be successful at all.
50:38 But I just wanted to build something.
50:40 That was my main motivation back then.
50:42 Create something of value to developers because I was a developer full-time back then and something that made my show a little more fun.
50:51 And that's what I did.
50:53 And then it was later than I realized I had a nice thing going on, a nice open source project that the community has gathered around.
51:01 First, it was just a couple of guys and I tried to help them and answer a few questions here and there.
51:08 And yeah, there was this community that encouraged me to quit my show and focus solely on Scrappy.
51:15 Was that a pretty happy day?
51:16 Yeah.
51:17 At the time, it was a bit crazy, like leaving my show and then just starting this.
51:24 But to be honest, I already had a few customers that were going to become immediate scripting, have customers.
51:33 So it was enough to sustain a small business because we kind of had the market proven before I started Scrappy.
51:43 I know you can do even a lot more risky moves, but I constantly feel proud about the things that Scrappy has achieved.
51:54 And a lot of things that are achieved lately are not thanks to me.
51:57 Like I'm not super involved day to day in the project itself.
52:02 But it's like when you see your kid, right?
52:05 That just graduated from college and is doing great things out there.
52:11 It really makes me very proud.
52:13 And just last month, the first Scrappy book came live.
52:18 And that was a really happy thing for me when I realized.
52:22 I'm sure you're really proud of it.
52:24 That's great.
52:24 Congratulations.
52:25 So the companies I named earlier, possibly with the exception of Continuum, were really large companies.
52:31 But you're not a super large company, right?
52:32 How many people work with you?
52:34 At the moment, we are 130 people working fully distributed around the world, which to me is pretty big.
52:41 But of course, not comfortable.
52:42 Okay, that's bigger than I realized.
52:44 That's awesome.
52:45 Yeah, cool.
52:46 So yeah, we were kind of growing and doubling in size every year, more or less.
52:50 We never kind of set a goal of like, let's double in size or let's grow to this size.
52:56 It kind of just naturally happened.
52:58 Like customers were requesting more work and we keep adding people to the team as there was more work.
53:04 And now we have a large part of the team working on the platform that already drives a lot of demand.
53:10 And running a remote company comes with its challenges, right?
53:15 I mean, you have a lot of times on issues.
53:18 Coordination between teams, like communication is key.
53:21 Self-management is very important.
53:25 But in exchange, you have a global pool of talent available to you.
53:30 And that's been one of the key, if not possibly the most important thing to be able to grow scraping up to the size that we've grown it so far.
53:38 Like we've been focusing really hard on hiding the best talent out there.
53:43 It took us a while, but I'm really proud of the team that we have assembled.
53:47 And I'm sure that the best is still to come.
53:50 I've always kind of worked remotely since I started working professionally in 2000 because I was sort of a sysadmin at the beginning.
54:00 So, yeah, I managed the servers from home.
54:03 Then I worked for this company in the UK that built this aggregator.
54:07 And I was also remote there.
54:09 And it kind of felt natural to continue working this way for me.
54:13 And also, the things that I learned while managing the scrapping community, I tried to apply them as much as possible to manage the scraping hub as well.
54:21 Like the whole culture of not asking for, it's better to ask for forgiveness and permission, those type of things.
54:31 I really applied to the scraping hub.
54:33 Sometimes I try to see it as a really big open source project with some commercial interest that we're trying to manage.
54:41 Yeah, you must be super proud of what you built.
54:43 That's awesome.
54:43 And I think people will really enjoy hearing this sort of success story from the beginning until now.
54:50 That's great.
54:50 So, we just have like a couple minutes for final questions.
54:54 Questions I always see at the end of the show.
54:56 If you're going to write some Python code, what editor do you open?
54:58 Well, I'm a BIM guy.
55:01 What can I say?
55:02 All right.
55:04 Awesome.
55:04 That may be the most common answer.
55:06 I think it probably is.
55:08 I'm quite fond of Sublime.
55:11 Yeah.
55:11 Because I'm not writing much code these days.
55:14 I've tried Sublime a couple of times.
55:17 And it's the first one that made me consider leaving.
55:20 Leaving.
55:21 Leaving.
55:22 But.
55:23 Yeah.
55:23 It hasn't pulled you over to its side yet.
55:25 No.
55:26 Not yet.
55:27 All right.
55:27 And from all the packages on PyPI, there's 70 something thousand these days.
55:32 There's so many that are awesome that maybe people don't know about.
55:35 Like, what would you recommend?
55:36 Like, obviously, pip install Scrappy.
55:39 And then anything else?
55:40 There's a whole bunch of interesting stuff there.
55:43 It's hard to name, to pick one.
55:46 I'll just try to think one that is not part of Scrappy ecosystem.
55:51 Do definitely give a try to Sentry if you're running a website.
55:55 It's really cool stuff to monitor websites.
55:58 Okay.
55:58 Yeah.
55:58 Sentry's nice.
55:59 Yeah.
55:59 That's, I guess I will go with that one.
56:03 Go with Sentry.
56:03 Awesome.
56:04 Okay.
56:04 Before we let you go, any final calls to action?
56:06 Things people should go try?
56:09 Yeah.
56:09 Yeah.
56:09 Definitely.
56:09 Come try Scraping Hub and you won't forget if you have any Scraping needs at any level.
56:16 Like, we are building the most awesome tools to make your shop easier.
56:19 So, yeah, I'm always happy to get in calls with people doing things right with Scraping.
56:25 And we'd like to learn more about what your needs are and how we can get it easier.
56:30 Awesome.
56:31 So, everyone check it out.
56:32 Pablo, it's been really fun to talk about web scraping.
56:34 I've learned a ton from it.
56:35 Thank you.
56:36 Thank you.
56:36 Likewise.
56:37 It's a very interesting job.
56:39 Yeah.
56:39 Thanks for being on the show.
56:40 Thank you.
56:40 Bye-bye.
56:41 Thanks.
56:41 Bye.
56:41 This has been another episode of Talk Python to Me.
56:45 Today's guest was Pablo Hoffman.
56:46 And this episode has been sponsored by Hired and SnapCI.
56:49 Thank you guys for supporting the show.
56:51 Hired wants to help you find your next big thing.
56:53 Visit Hired.com slash Talk Python to Me to get five or more offers with salary and equity
56:58 presented right up front and a special listener signing bonus of $2,000.
57:01 SnapCI is modern, continuous integration, and delivery.
57:06 Build, test, and deploy your code directly from GitHub, all in your browser with debugging,
57:10 Docker, and parallelism included.
57:12 Try them for free at snap.ci slash Talk Python.
57:15 It's the final few days for my video course Kickstarter.
57:18 The campaign is open until March 18th, and you'll find all the details at talkpython.fm slash
57:23 course.
57:24 Hurry on over there and sign up before it closes.
57:26 You can find the links from today's show at talkpython.fm/episode slash show slash 50.
57:32 Be sure to subscribe to the show.
57:34 Open your favorite podcatcher and search for Python.
57:36 We should be right near the top.
57:37 You can also find the iTunes and direct RSS feeds in the footer of the website.
57:41 Our theme music is Developers, Developers, Developers by Corey Smith, who goes by Smix.
57:45 You can hear the entire song on talkpython.fm.
57:48 This is your host, Michael Kennedy.
57:50 Thank you so much for listening.
57:52 Smix takes us out of here.
57:55 Thank you.
58:02 We'll be right back.
58:03 We'll be right back.
58:14 you Bye.