Monitor performance issues & errors in your code

#50: Web scraping at scale with Scrapy and ScrapingHub Transcript

Recorded on Tuesday, Feb 16, 2016.

00:00 What do you do when you are working with an amazing web application that, for whatever reason, doesn't have an API? One option is to say I wish that site had an API and give up. Or, you could use Scrapy, an open source web scraping framework from Pablo Hoffman and scrapinghub.com and create your own API!

00:00 On episode 50 of Talk Python To Me, we'll talk about how to do this, when it makes sense, and even when it's allowed. This episode has been recorded on February 16th 2016.

00:00 Welcome to Talk Python to Me. A weekly podcast on Python the language, the libraries, the ecosystem, and the personalities. This is your host, Michael Kennedy Follow me on twitter where I'm @mkennedy. Keep up with the show and listen to past episodes at talkpython.fm and follow the show on twitter via @talkpython.

00:00 This episode is brought to you by Hired and Snap CI. Thank them for supporting the show on twitter via @Hired_HQ and @snap_ci.

00:00 Hey everyone. Thanks for joining me today. We have a great interview on tap with Pablo Hoffman.

00:00 I want to give you a quick Kickstarter update. There are just 3 days left to join the course via Kickstarter and get a big discount while you're at it. Of course, it'll for sale afterwards, but not at the kickstarter prices.

00:00 The students who have had early access have had really positive things to say. Here's just a taste.

00:00 [This is] by far the best Python course I've done to date, clear explanations, good apps that incorporate the concepts of the particular session and a follow up on core concepts. The course is very engaging with a little humor which makes it much easier to dedicate time following along.

00:00 If you are looking for a easy to follow, enjoyable Python course - back the project now, get comfortable and enjoy!

00:00 Andy T.

00:00 Check it out at talkpython.fm/course. Now, let's talk about web scraping.

02:08 Michael: Pablo, welcome to the show.

02:11 Pablo: Thank you, Michael. Thanks for having me here.

02:13 Michael: Yeah, we got some really cool stuff to talk about- web scraping, open source businesses, we'll talk a little bit about Python 3 maybe, all sorts of cool things. But before we dig into them, let's talk about your story, how did you get into Python and programming?

02:26 Pablo: All right, so I met Python when I was in college, in 2004 and immediately fell in love with it, with the simplicity and the structure of the syntax. Back then, I was always trying to find an opportunity to use Python for whatever, I would make a lot of crazy, just useless stuff just to be able to use Python for it before Python I used like PHP, Perl, I don't regret any of it, but yeah, I come from that background, more sys admin, web application, background sort of natural flow for me. I started working on Python 2004 in college and never looked back and yeah, it was only in 2007 that I was able to start a company working solely on Python so I had to wait 3 years but the time came.

03:23 Michael: But you made it, you finally made it.

03:25 Pablo: Yeah. Absolutely. And I am still here, almost ten years later working pretty much exclusively with Python, and enjoying every moment of it.

03:35 Michael: Yeah, that's really cool. What continues to surprise me about Python is here is a language and the ecosystem that is 25+ years old, and seems to be becoming more popular and growing faster, even though it's not like something new- I think we are in a good place.

03:53 Pablo: Yeah, I believe it's a lot of combinations, lot of things that are right, and the community is certainly one of the key aspects of Python in addition to the language being beautiful and elegant by itself, so a lot of things came together for that to happen.

04:08 Michael: Yeah, I totally agree, so we are going to talk about web scraping. There is a whole variety of listeners, some of them will go, "Oh yeah, web scraping, I did that back in- whenever" but a lot of people maybe do not really understand or are familiar with this, so maybe give me the background on what is web scraping, why do we do this, how does it work?

04:24 Pablo: Right. So there is a bunch of information everywhere in the web right now, and it's always been and it continues to be and the information is flowing in percent and amounts. And so, there is a natural need for instructing and using that information for all kinds of purposes, but you can't just take what an web page and use it as sys, you need to transform it, extract it and transform it, in order to apply whatever you need to apply to it and that is where web scraping comes into play. And that is great, that is beautiful, there is a lot of things that you can do with data, once you have it in the format that you can manipulate. And so that's where web scraping discipline sits, and given the increase and ever increase in amount of data in web pages and structure, form, web scraping is here to stay definitely and it's going to get continual growing. My interest on web scraping came shortly after I met Python in college, bet there was this new site's ecosystem in my country, here in Uruguay, where newspapers had just came online but didn't really get it, so they posted the news at their however they seemed to like it, and the information there wasn't really accessible, there was no commenting possibilities, so I started creating a new site that aggregated all the newspaper the local media available then which was really short and sort of pull it together in a single news aggregate.

05:53 Michael: Nice. So is this like very early, simple version of Google news type thing?

05:58 Pablo: Yeah, Google news existed back then, but Uruguay was completely ignored by them, but even up to today, we don't have Uruguay in Google news supported country. But you can think of it as a Google news with custom extractors for its site. So I ended up building this project, it was called NotUi because Noticias and Uruguay, the two words combined together, I built it in Python, the site was live, I started getting some attention. It was after that project that ran on my extra time completely, it was a hobby project that I got noticed by a company here in Uruguay that introduced me to a company that I later joined to work full time on Python. I actually created Scrapy and open sourced it. So yeah, it's funny how you can connect the dots after everything.

06:51 Michael: Yeah, you could connect the dots looking backwards, but not forward, right?

06:54 Pablo: Yeah, forward no.

06:55 Michael: That's really cool, before we onto the Scrapy, like ideally everything all the information should be free, maybe there should be like an HTTP Json service we could access or contribute like an xmls, so wcf service or something, but a lot of times event hat is not available, right, there is just it's on the web, technically you can get the text but nobody has built an API right, that's really where web scarping kind of comes in is, when you know the data is accessible but there is no structured API for it.

07:25 Pablo: Exactly, and as a future of web scraping I see web scraping evolving a lot with a lot more machine learning incorporated directly at the early extraction phase so that you can turn website page into an API as quick as possible without requiring manual code.

07:43 Michael: That's really interesting because the web scraping that I know is you say I am looking for this information and maybe here is the css selector that I could go against the page, and then I'll grab it, but that's not machine learning, right?

07:55 Pablo: That's web scraping that most people is familiar with, but once you need to scale to a lot of websites that simply doesn't work. It was the scraping that Scrapy was built for, and because of this need to scale, to maintain a lot of website destructors8:09 in a common way and more consistent, unified fashion is that this Scrapy idea came to light because otherwise, after you maintain a couple dozen sites, you know fighting with just your own cold infrastructure, right in the xpath or the css selectors, that's the easy part, right? That's one of the main reasons why we came up with Scrapy.

08:33 Michael: Very cool, so why don't you tell us what it is?

08:36 Pablo: Scrapy started as an initiative to build a lot of web scrapers more easily and relieve the pain of maintaining them. Because, Scrapy was built in an environment where we had to maintain hundreds of spiders as we call them, and you need to have certain structuring the code, certain conventions in place so that when someone else joins the team any developer joins the team and starts developing new spiders, they don't need to go through learning a lot of intricate internals about how the spiders are supposed to work and the rest of infrastructure. So, what Scrapy does is it tries to factor out the common things that you do, when you write web scrapers, and separate them from the actual extraction rules or xpath and css selectors that you will tie for each website. So that you have all the rest as given, you can focus just on what needs to 9:36 So that is where Scrapy excels really and compared to like other ad hoc solutions like combining request with beautiful soup, which are great libraries, and they do a great job and maybe if you are writing a single structure for a single site, you wouldn't find much difference between using one or the other. But if you are projecting and maintaining spider for hundreds of websites, you will see this conventions that Scrapy proposes very welcome advantages. So, in a way like, it's a framework, it's not a library so it's not that it doesn't get in the way, it does get in the way but for good reasons, it tries to impose some common practice and good practices and mechanisms, so that you don't have to go through the typical journey of first doing this simple stuff and then realizing that you need something more complex only to end up implementing. It's what Scrapy is.

10:33 Michael: That's a really interesting comparison with the other ones. I wanted to have you may quickly walk people through the API but before we kind of get off the beginning of it, can you just tell me the history- you said it came out of your work at this company, like did you actually get permission to kind of extract that framework out or did you leave that company and rebuilt it with your knowledge from there, what year was that?

10:54 Pablo: This company, Mydeco was furniture aggregator in the UK, I joined them in 2007, and when the Mydeco, there was a lot of parts of the Srapy already built, like the core framework, the downloader, and the more important internals, were already in place. And it was working fine already, you really noticed that there was an improvement over how you will go about writing ad hoc crawlers. So, I started this scraping department in Mydeco, and I noticed that there was a huge potential for releasing this, and making other people benefit from it, by then I was already hard core open source fan and I had been following many open source projects and open source practices, and I was looking for an opportunity to get involved into release my own open source project, pretty much like I was waiting for opportunities to mobile phone to work on Python in 2004. I was looking for open source opportunities in 2007, as well. Fortunately, the people in charge in Mydeco, the technical part of Mydeco, were quite aligned with open source as well, quite fans themselves, so that made things a lot easier. So we didn't have much problems convincing the boards, they didn't have much problems convincing the board to allow us to open source that part. What happened afterwards is that there was a lot of work on my side and a couple of guys that worked in my company back then. Working on ironing out, factoring out the common code and 12:31 in a way that make sense for external developers to digest and use.

12:36 Michael: That's cool, and when you did that, basically, you released as a open source, did this company like take it back and sort of start using it or did they just kind of go along in a parallel path?

12:45 Pablo: We started from something already built, right, it's not like we decided to build an open source project and started from scratch there, we had a system already working and we wanted to take certain core parts of it and open source it. This was a challenge because there was a lot of dependencies int he code that took us some time to factor out but at one point we wanted to diverse of fork into separate projects. And never even considered it take that path but I think it was crucial and important that we remain always in sync, because Mydeco was the main user of Scrapy, and the feedback loop was crucial for Scrapy to succeed, if you look at it like it makes no sense to like like build on open source, at least if you want to build a successful open for source project, you need to have successful company using it, right, just don't build it in abstract unless it's an academic thing of course. this allow us to grow Scrapy, in the beginning, there was a lot of the integration work to pull it somehow, so that all the crawlers at this company remain running while we move stuff to state where it could be open source. I am sure we had spiders broke here and there but yeah, we learned along the process what to improve spider testing if you will, or when you make changes in a framework and that is also part of the Scrapy as one of the features.

14:14 Michael: One thing that is cool about the spiders is that you can rerun them against the same source if you have to, right?

14:18 Pablo: You can check if they stack the same data and that's how you actually check the spider remain working. You need to find the way to do it fast enough if you have thousands of them, that's just one thing to keep in mind, but yeah.

14:34 Michael: Sure, what was the version of Python that you started on?

14:36 Pablo: It was 2.5. back then. It was only year or two ago that we dropped support for 2.5 in Scrapy, and now we run in 2.7 and almost finished a support for Python 3, actually, support is finished and it's embeddable now in the last version of Scrapy which makes me very proud.

14:54 Michael: Yeah, I just saw an announcement like what was that- two weeks ago that you guys announced that you are now officially beta in Python 3. I had a project where I needed to do some web scraping and I ended up going the request + beautiful soup for direction because I didn't want to be painted into the corner of having to do Python 2. And I was like, "If I go on Scrapy, the path may be-" and I was like "Oh that's great" when it came out, so congratulations, was it a lot of work?

15:21 Pablo: It was. Because it was moderately big project, but we decided to really prioritize this because we realized that maybe a bit late if you will, that people were actually not using or leaving Scrapy because of the lack of Python 3 support. But I am glad that we are on it now and that we have it there.

15:39 Michael: Yeah, that's fantastic. Congrats.

15:41 Pablo: Thank you.

15:41 Michael: Cool, so let's talk about the API a little bit, if you go to Scrapy.org, right there you've got like a couple of code snippets showing you how you can use it and so on. One of your examples is if I go to like a blog I want to go grab all the links that are categorized- can you maybe just talk us through what that looks like?

15:58 Pablo: Yeah, you want me to go through the like of the code that is on the Scrapy, or-

16:02 Michael: Not exactly, just tell how do I get started with Scrapy, basically?

16:05 Pablo: Yeah, so the Scrapy has a pretty good tutorial in the base documentation, they are recommended as a starting point. The idea is that you should, unless you have complex needs, you should be enough write in a spider which can be just a couple of lines as it is shown in Scrpay on our website. The spider API itself is really simple in the sense that you have piece class to group all things write into the spider, where and the class has methods that are callbacks that are call with responses or web pages that are retrieved from the web that are delivered to these callbacks, and then this callbacks process the response in how are they need, and then return items or subsequent request to follow. That's the very basic idea of Scrapy API. So, using these simple APIs, you can build a lot of things on top of it, I love APIs that are really simple at the bottom and that allows you to do a lot. This is most basic spider API, all spiders in Scrpay will follow this one, but then on top of it there is variations and improvements like something called like "curl spider", that will allow you to setup some rules for URLs that should be followed by this spider, so you setup some class attributes with this rules and the start URLs that they shall start rolling from, and the spider automatically follows them and the call certain callbacks when the URLs fit certain pattern. This step 17:37 spider is very useful for example crawling retailer, ecommerce site, where you want to extract product data, and many sites will follow this pattern of having a certain pattern of URLs to follow and certain rules to extract the product data out of the pages. So, all that internally, is built with a simple API of receiving that response and returning request to follow and it items of escape data. That's basically one of the great things and beauty of it.

18:12 Michael: That's really cool, and at the heart of it, you've got the css method that I can give sort of complex hierarchical css expressions to get the hold of pieces I need, right?

18:22 Pablo: Yeah, and on the extraction side, we also provide some convenient methods for extracting the data using well known selectors like css selector, or even x paths. Yeah, this is like the most universal way to address 18:38 of pages so you can do anything with it. And with the request to follow, you can generate for the request to make what you would say 18:47 calls in order to retrieve extra data from the web pages. And here is something to know- if you want to use this low level, you will need to do in your browser what request the website is doing and replicate those in Scrapy.

18:57 Michael: Oh interesting, so suppose I want to go and I want to scrape like a single page app, that's written in AngularJS. If I just go and hit that with direct request, I am going to get curly this, curly that meaningless no data, right.

19:18 Pablo: Yeah, exactly.

19:18 [music]

19:18 This episode is brought to you by Hired. Hired is a two-sided, curated marketplace that connects the world's knowledge workers to the best opportunities.

19:18 Each offer you receive has salary and equity presented right up front and you can view the offers to accept or reject them before you even talk to the company.

19:18 Typically, candidates receive 5 or more offers in just the first week and there are no obligations ever.

19:18 Sounds awesome, doesn't it? Well did I mention the signing bonus? Everyone who accepts a job from Hired gets a $1,000 signing bonus. And, as Talk Python listeners, it get's way sweeter! Use the link hired.com/talkpythontome and Hired will double the signing bonus to $2,000!

19:18 Opportunity is knocking, visit hired.com/talkpythontome and answer the call.

19:18 [music]

20:17 Pablo: Do you remember that Scrapy was built in a different world in 2007, it was a lot of the static.

20:24 Michael: There was not so much Ajax, not so much Javascript.

20:25 Pablo: And now we are in this crazy world moving to more like apps running in websites, Scrapy still is able to do it because Scrapy is to some extent very low level, whatever you can do in a browser, you can do in Scrapy, because it works at the http respond request level. You sometimes feel that you need to something like more digestable so that you have the data really available. Based on this need, we work on extending a Scrapy Javascript support and being the hard core reuse of component funds that we ended up creating a separate component called splash, which is sort of mini browser of the one ones with an http API and it directs really well with the Scrapy, each one of them many libraries that integrate well with Scrapy so that you can actually Javascript render data available in your callbacks for you to use.

21:30 Michael: That's really cool, I didn't know it did that, so that's really excellent.

21:34 Pablo: Yeah, it's not one of the things that are most prominently shown in when you came across Scrapy.

21:39 Michael: Yes, of course, but if you've got to go after a website that is sort of one of these front end Javascript data binding type frameworks, you have to have those things, that's really cool. So, another thing that's in your sample, that I think is worth bringing up just really quickly, although is not specific to what you guys are doing, is you are using the yield keyword, and your pars methods in the bit that you do supports sort of these generator methods. And these sort of progressive get as much as you like type of methods, so do you want to maybe just talk really quickly about generator methods, because I don't think we have talked about it on the show very often, if at all.

22:16 Pablo: Yeah, absolutely. So, this spider callbacks that it was 22:21 actually can return a generator, an iterator in its more general form. So what the Scrapy does internally, it calls this callback and it starts iterating the output of it, so it can potentially return an infinite generator. Scrapy at some times is used that way, starting out in a page and start monitoring the page, and keep returning data in your request to follow as they appear. Or you generate just an incremental number so whatever phases to check, and yeah, the Scrapy uses memory efficiently, in order not to consume the whole generator at once, it's consume it in sizable manageable part and it has a bunch of control directives in place to make sure that the no place in the framework consumes too much and overflows memory. This has the reason that in the spider code you can use the thing like the gil or you just send the data out into Scrapy framework side and you will be sure that it's going get processed and you don't have to check yourself that if you are sending too much data or too little, or whatever. So yeah, that's how you can make benefits of how Scrapy makes benefits.

23:34 Michael: Yeah, so internally it uses iterators everywhere and you can just layer on your own iterators, right, because if they would build like less and fill them up that would kind of be useless, right?

23:44 Pablo: Exactly. Iterator 23:46 friendly by default and you can take that advantage.

23:49 Michael: Yeah, that's really awesome. If I want to go to a website, you know, maybe let's just take somewhere like The New York Times, so The New York Times are the well known newspaper, their articles are highly valued, but there is also usage restrictions, they have they are so called paywall trying to make you have an account and stuff, is it legal for me to go and turn Scrapy loose on the New York Times or what is the story sort of around when I can and can't use the web scraping?

24:17 Pablo: We like to say Scrapy is a tool, and as with many other tools you can use it for legal purposes. Even tools like Bittorrent which is defacto example can be used for completely authentic engineering stuff, legal stuff, I am not saying- the use what you are doing with the tool is ultimately what makes it illegal or not. In the case of The New York Times, you should probably obey whatever rules they have, at some time they will realize they are doing it wrong, you can't see- at least I don't see scraping as a way to get data, to steal data that isn't supposed to be taken, like there is a lot of projects and stuff out there that uses Scrapy to just gather the data that is available there, and in order to process it-

25:12 Michael: Right, like a really common example would be Google.

25:14 Pablo: Exactly.

25:14 Michael: And I don't that they use Scrapy but obviously they do that conceptually they do that, right?

25:19 Pablo: Exactly. Google is the largest scraper in the world. And, they get the data and show them in their search results. Similar thing applies everywhere for us and how we do our projects. We generally deal with public information only, when we work on the scraping projects, we don't want to deal with all the legal stuff like these and yeah, we always recommend customers to get proper assistants and is possible consent from the web sites to get the data, because many cases believe it or not the websites don't care if you take the data from them, they just don't want to go through the trouble of packaging the data and sending it to you.

26:07 Michael: Yeah, right, or maybe they would like to but they are technically incapable.

26:10 Pablo: Exactly.

26:11 Michael: Not everybody who has a website is a programmer.

26:15 Pablo: Yeah, I've heard so many times the website owner saying, "ok, if you can take the data and you don't cost me any problems, then go for it." And, you wouldn't believe how common that is.

26:26 Michael: Yeah, yeah, that's really cool. Let's talk about large scale web crawling- I mentioned Google, that's kind of large scale. You had your experience with thousands of crawlers, what do you need to worry about when you are running stuff at that scale rather than I have like my app that goes to one page and just get some information?

26:44 Pablo: Right, scraping kind of scale in two directions I will say. One is the code direction, and the other is the infrastructure and volumes direction. Large scale at the code site could be like having to maintain a couple of thousands spiders in your project, let's call it like vertical scaling if you will. Whereas scaling in the horizontal or infrastructure site would be having a single code perhaps more automate or based on sophisticated extraction that need to grow like a lot of web pages, like hundreds of millions of web pages.

27:23 Michael: Ok, and that's more like what you are talking about with the machine learning, maybe you would do Scrapy plus Scikit-learn, and a single code base just go understand the web versus I know that these are the 27 retailers websites and I want to like give you the price of all them, so I write specific code to get the price of these common objects, something like this, right?

27:43 Pablo: Yeah, exactly. And you can only scale so much on vertically scale so much, with numbers of spiders because their cost starts increasing quite a lot to keep this as a spider that is running and is well maintained. So, because the spiders break, you need to monitor them, react when they break and the size of the team that you need to have to maintain goes up really quickly.

28:04 Michael: You get a little nice happy message from target saying, "We've just redesigned target.com" and you are like, "Oh no, there it goes," right?

28:11 Pablo: Yeah, don't get me started at Black Friday and Holiday seasons. Challenges vary depending on which case we are talking, they call the scalability and somehow minimize or address as much as possible by Scrapy common conventions, it's as much as you can do, in the end you have to write the xpaths or css selectors anyway at some point, but aside from that, anything else that can be automate infrastructure or 28:40 we try to automate it. The other type of scalability problems are related with crawling huge number of pages there involves not only the massive amounts of data that you need to go through into digest, but things like revisiting policies for when you have to keep track of huge number of pages in a very large website or group of websites, and you only want to revisit them and depending on how often they get updated. So, there is a few common algorithms used in this case, but the big challenge here is keeping this big queue or URLs that you need to keep control of and prioritize, because the crawlers just kind of take URLs from the queue, crawlers will just ask a queue, "Ok, what's the next URL you want me to fetch" and it will take that URL, fetch it, return the process. But keeping this queue is the challenge.

29:43 Michael: Yeah, yeah, of course, because you want to conceptually cash as much of that as possible not hit it if it's not going to have changed. Is that a place where you could plug in like machine learning and have it watch and sort of tell it like this time I learned something new, this time I didn't and it could some maybe be smart about this?

30:02 Pablo: Yeah, exactly, as I was saying, it's a few common algorithms used to revisit, simply based on frequency of updates, like if you visit a page two times and during a certain amount of period and the page is the same, or actually the extracted data is the same-

30:19 Michael: Exactly, that's what matters, right?

30:20 Pablo: Just to take out overhead changes, then you perhaps double the time that you are going to check that page again. So if you visit in a day, it didn't change, then you are going to visit again at least in two days from now, and if the page changed, then for that page, you will reduce to the half of the time that you waited. So you are going revisit in a half of the day and let the wait times adjust automatically. That works as a basic algorithm, it works just really well. And, Scrapy has its own internal schedule which serves as memory queue for request. It can only grow so much because Scrapy runs in a process, in an operating system process, and there is more limitations when running a single process. So when you need to scale to very large number of web pages you need to use something external, on this side of the equation is that we are working for just over a year now on a new project called "Frontera" which is an extension to Scrapy to deal with large crawls. And it essentially manage crawl what is called the crawl frontier, I should have mentioned before but this queue of request to follow is called the crawl frontier in scraping terms.

31:42 Michael: Yeah, very cool. you guys also built like a web crawling as a service, or web crawling infrastructure has a service if you will, right? Tell me about that?

31:53 Pablo: Yeah, so one of the things that are common to many scraping projects are the infrastructures require by it. As I was saying, writing the css selector, expands something that you may need to separate for each website, but all the rest, like running the spider, getting the data, reviewing the data, with your colleagues or customers, iterating over it is pretty much the same and it's surprisingly what ends up taking more time then writing the css selectors or the xpaths itself, because when you are dealing with data, sometimes there are different expectations of how the customer needs the data or how you produce it from your spider. So being able to have those in sync with an efficient tool is of crucial importance. So, realize this, there was a opportunity to launch a tool for taking the next step after Scrapy, because Scrapy kind of solves the problems at the developer level, you can run Scrapy spider with Scrapy crawl command, it all works the same in any machine, in any platform, but what after that, what if you need to collaborate with teams, then it starts getting bit complicated. The scraping have to deal with that, with running Scrapy spiders in the cloud, in the most friendly way possible, as so that developers can benefit form.

33:25 Michael: Yeah, that's cool, and that's just at scrapinghub.com, right?

33:27 Pablo: Yeah, exactly.

33:28 Michael: Yeah, so people can check it out there, maybe tell us like what is the main use case, like why do people come and set up system to run on your platform?

33:38 Pablo: The nice thing Scrapy have is that you can run any Scrapy spider that runs on your machine already will run on Scrapy cloud, so that's the premise of the service. We don't require you to make any changes to your already working spiders, you can just run a common Scrapy have deploy and you have it deployed in our cloud and you can run it using the web analytics panel from there. Think of it similar to Heroku here, I am not sure if you are familiar with it but Heroku is the same for websites, you have a website running locally and yes, you need to configure some manifest file for sure to indicate a few things, but that's all you need to do and then with the Heroku deploy you have the website running. It doesn't matter where you have the URL, and anyone can access it. We wanted to build the same for spiders, there wasn't anything like it and there isn't still, and we realized that we have kind of this the best framework for writing web spiders and kind of the next obvious thing to do was to give it the friendliest way possible to run this and collaborate with your team. You can run and you can see the stuff, the data that is being scraped with images or all those things like nicely rendered, and you can add inline comments to the data to check with your colleagues, to check if it's ok. The other big piece of infrastructure is that you need to- it's very hard to have this infrastructure when you are small, so it doesn't make sense to build this if you are just crawling a few sites, although big companies that do a lot of web crawling like Google have their own sophisticated crawling infrastructure already in place. And, having a sophisticated crawling infrastructure in place is a pretty daunting for a small startup or a small company, so we wanted to make that accessible for small companies, when we started the scraping hub. All this, or as much as this nice infrastructure perks that one has when you have decently large crawling infrastructure, and make it available to everyone.

35:51 Michael: That's a really interesting mission, I mean if I was going to do a startup and a key component of that was to go out and just gather all this data through web scraping, you could setup your own infrastructure on AWS or Digital Ocean or whatever, but that's not your job, right, why do you want to do that, right, just drop it over there and let you guys deal with it.

35:51 [music]

35:51 Continuous delivery isn't just a buzz word; it's a shift in productivity that will help your whole team become more efficient.

35:51 With Snap CI's continuous delivery tool you can test, debug, and deploy your code quickly and reliably. Get your product in the hands of your users faster and deploy from just about anywhere at any time.

35:51 Did you know that ThoughtWorks literally wrote the book on continuous integration and continuous delivery? Connect Snap to your GitHub Repo and they'll build and run your first pipeline auto-magically.

35:51 Thank Snap CI for sponsoring this episode by trying them for free at snap.ci/talkpython.

35:51 [music]

37:07 Pablo: That's what we tried to convey, like we are the Heroku of, we are trying to become the Heroku of web scraping, you can always build an AWS server or Digital Ocean server and deploy everything there and it will work because everything is based on open source tools and we have remained firmly against any type of vendor lock in, we release as much as we can on open source and I am not understating this in any way, the things that we haven't yet open sourced is because we haven't cleaned them enough to send them out there and it will cost more harm than good being out there, but eventually, we see ourselves like open sourcing everything and charging for infrastructure essentially to run those things. We don't want to charge for code licenses or so.

37:57 Michael: Right, do you see it somewhat analog to open stack?

37:59 Pablo: Yes, absolutely.

38:00 Michael: And from a business model, from a business model?

38:02 Pablo: Yeah, absolutely, because the infrastructure is something that you can never expect to open source, right, so there is, it's not like a major flow that we could have at some point, so yeah, it's very similar in that regard to open stack. There is also two types of customers, right, like developers with no money but very eager to learn and do stuff, and companies with big pockets and no time and they want things like working tomorrow. We'll try to fulfill both targets, right, we have a business ambition at scraping hub where you can get the spider retain and everything that you don't need to do anything, just get the data, but internally, this uses all over the rest of our platform that we share with developers in the world, and we both work on the platform, so if you are a developer, you can use the platform, and if you are a business you can hire a developer to using the same platform get the job done for you. And I believe that trying to fulfill both audiences is the key for business built around open source to succeed. I guess you could stand that to any business but this has been one of our best decisions. And, I always was a fan of building developer tools, coming from my developer background and I am still programmer first and foremost, I love tools that are well done that allow you to and show you working and they come in fun while you work, so I love building these things, I always like when I wanted to work on something, I first tried to build tools to make that something more efficient before actually going to work on it.

39:55 Michael: Yeah, that's cool, and you just can enable other people to build their thing, right, and you get to just keep working on the tools. So, I want to come back on this idea of open source and business, these are very interesting mixes to me. Real quickly, what is the future of web scraping, like form where we are now, what do you see coming up in the next 5 or 10 years that's going to be different?

40:17 Pablo: Well, technologies will change quite a bit in 5 years if you look back what happened the last 5 years, and I think the war is a lot less proprietary now, in terms of internet technologies, perhaps the last big one was the fact that flash lost to HTML5, last example of being proprietary versus open technology.

40:50 Michael: Yeah. That's a big technical wall coming crashing down, like we'll let you into that area now, right?

40:56 Pablo: Yeah. So, of course we benefit from the fact that everything is in HTML is going to be more in HTML out there, and it will continue to be but the point is that technologies may change a bit, but the information that is available out there will need to be retrieved and used by companies, so scraping in general, I don't think it will change much the concept. The earlier concept of screen scraping of extracting data from screens rather than form web pages still remains to this day, and if anything- it will change like bits in terms how you extract the data and it will involve more Javascript accessing, and kind of tool automation for reproducing actions rather than following links, to follow the pattern of where applications are transforming. But, I don't expect like a lot of change aside from that really. There is going to be a lot of sophisticated technologies that will keep increasing on the side of keeping web crawlers away, detecting web crawlers and banning them for sure.

42:09 Michael: There is a bit of an arms race there, right?

42:11 Pablo: Yeah, exactly, that summarize it well. But, yeah, the fact that the companies are going to be more protective of their data, of some of them, they want to consider it the core business value, will mean that the scarping technologies will need to evolve and become more sophisticated as well, to follow it, so there is going to be a large, very expensive private market where you will be able to find mechanisms to still get the data out of web pages with even bigger infrastructure. That's not like the focus of what we are trying to achieve.

42:56 Michael: Right, of course. So, two other things that sort of I do not know anything about this, come to mind for me, is 1- you talked about the Javascript front end stuff. The fact that you guys support grabbing sort of Angular Js style apps, I think there is more of these, right so that will be interesting. But what about HTTP2? Will that have any effect on you guys or better scalability or no effect, what do you think?

43:20 Pablo: No, I don't see, it won't change really much, like it's going to be, part of the framework will be adapted to HTTP2 enhancement, but the discipline of the scraping data- I don't see will change much after that.

43:42 Michael: Yeah.

43:43 Pablo: Web sockets may require if you think of it as a different approach to it.

43:49 Michael: Yeah, when the data start coming out of web sockets.

43:51 Pablo: Yeah, I guess that we haven't done pretty much any web sockets scraping recently or at all, and really I haven't put a lot of thought to it, it hasn't crossed my mind a lot lately.

44:05 Michael: Sure. So let's talk a little bit about- we have a few minutes left, let's talk about business, open source, making living on open source. There is some really powerful examples of companies doing this and people doing it, but there is also a lot of people who dream of it but don't see a path forward. Notable examples of people making it work for companies Continuum, The Anaconda guys, The Red Hat Open Stack, MongoDB, like those are all really companies that have somehow made open source work.

44:36 Pablo: Somehow made it work.

44:35 Michael: Somehow, but there has got to be a ton of like failed attempts unfortunately. So, maybe tell me how is it going for you guys, what's it like to run sort of a distributed company that builds an open source product, but still, you know, somehow makes it a business.

44:52 Pablo: We started by doing a lot of professional and we still do a lot of professional service of consulting using our own tools. So, you need to keep this in mind, this audience of companies and businesses that could benefit from your open source project if you want to make a living out of it. In our case, it wasn't difficult, right, because there is a natural need for data, data is key and many companies are after data, and the open source tool that we developed, and we maintained, is related to extracting data, so we started the scraping hub by providing solutions to this companies first and foremost. And then, we will evolve the open source products as the needs of our customers require, improving it along the way when we find the opportunity. And even up to this point, this is very similar, the platform and the other service I mentioned, were grown out of need from our customers, so try to find I will say a general advice, try to find where your paying customers will be, like companies that will benefit from your open source project, and try to build a company that provides this. In our case, we did it by providing development expertise to write web crawlers, but it's very similar how we did it with how Continuum does it, for example. And I am a big fan of them and it's a really nice model to follow. They do a lot of consulting, but they keep this Anaconda platform and everything that makes their jobs easier and at the same time allows them to spend working on open source a lot and I always wanted to be able to just leave in the open source war and keep coding on Scrapy. Unfortunately I am not able to code much on Scrapy these days, but I plan to return to it once when the company runs on autopilot and I may be able to retire to get back to Scrapy. But yeah, if you just to connect the dots to where your companies will make use of the open source project that you are working on, like if you are working something like bootstrap funnel consulting company that builds websites quickly.

47:27 Michael: Right. So let me think through if I were trying to plan this out for Scrapy, how it might go. So if I were you guys, five or eight years ago, before you started Scrapy out, and said, ok, I've got this really successful web scraping open source project and people are using it, the companies are using it, they are really just using it as a tool but the thing they actually want is they want lots of data of web pages, fast. And so then, how do you build a business, I think it makes perfect sense to say well, let's build like this infrastructure where you can take the thing you are already doing and just push a button, drop it on to it and deliver the data fast in a reliable way without all the other stuff, right, without adding anything to it?

48:10 Pablo: Yeah.

48:12 Michael: Is that kind of the thinking that you guys went through or?

48:13 Pablo: That's a perfect example, and in fact we didn't focus early on, on just pushing a button and getting the data automatically, we were focused more on the provides engineer hours and consulting and training, for our tools, but we are very much focused on data market places and automatic APS to do extraction right now. But the model is just what you describe is also perfectly possible, I mean, it's a way to go about doing that and even better, if you find a way to productize your open source project in a way that companies can be able to benefit from it without having to go through the all the installation or the configuration and customization phase. So if you find a way to productize your open source project then provide it in a hosted manner, it's a pretty common example of few successful ones, like Sentry for example, this tool for monitoring web sites, they provide a nice hosted solution of the open source tool, and they are doing awesome. Yeah, that's another more that works really well. It's the end a matter of finding where your toll for this is value to business, and try to sort of connect the dots or try to offer it so that when you are working for your customer, you are implicitly forced to work in your open source project and remain connected to it and growing it. You need to find the way to put your open source project in your work agenda so that it's not something that you do in your extra time or as a hobby.

49:52 Michael: Right, so if you do consulting maybe you make sure that your contracts include clauses that if I am helping you with the consulting, using my open source project, and there is a natural enhancement that I could add to the project, but that would be kind of driven by your need, make that possible for me to put that in my open source project without any conflicts. Something like that, right?

50:11 Pablo: Exactly. In our case, we include that clause like that and sometimes we even get in Scrapy enhancements are sponsored by our customers which is great, and that is a win-win situation. You need to be a bit creative of course, and be patient as well, like for a long time. I was working on Scrapy in my extra time, and I wasn't really sure where the project was going, or if it was going to be successful at all, but I just wanted to build something, that was my main motivation back then, like create something of value to developers, because I was a developer, full time back then, and something that made my job a little more fun and that's what I did. And then, it was later that I realized I had a nice thing going on, a nice open source project that the community has gathered around. First, it was just couple of guys and I trying to help them, and answer a few questions here and there, and yeah, there was this community that encouraged me to quit my job and focus solely on Scrapy.

51:16 Michael: Was that a pretty happy day?

51:18 Pablo: At the time it was a bit crazy, like leaving my job and just starting this, but to be honest, I already had a few customers, like that were going to be Scrapy customers, so it was enough to sustain a small business, because we kind of had the market proven before I started Scrapy, but I constantly feel proud about the things that Scrapy has achieved and a lot of things that are achieved lately are not thanks to me, like I am not super involved day to day, in the project itself, but it's like when you see your kid graduate from college, and he is doing great things out there, it really makes me very proud. Just last month, the first Scrapy book came alive and that was a really happy thing for me when I realized-

52:23 Michael: I am sure you are really proud of it, that's great, congratulations. So the companies I named earlier, with possible the exception of Continuum, were really large companies. But you are not a super large company, right, how many people work with you?

52:35 Pablo: At the moment we are 130 working fully which to me is pretty big, but of course, not huge.

52:43 Michael: Ok, that's bigger than I realized, that's awesome, yeah. Cool.

52:47 Pablo: Yeah, so we are kind of growing and doubling in size every year, more or less, we have never kind of set a goal of like let's double in size or let's grow to this size, it kind of just naturally happened, like customers were requesting more work and we keep adding people to the team, and now we have a large part of the team working on the platform that already drives a lot of demands and running a remote company counts with these challenges, right, I mean, you have a lot of issues coordination between teams, like communication is key, self management is very important but in exchange you have a global pool of talent available to you and that's been of the key if not the most important thing to be able to grow Scrapy. We've been focusing really hard on hiring the best talents out there, it took us a while, but I am really proud of the team that we have assembled and I am sure that the best is still to come. I've always kind of worked remotely, since I started working professionally, I was the sys admin at the beginning, so I managed the servers from home, then I worked for this company in the UK that built this aggrogator and I was also remote there and kind of felt natural to continue working this was for me and also the things that I learned while managing the Scrapy community I try to apply them as much as possible to manage a Scrapy hub well like the whole culture of not asking for forgiveness and permission and those types of things, I really apply to Scrapy, sometimes I try to see that it's a really big open source project with some commercial interest that we are trying to manage.

54:41 Michael: Yeah. You must be super proud of what you built, that's awesome. And I think people really enjoy hearing this sort of success story from the beginning until now, that's great. So we just have like a couple of minutes for final questions, questions I always ask at the end of the show- if you are going to write some python code, what editor do you open up?

54:59 Pablo: Well, I am a Vim guy, what can I say.

55:04 Michael: All right, awesome, that may be the most common answer. I think it probably is.

55:08 Pablo: I am quite fond of Sublime. Because I am not writing much code these days, I try Sublime a couple of times and it's a first one that made me consider leaving Vim.

55:24 Michael: Yeah, but it hasn't pulled you over to its side, yeah?

55:27 Pablo: No, not yet.

55:28 Michael: All right, and from all the packages on PyPi there is 70 000+ these days, there are so many that are awesome that maybe people do not know about, like what would you recommend, like obviously pip install Scrapy? And then, anything else?

55:43 Pablo: There is a whole bunch of interesting stuff there, it's hard to name, to pick one- I'll just try to think one that is not part of the Scrapy ecosystem, definitely try Sentry if you are running a website, it's really cool stuff to run a website.

55:59 Michael: Ok, Sentry is nice.

55:59 Pablo: Yeah, I guess I will go with that one.

56:02 Michael: Awesome, ok. Before I let you go, any final calls to action, things people should go try?

56:09 Pablo: Yeah, definitely come try Scarpy hub and you won't regret, if you have any Scraping needs at any level like we are building the most awesome tools to make your job easier.

56:32 Michael: Awesome, so everyone check it out. Pablo, it's been really fun to talk about web scraping, I learned a ton form it.

56:36 Pablo: Thank you. Likewise, it's been very interesting talk.

56:39 Michael: Yeah, thanks for being on the show.

56:40 Pablo: Thank you, bye bye.

56:40 This has been another episode of Talk Python To Me.

56:40 Today's guest was Pablo Hoffman and this episode has been sponsored by Hired and SnapCI. Thank you guys for supporting the show!

56:40 Hired wants to help you find your next big thing. Visit hired.com/talkpythontome to get 5 or more offers with salary and equity right up front and a special listener signing bonus of $4,000 USD.

56:40 Snap CI is modern continuous integration and delivery. Build, test, and deploy your code directly from github, all in your browser with debugging, docker, and parallelism included. Try them for free at snap.ci/talkpython

56:40 It's the final few days for my video course kickstarter. The campaign is open until March 18th and you'll find all the details at talkpython.fm/course. Hurry on over there and sign up before it closes!

56:40 You can find the links from the show at talkpython.fm/episodes/show/50

56:40 Be sure to subscribe to the show. Open your favorite podcatcher and search for Python. We should be right at the top. You can also find the iTunes and direct RSS feeds in the footer on the website.

56:40 Our theme music is Developers Developers Developers by Cory Smith, who goes by Smixx. You can hear the entire song on our website.

56:40 This is your host, Michael Kennedy. Thanks for listening!

56:40 Smixx, take us out of here.

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon