Brought to you by Linode - Build your next big idea @ linode.com

Web scraping, the 2020 edition

Episode #283, published Wed, Sep 23, 2020, recorded Wed, Jul 22, 2020.

This episode is carbon neutral.
Web scraping is pulling the HTML of a website down and parsing useful data out of it. The use-cases for this type of functionality are endless. Have a bunch of data on governmental sites that are only listed online in HTML without a download? There's an API for that! Do you want to keep abreast of what your competitors are featuring on their site? There's an API for that. Need alerts for changes on a website, for example enrollment is now open at your college and you want to be first to get in and avoid the 8am Monday morning course slot? There's an API for that.

That API is screen scraping and Attila Tóth from ScrapingHub is here to tell us all about it.

Links from the show

Attila Tóth on LinkedIn: linkedin.com
Scrapy project: scrapy.org
Scrapinghub on Twitter: @scrapinghub
Scrapinghub: scrapinghub.com
cookiecutter template for Scrapy projects: github.com
Splash: headless browser designed specifically for web scraping: scrapinghub.com/splash
Awesome Web Scraping list: github.com

Talk Python episode 50 on web scraping: talkpython.fm
How Web Scraping is Revealing Lobbying and Corruption in Peru: blog.scrapinghub.com
Web Data Extraction Summit event: extractsummit.io

Attila Tóth
Attila Tóth
Episode sponsored by
Ads served ethically
Click to show comments


Individuals can support this podcast directly via Patreon. Corporate sponsorship opportunities available here.
X
Become a friend of the show
Stay in the know and get a chance to win our contests.
See our privacy statement about email communications.