Member-only story
Scraping Websites Using Python
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. The web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. Web scraping can both be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis. In this blog I will discuss the following three libraries that are design specifically for web scraping; firstly we’ll be covering Beautiful Soup 4, then Selenium, and finally Scrapy.
Beautiful Soup
pip install beautifulsoup4
Beautiful Soup is a Python library for pulling data out of HTML and XML files and is realistically the easiest to learn and use among these 3 options. Beautiful Soup comes with it’s downfalls though, it has some dependencies, such as the need of the requests library to make requests to the website and the use of external parsers to extract data; for example, a XML or HTML parser. These dependencies can make it become quite a bit more difficult to transfer code between projects. Let’s take a look at Beautiful Soup in use, for this example I will use a py file from my last blog post about Programming a Twitter Bot that happened to use this library:
As shown above, just a couple lines of code are needed to extract data with BeautifulSoup, but we are still required to import requests
to gain access the URL that we want to extract data from and html.parser
to actually parse the content. Let’s take a look at some alternatives to this.
Selenium
pip install selenium
Selenium requires a driver to interface with the chosen browser:
Chrome: