Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Follow publication

Member-only story

Scraping Websites Using Python

Photo by Nathan Dumlao on Unsplash

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. The web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. Web scraping can both be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis. In this blog I will discuss the following three libraries that are design specifically for web scraping; firstly we’ll be covering Beautiful Soup 4, then Selenium, and finally Scrapy.

Beautiful Soup

pip install beautifulsoup4

Beautiful Soup is a Python library for pulling data out of HTML and XML files and is realistically the easiest to learn and use among these 3 options. Beautiful Soup comes with it’s downfalls though, it has some dependencies, such as the need of the requests library to make requests to the website and the use of external parsers to extract data; for example, a XML or HTML parser. These dependencies can make it become quite a bit more difficult to transfer code between projects. Let’s take a look at Beautiful Soup in use, for this example I will use a py file from my last blog post about Programming a Twitter Bot that happened to use this library:

As shown above, just a couple lines of code are needed to extract data with BeautifulSoup, but we are still required to import requests to gain access the URL that we want to extract data from and html.parser to actually parse the content. Let’s take a look at some alternatives to this.

Selenium

pip install selenium

Selenium requires a driver to interface with the chosen browser:

Chrome:

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

Analytics Vidhya
Analytics Vidhya

Published in Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

@lee-rowe
@lee-rowe

Written by @lee-rowe

Data Scientist/Engineer | Create Value | if data: data.science()

No responses yet

Write a response